Hacker Newsnew | past | comments | ask | show | jobs | submit | fatihturker's commentslogin

Author here.

Thank you for all the good and curious comments.

For 72B models, around *36GB memory works fine* by the way. I ran the benchmark and shared the results on the website: https://opengraviton.github.io/index.html

While working on this research I realized something important: the way most current models are trained is extremely inefficient. Because of that, I started developing *graviton-native*, which trains AI models from scratch using more efficient architectures.

The idea is to design models that are optimized for efficiency from the beginning. My expectation is that this approach could bring around *~70% efficiency improvement*. Combined with OpenGraviton, I believe this could eventually make it possible to run *trillion-parameter scale models locally*.

You can find the paper here: https://opengraviton.github.io/paper.html

And the repository here: https://github.com/opengraviton/graviton-native

Right now I’m training a *72B model* using this approach. I’ll share the results soon and update the website.


One question I'm interested in exploring:

If models become heavily compressed and streamed from SSD, where do people think the real bottleneck moves to — storage bandwidth, memory bandwidth, or kernel efficiency?


It’s inspired by ideas similar to BitNet, but I wouldn’t call it “next-gen BitNet.” BitNet focuses mainly on model representation, while OpenGraviton is about inference — pushing the limits of running large models efficiently on consumer hardware. Similar motivation (more efficient models), different layer (inference engine).


One question I’m particularly curious about:

At what point does SSD bandwidth become the main bottleneck for inference when weights are heavily compressed? If anyone has experience with streaming layers or low-bit runtimes, would love to hear how you approach it.


I've been thinking about whether extreme weight compression could fundamentally change the hardware requirements for large language models.

Most LLM deployments assume large GPU clusters mainly because of memory constraints (VRAM / RAM). But if weights are aggressively compressed — for example using ternary representations ({-1, 0, +1}) — the memory footprint drops dramatically.

In theory this could reduce model size by roughly an order of magnitude compared to FP16 weights.

If you combine that with:

• dynamic sparsity • memory-mapped weight streaming from NVMe • speculative decoding • fast tensor unpacking on GPU/Metal

it raises an interesting possibility:

Could extremely large models (100B–500B+) become runnable on consumer machines, even if they stream weights from SSD instead of holding everything in RAM?

Of course bandwidth, latency, and compute efficiency become major bottlenecks.

I'm curious if anyone here has experimented with:

• ternary / ultra-low-bit networks • SSD-streamed inference • sparse LLM architectures • MoE-style routing combined with quantization

Would love to hear thoughts on whether this approach is realistic or fundamentally limited by bandwidth and compute.


My attempts to try ternary encodings from Unsloth with llama.cpp on ROCm failed miserably. Either ggml or ROCm simply can't run it at this time on gfx1201, and CPU isn't fast enough.


Author here.

I'm currently working on further speed improvements — it's already around 8× faster in some cases, but there’s still potential for more optimization.

Since this is an open-source project, community support is very important. I believe AI shouldn’t be controlled or driven by only a few companies, so contributions, feedback, and ideas are always very welcome. Feel free to open an issue or PR if you'd like to help.


Happy to help if needed. The project is already tested and benchmarked with several models and everything is working as expected. If you run into any specific issues, feel free to open an issue or PR.


Author here.

The architecture page explains how ternary quantization, dynamic sparsity, and mmap layer streaming work together to push models far beyond normal RAM limits.

Happy to answer questions about the implementation or benchmarks.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: