Isn't GPT so complex that it requires hundreds of GB of ram to be used? How's it...

garblegarble · on March 24, 2023

I'm similarly skeptical, but that said I'm running 30B parameter LLMs on my 32GB M1 Macbook Pro every day now. The trick is quantising them down to 4 (or even 3) bit, it's possible to massively reduce the memory requirements. Have a look at[1]

The devs working on llama.cpp have been discussing ways to further reduce the memory requirements by mmapping the large weights files (I thought LLMs mutated the weights as they run inference, but they clearly know more than me about the internals), bringing it within reach of phone memory.

So, iPhones are not as far off the computational capacity to run these models as you'd think. Memory (and to a greater extent, battery and cooling) are the limiting factors. iPads even less so, given they run M1 chips and have much larger batteries & much more RAM

https://arxiv.org/abs/2210.17323

qumpis · on March 24, 2023

Offtopic, but for what purpose are you running llms locally (especially everyday)? My understanding was that the prompting requires to make them work at all was too great.

garblegarble · on March 24, 2023

A little bit of research, a little bit of actual useful tasks - I'm interested in summarisation, which alpaca is decent at (even compared to existing summarisation-specific models I've tried)

My other motivation is making sure I understand what offline LLMs can do... while I use GPT-3 and 4 extensively, I don't want to send something over the wire if I don't have to (e.g. if I can summarise e-mails locally, I'd rather do that than send them to OpenAI).

It's also surprisingly good at defining things if I'm somewhere with no internet connectivity and want to look something up (although obviously that's not really what it's good at & hallucination risks abound)

johnthuss · on March 24, 2023

What prompt are you using for summarization? I’ve tried several variations without consistent results.

garblegarble · on March 24, 2023

On alpaca, I've found "Below is an instruction that describes a task. Write a response that appropriately completes the request. Summarise the following text: " or "Give me a 5 word summary of the following: " to work fairly well using the 30B weights.

It's certainly nowhere close to the quality of OpenAI summarisation, just better than what I previously had locally (e.g. in summarising a family history project with transcripts of old letters, gpt-3.5-turbo was able to accurately read between the lines summarising an original poem which I found amazing).

I half wonder if the change in spelling from US -> UK makes a difference...

I'd run a test on that but I've just broken my alpaca setup for longer prompts (switched to use mainline llama.cpp, which required a model conversion & some code changes, and it's no longer allocating enough memory)

bigfudge · on March 24, 2023

Necessary if you have sensitive datasets you can’t share with US company

_rs · on March 24, 2023

Off topic slightly, but are you running into limits with 32GB RAM that the 64GB model would meaningfully be adequate for? Do you wish you had one of the larger RAM models?

garblegarble · on March 24, 2023

I've been pretty happy with 32GB, but the 30B models do push near to the limits. I don't see a big difference between the quality of 65B (running on a 64GB x86 host) and 30B on M1 (although that may be the 4bit quantisation though, so take that with a grain of salt). I'm just glad that I have it on an M1... I have a 3080 in my PC, but when I got that I was thinking more of Stable Diffusion and YOLO tasks rather than LLMs, and it just doesn't have the VRAM for LLMs.

Alpaca seems like it could be significantly improved with better training (some of the old training data was truncated), so I think there's a decent amount of improvement to be had at the current model size.

In the future though... what would really be a meaningful change would be a larger context size - the 8k tokens of GPT-4 was a big improvement for my uses... I would guess a future local llm with larger context would exceed 32GB, but that's speculation beyond my expertise, I don't know how context size and network size scale.

If it was a PC I'd say go for 64GB, but hard to recommend that given how much Apple charge for RAM upgrades. On my next upgrade (2+ years time, hopefully) I'll likely opt for 64GB+ though

_rs · on March 24, 2023

Yeah, it is expensive. My other strong consideration is battery life, since DRAM is always running; going from 32 to 64 would be a hit to battery life regardless of workload, but hard to say exactly how big of a hit.

I'm curious, which configuration of the M1 MBP do you have?

garblegarble · on March 24, 2023

I went for the 16" with M1 Max w/32 GPU cores and 1TB SSD (500GB free, I offload most large files my NAS/iCloud). On the added power usage, my understanding is that's less of a concern due to using LPDDR5?

The only drawback I've found with the M1 Max model is the added weight from the bigger heatsink just makes it a hair heavier than I'd like when picking it up at the front with one hand when open... and that in the winter time the case is cold no matter what you're running, I used to love that my Intel MBP acted as a mini leg warmer :-)

endisneigh · on March 23, 2023

yes, there are billions of parameters necessary. but large language models only came out about 5 years ago. I'm confident 5 years from now the parameters necessary to get gpt-4 performance will be decreased orders of magnitude.

at the very least, even if that's not the case, inference will be drastically less gpu heavy by then I suspect.

chatmasta · on March 23, 2023

There will also be hardware improvements (as always) and ASIC chips specifically designed for running this kind of model. For example, see this "Optical Transformers" paper [0] and its HN discussion [1] from last month.

[0] https://arxiv.org/abs/2302.10360

[1] https://news.ycombinator.com/item?id=34905210

tlrobinson · on March 24, 2023

I could also imagine a sort of two-tier approach, where the on-device model can handle the majority of queries, but recognize when it should pass the query on to a larger model running in the cloud.

s1k3s · on March 23, 2023

Wait, so there's a way to make a model as smart as GPT but with less parameters? Isn't that why it's so good?

month13 · on March 23, 2023

This is an older paper, but DeepMind alleges in their Chinchilla paper that far better performance can be extracted with fewer parameters; quote

"We find that current large language models are significantly under-trained, a consequence of the recent focus on scaling language models whilst keeping the amount of training data constant."

It's difficult to evaluate a LLM's performance as it's all qualitative, but Meta's LLaMA has been doing quite well, at even 13B parameters.

astrange · on March 24, 2023

Chinchilla is aimed at finding a cost-performance tradeoff as well, not the optimal amount of training. If cost is no barrier because it'll be used forever, then probably there's no amount of training that's good enough.

monocasa · on March 23, 2023

The rumor I've heard is that GPT4 didn't meaningfully increase the parameter count versus GPT3.5, but instead focused on training and structural improvements.

qumpis · on March 24, 2023

Well the inference time of gpt4 seems to be far greater than gpt3, so it could hint a difference in parameters count.

_oghd · on March 24, 2023

if you watch their announcement Livestream video it looked just as fast as normal ChatGPT.

I think what we have access to is a fair bit slower.

jazzyjackson · on March 24, 2023

You can train a small model to behave like the large model at a subset of tasks.

endisneigh · on March 23, 2023

that's a complicated question to answer. what I'd say is that more parameters makes the model more robust, but there are diminishing returns. optimizations are under way

dachryn · on March 24, 2023

dont underestimate how many of those parameters are actually necessary to support multiple languages.

If you focus on english only, this can easily reduce the paramters 5fold

spacebanana7 · on March 24, 2023

Could you explain how supporting multiple languages increases the parameter count so much? I'm genuinely curious.

LLMs seem to be comfortable with hundreds of programming languages, DSLs and application specific syntaxes so how does supporting a couple more natural languages become so expensive?

I see how more training data would be needed, but I don't understand how that maps to a greater parameter count.

pornel · on March 24, 2023

Hundreds of GBs of RAM in a phone is just ~6 years away if Moore's law holds. It's also likely that memory requirements will be shrunk through software and ML improvements.

AgentOrange1234 · on March 24, 2023

I don’t think RAM on phones has been increasing exponentially? For good reason — it uses a lot of power.

pornel · on March 24, 2023

It has been increasing. In iPhones went from 128MB to 6GB, and there are Android phones with 18GB now.

RAM hasn't been increasing as steeply as it could, but if there's a strong use-case for it, it may happen. Also consider that Apple is in control of the whole chipset and software, so they could implement things like turning the extra RAM on only during ML computation.

datpiff · on March 24, 2023

> Hundreds of GBs of RAM in a phone is just ~6 years away if Moore's law holds.

It's not expected to. The consensus seems to be ~2025 https://arxiv.org/abs/1511.05956

bonestamp2 · on March 24, 2023

Not to mention storage. The model is estimated to be 500-600GB. That's a lot of storage to tie up on a phone.