Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
SlowLlama: Finetune llama2-70B and codellama on MacBook Air without quantization (github.com/okuvshynov)
156 points by behnamoh on Oct 6, 2023 | hide | past | favorite | 54 comments


What’s the use case here? The author notes doing inference from disk is too slow. Fine tune a model and do nothing with it? Fine tune and then upload it to a cloud machine? Not sure how often I’d like to try uploading multi gigabyte files from a consumer internet connection with narrow upload bandwidth.


Not sure about 70b, but this is a really useful repo for fine-tuning the 7b models on M1 machines—and all the examples in the README seem to use 7b.


Only LLaMA2, or does it also work on Mistral-7B?


Depending on how much ram you have, fine-tune then quanitize & run with llama.cpp, which works quite well.


> Not sure how often I’d like to try uploading multi gigabyte files from a consumer internet connection with narrow upload bandwidth

It's 2023... this is a fairly standard use case (e.g. YouTubers uploading videos, etc).


I'm on the mid-tier plan available to me. 500 Mbps down, 20 Mbps up. It would take me, bare minimum, about 6.67 minutes to upload 1 gigabyte. I'll be the first to admit that I don't know much about AI/ML, so I could be wrong, but assuming a fine-tuned model is the same size as the base model, it would take me 90 minutes to upload the llama2 7B model (13.5 GB).

If it were standard practice for many people, we might finally get symmetric cable. But most people aren't YouTubers. In my experience, people who aren't into tech usually have no idea their upload speeds are any slower than their download speeds, much less 25 times slower. They usually just let their phone upload their photos in the background, and it just works. Hell, I can't even find my upload speed on the consumer section of my ISP's site, only in a pdf for their business plans. Not even in legalese.


Spectrum? Lots of cable internet right now has upstream limitations, but fiber is making it to more homes and upstream limitations are (slowly) improving on cable.

As a society we have something like a 20% CAGR on upstream speeds over the past few years. It may get better even faster, because DOCSIS 4 is coming soon and is full duplex over the same channels.

But even at 20mbps, uploading a 70B model might take you 12 hours... after you spent many hours fine-tuning it. It's annoying but not an unmanageable problem.


In my country there’s a tendency to provide max 50MBit/s up on consumer connections, but a few do offer more. I‘m on a FTTH plan with 130 up now, and think this would suffice for most current use cases. There’s also the possibility to bundle multiple connections with enterprise grade hardware - last time I looked that was a 1k one time investment. I‘m dubious as to whether that would actually boost uploading speeds though.


Same here, I'm on 600 Mbps up/down and I enjoy the up more than the down.


Which country? In Spain most of the broadband nowadays is fiber to the home and symmetric, I have one gigabit up/down. Pretty cheap too.


Yea, it takes a long time. I regularly upload 60 GB files on a 20 Mbps up residential connection. It usually takes 3-4 hours (maybe more, I usually start the upload when I'm done working for the day).

It's not that big of a deal as long as you don't sit there and stare at the progress bar the entire time.


> But most people aren't YouTubers. In my experience, people who aren't into tech usually have no idea their upload speeds are any slower than their download speeds, much less 25 times slower.

People who aren’t into tech aren’t trying to fine tune Llama2 and then upload it to a cloud machine


Be a lot cooler if they did.


It uses LoRA which is much smaller than full model


fine tune for local use?


> ...it offloads parts of model to SSD or main memory on both forward/backward passes

So... does this murder SSDs?


Having the SSD soldered to the mainboard usually doesn’t matter… except for this particular use case.

You could stick a SSD in a thunderbolt enclosure and use that, but at that point you might as well rent a cloud GPU instance for a buck an hour.


> for a buck an hour.

Every decent GPU cloud I've seen is around +$3/hour. Where do you get ~$1/hour?


Take a look at this great comment, which provides a list of affordable services. https://news.ycombinator.com/item?id=34358781

Basically, it depends a lot of what you mean by "decent". Your standards for performance might be a lot more than the person you're replying to.


So I keep getting told "Macs don't use memory like Windows", "8GB is fine for everything on Mac" (with quotes from people who use it to Photoshop) and "the SSD is so fast if you use virtual memory anyway it's not noticeable"... Is 8GB suitable for LLM use like this, and has anybody actually used it?


LLMs are notoriously RAM-hungry. It’s not fair to judge what a MacBook comes with against what LLaMa2 needs. That is not normal use of a MacBook.


Apple really shouldn't sell Macs with less than 16GiB of RAM. The first set of M1s got some bad press because so many folks were plagued by pop-ups saying they were out of memory -- from heavy browser usage not even intense workloads.


As someone with a base model M1 Air, I’ve never experienced this issue from normal workloads. The ONLY times I’ve exceeded memory capacity were doing expensive computations without chunking, and running minikube.

I’m neither a Node dev nor an AI person, though, so that probably helps.


Have a half dozen Google Docs open at once and you could force an 8gb machine to start swapping. Just saying.


This is only slightly hyperbolic, after having checked it. Each GDoc tab (in Safari) took ~600 MiB. What caused the swap was the other browser window with two Gmail tabs, each of which took over 1 GiB, which when combined with some other apps I had open, broke it.

I stand corrected. 8 GiB might be enough IFF you're not doing heavy multi-tasking during development, and you're not coding in JS or its offshoots.

I will say the comments about "the SSD is so fast you won't notice" is true from an end-user perspective, in that I experienced no visible lag on any app. This may not be true with the smaller-capacity M2 models, since they have slower SSDs than the base M1s.


> This is only slightly hyperbolic, after having checked it. Each GDoc tab (in Safari) took ~600 MiB.

It very much depends on the size of the document and how long it’s been open. I’ve actually had Gdocs be over 3gb for a single document in Safari before.


That's OS or app bugs. The way that dialog works isn't just how much RAM you have.


Macs manage memory and swap in a way that makes them feel much more responsive when you have a lot of common applications open at once (or something like a lot of browser tabs at once), but if you’re literally doing a single task that uses a given amount of ram, you need that much ram or you’ll slow way down.


8GB of memory is not currently suitable for LLM use.

A 7B model @ Q4 needs roughly 6.5GB of memory.

I have a 64GB M1. If my memory pressure is high and I start up a model that goes into swap, the machine becomes unbearably slow.


> A 7B model @ Q4 needs roughly 6.5GB of memory.

Mistral 7B q4 seems to use ~ 5Gb VRAM

A few months ago I could not run Llama 7B q4 on my laptop's rtx 3070 (8gb VRAM), but Mistral 7B it runs very well and still has approx 2gb VRAM left even with a big context.

Not sure if this is an optimization on inference algorithms or if Mistral 7B is just more efficient


More than that for the 8K+ context people are running now.


8gb on a Mac is better than 8gb on Windows, but 16gb is a whole lot better


Before I got a desktop, I used to run my laptop (16GB RAM + a 6GB Nvidia GPU) headless to just barely squeeze the 33B models onto the GPU + RAM, and play with it on another device.

...I am not an OSX person, but if you can boot headless, it can squeeze 7B on there with full context


8 GB on a MacBook is kinda rough. I regret it.


If you need a continuous block of addressable ram, there is nothing that a mac does differently and you will need more than 8gb if you need more than an 8gb block for a process


I use an M1 Pro every day for work, and while it runs fantastically, I call bullshit on the 8GB claim.


I wonder the same thing, I have 64GB on a thinkpad from 3 years ago...


I love that I can now do an LLM fine tune on my local Macbook. IMHO, this is why Facebook open sourced Llama. Rockstar programmers like this guy are going to make it scale out on cheap hardware and give them an edge over even their well funded competitors.


Please excuse my ignorance, but what would be the use case of fine tuning? I'm quite interested in this application of it for hobby projects, but I'm not sure what value I would be able to gain.


Fine-tuning changes how the model works, by updating model weights and changing the way it generates text.

Fine-tuning is how you can turn raw base models into something that you can chat with, and into something that can follow commands. Fine-tuning can be also used to control the style of speech the model uses, and what it will be and won't be willing to discuss. To an extent, it can also teach the model to understand new things.


Is this how I would take a model like CodeLlama and feed it a load of code from my favorite programming language and its most useful packages ?


Yes, exactly. Fine-tuning is a good way to impart this kind of knowledge to the model.


I'm hitting 3.9tok/s with CTX of 300 tokens on Android/778G via Userland & this is with an older unoptimized build of llama.cpp


Side question: is there a benefit to doing this on an MacBook vs PC ?


Yes, and it revolves around memory.

If you're running code on the CPU, Apple Silicon systems have significantly higher memory bandwidth than most x86 systems -- up to 800 GB/sec on M2 Ultra.

If you're running code on the GPU, an Apple Silicon system can be configured with up to 192 GB of unified memory, whereas most discrete graphics cards top out around 16-24 GB of VRAM. (There are a few larger compute-focused cards like Nvidia A100, but they're incredibly expensive.)


There’s no _single_ consumer or datacenter GPU that is currently breaking 100GB VRAM. In this regard, the Apple offering is pretty unique, but their compute capability is lagging.

Also the 800GB/s figure for memory bandwidth is impressive ... for a CPU. GPUs regularly hit 2TB/s or more.

TL;DR on Apple for ML: Bigger models, slower calculations.


Not quite, there are 128GiB AMD instinct parts, though the prices are very...enterprise: https://www.amd.com/en/products/server-accelerators/instinct...


Wow that's cool. I didn't know about it, thanks for sharing!

Side note, it looks like the AMD MI250 supports FP16, and the raw TFLOPS is ~15% higher than an Nvidia A100 80GB, with a significant advantage for AMD in terms of memory bandwidth. The price range for the two cards seem to roughly overlap.

It seems this card is a pretty good deal barring crazy software edge cases. @AMD, if you ever see this, I would be interested to test your card and publish a complete writeup. I'm currently fine-tuning LLMs on A100/H100 and have a good reference point.


Personally speaking I never realized how much fan noise and hot air coming from the machine distracted me until it wasn't there anymore.


Macs have a larger iGPU and faster RAM than pretty much any comparable PC, which are precisely the two things you want for LLMs.

Also, lots of people only own a Mac.

You'd have to be kinda crazy to buy a Mac explicitly for running/training genai though.


I don't think you want iGPU for LLMs. 4090 + RAM offload would be many times faster.


3090 + RAM offload is not very fast on my desktop. Its fine, but the speed hit is large.

And I was thinking regular laptops as a baseline vs the base M1/M2.


MacBooks don't have enough RAM to train without offloading to SSD which would be even slower.


From the article: ~25-30 min per iteration ....




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: