What’s the use case here? The author notes doing inference from disk is too slow. Fine tune a model and do nothing with it? Fine tune and then upload it to a cloud machine? Not sure how often I’d like to try uploading multi gigabyte files from a consumer internet connection with narrow upload bandwidth.
I'm on the mid-tier plan available to me. 500 Mbps down, 20 Mbps up. It would take me, bare minimum, about 6.67 minutes to upload 1 gigabyte. I'll be the first to admit that I don't know much about AI/ML, so I could be wrong, but assuming a fine-tuned model is the same size as the base model, it would take me 90 minutes to upload the llama2 7B model (13.5 GB).
If it were standard practice for many people, we might finally get symmetric cable. But most people aren't YouTubers. In my experience, people who aren't into tech usually have no idea their upload speeds are any slower than their download speeds, much less 25 times slower. They usually just let their phone upload their photos in the background, and it just works. Hell, I can't even find my upload speed on the consumer section of my ISP's site, only in a pdf for their business plans. Not even in legalese.
Spectrum? Lots of cable internet right now has upstream limitations, but fiber is making it to more homes and upstream limitations are (slowly) improving on cable.
As a society we have something like a 20% CAGR on upstream speeds over the past few years. It may get better even faster, because DOCSIS 4 is coming soon and is full duplex over the same channels.
But even at 20mbps, uploading a 70B model might take you 12 hours... after you spent many hours fine-tuning it. It's annoying but not an unmanageable problem.
In my country there’s a tendency to provide max 50MBit/s up on consumer connections, but a few do offer more. I‘m on a FTTH plan with 130 up now, and think this would suffice for most current use cases. There’s also the possibility to bundle multiple connections with enterprise grade hardware - last time I looked that was a 1k one time investment. I‘m dubious as to whether that would actually boost uploading speeds though.
Yea, it takes a long time. I regularly upload 60 GB files on a 20 Mbps up residential connection. It usually takes 3-4 hours (maybe more, I usually start the upload when I'm done working for the day).
It's not that big of a deal as long as you don't sit there and stare at the progress bar the entire time.
> But most people aren't YouTubers. In my experience, people who aren't into tech usually have no idea their upload speeds are any slower than their download speeds, much less 25 times slower.
People who aren’t into tech aren’t trying to fine tune Llama2 and then upload it to a cloud machine
So I keep getting told "Macs don't use memory like Windows", "8GB is fine for everything on Mac" (with quotes from people who use it to Photoshop) and "the SSD is so fast if you use virtual memory anyway it's not noticeable"...
Is 8GB suitable for LLM use like this, and has anybody actually used it?
Apple really shouldn't sell Macs with less than 16GiB of RAM. The first set of M1s got some bad press because so many folks were plagued by pop-ups saying they were out of memory -- from heavy browser usage not even intense workloads.
As someone with a base model M1 Air, I’ve never experienced this issue from normal workloads. The ONLY times I’ve exceeded memory capacity were doing expensive computations without chunking, and running minikube.
I’m neither a Node dev nor an AI person, though, so that probably helps.
This is only slightly hyperbolic, after having checked it. Each GDoc tab (in Safari) took ~600 MiB. What caused the swap was the other browser window with two Gmail tabs, each of which took over 1 GiB, which when combined with some other apps I had open, broke it.
I stand corrected. 8 GiB might be enough IFF you're not doing heavy multi-tasking during development, and you're not coding in JS or its offshoots.
I will say the comments about "the SSD is so fast you won't notice" is true from an end-user perspective, in that I experienced no visible lag on any app. This may not be true with the smaller-capacity M2 models, since they have slower SSDs than the base M1s.
> This is only slightly hyperbolic, after having checked it. Each GDoc tab (in Safari) took ~600 MiB.
It very much depends on the size of the document and how long it’s been open. I’ve actually had Gdocs be over 3gb for a single document in Safari before.
Macs manage memory and swap in a way that makes them feel much more responsive when you have a lot of common applications open at once (or something like a lot of browser tabs at once), but if you’re literally doing a single task that uses a given amount of ram, you need that much ram or you’ll slow way down.
A few months ago I could not run Llama 7B q4 on my laptop's rtx 3070 (8gb VRAM), but Mistral 7B it runs very well and still has approx 2gb VRAM left even with a big context.
Not sure if this is an optimization on inference algorithms or if Mistral 7B is just more efficient
Before I got a desktop, I used to run my laptop (16GB RAM + a 6GB Nvidia GPU) headless to just barely squeeze the 33B models onto the GPU + RAM, and play with it on another device.
...I am not an OSX person, but if you can boot headless, it can squeeze 7B on there with full context
If you need a continuous block of addressable ram, there is nothing that a mac does differently and you will need more than 8gb if you need more than an 8gb block for a process
I love that I can now do an LLM fine tune on my local Macbook. IMHO, this is why Facebook open sourced Llama. Rockstar programmers like this guy are going to make it scale out on cheap hardware and give them an edge over even their well funded competitors.
Please excuse my ignorance, but what would be the use case of fine tuning? I'm quite interested in this application of it for hobby projects, but I'm not sure what value I would be able to gain.
Fine-tuning changes how the model works, by updating model weights and changing the way it generates text.
Fine-tuning is how you can turn raw base models into something that you can chat with, and into something that can follow commands. Fine-tuning can be also used to control the style of speech the model uses, and what it will be and won't be willing to discuss. To an extent, it can also teach the model to understand new things.
If you're running code on the CPU, Apple Silicon systems have significantly higher memory bandwidth than most x86 systems -- up to 800 GB/sec on M2 Ultra.
If you're running code on the GPU, an Apple Silicon system can be configured with up to 192 GB of unified memory, whereas most discrete graphics cards top out around 16-24 GB of VRAM. (There are a few larger compute-focused cards like Nvidia A100, but they're incredibly expensive.)
There’s no _single_ consumer or datacenter GPU that is currently breaking 100GB VRAM. In this regard, the Apple offering is pretty unique, but their compute capability is lagging.
Also the 800GB/s figure for memory bandwidth is impressive ... for a CPU. GPUs regularly hit 2TB/s or more.
TL;DR on Apple for ML: Bigger models, slower calculations.
Wow that's cool. I didn't know about it, thanks for sharing!
Side note, it looks like the AMD MI250 supports FP16, and the raw TFLOPS is ~15% higher than an Nvidia A100 80GB, with a significant advantage for AMD in terms of memory bandwidth. The price range for the two cards seem to roughly overlap.
It seems this card is a pretty good deal barring crazy software edge cases. @AMD, if you ever see this, I would be interested to test your card and publish a complete writeup. I'm currently fine-tuning LLMs on A100/H100 and have a good reference point.