I love the idea, that's the future. However you should be aware that the explanation of second law of thermodynamics generated by the LLM you used in your app store screenshot is wrong: the LLM has it backwards. Energy transfers to less stable states from more stable states, and not the reverse. (I use LLMs for science education apps like https://apps.apple.com/fr/app/explayn-learn-chemistry/id6448..., so I am quite used to spot that kind of errors in LLM outputs...)
Local, app embedded, and purpose-built targeted experts is clearly the future in my mind for a variety of reasons. Looking at TPUs in Android devices and neural engine in Apple hardware it's pretty clear.
Xcode already has an ML studio, for example, that can not only embed and integrate models in apps but also finetune, etc. It's obvious to me that at some point most apps will have embedded models in the app (or device) for specific purposes.
No AI can compare to humans and even we specialize. You wouldn't hire a plumber to perform brain surgery and you wouldn't hire a neurosurgeon to fix your toilet. Mixture of experts with AI models is a thing of course but when we look at how we primarily interact with technology and the functionality it provides it's generally pretty well siloed to specific purposes.
A purposed domain and context trained/tuned small model doing stuff on your on-device data would likely do nearly as well if not better for some applications than even ChatGPT. Think of the next version of device keyboards doing RAG+LLM through your text messages to generate replies. Stack it up with speech to text, vision, multimodal models, and who knows what and yeah, interesting.
Throw in the automatic scaling, latency, and privacy and the wins really stack up.
Some random app developer can integrate a model in their application and scale higher with better performance than ChatGPT without setting money on fire.
> Local, app embedded, and purpose-built targeted experts is clearly the future in my mind for a variety of reasons. Looking at TPUs in Android devices and neural engine in Apple hardware it's pretty clear.
I think that’s only true for delay-intolerant or privacy-focused features. For most situations, a remote model running on an external server will outperform a local model. There is no thermal, battery or memory headroom for the local model to ever do better. The cost being a mere hundred milliseconds delay at most.
I expect most models triggered on consumer devices to run remotely, with a degraded local service option in case of connection problems.
Snapchat filters, iPhone photo processing/speech to text/always-on Hey Siri/OCR/object detection and segmentation - there are countless applications and functionality doing this on device today (and for years). For something like the RAG approach I mentioned the sync and coordination of your local content to a remote API would be more taxing on the battery just in terms of the radio than what we already see from on device neural engines and TPUs as leveraged by the functionality I described.
These applications would also likely be very upload heavy (photo/video inference - massive upload, tiny JSON response) which could very likely end up taxing cell networks further. Even RAG is thousands of tokens in and a few hundred out (in most cases).
There's also the issue of Nvidia GPUs having > 1 yr lead times and the exhaustion of GPUs available from various cloud providers. LLMs especially use tremendous resources for training and this increase is leading to more and more contention for available GPU resources. People are going to be looking more and more to save the clouds and big GPUs for what you really need to do there - big training.
Plus, not everyone can burn $1m/day like ChatGPT.
If AI keeps expanding and eating more and more functionality the remote-first approach just isn't sustainable.
There will likely always be some sort of blend (with serious heavy lifting being cloud, of course) but it's going to shift more and more to local and on-device. There's just no other way.
> Snapchat filters, iPhone photo processing/speech to text/always-on Hey Siri/OCR/object detection and segmentation - there are countless applications and functionality doing this on device today (and for years)
But those are peanuts compared to what will be possible in the (near) future. You think content-aware fill is neat? Wait until you can zoom out of a photo 50% or completely change the angle.
That’ll costs gobs of processing power and thus time and battery, much more than a 20MB burst transfer of a photo and the backsynced modifications.
> If AI keeps expanding and eating more and more functionality the remote-first approach just isn't sustainable.
It’ll definitely create a large moat around companies with lots of money or extremely efficient proprietary models.
> That’ll costs gobs of processing power and thus time and battery
The exact same thing was said about the functionality we're describing yet there it is. Imagine describing that to someone in 2010 who's already complaining about iPhone battery life. The response would be carbon-copy to yours.
In five years from the iPhone 8 to the iPhone 14 TOPS on the neural engine went from 0.6 to 17[0]. The iPhone 15 more than doubled that and stands at 35 TOPS[1]. Battery life is better than ever and that's a 58x gain just in neural, not even GPU, CPU, performance cores, etc.
Over that same period of time Nvidia GPUs only increased about 9x[2] - they're pushing the fundamentals much harder as a law of large numbers-ish issue.
So yeah, I won't have to wait long for zoom out of a photo 50%, completely change the angle, or who knows what else to be done locally. In fact, for these use cases increasingly advanced optics, processing, outside visual range sensors, etc, etc makes my point even more - even more data going to the cloud when the device is best suited to be doing it anyway.
Look at it this way - Apple sold over 97 million iPhones in 2023. Assuming the lower averages that's 1,649,000,000 combined TOPS out there.
Cloud providers benefit from optimization and inherent oversubscription but by comparison Nvidia sold somewhere around 500,000,000 TFLOPS worth of H100s last year.
Mainframe and serial terminal to desktop to thin client and terminal server - around and around we go.
Stability is actually defined by having a lower energy level. That explains why energy can only flow from a less stable system to a more stable system : the more stable system does not have available energy to give.
EDIT: Attempting to converse with any Q4_K_M 7B parameter model on a 15 Pro Max... the phone just melts down. It feels like it is producing about one token per minute. MLC-Chat can handle 7B parameter models just fine even on a 14 Pro Max, which has less RAM, so I think there is an issue here.
EDIT 2: Even using StableLM, I am experiencing a total crash of the app fairly consistently if I chat in one conversation, then start a new conversation and try to chat in that. On a related note, since chat history is saved... I don't think it's necessary to have a confirmation prompt if the user clicks the "new chat" shortcut in the top right of a chat.
-----
That does seem much nicer than MLC Chat. I really like the selection of models and saving of conversations.
Microsoft recently re-licensed Phi-2 to be MIT instead of non-commercial, so I would love to see that in the list of models. Similarly, there is a Dolphin-Phi fine tune.
The topic of discussion here is Mistral-7B v0.2, which is also missing from the model list, unfortunately. There are a few Mistral fine tunes in the list, but obviously not the same thing.
I also wish I could enable performance metrics to see how many tokens/sec the model was running at after each message, and to see how much RAM is being used.
Wow, thanks so much for taking the time to test it out and share such great feedback!
Thrilled about all those developments! More model options as well as link-based GGUF downloads on the way.
On the 7b models: I’m very sorry for the poor experience. I wouldn’t recommend 7b over Q2_K at the moment, unless you’re on a 16GB iPad (or an Apple Silicon Mac!). This needs to be much clearer, as you observed the consequences can be severe. The larger models, and even 3b Q6_K can be crash prone due to memory pressure. Will work on improve handling of low level out-of-memory errors very soon.
Will also investigate the StableLM crashes, I’m sorry about that! Hopefully Testflight recorded a trace. Just speculating, it may be a similar issue to the larger models, due to the higher-fidelity quant (Q6_K) combined with the context length eventually running out of RAM. Could you give the Q4_K_M a shot? I heard something similar from a friend yesterday, I’m curious if you have a better time with that — perhaps that’s a more sensible default.
Re: the overly-protective new chat alert, I agree, thanks for the suggestion. I’ll incorporate that into the next build. Can I credit you? Let me know how you’d like for me to refer to you, and I’d be happy to.
Finally, please feel free to email me any further feedback, and thanks again for your time and consideration!
I just checked and MLC Chat is running the 3-bit quantized version of Mistral-7B. It works fine on the 14 Pro Max (6GB RAM) without crashing, and is able to stay resident in memory on the 15 Pro Max (8GB RAM) when switching with another not-too-heavy app. 2-bit quantization just feels like a step too far, but I’ll give it a try.
Regarding credit, I definitely don’t need any. Just happy to see someone working on a better LLM app!
FYI, just submitted a new update for review with a few small but hopefully noticeable changes, thanks in no small part to your feedback:
1. StableLM Zephyr 3b Q4_K_M is now the built-in model, replacing the Q6_K variant.
2. More aggressive RAM headroom calculation, with forced fallback to CPU rather than failing to load or crashing.
3. New status indicator for Metal when model is loaded (filled bolt for enabled, vs slashed bolt for disabled.)
4. Metal will now also be enabled for devices with 4GB RAM or less, but only when the selected model can comfortably fit in RAM. Previously, only devices with at least 6GB had Metal enabled.
The fallback does seem to work! Although the 4-bit 7B models only run at 1 token every several seconds.
I still wish Phi-2, Dolphin Phi-2, and TinyLlama-Chat-v1.0 were available, but I understand you have plans to make it easier to download any model in the future.
edit: I can't reply to you below: Do you have the right app, there's no TestFlight just App Store link - if it's ChatOnMac then it should have a dropdown at the top of the chat room to select a model. If it's empty or otherwise bugged out please let me know what you see in the top menu. It filters the available model presets based on how much RAM you have available, so let me know what specific device you have and I can look into it. Thank you.
The model presets are also configurable by forking the bot and loading your own via GitHub (bots run inside sandboxed hidden webviews inside the app). But this is not ergonomically friendly just yet.
I was excited when I saw this, but I'm having trouble with it (and it looks like I'm not the only one). As others have pointed out, the download link on your site does open TestFlight. I've since deleted that version and installed the official version from the AppStore after revisiting this thread in search of answers.
I now have the full version installed on my iPhone 15 pro, and I have added my OpenAI key, but none of the models I've selected (3.5 Turbo, 4, 4 Turbo) work. My messages in the chat have a red exclamation next to them which opens an error message stating 'Load failed' when clicked. If I click 'Retry Message' the entire app crashes.
Apologies for the rough edges and bad experience - I’ve just soft launched without announcement til this post. I will have a hotfix up soon. Thanks for the report.
> Do you have the right app, there's no TestFlight just App Store link
On chatonmac.com, the "Download on the App Store" button does not link the App Store for me either - I get a modal titled "Public Beta & Launch Day News" with "Join the TestFlight Beta" and "Launch Day Newsletter Signup Form".
Hello, I like your app and the ethics you push forward. Do you plan to add the possibility to request for Dall-E 3 images within the chat? I’ve yet to find an app which does that and makes me use my own api key
In your experience, how could these local LLMs become snappier than using streamed API calls? How far are they if not? How soon do you guess they’ll get there?
I understand the motivation includes factors other than performance, I’m just curious about performance as it applies to UX.
Honestly I think being able to run any kind of LLM on a phone is a miracle. I'm astonished at how good (and how fast) Mistral 7B runs under MLC Chat on iOS, considering the constraints of the device.
I don't use it as more than a cool demo though, because the large hosted LLMs (I tend to mostly use GPT-4) are massively more powerful.
But... I'm still intrigued at the idea of a local, slow LLM on my phone enhanced with function calling capabilities, and maybe usable for RAG against private data.
The rate of improvement in these smaller models over the past 6 months has been incredible. We may well find useful applications for them even despite their weaknesses compared to GPT-4 etc.
What does snappier even mean in this context? The latency from connecting to a server over most network connections isn’t really noticeable when talking about text generation. If the server with a beefy datacenter-class GPU were running the same Mistral you can run on your phone, it would be spitting out hundreds of tokens per second. Most responses would appear on your screen before you blink.
There is no expectation that phones will ever be comparable in performance for LLMs.
Mistral runs at a decent clip on phones, but we’re talking like 11 tokens per second, not hundreds of tokens per second.
Server-based models tend to be only slightly faster than Mistral on my phone because they’re usually running much larger, much more accurate/useful models. Models which currently can’t fit onto phones.
Running models locally is not motivated by performance, except if you’re in places without reliable internet.
These data center targeted GPUs can only output that many tokens per second for large batches. These tokens are shared between hundreds or even thousands of users concurrently accessing the same server.
That’s why despite these GPUs deliver very high throughput in tokens/second, responses do not appear instantly, and individual users observe non-trivial latency.
Another interesting consequence, running these ML models with batch size = 1 (when running on end-user computers or phones) is practically guaranteed to bottleneck on memory. Computation performance or tensor cores are irrelevant for the use case, the only number which matters is memory bandwidth.
For example, I’ve tested my Mistral implementation on desktop with nVidia 1080Ti versus laptop with Radeon Vega 7 inside Ryzen 5 5600U. The performance difference between them is close to 10x, because memory: 484 GB/second for GDDR5X in the desktop versus 50 GB/second for dual-channel DDR4-3200 in the laptop. This is despite theoretical compute performance only differs by the factor of 6.6, the numbers are 10.6 versus 1.6 TFlops.
> These data center targeted GPUs can only output that many tokens per second for large batches.
No… my RTX 3090 can output 130 tokens per second with Mistral on batch size 1. A more powerful GPU (with faster memory) should easily be able to crack 200 tokens per second at batch size 1 with Mistral.
At larger batch sizes, the token rate would be enormous.
Microsoft’s high performing Phi-2 model breaks 200 tokens per second on batch size 1 on my RTX 3090. TinyLlama-1.1B is 350 tokens per second, though its usefulness may be questionable.
We’re just used to datacenter GPUs being used for much larger models, which are much slower, and cannot fit on today’s phones.
I wonder are you using a quantized version of Mistral? NVidia 3090 has 936 GB/second memory bandwidth, so 150 tokens/second = 7.2 GB per token. In the original 16 bits format, the model takes about 13GB.
Anyway, while these datacenter servers can deliver these speeds for a single session, they don’t do that because large batches result in much higher combined throughput.
> I wonder are you using a quantized version of Mistral?
Yes, we’re comparing phone performance versus datacenter GPUs. That is the discussion point I was responding to originally. That person appeared to be asking when phones are going to be faster than datacenters at running these models. Phones are not running un-quantized 7B models. I was using the 4-bit quantized models, which are close to what phones would be able to run, and a very good balance of accuracy vs speed.
> Anyway, while these datacenter servers can deliver these speeds for a single session, they don’t do that because large batches result in much higher combined throughput.
I don’t agree… batching will increase latency slightly, but it shouldn’t affect throughput for a single session much if it is done correctly. I admit it probably will have some effect, of course. The point of batching is to make use of the unused compute resources, balancing compute vs memory bandwidth better. You should still be running through the layers as fast as memory bandwidth allows, not stalling on compute by making the batch size too large. Right?
We don’t see these speeds because datacenter GPUs are running much larger models, as I have said repeatedly. Even GPT-3.5 Turbo is huge by comparison, since it is believed to be 20B parameters. It would run at about a third of the speed of Mistral. But, GPT-4 is where things get really useful, and no one knows (publicly) just how huge that is. It is definitely a lot slower than GPT-3.5, which in turn is a lot slower than Mistral.
There’re other interesting graphs there, they also measured the latency. They found a very strong dependency between batch size and latency, both for first token i.e. pre-fill, and time between subsequent tokens. Note how batch size = 40 delivers best throughput in tokens/second for the server, however the first output token takes almost 4 seconds to generate, probably too slow for an interactive chat.
BTW, I used development tools in the browser to measure latency for the free ChatGPT 3.5, and got about 900 milliseconds till the first token. OpenAI probably balanced throughput versus latency very carefully because their user base is large, and that balance directly affects their costs.
The chart you pointed out is very interesting, but it largely supports my point.
The blue line is easiest to read, so let’s look at how the tokens/sec scale for a single user session as the batch size increases. It starts out at about 100 tokens/s for 5 users = 20 tokens/s/user. At the next point, it is about 19t/s/u. Beyond this point, we start losing some ground, but even by the final data point, it is still over 11t/s/u.
The throughput is affected by less than 2x even with the most unreasonably large batch size. (Unreasonable, because the time to first token is unacceptable for an interactive chat, as you pointed out.)
But, with a batch size that is balanced appropriately, the throughput for a single user session is effectively unchanged whether the service is batching at N=3 or N=10. (Or presumably N=1, but the chart doesn’t include that.) The time to first token is also a reasonable 1 second delay, which is similar to what OpenAI is providing in your testing.
So, with the right batching balance, batching increases the total throughput of the server, but does not affect the throughput or latency for any individual session very much. It does have some impact, of course. Model size and quantization seem to have a much larger impact than batching, from an end user standpoint.
I don't think running raw llama.cpp under termux in a shell on your phone, after downloading and compiling it from scratch,, is really comparable to 'I made an app'.
What we're seeing here might be classic case of the iOS Freedom Choking Syndrome: when a device's lack of freedom spreads to its owner and chokes cerebral circulation.
I see the efforts required to create the little app, but inference via llama.cpp or core Ml is trivial and the models are open weights, so it makes more sense to have a free app for this: most of the value is in the LLM which is free.
I think there is some cost associated with iPhone app development ($100-$300 plus submission costs), as opposed to android, when it comes to publishing, it seems fair enough for an individual to charge a dollar or two to recoup that.
I'd argue in this space besides the model weights, a lot of the value comes from a nice, not-too-fancy but nevertheless intuitive and delightful UI. I mean I've used the free MLC Chat app which runs Mistral 7B fine, and because it's free, I have very low expectations of its UI design. If someone is making a new app with a nicer UI, I really don't mind paying a buck or two.
That makes sense, but if that's the case I would like to see something a bit more vague polish. For instance:
1. Ability to forward messages to app.
2. Ability to run the same questions to the different LLMs installed.
3. Some updated list of GGUF files I can download with a description of the model highlight.
4. Advanced things like check token preplexity to identify parts of the chat the LLM is most unsure and highlight them?
I could continue because it's full of obvious things like that. Do this effort, and I'll pay you 10$ for the app, not 2$. But if it's just not brutal but very low effort, it seems that it's not going anyway since the free apps with exact capabilities will emerge and will not be so different.
TL;DR: No, nearly all these apps will use GPU (via Metal), or CPU, not Neural Engine (ANE).
Why? I suggest a few main reasons:
1) No Neural Engine API
2) CoreML has challenges modeling LLMs efficiently right now.
3) Not Enough Benefit (For the Cost... Yet!)
This is my best understanding based on my own work and research for a local LLM iOS app. Read on for more in-depth justifications of each point!
---
1) No Neural Engine API
- There is no developer API to use the Neural Engine programmatically, so CoreML is the only way to be able to use it.
2) CoreML has challenges modeling LLMs efficiently right now.
- Its most-optimized use cases seem tailored for image models, as it works best with fixed input lengths[1][2], which are fairly limiting for general language modeling (are all prompts, sentences and paragraphs, the same number of tokens? do you want to pad all your inputs?).
- CoreML features limited support for the leading approaches for compressing LLMs (quantization, whether weights-only or activation-aware). Falcon-7b-instruct (fp32) in CoreML is 27.7GB [3], Llama-2-chat (fp16) is 13.5GB [4] — neither will fit in memory on any currently shipping iPhone. They'd only barely fit on the newest, highest-end iPad Pros.
- HuggingFace‘s swift-transformers[5] is a CoreML-focused library under active development to eventually help developers with many of these problems, in addition to an `exporters` cli tool[6] that wraps Apple's `coremltools` for converting PyTorch or other models to CoreML.
3) Not Enough Benefit (For the Cost... Yet!)
- ANE & GPU (Metal) have access to the same unified memory. They are both subject to the same restrictions on background execution (you simply can't use them in the background, or your app is killed[7]).
- So the main benefit from unlocking the ANE would be multitasking: running an ML task in parallel with non-ML tasks that might also require the GPU: e.g. SwiftUI Metal Shaders, background audio processing (shoutout Overcast!), screen recording/sharing, etc. Absolutely worthwhile to achieve, but for the significant work required and the lack of ecosystem currently around CoreML for LLMs specifically, the benefits become less clear.
- Apple's hot new ML library, MLX, only uses Metal for GPU[8], just like Llama.cpp. More nuanced differences arise on closer inspection related to MLX's focus on unified memory optimizations. So perhaps we can squeeze out some performance from unified memory in Llama.cpp, but CoreML will be the only way to unlock ANE, which is lower priority according to lead maintainer Georgi Gerganov as of late this past summer[9], likely for many of the reasons enumerated above.
I've learned most of this while working on my own private LLM inference app, cnvrs[10] — would love to hear your feedback or thoughts!
Conceptually, to the best of my understanding, nothing too serious; perhaps the inefficiency of processing a larger input than necessary?
Practically, a few things:
If you want to have your cake & eat it too, they recommend Enumerated Shapes[1] in their coremltools docs, where CoreML precompiles up to 128 (!) variants of input shapes, but again this is fairly limiting (1 tok, 2 tok, 3 tok... up to 128 token prompts.. maybe you enforce a minimum, say 80 tokens to account for a system prompt, so up to 200 tokens, but... still pretty short). But this is only compatible with CPU inference, so that reduces its appeal.
It seems like its current state was designed for text embedding models, where you normalize input length by chunking (often 128 or 256 tokens) and operate on the chunks — and indeed, that’s the only text-based CoreML model that Apple ships today, a Bert embedding model tuned for Q&A[2], not an LLM.
You could used a fixed input length that’s fairly large; I haven’t experimented with it once I grasped the memory requirements, but from what I gather from HuggingFace’s announcement blog post[3], it seems that is what they do with swift-transformers & their CoreML conversions, handling the details for you[4][5]. I haven’t carefully investigated the implementation, but I’m curious to learn more!
You can be sure that no one is more aware of all this than Apple — they published "Deploying Transformers on the Apple Neural Engine" in June 2022[6]. I look forward to seeing what they cook up for developers at WWDC this year!
---
[1] "Use `EnumeratedShapes` for best performance. During compilation the model can be optimized on the device for the finite set of input shapes. You can provide up to 128 different shapes." https://apple.github.io/coremltools/docs-guides/source/flexi...
[5] `use_flexible_shapes` "When True, inputs are allowed to use sequence lengths of `1` up to `maxSequenceLength`. Unfortunately, this currently prevents the model from running on GPU or the Neural Engine. We default to `False`, but this can be overridden in custom configurations." https://github.com/huggingface/exporters/pull/37/files#diff-...
Oh man I’m a big fan, swyx!! Latent Space & AI.engineer are fantastic resources to the community. Thank you for the kind words & the prompt!
It’s still early days, but at a high level, I have a few goals:
- expand accessibility and increase awareness of the power & viability of small models — the scene can be quite impenetrable for many!
- provide the an easy to use, attractive, efficient app that’s a good platform citizen, taking full advantage of Apple’s powerful device capabilities;
- empower more people to protect their private conversation data, which has material value to large AI companies;
- incentivize more experimentation, training & fine-tuning efforts focused on small, privately-runnable models.
I’d love to one day become your habitual ChatGPT alternative, as high a bar as that may be.
I have some exciting ideas, from enabling a user generated public gallery of characters; to expanding into multimodal use cases, like images & speech; composing larger workflows on top of LLMs, similar to Shortcuts; grounding open models against web search indices for factuality; and further out, more speculative ideas, including exposing tools like JavaScriptCore to models as a tool, like Python in ChatGPT’s code interpreter.
But I’m sure you’ve also given a lot of thought to the future of AI on device with smol — what are some dreams you have for truly private AI that’s always with you?
I have a 2020 16in MacBook Pro. I think it's the last generation of Intel chips. I've been struggling to get some of the LLM models like Mixtral to run on it.
I hate the idea of needing to buy another $3k laptop less than 4 years after spending that much on my current machine. But if I want to get serious about developing non-chatgpt services, do I need a new M2 or M3 chip to get this stuff running locally?
We should be happy that compute is once again improving and machines are getting outdated rapidly. Which is better - a world where your laptop is competitive for 5+ years but everything stays the same? Or one where entire new realms of advancement open up every 18 months?
It’s a no contest option 2 for me.
Just use llama.cpp with any of the available UIs. It will be usable with 4 but quantization on CPU. You can use any of the “Q4_M” “GGUF” models that TheBloke puts out on Huggingface.
I'd suggest using a cloud VM with a GPU attached. For normal stuff like LLM inference, I just rent an instance with a small (cheap) GPU. But when I need to do something more exotic like train an image model from scratch, I can temporarily spin up a cluster that has high-end expensive A100s. This way I don't have to invest in expensive hardware like an M3 that can still only do a small part of the full range.
You can do a lot with either a VM instance with a GPU or within google collab. If you are just starting and doing this stuff mostly a few hours a week, I'd recommend going that way for a while.
If you want to run local, I’d get an m2 with 64gb of ram. That will enable you to run 30b models and mixtral 7bx8 . You need around 50gb to run those at 5/6 bit quant.
I’m getting about 20 tokens/second on my 64gb m2 mbp with mixtral 5-k-m gguf in llamacpp using text generation webui., 35? Layers being sent to metal for acceleration.
I’m really pleased with the performance compared to my dual 3090 desktop rig, the mbp is actually faster.
Data point: my MacBook Pro 16" with the M3 Max (64GB) runs 34b model inference about as fast (or slightly faster) as ChatGPT runs GPT-4.
I am now running phind-codellama:34b-v2-q8_0 through ollama and the experience is very good.
All that said, though, every model I tried couldn't hold a candle to GPT-4: they all produce crappy results, aren't good at translation, and can't really do much for me. They are toys, I go "ooh" and "aah" over them, then realize they aren't that useful and go back to using GPT-4.
Perhaps 34B is still not enough to get anything resonable.
ollamma https://ollama.ai/ is popular choice for running local llm models and should work fine on intel. It's just wrapping docker so shouldn't require m2/m3.
On your CPU, you should be able to leverage the same AVX acceleration used on Linux and Windows machines. It's not going to make any GPU owners envious, but it might be enough to keep you satisfied with your current hardware.
It runs faster and cooler than the software-accelerated alternative. Probably cooler than my 3070 too, my laptop sat ~50c when using AVX to generate Stable Diffusion Turbo images.
Does your mac support an external GPU? A mid to high end nvidia card may or may not outperform the M3 GPU at a lower or similar price. You can also stick it in a PC or resell it separately.
My 64gb m2 mbp is faster running inference than my dual 3090 desktop rig, and at 64g of unified memory it can hold slightly bigger models than the 48gb of vram of the desktop. The performance of the m2/m3 with a big unified memory is very impressive. Not much difference between m2/m3 though, if all other things are the same.
I’m intrigued and currently downloading this app. Love the idea of having offline direct access to this model. One small-ish thing though: Looks like the URL for the privacy policy (http://opusnoma.com/privacy) linked from the App Store page goes nowhere. Actually, opusnoma.com is likewise offline.
I've had a successful offline LLM app[1] on the App Store since June, last year. Works on all iPhones since iPhone 11 and ships with a 3B RedPajama Chat model and has an optional download for 7B Llama 2 based model on newer iPhones and Apple Silicon iPads. I'm currently working on an update to bring more 3B and 7B models to the iOS app.
Are there any models out there that don’t come trained or tweaked or system prompted into somebody else’s idea of ethical or professional conduct? I tested out a bunch of these apps and asked them to write an explicit story to see if they would, and despite this being entirely legal, none would do so. Are we entering some new Orwellian era?
Are these LLMs you can run locally giving answers deterministically just as with, say, StableDiffusion? In StableDiffusion if you reuse the exact same version of SD / model and same query and seed, you always get the same result (at least I think so).
Even with Stable Diffusion, determinism is “best effort”- there are flags you can set in Torch to make it more deterministic at a performance cost, but it’s explicitly disclaimed:
I think they’re referring to CUDA (and possibly other similar runtimes) being able to schedule floating point ops non-deterministically, combined with floating point arithmetic being potentially non-associative. I’m not personally sure how big an issue that would be for the output though.
I have never spotted any difference when regenerating (a recent) image with the same settings/seed/noise and I do it often. Haven't compared the bits though.
Older images are often difficult to reproduce for me - I believe due to changes in tooling (mostly updating Auto1111).
Differences in output are generally varying levels of difficulty of “spot the difference” and rarely changes the overall image composition by much. I always use nondeterministic algos and it doesn’t have any affect on my ability to refine prompts effectively.
Someone mentions temperature in the context of algorithms, can't stop thinking, cool, simulated annealing. Haven't seen temperature used in any other family of algo before this.
If you squint, it’s the same thing. Simulated annealing generally attempts to sample from the Boltzmann distribution. (Presumably because actual annealing is a thermodynamic thing, and you can often think of annealing in a way that the system is a sample from the Boltzmann distribution.)
And softmax is exactly the function that maps energies into the corresponding normalized probabilities under the Boltzmann distribution. And transformers are generally treated as modeling the probabilities of strings, and those probabilities are expressed as energies under the Boltzmann distribution (i.e., logits are on a log scale), and asking your favorite model a question works by sampling from the Boltzmann distribution based on the energies (log probabilities) the model predicts, and you can sample that distribution at any temperature you like.
This app got through review pretty easily, especially since I flagged potentially offensive content which makes it age 12+. In comparison to social media these apps are positively angelic.
Understood but at some point it becomes the responsibility of the user of the hammer if they use it in an attack or hurt someone else or themselves with it. LLMs are LLMs anyone who is using it who doesn’t understand it is language model and is a machine and how at the high level it works, probably shouldn’t use it, and it shouldn’t be Apple’s responsibility to keep hammers out of the hands of everyone due to the few who can’t hit a nail and stub their thumb.
Apple forcefully takes on that responsibility, and their customers love it (as clearly evidenced by their domination of the market). If you don't want a hammer that gets reviewed and screened by Apple before you can use it, then you're on the wrong platform.
Yeah that’s fine. But we still don’t have any tangible specifics in regards to what Apple needs to be reviewing for a locally hosted LLM. What can possibly be the criteria?
Probably an alternative version of this app or a similar app can provide an option to load your own models. Is that a problem for Apple to allow?
What I have found in my personal (and perhaps biased and anecdotal experience) is that there is a large cadre of LLM and AI hating people who inevitably start rambling about (a) safety and (b) copyright violations. Which I find reflect more about their mental model of the world and less about reality, in that they tends to be statists and collectivists that want a central authority to “protect” them. Which obviously get’s under my skin as that unfortunately mass instinct is what enabled big government, mass surveillance, and totalitarianism. Just my 2 cents, which undoubtedly many on HN may disagree with, and that’s fine, especially as I’m actually very interested in hearing more specifics from the AI safeguards crowd as admittedly perhaps I’m missing something terrible about this new hammer that has been invented, a hammer I find to be incredibly useful for nailing in all sorts of ways.
I'm not sure how Apple could make an explicit policy for this. My theory is that they won't, but rather are going to roll out their own LLM that run locally and is optimized for on-device hardware, which non-Apple code will not be able to use. This won't make all the LLMs go away, but it will make running very unattractive since they'll be battery hungry and slow compared to the official app.
I wouldn't say "love it" as much as I would say "don't know" or "don't care".
Your average person doesn't spend a second thinking about Apple's app store policies. In fact most people don't even install new apps unless absolutely necessary.
Where to leave feedback? I am trying the Mistral dolphin model but getting GGML ASSERT errors referencing Users/tito lol (not me). Using iPhone 14 Pro Max.
Sorry for the confusing experience, and thank you for sharing this!
I’ve just submitted a new update for review with a few small but hopefully noticeable changes, thanks in no small part to your feedback:
1. StableLM Zephyr 3b Q4_K_M is now the built-in model, replacing the Q6_K variant.
2. More aggressive RAM headroom calculation, with forced fallback to CPU rather than failing to load as you observed, or crashing outright in some nasty edge cases.
3. New status indicator for Metal when model is loaded (filled bolt for enabled, vs slashed bolt for disabled.)
4. Metal will now also be enabled for devices with 4GB RAM or less, but only when the selected model can comfortably fit in RAM. Previously, only devices with at least 6GB had Metal enabled.
Thank you so much for taking the time to test and share your experience! Feel free to reach out anytime at britt [at] bl3 [dot] dev.
iPhone 15 and iPhone 14 Pro, 14 Pro Max have exactly the same CPU and amount of RAM (Apple A16 Bionic and 6GB). This is also true for iPhone 14 and iPhone 13 Pro, Pro Max (Apple A15 Bionic and also 6GB).
I don't play games or do anything too resource-demanding on my phone normally. Pro models typically have more memory than non-pro models and running LLMs on device might be the only scenario where it can realistically make a difference for me.
Smaller 3B LLMs (like phi-2) work fine on newer non pro models, at full context lengths. Running 7B models on even 8GB iPhone 15 Pro and Pro Max phones involves reducing the context lengths to 1k or fewer tokens, because the full context length KV cache won't fit on these devices.
edit: my bad, I misread the price and it's really hard to see the price after you bought it to double check.
$10 for something that (I think) doesn't work on most phones but isn't gated to ones it works on feels hostile.
Probably there's no way to gate, in that case I'd suggest not charging for it. Or I guess adding a daily usage limit that's lifted with an IAP.
I'll admit I was off-put by the price to begin with, which probably amplifies what a slap in the face it feels like to pay and get something that doesn't work at all.
So if they know it wont work, and do not put that info into the store's compatibility matrix then it's still a bait/switch to me. Compare to the Resident Evil page which does set the store limits on what devices can dl it.
You can set the minimum deployment to iOS 17 & then if someone has iPhone X*, 11 or SE then you can alert them to get a refund when they open the app either with a device check or total memory check. That'll set it so you remove most of the issues of older devices.
It's clearly spelled out, App Store refunds work more often than they don't...
...and it's a $1.99 risk ffs.
Tangential point: It's super easy to go off the rails and on a rant, while the real reason behind someone's "bait/switch" is external, trivial, and benign. We tend to judge others by their actions, but ourselves by our intentions. I used a German company's excellent sleep supplement (and later worked for that company, too) which was being bashed on Facebook as "non FDA approved snake oil". Meanwhile, the FDA refused (and still refuses) to even look at anything outside actual drugs, even if you wanted them to. Sometimes your hands are just tied.
Thanks, but wikipedia_en_all_nopic_2023-12.zim is still 56 GB, whereas the BZ2-compressed Wiki2Touch archives are only about 14 GB for the latest (and only 8 GB for an archive from 2012 which I'm using).
I think a great caution should be used with modern physics and chemistry - it may be a way to get yourself killed for sorcery.
But if you want to say alive then I'll recommend including few books about creating modern medicine from scratch - like creating aspirin from willow bark and penicillin from molded bread.
Why do none of these apps allow you to set the system prompt? I find these LLM apps kind of useless without being able to refine the way in which the model will respond to later questions.
- save characters (system prompt + temperature, and a name & cosmetic color)
- download & experiment with models from 1b, 3b, & 7b, and quant options q2k, q4km, q6k
- save, search, continue, & export past chats
along with smaller touches:
- custom theme colors
- haptics
I downloaded this on my 14 Pro and it completely locked up the system to the point where even the power button wouldn’t work. I couldn’t use my phone for about 10 minutes.
I’ve just submitted a new update for review with a few small but hopefully noticeable changes, thanks to your feedback:
1. StableLM Zephyr 3b Q4_K_M is now the built-in model, replacing the Q6_K variant.
2. More aggressive RAM headroom calculation, with forced fallback to CPU rather than failing to load or crashing/hanging in such a nasty fashion.
3. New status indicator for Metal when model is loaded (filled bolt for enabled, vs slashed bolt for disabled.)
4. Metal will now also be enabled for devices with 4GB RAM or less, but only when the selected model can comfortably fit in RAM. Previously, only devices with at least 6GB ever had Metal enabled.
I really appreciate your taking the time to test — the hanging you experienced was unacceptable, and I truly am sorry for the inconvenience. I hope you’ll give it another chance once this update is live, but either way I’m grateful for your help in isolating and eliminating this issue!
I’m very sorry about your experience. That’s definitely not what I was aiming for, and I can imagine that was a nasty surprise. Any hang like that is unacceptable, full stop.
My understanding is Metal is currently causing hangs on devices when there is barely enough RAM to fit the model and prompt, but not quite enough to run. Will work on falling back to CPU to avoid this kind of experience much more aggressively than today.
Thank you for taking the time to both try it out and to share your experience; I will use it to ensure it’s better in the future.
Thanks for the response. Unfortunately on my device the behavior makes it impossible to report a bug using a screenshot as requested in the app. I can give you more device info if you want to narrow down the cause.
I’m very sorry to hear you had such a poor experience as well. I’m sure it’s little consolation at this point having been inconvenienced as you have — it’s certainly not what I aim for in my work!
Thanks. I did test your new version but unfortunately similar issues. App completely hung and entire OS was sluggish. iPhone 13 Pro, iOS 17.1.2. Unfortunately I won’t have time to test any more but very good luck with the project.
BTW and FYI i need to reduce the font size on my iOS device to be smaller than i like in order to use your add/replace API key key pages. if the font is "larger than normal" i can't see/focus on the box to enter or paste in the API key. just increase your iOS system font size to trigger this. thanks in advance for fixing, will try out the app!
Thanks for the detailed report - will fix asap, along with releasing the macOS v1.0. I've just soft launched this so far but have more to come so please let me know anything else.
I definitely do not want any liability of user-generated content or PII or similar. I have no analytics, besides the standard Apple opt-in crash/reporting (not using any 3rd-party service and not sending anything to my own servers).
It downloads configuration from GitHub and HuggingFace directly. It also has OpenAI integration, directly to their servers via BYOK.
There is no system prompt. Unless Llamaindex or some other sources cite something from mistral, I am inclined to believe they just copied it from llama.
<s>[INST] What is your favorite condiment? [/INST]
"Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"</s> [INST] The right amount of what? [/INST]
Note that the sentence starting "Well, I'm quite partial isn't inside the tag.
ollama run Mistral "<s>[INST] What is your favorite condiment? [/INST] Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen</s> [INST] The right amount of what? [/INST]"
That's the whole context with two user inputs in the INST tags, and one assistant output between/outside of the tags. They're just simulating the beginning of a conversation. You can see this very clearly in the JSON version in the next code block:
messages = [
{"role": "user", "content": "What is your favourite condiment?"},
{"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
{"role": "user", "content": "Do you have mayonnaise recipes?"}
]
Yes, but this whole block of text gets passed to the LLM on each call as the conversation history. The [INST] tags tell the LLM which parts were inputs (system instructions) as opposed to outputs.
In actual fact system prompts are just the instruction tuning prompts that provide instructions formatted in the correct way (ie, with [INST]). It's just text all the way down.