More

rshemet · 2025-07-11T16:33:47 1752251627

no, good observation - not hidden; we don't have a "clear conversation" button.

to your previous point - Cactus fully supports tool calling (for models that have been instruction-trained accordingly, e.g. Qwen 1.7B)

for "turning your old phones into local llm servers", Cactus is likely not the best tool. We'd recommend something like actual Ollama or Exo

rshemet · 2025-07-11T16:30:40 1752251440

looking forward to your feedback!

rshemet · 2025-07-11T00:45:06 1752194706

hot off the press in our latest feature release :)

we support cloud fallback as an add-on feature. This lets us support vision and audio in addition to text.

rshemet · 2025-07-10T23:10:30 1752189030

thank you! We're continue to add performance metrics as more data comes in.

A Qwen 2.5 500M will get you to ≈45tok/sec on an iPhone 13. Inference speeds are somewhat linearly inversely proportional to model sizes.

Yes, speeds are consistent across frameworks, although (and don't quote me on this), I believe React Native is slightly slower because it interfaces with the C++ engine through a set of bridges.

pickettd · 2025-07-11T00:10:25 1752192625

I also want to add on that I really appreciate the benchmarks.

When I was working with RAG llama.cpp through RN early last year I had pretty acceptable tok/sec results up through 7-8b quantized models (on phones like the S24+ and iPhone 15pro). MLC was definitely higher tok/sec but it is really tough to beat the community support and availability in the gguf ecosystem.

Reebz · 2025-07-10T23:44:41 1752191081

Looking at the current benchmarks table, I was curious: what do you think is wrong with Samsung S25 Ultra?

Most of the standard mobile CPU benchmarks (GeekBench, AnTuTu, et al) show a 20-40% performance gain over S23/S24 Ultra. Also, this bucks the trend where most other devices are ranked appropriately (i.e. newer devices perform better).

Thanks for sharing your project.

rshemet · 2025-07-11T00:10:22 1752192622

great observation - this data is not from a controlled environment; these are metrics from our Cactus Chat use (we only collect tok/sec telemetry).

S25 is an outlier that surprised us too.

I got $10 on S25 climbing back up to the top of the rankings as more data comes in :)

rshemet · 2025-07-10T22:13:50 1752185630

reminds me of

- "You are, undoubtedly, the worst pirate i have ever heard of" - "Ah, but you have heard of me"

Yes, we are indeed a young project. Not two weeks, but a couple of months. Welcome to AI, most projects are young :)

Yes, we are wrapping llama.cpp. For now. Ollama too began wrapping llama.cpp. That is the mission of open-source software - to enable the community to build on each others' progress.

We're enabling the first cross-platform in-app inference experience for GGUF models and we're soon shipping our own inference kernels fully optimized for mobile to speed up the performance. Stay tuned.

PS - we're up to good (source: trust us)

rshemet · 2025-07-10T22:03:42 1752185022

love this. So many layers deep, we just had a good laugh.

rshemet · 2025-07-10T22:01:49 1752184909

Very good point - we've heard this before.

We're restructuring the model initialization API to point to a local file & exposing a separate abstracted download function that takes in a URL.

wrt downloading post-install: based on our feedback, this is indeed a preferred pattern (as opposed to bundling in large files).

We'll update the download API, thanks again.

teaearlgraycold · 2025-07-10T22:17:43 1752185863

Sounds good!

rshemet · 2025-07-10T21:39:46 1752183586

True - but Cactus is not just an app.

We are a dev toolkit to run LLMs cross-platform locally in any app you like.

jadbox · 2025-07-10T21:52:07 1752184327

How does it work? How does one model on the device get shared to many apps? Does each app have it's own inference sdk running or is there one inference engine shared to many apps (like ollama does). If it's the later, what's the communication protocol to the inference engine?

rshemet · 2025-07-10T22:21:32 1752186092

Great question. Currently, each app is sandboxed - so each model file is downloaded inside each app's sandbox. We're working on enabling file sharing across multiple apps so you don't have to redownload the model.

With respect to the inference SDK, yes you'll need to install the (react native/flutter) framework inside each app you're building.

The SDK is very lightweight (our own iOS app is <30MB which includes the inference SDK and a ton of other stuff)

pogue · 2025-07-11T05:25:04 1752211504

I would like to see it as an app, tbh! If I could run it as an APK with a nice GUI interface for picking different models to run, that would be a killer feature.

rshemet · 2025-07-11T16:37:33 1752251853

https://play.google.com/store/apps/details?id=com.rshemetsub...

pogue · 2025-07-11T16:48:00 1752252480

Ah ha!

rshemet · 2025-07-10T21:04:06 1752181446

Thanks for the feedback. You're right to point out that Google AI Edge is cross-platform and more flexible than our phrasing suggested.

The core distinction is in the ecosystem: Google AI Edge runs tflite models, whereas Cactus is built for GGUF. This is a critical difference for developers who want to use the latest open-source models.

One major outcome of this is model availability. New open source models are released in GGUF format almost immediately. Finding or reliably converting them to tflite is often a pain. With Cactus, you can run new GGUF models on the day they drop on Huggingface.

Quantization level also plays a role. GGUF has mature support for quantization far below 8-bit. This is effectively essential for mobile. Sub-8-bit support in TFLite is still highly experimental and not broadly applicable.

Last, Cactus excels at CPU inference. While tflite is great, its peak performance often relies on specific hardware accelerators (GPUs, DSPs). GGUF is designed for exceptional performance on standard CPUs, offering a more consistent baseline across the wide variety of devices that app developers have to support.

deepdarkforest · 2025-07-10T21:25:48 1752182748

No worries.

GGUF is more suitable for the latest open-source models, i agree there. Quant2/Q4 will probably be critical as well, if we don't see a jump in ram. But then again I wonder when/If mediapipe will support GGUF as well.

PS, I see you are in the latest YC batch? (below you mentioned BF). Good luck and have fun!

blks · 2025-07-11T10:14:00 1752228840

First paragraph reads like chat gpt response.

poly2it · 2025-07-11T11:26:48 1752233208

Not just the first paragraph, the whole response reads like LLM output.

rshemet · 2025-07-10T20:48:39 1752180519

So far, our focus is on supporting models with fully open-sourced weights. Providers who are sensitive about their weights typically lock those weights up in their cloud and don't run their models locally on consumer devices anyway.

I believe there are some frameworks pioneering model encryption, but i think we're a few steps away from wide adoption.