For translation jobs, I've experimented with Llama 2 70B (running on Replicate) ...

kcorbitt · on Sept 12, 2023

Yes, if you're just using Llama 2 off the shelf (without fine-tuning) I don't think there are a lot of workloads where it makes sense as a replacement for GPT-3.5. The one exception being for organizations where data security is non-negotiable and they really need to host on-prem. The calculus changes drastically though when you bring fine-tuning in, which lets a much smaller model outperform a larger one on many classes of task.

Also, it's worth noting that Replicate started out with a focus on image generation, and their current inference stack for LLMs is extremely inefficient. A significant fraction of the 100x cost difference you mentioned can be made up by using an optimized inference server like vLLM. Replicate knows about this and is working hard on improving their stack, it's just really early for all of us. :)

bfirsh · on Sept 12, 2023

Founder of Replicate here. It's early indeed.

OpenAI aren't doing anything magic. We're optimizing Llama inference at the moment and it looks like we'll be able to roughly match GPT 3.5's price for Llama 2 70B.

Running a fine-tuned GPT-3.5 is surprisingly expensive. That's where using Llama makes a ton of sense. Once we’ve optimized inference, it’ll be much cheaper to run a fine-tuned Llama.

yixu34 · on Sept 13, 2023

We're working on LLM Engine (https://llm-engine.scale.com) at Scale, which is our open source, self-hostable framework for open source LLM inference and fine-tuning. We have similar findings to Replicate: Llama 2 70B can be comparable to GPT 3.5 price, etc. Would be great to discuss this further!

Dowwie · on Sept 13, 2023

How heavy of a lift is it to optimize inference?

Arctic_fly · on Sept 12, 2023

> Llama 7B wasn't up to the task fyi, producing very poor translations.

From what I've read and personally experimented with, none of the Llama 2 models are well-suited to translation in particular (they were mainly trained on English data). Still, there are a number of tasks that they're really good at if fine-tuned correctly, such as classification and data extraction.

> I believe that OpenAI priced GPT-3.5 aggressively cheap in order to make it a non-brainer to rely on them rather than relying on other vendors (even open source models).

I think you're definitely right about that, and in most cases just using GPT 3.5 for one-off tasks makes the most sense. I think when you get into production workflows that scale, that's when using a small fine-tuned models starts making more sense. You can drop the system prompt and get data in the format you'd expect it in, and train on GPT-4's output to sometimes get better accuracy than 3.5 would give you right off the bat. And keep in mind, while you can do the same thing with a fine-tuned 3.5 model, it's going to cost 8x the base 3.5 price per token.

kelseyfrog · on Sept 12, 2023

Is that because translation is typically an encoder-decoder task and llama is decoder only or is there something else about it that makes the last difficult for llama?

FeepingCreature · on Sept 12, 2023

If you don't make it learn other-language texts, it won't be able to speak that language.

mikewang · on Sept 14, 2023

As I learned that 85% of its trainig data is English. Othere languanges composed of 15%.

AnonymousPlanet · on Sept 12, 2023

Cost isn't the only incentive not to use an LLM service that resides in a foreign country. Around here, there are industries for which it's pretty much a no-brainer to avoid anything that sends data across the atlantic.

unoti · on Sept 12, 2023

Although it wouldn't surprise me if today's Azure OpenAI offerings route to certain US-centric regions, I'd be very surprised if Azure isn't working day and night to try to provision OpenAI capacity everywhere they can in the world.

(Disclaimer: I work in the cloud organization at Microsoft, and these are totally my own thoughts and opinions and don't reflect any kind of inside knowledge I have. I think I can say that provisioning LLM capacity and GPU's is something we basically all have a tremendous amount of passion about.)

AnonymousPlanet · on Sept 12, 2023

Let's say a French company would offer the same service in the US, swearing no data would be ever siphoned out of the US and no French intelligence service would be allowed to review the data. Would you be comfortable with your patient records being stored there or the business secrets of US companies?

Do you believe Microsoft can actually make the same promises and keep them? You don't have to answer the last question, of course, but please think about it. It doesn't matter where the LLM is located but who controls it and who holds the resulting data.

ozgune · on Sept 13, 2023

I don't think this is a promise Microsoft can make. The US Cloud Act states that Microsoft falls under US jurisdiction and it's legally bound to share foreign data if asked by US law enforcement.

"The CLOUD Act asserts that U.S. data and communication companies must provide stored data for a customer or subscriber on any server they own and operate when requested by warrant, but provides mechanisms for the companies or the courts to reject or challenge these if they believe the request violates the privacy rights of the foreign country the data is stored in."

https://en.wikipedia.org/wiki/CLOUD_Act

fomine3 · on Sept 13, 2023

Worldwide big corps already utilized Microsoft 365 especially SharePoint. That's Microsoft's advantage.

carom · on Sept 13, 2023

I do think large tech companies do pretty well with customer data. As a former Googler I would be comfortable with my Gmail data residing in a foreign datacenter.

Aerbil313 · on Sept 13, 2023

They do pretty well, except the Room_641A in the building which is allowed to do anything they what with production branch without it being visible to ordinary workers.

https://en.m.wikipedia.org/wiki/Room_641A

deet · on Sept 13, 2023

Azure GPT 4 is already available in: Australia East, Canada East, East US, East US 2, France Central, Japan East, Sweden Central, Switzerland North, UK South (https://learn.microsoft.com/en-us/azure/ai-services/openai/c...)

ttt3ts · on Sept 12, 2023

You can run 70B LLAMA on dual 4090s/3090s with quantization. Going with dual 3090s you can get a system that can run LLAMA 2 70B with 12K context for < $2K.

I built two such a systems after burning that much in a week on ChatGPT.

coryrc · on Sept 13, 2023

> I built two such a systems after burning that much in a week on ChatGPT.

What are you doing!?

ttt3ts · on Sept 14, 2023

Have a client with many thousands of csv, json, xml files detailing insurance prices. Fundimentally they all contained the same data but wildly different formats because they were produced by different companies and teams. I used ChatGPT to deduce their format so I could normalize them. Easily underbid their current contractor who was using humans for the work and now I have an easy quarterly billing. :)

TBC, I probably could have optimized tokens but contract was profitable and time critical.

coryrc · on Sept 14, 2023

Thanks for sharing!

zakki · on Sept 12, 2023

Would you mind to share all your PC HW (mobo, casing, cooling, etc) for this dual GPU configuration? Thanks.

ttt3ts · on Sept 12, 2023

The one you could build for under 2K is last gen hardware.

* Chenbro Rackmount 4U Server Chassis RM42300-F (rack mount case Remove the air filter on 120mm fan. Put two decent 80mm exhaust at rear). * Two used air cooled 3090s. About $650 a piece on ebay. Check slot width and make sure everything will fit on your motherboard. Do a burn in when you get them cause used GPUs can be hit or miss. * 5950x CPU (overkill just had it) * 128GB DDR4 * Motherboard with x570 chipset and dual pcie x16. These will birificate to x8 pcie 4.0 lanes to each GPU. This is enough bandwidth to push GPUs to max IME * 1200W+ ATX power supply. * ebay "u.2 pcie 3.84TB" and adaptor for m.2 NVME slot. (again what I had & it is cheap)

If you're going to really beat the thing I would power limit the 3090s to 320w (from 350w). Perf change is not really notable and keeps temps better.

efreak · on Sept 15, 2023

From people hosting image generation models on Stable Horde I've heard that you can pretty severely underclock/undervolt your GPUs and keep them stable, massively reducing heat output and energy cost without losing nearly as much performance. I'm not sure if this transfers into text generation or not, this was from image generation workers that have a few seconds downtime between requests; however it might be worth a bit of research if you happen to be running consumer GPUs.

----- From TheUnamusedFox, in August: > 3090 down to ~260-270 watts (from 400) with minimal gen speed impact. Same with a 3080ti. It seems to be more stable with image generation than gaming, at least on my two cards. If I try to game or benchmark with this undervolt it is an instant crash.

From another user:

> this undervolting stuff is pretty sweet. > undervolted_limits.png [1] > max_power_limits.png [2] > this is my before and after. > a solid 200 watt drop for only 9.2% loss of performance > not to mention the 30 degree drop in temps

[1]: https://cdn.discordapp.com/attachments/1143237412663869570/1... [2]: https://cdn.discordapp.com/attachments/1143237412663869570/1...

zakki · on Sept 13, 2023

Thank you so much.

apstls · on Sept 13, 2023

Are there any good resources related to expanding context windows, or even just the mechanics of how they actually work as properties of a model?

ttt3ts · on Sept 13, 2023

Lots. LLAMA 2 was trained on 4K context windows but can run on arbitrary length just the results become garbage as you go longer.

I refer you to https://blog.gopenai.com/how-to-speed-up-llms-and-use-100k-c... for an "easy" to digest summary

Reviving1514 · on Sept 13, 2023

Edit: Nevermind, saw you posted elsewhere. Thank you!

Can you share your system specs? I was looking into something similar but my costs were closer to 6 to 8k for the whole system.

0x008 · on Sept 14, 2023

is the $2K you mentioned the total cost of ownership?

ramesh31 · on Sept 12, 2023

>For about 1000 input tokens (and resulting 1000 output tokens), to my surprise, GPT-3.5 turbo was 100x cheaper than Llama 2.

You'll never get actual economics out of switching to open models without running your own hardware. That's the whole point. There's orders of magnitude difference in price, where a single V100/3090 instance can run llama2-70b inference for ~0.50$/hr.

YetAnotherNick · on Sept 12, 2023

No, they can't run it. llama 70 with 4 bit quantization takes ~50 GB VRAM for decent enough context size. You need A100, or 2-3 V100 or 4 3090 which all costs roughly roughly $3-5/h

ramesh31 · on Sept 12, 2023

Wrong. I am running 8bit GGML with 24GB VRAM on a single 4090 with 2048 context right now

YetAnotherNick · on Sept 12, 2023

Which model? I am talking about 70b as mentioned clearly. 70b 8b is 70GB just for the model itself. How much token/second are you getting with single 4090?

ramesh31 · on Sept 12, 2023

Offloading 40% of layers to CPU, about 50t/s with 16 threads.

pocketarc · on Sept 12, 2023

That is more than an order of magnitude better than my experience; I get around 2 t/s with similar hardware. I had also seen others reporting similar figures to mine so I assumed it was normal. Is there a secret to what you're doing?

ramesh31 · on Sept 12, 2023

>Is there a secret to what you're doing?

Core speed and memory bandwidth matter a lot. This is on a Ryzen 7950 with DDR5.

jpdus · on Sept 13, 2023

Care to share your detailed stack and command to reach 50t/s? I also have a 7950 with DDR 5 and I don't even get 50 t/s on my two RTX 4090s....

brucethemoose2 · on Sept 12, 2023

TBH, Replicate is not a great way to run 7B beyond experimentation. You want a host with cheap consumer GPUs (like vast.ai) since the 4-bit requirements are so modest.

You either need a backend with good batching support (vLLM), or if you don't need much throughput, an extremely low end GPU or no GPU at all for exLlama/llama.cpp.

OpenAI benefits from quantization/batching, optimized kernels and very high utilization on their end, so the huge price gap vs a default HF Transformers instance is understandable. But even then, you are probably right about their aggressive pricing.

As for quality, you need a llama model finetunes on the target language (many already exist on Huggingface) and possibly custom grammar if your backend supports it.

halflings · on Sept 12, 2023

I don't think translation is a great use case for ChatGPT and LLAMA. These models are overwhelmingly trained on English, and LLAMA2 which should have more data from other languages is still focused on languages w/ Latin/Cyrillic characters (so won't work well for Arabic, Hebrew, or CJK languages).

You're better off using models specialized in translation; General purpose LLMs are more useful when fine-tuning on specific tasks (some form of extraction, summarization, generative tasks, etc.), or for general chatbot-like uses.

og_kalu · on Sept 12, 2023

>You're better off using models specialized in translation

For a couple dozen languages, GPT-4 is by far the best translator you can get your hand on so basically no.

daniels11 · on Sept 12, 2023

I will say that GPT-4 is just incredibly expensive. For my app I only use it for advanced translations/corrections, and usually a combination of GPT-3.5+Wiktionary is able to get the more simple stuff done

all2 · on Sept 12, 2023

> GPT-3.5+Wiktionary

Can you share more about your app and what you're doing?

daniels11 · on Sept 12, 2023

Sure! I'm building a personalized AI language learning tutor using Open AI's API and ElevenLabs (for Text to Speech).

Right now it's basically a chat bot that you can use to practice conversing with. It provides corrections for the things you type. Eventually I'd like to try adding Whisper as well to allow users to speak out loud.

When you hover over a word, you get a translation. Initially I thought using Open AI for every word translation would be too much, but I've been able to get it down to ~36-40 tokens/request. (3-4 cents/1000 requests). I also began parsing and uploading some of this [Wiktionary data](https://kaikki.org/dictionary/rawdata.html) and am working on a feature that integrates the GPT-3.5 translation with this Wiktionary data.

A lot of these features are still in the works but you can feel free to try it if you like (https://trytutor.app).

two_in_one · on Sept 14, 2023

What would be the best local standalone solution for translation model? Personal use, mostly self-education. 2 popular languages both ways (like en-spa, fr-ger). Free, pretrained off the github would be the best. I can try and train say 100M params LLM on 4090 RTX. But I'm not sure satisfactory result are achievable.

achileas · on Sept 12, 2023

There are plenty of examples in the literature of using LLMs for translation beating the metrics of non-LLM models, even for languages for which there isn't a lot of data. Transliterating non-Latin characters helps a lot with accuracy as well.

daniels11 · on Sept 12, 2023

what models would you use for translation? I am working on a language learning tutor (trytutor.app, very early) and GPT-3.5 turbo has been working fine, for the most part.

For foreign language corrections ("correct this German sentence and give a reason for the correction"), GPT-3.5 doesn't quite have the horsepower so I use GPT-4

robertnishihara · on Sept 14, 2023

It shouldn't be 100x. We've built an LLM API at Anyscale, and the price comparison works out as follows (per million tokens)

- Llama-2-70B: $1 (on Anyscale Endpoints [1]) - GPT-3.5-turbo: $1.50 - $2 (OpenAI [2])

[1] https://app.endpoints.anyscale.com/ [2] https://openai.com/pricing

nborwankar · on Sept 12, 2023

Llama and GPT are auto-regressive decoder only architectures which for pure translation jobs are not the optimal architectures. Training seq2seq models or encoder/decoder models on datasets of sentence pairs designed for translation will likely allow you to use much smaller models. You will not be wasting parameters on general “language understanding” capability that Llama and GPT have if pure translation is all you need. T5 or Flan-T5 might be good starting points.

mr_o47 · on Sept 13, 2023

I’m actually replicate user. I have experimented with LLAMA2 on the replicate and I have similar experience

But you are totally correct about the pricing part it can get expensive

I’m running this photo service https://msdosimagetools.ngrok.dev/

Its doing 200+ photos every day and I’m using open source models behind the scene on replicate. My costs increasing day by day

Plus this is hosted locally

octacat · on Sept 12, 2023

Google Maps was also cheap. Initially. So it is aggressively cheap now, but would aggressively change later.

flangola7 · on Sept 13, 2023

Google maps has always been free.

blitz_skull · on Sept 13, 2023

Their API, however, is not. (After a certain usage threshold)

Mystery-Machine · on Sept 14, 2023

What?

mrybczyn · on Sept 12, 2023

Yes, openAI is dumping the market with chat-gpt 3.5. Vulture capital behaviour at its finest, and I'm sure government regulations will definitely catch on to this in 20 or 30 years...

It's cheaper than the ELECTRICITY cost of running a llama-70 on your own M1.Max (very energy efficient chip) assuming free hardware.

I guess they are also getting a pretty good cache hit rate - there are only so many questions people ask at scale. But still, it's dumping.

sacred_numbers · on Sept 12, 2023

Based on my research, GPT-3.5 is likely significantly smaller than 70B parameters, so it would make sense that it's cheaper to run. My guess is that OpenAI significantly overtrained GPT-3.5 to get as small a model as possible to optimize for inference. Also, Nvidia chips are way more efficient at inference than M1 Max. OpenAI also has the advantage of batching API calls which leads to better hardware utilization. I don't have definitive proof that they're not dumping, but economies of scale and optimization seem like better explanations to me.

csjh · on Sept 12, 2023

What makes you think 3.5 is significantly smaller than 70B?

hutzlibu · on Sept 12, 2023

I also do not have proof of anything here, but can't it be both?

They have lots of money now and the market lead. They want to keep the lead and some extra electricity and hardware costs are surely worth it for them, if it keeps the competition from getting traction.

haxton · on Sept 12, 2023

gpt3.5 turbo is (mostly likely) Curie which is (most likely) 6.7b params. So, yeah, makes perfect sense that it can't compete with a 70b model on cost.

JackRumford · on Sept 14, 2023

These sites say 154B:

https://www.ankursnewsletter.com/p/gpt-4-gpt-3-and-gpt-35-tu...

https://blog.wordbot.io/ai-artificial-intelligence/gpt-3-5-t...

why_only_15 · on Sept 12, 2023

gpt3.5 turbo is a new model, not Curie. As others have stated, it probably uses Mixture of Experts which lowers inference cost.

csjh · on Sept 12, 2023

Is there a source on that? I've never seen anyone think it's below even 70B

ronyfadel · on Sept 12, 2023

It still does a much better job at translation than llama 2 70b even, at 6.7b params

two_in_one · on Sept 12, 2023

If it's MOE that may explain why it's faster and better...

yumraj · on Sept 12, 2023

sarthaksrinivas · on Sept 12, 2023

Mixture of Experts Model - https://en.wikipedia.org/wiki/Mixture_of_experts

jiggawatts · on Sept 12, 2023

I thought it was fairly well established that GPT 3.5 has something like 130B parameters and that GPT 4 is on the order of 600-1,000

avion23 · on Sept 13, 2023

I remember:

- gpt-3.5 175b params

- gpt-4 1800b params

PUSH_AX · on Sept 12, 2023

You think they are caching? Even though one of the parameters is temperature? Can of worms, and should be reflected in the pricing if true, don't get me started if they are charging per token for cached responses.

I just don't see it.

why_only_15 · on Sept 12, 2023

You can keep around the KV cache from previous generations which lowers the cost of prompts significantly.

read_if_gay_ · on Sept 12, 2023

turbo is likely nowhere near 70b.

avereveard · on Sept 12, 2023

Together AI has new aggressive pricing where 70b are on par with gpt35 and everything smaller is fairly cheaper. The catch is the only 32k context length model as of today is their llama 7b which is fairly limited.

MuffinFlavored · on Sept 12, 2023

I thought Llama was opensource/free and you could run it yourself?

thewataccount · on Sept 12, 2023

You (currently) need a GPU to run any of the useful models. I haven't really seen a business use-case that runs it on the user's computer, but given the hardware requirements it wouldn't be very feasible to expect.

So you'll have to figure out how to run/scale the model inference. Cloud GPU instances are generally very expensive, and once you start needing to horizontally scale it'll get messy fast.

At least at the moment it's expensive, especially if it's either very light usage or very intensive usage - you either need just a few seconds of compute occasionally, or lots of compute all the time requiring scaling.

The "lucky" ones in this scenario are small-medium businesses that can use one or a few cards on-site for their traffic. Even then when you take the cost of an A100 + maintaining it, etc. OpenAI's offering still looks attractive.

I know there's a few services that try to provide an api similar to what openai has, and some software to self orchestrate it, I'm curious how those compare...

hereonout2 · on Sept 12, 2023

> once you start needing to horizontally scale it'll get messy fast.

It gets expensive fast, but not messy, these things scale horizontally really well. All the state is encapsulated in the request, no replication, synchronisation, user data to worry about. I'd rather have the job of horizontally scaling llama2 than a relational database.

thewataccount · on Sept 12, 2023

For sure, and yeah it wouldn't be terrible you're right. You'd just need the api servers + a load balancer.

My thing is that dynamically doing that is still a lot compared to just calling a single endpoint and all of that is handled for you.

But for sure this is a very decent horizontal use-case.

loudmax · on Sept 12, 2023

You can run the smaller Llama variants on consumer grade hardware, but people typically rent GPUs from the cloud to run the larger variants. It is possible to run even larger variants on a beefy workstation or gaming rig, but the performance on consumer hardware usually makes this impractical.

So the comparison would be the cost of renting a cloud GPU to run Llama vs querying ChatGPT.

ramesh31 · on Sept 12, 2023

>So the comparison would be the cost of renting a cloud GPU to run Llama vs querying ChatGPT.

Yes, and it doesn't even come close. Llama2-70b can run inference at 300+tokens/s on a single V100 instance at ~$0.50/hr. Anyone who can should be switching away from OpenAI right now.

cheptsov · on Sept 21, 2023

How do you fit Llama2-70b into V100? V100 is 16GB. Llama2-70b 4bit would require up to 40GB. Also, what do you use for inference to get 300+tokens/s?

thewataccount · on Sept 12, 2023

What's the best way to use LLama2-70b without existing infrastructure for orchestrating it?

mjirv · on Sept 12, 2023

I stumbled upon OpenRouter[0] a few days ago. Easiest I’ve seen by far (if you want SaaS, not hosting it yourself).

[0] https://openrouter.ai

ramesh31 · on Sept 12, 2023

>What's the best way to use LLama2-70b without existing infrastructure for orchestrating it?

That's an exercise left to the reader for now, and is where your value/moat lies.

thewataccount · on Sept 12, 2023

> That's an exercise left to the reader for now, and is where your value/moat lies.

Hopefully more on-demand services enter the space. Currently where I am we don't have the resources for any type of self orchestration and our use case is so low/sporadic that we can't simply have a dedicated instance.

Last I saw the current services were rather expensive but I should recheck.

pdntspa · on Sept 13, 2023

I bought an old server off ServerMonkey for like $700 with a stupid amount of RAM and CPUs and it runs Llama2-70b fine, if a little slowly. Good for experimenting

axpy906 · on Sept 12, 2023

Unfortunately, Lama2 is not a fully open source license.

kuchenbecker · on Sept 12, 2023

Compute costs money.

yessenzhar · on Sept 13, 2023

We provide per token based Llama 2 70B API at Deep Infra, $1/1M tokens, which is 25-50% cheaper than ChatGPT.

tuckerconnelly · on Sept 13, 2023

Can you provide a larger context length? Looking for a replacement of GPT-3.5 16k model. Might be interested for a higher-scale project.

computerex · on Sept 12, 2023

Replicate has terrible pricing. Have you tried deepinfra?

refulgentis · on Sept 12, 2023

For use cases well within the capabilities of an LLM from last year, fine-tuned LLaMa 2 13B should/will blow ChatGPT out of the water: think "rate the sentiment of this text from 0-10".

I believe this because LLaMa-2 13B is more than good enough to handle what I call "quick search", i.e.

``` User: "What's the weather in Milwaukee?"

System: Here's some docs, answer concisely in one sentence.

AI: It's 73 degrees Farenheit. ```

YMMV on cost still, depends on cloud vendor, and my intuition agrees with yours: GPT-3.5 is priced low enough that there isn't a case where it makes sense to use another model. It strikes me now that's there's a good reason for that intuition: OpenAI's $/GPU hour is likely <= any other vendor's and inference time of LLaMa 2 ~= GPT.

I do think this will change with local LLMs. They've been way over-hyped for months, but after LLaMa 2, the challenges remaining are more sociological than technical.

For months now it's been one-off $LATEST_BUZZY_MODEL.c stunts that run on desktop.

The vast majority of the _actual_ usage and progress is coming from porn-y stuff, and the investment occurs in one-off stunts.

That split of effort, and lack of engineering rigor, is stunting progress overall.

Microsoft has LLaMa-2 ONNX available on GitHub[1]. There's budding but very small projects in different languages to wrap ONNX. Once there's a genuine cross-platform[2] ONNX wrapper that makes running LLaMa-2 easy, there will be a step change. It'll be "free"[3] to run your fine-tuned model that does as well as GPT-4.

It's not clear to me exactly when this will occur. It's "difficult" now, but only because the _actual usage_ in the local LLM community doesn't have a reason to invest in ONNX, and it's extremely intimidating to figure out how exactly to get LLaMa-2 running in ONNX. Microsoft kinda threw it up on GitHub and moved on, the sample code even still needs a PyTorch model. I see at least one very small company on HuggingFace that _may_ have figured out full ONNX.

Funnily enough, ONNX is getting a spike in mindshare over the last month in the _Stable Diffusion_ community. There's decent cross-pollination between local art and local LLMs, ex. LoRA's were first a thing for Stable Diffusion. So I'm hoping we see this sooner rather than later.

[1] https://github.com/microsoft/Llama-2-Onnx

[2] Definition of cross-platform matters a ton here, what I mean is "I can import $ONNX_WRAPPER_LIB on iOS / Android / Mac / Windows and call Llama2.reply(String prompt, ...)"

[3] Runs on somebody else's computer, where "somebody else" is the user, instead of a cloud vendor.

homarp · on Sept 12, 2023

you already have TVM for the cross platform stuff

see https://tvm.apache.org/docs/how_to/deploy/android.html

or https://octoml.ai/blog/using-swift-and-apache-tvm-to-develop...

or https://github.com/mlc-ai/mlc-llm

refulgentis · on Sept 12, 2023

My deepest thanks, I owe you one. Overlooked this completely. & spent dozens of hours learning way too much to still fall short of understanding how to make it work in ONNX.