Llama2.c L2E LLM – Multi OS Binary and Unikernel Release

AMICABoard · on Aug 26, 2023

Have you ever wanted to boot and inference a herd of 1000's of Virtual baby Llama 2 models on big ass enterprise servers? No? Well, now you can! (Almost baremetal)

Also drop the binary portable run.com cosmocc build on any OS and run! Truly portable. (Soon baremetal)

Special Thanks & Credits:

llama2.c - @karpathy

cosmopolitan - @jart

unikraft - @unikraft

Would love to hear your feed back here!

dingdingdang · on Aug 26, 2023

Would love some examples, practical a-z stuff; how does one make a couple of these run on a local (yes, large) server for instance?

AMICABoard · on Aug 26, 2023

It's just the beginning, still optimization and some figuring out to do.

This fork is based on karpathy's llama.c and we try to mirror its progress and add our patches to it to add performance, be binary portable or run as unikernel. However there is a catch, this doesn't currently infer the 7b or bigger meta llama 2 models yet. It's too slow and memory consuming.

My plan is to get to a stage where we can actually infer larger models at a comfortable speed like llama.cpp / ggml does, add GPU acceleration along the way.

Also this doesn't have a web api, I'll be adding that in the next update, then it would actually make sense to deploy it on a server to test it out.

Right now you would have to manually spawn vm instances with qemu like this:

qemu-system-x86_64 -m 256m -accel kvm -kernel L2E_qemu-x86_64

or

qemu-system-x86_64 -m 256m -accel kvm -kernel L2E_qemu-x86_64 -nographic

and that's not very practical especially as there is no web api yet. So see this more as a tech preview - release early release often thing.

Yeah as I'll get time, I'll be adding a build for firecracker and also write instructions to spawn 100's of baby llama 2 kvm qemu / firecracker builds on a powerful server.

Thank you for your interest. As per your suggestion, a comprehensive howto is planned. Feel free to add any issue / wants / suggestions to https://github.com/trholding/llama2.c , I'll address those as I get time.

I'm stuck with bigger IRL projects, but if there is deep interest from the community I'll be sure to spend more time on this.

Datagenerator · on Aug 26, 2023

This and also how to let them talk to each other and follow the conversation after setting up only the initial topic - that would be awesome!

AMICABoard · on Aug 26, 2023

If you add it as an issue, I'll address it in future. I like the idea, but with the current toy size small models it won't be too fun (I did a manual try). Once larger models run at good pace, it would be absolutely cool.

Datagenerator · on Aug 26, 2023

Thanks and love the reference to the wonderful blinking guru meditation we used to see on our Amiga

AMICABoard · on Aug 26, 2023

I grew up with the Amiga 500, it was my first love, how can I forget those times!

politelemon · on Aug 26, 2023

Is this similar to llama.cpp? I'm not very versed in this area.

AMICABoard · on Aug 26, 2023

This is a fork of https://github.com/karpathy/llama2.c

karpathy's llama2.c is like llama.cpp but it is written in C and the python training code is available in that same repo. llama2.c's goal is to be a elegant single file C implementation of the inference and an elegant python implementation for training.

His goal is for people to understand how llama 2 and LLM's work, so he keeps it simple and sweet. As the project progresses, so will features and performance improvements added.

Currently it can infer baby (small) Story models trained by Karpathy at a fast pace. It can also infer Meta LLAMA 2 7b models, but at a very slow rate such as 1 token per second.

So currently this can be used for learning or as a tech preview.

Our friendly fork tries to make it portable, performant and more usable (bells and whistles) over time. Since we mirror upstream closely, the inference capabilities of our fork is similar but slightly faster if compiled with acceleration. What we try to do different is that we try to make this bootable (not there yet) and portable. Right now you can get binary portablity - use the same run.com on any x86_64 machine running on any OS, it will work (possible due to cosmopolitan toolchain). The other part that works is unikernels - boot this as unikernel in VM's (possible due unikraft unikernel & toolchain).

See our fork currently as a release early and release often toy tech demo. We plan to build it out into a useful product.

regularfry · on Aug 26, 2023

Inspired by, but no shared code as far as I can see on a brief scan.

AMICABoard · on Aug 26, 2023

It is not inspired by, it's friendly fork, so its more than inspiration. Code has diverged a bit, yes I do try to share. Karpathy's llama2.c upstream project has clearly stated goals of elegance and simplicity, so code that has lot of pre-processor directives like ours won't help upstream. Apart from this we are very thankful to jart & contributors from the cosmopolitan project and also the unikraft folks without which binary portability or unikernels wouldn't have been possible.

TL;DR Simple elegant stuff that we added will be shared to upstream. Complex non elegant code won't be shared as it won't be accepted upstream.

regularfry · on Aug 26, 2023

That's your relationship to upstream llama2.c, which is fairly clear. politelemon was asking about the relationship to ggerganov's llama.cpp, which seems to have inspired the upstream llama2.c.

ragnarok123 · on Aug 26, 2023

Does it offer a local api so i can embed this in my python build?

AMICABoard · on Aug 26, 2023

If you add an issue you can do so on https://github.com/trholding/llama2.c

I have a planned a web api. Are you looking for python binding? I'm interested to know. The issues would keep me organized.

sa-code · on Aug 26, 2023

The ability to hook it into frameworks like streamlit would be huge. Sorta like llama-cpp-python

AMICABoard · on Aug 26, 2023

Agreed and yes would be awesome... But currently this does not infer large meta 7b model efficiently - like 1 token or lower per seconds. But the small toy story (not so useful) models are fast.

If the above mentioned API / python binding is ready, I'll make a streamlit interface demo. The streamlit demo should be simple. But I have to figure out python binding.

AMICABoard · on Aug 26, 2023

not yet

beardedwizard · on Aug 26, 2023

"serving as an information gateway for students without constant reliance on the internet." But llms without knowledge augmentation are terrible at this! How can students know what they read is true?

AMICABoard · on Aug 26, 2023

What you said is true, its something we would have to figure out on the way.

What I have in mind is -

1. Topic specialized models which are frequently updated maybe every month or two.

2. Fact Checking & Moderation specialized, models which moderate or do fact checking on other model's output.

Kind of a chicken and egg problem. But I believe on the way we will be able to minimize the effects of hallucinations through output validation (both neural and rule based).

beardedwizard · on Aug 26, 2023

Pretty reasonable, I suggest you mention something along these lines to avoid further misconceptions

AMICABoard · on Aug 26, 2023

Will be updated with the next commit.

beardedwizard · on Aug 26, 2023

An aside, any idea why this comment is being downvoted? Seems like a productive exchange.

AMICABoard · on Aug 26, 2023

Pressing the up arrow beside the comments is upvote right? I hope this doesn't have some inverted logic.

AMICABoard · on Aug 26, 2023

No idea. I have just upvoted your comment.

beardedwizard · on Aug 26, 2023

I figured it wasn't you :), anyway cheers buddy!

AMICABoard · on Aug 26, 2023

I have updated the description and added the requested clarification.

To avoid trolls and invisible downvotes, you can always use github issues. You are welcome there.

Thanks for the initial question. Ultimately it seems that there must always be a human in the loop somewhere.

bsima · on Aug 26, 2023

How can students know what the teacher tells them is true?

beardedwizard · on Aug 26, 2023

How can students ever know the truth when history is written by the victors? :)

While similar, I believe revisionist history is different than llm hallucination, but hard to say which is more dangerous. I guess it depends on the content. I suspect an llm won't jail you for questioning it, just happily hallucinate an apology and new answer :)

AMICABoard · on Aug 26, 2023

That's something we would have to figure out on the way - especially due to hallucinations. See the post under this for ideas on how I would try to solve that.