It's just the beginning, still optimization and some figuring out to do. This fo...

It's just the beginning, still optimization and some figuring out to do.

This fork is based on karpathy's llama.c and we try to mirror its progress and add our patches to it to add performance, be binary portable or run as unikernel. However there is a catch, this doesn't currently infer the 7b or bigger meta llama 2 models yet. It's too slow and memory consuming.

My plan is to get to a stage where we can actually infer larger models at a comfortable speed like llama.cpp / ggml does, add GPU acceleration along the way.

Also this doesn't have a web api, I'll be adding that in the next update, then it would actually make sense to deploy it on a server to test it out.

Right now you would have to manually spawn vm instances with qemu like this:

qemu-system-x86_64 -m 256m -accel kvm -kernel L2E_qemu-x86_64

qemu-system-x86_64 -m 256m -accel kvm -kernel L2E_qemu-x86_64 -nographic

and that's not very practical especially as there is no web api yet. So see this more as a tech preview - release early release often thing.

Yeah as I'll get time, I'll be adding a build for firecracker and also write instructions to spawn 100's of baby llama 2 kvm qemu / firecracker builds on a powerful server.

Thank you for your interest. As per your suggestion, a comprehensive howto is planned. Feel free to add any issue / wants / suggestions to https://github.com/trholding/llama2.c , I'll address those as I get time.

I'm stuck with bigger IRL projects, but if there is deep interest from the community I'll be sure to spend more time on this.