Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Refact Code LLM: 1.6B LLM for code that reaches 32% HumanEval (refact.ai)
181 points by kateklink on Sept 4, 2023 | hide | past | favorite | 100 comments


This post is misleading, in a way that is hard to do accidentally.

  - They compare the performance of this model to the worst 7B code llama model.  The base code llama 7B python model scores 38.4% on humaneval, versus the non-python model, which only scores 33%.
  - They compare their instruct tuned model to non-instruct-tuned models.  Instruction tuning can add 20% or more to humaneval performance.  For example, WizardLM 7B scores 55% on humaneval [1], and I've trained a 7B model that scores 62% [2].
  - For another example of instruction tuning, Stablecode instruct tuned benchmarks at 26%, not the 20% they cite for the base model [3]
  - Starcoder, when prompted properly, scores 40% on humaneval [4]
  - They do not report their base model performance (as far as I can tell)
This is interesting work, and a good contribution, but it's important to compare similar models.

[1] https://github.com/nlpxucan/WizardLM

[2] https://huggingface.co/vikp/llama_coder

[3] https://stability.ai/blog/stablecode-llm-generative-ai-codin...

[4] https://github.com/huggingface/blog/blob/main/starcoder.md


Hi, thank you for your attention!

> They compare the performance of this model to the worst 7B code llama model. The base code llama 7B python model scores 38.4% on humaneval, versus the non-python model, which only scores 33%.

We are comparing multilingual models, and we are not focused on python-finetuned versions

> They compare their instruct tuned model to non-instruct-tuned models. Instruction tuning can add 20% or more to humaneval performance. For example, WizardLM 7B scores 55% on humaneval [1], and I've trained a 7B model that scores 62% [2]. > For another example of instruction tuning, Stablecode instruct tuned benchmarks at 26%, not the 20% they cite for the base model [3]

We have two separate comparisons (see https://huggingface.co/smallcloudai/Refact-1_6B-fim) for completion-based models and instruction-following-based models with different humaneval formats. But we are considering our model as a completion (FIM) one in the first place and we were using 85% non-instruction following data to make the final model. The chat functionality is really limited for such small models

> Starcoder, when prompted properly, scores 40% on humaneval

Yep, that is right. But worth mentioning, the starcoder model showed 40% while being extra finetuned exclusively on python

> They do not report their base model performance (as far as I can tell)

Our base model gets around 20-23% humaneval. But it is not the case since the model was trained using 50% non-code data (considering the model's size it was really hard to keep the model converging)


That’s an impressive result

The open rail license seems to reference some sort of limitations on safety and unethical use but I can’t see where in the repo that’s spelled out precisely what the authors have in mind?


One misleading thing is the notion that you need a 1-2B model to run on commodity hardware.

This is not really true. Llama 7B runs with Vulkan/llama.cpp on ~8GB smartphones and ~12GB laptops. That ease is going to get much better over time, as lower RAM hardware starts dropping out of the market and the Vulkan implementations get more widespread.

For users trying to run LLMs on 8GB or less machines, the AI Horde approach of distributed models seems much more practical anyway.


Ah, but have no fear - as lower RAM hardware starts dropping out of the market, the RAM usage of Microsoft Teams will increase to compensate!

(Not even /s - while the developers of LLM applications may have 64GB RAM in their laptops or desktops, the less-technical early adopters of LLMs running locally are likely to be power users with lower-powered laptops, much more stringent RAM limits, and numerous line-of-business applications and browser tabs contending for that RAM. Causing those applications to be swapped onto disk will almost certainly result in a degraded overall experience that could easily be blamed on the LLM application itself.)


Yes, 7B is perfectly usable on low-end hardware if you're using it for instruction tuning/chat.

But for code completion in an IDE where it has to react as you type, every 100 millisecond delay in response time is noticable.

Even with a 24GB GPU, a 7B model doesn't feel snappy enough for code-completion in an IDE.


GPU RAM quantity isn’t typically correlated to inference rate. Precision/quantization levels do affect model size, which will affect inference rate. However, I would expect a smaller model to be faster (less RAM).


Llama (and many other llms, I presume) are so memory bandwidth bound that model size is a decent indicator of inference rate.

The smaller the model, the less has to be read from ram for every single token.

Batching mixes up this calculus a bit.


This can be addressed with token streaming and input caching.

Would that be enough? shrug


This is true! Although I'm also really excited at the potential speed (both for loading the model and token generation) of a 1B model for things like code completion.


> the AI Horde approach of distributed models seems much more practical anyway.

i wasnt aware this was a term of art. is there a definitive blogpost or product explaining this approach?


This is a reference to Kobold Horde, a distributed volunteer network of GPUs that can be inferenced upon.


^

I didn't mean to imply splitting llama up between machines (though that is a thing with llama.cpp), but a pool of clients and servers who make requests and process them:

https://lite.koboldai.net/

A few users with half decent PCs can serve a much larger group of people, and the "lesser" hosts can host smaller models to "earn" access to larger ones.


Perhaps the wrong thread to ask this question... Is it not possible to load a model on something like an NVMe M.2 drive instead of RAM? It's slower of course, but only 5-10x if I understand correctly.


Yes but they’re slow enough on normal hardware for that 5-10x to be painful…


Can you RAID them?


Technically yes?

But its way beyond the point where its going to help LLMs. CPU RAM is already "too slow" in machines big enough for multiple NVMe SSDs.


Yeah but I remember thinking to myself every few years that surely next year will be the year that base model machines start at 32/64/… GB - but alas, it’s nearly the end of 2023 and your average computer still seems stuck on a measly 16GB! I don’t think average RAM size on consumer machines has increased at all in the last 8~ years or so.


It actually kind of makes sense.

RAM is only about 6x the speed of SSD’s for sequential access. Most people don’t actually need truly random access to all that much data rather than streaming video or loading video game assets to their GPU. So they shift spending to other components like video card, monitors, etc that actually provide significant value.

Which is how you get people with 16 GB of system RAM using graphics cards that also have 16GB of RAM.


7b runs on my 4gb vram machine (8gb memory). I.e. quantization helps a lot too


Hey, I have a genuine question:

What is the point of a new model that isn’t better than the best possible model (example: OpenAI GPT-4)?

What’s the point in having a smaller model? Who cares?

—-

This is a real, genuine question that I don’t have a clear answer to. Excuse my ignorance, plz enlighten your boi.


GPT4 is expensive to run, even more expensive to finetune, and for all practical purposes can’t be run offline (because the model is too big to run outside of a huge data center). Evaluation latency is also an issue for many usecases, and you have to share your query with openai, so you can’t run sensitive queries. The output is also controlled/censored by OpenAI.

Here’s a few usecases that I wouldn’t want to use OpenAI/GPT for

- Advanced autocomplete for texting and private communications

- Querying sensitive document databases like emails

- Traveling in low connectivity areas

- Politically incorrect usecases (generating erotic content for example)

List kinda goes on and on


> GPT4 is expensive to run, even more expensive to finetune

GPT4 can't even be finetuned at the moment (though I expect that to change).


It can be finetuned. Bing is a finetuned GPT-4.


I'd assume that that "can't" there is about what's publicly available, not what's technically possible.


It’s obviously technically feasible, it’s just not commercially offered…


IMO, the main reasons are (but are definitely not limited to):

- You can fine tune these models for very specific tasks, which GPT-4 might not be as good at.

- Open source models are free. You can use them as much as you want without worrying about a $xx,xxx bill at the end of the month which makes tinkering with them easier.

- Smaller models like this can run on consumer hardware, even phones, and can run offline.

- Privacy and not having to abide by a third parties terms. You don't have to deal with "As a large language model...", especially with uncensored models.

- Tools like jsonformer https://github.com/1rgs/jsonformer are not possible with OpenAIs API.

- It's also just really cool, let's be honest.


1) people can run a 1.6B model for free on consumer hardware

2) any model that's run on computational resources you are owning or leasing will have more privacy than an explicit cloud offering. running completely on your own local hardware will be private. this means you don't have to think twice about asking the LLM about the proprietary code or information you are working on.

3) smaller models gain the performance improvements from all the other improvements in interpreters and quantizing, allowing for even more consumer friendly offline use

4) oh yeah, offline use. could expand use cases to having LLM's baked into operating systems directly, including leading phones

5) showing what's possible, pushing towards the benchmarks of the best possible model while using less computational resources. this also makes the hosts of the best possible model realize that they could either A) be using less computational resources and increasing the bandwidth for their users B) further improve their own model because of competition. Basically if ChatGPT 4 was using similar improvements in technology across all areas of reasoning/whatever, there never would have been a rate limit on ChatGPT 4.

6) more demand for other computational resources. Nvidia is backordered till maybe Q2 2024 right now. If people realize AMD or even their ARM chips can offer same performance with the right combination of hardware and software, It alleviates pressure on other ventures that want computation power.


The other answers are great, but to add more

- You can run it behind an air-gap, where your systems are disconnected from the world.

- You can run it on the edge with low or no internet connectivity

- You do not need to worry about breaching geographic data restrictions, e.g.: medical data from Country X cannot leave Country X


Your questions sounds like why do we need Alpine linux when we have Ubuntu? Why do we have SQLite when we have Postgres?

I think the point is to reach a baseline of something being super lightweight yet still useful that could be production for a number of use cases.


You can use it 100% locally, and it doesn't cost anything.


Imagine being on Mars and running on a small PV panel and needing to code a bugfix in your oxygen supply system through the wire with Microsoft Earth(tm) or something


Just trying out the official container image for self-hosting along side the VSCode extension - I've got to say I'm really impressed with the scaffolding especially for an early stage project.

The web interface for the LLM server is especially nice and clean compared to many of the others I've tried - and it "just works". Very interested to see how this evolves.


Whats the difference between 1% and 99% of HumanEval? What does it tell really?


for pass@1 HumanEval tells how well the model solves a task from a set, given only one chance to solve it. It's not the perfect metric, there're other like DS-1000, MBPP (we have included them on HuggingFace model card). HumanEval is good for benchmarking with other models as it gives a fast idea how powerful the model is.


> given only one chance to solve it

my understanding is that there are 2 usages of the pass@{number} syntax. the HumanEval/Codex paper interprets the {number} as number of attempts[0]. however language modelers seem to use it to denote the number of few shot example demonstrations given in the context. these are starkly different and i wish the syntax wasnt overloaded

---

[0] https://arxiv.org/pdf/2107.03374.pdf

> Kulal et al. (2019) evaluate functional correctness using the pass@k metric, where k code samples are generated per problem, a problem is considered solved if any sample passes the unit tests, and the total fraction of problems solved is reported.


I dont trust any benchmarks for any LLM thats not coming from FB, Google, OpenAI, Anthropic, or Microsoft. These models are so dynamic, the simple benchmark numbers never tell the whole story of the quality of the model. Take for instance, a recent posting by anyscale, claiming their fine tuning of Llama 2 was competitive with OpenAI's model. The reality being their fined tuned model is basically worthless, and was competitive along a single metric/very narrow commoditized task. Its a great way to get clicks by posting these metrics though


They could have easily benchmarked with the Spider SQL test set but they didn’t.

I have a feeling that the more robust models might be the ones that don’t perform best on benchmarks right away.


The community has fine-tuned some really good llama models (much better than llama-chat), but I get what you're saying.

I've been testing the best performing models on the huggingface leaderboard lately. Some of them are really impressive, and others are so bad that I second guess the prompt format or if the benchmarked model is actually the same one I'm testing.


Which models were really bad?


I was keeping track of the good ones, and don't have many notes on the bad ones.

I do remember testing "LoKuS" last week and it was quite terrible (sometimes gave completely off-topic answers). It scored as one of the highest 13B models on the leaderboard (~65 average), but appears to be removed now.


This is the goal of humaneval, correct?


Congrats on your achievement! I'm curious about your end goal. Do you aim to beat GitHub Copilot's performance and convince devs to use Refact for code completion instead of GitHub Copilot? I want to understand the motivation behind these different code-completion models that are not solely for academic research.


we want to help developers who need either on-premise or permissive code assistant, copilot has neither of this. We also wanted to lower the barriers for self-hosting, so that the model is available on most GPUs with just 3GB Ram. Plus making the code completions fast and efficient (understanding entire context, not just the previous tokens).


You can use it in practice, that was the goal of that particular model! It's fast, runs on your own hardware if you want it to.


The title is misleading This model is not "SOTA for the size", there are smaller models that do 10-18% better in absolute score. The text says it's SOTA "among similar models" where they probably compare with other models with permissive licensing.


"Permissive" usually refers to Free Software or Open Source licenses without copyleft requirements. OpenRAIL is a proprietary license because it imposes usage restrictions, contrary to both the Free Software and Open Source definitions.


AFAIK There is only one model that do better, it’s phi-1 and it’s python only, and it does not support fill-in-the-middle so you can't really use it.


Phi-1-small also scores higher with 350M parameters. It helps to be specific about what the comparison is against when claiming SOTA.


License text: https://drive.google.com/file/d/16NqKiAkzyZ55NClubCIFup8pT2j... [PDF]

See last page for restrictions


> In any way that violates any applicable national, federal, state, local or international law or regulation;

Darn! Foiled again! I was planning on breaking some federal laws, but the license says that I can't ;( \s

Open-RAIL license has the be the worst license in existence claiming to be "open".

> You shall undertake reasonable efforts to use the latest version of the Model.

Plea to folks releasing models: Please stop using this user-hostile and deranged license


Thanks. That looks pretty relaxed on terms


Say I want to fine tune a Golang specific model. How much $ and effort would I have to put in? Would using this as a base help in any way compared to starting from llama?


Maybe it makes sense to start from llama-code, not llama :D I think golang specific model will not be that different from a multi-language model. But it definitely will work better after fine tuning on your code. Check out refact self hosting docker in a couple of days, finetune will be there soon. It will take you 1 GPU and almost no money )


All these LLMs are pretty general if I understand correctly. Are there any efforts to create specialized models (other than for coding)? Or, what would be even better, "extract" certain areas from existing LLMs as a way to specialize them? With the goal to drastically reduce model size to be able to run on less powerful devices.

E.g. a model specializing in chemistry doesn't need to include data on world's history or to be able to write poetry.


I am not an expert but it still has to learn human language/grammar/whathaveyou, and that is where scale seems to matter. Fine-tuning on a subset of knowledge after that is typically how the domain-specialisation is achieved, by my understanding.


Domain specialization is done by continuing the full training process. Fine tuning is more for changing the style of the output than adding new knowledge.


What if the initial training already contains all necessary data for a particular specialization? What would be the benefit of continuing the training process?


Imagine someone tells you about how someone committed a crime and asks you to summarise. Now imagine the same question is asked to a lawyer. Even if you both knew the same facts, the response would be very different in style, highlighted points, mentioned references, etc. The domain specific fine tuning does exactly that. Sure, sometimes you can get very close by changing the prompt to include "respond like a lawyer in situation X with following extra rules", but not always and the fine-tuning gives better results and shorter prompt.


I was wondering about that too. Would it be possible in the future to have a more modular approach to LLMs? Have a module that is responsible for basic knowledge/language/grammar and then other more specialized modules that are added selectively.

I don't know enough about fine-tuning, not sure if the process is capable of removing "unused" parts of the model (I guess not possible, similar to un-learning).


There are various methods for removing unused parts of the model, like distillation. The idea is generally that you always lose performance, but hopefully you lose more size/runcost than you do performance, proportionately.


so, so many. there are RAG specific models (contextual ai), finance specific models (bloomberg gpt, brightwave), contact center models (cresta), even telco models (anthropic).


Very interesting. Thanks for replying!


Another model that we'll soon forget it ever existed.


For the sake of not giving Microsoft and a few other tech giants immense power over the world, I really do hope the cost and efficiency of LLMs improve dramatically, until we can get GPT-4-equivalent models trained on a few graphics cards and running offline on an iPhone. Really rooting for these kinds of projects until someone makes the breakthrough.


You may be interested in what we’re working on at Symbolica AI.

We’re using formal logic in the form of abstract rewrite systems over a causal graph to perform geometric deep learning. In theory it should be able to learn the same topological structure of data that neural networks do, but using entirely discrete operations and without the random walk inherent to stochastic gradient descent.

Current experiments are really promising, and assuming the growth curve continues as we scale up you should be able to train a GPT-4 scale LLM in a few weeks on commodity hardware (we are using a desktop with 4 4090’s currently), and be able to do both inference and continual fine tuning/online learning on device.


> We’re using formal logic in the form of abstract rewrite systems over a causal graph to perform geometric deep learning. In theory it should be able to learn the same topological structure of data that neural networks do, but using entirely discrete operations and without the random walk inherent to stochastic gradient descent.

Abstract rewrite like a computer algebra system's (e.g. Wolfram) term rewriting equation simplication method?


Heavily influenced by Wolfram's work on metamathematics and the physics project, in so far as using a rewrite system to uncover an emergent topology; we're just using it to uncover the topology of certain data (assuming that the manifold hypothesis is correct), rather than the topology of fundamental physics as he did.


Sounds cool, but what are the drawbacks?


Biggest drawback is that since the structure is all discrete, it is inherently weak at modeling statistical distributions. For example, it'll likely never best a neural network at stock market prediction or medical data extrapolation.

However, for things that are discrete and/or causal in nature, we expect it to outperform deep learning by a wide margin. We're focused on language to start, but want to eventually target planning and controls problems as well, such as self-driving and robotics.

Another drawback is that the algorithm as it stands today is based on a subgraph isomorphism search, which is hard. Not hard as in tricky to get right like Paxos or other complex algorithms; like NP-Hard, so very difficult to scale. We have some fantastic Ph.Ds working with us who focus on optimization of subgraph isomorphism search, and category theorists working to formalize what constraints we can relax without effecting the learning mechanism of the rewrite system, so we're confident that it's achievable, but the time horizon is unknown currently.


It doesn't exist at scale yet.


Especially interested in learning directly on geometries, please keep us updated and share results


Would definitely recommend Bronstein et. al's work on geometric deep learning! https://geometricdeeplearning.com

That's effectively the right hand side of the bridge that we're building between formal logic and deep learning. So far their work has been viewed mainly as descriptive, helping to understand neural networks better, but as their abstract calls out: "it gives a constructive procedure to incorporate prior physical knowledge into neural architectures and provide principled way to build future architectures yet to be invented". That's us (we hope)!


I would like to subscribe to your newsletter, we'd be super interested in this at Brainchain AI.

Drop me a link at (my first name) @ brainchain dot AI if you'd like to chat, I'd love to hear more about what you're working on!


Really cool stuff! Do you have any recommendations of where we could learn more?


The key in that is models. Per the GPT4 leaked details, it’s not a a single model but 16 MOE mixture of experts. There’s probably quite a lot of complexity on the backend in sourcing the right model for the right query. In short, it’s probably better to focus on single models for specific tasks in the OS community as evidenced by Code Llama. Having a system like GPT4 is still difficult to replicate. Getting it to run on a consumer hardware for specific tasks like code gen at almost GPT4 level is doable.


>There’s probably quite a lot of complexity on the backend in sourcing the right model for the right query.

This isn't how Sparse MoE models work. There isn't really any complexity like that. And different models will or can pick each token.

Sparse models aren't an ensemble of models.


There are many MoE architectures and I suppose we don’t know for sure which OpenAI is using. The “selection” of the right mix of models is something that a network learns and it’s not a complex process. Certainly no more complex than training an LLM.


When I wrote “backend” was a poor choice of a word. “Meta-model” is probably a better choice of wording.

I hope it did not detract too much from the point of focusing on subtasks and modalities for FOSS as GPT 4 was built on a $163 million budget.

Finally, good point. We’ve got no idea of what OpenAI’s MoE approach is and how it works. I went back to Metas 2022 NLLB-200 system paper and they didn’t even publish the exact details of the router (gate).


Yeah, good point on the importance of FOSS focusing on subtasks... because FOSS isn't going to be spending $150M+ training a model any time soon without something like government backing.


I think with or without algorithmic advantages hardware will improve for local model running. There’s an immense amount of capital being invested in hardware improvement and that will absolutely trickle down.

My sincere belief is that local models is the way of the future, with flexible base models adapted via Lora and context to specific use cases. I think open source models and techniques are inexorable at this point barring some sort of regulatory moat and will rival commercial models in all but extreme cases.


I don't, how do you maintain control and prevent mass harm in that case? I don't see anyway out other than similar gatekeeping we apply to ownership and use of high explosives and radiological weapon tooling.

At all other times I support tech freedom. I use libre software, I use Tor, I donate to privacy and FOSS organizations constantly. I only write my software projects under an AGPL license. AI is qualitatively different. A world run amok with intelligent infinite Sybils is not good for anyone. I hope massive compute continues to be necessary, it may be the only hard chokepoint we have to keep a handle on the beast.


> For the sake of not giving Microsoft and a few other tech giants immense power over the world

I agree with and appreciate the sentiment, but it feels way too late for that. These people do have and exert direct control over pretty much all of our digital devices. It's funny (or sad) that we only seem to care about this when shiny doodads like AI come around every so-often.


That could also help tech giants build even larger/more capable models cheaply. Ideally there would be a hard ceiling of LLM capability that even massive amounts of hardware couldn't exceed, allowing inexpensive hardware to catch up.


I personally hope that LLMs have no such limits. The good these tools can do is immeasurable.

I can already run Llama 2 @70b on my laptop, and that’ll look like a quaint old AI artifact in 5-7 years. I think the consumer market will keep pace yet stay well below SotA, just as it always has. That still leaves plenty of room for incredible open-source stuff!


to be fair, if that is achieved then the massive models that tech giants produce will probably be phenomenal


We’ve finished training a new code model Refact LLM which took us about a month. The main use-case is for blazing-fast code completion with fill-in-the-middle, additionally, the model could reply to chat prompts.

It has much better performance than all of the code models of similar size, and almost reaches the same HumanEval as Starcoder being 10x smaller in size.

With the small size, it can work with most modern GPUs requiring just 3GB Ram.

You can try self-hosting it in Refact https://github.com/smallcloudai/refact/ and get a local fast copilot alternative with decent suggestions.

Weights and model card https://huggingface.co/smallcloudai/Refact-1_6B-fim.

We would love to hear your feedback!


How does it compare to Copilot? A metric I'd like to see is % of proposed completions accepted by a human user. If you had an extension that 50% of the time proposed a Copilot extension and 50% of the time proposed a Refact extension (blind to the user) then you could come up with a metric like this.


Does ctransformer (https://github.com/marella/ctransformers#supported-models) support running refact?

I see that model type "gpt_refact" in https://huggingface.co/smallcloudai/Refact-1_6B-fim/blob/mai...


Is it possible to run it as an LSP so that it can be used in editors other than VSCode and JetBrains? (sorry if this question is completely mad, my understanding of how these things work is extremely limited)


Yes, it's coming up in a couple of weeks.


Great, thanks. I'll keep an eye out.


hi, i try to fine tune refact model using evolve code alpaca, but the loss is always bigger than 2, i try some different params but it doesn't work, can you give me some advice?


> almost reaches the same HumanEval

how can you tell that HumanEval is not leaked to your training data in some form?


Hi! We ran LSH filtering over datasets to remove all code that can be similar to HumanEval samples.


so, we have to trust your procedure..


It can be checked if the model predicts canonical solutions from humaneval. I understand it is not ideal, but at least you can check it yourself

There are a bunch of other benchmarks too, check out the page https://huggingface.co/smallcloudai/Refact-1_6B-fim

Also, feel free to run any new benchmarks


tangentially related: refact recently shared 4 bounties worth $9,000 to help improve their tech!

https://algora.io/org/smallcloudai/bounties

disclaimer: i'm a cofounder of algora, the platform enabling these bounties



Model Stats - Architecture: LLAMA-like model with multi-query attention - Objectives Fill-in-the-Middle, Chat - Tokens context: 4096 - Pretraining tokens: 1.2T - Finetuning tokens: 40B - Precision: bfloat16 - GPUs 64 NVidia A5000 - Training time 28 days




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: