Refact Code LLM: 1.6B LLM for code that reaches 32% HumanEval

vikp · on Sept 4, 2023

This post is misleading, in a way that is hard to do accidentally.

  - They compare the performance of this model to the worst 7B code llama model.  The base code llama 7B python model scores 38.4% on humaneval, versus the non-python model, which only scores 33%.
  - They compare their instruct tuned model to non-instruct-tuned models.  Instruction tuning can add 20% or more to humaneval performance.  For example, WizardLM 7B scores 55% on humaneval [1], and I've trained a 7B model that scores 62% [2].
  - For another example of instruction tuning, Stablecode instruct tuned benchmarks at 26%, not the 20% they cite for the base model [3]
  - Starcoder, when prompted properly, scores 40% on humaneval [4]
  - They do not report their base model performance (as far as I can tell)

This is interesting work, and a good contribution, but it's important to compare similar models.

[1] https://github.com/nlpxucan/WizardLM

[2] https://huggingface.co/vikp/llama_coder

[3] https://stability.ai/blog/stablecode-llm-generative-ai-codin...

[4] https://github.com/huggingface/blog/blob/main/starcoder.md

JegernOUTT · on Sept 5, 2023

Hi, thank you for your attention!

> They compare the performance of this model to the worst 7B code llama model. The base code llama 7B python model scores 38.4% on humaneval, versus the non-python model, which only scores 33%.

We are comparing multilingual models, and we are not focused on python-finetuned versions

> They compare their instruct tuned model to non-instruct-tuned models. Instruction tuning can add 20% or more to humaneval performance. For example, WizardLM 7B scores 55% on humaneval [1], and I've trained a 7B model that scores 62% [2]. > For another example of instruction tuning, Stablecode instruct tuned benchmarks at 26%, not the 20% they cite for the base model [3]

We have two separate comparisons (see https://huggingface.co/smallcloudai/Refact-1_6B-fim) for completion-based models and instruction-following-based models with different humaneval formats. But we are considering our model as a completion (FIM) one in the first place and we were using 85% non-instruction following data to make the final model. The chat functionality is really limited for such small models

> Starcoder, when prompted properly, scores 40% on humaneval

Yep, that is right. But worth mentioning, the starcoder model showed 40% while being extra finetuned exclusively on python

> They do not report their base model performance (as far as I can tell)

Our base model gets around 20-23% humaneval. But it is not the case since the model was trained using 50% non-code data (considering the model's size it was really hard to keep the model converging)

Havoc · on Sept 4, 2023

That’s an impressive result

The open rail license seems to reference some sort of limitations on safety and unethical use but I can’t see where in the repo that’s spelled out precisely what the authors have in mind?

brucethemoose2 · on Sept 4, 2023

One misleading thing is the notion that you need a 1-2B model to run on commodity hardware.

This is not really true. Llama 7B runs with Vulkan/llama.cpp on ~8GB smartphones and ~12GB laptops. That ease is going to get much better over time, as lower RAM hardware starts dropping out of the market and the Vulkan implementations get more widespread.

For users trying to run LLMs on 8GB or less machines, the AI Horde approach of distributed models seems much more practical anyway.

btown · on Sept 4, 2023

Ah, but have no fear - as lower RAM hardware starts dropping out of the market, the RAM usage of Microsoft Teams will increase to compensate!

(Not even /s - while the developers of LLM applications may have 64GB RAM in their laptops or desktops, the less-technical early adopters of LLMs running locally are likely to be power users with lower-powered laptops, much more stringent RAM limits, and numerous line-of-business applications and browser tabs contending for that RAM. Causing those applications to be swapped onto disk will almost certainly result in a degraded overall experience that could easily be blamed on the LLM application itself.)

nacs · on Sept 4, 2023

Yes, 7B is perfectly usable on low-end hardware if you're using it for instruction tuning/chat.

But for code completion in an IDE where it has to react as you type, every 100 millisecond delay in response time is noticable.

Even with a 24GB GPU, a 7B model doesn't feel snappy enough for code-completion in an IDE.

evolve7942 · on Sept 5, 2023

GPU RAM quantity isn’t typically correlated to inference rate. Precision/quantization levels do affect model size, which will affect inference rate. However, I would expect a smaller model to be faster (less RAM).

brucethemoose2 · on Sept 5, 2023

Llama (and many other llms, I presume) are so memory bandwidth bound that model size is a decent indicator of inference rate.

The smaller the model, the less has to be read from ram for every single token.

Batching mixes up this calculus a bit.

brucethemoose2 · on Sept 4, 2023

This can be addressed with token streaming and input caching.

Would that be enough? shrug

jmorgan · on Sept 4, 2023

This is true! Although I'm also really excited at the potential speed (both for loading the model and token generation) of a 1B model for things like code completion.

swyx · on Sept 4, 2023

> the AI Horde approach of distributed models seems much more practical anyway.

i wasnt aware this was a term of art. is there a definitive blogpost or product explaining this approach?

ukuina · on Sept 4, 2023

This is a reference to Kobold Horde, a distributed volunteer network of GPUs that can be inferenced upon.

brucethemoose2 · on Sept 4, 2023

^

I didn't mean to imply splitting llama up between machines (though that is a thing with llama.cpp), but a pool of clients and servers who make requests and process them:

https://lite.koboldai.net/

A few users with half decent PCs can serve a much larger group of people, and the "lesser" hosts can host smaller models to "earn" access to larger ones.

palmer_fox · on Sept 4, 2023

Perhaps the wrong thread to ask this question... Is it not possible to load a model on something like an NVMe M.2 drive instead of RAM? It's slower of course, but only 5-10x if I understand correctly.

kirill5pol · on Sept 4, 2023

Yes but they’re slow enough on normal hardware for that 5-10x to be painful…

mirekrusin · on Sept 5, 2023

Can you RAID them?

brucethemoose2 · on Sept 5, 2023

Technically yes?

But its way beyond the point where its going to help LLMs. CPU RAM is already "too slow" in machines big enough for multiple NVMe SSDs.

smcleod · on Sept 4, 2023

Yeah but I remember thinking to myself every few years that surely next year will be the year that base model machines start at 32/64/… GB - but alas, it’s nearly the end of 2023 and your average computer still seems stuck on a measly 16GB! I don’t think average RAM size on consumer machines has increased at all in the last 8~ years or so.

Retric · on Sept 4, 2023

It actually kind of makes sense.

RAM is only about 6x the speed of SSD’s for sequential access. Most people don’t actually need truly random access to all that much data rather than streaming video or loading video game assets to their GPU. So they shift spending to other components like video card, monitors, etc that actually provide significant value.

Which is how you get people with 16 GB of system RAM using graphics cards that also have 16GB of RAM.

naillo · on Sept 4, 2023

7b runs on my 4gb vram machine (8gb memory). I.e. quantization helps a lot too

mholubowski · on Sept 4, 2023

Hey, I have a genuine question:

What is the point of a new model that isn’t better than the best possible model (example: OpenAI GPT-4)?

What’s the point in having a smaller model? Who cares?

—-

This is a real, genuine question that I don’t have a clear answer to. Excuse my ignorance, plz enlighten your boi.

yunwal · on Sept 4, 2023

GPT4 is expensive to run, even more expensive to finetune, and for all practical purposes can’t be run offline (because the model is too big to run outside of a huge data center). Evaluation latency is also an issue for many usecases, and you have to share your query with openai, so you can’t run sensitive queries. The output is also controlled/censored by OpenAI.

Here’s a few usecases that I wouldn’t want to use OpenAI/GPT for

- Advanced autocomplete for texting and private communications

- Querying sensitive document databases like emails

- Traveling in low connectivity areas

- Politically incorrect usecases (generating erotic content for example)

List kinda goes on and on

qeternity · on Sept 4, 2023

> GPT4 is expensive to run, even more expensive to finetune

GPT4 can't even be finetuned at the moment (though I expect that to change).

MichaelBurge · on Sept 4, 2023

It can be finetuned. Bing is a finetuned GPT-4.

tbrownaw · on Sept 5, 2023

I'd assume that that "can't" there is about what's publicly available, not what's technically possible.

qeternity · on Sept 5, 2023

It’s obviously technically feasible, it’s just not commercially offered…

notsylver · on Sept 4, 2023

IMO, the main reasons are (but are definitely not limited to):

- You can fine tune these models for very specific tasks, which GPT-4 might not be as good at.

- Open source models are free. You can use them as much as you want without worrying about a $xx,xxx bill at the end of the month which makes tinkering with them easier.

- Smaller models like this can run on consumer hardware, even phones, and can run offline.

- Privacy and not having to abide by a third parties terms. You don't have to deal with "As a large language model...", especially with uncensored models.

- Tools like jsonformer https://github.com/1rgs/jsonformer are not possible with OpenAIs API.

- It's also just really cool, let's be honest.

yieldcrv · on Sept 4, 2023

1) people can run a 1.6B model for free on consumer hardware

2) any model that's run on computational resources you are owning or leasing will have more privacy than an explicit cloud offering. running completely on your own local hardware will be private. this means you don't have to think twice about asking the LLM about the proprietary code or information you are working on.

3) smaller models gain the performance improvements from all the other improvements in interpreters and quantizing, allowing for even more consumer friendly offline use

4) oh yeah, offline use. could expand use cases to having LLM's baked into operating systems directly, including leading phones

5) showing what's possible, pushing towards the benchmarks of the best possible model while using less computational resources. this also makes the hosts of the best possible model realize that they could either A) be using less computational resources and increasing the bandwidth for their users B) further improve their own model because of competition. Basically if ChatGPT 4 was using similar improvements in technology across all areas of reasoning/whatever, there never would have been a rate limit on ChatGPT 4.

6) more demand for other computational resources. Nvidia is backordered till maybe Q2 2024 right now. If people realize AMD or even their ARM chips can offer same performance with the right combination of hardware and software, It alleviates pressure on other ventures that want computation power.

TuringNYC · on Sept 4, 2023

The other answers are great, but to add more

- You can run it behind an air-gap, where your systems are disconnected from the world.

- You can run it on the edge with low or no internet connectivity

- You do not need to worry about breaching geographic data restrictions, e.g.: medical data from Country X cannot leave Country X

tiborsaas · on Sept 4, 2023

Your questions sounds like why do we need Alpine linux when we have Ubuntu? Why do we have SQLite when we have Postgres?

I think the point is to reach a baseline of something being super lightweight yet still useful that could be production for a number of use cases.

SparkyMcUnicorn · on Sept 4, 2023

You can use it 100% locally, and it doesn't cost anything.

seydor · on Sept 4, 2023

Imagine being on Mars and running on a small PV panel and needing to code a bugfix in your oxygen supply system through the wire with Microsoft Earth(tm) or something

smcleod · on Sept 5, 2023

Just trying out the official container image for self-hosting along side the VSCode extension - I've got to say I'm really impressed with the scaffolding especially for an early stage project.

The web interface for the LLM server is especially nice and clean compared to many of the others I've tried - and it "just works". Very interested to see how this evolves.

holoduke · on Sept 4, 2023

Whats the difference between 1% and 99% of HumanEval? What does it tell really?

kateklink · on Sept 4, 2023

for pass@1 HumanEval tells how well the model solves a task from a set, given only one chance to solve it. It's not the perfect metric, there're other like DS-1000, MBPP (we have included them on HuggingFace model card). HumanEval is good for benchmarking with other models as it gives a fast idea how powerful the model is.

swyx · on Sept 4, 2023

> given only one chance to solve it

my understanding is that there are 2 usages of the pass@{number} syntax. the HumanEval/Codex paper interprets the {number} as number of attempts[0]. however language modelers seem to use it to denote the number of few shot example demonstrations given in the context. these are starkly different and i wish the syntax wasnt overloaded

---

[0] https://arxiv.org/pdf/2107.03374.pdf

> Kulal et al. (2019) evaluate functional correctness using the pass@k metric, where k code samples are generated per problem, a problem is considered solved if any sample passes the unit tests, and the total fraction of problems solved is reported.

ldjkfkdsjnv · on Sept 4, 2023

I dont trust any benchmarks for any LLM thats not coming from FB, Google, OpenAI, Anthropic, or Microsoft. These models are so dynamic, the simple benchmark numbers never tell the whole story of the quality of the model. Take for instance, a recent posting by anyscale, claiming their fine tuning of Llama 2 was competitive with OpenAI's model. The reality being their fined tuned model is basically worthless, and was competitive along a single metric/very narrow commoditized task. Its a great way to get clicks by posting these metrics though

breadsniffer01 · on Sept 4, 2023

They could have easily benchmarked with the Spider SQL test set but they didn’t.

I have a feeling that the more robust models might be the ones that don’t perform best on benchmarks right away.

SparkyMcUnicorn · on Sept 4, 2023

The community has fine-tuned some really good llama models (much better than llama-chat), but I get what you're saying.

I've been testing the best performing models on the huggingface leaderboard lately. Some of them are really impressive, and others are so bad that I second guess the prompt format or if the benchmarked model is actually the same one I'm testing.

breadsniffer01 · on Sept 4, 2023

Which models were really bad?

SparkyMcUnicorn · on Sept 4, 2023

I was keeping track of the good ones, and don't have many notes on the bad ones.

I do remember testing "LoKuS" last week and it was quite terrible (sometimes gave completely off-topic answers). It scored as one of the highest 13B models on the leaderboard (~65 average), but appears to be removed now.

nomel · on Sept 4, 2023

This is the goal of humaneval, correct?

howon92 · on Sept 4, 2023

Congrats on your achievement! I'm curious about your end goal. Do you aim to beat GitHub Copilot's performance and convince devs to use Refact for code completion instead of GitHub Copilot? I want to understand the motivation behind these different code-completion models that are not solely for academic research.

kateklink · on Sept 4, 2023

we want to help developers who need either on-premise or permissive code assistant, copilot has neither of this. We also wanted to lower the barriers for self-hosting, so that the model is available on most GPUs with just 3GB Ram. Plus making the code completions fast and efficient (understanding entire context, not just the previous tokens).

OlegKlimov1337 · on Sept 4, 2023

You can use it in practice, that was the goal of that particular model! It's fast, runs on your own hardware if you want it to.

umutisik · on Sept 4, 2023

The title is misleading This model is not "SOTA for the size", there are smaller models that do 10-18% better in absolute score. The text says it's SOTA "among similar models" where they probably compare with other models with permissive licensing.

mrob · on Sept 4, 2023

"Permissive" usually refers to Free Software or Open Source licenses without copyleft requirements. OpenRAIL is a proprietary license because it imposes usage restrictions, contrary to both the Free Software and Open Source definitions.

OlegKlimov1337 · on Sept 4, 2023

AFAIK There is only one model that do better, it’s phi-1 and it’s python only, and it does not support fill-in-the-middle so you can't really use it.

umutisik · on Sept 4, 2023

Phi-1-small also scores higher with 350M parameters. It helps to be specific about what the comparison is against when claiming SOTA.

glutamate · on Sept 4, 2023

License text: https://drive.google.com/file/d/16NqKiAkzyZ55NClubCIFup8pT2j... [PDF]

See last page for restrictions

lordofgibbons · on Sept 4, 2023

> In any way that violates any applicable national, federal, state, local or international law or regulation;

Darn! Foiled again! I was planning on breaking some federal laws, but the license says that I can't ;( \s

Open-RAIL license has the be the worst license in existence claiming to be "open".

> You shall undertake reasonable efforts to use the latest version of the Model.

Plea to folks releasing models: Please stop using this user-hostile and deranged license

Havoc · on Sept 4, 2023

Thanks. That looks pretty relaxed on terms

acheong08 · on Sept 4, 2023

Say I want to fine tune a Golang specific model. How much $ and effort would I have to put in? Would using this as a base help in any way compared to starting from llama?

OlegKlimov1337 · on Sept 4, 2023

Maybe it makes sense to start from llama-code, not llama :D I think golang specific model will not be that different from a multi-language model. But it definitely will work better after fine tuning on your code. Check out refact self hosting docker in a couple of days, finetune will be there soon. It will take you 1 GPU and almost no money )

palmer_fox · on Sept 4, 2023

All these LLMs are pretty general if I understand correctly. Are there any efforts to create specialized models (other than for coding)? Or, what would be even better, "extract" certain areas from existing LLMs as a way to specialize them? With the goal to drastically reduce model size to be able to run on less powerful devices.

E.g. a model specializing in chemistry doesn't need to include data on world's history or to be able to write poetry.

hnhg · on Sept 4, 2023

I am not an expert but it still has to learn human language/grammar/whathaveyou, and that is where scale seems to matter. Fine-tuning on a subset of knowledge after that is typically how the domain-specialisation is achieved, by my understanding.

charcircuit · on Sept 4, 2023

Domain specialization is done by continuing the full training process. Fine tuning is more for changing the style of the output than adding new knowledge.

palmer_fox · on Sept 4, 2023

What if the initial training already contains all necessary data for a particular specialization? What would be the benefit of continuing the training process?

viraptor · on Sept 4, 2023

Imagine someone tells you about how someone committed a crime and asks you to summarise. Now imagine the same question is asked to a lawyer. Even if you both knew the same facts, the response would be very different in style, highlighted points, mentioned references, etc. The domain specific fine tuning does exactly that. Sure, sometimes you can get very close by changing the prompt to include "respond like a lawyer in situation X with following extra rules", but not always and the fine-tuning gives better results and shorter prompt.

palmer_fox · on Sept 4, 2023

I was wondering about that too. Would it be possible in the future to have a more modular approach to LLMs? Have a module that is responsible for basic knowledge/language/grammar and then other more specialized modules that are added selectively.

I don't know enough about fine-tuning, not sure if the process is capable of removing "unused" parts of the model (I guess not possible, similar to un-learning).

lucubratory · on Sept 4, 2023

There are various methods for removing unused parts of the model, like distillation. The idea is generally that you always lose performance, but hopefully you lose more size/runcost than you do performance, proportionately.

swyx · on Sept 4, 2023

so, so many. there are RAG specific models (contextual ai), finance specific models (bloomberg gpt, brightwave), contact center models (cresta), even telco models (anthropic).

palmer_fox · on Sept 4, 2023

Very interesting. Thanks for replying!

Manjuuu · on Sept 4, 2023

Another model that we'll soon forget it ever existed.

_xnmw · on Sept 4, 2023

For the sake of not giving Microsoft and a few other tech giants immense power over the world, I really do hope the cost and efficiency of LLMs improve dramatically, until we can get GPT-4-equivalent models trained on a few graphics cards and running offline on an iPhone. Really rooting for these kinds of projects until someone makes the breakthrough.

taywrobel · on Sept 4, 2023

You may be interested in what we’re working on at Symbolica AI.

We’re using formal logic in the form of abstract rewrite systems over a causal graph to perform geometric deep learning. In theory it should be able to learn the same topological structure of data that neural networks do, but using entirely discrete operations and without the random walk inherent to stochastic gradient descent.

Current experiments are really promising, and assuming the growth curve continues as we scale up you should be able to train a GPT-4 scale LLM in a few weeks on commodity hardware (we are using a desktop with 4 4090’s currently), and be able to do both inference and continual fine tuning/online learning on device.

KRAKRISMOTT · on Sept 4, 2023

> We’re using formal logic in the form of abstract rewrite systems over a causal graph to perform geometric deep learning. In theory it should be able to learn the same topological structure of data that neural networks do, but using entirely discrete operations and without the random walk inherent to stochastic gradient descent.

Abstract rewrite like a computer algebra system's (e.g. Wolfram) term rewriting equation simplication method?

taywrobel · on Sept 4, 2023

Heavily influenced by Wolfram's work on metamathematics and the physics project, in so far as using a rewrite system to uncover an emergent topology; we're just using it to uncover the topology of certain data (assuming that the manifold hypothesis is correct), rather than the topology of fundamental physics as he did.

pawelduda · on Sept 4, 2023

Sounds cool, but what are the drawbacks?

taywrobel · on Sept 4, 2023

Biggest drawback is that since the structure is all discrete, it is inherently weak at modeling statistical distributions. For example, it'll likely never best a neural network at stock market prediction or medical data extrapolation.

However, for things that are discrete and/or causal in nature, we expect it to outperform deep learning by a wide margin. We're focused on language to start, but want to eventually target planning and controls problems as well, such as self-driving and robotics.

Another drawback is that the algorithm as it stands today is based on a subgraph isomorphism search, which is hard. Not hard as in tricky to get right like Paxos or other complex algorithms; like NP-Hard, so very difficult to scale. We have some fantastic Ph.Ds working with us who focus on optimization of subgraph isomorphism search, and category theorists working to formalize what constraints we can relax without effecting the learning mechanism of the rewrite system, so we're confident that it's achievable, but the time horizon is unknown currently.

k__ · on Sept 4, 2023

It doesn't exist at scale yet.

paulsutter · on Sept 4, 2023

Especially interested in learning directly on geometries, please keep us updated and share results

taywrobel · on Sept 4, 2023

Would definitely recommend Bronstein et. al's work on geometric deep learning! https://geometricdeeplearning.com

That's effectively the right hand side of the bridge that we're building between formal logic and deep learning. So far their work has been viewed mainly as descriptive, helping to understand neural networks better, but as their abstract calls out: "it gives a constructive procedure to incorporate prior physical knowledge into neural architectures and provide principled way to build future architectures yet to be invented". That's us (we hope)!

arthurcolle · on Sept 4, 2023

I would like to subscribe to your newsletter, we'd be super interested in this at Brainchain AI.

Drop me a link at (my first name) @ brainchain dot AI if you'd like to chat, I'd love to hear more about what you're working on!

dmarchand90 · on Sept 4, 2023

Really cool stuff! Do you have any recommendations of where we could learn more?

axpy906 · on Sept 4, 2023

The key in that is models. Per the GPT4 leaked details, it’s not a a single model but 16 MOE mixture of experts. There’s probably quite a lot of complexity on the backend in sourcing the right model for the right query. In short, it’s probably better to focus on single models for specific tasks in the OS community as evidenced by Code Llama. Having a system like GPT4 is still difficult to replicate. Getting it to run on a consumer hardware for specific tasks like code gen at almost GPT4 level is doable.

og_kalu · on Sept 4, 2023

>There’s probably quite a lot of complexity on the backend in sourcing the right model for the right query.

This isn't how Sparse MoE models work. There isn't really any complexity like that. And different models will or can pick each token.

Sparse models aren't an ensemble of models.

ttul · on Sept 4, 2023

There are many MoE architectures and I suppose we don’t know for sure which OpenAI is using. The “selection” of the right mix of models is something that a network learns and it’s not a complex process. Certainly no more complex than training an LLM.

axpy906 · on Sept 4, 2023

When I wrote “backend” was a poor choice of a word. “Meta-model” is probably a better choice of wording.

I hope it did not detract too much from the point of focusing on subtasks and modalities for FOSS as GPT 4 was built on a $163 million budget.

Finally, good point. We’ve got no idea of what OpenAI’s MoE approach is and how it works. I went back to Metas 2022 NLLB-200 system paper and they didn’t even publish the exact details of the router (gate).

ttul · on Sept 5, 2023

Yeah, good point on the importance of FOSS focusing on subtasks... because FOSS isn't going to be spending $150M+ training a model any time soon without something like government backing.

fnordpiglet · on Sept 4, 2023

I think with or without algorithmic advantages hardware will improve for local model running. There’s an immense amount of capital being invested in hardware improvement and that will absolutely trickle down.

My sincere belief is that local models is the way of the future, with flexible base models adapted via Lora and context to specific use cases. I think open source models and techniques are inexorable at this point barring some sort of regulatory moat and will rival commercial models in all but extreme cases.

flangola7 · on Sept 4, 2023

I don't, how do you maintain control and prevent mass harm in that case? I don't see anyway out other than similar gatekeeping we apply to ownership and use of high explosives and radiological weapon tooling.

At all other times I support tech freedom. I use libre software, I use Tor, I donate to privacy and FOSS organizations constantly. I only write my software projects under an AGPL license. AI is qualitatively different. A world run amok with intelligent infinite Sybils is not good for anyone. I hope massive compute continues to be necessary, it may be the only hard chokepoint we have to keep a handle on the beast.

smoldesu · on Sept 4, 2023

> For the sake of not giving Microsoft and a few other tech giants immense power over the world

I agree with and appreciate the sentiment, but it feels way too late for that. These people do have and exert direct control over pretty much all of our digital devices. It's funny (or sad) that we only seem to care about this when shiny doodads like AI come around every so-often.

adrenvi · on Sept 4, 2023

That could also help tech giants build even larger/more capable models cheaply. Ideally there would be a hard ceiling of LLM capability that even massive amounts of hardware couldn't exceed, allowing inexpensive hardware to catch up.

a_wild_dandan · on Sept 4, 2023

I personally hope that LLMs have no such limits. The good these tools can do is immeasurable.

I can already run Llama 2 @70b on my laptop, and that’ll look like a quaint old AI artifact in 5-7 years. I think the consumer market will keep pace yet stay well below SotA, just as it always has. That still leaves plenty of room for incredible open-source stuff!

stainablesteel · on Sept 4, 2023

to be fair, if that is achieved then the massive models that tech giants produce will probably be phenomenal

kateklink · on Sept 4, 2023

We’ve finished training a new code model Refact LLM which took us about a month. The main use-case is for blazing-fast code completion with fill-in-the-middle, additionally, the model could reply to chat prompts.

It has much better performance than all of the code models of similar size, and almost reaches the same HumanEval as Starcoder being 10x smaller in size.

With the small size, it can work with most modern GPUs requiring just 3GB Ram.

You can try self-hosting it in Refact https://github.com/smallcloudai/refact/ and get a local fast copilot alternative with decent suggestions.

Weights and model card https://huggingface.co/smallcloudai/Refact-1_6B-fim.

We would love to hear your feedback!

ALittleLight · on Sept 4, 2023

How does it compare to Copilot? A metric I'd like to see is % of proposed completions accepted by a human user. If you had an extension that 50% of the time proposed a Copilot extension and 50% of the time proposed a Refact extension (blind to the user) then you could come up with a metric like this.

diminish · on Sept 4, 2023

Does ctransformer (https://github.com/marella/ctransformers#supported-models) support running refact?

I see that model type "gpt_refact" in https://huggingface.co/smallcloudai/Refact-1_6B-fim/blob/mai...

drcongo · on Sept 4, 2023

Is it possible to run it as an LSP so that it can be used in editors other than VSCode and JetBrains? (sorry if this question is completely mad, my understanding of how these things work is extremely limited)

OlegKlimov1337 · on Sept 4, 2023

Yes, it's coming up in a couple of weeks.

drcongo · on Sept 4, 2023

Great, thanks. I'll keep an eye out.

sparrow0519 · on Sept 11, 2023

hi, i try to fine tune refact model using evolve code alpaca, but the loss is always bigger than 2, i try some different params but it doesn't work, can you give me some advice?

riku_iki · on Sept 4, 2023

> almost reaches the same HumanEval

how can you tell that HumanEval is not leaked to your training data in some form?

mityamitya · on Sept 4, 2023

Hi! We ran LSH filtering over datasets to remove all code that can be similar to HumanEval samples.

riku_iki · on Sept 4, 2023

so, we have to trust your procedure..

JegernOUTT · on Sept 5, 2023

It can be checked if the model predicts canonical solutions from humaneval. I understand it is not ideal, but at least you can check it yourself

There are a bunch of other benchmarks too, check out the page https://huggingface.co/smallcloudai/Refact-1_6B-fim

Also, feel free to run any new benchmarks

zcesur · on Sept 4, 2023

tangentially related: refact recently shared 4 bounties worth $9,000 to help improve their tech!

https://algora.io/org/smallcloudai/bounties

disclaimer: i'm a cofounder of algora, the platform enabling these bounties

iFire · on Sept 4, 2023

LICENSE

bigscience-openrail-m

https://huggingface.co/smallcloudai/Refact-1_6B-fim/blob/mai...

notsahil · on Sept 4, 2023

Model Stats - Architecture: LLAMA-like model with multi-query attention - Objectives Fill-in-the-Middle, Chat - Tokens context: 4096 - Pretraining tokens: 1.2T - Finetuning tokens: 40B - Precision: bfloat16 - GPUs 64 NVidia A5000 - Training time 28 days