This is finetuned to the benchmarks and nowhere close to O1-Preview in any other...

mluo · on Feb 11, 2025

We beat O1-preview and even many other 7B models over many math benchmarks, which was TEST set (not in training set at all).

If you want to make the model fully generalist, feel free to train it over coding datasets (such as RL with passing unit tests as reward).

zamadatix · on Feb 11, 2025

It's already good accomplishment as it is but I think it'd be very surprising to show training such a small model as a generalist scales to the same magnitude as specialized finetuning. At some point you have to fit more background data and relations in the same amount of information space... but it's hard to say how much that is the case for a given size vs what we just haven't optimized yet. Unfortunately I think that will have to wait for someone with more compute before we can verify this * a dozen one way or the other :).

Side question, since it sounds like you were involved: how big is the impact on benchmarks of taking this 1.5B model down from fp32 to fp8 or similar? The focus on parameters alone sometimes feels like comparing house sizes by their lengths alone. And, if you were indeed involved, thanks for making all of this open and available!

mluo · on Feb 11, 2025

For quantization, very big impact for small models, can drop at much as 10% on AIME. Our model does best on bfloat16 ;)

Come checkout our repo at: https://github.com/agentica-project/deepscaler

rvnx · on Feb 11, 2025

It is great discovery, it could even open a next step in AI with MoM "Mixture of Models", where small fine-tuned models take each part of a task (instead of the current MoE)

mluo · on Feb 11, 2025

Check out one of my prior work: https://stylus-diffusion.github.io/

This work scales up selection/routing over many models/LoRAs

rvnx · on Feb 11, 2025

Love it, will check, thank you for showing / sharing all of that!

numba888 · on Feb 12, 2025

o1 is more than just math solver. And you cannot possibly train that much in a small model.

However smaller specialized models looks to be the right way to handle world's complexity. Sort of mixture of experts on one level above. Orchestrating them will be another problem. Possible solution is generalists model "to rule them all".

mdp2021 · on Feb 12, 2025

Have you considered the very practical importance of running specialized models for specialized tasks on common hardware (maybe a couple of CPU cores in a couple GB of RAM)?

numba888 · on Feb 12, 2025

Small models are just tools. Even many of them will make only a toolset. They don't evolve in AGI by themselves. But putting them together in a structure (brain) may result in something close. Like big smart calculator. It takes more to create a 'character' similar to, say, terminator.

janalsncm · on Feb 11, 2025

I disagree. They demonstrated a way to dramatically reduce training costs, 18x cheaper than R1. That alone is worth attention.

Also beating O1 on any benchmark is nontrivial.

nabakin · on Feb 11, 2025

I'm not so sure it's impressive even for mathematical tasks.

When ChatGPT came out, there was a flood of fine-tuned LLMs claiming ChatGPT-level performance for a fraction of the size. Every single time this happened, it was misleading.

These LLMs were able to score higher than ChatGPT because they took a narrow set of benchmarks and fine-tuned for those benchmarks. It's not difficult to fine-tune an LLM for a few benchmarks, cheaply and beat a SOTA generalist LLM at that benchmark. Comparing a generalist LLM to a specialist LLM is like comparing apples to oranges. What you want is to compare specialist LLMs to other specialist LLMs.

It would have been much more interesting and valuable if that was done here. Instead, we have a clickbait, misleading headline and no comparisons to math specialized LLMs which certainly should have been performed.

torginus · on Feb 11, 2025

But if that's the case - what do the benchmarks even mean then?

nabakin · on Feb 11, 2025

Automated benchmarks are still very useful. Just less so when the LLM is trained in a way to overfit to them, which is why we have to be careful with random people and the claims they make. Human evaluation is the gold standard, but even it has issues.

torginus · on Feb 12, 2025

The question is how do you train your LLMs to not 'cheat'?

Imagine you have an exam coming up, and the set of questions leaks - how do you prepare for the exam then?

Memorizing the test problems would be obviously problematic, but maybe practicing the problems that appear on the exam would be less so, or just giving extra attention to the topics that will come up would be even less like cheating.

The more honest approach you choose, the more indicative your training would be of exam results but everybody decides how much cheating they allow for themselves, which makes it a test of the honesty not the skill of the student.

nabakin · on Feb 12, 2025

I think the only way is to check your dataset for the benchmark leak and remove it before training, but (as you say) that's assuming an honest actor is training the LLM, going against the incentives of leaving the benchmark leak in the training data. Even then, a benchmark leak can make it through those checks.

I think it would be interesting to create a dynamic benchmark. For example, a benchmark which uses math and a random value determined at evaluation for the answer. The correct answer would be different for each run. Theoretically, training on it wouldn't help beat the benchmark because the random value would change the answer. Maybe this has already been done.

avbanks · on Feb 11, 2025

A lot of people in community are weary of benchmarks for this exact reason.

pona-a · on Feb 11, 2025

I tested it on basic long addition problems. It frequently misplaced the decimal signs, used unnecessary reasoning tokens (like restating previously done steps) and overall seemed only marginally more reliable than the base DeepSeek 1.5B.

On my own pet eval, writing a fast Fibonacci algorithm in Scheme, it actually performed much worse. It took a much longer tangent before arriving at fast doubling algorithm, but then completely forgot how to even write S-expressions, proceeding to instead imagine Scheme uses a Python-like syntax while babbling about tail recursion.

viraptor · on Feb 12, 2025

> On my own pet eval, writing a fast Fibonacci algorithm in Scheme,

This model was trained on math problems datasets only, it seems. It makes sense that it's not any better at programming.

pona-a · on Feb 12, 2025

The original model, aside from its programming mistakes, also misremembered the doubling formula. I hoped to see that solved, which it was, as well as maybe a more general performance boost from recovering some distillation loss.

ekidd · on Feb 15, 2025

This model can't code at all.

It does high school math homework, plus maybe some easy physics. And it does them surprisingly well. Outside of that, it fails every test prompt in my set.

It's a pure specialist model.

Aiguru31415666 · on Feb 12, 2025

It's absolutely worth to look into.

It's a great find