I'm testing it for system programming brainstorming, code reviews and Python test units writing, and my impression is that it's a Sonnet 3.5 level model for most tasks. I said a few things here: https://www.youtube.com/watch?v=xjCqi9JK440 but in general this is really an open weights frontier model, the first one that we get (IMHO llama 3.1 405B does not fit the definition, and the actual model quality is far from the benchmarks). Also the extreme inference speed due to MoE and other design choices improves the user experience a lot. I also tested asking questions with very large contexts (PDFs, large C files) at play, and it performs very well.
Also don't just focus on this model but check out what DeepSeek mission is, and the CEO words in the recently released interview. They want to be the DJI / Bambulab of AI, basically: leaders and not followers, and after V3 it's hard to say they don't have the right brains to do that.
This has exactly my experience. I use all the main models almost every day. I switch between them several times a day and compare and contrast their output. Deepseek is shockingly good! How they got there with a relatively small budget is mind blowing.
Do you mind sharing your workflow with how you use this model? Is it integrated with an IDE, are you using the command line, maybe a UI with each question then copy/pasting the results?
> IMHO llama 3.1 405B does not fit the definition, and the actual model quality is far from the benchmarks
Very true. Meta has been disappointing so far, and it takes away from the blog that it starts with a graph of a completely misrepresentative benchmark (MMLU) that shows things like Llama3.1-405b besting Gemini 1.5-pro, 4o-mini above Haiku 3.5, and so on.
But all this means is that the leap in terms of open weights model is even far bigger with Deepseek v3.
If you understand how LLMs work, you should disregard tests such as:
- How many 'r's are in Strawberry?
- Finding the fourth word of the response
These tests are at odds with the tokenizer and next-word prediction model.
They do not accurately represent an LLM's capabilities.
It's akin to asking a blind person to identify colors.
Ask a LLM to spell "Strawberry" one character per line. Claude's output, for example:
> Here's "strawberry" spelled out one character per line:
s
t
r
a
w
b
e
r
r
y
Most LLMs can handle that perfectly. Meaning, they can abstract over tokens into individual characters. Yet, most lack the ability to perform that multi-level inference to count individual 'r's.
From this perspective, I think it's the opposite. Something like the strawberry-tests is a good indicator how far the LLM is able to connect individually easy, but not readily interconnected steps.
The funny thing about those "tests" is that LLMs are judged by their ability to do that themselves, as opposed to their ability to write code that does it. The best LLMs still fail at doing the task themselves, because they fundamentally are not designed to do anything except predict tokens. But they absolutely can write code that does it perfectly, and can write code that does so many things better than that.
I'm not going to argue these are good tests, if you asked a coworker these questions they'd look at you weird, but what surprised me is how well you can encode a sentence never written down before, put it through base64 encoding, and then ask an llm to decode it. And the good models can do this surprisingly well.
I know future GPU development is addressing the constrained ram problem, but it is nonetheless a massive problem for local inference. MoE seems to solve a compute problem, at the expense of compounding the ram problem. So I have a question... My understanding is that the typical MoE model starts each output token with a decision as to which expert model(s) to send inference tasks to. How often is it that the vast majority of predictions end up being sent to the same expert(s)? Wouldn't it be a more practical from both a training and inference perspective to do the same mixture of experts model, but choose experts on a much higher level of granularity? Like maybe on the level of the whole response, or clause, or sentence? At least then you could load an expert into ram and expect to use it without having to do massive IO loading/unloading constantly.
A lot of talk about how much cheaper it is than all other models.
It remains to be seen what the pricing will be when run by non-Deepseek providers. They might be loss leading.
The comparison for cheap models should also be Gemini 2.0 Flash Exp. I could see it being even cheaper when it stops being free - if it does at all. There's definitely a scenario where Google just keeps it freeish for a long time with relatively high limits.
Just a minor clarification, DeepSeek's pricing for this model is temporary to match their previous model. They announced [1] that it will be the following after February 8:
DeepSeek - 0.27$ per million tokens input, 1.10$ million tokens output (66 tokens per/s)
Still much cheaper than the others though for input pricing.
For what it's worth, as always 99% benchmarks are very unreliable and per-task performance still greatly differs per model, with plenty of cases where results are wildly different.
I have a task I use in my work where Gemini 1.5-Pro is SOTA. Handily beating o1, Sonnet-3.5, Gemini-exp and everyone else, very consistently and significantly.
The newer/bigger models are better at reasoning and especially coding, but there's plenty of tasks that have little overlap with those skills.
Also don't just focus on this model but check out what DeepSeek mission is, and the CEO words in the recently released interview. They want to be the DJI / Bambulab of AI, basically: leaders and not followers, and after V3 it's hard to say they don't have the right brains to do that.