Notes on the New Deepseek v3

antirez · 2025-01-02T18:13:08 1735841588

I'm testing it for system programming brainstorming, code reviews and Python test units writing, and my impression is that it's a Sonnet 3.5 level model for most tasks. I said a few things here: https://www.youtube.com/watch?v=xjCqi9JK440 but in general this is really an open weights frontier model, the first one that we get (IMHO llama 3.1 405B does not fit the definition, and the actual model quality is far from the benchmarks). Also the extreme inference speed due to MoE and other design choices improves the user experience a lot. I also tested asking questions with very large contexts (PDFs, large C files) at play, and it performs very well.

Also don't just focus on this model but check out what DeepSeek mission is, and the CEO words in the recently released interview. They want to be the DJI / Bambulab of AI, basically: leaders and not followers, and after V3 it's hard to say they don't have the right brains to do that.

cheema33 · 2025-01-02T18:45:50 1735843550

This has exactly my experience. I use all the main models almost every day. I switch between them several times a day and compare and contrast their output. Deepseek is shockingly good! How they got there with a relatively small budget is mind blowing.

knicholes · 2025-01-02T18:35:43 1735842943

Do you mind sharing your workflow with how you use this model? Is it integrated with an IDE, are you using the command line, maybe a UI with each question then copy/pasting the results?

antirez · 2025-01-02T18:50:08 1735843808

Sure! I'll do several videos in the future with my workflow. No ide, I just use the web interface.

maeil · 2025-01-02T18:19:11 1735841951

> IMHO llama 3.1 405B does not fit the definition, and the actual model quality is far from the benchmarks

Very true. Meta has been disappointing so far, and it takes away from the blog that it starts with a graph of a completely misrepresentative benchmark (MMLU) that shows things like Llama3.1-405b besting Gemini 1.5-pro, 4o-mini above Haiku 3.5, and so on.

But all this means is that the leap in terms of open weights model is even far bigger with Deepseek v3.

egnehots · 2025-01-02T18:36:13 1735842973

If you understand how LLMs work, you should disregard tests such as:

- How many 'r's are in Strawberry?

- Finding the fourth word of the response

These tests are at odds with the tokenizer and next-word prediction model. They do not accurately represent an LLM's capabilities. It's akin to asking a blind person to identify colors.

firebaze · 2025-01-02T18:48:01 1735843681

Ask a LLM to spell "Strawberry" one character per line. Claude's output, for example:

> Here's "strawberry" spelled out one character per line: s t r a w b e r r y

Most LLMs can handle that perfectly. Meaning, they can abstract over tokens into individual characters. Yet, most lack the ability to perform that multi-level inference to count individual 'r's.

From this perspective, I think it's the opposite. Something like the strawberry-tests is a good indicator how far the LLM is able to connect individually easy, but not readily interconnected steps.

darksaints · 2025-01-02T20:49:48 1735850988

The funny thing about those "tests" is that LLMs are judged by their ability to do that themselves, as opposed to their ability to write code that does it. The best LLMs still fail at doing the task themselves, because they fundamentally are not designed to do anything except predict tokens. But they absolutely can write code that does it perfectly, and can write code that does so many things better than that.

maccam912 · 2025-01-02T19:21:54 1735845714

I'm not going to argue these are good tests, if you asked a coworker these questions they'd look at you weird, but what surprised me is how well you can encode a sentence never written down before, put it through base64 encoding, and then ask an llm to decode it. And the good models can do this surprisingly well.

darksaints · 2025-01-02T18:35:22 1735842922

I know future GPU development is addressing the constrained ram problem, but it is nonetheless a massive problem for local inference. MoE seems to solve a compute problem, at the expense of compounding the ram problem. So I have a question... My understanding is that the typical MoE model starts each output token with a decision as to which expert model(s) to send inference tasks to. How often is it that the vast majority of predictions end up being sent to the same expert(s)? Wouldn't it be a more practical from both a training and inference perspective to do the same mixture of experts model, but choose experts on a much higher level of granularity? Like maybe on the level of the whole response, or clause, or sentence? At least then you could load an expert into ram and expect to use it without having to do massive IO loading/unloading constantly.

doctorpangloss · 2025-01-02T19:00:22 1735844422

From the article:

> They probably trained the model on a synthetic dataset generated by GPT-4o.

This seems to be the case. I can speculate further. They trained on copyrighted material that OpenAI did not.

maeil · 2025-01-02T18:21:55 1735842115

A lot of talk about how much cheaper it is than all other models.

It remains to be seen what the pricing will be when run by non-Deepseek providers. They might be loss leading.

The comparison for cheap models should also be Gemini 2.0 Flash Exp. I could see it being even cheaper when it stops being free - if it does at all. There's definitely a scenario where Google just keeps it freeish for a long time with relatively high limits.

bangaladore · 2025-01-02T19:08:18 1735844898

Per available providers on OpenRouter right now:

DeepSeek - 0.14$ per million tokens input, 0.28$ million tokens output (66 tokens per/s)

Fireworks - 0.9$ per million tokens input, 0.9$ million tokens output (23 tokens per/s)

DeepInfra - 1$ per million tokens input, 2$ million tokens output (1.27 tokens per/s)

Compared to Llama 3.1 405B (smaller model than this afaik):

Cheapest is 0.8/0.8$ at 24 t/s all the way to 4$/4$ at 8 t/s

So third party cost seems similar, but there aren't many people hosting DeepSeek right now.

Arcuru · 2025-01-02T19:33:47 1735846427

Just a minor clarification, DeepSeek's pricing for this model is temporary to match their previous model. They announced [1] that it will be the following after February 8:

DeepSeek - 0.27$ per million tokens input, 1.10$ million tokens output (66 tokens per/s)

Still much cheaper than the others though for input pricing.

[1] https://api-docs.deepseek.com/news/news1226#-api-pricing-upd...

wtn2s · 2025-01-07T01:48:26 1736214506

This is at peasant 8k context size too.

bangaladore · 2025-01-02T20:09:04 1735848544

Yeah, I was assuming they are selling for cheap to get people to try the model.

But still certainly cheaper than everyone else at the moment.

danielvf · 2025-01-02T18:34:37 1735842877

Given that we have open weights on it, the costs to run it relative to other open source models are fairly transparent.

stonebraker · 2025-01-07T05:55:57 1736229357

True, but the cost for other models will continue to go down as well.

RockyMcNuts · 2025-01-02T19:02:57 1735844577

exactly ... Gemini 2.0 Flash ranks better on quality, is faster, and cheaper if you assume same pricing as 1.5 (might go down).

these models are being commoditized.

https://artificialanalysis.ai/models/deepseek-v3

maeil · 2025-01-03T17:15:46 1735924546

For what it's worth, as always 99% benchmarks are very unreliable and per-task performance still greatly differs per model, with plenty of cases where results are wildly different.

I have a task I use in my work where Gemini 1.5-Pro is SOTA. Handily beating o1, Sonnet-3.5, Gemini-exp and everyone else, very consistently and significantly.

The newer/bigger models are better at reasoning and especially coding, but there's plenty of tasks that have little overlap with those skills.

ReaLNero · 2025-01-02T21:53:12 1735854792

> Source: Perplexity

AI slop, I don't trust any of this article, especially the bullets on what made Deepseek "win"

musha68k · 2025-01-02T21:12:04 1735852324

Open weights are nice, but they're just the end product of a black box process (training data, alignment methods, filtering choices, etc).

Like with all of these models, we don't know what's in them.

jaggs · 2025-01-02T19:15:31 1735845331

Does it have function calling and vision?

Ey7NFZ3P0nzAe · 2025-01-05T01:10:51 1736039451

No vision. Not sure for function