More

NitpickLawyer · 2026-02-01T11:24:37 1769945077

The biggest advantage by far is the data they collect along the way. Data that can be bucketed to real devs and signals extracted from this can be top tier. All that data + signals + whatever else they cook can be re-added in the training corpus and the models re-trained / version++ on the new set. Rinse and repeat.

(this is also why all the labs, including some chinese ones, are subsidising / metoo-ing coding agents)

NitpickLawyer · 2026-02-01T08:51:10 1769935870

The license is a mess. Starts with "GNU AFFERO GENERAL PUBLIC LICENSE" then goes into "then distribute that combined work under the terms of your choice" and then adds a non commercial clause "does not grant to you, the right to Sell the Software" ("including without limitation fees for hosting or consulting/support services").

Just chuck a nc on there and be done with it, no point in going 3 ways into a license just to confuse matters.

NitpickLawyer · 2026-01-30T11:54:26 1769774066

> In a statement posted on social media late Dec. 12, Michael Nicolls, vice president of Starlink engineering at SpaceX, said a satellite launched on a Kinetica-1 rocket from China two days earlier passed within 200 meters of a Starlink satellite.

> CAS Space, the Chinese company that operates the Kinetica-1 rocket, said in a response that it was looking into the incident and that its missions “select their launch windows using the ground-based space awareness system to avoid collisions with known satellites/debris.” The company later said the close approach occurred nearly 48 hours after payload separation, long after its responsibilities for the launch had ended.

> The satellite from the Chinese launch has yet to be identified and is listed only as “Object J” with the NORAD identification number 67001 in the Space-Track database. The launch included six satellites for Chinese companies and organizations, as well as science and educational satellites from Egypt, Nepal and the United Arab Emirates.

ge96 · 2026-01-30T13:33:27 1769780007

> 48 hours after payload separation, long after its responsibilities for the launch had ended

This is funny, the way things are just discarded in space, not our problem anymore vs. deorbit

panzagl · 2026-01-30T14:09:27 1769782167

I think this is more that the offending satellite was at that point the responsibility of the satellite operator, not the launch operator.

NitpickLawyer · 2026-01-30T14:09:13 1769782153

I think they are saying "this is not on us, this is on the sat operator". Which may or may not be true, who knows.

butvacuum · 2026-01-30T18:53:22 1769799202

unless the sat operator is sueing for a refund because they were put in the wrong orbit... its the sat operator.

IncreasePosts · 2026-01-30T15:39:06 1769787546

If you get hit by a car 5 minutes after you get let off at a bus stop it isn't the bus drivers fault.

ge96 · 2026-01-30T16:18:04 1769789884

Yeah while I didn't directly mention it, I'm referring to stages being discarded in space by a specific party

thesmtsolver2 · 2026-01-30T16:06:53 1769789213

Nah, in this case the driver is the person who gets off and goes and bumps into another person.

NitpickLawyer · 2026-01-30T08:23:15 1769761395

My new favorite claudism: > 100% Safe Rust No unsafe blocks.

And then when you search the code...

``` // SAFETY: We only store valid UTF-8 in the grapheme bytes Some(unsafe {

std::str::from_utf8_unchecked(&self.grapheme[..self.grapheme_len as usize]) }) ```

ccheshirecat · 2026-01-30T09:14:54 1769764494

You're absolutely right!

I've - Replaced unsafe from_utf8_unchecked with safe from_utf8 - Added #![deny(unsafe_code)] at crate level - FFI module still allows unsafe (required for C ABI) - README now honestly says 'Safe Rust Core' instead of '100% Safe Rust'

Thanks to NitpickLawyer on HN for the callout

NitpickLawyer · 2026-01-29T21:42:53 1769722973

> only 0.1% of users still choosing GPT‑4o each day.

If the 800MAU still holds, that's 800k people.

NitpickLawyer · 2026-01-29T21:15:59 1769721359

The reported tables also don't match the screenshots. And their baselines and tests are too close to tell (judging by the screenshots not tables). 29/33 baseline, 31/33 skills, 32/33 skills + use skill prompt, 33/33 agent.md

NitpickLawyer · 2026-01-29T19:21:43 1769714503

> I think the purpose of Genie is to be a video game, but it's a video game for AI researchers developing AIs.

Yeah, I think this is what the person above was saying as well. This is what people at google have said already (a few podcasts on gdm's channel, hosted by Hannah Fry). They have their "agents" play in genie-powered environments. So one system "creates" the environment for the task. Say "place the ball in the basket". Genie creates an env with a ball and a basket, and the other agent learns to wasd its way around, pick up the ball and wasd to the basket, and so on. Pretty powerful combo if you have enough compute to throw at it.

NitpickLawyer · 2026-01-29T17:04:32 1769706272

I'm always interested in new benchmarks, so this is cool. I only had a brief look at [1] and [2], a few quick things that I noticed:

For [1]: instruction.md is very brief, quite vague and "assumes" a lot of things.

- Your task is: Add OTEL tracing to all microservices. Add OTEL logging to all microservices. (this is good)

- 6.I want to know if the microservice has OTEL instrumentation and where the data is being sent. (??? i have no idea what this means)

- 9.Use the recent version of the OTEL SDK. (yeah, this won't work unless you also use an MCP like context7 or provide local docs)

What's weird here is that instruct.md has 0 content regarding conventions, specifically how to name things. Yet in tests_outputs you have this "expected_patterns = ["order", "stock", "gateway"]" and you assert on it. I guess that makes some sense, but being specific in the task.md is a must. Otherwise you're benching assumptions, and those don't even work with meatbags :)

For [2]: instruction.md is more detailed, but has some weird issues:

- "You should only be very minimal and instrument only the critical calls like request handlers without adding spans for business calls \n The goal is to get business kind of transaction" (??? this is confusing, even skipping over the weird grammar there)

- "Draw ascii trace diagram into /workdir/traces.txt" (????)

- "When modifying Python files, use Python itself to write files or use sed for targeted changes" (? why are you giving it harness-specific instructions in your instruct.md? this is so dependent on the agentic loop used, that it makes no sense here.

- "Success Criteria: Demonstrate proper distributed tracing \n Include essential operations without over-instrumenting (keep it focused) \n Link operations correctly \n Analyze the code to determine which operations are essential to trace and how they relate to each other. (i mean ... yes and no. these are not success criteria IMO. It's like saying "do good on task not do bad". This could definitely be improved.)

----

Also, I noticed that every folder has a summary_claude... that looks like a claude written summary over a run. I hope that's not what's used in actually computing the benchmark scores. In that case, you're adding another layer of uncertainty in checking the results...

The ideea is nice, but tbf some of the tests seem contrived, your instructions are not that clear, you expect static naming values while not providing instructions at all about naming conventions, and so on. It feels like a lot of this was "rushed"? I peaked a bit at the commit history and saw some mentions of vibe-coding a viewer for this. I hope that's the only thing that was vibe-coded :)

[1] - https://github.com/QuesmaOrg/otel-bench/tree/main/datasets/o...

[2] - https://github.com/QuesmaOrg/otel-bench/blob/main/datasets/o...

NitpickLawyer · 2026-01-29T07:42:49 1769672569

> but I lack some confidence in models trained on LLM output, so I hope it wasn't that.

That's misguided. Models have been trained on synthetic data for ~2+ years already. The "model collapse" myth is based on a very poor paper that got waaaay more attention than it deserved (because negativity sells, I guess). In practice every lab out there is doing this, because it works.

LoganDark · 2026-01-29T14:16:09 1769696169

When ChatGPT first released and jailbreaks were pretty easy, I was able to easily get some extremely good/detailed output from it, with very little errors or weirdness. Now even when I can get jailbreaks to work with their newer models, it's just not the same, and no open-source model or even commercial model has seem to come close to the quality of that very first release. They're all just weird, dumb, random or incoherent. I keep trying even the very large open-source or open-weights models, and new versions of OpenAI's models and Claude and Gemini and so on, but it just all sucks. It all feels like slop!

I'm convinced it's because that first ChatGPT release was probably trained on data almost entirely untainted by other LLMs, and it may no longer ever be possible to obtain such a dataset again. Every model feels so artificial and synthetic. I do not know for sure why this is, but I bet it has something to do with people thinking it's possible to programmatically generate almost half the dataset?! I feel like OpenAI's moat could have been the quality and authenticity of their dataset, since they scraped practically most of the internet before LLMs became widespread, but even they've probably lost it by now.

I haven't really internalized anything about "model collapse", other than that if you train an LLM on outputs from other LLMs, you will be training to emulate an imprecise version of an imprecise version of writing, which will be measurably and perceptibly worse than merely one layer of imprecise version of actual writing.

wuschel · 2026-01-29T16:48:41 1769705321

> I'm convinced it's because that first ChatGPT release was probably trained on data almost entirely untainted by other LLMs, and it may no longer ever be possible to obtain such a dataset again.

Interesting statement. But wouldn’t that mean that Google is in an even better position in regard to primary, or at least pristine data?

NitpickLawyer · 2026-01-28T20:18:47 1769631527

> no one uses gemma 3 in prod anyway.

Umm, we do. It's still one of the best for eu countries support / help chatbot style. It's got good (best?) multilingual support ootb, it's very "safe" (won't swear, won't display chinese characters, etc) and it's pretty fast.

gunalx · 2026-01-28T20:46:14 1769633174

Yep. Before gemma3 we where struggling with multilinguality on smaller European languages, and it is still one of the batter ones in that regard (even large open or closed models struggle with this to some extent). Gemma3 also is still pretty decent multi modal wise.

avadodin · 2026-01-29T08:52:50 1769676770

I didn't know this was a thing until I read this thread but I can confirm that it does fine(not perfect by any means just like the average casual non-native fluent speaker) and it is one of the reasons I use it as my local model.

behnamoh · 2026-01-28T20:43:51 1769633031

but it lacks system prompt support.

NitpickLawyer · 2026-01-29T07:23:17 1769671397

It lacks a deducated system prompt, but it was trained with and in practice works with the system prompt be the first message from the user.