The biggest advantage by far is the data they collect along the way. Data that can be bucketed to real devs and signals extracted from this can be top tier. All that data + signals + whatever else they cook can be re-added in the training corpus and the models re-trained / version++ on the new set. Rinse and repeat.
(this is also why all the labs, including some chinese ones, are subsidising / metoo-ing coding agents)
The license is a mess. Starts with "GNU AFFERO GENERAL PUBLIC LICENSE" then goes into "then distribute that combined work under the terms of your choice" and then adds a non commercial clause "does not grant to you, the right to Sell the Software" ("including without limitation fees for hosting or consulting/support services").
Just chuck a nc on there and be done with it, no point in going 3 ways into a license just to confuse matters.
> In a statement posted on social media late Dec. 12, Michael Nicolls, vice president of Starlink engineering at SpaceX, said a satellite launched on a Kinetica-1 rocket from China two days earlier passed within 200 meters of a Starlink satellite.
> CAS Space, the Chinese company that operates the Kinetica-1 rocket, said in a response that it was looking into the incident and that its missions “select their launch windows using the ground-based space awareness system to avoid collisions with known satellites/debris.” The company later said the close approach occurred nearly 48 hours after payload separation, long after its responsibilities for the launch had ended.
> The satellite from the Chinese launch has yet to be identified and is listed only as “Object J” with the NORAD identification number 67001 in the Space-Track database. The launch included six satellites for Chinese companies and organizations, as well as science and educational satellites from Egypt, Nepal and the United Arab Emirates.
The reported tables also don't match the screenshots. And their baselines and tests are too close to tell (judging by the screenshots not tables). 29/33 baseline, 31/33 skills, 32/33 skills + use skill prompt, 33/33 agent.md
> I think the purpose of Genie is to be a video game, but it's a video game for AI researchers developing AIs.
Yeah, I think this is what the person above was saying as well. This is what people at google have said already (a few podcasts on gdm's channel, hosted by Hannah Fry). They have their "agents" play in genie-powered environments. So one system "creates" the environment for the task. Say "place the ball in the basket". Genie creates an env with a ball and a basket, and the other agent learns to wasd its way around, pick up the ball and wasd to the basket, and so on. Pretty powerful combo if you have enough compute to throw at it.
I'm always interested in new benchmarks, so this is cool. I only had a brief look at [1] and [2], a few quick things that I noticed:
For [1]: instruction.md is very brief, quite vague and "assumes" a lot of things.
- Your task is: Add OTEL tracing to all microservices. Add OTEL logging to all microservices. (this is good)
- 6.I want to know if the microservice has OTEL instrumentation and where the data is being sent. (??? i have no idea what this means)
- 9.Use the recent version of the OTEL SDK. (yeah, this won't work unless you also use an MCP like context7 or provide local docs)
What's weird here is that instruct.md has 0 content regarding conventions, specifically how to name things. Yet in tests_outputs you have this "expected_patterns = ["order", "stock", "gateway"]" and you assert on it. I guess that makes some sense, but being specific in the task.md is a must. Otherwise you're benching assumptions, and those don't even work with meatbags :)
For [2]: instruction.md is more detailed, but has some weird issues:
- "You should only be very minimal and instrument only the critical calls like request handlers without adding spans for business calls \n The goal is to get business kind of transaction" (??? this is confusing, even skipping over the weird grammar there)
- "Draw ascii trace diagram into /workdir/traces.txt" (????)
- "When modifying Python files, use Python itself to write files or use sed for targeted changes" (? why are you giving it harness-specific instructions in your instruct.md? this is so dependent on the agentic loop used, that it makes no sense here.
- "Success Criteria: Demonstrate proper distributed tracing \n Include essential operations without over-instrumenting (keep it focused) \n Link operations correctly \n Analyze the code to determine which operations are essential to trace and how they relate to each other. (i mean ... yes and no. these are not success criteria IMO. It's like saying "do good on task not do bad". This could definitely be improved.)
----
Also, I noticed that every folder has a summary_claude... that looks like a claude written summary over a run. I hope that's not what's used in actually computing the benchmark scores. In that case, you're adding another layer of uncertainty in checking the results...
The ideea is nice, but tbf some of the tests seem contrived, your instructions are not that clear, you expect static naming values while not providing instructions at all about naming conventions, and so on. It feels like a lot of this was "rushed"? I peaked a bit at the commit history and saw some mentions of vibe-coding a viewer for this. I hope that's the only thing that was vibe-coded :)
> but I lack some confidence in models trained on LLM output, so I hope it wasn't that.
That's misguided. Models have been trained on synthetic data for ~2+ years already. The "model collapse" myth is based on a very poor paper that got waaaay more attention than it deserved (because negativity sells, I guess). In practice every lab out there is doing this, because it works.
When ChatGPT first released and jailbreaks were pretty easy, I was able to easily get some extremely good/detailed output from it, with very little errors or weirdness. Now even when I can get jailbreaks to work with their newer models, it's just not the same, and no open-source model or even commercial model has seem to come close to the quality of that very first release. They're all just weird, dumb, random or incoherent. I keep trying even the very large open-source or open-weights models, and new versions of OpenAI's models and Claude and Gemini and so on, but it just all sucks. It all feels like slop!
I'm convinced it's because that first ChatGPT release was probably trained on data almost entirely untainted by other LLMs, and it may no longer ever be possible to obtain such a dataset again. Every model feels so artificial and synthetic. I do not know for sure why this is, but I bet it has something to do with people thinking it's possible to programmatically generate almost half the dataset?! I feel like OpenAI's moat could have been the quality and authenticity of their dataset, since they scraped practically most of the internet before LLMs became widespread, but even they've probably lost it by now.
I haven't really internalized anything about "model collapse", other than that if you train an LLM on outputs from other LLMs, you will be training to emulate an imprecise version of an imprecise version of writing, which will be measurably and perceptibly worse than merely one layer of imprecise version of actual writing.
> I'm convinced it's because that first ChatGPT release was probably trained on data almost entirely untainted by other LLMs, and it may no longer ever be possible to obtain such a dataset again.
Interesting statement. But wouldn’t that mean that Google is in an even better position in regard to primary, or at least pristine data?
Umm, we do. It's still one of the best for eu countries support / help chatbot style. It's got good (best?) multilingual support ootb, it's very "safe" (won't swear, won't display chinese characters, etc) and it's pretty fast.
Yep. Before gemma3 we where struggling with multilinguality on smaller European languages, and it is still one of the batter ones in that regard (even large open or closed models struggle with this to some extent). Gemma3 also is still pretty decent multi modal wise.
I didn't know this was a thing until I read this thread but I can confirm that it does fine(not perfect by any means just like the average casual non-native fluent speaker) and it is one of the reasons I use it as my local model.
(this is also why all the labs, including some chinese ones, are subsidising / metoo-ing coding agents)
reply