Hacker Newsnew | past | comments | ask | show | jobs | submit | dbagr's commentslogin


This has been known for a long time to those interested in the field.


You need recursion at some point: you can't account for all possible scenarios of combinations, as you would need an infinite number of layers.


> infinite number of layers

That’s not as impossible as it seems, Gaussian Processes are equivalent to a Neural Network with infinite hidden units, and any multilayer NN can be approximated by one with a single, larger layer of hidden units.


"a single, larger layer of hidden units"

Does this not mean that the entire model must cycle to operate any given part? Division into concurrent "modules" (the term appearing in this paper,) affords optimizing frequency independently and intentionally.

Also, what certainty is there that everything is best modelled with multilayer NN? Diversity of algorithms, independently optimized, could yield benefits.

Further, can we hope that modularity will create useful points of observability? The inherent serialization that develops between modules could be analyzed, and possibly reveal great insights.

Finally, isn't there a possibility that AGI could be achieved more rapidly by factoring the various processes into discrete modules, as opposed to solving every conceivable difficulty in a monolithic manner, whatever the algorithm?

That's a lot of questions. Seems like identifying possible benefits is easy enough that this approach is worthwhile exploring. We shall see I suppose. At the very least we know the modularization of HRM has a valid precedent: real biological brains.


It would not surprise me if all of these tangential advances in various models and approaches ultimately become part of a larger frame work of modules designed to handle certain tasks - similar to how your medula oblongata operates breathing and heart rate, and your amygdala sorts out memory and hormone production, and your cingulate gyrus helps control motor function, et al.

We have a great example (us), we just need to hone and replicate it.


I mean recurrence is an attempt to allow approximation of recursive processes, no?


Either they overtook other LLMs by simply using more compute (which is reasonable to think as they have a lot of GPUs) or I'm willing to bet there is benchmark contamination. I don't think their engineering team came up with any better techniques than used in training other LLMs, and Elon has a history of making deceptive announcements.


How do you explain Grok 4 achieving new SOTA on ARC-AGI-2, nearly doubling the previous commercial SOTA?

https://x.com/arcprize/status/1943168950763950555


They could still have trained the model in such a way as to focus on benchmarks, e.g. training on more examples of ARC style questions.

What I've noticed when testing previous versions of Grok, on paper they were better at benchmarks, but when I used it the responses were always worse than Sonnet and Gemini even though Grok had higher benchmark scores.

Occasionally I test Grok to see if it could become my daily driver but it's never produced better answers than Claude or Gemini for me, regardless of what their marketing shows.


They could still have trained the model in such a way as to focus on benchmarks, e.g. training on more examples of ARC style questions

That's kind of the idea behind ARC-AGI. Training on available ARC benchmarks does not generalize. Unless it does... in which case, mission accomplished.


Seems still possible to spend effort of building up an ARC-style dataset and that would game the test. The ARC questions I saw were not of some completely unknown topic, they were generally hard versions of existing problems in well-known domains. Not super familiar with this area in general though so would be curious if I'm wrong.


ARC-AGI isn't question- or knowledge-based, though, but "Infer the pattern and apply it to a new example you haven't seen before." The problems are meant to be easy for humans but hard for ML models, like a next-level CAPTCHA.

They have walked back the initial notion that success on the test requires, or demonstrates, the emergence of AGI. But the general idea remains, which is that no amount of pretraining on the publicly-available problems will help solve the specific problems in the (theoretically-undisclosed) test set unless the model is exhibiting genuine human-like intelligence.

Getting almost 16% on ARC-AGI-2 is pretty interesting. I wish somebody else had done it, though.


I’ve seen some of the problems before, like https://o3-failed-arc-agi.vercel.app/

This is not hard to build datasets that have these types of problems in them, and I would expect LLMs to generalize this well. I don’t see how this is any different really than any other type of problem LLMs are good at given they have the dataset to study.

I get they keep the test updated with secret problems, but I don’t see how companies can’t game this just by investing in building their own datasets, even if it means paying teams of smart people to generate them.


The other question is if enough examples of this type of task are helpful and generalizable in some way. If so, why wouldn't you integrate that dataset into your training pipeline of an LLM.


I use Grok with repomix to review my code and it tends to give decent answers and is a bit better at giving actual actionable issues with code examples than, say Gemini 2.5 pro.

But the lack of a CLI tool like codex, claude code or gemini-cli is preventing it from being a daily driver. Launching a browser and having to manually upload repomixed content is just blech.

With gemini I can just go `gemini -p "@repomix-output.xml review this code..."`


Well try it again and report back.


As I said, either by benchmark contamination (it is semi-private and could have been obtained by persons from other companies which model have been benchmarked) or by having more compute.


I still dont understand why people point to this chart as any sort of meaning. Cost per task is a fairly arbitrary X axis and in no way representing any sort of time scale.. I would love to be told how they didn't underprice their model and give it an arbitrary amount of time to work.


anecdotally, output in my tests is pretty good. It's at least competitive to SOTA from other providers right now.


Location: France Remote: Yes or hybrid Willing to relocate: Yes Technologies: Playwright, JavaScript/HTML, Bash Résumé/CV: https://drive.google.com/file/d/1SkC2fa3sKozpCvDC1QPMUBT9--3... Email: dbagory[at]icloud[dot]com

I am a QA with experience in automated test development for the web, web development, system administration, software integration, writing documentation. I am interested in any role that requires one or multiple of these skills.


This sounds like an RNN with extra steps.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: