Hacker Newsnew | past | comments | ask | show | jobs | submit | more nopinsight's commentslogin

Wondering if you could add “Flux 1.1 Pro Ultra” to the site? It’s supposed to be the best among the Flux family of models, and far better than Flux Dev (3rd among your current candidates) at prompt adherence.

Adding it would also provide a fair assessment for a leading open source model.

The site is a great idea and features very interesting prompts. :)


...which is why top LLM providers' web apps like ChatGPT, Claude.ai, Gemini try to nudge you to connect with Google Drive, and where appropriate, GitHub Repos. They also allow the user/dev to provide feedback to revise the results.

All the training and interaction data will help make them formidable.


Words are more than just symbols; they represent concepts and patterns we observe in the world, in our society, and inside ourselves.

Translation is only possible because we are all humans and have experienced broadly similar concepts, but there's a limit to it, especially in social milieu and in how we conceptualize ourselves in society.

To truly understand another people and culture at a deep level, you need to learn their native tongue and their living environment -- This is what I've internalized as a long-time learner and teacher of languages.


>> ...median urban household income...

You seem to miss the word 'disposable' when you cited the chart and table in [0] you linked to.

China's nominal GDP per capita is already on par with Malaysia's.

https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nomi...

Does your analysis on the relative standards of living of China's and Malaysia's populations still stand?

It's been more than 10 years since I last visited China. I've been to Malaysia a couple of times not too long ago. From what I see from random social media clips of regular people, their urban standards of living do not differ much.


Au contraire, reasoning LLMs free me from mundane cognitive tasks, e.g. selecting websites to read from among search results, assessing level of evidence in papers before reading them in more detail, routine coding, and many other repetitive tasks. It allows me to spend mental effort on more abstract and strategic thoughts.

We should seize the opportunity to elevate our mental skills to a higher plane, aided by these “cognitive elves”.


With Agentic RL training and sufficient data, AI operating at the level of average senior engineers should become plausible in a couple to a few years.

Top-tier engineers who integrate a deep understanding of business and user needs into technical design will likely be safe until we get full-fledged AGI.


Why in a few years? What training data is missing that we can’t have senior level agents today?


Training data, esp interaction data from agentic coding tools, are important for that. See also: Windsurf acquisition.


On the other hand I’m pretry sure you will need senior engineers not only for designing but debugging. You don’t want to hit a wall when your Agentic coder hits a bug that it just won’t fix.


There’s a recent article with experiments suggesting LLMs are better at bug fixing than coding, iirc. It’s from a company with a relevant product though.


Why do you expect AIs to learn programming, but not debugging?


1) Debugging is much harder than writing code that works

2) AIs are demonstrably much, much worse at debugging code than writing fresh code

Ex: "Oh, I see the problem! Let me fix that" -> proceeds to create a new bug while not fixing the old one


Debugging is harder for humans, too.


> 4.5/o3 doesn't seem hugely more intelligent then 3.0

I disagree with 3.0, but perhaps that feels true for 4.0 or even 3.5 for some queries.

The reason is that when LLMs are asked questions whose answers can be interpolated or retrieved from their training data, they will likely use widely accepted human knowledge or patterns to compose their responses. (This is a simplification of how LLMs work, just to illustrate the key point here.) This knowledge has been refined and has evolved through decades of human experiments and experiences.

Domain experts of varying intelligence will likely come up with similar replies on these largely routine questions as well.

The difference shows up when you pose a query that demands deep reasoning or requires expertise in multiple fields. Then, frontier reasoning models like o3 can sometimes form creative solutions that are not textbook answers.

I strongly suspect that Reinforcement Learning with feedback from high-quality simulations or real environments will be key for these models' capabilities to surpass those of human experts.

Superhuman milestones, equivalent to those achieved by AlphaGo and AlphaZero between 2016 and 2018, might be reached in several fields over the coming years. This will likely happen first in fields with rapid feedback loops and highly accurate simulators, e.g. math problem solving (as opposed to novel mathematical research), coding (as opposed to product innovation).


She appears to be a very good product developer. However, unless the hypothetical VC knows more about their upcoming plan, it doesn't look like a fundable business with a strong moat and the potential to become a unicorn or larger business.


Livebench.ai actually suggests the new version is better on most things.

https://livebench.ai/#/


Current models are quite far away from human-level physical reasoning (paper below). An upcoming version of models trained on world simulation will probably do much better.

PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models

https://phybench-official.github.io/phybench-demo/


This is more about a physics math aptitude test. You can already see that the best model in math is saturating it halfway. It might not indicate its usefulness in actual physical reasoning, or at the very least, it seems like a bit of a stretch.


Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: