Hacker Newsnew | past | comments | ask | show | jobs | submit | more adroniser's commentslogin

The distinction is that LLMs are not used for what they are trained for in this case. In the vast majority of cases someone using an LLM is not interested in what some mixture of openai employees ratings + average person would say about a topic, they are interested in the correct answer.

When I ask chatgpt for code I don't want them to imitate humans, I want them to be better than humans. My reward function should then be code that actually works, not code that is similar to humans.


How about you want to solve sudoku say.And you simply specify that you want the output to have unique numbers in each row, unique numbers in each column, and no unique number in any 3x3 grid.

I feel like this is a very different type of programming, even if in some cases it would wind up being the same thing.


"AlphaProof is a system that trains itself to prove mathematical statements in the formal language Lean. It couples a pre-trained language model with the AlphaZero reinforcement learning algorithm, which previously taught itself how to master the games of chess, shogi and Go."


If you're going to suggest something you think an LLM can't do I think at the very least as a show of good faith you should try it out. I've lost count of the number of times people have told me LLMs can't do shit that they very evidently can.


I explicitly say that LLMs could do it in my response. As a show of good faith you should try reading the entire comment.

Yes, I'm using simple examples to demonstrate a particular difference, because using "real" examples makes getting the point across a lot harder.

You're also just wrong. I did in fact test, and both GPT 3.5 Turbo and 4o failed. Not only with the rule change, but with the mere task of providing possible moves. I only included the admission that they may succeed as a matter of due diligence, in that I cannot conclusively rule out they can't get the right answer because of the randomization and API-specific pre-prompting involved.

> "For chess board r1bk3r/p2pBpNp/n4n2/1p1NP2P/6P1/3P4/P1P1K3/q5b1 (FEN notation), what are the available moves for pawn B5"


I did read your entire comment, and that is what prompted my response, because from my perspective your entire premise was based on LLMs failing at simple examples, and yet despite admitting you thought there was a chance an LLM would succeed at your example, it didn't seem you'd bothered to check.

The argument you are making is based on the fact that the example is simple. If the example were not simple, you would not be able to use it to dismiss LLMs.

I am not surprised that GPT 3.5 and 4o failed, they are both terrible models. GPT4-o is multimodal, but it is far buggier than gpt-4. I tried with claude 3.5 sonnet and it got it first try. It also was able to compute the moves when told the rule change.


To be clear, the construction given here violates the finite additivity property of measure. It's got nothing to do with the countable/uncountable additivity property.


Yeah but this is only possible if the sets in question are uncountably infinite.


Yes but you are not taking an uncountable union. You are taking a finite union.


Essentially the difficulty arises from attempting to assign a measure (area) to every single subset of the sphere, where you say that rotations need to preserve this measure. The paradox can be viewed as a proof that you cannot assign a measure to every subset of the sphere in a consistent way.

The way measure theory resolves this is by showing that if you restrict to appropriate subsets, called measurable subsets, you can get all the nice properties you would expect.

It turns out that basically everything is measurable. In fact the existence of a non-measurable set is independent of ZF. This means that you need the axiom of choice, which was used here in the Banach-Tarski paradox, in order to construct a non-measurable set. So measure theory doesn't really lose a great deal by restricting in this way, which is why it gives such a great theory of integration.


The only source I can find for this estimate is from a year ago. I feel like efficiency has gone up by a lot since then


Same as usage



anabolic steroids will kill you idk why you'd want to mess with them.


And yet it doesn't rule out that it can't. See new york times lawsuit


From old pieces of articles that are quoted all over the internet? That's not surprising.


That's still sufficient for both The Times and for it to be a potential problem in this case.


I don't see how the point about the typical human is relevant. Either you can reason or you can't, the ARC test is supposed to be an objective way to measure this. Clearly a vanilla LLM currently cannot do this, and somehow an expert crafting a super-specific prompt is supposed to be impressive.


The point is that if you have some test of whether an AI is intelligent that the vast majority of living humans would fail or do worse on than gpt4-o (let alone future LLMs) then it’s not a very persuasive argument.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: