Quote: "GPT-4 will change the world" No it won't. Neither 5 nor 6 that I suspect...

ChatGTP · on March 19, 2023

I actually think it might get so “smart” we will no longer ever understand it, and have no way to verify it’s accuracy because we’ll have nothing to benchmark it by.

Will be great.

hackerlight · on March 19, 2023

I expect this one day, but only after there are more methodology breakthroughs. I am doubtful we get there by just scaling up current methods.

What I predict is AI will stagnate at the upper-end of human-level intelligence except in narrow domains where we can simulate unlimited data which allows for self-play learning, such as in Go. But such domains are limited. For everything else, in the absence of unlimited data (generated via self-exploration or self-play), the best AI can do is learn from human output, which carries with it all the limitations and flaws of human cognition.

zarzavat · on March 19, 2023

Two of the most interesting domains (mathematics and programming) do have external forms of validation. In mathematics you can use a proof assistant like Lean. In programming there’s types, tests and formal verification.

Where the AI will struggle is anything where forward progress requires a lab. In this case it will not provide as much acceleration, as it still has to move at the speed of the human in the lab doing the experiment for it. It can maybe make better decisions than the human and get a small speedup, but it won’t be anywhere close to the possible speedup for formal sciences.

And the AI will really struggle with anything that involves mechanical sympathy gained through practice. No amount of book knowledge or raw intelligence can teach an AI how to play the violin. For that, you need a violin, a hand, and a feedback loop. Indeed there’s probably many skills that humans have near optimal learning rate, because the bottleneck is in the physical environment rather than our brains.

rhdunn · on March 19, 2023

It would be interesting to see how these LLMs work in a REPL environment. The model could take a (language, version, compiler, sourceText) tuple as input, and have the model generate the same. It could also be fed the compiler output in a (warningCount, errorCount, stdoutText, stderrText) tuple, so it can learn to produce error-free output.

It would also be interesting to feed it language concepts ("if statement", "while loop", etc.) in the form of (language, version, concept, sourceText), to see if it can form an abstract understanding of those concepts, and successfully port code from one language to another.

oezi · on March 19, 2023

> I am doubtful we get there by just scaling up current methods.

Why? The last two years show that throwing more compute at this during training and execution is clearly helping.

LLMs currently lack access to real world data for instance to measure a physical quantity or perform a randomized control trial, but given how much of human output is just manipulation of language tokens and that we got quite a lot of sample data to feed into the networks I am confident there are still some steps ahead of us that are essentially just more of the same.

An important data point is also the limited amount of data that we need to feed into a human brain to get somewhere smart.

haldujai · on March 19, 2023

Have they? I would say the last year has shown “throwing more compute” is not the endgame with Chinchilla, Galactica and LLaMA being some prominent papers raising questions.

Despite all of the (presumed) compute thrown at GPT4 hallucination is still a massive problem (in their technical paper) and “more RL” isn’t/won’t solve that.

It’s hard to believe that a certain number of flops will fix the fact that next token prediction is inherently not factually grounded and can result in very inaccurate outputs.

I think we’re plateauing with current model architectures and overestimating LLMs because their output sounds so much better, chess is a great example. If LLMs of this size can’t come close to stockfish which can run on my 10 year old CPU, how much knowledge is all this compute really encoding?

oezi · on March 19, 2023

I don't interpret results of Chinchilla et al. as there are fundamental limits to what LLMs can do. From the Chinchilla paper's abstract:

"we find that for compute-optimal training, the model size and the number of training tokens should be scaled equally"

> GPT4 hallucinations

My interpretation is that we are just seeing Dunning-Kruger. We are at the overconfidence stage, but quickly descending into the Valley of Despair ("I am an AI and can't..."). My reading of the report is that hallucinations are already reduced a lot.

> is inherently not factually grounded

Aren't we all just symbol manipulation machines? Isn't truth just created by assigning symbols? Is there more value in experiencing an apple falling on your head than reading on an apple falling onto Newton's head?

> If LLMs of this size can’t come close to stockfish which can run on my 10 year old CPU, how much knowledge is all this compute really encoding?

If we assume that GPT-4 is maybe a bit bigger than GTP-3 then it clocks in at around 1000 GB = 1 TB. A large chess data base is something like 5GB, but I would assume that training maybe keeps a fraction of that, maybe 100 MB to encode chess moves. I think it is quite an achievement to that GPT-4 can play at all, but by reading moves from textfiles online.

haldujai · on March 19, 2023

I agree re: Chinchilla’s main point, what I was trying to point out is that I’m not sure that it’s clear that simply adding “more compute”, which over the past 2 years has meant “increasing model sizes”, is going to keep scaling as was stated.

It is still unclear whether we even have enough training tokens to adequately train some of these models (as GPT4 is unknown, thinking of PaLM here)? Galactica gets around that by repeating high quality tokens. Anthropic’s work conversely shows repeated tokens can significantly degrade performance.

I expect there will be higher yield/more focus on refining datasets or training objectives (e.g. GNN work by Leskovec and Liang) in the near future rather than just throwing more compute at CommonCrawl.

OpenAI claims significantly improved hallucination yet still by their own metric scores up to 80% on factual accuracy tests and scores 60% on TruthfulQA, so it’s still far too frequent to be reliable despite the presumably large effort into RLHF and incorporation of ChatGPT data.

The problem space of potentially inaccurate seems too large for RLHF to be a good solution.

> Aren't we all just symbol manipulation machines? Isn't truth just created by assigning symbols? Is there more value in experiencing an apple falling on your head than reading on an apple falling onto Newton's head?

Yeah there is actually, having experienced gravity grounds Newton’s law in reality and provides a sanity check. What’s the evidence to suggest that we are purely symbol manipulation machines?

In reality, human decision making is very “multimodal” when you move away from low hanging fruit.

> If we assume that GPT-4 is maybe..

I assume you’re offering a simplistic explanation of how LLMs work but continuing this hypothetical I don’t really follow the logic behind that math. If LLMs truly encode knowledge and contain a “model” of reality the whole point is that it is reasoning as it plays chess and not necessarily regurgitating.

It “plays” chess (except makes illegal moves and appears to be playing in very predictable ways as better discussed in the other HN chess post) in a way suggesting that it’s still “fill in the blank” rather than innately understanding/modelling the game which is the claim being made of LLMs.

hackerlight · on March 19, 2023

> Why?

These AIs (GPT-4, Stable Diffusion) are being trained on human output. How can you far exceed the capability of humans if you are being trained on human output? The limitations of human cognition are present in the training data. It's no coincidence that Stable Diffusion/MidJourney are producing artworks about 97% as beautiful as the most competent human artists, and we never see something far more beautiful than what a human could produce even by chance. Ditto with the code that GPT-4 produces. It inherently can't exceed the capabilities of the most competent human, because such solutions are not present in the information available in the input data.

Let's take Go as an example. The reason DeepMind created a superhuman Go agent is because of unlimited self-play. But if DeepMind limited itself to training only on games that humans have played (analogous to how GPT-4, Stable Diffusion were trained), this take-off would have been impossible. At best, the resulting agent would have been as intelligent as human players, perhaps very slightly better, but not much. That's why I think LLMs need to go the way of self-play or self-teaching somehow if we expect them to ever get far beyond humans intellectually.

> The last two years show that throwing more compute at this during training and execution is clearly helping.

That's because these models were far below human capability and have been catching up to human capability. This doesn't refute the thesis. My prediction applies from the point GPT is as about as smart as the smartest humans.