Have they? I would say the last year has shown “throwing more compute” is not the endgame with Chinchilla, Galactica and LLaMA being some prominent papers raising questions.
Despite all of the (presumed) compute thrown at GPT4 hallucination is still a massive problem (in their technical paper) and “more RL” isn’t/won’t solve that.
It’s hard to believe that a certain number of flops will fix the fact that next token prediction is inherently not factually grounded and can result in very inaccurate outputs.
I think we’re plateauing with current model architectures and overestimating LLMs because their output sounds so much better, chess is a great example. If LLMs of this size can’t come close to stockfish which can run on my 10 year old CPU, how much knowledge is all this compute really encoding?
I don't interpret results of Chinchilla et al. as there are fundamental limits to what LLMs can do. From the Chinchilla paper's abstract:
"we find that for compute-optimal training, the model size and the number of training tokens should be scaled equally"
> GPT4 hallucinations
My interpretation is that we are just seeing Dunning-Kruger. We are at the overconfidence stage, but quickly descending into the Valley of Despair ("I am an AI and can't..."). My reading of the report is that hallucinations are already reduced a lot.
> is inherently not factually grounded
Aren't we all just symbol manipulation machines? Isn't truth just created by assigning symbols? Is there more value in experiencing an apple falling on your head than reading on an apple falling onto Newton's head?
> If LLMs of this size can’t come close to stockfish which can run on my 10 year old CPU, how much knowledge is all this compute really encoding?
If we assume that GPT-4 is maybe a bit bigger than GTP-3 then it clocks in at around 1000 GB = 1 TB. A large chess data base is something like 5GB, but I would assume that training maybe keeps a fraction of that, maybe 100 MB to encode chess moves. I think it is quite an achievement to that GPT-4 can play at all, but by reading moves from textfiles online.
I agree re: Chinchilla’s main point, what I was trying to point out is that I’m not sure that it’s clear that simply adding “more compute”, which over the past 2 years has meant “increasing model sizes”, is going to keep scaling as was stated.
It is still unclear whether we even have enough training tokens to adequately train some of these models (as GPT4 is unknown, thinking of PaLM here)? Galactica gets around that by repeating high quality tokens. Anthropic’s work conversely shows repeated tokens can significantly degrade performance.
I expect there will be higher yield/more focus on refining datasets or training objectives (e.g. GNN work by Leskovec and Liang) in the near future rather than just throwing more compute at CommonCrawl.
OpenAI claims significantly improved hallucination yet still by their own metric scores up to 80% on factual accuracy tests and scores 60% on TruthfulQA, so it’s still far too frequent to be reliable despite the presumably large effort into RLHF and incorporation of ChatGPT data.
The problem space of potentially inaccurate seems too large for RLHF to be a good solution.
> Aren't we all just symbol manipulation machines? Isn't truth just created by assigning symbols? Is there more value in experiencing an apple falling on your head than reading on an apple falling onto Newton's head?
Yeah there is actually, having experienced gravity grounds Newton’s law in reality and provides a sanity check. What’s the evidence to suggest that we are purely symbol manipulation machines?
In reality, human decision making is very “multimodal” when you move away from low hanging fruit.
> If we assume that GPT-4 is maybe..
I assume you’re offering a simplistic explanation of how LLMs work but continuing this hypothetical I don’t really follow the logic behind that math. If LLMs truly encode knowledge and contain a “model” of reality the whole point is that it is reasoning as it plays chess and not necessarily regurgitating.
It “plays” chess (except makes illegal moves and appears to be playing in very predictable ways as better discussed in the other HN chess post) in a way suggesting that it’s still “fill in the blank” rather than innately understanding/modelling the game which is the claim being made of LLMs.
Despite all of the (presumed) compute thrown at GPT4 hallucination is still a massive problem (in their technical paper) and “more RL” isn’t/won’t solve that.
It’s hard to believe that a certain number of flops will fix the fact that next token prediction is inherently not factually grounded and can result in very inaccurate outputs.
I think we’re plateauing with current model architectures and overestimating LLMs because their output sounds so much better, chess is a great example. If LLMs of this size can’t come close to stockfish which can run on my 10 year old CPU, how much knowledge is all this compute really encoding?