> I don’t think this rebuts the significance of emergence, since metrics like exact match are what we ultimately want to optimize for many tasks. Consider asking ChatGPT what 15 + 23 is—you want the answer to be 38, and nothing else. Maybe 37 is closer to 38 than -2.591, but assigning some partial credit to that answer seems unhelpful for testing ability to do that task, and how to assign it would be arbitrary.
Not sure I can agree with this. Let's for the sake of the argument say that normal calculators don't exist. I can choose to run my calculation thought ChatGPT or do it myself. It is true that I would vastly prefer the answer to always be correct, but it's not true that there is never value in bounding the error.
Put another way: would you rather want a calculator that is at most 5% off 100% of the time. Or a calculator that is 100% correct 95% of the time, but may output garbage the remaining 5%?
Upon further reflection: it seems mighty ambitious to expect a statistical model to be always correct. If always correct was possible you probably didn't need a statistical model to begin with. Given that the model will fall from time to time it seems especially useful that you can bound the error somehow. +1 for smooth metrics i guess.
I guess all of this is slightly off a tangent as Wei seems to be just arguing that "there might be things LLM's can do or will learn to do in the future that we didn't train them for. And we won't necessarily be able to predict it from capabilities of smaller models". I agree with this, just not the way he arrives at the conclusion.
I can do math myself, thank you very much. If I take the effort to pull out a math calculating machine, the calculation is important enough that I want 99.99% certainty that the answer is right. Under the premise that the answer is wrong, I want 99.999% certainty that the failure is related to data capture and not the calculating mechanism itself.
Within the context of LLMs, the goal should be to make them recognize that the problem at hand is not well suited for its own set of capabilities, and to defer to an external system. In the case of ChatGPT, any numeric answer to "15+23" (including and specially "38") is wrong. The correct answer is "sorry, as a Large Language Model I can retrieve information that is encoded in English and I am not specially suited to solve math problems". A LLM-empowered Alexa may be proactive, run a Calculator program for you and report back "According to bc program, the answer is 38. Here's the full log for you to check if this is what you wanted"
This is not so much AGI as deliberate programmatic patching, not unlike the disclaimers when you ask ChatGPT about controversial topics.
You might be able to do math yourself, but there are many practical reasons why you wouldn't or couldn't do all sorts of math -- even simple math.
For example, how you find the median value of a billion random numbers without a computer? You wouldn't have time in your lifetime to do this, not to mention that you'd probably make a lot of mistakes despite the process being very simple in principle.
A computer could do this easily, and even were it to be a little off in its answer, that would be far preferable to devoting your entire lifetime to finding the answer yourself.
That's just one very simple example. It's not hard to find plenty more.
Yes, I am glad we have computers. I am also glad that I can code a median function that can efficiently go through a billion numbers; no need to ask the artificial averaged embodiment of millions of dumb teenagers for its duh-best-guess-lol.
I mean.. The LLMs are based on neural networks? They can approximate any equation right? It's possible the LLM formed areas that are extracting the equations and doing something very similar to traditional calculation. I think that "predicting" the next token is probably a bit reductive to what's happening in the LLM based on my understanding of image classification networks.
Humans don't work like binary computers either. Every person probably has some error rate on calculations as well. Maybe the LLM is much closer to humans in doing math than either to a physical processor.
Not sure I can agree with this. Let's for the sake of the argument say that normal calculators don't exist. I can choose to run my calculation thought ChatGPT or do it myself. It is true that I would vastly prefer the answer to always be correct, but it's not true that there is never value in bounding the error.
Put another way: would you rather want a calculator that is at most 5% off 100% of the time. Or a calculator that is 100% correct 95% of the time, but may output garbage the remaining 5%?
Upon further reflection: it seems mighty ambitious to expect a statistical model to be always correct. If always correct was possible you probably didn't need a statistical model to begin with. Given that the model will fall from time to time it seems especially useful that you can bound the error somehow. +1 for smooth metrics i guess.
I guess all of this is slightly off a tangent as Wei seems to be just arguing that "there might be things LLM's can do or will learn to do in the future that we didn't train them for. And we won't necessarily be able to predict it from capabilities of smaller models". I agree with this, just not the way he arrives at the conclusion.