> Yet, despite the paucity of negative examples, everyone figures it out.
After spending more than a year babbling nonsense and discovering a tiny bit more every time about the meaning of certain combinations of phonemes based on the positive or negative response you get.
> You probably did not need to crash a car for 10k generations before finally making it down the street, nor simulate it in your head.
Are you sure we don't simulate in our head what would happen if we drove the car into the lamp post / brick wall / other car / person, etc.? I find it highly unlikely that this kind of learning does not involve a large amount of simulation.
> There's a lot you can do unreasonably well despite virtually no prior experience.
That's true, but there's a lot we can't do well without repetitive practice, and most things that we can do well in a one-shot fashion depend on having prior practice or familiarity with similar things.
You're digging your heals in on a rehash of a model from the 40s, glibly dismissing the problems it doesn't account for bought up by linguists in the 50s and 60s as if they are unaware that babies go through a period of babbling. The amount of time spent acquiring language is already priced in and not enough to account for as pure reward and training.
>Are you sure we don't simulate in our head what would happen if we drove the car into the lamp post / brick wall / other car / person, etc.?
You left out the 10k times part. You're ignoring the huge training data sizes these models need even for basic inferences. No, I don't think it takes all that much full scale simulation to distill car speed as a function of pedal parameters, and estimate the control problem needed.
In many instances, humans can seemingly extrapolate from far less data. The algorithms to do this are missing. Training with loads of more data isn't a viable long term substitution.
>> Training with loads of more data isn't a viable long term substitution.
Depends. In principle, you can't learn an infinite language from finite
examples only and you need both positive and negative ones for super-regular
languages. Gold's result and so on. OK so far.
The problem is that in order to get infinite strings from a human language,
you need to use its infinite ability for embedding parenthetical sentences:
John, the friend of Mary, who married June, who is the daughter of Susan, who
went to school with Babe, who ...
But, while this is possible in principle, in practice there's only a limit to
how long such a sentence can be; or any sentence, really. In practice, most of
the utterances generated by humans are going to be not only finite, but
relatively short, as in short "relative" to the physical limit of utterance
length a human could plausibly produce (which must be something around the
length of the Iliad, considering that said human should be able to keep the
entire utterance in memory, or lose the thread; and that the Iliad probably
went on for as long as one could stand to recite from memory. Or perhaps to
listen to someone recite from memory...).
Obviously, there are only a finite number of sentences of finite length, given
a fixed vocabulary, so _in practice_ language, as spoken by humans, is not
actually-really infinite. Or, let's say that humans really do have a generator
of infinite language in our heads, but an outside observer would never see the
entire language being produced, because finite universe.
Which means that Chomsky's argument about the poverty of the stimulus might
apply to human learning, because it's very clear we learn some kind of
complete model of language as we grow up; but, it doesn't need to apply to
statistical modelling, i.e. the approximation of language by taking statistics
over large text corpora. Given that those large corpora will only have finite
utterances, and relatively short ones at that (as I'm supposing above) then it
should be possible to at least learn the structure of everyday spoken
language, just from text statistics.
So training with lots of data can be a viable long term solution, as long as
what's required is to only model the practical parts of language, rather than
the entire language. I think we've had plenty of evidence that this should be
possible since the 1980's or so.
Now, if someone wanted to get a language model to write like Dostoyevsky...
Your argument is that maybe we can brute force with statistics sentences long enough for no one to notice we run out past a certain point?
Everything you said applies to computers too. Real machines have physical memory constraints.
Sure the set of real sentences may be technically finite, but the growth per word is exponential and you don't have the compute resources to keep up.
Information is not about what is said but about what could be said. It doesn't matter so much that not every valid permutation of words is uttered, but rather that for any set of circumstances there exists words to describe it. Each new word in the string carries information in the sense it reduces the set of possibilities from prior to relaying my message. A machine which picks the maximum likelihood message in all circumstances is by definition not conveying information. Its spewing entropy.
Now, now. Who said anything about information? I was just talking about modelling text. Like, the distribution of token collocations in a corpus of natural language. We know that's perfectly doable, it's been done for years. And to avoid exponential blowups, just use the Markov property or in any case, do some fudgy approximation of this and that and you're good to go.
>> Your argument is that maybe we can brute force with statistics sentences long enough for no one to notice we run out past a certain point?
No, I wasn't saying that, I was saying that we only need to model sentences that are short enough that nobody will notice that the plot is lost with longer ones.
To clarify, because it's late and I'm tired and probably not making a lot of sense and bothering you, I'm saying that statistics can capture some surface regularities of natural language, but not all of natural language, mainly because there's no way to display the entire of natural language for its statistics to be captured.
Oh god, that's an even worse mess. I mean: statistics can only get you so far. But that might be good enough depending on what you're trying to do. I think that's what we're seeing with those GPT things.
>I was saying that we only need to model sentences that are short enough that nobody will notice that the plot is lost with longer ones.
Thats one of the things on my short list of unsolved probs. People remember oddly specific and arbitrarily old details. Clearly not a lossless memory, but also not an agnostic token window that starts dropping stuff after n tokens.
I think we agree then that a plain superficial model gets you surprisingly far, but does lose the plot. It is certainly enough for things that are definable purely as and within text (the examples I gave). Beyond that who knows.
>> I think we agree then that a plain superficial model gets you surprisingly far, but does lose the plot. It is certainly enough for things that are definable purely as and within text (the examples I gave). Beyond that who knows.
After spending more than a year babbling nonsense and discovering a tiny bit more every time about the meaning of certain combinations of phonemes based on the positive or negative response you get.
> You probably did not need to crash a car for 10k generations before finally making it down the street, nor simulate it in your head.
Are you sure we don't simulate in our head what would happen if we drove the car into the lamp post / brick wall / other car / person, etc.? I find it highly unlikely that this kind of learning does not involve a large amount of simulation.
> There's a lot you can do unreasonably well despite virtually no prior experience.
That's true, but there's a lot we can't do well without repetitive practice, and most things that we can do well in a one-shot fashion depend on having prior practice or familiarity with similar things.