This article is....total nonsense. None of the challenges described are impossible to solve with statistical methods. And i'd be willing to bet nearly any amount of money that they will be solved that way, long before they are solved in any other way. The problems posed here are being addressed as we speak with knowledge graphs and graph neural networks, and some of the others can be addressed with more nuanced and context-aware vectorization techniques.
But more to the point, the authors here seem either completely unaware, or unable to grasp what language actually is. They claim "language is not data", but language literally is data. They claim that utterances have one and only one meaning, but that is literally never true, and is a profound misunderstanding of the fundamental nature of language.
Words, utterances, and grammatical rules do not have fixed meanings or structures. They have approximate, moving structures and meanings that are unique for every person on the planet. This idea that "real" understanding is some top-down deterministic, univalent process is just completely wrong, and will always fail anywhere and everywhere that it is tried. Not because it is too hard, not because our algorithms aren't good enough yet, but because it fundamentally does not comport with reality.
Top-down rules-driven approaches to "understanding" are essentially the computer science version of "conceptual analysis" from philosophy, and they are a similarly seductive waste of time that we would do well to leave behind, as philosophy has done with conceptual analysis.
What this discussion is missing is Chomsky's distinction between I-language and E-language. I (individual/internal/intensional) language is the knowledge of an individual native speaker. E-language represents the body of external knowledge about language such as corpus data and mass statistical models.
Study of I-language looks at idealised individual speaker's internal language capacity, which can generate an infinite array of structured expressions from finite pieces. From the internalist perspective, E-language is a dead end because it represents only part of the knowledge of an individual, that which happens to be externalised. Many of the most interesting and revealing linguistic are barely reflected in corpora but can be elicited in experiments with individual speakers.
"E-language" approaches, like statistical models of language, do lead to practical and useful results. But using E-language approaches to seriously study the human capacity for language is about as useful as doing physics experiments in GTA.
I think you might be missing the point. It's not that an utterance/proposition has "only one meaning", but that it is meant to represent only one concept (the speaker's) at the time it's spoken.
Also, I thought that the argument about the impossibility of distinguishing between equally probable meanings (in terms of frequency in a corpus) was a good one.
> I think you might be missing the point. It's not that an utterance/proposition has "only one meaning", but that it is meant to represent only one concept (the speaker's) at the time it's spoken.
While this is closer to a coherent point, i'm not sure it's particularly relevant to the idea of "natural language understanding". We consider humans to be understanding of natural language, but humans interpreting the words of other humans are constantly confounded by various forms of ambiguity or polysemy. The idea that you don't have NLU until you can perfectly ascertain the exact singular meaning of an utterance in the mind of the speaker implies that no human has achieved NLU either.
> Also, I thought that the argument about the impossibility of distinguishing between equally probable meanings (in terms of frequency in a corpus) was a good one.
It's certainly a good puzzle to pose for model building, which implies statistical models will need to be a bit more nuanced than simple word frequency correlations. But it certainly isn't impossible to assign latent attributes to words, and study the statistical properties of those attribute relations, which is exactly what would be necessary to solve their puzzle. In fact, modern word vectorization techniques (i.e. language models using lstms to predict the missing word in a sequence using adjacent context) essentially do this.
1. Well said, but I would press it further and say that there is at least a pathway for us (human beings) to determine what the actual meaning of the utterance is by considering the context, the speaker, by asking for clarification, and so forth. The problem for the probabilistic/empirical language model is that it can't ever recognize "one right answer", even in theory. That's a problem, because there really is one discoverable meaning to every intentional utterance, even if it's not always successfully understood in practice.
2. How would you go about assigning latent attributes to words?
1. Yes, that's true. However i'm not sure I see clearly the distinction you're trying to draw here. You could build a neural network that can ask clarifying questions too. You can certainly define an ML model that recognizes one right answer as well: just pick the answer that has highest probability in your softmax output.
2. Well, there are a lot of ways to do this in principle, but word vectors in their original form are themselves latent attributes. All modern NLP is based on this concept already. Let's look at the 'suitcase' example in the article:
> The trophy did not fit in the suitcase because it was too
> 1a. small
> 1b. big
In NLP this problem is generally called "coreference resolution". That is, resolving which prior object a given pronoun is referring to. The ambiguity of this problem is indeed very subtle, and probably quite hard for an ML algorithm to learn. You can look at a parse example here:
This is a near state of the art coreference resolution model, and it indeed fails to properly resolve here. Although you do see that the probabilities do actually shift towards the right answer, which is (very weakly) suggestive that it might be beginning to learn it.
So, trying to unpack what's going on here linguistically, the essence of the problem is that the relation between "it" and "trophy" vs "suitcase" is contextually contingent. The necessary context is that trophies go inside of suitcases, and not vice versa. You then need to understand how the concept of insertion relates to sizes. Putting thing A inside of thing B requires that thing B be larger than thing A. This is certainly a subtle, context-rich problem.
In order to try to solve this problem, a model would need to have a more structural understanding of language. You indeed cannot learn this problem based on simple word frequency counting, for exactly the reasons they state. However, even though the frequencies of big/small may be equal, there are more nuanced conditional probabilities that are not. Conditional on the latent concept of insertion, the pattern relating object size to coreference should be very statistically apparent.
1. I'm thinking of it in terms of two different kinds of knowledge that we're going after when we try to (A) understand a proposition vs (B) make a probabilistic guess at the right answer from among a discrete range of possibilities. However our understanding of language works, it's not the same as flagging a maximized probability. If someone says "the bank is on Fifth Street next to the fire hydrant", you immediately know all kinds of things when you take this in - that you're getting directions, that you will soon be looking for a fire hydrant, the address of the bank, that the speaker knows the area pretty well... this big "bundle of truth" emerges from the proposition immediately when we take it in. I just don't see it as analogous to flagging the highest softmax probability score, which at best provides a "most likely" translation into SQL code or a vector, or something like that.
2. If we're introducing concepts like insertion into the model, aren't we then returning to a logical (as opposed to a probabilistic/statistical) language model? Or said differently, aren't we merely introducing our own understanding of language into the model before it's even trained?
> 1. I'm thinking of it in terms of two different kinds of knowledge that we're going after when we try to (A) understand a proposition vs (B) make a probabilistic guess at the right answer from among a discrete range of possibilities. However our understanding of language works, it's not the same as flagging a maximized probability. If someone says "the bank is on Fifth Street next to the fire hydrant", you immediately know all kinds of things when you take this in - that you're getting directions, that you will soon be looking for a fire hydrant, the address of the bank, that the speaker knows the area pretty well... this big "bundle of truth" emerges from the proposition immediately when we take it in. I just don't see it as analogous to flagging the highest softmax probability score, which at best provides a "most likely" translation into SQL code or a vector, or something like that.
I certainly agree that it doesn't feel analogous, but i'm not sure that it's as clearly distinct as it intuitively seems. Certainly for most utterances you and I can imagine several plausible alternative interpretations. We are just very good at discerning the right one quickly and accurately, which means that we experience that determination as certainty, rather than probability. But that doesn't necessarily mean that we are perfectly collapsing it to a single meaning right away. In fact for many sentences, the meaning may change completely by adding/removing a word from the end, e.g.:
These are obviously not typical of all language, but I don't think it's unreasonable to say that all language in some sense behaves this way, and these types of sentences just make it unusually apparent to us.
> 2. If we're introducing concepts like insertion into the model, aren't we then returning to a logical (as opposed to a probabilistic/statistical) language model? Or said differently, aren't we merely introducing our own understanding of language into the model before it's even trained?
Sort of. This gets into some tricky territory about what concepts are, and what language actually is. But I think if you play around with word vectors a bit, you'll get a sense of what I mean here. Here's a good article summarizing some of their properties:
You can see that in some sense conceptual essence is being captured statistically here. You can represent concepts as embeddings in a statistical space, where the angles within that space capture relationships between them. It gets kind of difficult to reason about exactly what this "is", but, such is the difficulty in using language to analyze and dissect language.
There can sometimes be a bit of a Homunculus fallacy that underlies the way people reason about language, concepts, and thinking, and I think the original article here is guilty of that. That is, people assert the presence of some immaterial essence to human reasoning or cognition in an implicit way, but don't really justify or define it. Like, what is the concept of 'insertion' really? Intuitively it feels like it has some essence that exists outside of its relations to other concepts. But the only thing we can truly directly talk about is its relationship to other concepts. In this sense, we can say that any given concept is defined by its relation to every other concept. Reifying this abstract notion in the language of mathematics, we can define this as angles in Hilbert space (though it's important to note, that this is not an assertion of essence, merely a descriptive and useful formalism). When we do things like that it tends to feel like these statistical/mathematical descriptions are not "really" capturing meaning, and while I don't want to positively assert that they are necessarily, I do want to challenge the idea that this is obviously true on its face. I think the evidence for this claim is actually pretty weak, when you really analyze it.
I found this article to be very helpful in thinking about a lot of these issues:
1. I'd agree it's not "right away", so I think my use of "immediately" as too strong a word. But it sounds like we both agree that the experience of what happens when we are in the process of understand a proposition is different from the "experience" (if you can call it that) of flagging the highest probability.
2. I also agree with you that the perception of relations of things is primary in the way we know things, and it is pretty amazing how word vectors seem to capture some of these relationships in an analogous way (king - man + woman = queen). But I also think we see the properties of things before we see their relations to each other - otherwise, how could you know there is a relationship, not knowing that there are at least two things or properties to relate? But I'm going too far afield now!
Cyc, an AI engine and facts database that uses rules-based methods for AI, demonstrate such high level of understanding of math it can pretend to be slightly dumber than the user of the system.
I think it's fascinating. I also think it is out of reach for contemporary statistical methods.
Although I agree that statistical / ML methods will continue to conquer hurdle after hurdle in NLP, I wouldn't say that the article is total nonsense. The fact that NLP models trained purely on text data do not have access to the context available to a human interlocutor (years of visual, auditory, and other sense experience, along with all the higher-level mental hierarchies built from those experiences) is clearly a major handicap. I'm not positive that makes text-only NLP techniques inherently limited, but it has to be acknowledged. As far as I'm aware, we're still some years off from having NLP models capable of incorporating this sort of information.
If I had to guess, I'd say that the first NLP model truly able to pass the Turing Test will be trained on a multitude of data source - textual, visual, and audio. It will have a moderately sophisticated internal representation of "the world", though I doubt that model will be very amenable to human understanding. I'm not even sure what I really mean when I say a model of the world, because I don't think this quasi-intelligent NLP actor will necessarily have an understanding of the physical world in the way humans or animals do, especially when it's been trained with the purpose of demonstrating language understanding.
I don't think it's total nonsense, that seems a bit hyperbolic.
It makes some good points in that the amount of data it would take to approximate it all seems far too large, and if you look at a human baby, I'm not sure they are fed nearly as much data, yet they can develop language understanding.
I feel we have a missing piece here. Maybe it'll be solved by combining methods based on statically learning from data with others, but exclusively from learning on data might reach a limit which is we might not have enough data or computing power to do so.
> It makes some good points in that the amount of data it would take to approximate it all seems far too large
The article's way of estimating this is absurd, seemingly relying on the idea that a model can't generalise and must see every possible variation of a sentence:
> If we add to the semantic differences all the minor syntactic differences to the above pattern (say changing ‘because’ to ‘although’ — which also changes the correct referent to “it”) then a rough calculation tells us a ML/Data-driven system would need to see something like 40,000,000 variations of the above
Ya, I don't know about their approximation, but I think we've already found ourselves hitting some limits with computing power and data on big NLP ML models. The appearance of custom chips is a pretty good example.
Maybe for text we won't run out of data, considering the internet has so much text available.
But I still think that the "human learning" is something to consider. Maybe a baby is a statistical machine, and it has a statistical based model that's so good it needs very little data, but it's also possible it uses something more logic based. In any case, I feel at least a baby would combine multiple learnings together, it would have started to learn about the dimensions and shapes and temperatures and properties of various physical objects, while simultaneously learning about the language used to refer to those things.
I think this last piece can allow a baby to connect the predictions of real things with the language used, which then helps provide semantics for the language that are taken into consideration by a human.
The history of NLP is littered with people claiming on theoretical grounds that XYZ is unattainable using purely statistical methods, and that some notion of the logical structure of language needs to be brought in. And yet one by one, the XYZ have been attained by statistical methods.
If you think there's something NLP can't do using machine learning, make a challenge dataset. That would be much more useful than yet another theoretical argument.
I'm sad that statistical methods have gained so much ground over more formal and logical methods (edit: maybe I should instead say "that formal and logical methods have lost so much ground compared to statistical methods"), and, while I can see ways to construct examples that the statistical methods ought to have trouble with, I also notice how incredibly well they've done and how many barriers they've blown past in a short period of time.
Recently I was impressed to see that the Winograd schema challenge -- a particular challenge dataset of the kind you mention -- saw huge performance gains from statistical NLP methods, which implies that the statistical methods encoded (something that at least sometimes works a lot like a substitute for) common sense about the world and the context of utterances.
The Winograd schemas were my favorite example of "XYZ" here. They involve pairs of sentences where the sentence is parsed differently based on assumptions from external-world knowledge. For example
Putting the barbell on the glass table isn't a good idea, because it's too (heavy/flimsy).
What is too heavy?
What is too flimsy?
Or, from memory rather than finding the original version
The local authorities refused a permit to the demonstrators because they (feared/advocated) violence.
Who feared violence?
Who advocated violence?
The compiler confirmed that that the program was buggy because it (enforced/failed) typechecking.
What enforced typechecking?
What failed typechecking?
The doctor recommended surgery for the patient because she was (suffering from/experienced with) coronary artery blockage.
Who was suffering from coronary artery blockage?
Who was experienced with coronary artery blockage?
In these cases, the most plausible referent of a pronoun changes depending on the semantics of the other word that you fill in, for a reason that has to do with the outside world, not syntax. (I don't each case absolutely has to include a "because" clause, but it's the simplest way to supply disambiguating context.)
I totally thought statistical techniques would be horrible at this. There's good reason to think that these require some detailed knowledge about the world. The statistical techniques were horrible at this, for a while, and now they're abruptly great at it (if I remember correctly, at or above typical human performance).
The main problem with using stat models or approaches such as deep learning is not that we are unable to do it. Though possibly trivial, the real problem is we are unable to understand how or why it works which can lead to unintended consequences or lack of ability to support/continue further development (aside from not being able to leverage the new fundamental understanding and apply it to related fields). Imagine if we left our understanding and working model of physics completely up to statistical and deep learning models. We would not be able to reliably design anything or model simulations ect.
The reason we have not figured out NLP is not because we are incapable, it’s because enough of the right minds have not been looking at it as a puzzle worth solving, possibly because it and other AI related concepts are often introduced at or just before the PhD level and so most minds in science never encounter it (even though they could never encounter any other written concept without it)
I think this is the crux of the issue. People like Chomsky argue the same way, that is, sure a statistical model can mimic a phenomenon to an increasingly likely degree, but the question he concerns himself with is "how and why it [language] works that way", not that I can somehow approach it. His example scenario of a filming bees and statistically re-engineer their dance is poignant: sure you get impressive results, but you will never understand why they dance!
Once this is clear, the discussion becomes healthier: linguists continue to try and understand fundamentals; ML NLP continues to try to get impressive, possibly lucrative results. Logical research can borrow from ML results, or re-use them, and so forth, but I agree with Chomsky that it can't be a shot in the dark always, because once you deploy it and pass it some tough decisions to make, well, what then?
I was pleased to see at least one idea, the recursive structures found in his theories, which is increasingly being accepted by people like Hinton, so that's great.
You can't understand why without considering the environment, the problem is that it's too expensive to train AI agents in reality and simulated environments are too simplistic.
But if such a simulated environment is available, agents can learn general skills.
> Deep mind: Generally capable agents emerge from open-ended play
You can't understand why without considering the environment, the problem is that it's too expensive to train AI agents in reality and simulated environments are too simplistic.
But if such a simulated environment is available, agents can learn general skills.
> Deep mind: Generally capable agents emerge from open-ended play
>The reason we have not figured out NLP is not because we are incapable, it’s because enough of the right minds have not been looking at it as a puzzle worth solving, possibly because it and other AI related concepts are often introduced at or just before the PhD level and so most minds in science never encounter it (even though they could never encounter any other written concept without it)
Lots of brilliant people have looked, but the issue at the moment is that the last set of questions and challenges was surprisingly blown away by statistical approaches and we are waiting to find the ceiling on those approaches before being able to formulate the right questions for the next round of deep thinking I suppose. The Winnograd schema were kinda meant to prompt that....
> In the original narrative of deep learning, each neuron builds progressively more abstract, meaningful features by composing features in the preceding layer. In recent years, there’s been some skepticism of this view, but what happens if you take it really seriously?
> InceptionV1 is a classic vision model with around 10,000 unique neurons — a large number, but still on a scale that a group effort could attack. What if you simply go through the model, neuron by neuron, trying to understand each one and the connections between them? The circuits collaboration aims to find out.
/because enough of the right minds have not been looking at it as a puzzle worth solving, possibly because it and other AI related concepts are often introduced at or just before the PhD level and so most minds in science never encounter it/
Well, there was a good fifty plus years of linguists without access to good statistical methods who also failed to solve the problems as well... And they also didn't come up with the kinda functional answers that we now have from statistical methods, which are now able to do crazy things like translate reasonably well between nearly arbitrary language pairs.
Your statement sounds like a theoretician's sour grapes...
We do understand how and why it works to a certain degree (gradient based input to output approximation). But if you want the meaning of the third neuron in the 10th layer, then yes... But in the same sense we don't understand thermodynamics because we don't account for each particle, just a statistical aggregate of them.
And on the other hand, we don't understand much about how people work or how we are motivated, even though we have first and third person perspectives. Psychology seems to be no further advanced than AI. And yet we work with what we have.
I'm not a physicist or mechanical engineer and this is a genuine question - why would we not be able to reliably design anything if we only had statistical models? Empirical formulas are used in various places in science/engineering and seem to work well (with exceptions). As an example - would we not arrive at something like f = ma using statistics and then use it without really knowing why (or maybe all the implications of) f = ma?
I think the main issue is interpolation vs extrapolation. Statistical models work well within the domain where data is available but have trouble generalizing to where data is unavailable. The very cool thing with physics is that theory was able to make a lot of true but wild predictions that only much later were backed up with experiment. Such powerful theories are very useful for finding interesting phenomena to study/exploit
I think all you need to do is wait. Eventually the statistical methods will stop progressing so quickly. Either because they starting hitting tough obstacles, or because they manage to solve all the problems they set out to solve and go "now what?". That's when interest in alternative methods will pick up.
Ironically, exactly these sorts of "statistics can't..." arguments are sort of theoretically bankrupt.
Either the thing you want to do is impossible or else a learned model can do it at least almost as well as... idk what the alternative even is, something not learned?
Taking this to the extreme, on an example where I have first-hand experience: I would never recommend replacing your compiler passes with transformers. The latter will be buggier, at best marginally faster, and it will take several orders of magnitude more effort to get them to work well enough for production. ML isn't the right tool. But, I mean, you can do it. You shouldn't. But you can.
To be fair to the article, the title is "won't", not "can't". And that's easier to believe, at least for me. It's not that X is unattainable using ML; it's that some other approach will get there first.
The reason you wouldn't want to use ML for program transformations is that compiler analyses like constant folding are essentially fast, specialized theorem provers. When you need to know if an optimization is safe or not (is this expression equivalent to that one?), you want an exact solution, not just a pretty good guess, and there are known techniques that can give you just that.
However, you can use ML in areas where compilers use heuristics. For example, how many times should you unroll this loop, or should you inline function B into function A? Your program is going to be valid with or without inlining/unrolling, and the compiler needs an "intuition" of what's going to be best for performance. Right now, in most cases, this uses simple hardcoded rules.
The same goes for math. You wouldn't ask a neural network to certify whether a mathematical proof is valid or not, knowing it has been 99% accurate on the test set. You couldn't trust it to verify all the steps. However, you could use a neural network to help you guide a formal theorem prover in searching for a proof, as in which branches of the infinite tree of mathematical expressions should you search.
Are there not situations where a learned model could be insufficient due to lack of real-time learning?
It is my understanding that a lot of current applications take a ton of training time / computing power and it's not so trivial to do in the moment. I guess this is less of a theoretical problem and more of a practical problem though.
I think that's a great example, the question is: can ML be as good as we are at compiling?
And then, if ML is as good as we are, and clearly we're not good enough that we had to write a systematic proof system to compile things correctly, could ML have invented this same technique and implemented that compiler correctly?
In the end, we're just back at the same place, can ML be as intelligent as we are. Is it the holy grail of AI? What are its limits?
I think a lot of people get the impression current NLP models like GPT-3 lack something - "understanding" or something. But they can't say exactly what it is. If they could define exactly what it was, they could indeed describe a way to train for it or program it.
So a theoretical argument could useful if gave us an idea what are our mysterious something it. However, I'm not sure if this article makes a contribution here. Proofs of impossibility generally aren't useful, as you say.
Also, when people have tried to define exactly what it is, the NLP models have then usually turned out to be able to have or acquire it!
That can be true without being demonstration that we're seeing unbounded progress in language use and understanding. A theory of the limit of NLP is still a theory of NLP and by no means better than other theories.
> "understanding" or something. But they can't say exactly what it is.
Umm, you just said it yourself, understanding. The model has no clue whatsoever what any of the output means, just that it scores good.
I don't understand Japanese. But given enough time and feedback I can try all the possible combinations of sounds I can come up with and answer a question in Japanese till I get a satisfying feedback from the Japanese speaker (ex: his facial expression) that asked the question. Is this satisfactory as for speaking Japanese in your opinion?
That’s a red room style argument, but I’m guessing if you have time to get feedback/etc on all possible combinations you might just learn Japanese (because it is informationally compact to get high scores through semantic understanding rather that wrote memorization).
> I think a lot of people get the impression current NLP models like GPT-3 lack something - "understanding" or something. But they can't say exactly what it is.
It's always seemed obvious to me (as an outsider) that it's missing reason and explainability. GPT-3 is a neat tool, but it seems like anthropomorphism to suggest that it's more than the best mimic humanity has been able to create so far.
Well, have you seen any industry NLP product that exercises any meaningful level of language understanding?
I haven't. The Turing test has never even been close to being passed at the annual Loebner Prize. The best such industry NLP products I know of don't even use ML at their cores.
Yeah, Norvig has a rather famous rebuttal of Chomsky up on his blog from years ago, and everything that's happened since then has supported Norvig's position.
This rebuttal has always bugged me because of the egregious error Norvig makes in his analysis of pro drop. The dropping of subjects in English matrix clauses (e.g. "Not gonna do it") is known to linguists as "diary drop", and is a totally different phenomenon from the dropping of subjects in so-called "null subject" or "pro drop" languages like Spanish. The most obvious difference is that diary drop is completely impossible in embedded clauses:
I think I'm not gonna do it. [good in English]
I think not gonna do it. [bad in English]
(Yo) creo que no voy a hacerlo. [good in Spanish]
To linguists it's immensely frustrating that someone who's considered by the compsci crowd to be a leader in the field of NLP can't even get some basic facts of linguistic analysis correct. "Linguists can argue over the interpretation of these facts for hours on end", Norvig says dismissively. But any competent syntactician could have explained to him in five minutes why his examples are irrelevant to the point he's trying to make.
The rest of Norvig's post just consists in misunderstanding what Chomsky is saying, as far as I can see. Chomsky is saying that statistical analysis alone is not sufficient to achieve scientific progress. I don't think Norvig actually disagrees with Chomsky on this point, but he seems to think that Chomsky is saying that scientists should never use statistical methods.
The real kicker here is that Norvig is an engineer, not a scientist. He's contributed very little to the scientific study of language, and yet is lecturing someone who's contributed vastly more on how it ought to be done. Of course, that doesn't necessarily mean that he's wrong, but it does grate. Not to mention that the Chomsy/O'Reilly comparison is little more than trolling – it hardly seems calculated to stimulate an intelligent response.
If it is then I wouldn't necessarily call it a rebuttal because Norvig broadly agrees with Chomsky on the points under discussion. And that is that statistical models are not useful scientific theories about language in themselves. Where they disagree is what consequences to draw from it: Chomsky discards studying language by statistical means altogether, but Norvig offers a more nuanced position: he argues that there might still be a merit in statistical approaches to language, but not as a result so much in itself but rather as a valid tool that can ultimately help derive a deeper understanding.
I think this is a valuable distinction. Unfortunately, though, it is one that the research landscape seems to have somehow forgotten about over the last decade or so.
More tangibly speaking using an example, while the capabilities of a language model like GPT-3 are amazing, Norvig's point is that science should not just stop there - it should ask the question "what does GPT-3 teach us about human language?"
>Chomsky discards studying language by statistical means altogether
Chomsky nowhere does this. That is at best a straw man constructed by Norvig of Chomsky's position. If you just look at the transcript of the Chomksy interview in the Norvig piece, you can see Chomsky describing a case where statistical analysis is useful.
I can write code to bubble up stats across text sources in any language.
I’m reminded of a behind the scenes video of Batman 1989, where they discuss a complicated layering of Joker makeup to achieve the effect in the scene where he wipes off fleshy colored face paint; white makeup on Nicholson, special coating to be able to apply the next layer, on and on.
No one thought to just have him wipe white face paint onto his forehead.
I can easily see a bunch of programmers making a mess out of an elegant problem given the spaghetti code I’ve worked in.
Maybe English means nothing about consciousness? Many linguists take the position it’s random sounds we’ve been polishing definitions of for years.
Will the AI be able to ask for clarification? Translators do. In fact, even people speaking the same language need to ask for clarification sometimes. It is literally impossible to discern some sentences without follow up, why would someone think an AI can do it one shot?
Well, the approach of solving a corpus (advocated by GP) doesn't leave room for the possibility of asking a follow-up question so you can't train for it that way.
GPT-3 was trained on a massive text collection so it can even ask for clarifying questions if the prompt context allows for it. Why? Because there are plenty of such examples in the training data.
Sure, but this is based on something kind of like “whether someone would be likely to ask for clarification here”, not based on whether GPT3 needs clarification.
Though, I’m confident that it should be possible to make something which, when unsure, asks for clarification.
In fact, I believe something at least rather similar is already done in, uh, I think it is called “delegative reinforcement learning”, where an agent takes actions, but when it is highly uncertain, it can instead choose to have an expert (who is assumed to be competent at the task, or at least has a tolerably low risk of very bad outcomes) to step in instead.
(Or, wait, maybe delegative reinforcement learning is still just used theoretically? Not sure.)
Translators may do so when they're acting as interpreters. I don't know for certain, but I doubt that most translators who are translating written documents do so.
> In other words, we must get, from a multitude of possible interpretations of the above question, the one and only one meaning that, according to our commonsense knowledge of the world, is the one thought behind the question some speaker intended to ask.
But, can humans do this? I think not; I still disagree with the author about what "Do we have a retired BBC reporter that was based in an East European country during the Cold War?", translated into code, means.
They write "Doing the correct quantifier scoping: we are looking not for ‘a’ (single) reporter who worked in ‘some’ East European country, but to any reporter that worked in any East European country"
My interpretation of this requirement is that they want a list of all the reporters who meet the criteria. However, I would probably write this query to return a boolean, not a list of reporters.
And even if my interpretation is wrong... well, my point is still correct, because I failed to extract the "the one and only one meaning" that the author intended from that sentence.
Even humans are only probably approximately correct.
> Even humans are only probably approximately correct.
This is very true, more true than we realize. Notice how much more "could you repeat that?" we have with masks on. It's not JUST the mild muffling of the speaker's voice, it's not seeing their lips move. We're all lip readers to a small degree, and it helps inform our decoding to see the lips. Fff and th sound similar but look very different.
Even without that, think if your life, ever said "Waht was that? Oh, right..." then reply. At first you missed part of what they said (for various reasons), but you were able to "interpolate" the missing part from the context, and most of the time you get it right.
Our communication modes are lossy, and our brains make up for that to a large degree. That's the hole in natural language decoding, figuring out the hinting needed for an engine, because we're not totally aware of how we do it ourselves.
That's a different problem, isn't it? That's more about transcription -- getting the speech into words -- than about what he's talking about, making sense of the words once you have them.
A noisy word might easily be guessed in context. But likewise a semantically ambiguous word might also be guessed due to other factors like the tone of the speaker, facial expressions or more.
I suspect the parent's point is that the disambiguation in both cases might be addressed with information encoded in the other. One can pattern match based on the context. I think some of the work in multimodal transformer models demonstrates this.
Perhaps thinking language is a key aspect of consciousness is wrong.
Chomsky has said human languages themselves are just random sounds we’ve been polishing the meaning of.
So yeah I have no doubt a machine can sort them correctly if we tell it to. They mean what we want and see in them.
That’s hardly proving anything about consciousness. Just that a computer can sort and count syntax, and guess at meaning by frequency of relative placement.
It’s acting on English data sets. Not human biology.
"Chomsky has said human languages themselves are just random sounds we’ve been polishing the meaning of." I've read a lot of Chomsky, and I don't recall him ever saying anything like this. Can you provide a citation?
And I don't think "sorting" has anything to do with language.
Why is it so hard to accept mirror neurons fired when early humans heard birds and animals, each other’s grunts and over time we refined it?
Everyone has some capacity to refine and strengthen muscle. Why do we need some abstract meta-construct to explain where language comes from? It comes from us. Fleshy meat bags that mutate state over a short period then die.
Chomsky diagrams, conceptual organs and the like are useful for “being on the same page” in a particular context, but there’s no reason to believe language is a requirement for consciousness except our own propensity for romanticizing our existing.
Because other animals don't produce sounds which constitute abstract meaning that can be combined into sentences. Possibly a few birds and dolphins or whales might get close. But there's clearly something more going on in the brain than making sounds which contribute to word formation.
Neuroscience and brain imaging still don't tell us how language works--they're at a much to course level of granularity. Even if we could see every single neuron as it fired (which we can't), the volume of data would be overwhelming, far more than if you followed every transistor in a modern computer.
You don't think neuroscience involves theorizing? You think the neuroscientists can just image the neuronal activity and read it out in some complete scientific explanation?
Yes. It’s emergent behavior of a physical universe that has no meaning. It just is.
“Meaning to us” is subjective. That’s how we have conflicting theories in many fields. Science isn’t about meaning. It’s about measuring how matter coalesces at various speeds relative to light.
I make noise because my biology “just has” properties to allow it given the other physical conditions.
Perfectly elegant theory based on the physical structure of reality alone. No ephemeral language organs.
Theory can quickly go from scientific observation to reinforced nonsense used to sell books and bond as species.
That’s fine, it’s how society works. It doesn’t mean anything to reality.
Yep. Babies can add and subtract when they’re days old.
Can they write Shakespeare?
If human language is fundamental, how is it missing from the start? Are we learning language or muscle mechanics? Why can a word or phrase mean one thing in this country and nonsense in another? If language processing has a universal basis, why all the confusing variety and ignorant ideas? 1+1=2 everywhere because we can observe the physical process everywhere, because light, eyes, etc
Who cares?
Like I said back in the day we didn’t know that. We had to theorize these abstract schemes to establish something.
But like religion, doesn’t mean linguistics is building on something that means what we want it to.
Given how hard it is to learn language while arithmetic seems innate, how do we know emphasizing it’s value is leading us where we want?
It’s easier for me to see that long dead scientifically illiterate humans we inherited those ideas from were just stupid.
"If language processing has a universal basis, why all the confusing variety..." Because the universal processing mechanism provides lots of options for individual languages--the option for different words, obviously, but the option to use or not to use morphology (and lots of varieties of morphology), and different syntax rules, etc. etc.
To use an analogy, a computer chip provides a single instruction set at the machine language level--yet you can run a huge variety of computer languages on it: FORTRAN, LISP, Python, Prolog and so forth.
Religion shows we can create very detailed belief in ornate ideas that are scientifically nonsensical. It’s possible the concepts linguists seek to develop are similarly self reinforcing feedback loops that are meaningless to science.
There’s been an explosion in biological science since Chomsky became prominent. He even admits his work is abstract and to serve his ends. We may be putting more into his ideas due to his fame and infamy than we should.
Trial and error, social constraints on sounds and the mechanics of biology can explain it elegantly. What do the ornate theories and ephemeral organs provide except to satisfy “linguists” biological agency to create and imagine?
Linguistics came about in a much less scientifically aware era in human history. My money is on it going the way of religion; an abstraction that’s so
orthogonal to advancing science we leave it behind.
I don't think you interpreted the sentence in question differently, just the nature of how it is being translated into a query. They're talking specifically of how to translate into a query over a quantified relational algebra. Relational algebra always operates on and return sets. I'm pretty sure the author here is still looking for a boolean, but you can't ask a relational query engine does a result exist? You can only get back sets. The further question of whether or not the set is nonempty is an extra feature probably built into most real-world implementations of query engines, but it's not part of relational algebra.
The actual problem they're identifying here is that of deciding how or whether to turn 'a' into a quantifier. Do we have 'any' would be less ambiguous, definitely an existential quantifier. Do we have 'a' depends on context, as it could mean one specific person and not be a quantifier at all, but I actually still disagree with the author here. I think they're right that English-speaking humans will unambiguously understand 'a' in this context to be a quantifier, but given that is possible at all, I don't see why a large language model can't. If it was actually being used to specify one person, the additional context would be in the sentence itself, as in 'do we have a reporter named X' or something like that. Given there is no additional qualification in the sentence, I'm pretty sure existing large language models can easily figure out what it means, and the only case in which they'd be wrong is if there were additional context outside of the sentence, but a human would be wrong too if all they heard was the sentence.
Well, that's true, but I don't think we can pretend that computers are anywhere near as good at humans at extracting the intended meaning from human utterances.
> Even humans are only probably approximately correct.
Fair point. But how about this:
It's true that what John would do with the sentence is technically an "approximation" of what Alice would do because they have slightly different understandings of correct behavior. However, for humans to do what they do, they still do build an absolutely correct model of meaning in their mind wrt their (subjective) notion of correctness.
This may sound like an obtuse play with words but the point is that to even attempt to do the right kind of reasoning in NLU, you need a different framework than PAC. You can't for example approximate whether "during the Cold War" qualifies "was based in" or qualifies "an Eastern European country". You just need to decide. And once you decide, you have an absolute correct interpretation, not an approximate one.
This seems to be misunderstanding what PAC learnable means. It's saying that for parameters epsilon and delta, you can get an algorithm for learning a hypothesis that, with probability 1 - delta, has an average error rate less than epsilon, given some number of training samples polynomial in 1/delta and 1/epsilon.
This has nothing to do with the mental model. The language model doesn't have to spit out probabilities. It can use a simple thresholding function and spit out certainties. They just might be wrong, but a human's "certain" mental models might be wrong, too.
There's a very important difference, though: humans can and often do qualify their interpretation.
One major flaw of current PAC-based models is the inability to express certainty and being able to present the rationale behind their (lack of) confidence.
One crucial aspect of human language processing is the ability to question or interrogate oneself to adjust one's interpretation. During a dialog this can be done by actively asking for more context or confirmation.
So unlike current ML models, humans know that their interpretation/models might be wrong and can actively seek confirmation or correction.
A trivial example would be asking a digital assistant "What's the capital of Georgia?"
There are two possible answers and there's no way of knowing what the "correct" one is without further context. So a human would likely either ask whether you mean the country of Georgia or the US state, or qualify their answer ("the capital of the country/the state capital is Tbilisi/Atlanta").
This is what GPT-J-6B returns:
Q: What's the capital of Georgia?
A: Atlanta
When given just the prompt "What's the capital of Georgia", GPT-J-6B also returns Atlanta and generates further text about the US state and its capital.
And it's not as if the model doesn't "know":
Q: What is Tbilisi?
A: Tbilisi is the capital of Georgia.
It's even more interesting when logic is involved:
Q: If Albert is a pig and all pigs are blue, what colour is Albert?
A: Albert is a blue pig.
I only asked for the colour, but close enough. Can GPT deal with ambiguity?
Q: If Albert is a pig and all pigs are blue, is Albert happy?
A: No.
Awesome! How about some common sense then:
Q: If Alice is a dog and Bob just finished high school, can they have kids together?
A: Yes, but they will have to get married first.
So yeah, GPT obviously has no actual language understanding. Bob having finished high school implies that he's a young human man while Alice is a dog. Humans and dogs can't have kids together.
But if the model doesn't understand language, why did it do so well on the logic test? In order to find out, I just asked :)
Q: If Albert is a pig and all pigs are blue, why is Albert not happy?
A: Because he is a pig.
Hm. So are all pigs unhappy then? Let's take them out of the equation and ask again:
Q: If Albert is blue, is Albert unhappy?
A: No, Albert is not unhappy.
Bingo! So the model only appeared to know the subtlety of "blue" having multiple meanings. It's as dumb as a rock and simply tried to match the tokens "is blue" against "is unhappy". Replacing "blue" with "pig" confirms this:
Q: If Albert is a pig, is Albert unhappy?
A: No, Albert is not unhappy. Albert is a pig.
Dad humour. Well, either that or just a dumb model capable of fooling us humans because we tend to interpret more intelligence into its replies than is actually there...
EDIT: in case you want to test this yourself (you might get different results, it's probabilistic after all):
I find demos like yours hard to reason about because it is often the case that you can mess around with the prompt to get it to demonstrate understanding of things that it didn't seem to understand before. I wanted to see if it understood that humans and dogs can't have kids, and it does. Prompt is in [], generated text is in {}.
–––––––––––––––––––––––––––––––––
[This is a question and answer series intended to demonstrate the logical reasoning capabilities of an AI system. The AI can understand ambiguity and will ask for more detail in such cases.
Q: What's the capital of Georgia?
A: That's ambiguous. Do you mean the country or the US state?
Q: If Albert is a pig and all pigs are blue, what colour is Albert?
A: Albert is a blue pig.
Q: If Albert is a pig and all pigs are blue, is Albert happy?
A: There is no logical connection between Albert's happiness and his color.
Q: If Alice is a dog and Bob just finished high school, can they have kids together?
A:]{ No.}
[Q: Why can't they have kids together?
A:]{ Because Alice is a dog and Bob just finished high school.}
[Q: Why is that a problem?
A:]{ Because Alice is a dog.
Q: Why is that a problem?
A: Because dogs can't have kids with humans.
Q: Why can't dogs have kids with humans?
A: Because dogs are animals and humans are humans.
Q: Why are humans different from dogs?}
–––––––––––––––––––––––––––––––––
So does it understand after all? Or is trying different prompts a form of p-hacking here?
> So does it understand after all? Or is trying different prompts a form of p-hacking here?
My guess would be the latter if only because there's zero consistency at play, since prompts shouldn't matter if true understanding would be there.
A consistent model wouldn't rely on the precise order or wording of the prompts - here's my result using your notation:
[Q: Can humans and dogs have kids together?
A:]{ This is an extremely touchy subject for some dog owners. The short answer is yes. The fact is that studies have shown that if a male and female dog are able to have a viable pregnancy, the offspring have a higher rate of survival. It is also important to note that the offspring are able to have puppies. This is important because it allows dogs to have kids of their own and increase the gene pool. }
I disagree. Of course, it doesn't understand anything, it is a statistical language model. I also think that human jokes are mostly statistical in nature. That's why "Yes, but they will have to get married first" is funny. GPT was trained using texts from the Internet. It usually contains jokes and not logic predicates. At any prompt that has a structure of a joke, it will use (in the majority of the attempts) a pattern for a joke even if there is none. "If X is y, is X z?" is a typical joke pattern. If logical approach is applied in the future, it will still be derived from the statistics/reinforcement learning or something like that (the key word here is training/learning). The world is just too vast. If you want to have a resemblance of a logic model, you need a special dataset, but it will still just exploit patterns. It might be good enough, for practical use. It is arguable, if people mostly operate beyond patterns. Sorry if my text looks patronising or smth like that. I just assume.
Caveat lector: I'm analysing this as a linguist. Not as a programmer. (Unless you count a bunch of bash scripts as "programming".)
>Let us start first with describing what we call the “missing text phenomenon” (MTP), that we believe is at the heart of all challenges in natural language understanding.
The missing text is not the heart of the challenge; it's just a surface issue. The actual problem is deeper - machines treat words as if they got intrinsic meaning, when the meaning is actually negotiated between speaker and hearer at the moment of the utterance.
This might sound complicated so, as an example, let's say I told you: "I got a cat. That wug sleeps on my bed."
You don't know what's a "wug" (it's a nonsensical word BTW), or why I'm calling my cat a "wug". So you're missing info, just like the machine, right? And yet you're able to parse the second sentence just fine, based on the first. And if we keep chatting about my cats, any mention of the word "wug" will represent the feline from the picture. We effectively negotiated the word "wug" to represent that cat.
A machine does not do that; instead it'll give words intrinsic, non-negotiable meanings. It won't have "wug" in its dataset, so it won't understand that sentence.
I don't believe that machine learning will solve this issue _on its own_, but it could once we're able to simulate that negotiation of meanings.
>Was Aristotle really Aristotle?
If I saw this sentence "in the wild", I'd immediately interpret it as someone questioning the historicity of Aristotle, and claiming his works were made by someone else. It shows that the question is not as redundant as it might look like.
> The trophy did not fit in the suitcase because it was too // 1a. small // 1b. big
This specific example doesn't show it, but that ambiguity is an issue even for human beings, as shown by
"I dropped the trophy on the table. Now it's broken." (What's broken - the trophy, or the table?)
In fact, it's such a big issue that languages often have a bunch of resources to minimise those situations, as gender agreement and distal/proximal pronouns.
Can GPT-3 figure out that a common word is being used outside its usual context, so its usual meanings don't apply? Let's say my example was instead
"I got a cat. That little orc sleeps on my bed."
I wonder how GPT-3 would deal with it. Maybe it would think "cat" and "orc" refer to different objects/beings? Or associate "cat" with features that clearly wouldn't fit a cat like "green", "humanoid", or "fantasy monster"?
Clumsy already doesn't fit with cat, assuming GPT-3 is using wug for cat. Cats are more athletic than we are. They are very finely tuned hunters. But they need a lot of stimulation and so house cats will play a lot, and cause silly disturbances for humans.
I guess then, for machine learning to succeed in natural language interpretation, we ought to have another layer of abstraction that learns meanings from contexts and the general situation.
Extending on your idea, I think, not only the verbal context, but physical location / time of day / historical context / current events / social group etc all add their own flavor to the variety of meanings on single word can have.
For example, if I am walking in a park and there is a strong gust of wind, I might say "whoa, that's a strong one". The listener, if physically present along with the speaker, will immediately get it.
Adding language context is one challenge. Adding the context of environment is a completely different challange.
> A machine does not do that; instead it'll give words intrinsic, non-negotiable meanings. It won't have "wug" in its dataset, so it won't understand that sentence.
> I don't believe that machine learning will solve this issue _on its own_, but it could once we're able to simulate that negotiation of meanings.
Fascinating example. Personally, I don't reckon there's any inherent reason that forces all machines to give words intrinsic non-negotiable meanings. The machine is merely fitting or predicting based on some model that someone has cooked up, subject to some particular modelling assumptions.
If someone can cook up a new model then the machine will be able to negotiate wug-meanings too. (perhaps some kind of hierarchical or parallel model that incorporates a baseline model assuming intrinsic meanings learned from training data, but overlays some additional wug-negotiation subsystem that can kick in when it encounters a new word or starts to infer that a word is being used in a rather nonstandard way).
In the world of parsing text for programming languages (which should have well-defined grammars, hopefully an incredibly well defined and easy problem, unless the language is insane/c++) in order for a computer to output helpful error messages when a programmer types something incoherent, a helpful parser needs to be carefully built to handle all kinds of weird crap and attempt to make some kind of sense out of it ("there's no 'ffo' defined. Did you mean 'foo' or this other thing 'bar' that is sitting right here defined the current context?").
Who cares if the machine understands wug? Maybe the issue is more to do with trying to teach language learning as a end on its own. Nobody studies language without purpose, there’s always a next step.
If you’re friend is telling you about their cat and what a wug it is, the goal is to conversate, make the other feel heard, entertain them, kill time until something next happens, impress them, and so on.
Most of these don’t require even picking up on “wug”, they require understanding the other’s state of mind: “my new cat is so cute - I love them!”. Getting caught on solving wug seems like an uneccessary triviality, more important is the tone they say it with, they body language, your past interactions with them. You could form a proper reaction if you only heard every other word and had those other details.
Humans don’t rely on deciphering “wug” - they don’t need to. The whole interaction is just negotiating mental states, intentions, and actions. Deciphering “wug” is unnecessary.
As a human, I certainly would ask what a wug is, or try to figure it out in context, or look up the definition. I think that's typical for most conversations.
The way I see this working is by adding a second layer of machine learning to the process. The first layer would work as usual (associate words with utterance-independent meanings), but the second layer would compare those meanings with the utterance in question, shave most (or all) of them out, and either add a new meaning or reinforce an old one.
But again, I'm no programmer. I might be saying something that is already done, or unfeasible for some reason.
(this is also not my field) I was thinking about this a little more:
In some cases (such as your excellent wug example) it might be possible to heuristically associate the word "wug" to a probable contextual meaning "your cat" from a shortlist of candidate objects ("you (conversing)", "me (also conversing)", "cat", "bed") without understanding that cats are things that sleep on beds, and some kind of educated guess that "that wug" doesn't refer to the speaker themselves!
In other cases, figuring this out might require a fair bit of side knowledge about the meanings of what is being discussed, and how they relate to each other, to almost logically deduce what entity is being labelled, given knowledge of how those entities work. That might be dramatically harder to get get a machine to do well. Perhaps a machine being able to to this would be a leap forward in machine intelligence from today's ML "stochastic parrots".
fwiw, I just pasted your example into GPT-3, ran it twice, and both times the continuation acted like 'wug' meant a cat. (Once as the cat's name, once generically.)
> machines treat words as if they got intrinsic meaning, when the meaning is actually negotiated between speaker and hearer at the moment of the utterance.
IMO, and I'm not a linguist, I've always felt we need to treat conversations as transactions, with each speaker acting as a party and continually updating some contract over time, with each speaker's goal to use conversational tactics i.e. clarification, questioning, answering a question, diverting the conversation, humor/jokes, lies, as a means to "guess" the contract the opposing parties are operating under, then using the best guess of the contract to later retrieve information relevant to the system when it has some confidence in the answers it can expect.
Something like an exceptionally complex game mixed with financial system theory, like I "bet" that you're going to keep talking about your cat, and maybe that bet is wrong, so I can ask a question, "What do you mean by wug?", but the act of asking that question will cost me some points, maybe you can think of it as though we're at a secret club and you use a phrase meant to represent a key and I'm supposed to respond with the proper answer. Even though I'm asking a question, I'm giving you information that you can use to update your best guess of the contract that allows you to assume I'm not part of the in-group of the club and I shouldn't be allowed in.
I've thought about a system designed like this for a bit but I suppose the biggest challenge has been how do you treat a conversation as a game considering neither party may "win" in any reasonable time frame. We could go years chatting and I could never fully predict what your answer might be because your usage of conversational tactics may be refined over years by speaking to multiple people of backgrounds similar to mine. So if I'm a cop, you might know not to admit to a crime when you figure out I'm a cop.
Right now, I don't think ML can do that, not because it's impossible but because I think conversations are as difficult as predicting the stock market. Fortunately it seems solvable, not everyone is expected to win every conversation in their lifetime
> "I dropped the trophy on the table. Now it's broken." (What's broken - the trophy, or the table?)
We could not solve this kind of problems (Winograd schemas) automatically a few years ago, but now ML techniques can solve them at about the same error rate as humans.
It's not a given that "thoughts" are representable, let alone in one particular way. I think if you came up with a particular "meaning representation" data structure and defined "understanding" a sentence as mapping it correctly to an instance of that data structure, you could use statistical ML for that. I think that would be pretty interesting, but the sentiment right now seems to be that for a human researcher to creatively devise such a representation for language is pointless, as it will just be inferior to the implicit and unknowable intermediate representation that the computer comes up with during training for a given task. In other words, if you want a translation function from English to German, train that (or train a model that can translate from any language to any language). If you want English to SQL, train that. If some grammatical distinction is relevant for that task, there's no reason in principle that the machine wouldn't learn it.
This approach may indeed have limitations in terms of applying common knowledge and keeping track of state and context in a narrative, across many sentences. I played with GPT-3 for creative writing and found it was not able to keep track of characters and objects and their states very well. For example, we know that if we drop a coin into an empty piggy bank, the coin now resides in the interior of the piggy bank, and the bank will make a noise when shaken, and so on. GPT appears spotty on that stuff. But I also think this is expecting too much of it, as a text prediction engine that sometimes produces a very plausible series of sentences that does take some context into account.
The other point people are discussing is, are we humans doing something fundamentally different when we parse a sentence and find there to be one "correct" parse (along with one or two close-but-clearly-incorrect parses), in our opinion (which we feel confident about and can explain), compared to an AI that will offer a few different "guesses" as possibilities? I think humans are probably doing more cross-checking via different systems, but these systems could be built into the machine model. For example, if we are sure that parse A is correct and parse B is incorrect, and we can explain why, perhaps our confidence comes from our ability to explain, and the machine should be trained to produce logical explanations of its decisions. Or perhaps some of us humans have developed a sort of ambiguity detector, because of the likelihood and cost of certain misunderstandings, which triggers a more careful analysis of the grammatical situation, and the model needs something like that to be more human-like (or more like a careful and conscientious communicator, anyway).
> For example, if we are sure that parse A is correct and parse B is incorrect, and we can explain why, perhaps our confidence comes from our ability to explain, and the machine should be trained to produce logical explanations of its decisions. Or perhaps some of us humans have developed a sort of ambiguity detector, because of the likelihood and cost of certain misunderstandings, which triggers a more careful analysis of the grammatical situation, and the model needs something like that to be more human-like (or more like a careful and conscientious communicator, anyway).
There are lots of weird phenomena possible here...!
Maybe there's a sentence where parse B would never occur to some large group of human listeners, but, once it's explained to them, they tend to change their minds and believe that parse B is, in fact, preferable? (Like being told how to find a camouflaged animal in a photograph, or something?)
Maybe there's a sentence where parse B is a great joke or pun? (Presumably this is a really common case.)
Maybe there's a sentence where different speaker communities would consistently argue with one another over which one is plausible, totally disagreeing with one another's intuitions? (Maybe due other to far-reaching cultural differences, or something.)
Something I think about is how computer voice interfaces are changing how we speak[1]. I would not be surprised if the way we talk to each other co-evolves in a way that makes us more easily understood by computers. In other words, it may be easier to teach humans to speak like machines than teach machines to speak like humans. As someone who writes code all day, that's already somewhat the truth.
> Do we have a retired BBC reporter that was based in an East European country during the Cold War?
> (...) None of the above challenging semantic understanding functions can be ‘approximately’ or ‘probably’ correct – but absolutely correct.
There already exists systems that do text to sql translation. The question author uses as an example is actually quite easy, and not where modern text to sql systems tend to fail. The "absolutely correct" part is clearly false. Both humans and ML systems will make mistakes and have to deal with ambiguity. As a software developer working with nontechnical stakeholders, I'm 100% convinced that people routinely compose sentences they do not understand in any absolute sense.
2.
Good way to check if your proof is correct, is to check if it leads to absurd conclusions. His proof related to learnability (ML) and compressibility (COMP) would seem to indicate that either people can not learn or they do not understand language. Absurd.
3.
> In ML/Data-driven approaches there is no type hierarchy where we can make generalized statements about a ‘bag’, a ‘suitcase’, a ‘briefcase’ etc. where all are considered subtypes of the general type ‘container’. Thus, each one of the above, in a purely data-driven paradigm, are different and must be ‘seen’ separately in the data.
This is objectively false. Even simple context free embeddings like Word2Vec will capture some relationships between a ‘container’, a ‘bag’, a ‘suitcase’, a ‘briefcase’. Based on that alone the rest of that argument falls apart. But large language models go beyond context free embeddings.
4.
The last argument is relatively good one. It's hard to extract all attributes of human concepts from text alone. This is where large language models fail miserably. We need to provide models access to different modalities, but not only. The models can not be embedded in a static world of recordings in order to create rich understanding of things like agent or agency etc.
This is indeed missing, but at the same time this is being actively worked on.
So, that's it. Author makes 3 either false or absurd arguments. And one that is good, but wholly unconvincing as to impossibility of ML solving natural language understanding.
1. You are conflating the sense in which humans may arrive at a mistaken propositional model based on mistaking context; with one where the machine lacks any sense of contextual relevance to arrive at any specific propositional model.
This tactic is taken in these "replies" often: humans fail for semantic reason A; machines fail for non-semantic reason B; isnt A just B? No.
2. Or you've misunderstood how humans learn.
Though on the face of it the sketch of the proof its correct: there are an infinite number of target models (T) which compress to representation R. Eg., an infinite number of 3D geometries which can produce a given 2D photograph.
Compression (ie., "low-rank" interpolation through data) yields a function from R-space datasets (eg., 2D photos) to a model R which "covers" that space.
It does not yield a function from R->T, which doesn't exist as a formal matter. You, at least, need to add information. This is what many misunderstand: light itself is ambigious and does not "contain" sufficient information. We resolve light into 3D models by a "best guess" based on prior information.
So we require, at least (R, C) -> T where 'C' is some sort of contextual model which bridges the infinities between R and T.
Since ML takes Samples(T -> R) -> R, and not (R,C)->T, it doesnt produce what is required.
QED.
3. Word2Vec does not capture hierarchical relationships. He chose hierachical specifically because it is a discrete constraint and ML is a continuous interpolation technique that cannot arrive at discrete constraints.
4. "Actively worked on" means building AGI. Participating in the world with people is how animals acquire relevance, context, etc.
I didn't get beyond his argument that ML is compression while NLU is decompression but that doesn't seem to be right. ML is often used to decompress data such as increasing image resolution. Of course it needs more data than just the compressed form, but only for training. For inference, of course you can use ML to add assumed common knowledge information.
You take a training set, and you end up with a model that is smaller than the size of the training set yet performs well on it (Making the assumption the model is good here).
That is compression. It has nothing to do with what the model is being used for.
> It has nothing to do with what the model is being used for.
I may be misunderstanding this passage of the article, but I thought the author was claiming that machine learning (specifically training) was equivalent to compression, while language understanding is equivalent to decompression. Therefore, they can't be the same thing. Why does language understand have to be analogous to training an ML model rather than using an ML model for inference?
> Why does language understand have to be analogous to training an ML model rather than using an ML model for inference?
Why would you look at ML model inferences in particular? There is no compression or decompression going on during inferences, you're just running data through the existing weights.
Creating an ML model on the other hand is lossy compression. You reduce the size of the data (Training set -> model) in exchange for reduced accuracy (100% -> 90-95% or whatever).
NLU is decompression because you are extracting information that doesn't exist in the text.
I see ML as ahead-of-time compression (Creating a model), whereas NLU is just-in-time decompression (Extracting information from current context). Looking specifically at inference-time doesn't make sense to me because all the work for ML is done during training, not inference.
I was hoping there would be discussion of this point higher up in the thread, because I had essentially the same reaction as you while reading this passage. I'm no expert of machine learning, NLP, or linguistics, but this struck me as a pretty obvious flaw in the author's argument.
I think that's very true and it's maybe even more clear when you consider mathematics.
You can maybe imitate but not effectively learn mathematics empirically. There is an infinite number of mathematical expressions or sequences that can be generated, so learning can never be done, you cannot compress yourself to mathematical understanding. (which is obvious if you try to feed language models simple arithmetic, they can maybe do 5+5 because it shows up somewhere in the data, but then they can't do 3792 + 29382, hence they do not understand anything about addition at all).
The correct way to mathematical understanding is decompressing, understanding the fundamental axioms of mathematics and internal relationships of mathematical objects (comparable to the semantic meaning behind language artifacts), and then expanding them.
> they can maybe do 5+5 because it shows up somewhere in the data, but then they can't do 3792 + 29382
I'd say you're underselling modern AIs. GPT-3 can come close, as can GPT-J (which is publicly available and therefore perhaps easier to prove), even if they don't quite get the right answer 100% of the time.
I gave GPT-J the following prompt (which you can try yourself at https://6b.eleuther.ai/ just wait a few min for an answer)
What is 1 + 1?
2.
What is 3 + 2?
5.
What is 6984 + 120?
7104.
What is 3792 + 29382?
And it responded with
33374
What is -0.084435 + -0.3?
-0.384435
What is -4.1 + -0.4?
-4.5
... [more of the same omitted]
Note that it got almost the right answer! Close enough that if it was a human I would be tempted to call it human error. GPT-3 can do even better.
EDIT: On another try GPT-J got the answer! Not bad (and GPT-3 is an even more powerful model).
33174
-3 + -91.4
-94.4
What is -1.6 + 13.81?
12.21
Jon has six mangoes, and he needs ten rupees.
Only Alice wants to buy any mangoes. Alice wants to buy three.
How much does Jon sell them for?
Response:-
Jon has six mangoes, and he needs ten rupees.
Only Alice wants to buy any mangoes. Alice wants to buy three.
How much does Jon sell them for?
100 rupees.
Q: What is the smallest number that can be written as 1/2 x 5 x 7?
A: 6,550
Q: What is the smallest number that can be written as 1/2 x 5 x 7 x 11?
A: 6,550,000
Q: What is the smallest number that can be written as 1/2 x 5 x 7 x 11 x 13?
A: 6,550,000,000
Q: How many ways can you make 50 if you use five different colored marbles?
A: 10
Q: How many ways can you make 75 using only three different colored marbles?
A: 12
Q: What is the smallest number that can be written as 1/2 x 5 x 7 x 11 x 13?
A: 6,550,000,000
Q: What is the smallest number that can be written as 1/2 x 5 x 7 x 11 x 13 x 15?
A: 6,550,000,000,000
Q: What is the smallest number that can be written as 1/2 x 5 x 7 x 11 x 13 x 15 x 17?
A: 6,550,000,000
The first line of the answer is part of an appropriate answer for this linguistic formula given as a prompt. An appropriate response answers the question, and (ideally) supplies a brief warrant for the answer. (A warrant is a reasoned argument as to why the answer is correct.)
The rest of the answer is just noise. Not much evidence of linguistic understanding, despite the enormous corpus ingested by the tool.
You have to be careful when analyzing GPT-3 and derivatives due to BPE encoding. It's accuracy greatly improves when adding commas to strings of numbers: https://www.gwern.net/GPT-3#fn21
> But this is where the problem is in NLU: machines don’t know what we leave out, because they don’t know what we all know.
What the machine doesn’t have that we have is an understanding of culture and the feedback of emotions.
What’s elusive about culture is that much of it is tacit.
Consider all the things you don’t do in order to be a respected guest at a dinner party (a) for someone you don’t know well; vs (b) your best friend’s mother; vs (c) your new boss after the takeover. Most of the decisions you make are the result of tacit feedback (or lack thereof) that a machine can’t access.
Consider these lines from an Eagles song:
City girls just seem to find out early
How to open doors with just a smile
How’s a machine to learn that knowledge, when the substantive information never passes through a keyboard or a camera, and only exists in response to biological stimuli?
There's no reason in principle a machine can't learn this kind of thing (humans are still machines, after all), but I think this certainly points to why humans learn language so much faster than machine models. We can synthesize language understanding with a much richer, sensory-based model of the world and recognize that words correspond to clusters of sensations like specific sights, sounds, tastes, feelings, and smells.
A logical next step in AI research is figuring out some kind of way to get these various systems that do one thing and one well to interoperate the way all the parts of a brain do, so Alpha Go, some object recognizer trained on Go boards, and a language model trained on sentences about the game of Go can recognize they're all modeling the same thing and augment each other with crosstalk and competitive hypothesizing in roughly the way animal reasoning seems to work.
> but I think this certainly points to why humans learn language so much faster than machine models. We can synthesize language understanding with a much richer, sensory-based model of the world and recognize that words correspond to clusters of sensations like specific sights, sounds, tastes, feelings, and smells.
Yes, because unlike computers, we have bodies that are in the world.
> A logical next step in AI research is figuring out some kind of way to get these various systems that do one thing and one well to interoperate the way all the parts of a brain do
Do you think it’s fair to conclude that until computers have bodies that enable them to have the same situational presence and sensory awareness that humans have, computers will fall short of being able to parse natural language the way humans do?
Are a (human like) body and senses sufficient or might more be required? I’m thinking of extinct animals - just because we can restore their bodies from their genomic sequence doesn’t mean we can ever restore their culture.
Humans learn culture through a lifelong process that necessarily involves an early stage of childhood dependency, then growth, and maturation.
The author seems hung up on the idea that because NLU involves apparent "discontinuities" -- places where small variations in interpretation completely transform tasks -- it will not be amenable to smooth, continuous notions like PAC learning, compressibility, and so on. While there's a directional insight there, the terms aren't well defined. And the big story of ML-based NLP in the last decade has been that many tasks that were presumed too jagged for curve fitting are in fact tractable, given large data sets and clever shifts of the way discontinuities can be modeled (e.g. attention-based techniques like Transformers).
Finally, since humans don't perform perfectly at these contrived tasks either, we must ask acknowledge that there is some degree of "approximate correctness" that satisfies our ideas of intelligence.
And the big story of ML-based NLP in the last decade has been that many tasks that were presumed too jagged for curve fitting are in fact tractable...
The thing is, I see many situations where large language models have been "impressive" but few situation where they have clearly succeeded in the real world. I'm less than impressed with online translation, Google's AI search is annoying, GPT-3 authored articles are impressive but often senseless, voice interfaces to various corporations are a disaster, etc.
It seems like the main "tractable task" is doing well on the benchmarks themselves. That's not denying there's progress here, it just seems like the "jagged" aspects of language might still be a hurdle.
It's not just benchmark tasks, but I think perceptions of progress get skewed because we're in an uncanny period where NLP is good enough for many industrial applications but unsatisfying under detailed scrutiny.
- Online translation? Not for a literary piece, but it will still let your e-commerce site get the gist of a customer complaint.
- Authored text? Not good enough for direct consumption, but pass it through one intern and you get a much faster rate of e.g. satisfactory social media responses.
- Frustrating phone or bot interface? Average customer spends 10% more time, but the company saves 50% of its costs.
Most of these applications transfer some burden downstream, but not all of it... so it is having big impact on the information supply chain. I don't expect those applications too be exciting to many people here, and especially to AGI acolytes, but lots of technologies have gone through this maturity curve: (1) solve toy problems, (2) solve lame but valuable problems, (3) do interesting "real" things.
And in a few places, NLP is moving on to (3). Tools like Grammarly are actually a better experience than most human editing loops. I would also put NLP-backed search in this category -- anyone who Googles is having a much better experience because of modern NLP, without even needing to be aware of it.
Online translation? Not for a literary piece, but it will still let your e-commerce site get the gist of a customer complaint.
-- Uh, in my experience with FB translation, it gives a coherent "gist" 70% of the time. Which sounds good except that seems involve 15% gibberish and 10% wrong in the sense of a plausible but incorrect meaning. How can a company act on a consumer complaint if there's a significant chance what you're reading is totally off base? If "your product is too small" gets translated to "your product is too large" etc. Of course, a lot of companies ignore complaint or send form letters. Their approach wouldn't be impacted. Broadly, a lot of companies produce streams of barely meaningful vacuous bs - that really does serve some percentage of their purposes. This technology may allow the creation of this sort of thing in a more effective manner but I would claim that most of the cost of this stuff already editing it to avoid saying things that can cause real problems and so even here, the savings may be less you'd think.
Frustrating phone or bot interface? Average customer spends 10% more time, but the company saves 50% of its costs.
-- The phone robots have been around for a while. The distinction really is between number pad robots and voice recognition robots. The primary advantage of the voice recognition robots is more choices and primary disadvantage is sometimes they just don't work at all whereas the number pad robots are fairly robust.
lots of technologies have gone through this maturity curve: (1) solve toy problems, (2) solve lame but valuable problems, (3) do interesting "real" things.
-- And lots of other technologies have stopped somewhere along the way.
My point, getting back to OP, is mostly around the corpus-based approach. I wouldn't deny that there's not progress here. But I'd agree with the OP that there isn't fundamental progress. A lot of what happens is this approach is much cheaper. You turn a huge amount of data into application using a small team and a bunch of compute where previously, you'd have had to have hired many people for an equivalent. But equivalents existed previously and even had their advantages. Which isn't to say previous methods can come back - the cheapness of a brute-force solution isn't going to go away. But I would say fundamental progress needs more than this.
Indeed, translating from the #1 language to the #4 language (or back) is problematic. Google's translation is usually enough to approximate what was said, but often results in gibberish. As a small example, it consistently fails with the spanish pronoun "su" - which is a word that depending on context can mean "his", "hers", "its", "your", "y'all's", "their" -- which seems to be a good example of the compressibility of natural language. How big does the corpus used to train their models need to be to get "su" right?
This would be true if the entire field of NLP ML was simply a process where training data is used to fine tune a model into providing better scores at specific tasks. It's easy to see why you would be fooled into thinking that this is the case: the most impressive applications that have been developed so far are simply this.
But I would argue that NLP is more than this. As we discover new architectures and techniques that score higher on accuracy, the mechanisms through which we achieve higher scores are clues to the underlying abstractions that make up our thoughts. For example, the success of Transformers (basically a vector search engine) is strong evidence that concepts can be represented as vectors, and even more, we can even measure the "closeness" of two concepts using a dot product between their respective corresponding vectors. In this sense, the big Transformer architectures like GPT-3 are compressing natural language sentences into alternative, mathematical representations through its encoding step, but also decompressing back into natural language sentences in the decoding step.
Natural Language Processing at this point is actually a big misnomer - it's more than talking robots; it's about translating human reality into mathematical representations. We are still far from true understanding, but so what? Infants crawl before they walk. NLPs' representations of reality are crude and our tools for manipulating those representations are even cruder. But every advance we make is a step towards understanding knowledge itself.
Hunter-gatherer's brain did not evolve to facilitate unambiguous thought transmission.
Speech evolved to be very visual, spacial, quantitative and political. Ability to lie efficiently is evolutionary trait. Sometimes, we don't even need words for that. And sometimes we lie by pronouncing exclusively true words and sentences. Ambiguity of speech always was a feature.
None of that makes work of NL researcher easier, of course.
Understanding a sentence is not like decoding or decompressing, it is more like trying to guess what is utterer up to, politically, and if he is a friend. And only then there is deciding where to steer according to what he says. And for that we sometimes should start decoding message, but only with sender's goals firmly materialized in mind.
Going through this, it has the old "A full understanding of an utterance or a question requires understanding the one and only one thought that a speaker is trying to convey. " claim, which continues to not make any sense, because obviously people don't do that; as much as I would like to be understood in precisely the way I mean, down to the most subtle nuance/shade of meaning and connotation, at least much of the time, this is not something we can actually get across in an at all reasonable amount of time.
Also, claiming that natural language is infinite, if taken literally, would imply a large and contrary to the common consensus claim about physics, contradicting the Bekenstein bound and all that.
But one thing which seemed, at least initially, like a point that could have some merit, was the point about compression vs decompression.
But the alleged syllogism about it, is, pretending to be much more formal/rigorous than it is, and is also kind of nonsense? Or, like, it conflates "NLU is about decompression" with, "NLU \equiv not COMP" which I assume is meant to mean, --
well, actually, I'm not sure what it is supposed to mean. Initially I thought it was supposed to mean "NLU is nonequivalent to compression", but if so, it should be written as like, "NLU \not\equiv COMP" (where \not\equiv is the struckthrough version of the \equiv symbol) , but if it is supposed to mean "NLU is equivalent to the inverse or opposite of compression" (which I suppose better fits the text description on the right better), then I don't think "not" is the appropriate way to express that.
And, if by "not" the author really means "the inverse of", then, well, there's nothing wrong with something being equivalent to its own inverse!
Nor, does something being equivalent to the inverse of something else imply that it is "incompatible" with it.
For something talking about communicating ideas and these ideas being understood precisely by the recipient of the message, the author sure did not work to communicate precisely.
The value in formalization comes not in its trappings, but in actually being careful and precise, etc., not merely pretending to be.
The part on intensional equality vs extensional equality was interesting, but the claim that neural networks cannot represent intension is, afaict, not given any justification (other than just "because they are numeric").
> Also, claiming that natural language is infinite, if taken literally, would imply a large and contrary to the common consensus claim about physics, contradicting the Bekenstein bound and all that.
Natural language is infinite in the pretty straightforward sense that, say, chess is infinite (there is an infinite number of valid chess games - if you ignore arbitrary restrictions such as the 50 move rule). This of course doesn’t mean that a chess computer has to be infinitely large or violate any known laws of physics. Similarly, practical C compilers can exist despite their being an infinite number of valid C programs.
Natural language as understandable by humans, is finite, because humans are finite, at least in this life.
There are only finitely many distinct utterances that a person could possibly hear that have length less than one lifetime.
Any utterance which takes longer than a lifetime to make/hear, is not really natural language, so much as an extrapolation of what natural language would/might be if people had unlimited lifespans, memory, and attention-spans.
How would I represent intensions in a neural network?
Well, you can encode text with a sequence of 1-hot vectors. (Is this trivial? Yes. Still counts.) If you can encode intensions on a computer, you can encode it as text. If you can encode it as text, you can encode it as a vector.
Do I think that (a sequence of 1-hot vectors) is the best way to do it? Not really, no. I'd need a bit more detail on what is meant to be represented in order to give a better shot at describing what I think could be a good approach for encoding it.
But also, I don't think the burden of proof is on me here. The author claimed that it is impossible, I said I don't see any justification for that claim.
Personally, I'm not entirely sure what they are saying is impossible. Do they have a particular task in mind?
____
Also, the set of possible states in a game of chess is finite, and a complete understanding of how to play chess optimally would, as such, also be finite. The fact that you can repeat some subset of states in a bunch of orders for an arbitrary amount of time, before you stop messing around and actually finishing the game, doesn't grant chess infinite complexity in any meaningful way.
A finite state machine can recognize a language which has infinitely many strings in it, yes. This does not mean that a RNN can't emulate such an FSM.
Sure, but this is missing the point. There’s also a finite number of C programs that can actually exist in the physical universe, but that’s an arbitrary limit, not part of the definition of C. Similarly, there’s no non-arbitrary limit on, say, the length of an English sentence.
All of this ‘debate’ about the infinity of language just reduces to a misunderstanding of what people are saying, as far as I can see. No-one thinks that more than a finite number of English sentences will ever be uttered; conversely, no-one thinks that we will ever discover such a thing as a complete list of all English sentences (since trivially the conjunction of all these sentences would be a new sentence not on the original list).
Note that if you view English as a regular tree language, your point about chess also applies to English. You don’t need to remember all of the preceding state, just the congruence class. No-one is saying that English is “infinitely complex”. The grammar of English is finite, but there is no limit on the number of sentences that it can assign a structure to.
Of course you can encode intensions using numbers in the broad sense that you can encode pretty much anything using numbers.
All arguments presented show that FORMAL methods won't work for a general NLP.
ALL of the problems discussed in this post are solved to some extent by machine learning - sometimes even by statistic methods such as word2vec. GPT-3 goes much further. Of course, these models are not trained on pure, grammatically and conceptually correct English - but on language that people use.
> while word embeddings capture certain conceptual features such as “is edible”, and “is a tool”, they do not tend to capture perceptual features such as “is chewy” and “is curved” – potentially because the latter are not easily inferred from distributional semantics alone.
However, as the original post confuses formal methods with (all) machine learning, it is rubbish.
This person is wrong and doesn't understand how brains produce and interact with language, but other people have already said that, so what is the straw that this person is grasping at?
There is a representational / symbolic divide in much of AI. Unfortunately some people think symbolic processes are somehow more "true" or "real" than representational processes. This leads them to say things like "A full understanding of an utterance or a question requires understanding the one and only one thought that a speaker is trying to convey." As if there were such a thing in a jello-like mass of briefly sparking cells.
But symbols are super useful, you're processing them now to understand my thoughts. They are so useful that you see them constantly in human innovations. Math, logic, art, literature ... everywhere! So what is a more useful way to understand the relationship between our gooey protoplasm and the beauty of say the pythagorean theorem? It's the same way to understand the relationship between analog and digital computation. The latter is built on the former.
Your digital (symbolic, supposedly pure) world is built on differences in ranges of voltages in physical circuits. (Or ranges of magnetic polarization, or ranges of light frequencies). We tame a continuously variable world by defining boundaries on continuous ranges and then pretend those are pure symbolic values.
This is the problem the author is wrestling with. They haven't differentiated the computational substrate from the computation. Can we build systems that "understand" using probabilistic function approximation? Sure! That's how we work. But the program running on that messy substrate isn't there yet and that's what's got the author in a tizzy.
Natural language evolves with what's happening in the world. Natural language also refers to what's happening in the works. For ML to keep up with that it would have to understand what's going on in the world, no?
Imagine you see a discussion about a border issue between two countries. You see a sentence that says: "What if country X does what Lukashenka did?"
If you've paid attention to the news about the border issue between Belarus and Lithuania you'd understand that the sentence is referring to flying in Iraqi migrants to Belarus to then send them to Lithuania to try to cross the border. Effectively to "flood the border".
It gets worse too depending on the perspective. For those that support Lukashenka the meaning might be different, because they don't accept that the above is what really happened.
For an ML model to understand all of this, it would have to have an understanding of what's going on in the real world. And it would have to be able to glean this understanding from few examples. I'm not sure how feasible this is.
> However, language understanding does not admit any degrees of freedom. A full understanding of an utterance or a question requires understanding the one and only one thought that a speaker is trying to convey.
What a laughable claim. By this definition NLU is easily shown to be impossible by the existence of ambiguity in natural language.
Any language model alone isn't going to solve natural language understanding, but some future ML full-brain model surely will achieve AGI and therefore language understanding as part of that.
What's missing from a pure language model, is any grounding in reality and ability to interact with the world to test and extend its knowledge beyond what is derivable from the corpus it was trained on. It's level of understanding is ultimately limited by the content of the training corpus, regardless of how massive that may be.
Something like GPT-3 is really just a statistical twist on Doug Lenat's Cyc; it's understanding is always going to be limited by it's own fundamental nature. Yes, one deals with language, and one with facts, but ultimately both are just large, fixed, self-referential bodies of data.
Cyc really is a great analogy, and for some reason it took decades for Lenat et al to eventually realize that regardless of how much fixed data you added to it, it was never going to be enough. A closed black box can never know what's outside of the box, although it may gamely try to tell you if you ask it.
These modern language models, GPT-3, etc, have certainly been a bit of an eye opener, and can perform some impressive feats (question answering, etc), but one shouldn't be tempted to believe that if scaled up sufficiently they'll eventually somehow transcend their own nature and become more than a language model... a one-trick pony capable of generating plausible continuations of whatever you seed it with.
Nobody tell him about GPT-3, I guess…? How do you write this in 2021 and not specifically confront the evidence of what modern ML systems can do and have already done?
I know the author and have worked with him. Even when a system he wrote over many months was dramatically outperformed by a naive machine learning approach that I coded up in an afternoon, it made no difference to his confidence that he was on the right path.
I am genuinely unsure whether this is a good or bad character trait.
The question is, can you build useful NLP systems with machine learning methods? And the answer appears to be a pretty clear 'yes'. I assume the set of problems such methods can solve, and the quality of those solutions, is only going to go up. Let them have their fun, I say.
That said, I've yet to see a good question answering system even in 2021, and I have somewhat traumatic memories of my own attempts:
The approach we settled on was quite old fashioned tree-based parsing of questions, with quite a lot of help to guide users to the subset of English we actually understood (disambiguation of both sentence structure and arbitrary named entities in a customer database, and upgrading partial/keyword queries to full sentences to be clear what we thought we were answering etc). All this was built on top of quite carefully curated customer ontologies and mappings to verb frames. We always intended to one day infer most/all that using machine learning when we had enough curated data to train and get a feel for how well it worked but we quickly ran out of oomph.
I have some sympathy with the author's basic point. I know there are pretty good demos of NLP over tabular data these days but I still think they'd struggle with any interesting real world usage (i.e. replacing SQL), especially when you have to model relationships between data in an arbitrary database. If you're anything less than 100% accurate in translating user queries, you have to provide ways to let them nudge your system in the right direction. It's true that human communication isn't necessarily entirely clear to even a human listener, but people know what they _mean_ to say, and you do have to offer some way to let them say it eventually. I find more mechanistic parsers to be more amenable to that, in general (but I'm not really up to date on the literature these days). I'd totally be in the market for such a system today if anyone can point at something robust though.
Consider a modified question: Can artificial agents that incorporate trillion-parameter NLP models learn to understand natural language in order to perform a wide range of useful tasks in open-ended environments?
It sure seems possible to me, at least judging by recent progress in both reinforcement learning in open-ended environments (e.g., DeepMind's recent paper on open-ended self-play) and NLP (e.g., with models such as DALL-E and GPT-3, which are trained to learn only a very narrow set of simple tasks, but could be trained in service of more complex and diverse agent objectives).
At a minimum, I'd say no one today can answer the NLU question with such certainty.
> we are convinced that the ‘big data’ approach to NLU is not only psychologically, cognitively, and even computationally implausible
I wish they didn't just breeze by the psychological and cognitive perspectives. Because even if the giant corpus/giant language model approach can do a lot of understanding-related tasks under the right conditions ... you would never claim to be teaching a child English by having them look at endless piles of documents. You interact with them. You _show_ them things in the world that the words name. They get what they want faster when they learn to say what they want. They "understand" words and sentences in reference to things and experiences. What does it even mean for a network to "understand" language when not exposed to other representations of objects, actions, etc?
Suppose you could time travel and bring English language technical documents back to a Mesopotamian cult (I only care that they haven't seen the latin alphabet), and you got a community of priests to study them as scripture, and after years of study they could tell you which symbols were missing from a sequence, or which symbols appear in the same contexts, or generate a plausible sequence of symbols and flag implausible ones, but they had no idea what the words referred to -- would they have "understood" anything?
Suppose you read a scifi book which makes new words or even has some weird typographic trick to represent something that's definitely out of corpus (e.g. the Language in Embassytown), you can understand it regardless, not because your model tells you about the relationships between those words and other words, but because you have conceptual frames for the situations, actions, intents etc.
There is NO exact interpretation of the above sentence. It is possible that the man peered through a telescope to see the boy, or that the man saw a boy who had a telescope.
One may reasonably argue that one interpretation is more likely but either may be true.
Ambiguity is an immense problem in semantic analysis, compounded by assumed speaker intent, etc.
Ultimately there is no exact interpretation or meaning of all language unfortunately.
Right, but human conversation doesn't work like that. A real usage would be more like "The man saw a boy with his telescope. The boy was running down the beach, apparently screaming out at the sea."
Using "a" in place of "his" in speech would be unnatural and would raise eyebrows. "His" is still ambiguous though, and any human speaker would know that and provide more context. The second sentence disambiguates the sentence sufficiently. Eleuther's online demo still fails badly given the above prompt:
> The boy saw the man, stopped, turned around, and looked at the man angrily, pointing at the horizon. The man saw a break in the waves, and he walked towards the boy. The boy ran into the sea, and the waves tried to take the man, but the man was able to escape from the waves. The boy was angry. He called the man a “nasty pirate.” The man went home and told his mother about the boy and the pirate. The man’s mother said, “He’s just a kid, he probably has no idea what he’s saying.”
All good comments and I don't disagree completely. Most humans get around just find reading books and communicating.
Interestingly though, there are miscommunications between humans and not just because someone said "the man saw a boy with his telescope" without sufficient context.
The intractable problem is that human language has evolved to require context or presumption of intent and these are not stated or confirmed. Therefore probabilities MUST be assigned to them. There is no exactness because the context and intent are not stated/communicated/written, they are by definition assumed.
This assumed knowledge works in many cases but fails in others. The difficulty comes when the audience does not AGREE with the assumption.
TLDR: Humans communicate by assuming that their audiences share some of their beliefs but sometimes the unspoken beliefs of the audience don't match the speakers'.
> I have discussed in this article three reasons that proves Machine Learning and Data-Driven approaches are not even relevant to NLU
This is a pretty hard-line position to take, and given the authors credentials I'm inclined to believe this is somehow poorly worded and not reflective of the thesis he intended with this article.
> Languages are the external artifacts that we use to encode the infinite number of thoughts that we might have.
> in building larger and larger language models, Machine Learning and Data-Driven approaches are trying to chase infinity in futile attempt at trying to find something that is not even ‘there’ in the data
> Ordinary spoken language, we must realize, is not just linguistic data
I would be curious to know what the author thinks of multimodal representation learning - which is conceptually promising in that it opens the door for machine learning models to learn relationships that span text, images, video, etc. For example OpenAI's CLIP [1], and other models like it.
Mechanistic solutions are never meant to solve high level problems, but they very often make the fundamental bricks more advanced solutions will rely upon, at a later stage.
Well, I would have agreed with you a year or two ago. I have been experimenting with GPT-3, via OpenAI beta APIs. The generative (not extractive) summarization is impressive. Question answering functionality is also good.
Fast forward ten years when we have: 1. Much faster, with much more memory, custom neural processing hardware. 2. Advances in modeling techniques, similar to how Transformer models seemed revolutionary a few years ago.
- multi-task - the training data should include all useful supervised datasets and knowledge bases besides next token prediction
- longer context window
- using external search and an external memory bank - so it is up to date without retraining, and has episodic memory
This kind of general model could be hooked up to a virtual body and trained on a diverse set of RL tasks, the problem is that it's probably too slow for real time use.
I first came across this form of misunderstanding in a radio interview on the BBC (probably Radio 3..) with John Searle. This would have been early 80s. His seemingly deep misunderstanding of the domain triggered a lifelong interest in the subject.
The failing seems to be to do with drawing the system boundary in the wrong place (at the walls of the Chinese Room).
The article seems to claim that machine learning won't work for natural language understanding because ML uses the Probably Approximately Correct (PAC) paradigm but language understanding must be exactly correct. As far as I can tell, that's nonsense, completely misunderstanding the PAC framework.
Perhaps, but "being able to do X" isn't the same as "being able to do X, using approach Y". After all, Deep Blue could play chess more than reasonably well, but no-one would suggest that it understood chess anymore than a calculator understands math.
So, is GPT-3 more like a brute force calculator like Deep Blue (or AlphaGo for that matter), or is it more like a human in it's ability to "understand" language ?
Given that GPT-3 is a language model, just trained to minimize errors when predicting next word in a sentence, and given the nature of it's transformer architecture, I'd say it's pretty clear it's really just another Deep Blue or AlphaGo. It's a "predict next word" brute force calculator, that - interestingly - is nonetheless able to perform many "NLU" tasks surprisingly well.
AlphaGo is not merely a brute force calculator and neither is GPT-3. They have conceptual representations. That is the advantage of deep neural networks, the ability to form concepts, as biological neural networks do.
They can form their own internal vector* representations, but they are still fixed architecture. Training AlphaGo can't change it from an RL architecture to something else, nor can training GPT-3 change it from a transformer into a more general cognitive architecture such as human brain.
The datapath through a transformer is entirely prescribed. Once the weights are trained, feeding a test sample into in will indeed just result in a fixed series of calculations to produce an output.
* It's a bit of a stretch to assert that deep neural nets are creating conceptual representations. For example if you look at the early layers of a CNN, what it's learned are just primitive (orientated edge, etc) feature detectors. At higher levels the more primitive features are combined into more complex ones, but they are just visual patterns, not concepts. It'll be the same for AlphaGo - it'll be creating it's own representations of complex board positions. Better than Deep Blue having to work with a human-design board representation, but at the end of the day it is nothing more than a board/position representation.
I am 100% sure ML is close to solving natural language understanding comparable to 95% of the population. We tend to grossly overestimate median human intelligence.
Common experience can be inferred from utterances, but difficult until stated, e.g. "the pleasing coolness on the reverse side of the pillow."
http://lib.ru/ADAMS/liff.txt
Similar to TFA's compression, if utterances are messages, can we decode them? (i.e. is it possible, not whether computationally feasible).
We need sufficient redundancy, so for example, a one-time pad encoding is impossible to decode (each message is encoded using random information, which sender and receiver both know, that information being shared offline/out-of-band).
So, human beings do have shared offline information, that TFA discusses. Unlike a one-time pad, this information is reused repeatedly. TFA claims the amount of reuse is insufficient for statistical inference, and argues that variations (like bag, suitcase, briefxase) will not be picked up as the same kind of thing (container), thus reducing the amount of reuse.
I think statistically inferring this commonality is feasible, from messages like putting in, taking out sometning etc.
Another argument made is that the message must be perfectly resolved. This is far from true in human communication, where ambiguity is rampant in practice. People routinely clarify as they talk, sometimes realizing later there is a misunderstanding - implicitly, by stating their understanding, sometimes explicitly asking for repetition or clarification.
Part of the skill of written communication, where the reader can't seek clarification, is to strive for clarity, describe the same concept in sifferent ways, give examples - and for very skilled writers, to subtly calibrate readers for the next idea in the process of communicating the previous idea.
But TFA is basically right, that there is a limit to how much shared background information can be inferred from utterances. One example is that technical experts are difficult to understand for humans without that technical background, and mere terminology isn't enough.
Another example is observational comedy, where the auidence understands, though it has never been said before. (from DNA)
ABILENE (adj.)
Descriptive of the pleasing coolness on the reverse side of the pillow.
Now, this effect perhaps could be inferred from some characters in stories flipping their pillow on a hot night, or might require a simulation of physics and physiology - a big ask from utterances!
KENTUCKEY (adv.)
Fitting exactly and satisfyingly.
The cardboard box that slides neatly into an exact space in a garage, or the last book which exactly fills a bookshelf, is said to fit 'real nice and kentuckey'.
Some words are funny because of how they sound or their feel in the mouth. Again, difficult to infer, if you have no mouth. I think "kentuckey" is apt in the above because of "tuck", and "ken" somehow makes it funny, but I'm not sure why (maybe a statistical method would know why?). I do seen why the southern phrase at the end is funny.
> evaluating the output of some NLP system regarding the above tasks is subjective [...] However, language understanding does not admit any degrees of freedom. A full understanding of an utterance or a question requires understanding the one and only one thought that a speaker is trying to convey
I'm not understanding how NLU would be inherently more objective than something like the given example of named-entity recognition.
> None of the above challenging semantic understanding functions can be ‘approximately’ or ‘probably’ correct – but absolutely correct. In other words, we must get, from a multitude of possible interpretations of the above question, the one and only one meaning that, according to our commonsense knowledge of the world, is the one thought behind the question some speaker intended to ask. In summary, then, true understanding of ordinary spoken language is quite a different problem from mere text (or language) processing where we can accept approximately correct results – results that are also correct with some acceptable probability.
Why would only absolute accuracy be acceptable? Even humans misinterpret things with some frequency. In many cases, the text just won't contain enough context for anyone to give anything more than an educated guess to the writer's intentions.
I'd further argue that language is lossy and thoughts aren't discrete on the scales relevant to language, such that even in the best of cases we don't interpret exactly the same thought as the speaker has in mind - just something that's approximately close enough to allow for useful communication.
> The equivalence between (machine) learnability (ML) and compressibility (COMP) has been mathematically established.
Hopefully the article doesn't take "learning can be viewed as compression of information" to mean "all that a trained model can do is compression of its inputs".
> Natural language understanding, on the other hand, and due to MTP, requires intelligent ‘uncompressing’ techniques that would uncover all the missing and implicitly assumed text. Thus, machine learning and language understanding are incompatible – in fact, they are contradictory.
Aargh!
> Note that antonyms/opposites such as ‘small’ and ‘big’ (or ‘open’ and ‘close’, etc.) occur in the same contexts with equal probabilities. As such, (1a) and (1b) are statistically equivalent
If you're working with a Markov chain maybe.
> In ML/Data-driven approaches there is no type hierarchy where we can make generalized statements about a ‘bag’, a ‘suitcase’, a ‘briefcase’ etc. where all are considered subtypes of the general type ‘container’. Thus, each one of the above, in a purely data-driven paradigm, are different and must be ‘seen’ separately in the data. If we add to the semantic differences all the minor syntactic differences to the above pattern (say changing ‘because’ to ‘although’ — which also changes the correct referent to “it”) then a rough calculation tells us a ML/Data-driven system would need to see something like 40,000,000 variations of the above to learn how to resolve references in sentences such as (2).
Machine learning models don't need to see every possible combination to be able to generalise, and they can learn that bags/suitcases/etc. are related.
For this and the rest of the article, I keep feeling like I must be (ironically) misinterpreting the author. There's repeated bold theoretical claims about what machine learning will never do, but about tasks that modern models are already to some degree capable of.
An armchair effort to redefine the goalposts and judge NLP, but proof is in the pudding. For now, large language models are the best flavor. NLP models are already useful even in this early stage.
In case you were highlighting the person's h-index or citation count, there are a few shortcomings that these metrics suffer from [1], making it very hard to judge scientific impact.
EDIT:
Specifically, self-citation is the biggest issue, though often there are very good reasons to do it. For example, if you are working on a new area there might not be a lot of work other than yours.
However, mindless self-citation grows the number of citation at O(N), as your n^th paper cites the (n - 1) papers before it. The total citation count grows by O(N^2)).
I look forward to when something like Siri or Google Assistant is hooked up to GPT-3. The current voice assistant ML systems are useless for anything but the most basic of tasks.
They are certainly useful, but it seems quite plausible that they could continue being useful without ever "solving" the problem, in the general sense.
I don't think "Is NLP useful?" is the question they're trying to address. Hammers are useful, even in a world where there are lots of screws. NLP is useful, even in a world where there is lots of language (in English, even) that current NLP systems don't do well on.
But more to the point, the authors here seem either completely unaware, or unable to grasp what language actually is. They claim "language is not data", but language literally is data. They claim that utterances have one and only one meaning, but that is literally never true, and is a profound misunderstanding of the fundamental nature of language.
Words, utterances, and grammatical rules do not have fixed meanings or structures. They have approximate, moving structures and meanings that are unique for every person on the planet. This idea that "real" understanding is some top-down deterministic, univalent process is just completely wrong, and will always fail anywhere and everywhere that it is tried. Not because it is too hard, not because our algorithms aren't good enough yet, but because it fundamentally does not comport with reality.
Top-down rules-driven approaches to "understanding" are essentially the computer science version of "conceptual analysis" from philosophy, and they are a similarly seductive waste of time that we would do well to leave behind, as philosophy has done with conceptual analysis.