The actual wide narrative is that the current language models hallucinate and lie, and there is no coherent plan to avoid this. Google? Microsoft? This is a much less important question than whether or not anyone is going to push this party-trick level technology onto a largely unsuspecting public.
I think you're focusing on a few narrow examples where LLMs are underperforming and generalising about the technology as a whole. This ignores the fact that Microsoft already has a succesful LLM-based product in the market with Github Copilot. It's a real tool (not a party-trick technology) that people actually pay for and use every day.
Search is one application, and it might be crap right now, but for Microsoft it only needs to provide incremental value, for Google it's life or death. Microsoft is still better positioned in both the enterprise (Azure, Office365, Teams) and developer (Github, VSCode) markets.
Copilot mostly spews distracting nonsense, but when it’s useful (like with repetitive boilerplate where it doesn’t have to “think” much) it’s really nice. But if that’s the bar, I don’t think were ready for something like search, which is much more difficult and important to get right for the average person to get more good than harm from it.
Few people seem to know this, but you can disable auto-suggest in Copilot, so it only suggests things when you proactively ask it to. I only prompt it when I know it will be helpful and it's a huge time saver when used that way.
Sometimes, Copilot is brilliant. I have encountered solutions that are miles better than anything i had found on the internet nor expected to find in the first place.
The issue involved heavy numerical computation with numpy, and it found a library call for that that covered exactly my issue.
I've had similar experiences. Sometimes it just knows what you want and saves you a minute searching. Sometimes way more than a minute.
But I find it also hallucinates in code, coming up with function calls that aren't in the API but would sound like a natural thing to call.
Overall it's a positive though, it's pretty easy to tell for your other coding tools if the suggestion is for something made up, and the benefits of filling in your next little thought are very real.
Google's search results are pretty terrible. I actually have a hard time telling which is a result and which is an ad anymore tbh. I really don't think the bar is that high.
That what I find so funny. Again, UX innovation over LLMs is what makes ChatGPT so hot right now, like Hansel, but I mean, the product is tragically flawed as like all LLMs at the moment.
I believe that’s because people are using it wrong. Asking for facts is its weakness. Aiding creativity and more narrowly productivity (by way of common sense reasoning) is its greatest strength.
The product Microsoft is showing off is a fact-finding engine. Just look at the demo, they have built the AI model into the search experience and the demo shows it used exclusively to provide (supposedly) factual information [0]. It's not the users' fault that companies are building the wrong product.
I don’t not include Microsoft here in the audience of “using it wrong.”
It’s also very tempting when you get coherent text out to believe it. Hopefully the underlying tech will get better and/or people will understand it’s weaknesses… except the inability to ascertain clear misinformation gives me pause.
I think there's also potential value in being able to give feedback on results. If I try to search for something on Google right now and it doesn't give me what I want, my only options are to try a different query or give up. This puts the onus on me to learn how to "ask" properly. On the other hand, using something like ChatGPT and asking it a question gives me the option to tell it "no, you got this part wrong, try again". This isn't necessarily useful for all queries, but some queries might have answers that you can verify easily.
Over the weekend, I was shopping for laptops and tried searching "laptops AMD GPU at least 2560x1440 resolution at least 16 GB RAM", and of course Google gave all sorts of results that didn't fit those criteria. I could use quotes around "16 GB RAM", but then some useful results might get excluded (e.g. a table with "RAM" or even "Memory" in one column and "16 GB" in another, or a laptop with a higher resolution like 4K), and I'd still get many incorrect results (e.g. an Amazon page for a laptop with 1920x1080 resolution and then a different laptop in "similar options" with 2560x1440 resolution but an Nvidia GPU). I decided to try using ChatGPT to list me some laptops with those criteria; it immediately listed five correct models. I asked for five more, and it gave one correct option and four incorrect ones, but when I pointed out the mistakes and asked for 10 more results that did fit my criteria, it was able to correctly do this. Because I can easily verify externally if a given laptop fits my criteria or not, I'm not at risk of acting on false information. The only limitation is that ChatGPT currently won't search the internet and has data limited to 2021 and earlier. If it had access to current data, I think there would be a lot of places that it would be useful, especially given that it wouldn't necessarily replace existing search engines, but complement them.
I would argue this would be better done but google or someone else specializing faceted search over structured data. GPT may smooth over results that are coded as near misses (eg USB vs USB3),but as you said it gave you nearly half wrong incorrect data. There are also ways with toolformer it could call the right APIs and maybe interpret the data, but as is LLMs aren’t the right tech to fetch data from like this.
Most of OPs dissatisfaction with shopping for laptops on Google stem from query understanding fails. Google needs to understand that the challenge is for them to evolve their UX so that users can intuitively tune their searches in a natural manner.
A pure LLM approach is going to quickly lose its novelty as reprompting is frustrating and it is difficult to maintain a long lived conversation context that actually 'learns' how to interact with the user (as a person would).
Maybe, but I think the point of all this discussion is that Google _hasn't_ done something like this. It's not an unreasonable take that their lack of progress on this front is exactly why solutions like this are noticeable improvements in the first place. Sure, Bing AI isn't better than Google with ChatGPT, but the fact that it's a discussion at all is a sign of how far Google as fallen; if we're setting the bar at the same place for both Microsoft and Google for search products, then Google has already lost their lead, and that's a story on its own.
Agree, areas where correctness can be sub 90%. When a BA is making business docs, do they want creativity in querying factual data for a report they are creating? How tempting is it to not just use it for "creative" tasks?
Microsoft have the advantage that people saw GPT2, GPT3, and ChatGPT and how those models progressed and improved. Bard is Google's first public AI product so it looks like GPT2 while Microsoft are teasing at GPT4. People will assume that Google are a long way off fixing the accuracy problem because there isn't any trajectory or iteration, while they believe Microsoft will crack it quite soon because they've already seen how things can change.
There's a lesson for founders in this. If you develop in secret and try to launch a perfect product then anything less that perfect is unforgivable. If you launch early with something that has obvious problems people will forgive them because they see the potential and trust you to fix them.
>people will forgive them because they see the potential and trust you to fix them.
That seems very optimistic to me. Having seen Siri, Google Assistant, Cortana, and Alexa I trust that changes will be made, some of them will even be positive, but generally a net negative until they are completely irrelevant.
Notice how neither of these announcements mention their digital assistants getting an upgrade to be less garbage.
I think your "wide" narrative is actually still a very narrow narrative.
The actual wide narrative is this: yes the models lie and hallucinate, but people are realizing now that this is essentially what every human AND website currently does now! Every human presents their "facts" and "viewpoints" as though they know the whole truth but really they are just parroting whatever talking points they got from BBC or TheGuardian or Fox News, and all of those journalists are just using other sources with their own biases and inaccuracies. Basically, it's bullshit and inaccuracies all the way down!
I was chatting with my friends when out for dinner last night about ChatGPT and we concluded that while it does have inaccuracies, it's still better than asking humans for information and still better then the average Google SEO-Spam website. That is, what makes us think random human made website about say space travel is more or less accurate than ChatGPT or what our friend Bob thinks about space travel.
The truth is, most of the information we receive on a daily basis is inaccurate or hallucinated to some degree, we just have gotten "used" to taking whatever the BBC or Bloomberg or ArsTechnica says as "the truth."
I'd strongly disagree with the idea that ChatGPT is essentially as trustworthy as humans and human-generated content because humans occasionally bullshit and misrepresent reality.
You can rationalize what people are saying based on their experiences, opinions, and backgrounds. You can engage in the Socratic method with people, to unpack where their claims come from and get to the grounding "truth" of primary experience.
You can't do any of these things with ChatGPT, because ChatGPT isn't grounded – it goes in circles at a level of abstraction where truth doesn't exist.
The reality, as usual, is that Google is far, far ahead of the other companies. Just like Waymo is ahead of Tesla, and DeepMind is ahead of OpenAI by miles...:
They even have a far more advanced language model, they just don't release it publicly. It's scary how far Google is ahead, with its army of Ph. D's. They're the ones that pioneered the papers and techniques that OpenAI used -- but they did it 5 years ago.
This is just Google being Google ... they sunsetted Reader when it was popular, went through like 20 different chat products (GMail Chat, Hangouts, Google Meet, etc. etc.) and cannibalized their own projects. But as far as technology, they've got AI they're not disclosing to the world yet.
I'm not sure if you're being serious or not, but I'm going to assume serious and write a serious response.
Having PhDs or doing research does not equal the ability to create products people want to use and/or pay for. History has shown us time and time again there are some people who are amazing at creating original and groundbreaking research and other people who are amazing at turning research into money making, people-pleasing products. See all the research that came out of Xerox PARC which ended up doing nothing for Xerox and everything for Apple (and other companies).
Google has been spending a fortune on research in AI for 15+ years and, if anything, the company's main product (Search) has only gotten worse! They have been second, third best, or moribund in mobile phones, cloud computing, videogames, social media, and many others I've forgotten.
Now I'm not sure what the moral of the story here is but I can say it definitely isn't that doing the most research equals success because it clearly isn't that! I'd say it's probably a culture issue and also a motivation issue (which are clearly related). You're sitting on a money printing machine, employees all earning 350k+ per year, in an office filled with bean-bags, gourmet food, and living a chill life with nice and comfortable working hours... where is the motivation and drive to try and build a really innovative and amazing new product? "Sounds like a lot of work man..." It surprises me little that OpenAI beat them to the punch.
Even if those benchmarks showing their models are X% above SoTA actually translate into qualitatively significant improvements (which they probably would), Google still has the most to lose even if they do release widely and little to gain. Their best case outcome is to not lose any users, and maybe gain back a few % of the users they've lost in recent years. Search result quality has been declining noticeably for a while now and users want something different.
I’m bearish on Google but you’re correct and shouldn’t be in light gray text. Bard is a smaller version of Lamda, not PALM, so we know for sure they had a much more advanced model some time ago
If google can’t get things out of the lab into products that accrue value to their business they’ll head the way of Xerox PARC. A legacy of research innovations that others successfully capitalized on.
For many that may be a laudable end goal. For shareholders though it’s probably a tough pill to swallow.
Innovators dilemma. Search has huge margins because it is cheap to run. LLMs are not. Google gets 80% of revenue from search, Microsoft is forcing them to put a dent in those margins and laughing their asses off. Or as Nadella said in an interview “we made Google dance”.
There is a project called Stable Attribution which can tell you what training set sources were used for generating your image. The same tech applied to chatgpt results would let it operate like a traditional search engine (and make it easier to filter out hallucinated factoids or citations)
Stable Attribution is highly misleading: it does NOT tell you which images in the training set were used to generate your image. It shows you images in the training set that are most visually similar to the image that you show it.
Stable Diffusion 2 supports negative prompt weights, and amusingly you can give it a negative prompt of "weird looking hands" and it will generate much better hands!
First of all, the "citing sources" needs to be out-of-band from the regular response, since we've already seen that these systems are entirely comfortable inventing ficticious sources.
Second, if I give you a wonderfully written paragraph or two about the history of the British Parliament, and 3 links that supposedly back me up, how likely are you to check the links? Because that's what is actually going to happen. The LM will not "cite" a source, it will provide one out-of-band and your motivation to read it will need to be high, which is unlikely given the apparent quality of the LM's own answer.
You seem to be talking about chatgpt, not bing chat. Bing chat literally uses the search engine queries and links to those sources. I have seen its summaries include mistakes, but I have never seen it invent sources (I’ve tried maybe 500 chat queries).
It’s ironic that you’re very confidently presenting erroneous information here. I’d really recommend trying the actual product, or at least looking at the demos. It has some problems. It does not have the same problems that chatgpt does, because it does not rely solely on LLM baked-in data.
they will because like so offen perfection and accidental consequences matter less then "have the new tech" for both the companies and parts of the users
most important while both AIs get sometimes very important and fundamental things wrong they do get enough right to be of help in some tasks to be of a lot of help
> most important while both AIs get sometimes very important and fundamental things wrong they do get enough right to be of help in some tasks to be of a lot of help
that's only true if you can easily and cheaply identify when they are correct. But at this time, and for the foreseeable future, that's not the case.
thats also true if the fallout from wrong usages is less in cost then the savings, which for huge coperations is often the case for situations where it shouldn't
also you can't just use this tech to "get the truth" but to generate things in which it sometimes is rather simple to identify and fix mistakes, and where due to human error you anyway have a "identity and fix mistakes step"
have most of this kind of usages likely negative impact on society? Sure. But for adoption that doesn't matter because "negative impact on society" doesn't the main adopters money, or at least not more then they make from it
I said it will happen because you can make/save money with it.
And it's not just big cooperation, it's already good enough for many applications of all kinds and sizes of companies to be used from a financial point of view.
And "it makes money so people will do it" is a truth we have to live with as long as we don't fundamentally overhaul our economic system.
>The actual wide narrative is that the current language models hallucinate and lie
So do people. And ChatGPT is a whole lot smarter than most people I know. It's funny that we've blown the Turing test out of the water at this point, and people are still claiming it's not enough.
The comparison is not against "most people" though. When we search the web we usually want an answer from an "expert" not some random internet poster. If you compare ChatGPT even to something like WebMD, well, I'll trust the latter over ChatGPT in an instant.
It's no better for other domains either. It can give programming advice, but it's often wrong in important ways and so no, I'd rather have the answer verified by an actual developer who knows whatever technology I'm asking about.
And finally, when you talk about what is "enough", I'd ask "for what?" This is what people in this thread are saying. That ChatGPT is not enough for the majority of what people wish it to be, but it may be enough for some tasks such as a creative writing aid or other human in the loop tasks.
Also, ChatGPT isn't smart. It is very good at stringing words together, a facility which, when over-developed in humans, is not generally termed "smart". We reserve that sort of term for the capacity to reason, something ChatGPT has absolutely no capability for.
You and I live in a rarefied world. It's far better at writing than most people I know. and i suspect it's (sadly) better at logic/ math than many people, which, i fully admit isn't saying much.
They are being integrated into the most widely used information retrieval systems (search engines). It's not enough that they are "smarter then most people", they have to always be correct when the question asked of them has a definitive answer otherwise they are just another dangerous avenue for misinformation.
Yes, not all questions have definitive answers, which is fine, then you can argue that they are better then going to the smartest human you know and that might be enough. Although I personally would still disagree with this argument, since I think it's better that the answer provided is "I don't know".
We have not blown the Turing Test out of the water. I guarantee you that out of two conversations, I can tell which one is ChatGPT and which is human 95%+ of the time. (even leaving aside cheap tricks like asking about sensitive topics and getting the "I am a bot!" response)
The Turing test was originated in the 1950s. The goal posts haven't moved much in 70 years. The development of these new language models is revealing that, as impressive as the models are at generating language, it is possible that the Turing test was mis-conceived if the goal was to identify AGI.
At the time, it was inconceivable that a program could interact the way that ChatGPT does (or the way that Dall-E does) without AGI. We now know that this is not the case, and that means that it might finally be time to recognize that the Turing test, while a brilliant idea at the time, doesn't actually differentiate in the way that we want to.
70 years without moving the goal posts is, frankly, pretty good.
Au contraire, the whole history of AI is one of moving goal posts. One professor I worked with quipped that a field is called AI only so long as it remains unsolved.
Logic arguments and geometric analogies were once considered the epitome of human thinking. They were the first to fall. Computer vision, expert systems, complex robotic systems, and automated planning and scheduling were all Turing-hard problems at some point. Even Turing thought that Chess was a domain which required human intellect to master, until DeepMind. Then it was assumed Go would be different. Even in the realm of chat bots, Eliza successfully passed the Turing test when it was first released. Most people who interacted with it could not believe that there was a simple algorithm underlying its behavior.
> One professor I worked with quipped that a field is called AI only so long as it remains unsolved.
Not just one professor you worked with, this has been a common observation across the field for decades.
But the deeper debate about this is absolutely not about moving goal posts, it is about research revealing that our intuitions were (and thus likely still are) wrong. People thought that very conscious, high-cognition tasks like playing chess likely represented the high water mark of "intelligence". They turned out to be wrong. Ditto for other similar tasks.
There have been people in the AI field as long as I've been reading pop-sci articles and books about who have cautioned about these sorts of beliefs, but they've generally been ignored in favor of "<new approach> will get us to AGI!". It didn't happen for "expert systems", it didn't happen for the first round of neural nets, it didn't happen for the game playing systems, it didn't happen for the schedulers and route creators.
The critical thing that has been absent from all the high-achieving approaches to AI (or some subset of it) thus far is that the systems do not have a generalized capacity for learning (both cognitive learning and proprioceptive learning). We've been able to build systems that are extremely good at a task; we have failed (thus far) at building systems which start out with limited abilities and grow (exponentially, if you want to compare it with humans and other animals) from there. Some left-field AI folks would also say that the lack of embodiment hampers progress towards AGI, because actual human/animal intelligence is almost always situated in a physical context, and that for humans in particular, we manipulate that context ahead of time to alter the cognitive demands we will face.
Also, most people do not accept that Eliza passed the Turing test. The program was a good model of a Rogerian psychotherapist, but could not engage in generalized conversation (without sounding like a relentlessly monofocal Rogerian psychotherapist, to a degree that was obviously non-human). The program did "fool" people into feeling that they were talking to a person, but in a highly constrained context, which violates the premise of the Turing test.
Anyway, as is clear, I don't think that we've moved the goal posts. It's just that some hyperactive boys (and they've nearly all been boys) got over-excited about computer systems capable doing frontal lobe tasks and forgot about the overall goal (which might be OK, if they did not make such outlandish claims).
The actual wide narrative is that the current language models hallucinate and lie, and there is no coherent plan to avoid this. Google? Microsoft? This is a much less important question than whether or not anyone is going to push this party-trick level technology onto a largely unsuspecting public.