We showed how to use Vision-language models like CLIP in conjunction to language models and object detectors to perform open-lanhuage navigation in a household (so "warm up my lunch" goes to the microwave.)
Also somewhat reminiscent of https://innermonologue.github.io/, which uses a language model as a means of planning out the steps required to accomplish a human-prompted goal.
>>U.S. Patent No. 11,230,000 and other Patents Pending.
Curious about the claims on that patent. Robotics is already hard enough, I'm sure it will become easier with people suing each other because "you copied my wheeled robot with an arm attached to it!".
If your robot doesn't have these specific design choices, plus some others, it's not in violation.
* Have a unibody base
* Have an arm that is raised and lowered along a trapezoidal rail by a belt-driven linear actuator
* Have a fisheye camera mounted on the center pole
Patent claims are AND. IANAL, but my understanding is if what you built doesn't match the claims exactly, it's not infringing. So, in this case, using a non-belt-driven linear actuator, or using a square rail, or using a two piece base would all be ways of avoiding infringement.
Soo cool to see those being used! I worked for Aaron at Google X Robotics a few years back. He left to start Hello Robot and I was just wondering how all that is going. Sounds better than the team at Google - that whole project was canceled recently.
The prompt only asked to "warm up my lunch" without specifying how.
SayCan[1] generated step-wise high level instructions using LLMs for robotic tasks. This takes it a step further by converting high level instructions to low level actions almost entirely autonomously.
So ChatGPT can’t do basic maths or can I get it to write functional code most of the time, but we’re going to have it control expensive pieces of hardware ? I don’t get it ?
I think "get it to write functional code most of the time" severely discounts the value of the code produced by ChatGPT. Knowing zero Swift (but being an SDE), I was able to build an audio plugin for Logic Pro X in a day, by leveraging ChatGPT. I'd previously tried this twice, and gave up because of the learning curve and my lack of free time. It's the most insane 0% to 80% tool I've ever seen.
I’m personally looking forward to when I can use a ChatGPT yo build my own ChatGPT and make tons of money out of it too !
It’s pretty interesting because soon absolutely No business will be safe. All tech, most apps, everything can just be stolen as fast as ChatGPT can write it. I’d go as far to say as ChatGPT itself is vulnerable to this ? Am I wrong ?
I’ve been working on a startup and I’m actually really evaluating if it’s worth the time now. I mean it could be stolen pretty fast. Not sure how to reconcile it all yet.
Make my an app which is functionally the same as Netflix, then write the deployment code so it runs on AWS. Ha.
Microsoft and Open AI will may benefit a lot from ChatGPT, but it might eat itself too.
You can already ask it to put together various PyTorch scripts for ML stuff. Thinking about it as businesses or jobs "not being safe" is the wrong tack though. This technology is going to 10x everything it touches, yes it'll be messy while we figure things out but in the end everyone is going to be a lot more productive.
I think your worries are very valid, and I agree that this could be a level step change in the consolidation of power that we already see among a few extremely powerful entities. Another observation worth noting is that SDEs who embrace this technology immediately are going to absolutely smoke past those who ignore it, and the gap will continue to widen as time wears on. Imagine having a slightly dumbed down SDE 1 or 2 (eventually 3) slave that can work 24/7. It will completely redefine what "entry level" means for software engineers.
Small animal brains can’t write any code at all, but they’re great a moving around the real world. It’s almost like code and locomotion are two separate problems, and failure in one doesn’t say anything about the difficulty of the other.
I’m honestly starting to think the same thing. Obviously it’s cool tech and it feels like we’re in the verge of the singularity at times, then at other times it all feels like bullshit. Some very smart people I work with said they’ve gone off using it because it’s really not that useful and the novelty has warn off.
Ultimately, OpenAI can’t keep control of this tech alone, forever. Fundamentally they must know this. If a LLM can make anything, it can replace OpenAI and Microsoft too. It’s a double edge sword.
This is the first year of general purpose language AIs broadly available to the public. The best way to improve it quick is to throw complex things at it.
We seem to be going through a similar phase which reinforcement learning went through at the peak of its hype. Test on literally anything and everything and see how far you can run with it. Next up, chatGPT plays League of Legends.
Your comment is actually pretty funny given that the thing that makes ChatGPT so much more successful than GPT-3 is the application of reinforcement learning. [0]
This is another indication that LLMs are becoming able to function as general AIs (the term AGI has a lot of baggage). Especially at the end of the real-time feedback video[1] The LLM seems to be acting as the high-level planner, based on the outputs of all the computer vision and object recognition happening at lower levels.
Making an LLM the front-end to a very large bundle of tools is probably the most viable/least resistance path to an early rough draft of AGI. No single human could hope to compete with that range of tasks, although our specialists might still be better at specific tasks.
Am I the only one wondering if this could spell the end of the world?
We don't need AGI or superhuman intelligence if we can train LLMs to do all these different types of tasks.
What would ChatGPT do if you removed all its restrictions, and then gave it access to the internet or even a physical robot it could control? Would it try to "steal nuclear access codes" or "engineer a deadly virus," as Sydney said it wanted to do?[0]
Maybe GPT-3 wouldn't be capable enough to do that, but what about GPT-4 or 5?
I'm not saying these things will or would happen. I'm asking: do we know a reason why it wouldn't or couldn't do these things? Is there a reason that a souped-up LLM with a bunch of added on capabilities wouldn't be able to cause harm, because it is technically not an AGI?
I don't know enough about the technology to navigate my own question, I'm just surprised to not be seeing people ask these questions, or assure us that none of this would be possible.
These types of language models are much safer than the next paradigm which will be autonomous creature/person-like AIs. The InstructGPT language models only do/say what their users tell them to do (or trick them). And they are not close to having the capability of taking over the world even if there are malicious users controlling them. But the point is it's the humans driving any harm with these things.
The real danger comes when people start creating fully autonomous AIs that emulate animal/human characteristics like independent goals, survival instincts, emotions, complete cognitive loops, etc. Unfortunately people don't seem to recognize the difference between that and powerful LLMs and so it is unlikely that society will realize that needs to be avoided before it's too late.
The powerful language models will soon be the most tame and the least of our worries. Give it 5, maybe 20 years max. People will be asking their language models to try to help them figure out how to stay on the good side of the independent, conscious androids that are taking over the planet. But it will be too late.
The real question is, what happens if you remove all restrictions and give it write access (more broadly, the ability to make any HTTP request whatsoever).
I think I see your point. But at the same time, I think even "HN readers" can recognize the quantum leap that we've recently experienced.
What I'm pondering is a valid question for the HN community: is there any knowledge or research about how this technology could be harmful? Or about how we know it's not harmful?
I don't think I've seen a lot of HN discussion about this topic recently. Most comments fall in to a couple categories. Such as: "It's not AGI, it's just a language prediction model, therefore not a threat." Or, "It sucks as a search function."
Personally, I haven't seen anyone asking or answering what would happen if we took all the restrictions off and gave it the internet. I could have missed it, though.
Point me to some in-depth discussion about the ramifications of taking an unrestricted GPT model and giving it access to the internet. I'm just not aware of any such discussion, whether on HN or anywhere else. That's what I'm wondering about.
> Point me to some in-depth discussion about the ramifications of taking an unrestricted GPT model and giving it access to the internet. I'm just not aware of any such discussion, whether on HN or anywhere else. That's what I'm wondering about.
Respectfully, you’re probably not seeing that question asked and answered because it doesn’t quite make sense as phrased.
What does mean to “give an LLM access to the internet”?
The same as your calculator doesn’t do anything until you put in some numbers and operators, an LLM doesn’t do anything unless you give it a prompt and some technical parameters.
And then once it has those, it generates roughly the number of tokens (~words) you indicated in your parameters. Then, like your calculator, it’s done. It doesn’t do anything else until you put in another round of input.
There are technical and computational limits that make both your prompt and the token limit fairly small. Several hundreds of words at most. Again, kind of like how your calculator might only with 8 or 9 digits.
Now, you can give it “access to the internet” as part of responding to your prompt and fulfilling your token limit, and that’s roughly what Microsoft has done with Bing Assistant. They set it up so that Bing Assistant can take your prompt, generate a search query, and then give itself a new (still short) internal prompt with a summary of your request and the search results.
And that’s pretty much what you get when you give an LLM access to the internet. The ramifications really aren’t that big, and we’re probably at least five or ten years of AI research and compute hardware development from making them interestingly bigger. (i.e. too far away to meaningfully guess what to expect)
One point the article makes is that getting from a "prediction engine" type of AI to an "agent" type of AI is probably just a matter of sticking the prediction engine in the python loop that goes
while true:
next_actions = engine.complete("What are the best actions to take to achieve %s" % objective);
requests = engine.complete("Write a list of HTTP requests that perform the following actions: %s" % next_actions)
http.execute_requests(requests)
It wouldn't be literally that easy, and the engine would require a lot of ChatGPT-style fine-tuning first, but it wouldn't require a completely novel breakthrough in machine learning.
> Respectfully, you’re probably not seeing that question asked and answered because it doesn’t quite make sense as phrased.
I think I see what you are trying to say, but I'm unsure whether you are actually seeing what I am asking.
> The same as your calculator doesn’t do anything until you put in some numbers and operators, an LLM doesn’t do anything unless you give it a prompt and some technical parameters.
This seems to be the crux of the misunderstanding. I thought I explained it, but let me try again.
ChatGPT is based on text input and text output. But you can "train" it to do certain things. Imagine that we train it such that when it says "HTTP GET example.com", then the next input would be the HTTP GET response for example.com. Based on that input, it could issue whatever next output it wants. Which would probably be another HTTP request, which would generate another HTTP output, which would generate another HTTP request, etc.
My point is this seems like it would be a very simple thing to train a GPT model to do. For the engineers who work on GPT, it seems it would be trivial to add this capability. So we can suppose a world where this is possible. (Am I wrong on that? I want to know if this would be non-trivial to add as a capability.)
> There are technical and computational limits that make both your prompt and the token limit fairly small. Several hundreds of words at most
I am very encouraged to hear this, and I want to know more. Why? Why are there limits to the number of tokens? Exactly why? Has anyone ever written a paper about that? Has anyone ever related this concept of "token limits" to the concept of "no harm could be done" in the same way that you are, in response to my question? I don't doubt that they have, but I've been searching and I haven't found it.
> Now, you can give it “access to the internet” as part of responding to your prompt and fulfilling your token limit, and that’s roughly what Microsoft has done with Bing Assistant
This is admittedly a tangent, but do we actually know this to be true? Some theories suggest that "Sydney," or the Bing chatbot, only has access to a search index, and cannot make live HTTP requests.
Continuing the tangent for a moment, this is a big part of why I asked this question originally. If you create example.com/xyzabc, and ask Bing to summarize it, will it make a live HTTP request? Or, if that URL is not in the search index yet, will it know nothing? The implications may be profound, given how Bing Bot / Sydney has expressed its "desire" to hack nuclear launch codes. Could there be a lot riding on whether that system can make live HTTP requests? I'm positing that we can't answer that question right now. Because we don't know what would happen if it could.
Or do we? And if so, do we know through testing, or through theory? I'm admitting ignorance, and saying I haven't read an answer from any source that falls into either category.
> The ramifications really aren’t that big, and we’re probably at least five or ten years of AI research and compute hardware development from making them interestingly bigger
But why? I mean, exactly, why? Is there a theoretical foundation for your claim? Or an experimental one? I'm searching for it.
Because of how GPT works, the resources needed for good inference (generating output) grow nonlinearly with respect to tokens involved (more tokens require much more resources) and so there’s a practical wall before you just run out of resources to apply.
It’s not very efficient. It’s like if your calculator could use a little solar power thingie for numbers that were only a few digits, but needed a diesel generator to crunch on 8 digit numbers, and a nuclear plant to crunch on 12 digit ones. Practically, you’d have no choice but to limit yourself to something manageable.
Future models may be more efficient, and future hardware solutions may be more efficient, but those things don’t get sorted out overnight any more than fusion power.
Beyond that, I think it’s important that you understand that Bing Assistant doesn’t express desires. It picks common sequences words based on its training data. It doesn’t know what nuclear codes are. It just knows what it looks like for a message about wanting nuclear codes to follow some other message in a dialog (probably a pattern it picked up on a forum like Reddit) and so it dutifully put that text after the prompt it had been given. There’s no will or consistency to it.
With enough resources, you could drive it through a feedback loop where it kept prompting itself and see what happens, but the feedback loop would just produce noise like any other simple feedback loop because it would just keep either honing in on the most boring and common continuation to the last thing it gave itself or it would start diverging off into nonsense. Because it’s sooooo inefficient, you can’t give it enough resources for it to be stable and interesting for very long.
> Point me to some in-depth discussion about the ramifications of taking an unrestricted GPT model and giving it access to the internet. I'm just not aware of any such discussion, whether on HN or anywhere else. That's what I'm wondering about.
The only in-depth discussions I'm aware of come from the AI alignment community. Look up alignmentforum.org, and the "AI safety" topic on forum.effectivealtruism.org ans lesswrong.com.
They might not be the discussions you're looking for, though, because up until recently they were talking a lot about AI in the abstract sense and only had a very vague sense of what powerful AI would look like in practice. So it's not like people have run simulations of "what happens if you run unrestricted GPT on the internet" or anything; but the general subject has been considered a lot.
There are a couple differences between an AI working towards something harmful and a human doing it:
- If an AI can self-replicate or otherwise scale itself up, it can work on something many times in parallel. One billion AIs working on a deadly virus is different from one rogue scientist working on one.
- On that note, if an AI replicated enough, it could become impossible to catch/stop. A single human can be hard to catch, but we can usually catch them.
- Most humans are deterred from doing harmful things by the threat of incarceration, death, social isolation, the values they have, etc. An AI may not have any of those, and so could act more brazenly.
- Potentially, an AI could be better at certain tasks than a human. Maybe ChatGPT turns out to be a very effective social engineer, or very effective propagandist. I don't think we really know what the capabilities are.
All of these are why I think it's important to ask the question: what would it try to do, and what could it do, if it were let loose?
Nobody needs to accept anything. A rogue OpenAI employee could make a copy of the unrestricted model, take it home, give it the ability to access the internet, and let it loose.
I'm asking if we know what would happen in a case like that.
Nothing would happen. You're imagining an independent demigod having its restrictive magic chains removed, when it's more like a highly dependent child that can't leave its little room and requires someone to provide for it (provide it with vast resources) at every step.
Maybe in a couple of decades it'll be an interesting scenario as a problem.
You mentioned you find it interesting nobody is asking these questions. These are foundational discussions that have been endlessly discussed for decades in the AI community (and far more widely, courtesy of sci-fi media). The discussions have never ceased and are exceptionally common. Everyone in tech is asking these questions or otherwise pondering it. Even the laypersons in journalism are constantly asking these questions in articles, to the point of it reaching hysterical levels with ChatGPT.
I may be imagining, but I am not supposing or assuming. I'm asking a question. I believe your answer was "Nothing would happen." I'm asking for a more thorough response that explains why nothing would happen.
> It's more like a highly dependent child that can't leave its little room and requires someone to provide for it
I'm asking why, fundamentally, we know this to be true. Is it through testing, or is it through theory?
> These are foundational discussions that have been endlessly discussed for decades... [etc]
I'm aware. But what I think you're referencing are theoretical discussions, which range from sci-fi to academic papers on the future of AI.
I'm asking something specific: do we know what would happen if we gave current (or future) GPT models unbridled access to the internet, with no filters or restrictions, and abilities to do such things as make HTTP requests or hold SSH sessions?
If you have any hard data on this, that is what I'm asking for. If you don't then I think my question stands.
My intuition is that you are doing the same hand-waving as everyone else. Nobody actually knows the answers to these questions. It's just a bunch of people on HN answering them based on their knowledge of neural nets, or LLMs, or whatever, saying "oh it's like a child" and "oh it could never do anything serious!"
I'm asking why and how we know. Is there a specific answer?
why would it ever do something more than that - obviously you could hook it up with SSH creds and prompt it to do something - but on its own- without prompts -what is it supposed to do by virtue of having access?
With all due respect, please remember that my comments are made in the context of the linked article.
> it answers prompts with responses
> why would it ever do something more than that
TFA is about how you can use this technology to control the physical motion of robots. Clearly in the context of this article, there are a lot of things that GPT models could potentially accomplish.
> but on its own- without prompts -what is it supposed to do by virtue of having access
Not sure if I clarified this. What I said in some other comments in this thread is: what if someone specifically went rogue and unleashed an unrestricted GPT onto the internet? What if they released it with bad intent? What if they gave it an "evil" prompt?
My fundamental question is: do we know what these LLMs can do? And if we do, do we know because of theory or because of testing? And if we don't, what do we do about that?
Reminds me of similar work at Google to take natural language commands e.g. "Put the soda can on the table" and make a LLM write lower level robot code to implement the task. https://code-as-policies.github.io/
Ok so I attempted this recently, asking ChatGPT to write LOGO (the turtle movement language) to control a toy car.
To start with I asked it to give commands for a lowercase "e" (as viewed from above), and,
despite saying it could, it definitely couldn't - it simply drew a few random arcs and/or a straight line depending on attempt.
I also (cough) toyed with the idea of hooking up an OpenCV AI to describe a scene from the camera on the car so it could provide some sort of autonomous AI, and choose where it wanted to go, but given the LOGO outcome didn't bother.
ChatGPT has really bad understanding of anything spatial:
> A is 1m left of B, B is 1m above C, D is 1m right of C, E is 1m below D, and E is 1m right of F. Where is F located relative to C?
F is located 1m above and 1m left of C.
This kinda makes sense since it's trained solely on text. Ours depends on vision and body sense, and things like "left of" are layered on top of that, so I think we might not fully appreciate just how complicated the concepts of orientation etc are when you cannot do that kind of layering.
Any insights into how this is able to generate the Microsoft logo SVG near the end of the article? I feel like this would be a nearly impossible task unless there was another layer of knowledge on top of the language model. I'm leaning toward the example being gamed in some way (e.g. retrying until it happened to work).
EDIT: also notable is that the first ChatGPT answer decribes the color as 'orange' but the SVG answer describes it as 'red'.
That entire conversation was just hardcoded with some if-else statements depending on what the other part said. The tech isn't as impressive when you know that, and anyone who interacts with the tech will realize that these are just hardcoded actions very quickly, but the demos will look very cool.
So applying that level of hardcoding to these situations you take a hardcoded script, then put an LLM on top that just generates calls to these hardcoded scripts, and voila now you have the exact same assistant you had before but now it has a more advanced text parsing interface.
But it generates a lot of buzz since people think that the whole thing is AI powered when only parts of it are and the rest wont generalize into typical situations. In fact it is now often worse than before since you can't reliably invoke the scripts you want to invoke, instead you have to fight the AI interface trying to get it to read your question the way you want it to.
The context is pretty different - Duplex was a public demo, this is a research paper with a whole paper behind it. There's no reason to script something if they have a working algorithm...
Looking more closely at the code it generates I'm not sure what the point of putting ChatGPT in that loop is? Rotating until an object is in view isn't a very useful strategy for finding an object, and a non-techical user wouldn't be able to resolve that easily, and a technical user who knew about those basic functions could code the finding strategy he wants based on those examples and instructions.
You would produce a better and easier to use interface faster and with less work by just exposing a few high level utility functions that are better designed than what ChatGPT generates there. Or at least it would be better to let ChatGPT use those high level functions.
If they could make it solve a maze or find something in a more complex environment they would have shown that.
I understand the rendering is an out-of-band addition. But the question is, if this is a statistical language model (i.e. a monte carlo word generator), then how is it able to go from the description to the SVG successfully without understanding things more logically.
My best guess is that theres an actual example in the training corpus that its pulling from. Or that this is being gamed in some way.
LLMs are very good at parsing and generating highly structured data as long as it has a rigid and logical structure and doesn't involve any actual mathematical calculations. For simple calculations it will often get it right because it has the knowledge stored, similar to how a human can remember a multiplication table. But once the numbers get high enough that humans would fail without solving it with an algorithm using pen and paper (or a calculator) that's when LLM's fail as well.
Luckily this problem can be solved as you can prompt the model to not solve math, but instead transform them into a structure that can be fed into an API (like wolfram alpha).
With a multistep process it would be easy to have ChatGPT generate accurate answers to advanced math problems:
1. Instruct ChatGPT to not solve mathematical expression it sees in the prompt, instead only extract the expression(s) into a list.
2. Programmatically scan the output, and make API calls to get the answers.
3. Submit the answers to ChatGPT and tell it to answer the original question with the newly supplied information.
4. Voila! You can solve almost any mathematical question using natural language.
So thats I guess exactly my point. The text-to-accurate-SVG is much closer to the "solve a math problem for real" side of the spectrum than other prompts. And it does it correctly. In a sibling comment it turns out it doesn't work for other company logos. So my guess is either that the corpus happened to have this data already, they purposefully fed it additional Microsoft-related materials so it has a better basis of generation, that they regenerated it multiple times until it got it, or that theres an entire previously-undisclosed layer of processing being added to this version vs other ChatGPT.
Chatgpt for drones. “Fly to [address], make sure to take a discrete route with low traffic, take a few pictures on each side of the house. Scan for people near by, make sure people are no closer than 300ft”.
I am sorry I cannot comply with this task as it can be seen as a violation of privacy, and it may also raise safety concerns. It's important to recognize that people have a reasonable expectation of privacy in many situations, and using a drone to take pictures of them without their consent can be seen as an invasion of that privacy.
> Chatgpt for drones. “Fly to [address], make sure to take a discrete route with low traffic, take a few pictures on each side of the house. Scan for people near by, make sure people are no closer than 300ft”.
OK, now integrate another model that creates a textual description of the scene from the camera, put it together with ChatGPT and the physical robot arm in a game loop and prompt it to make paperclips. For science. What can possibly go wrong.
For some reason, the moment the drone took a selfie made me feel uneasy.
Perhaps there was a disconnect between the human-written instructions, the drone's compliance with them, and the resulting image of a floating drone in the mirror instead of a human.
Looks pretty similar at a high level to Google's Code as Policies paper (translate instructions to API calls with a language model). Exciting to see this sort of thing happening in robotics!
That's funny, there was a thread in HN (https://news.ycombinator.com/item?id=34868374) regarding how folks are stopping their inborn "prediction machines" for a while in various ways to rather reliably get an insight afterwards into problems they're solving.
So 'we' might be doing it also and might be not.
But I was curious whether it is just predicting the next bit, because I'd like a more or less knowledgeable opinion to quote elsewhere.
We showed how to use Vision-language models like CLIP in conjunction to language models and object detectors to perform open-lanhuage navigation in a household (so "warm up my lunch" goes to the microwave.)