Hacker News new | past | comments | ask | show | jobs | submit login

It's interesting how much hand-holding the agents need to behave reasonably. Consider the prompt governing reflection:

>What 5 high-level insights can you infer from the above statements? (example format: insight (because of 1, 5, 3))

>Given only the information above, what are 3 most salient high-level questions we can answer about the subjects in the statements?

We're giving the agents step-by-step instructions about how to think, and handling tasks like book-keeping memories and modeling the environment outside the interaction loop.

This isn't a criticism of the quality of the research - these are clearly the necessary steps to achieve the impressive result. But it's revealing that for all the cool things ChatGPT can do, it is so helpless to navigate this kind of simulation without being dragged along every step of the way. We're still a long way from sci-fi scenarios of AI world domination.




I have a theory about this. All these LLMs are trained on mostly written texts. That's only a tiny part of our brain's output. There are other things as important, if not more, for learning how to think. Things that no one has ever written about: the most basic common senses, physics, inner voices. How do we get enough data to train on those? Or do we need a different training algo which requires less data?


If you’re looking for research along these directions, Melanie Mitchell at the Santa Fe institute explores these areas. There are better references from her, but this is what came to mind https://medium.com/p/can-a-computer-ever-learn-to-talk-cf47d....


LLMs can simulate inner voices pretty well. The way they've handled memory here isn't actually necessary and there are a number of agentic gpt papers out to show that (reflexion, self-refine etc) I can see why they did it though (helps a lot for control/observation)


> The way they've handled memory here isn't actually necessary

I'm curious if there are other methods you can point at that would handle arbitrarily long sets of 'memories' in an effective way. The use of embeddings and vector searches here seems like a way to sidestep that that's both powerful and easy to understand, and easy to generalize into multi-level referencing if there's enough space in the context window.


Every method so far basically uses embeddings and vector searches. what i mean is how the LLM processes/uses that information doesn't need to be this handholdy.


I guess that we could hook those AIs into a first person GTA 5 and see what happens. Every second take a screenshot, feed into facebookresearch/segment-anything, describe the scene to chat gpt, receive input, repeat.


Someone needs to start a Twitch account or YouTube channel focused around getting AI to play games like this through things like AutoGPT and Jarvis and just see what the hell it gets up to, what the failure modes are, and if it can succeed etc.


This is known as "embodied cognition". Current approaches involve collecting data that an agent (e.g. humanoid robot) experiences (e.g. video, audio, joint positions/accelerations), and/or generating such data in simulation.

See e.g. https://sanctuary.ai


It's already multimodal, as entropy is... entropy. In sound, vision, touch and more, the essence of universal symmetry and laws get through such that the AI can generalize across information patterns, not specifically text -- think of it as input instead.

Try prompts like: https://news.ycombinator.com/item?id=35510705

Encode sounds, images, etc in low resolution, and the LLM will be able to describe directions, points in time in the song, etc.

These LLM can spit out an ASCII image of text, or a different language, or code, etc. They understand representation versus an object.


You're not seeing this the right way. You are saying the equivalent argument of: "Look at how much hand-holding this processor needs. We had to give it step by step instructions on what program to execute. We are still a long way from computers automating any significant aspect of society."

LLMs are a primitive that can be controlled by a variety of higher level algorithms.


The "higher level algorithm" of "how to do abstract thought" is unknown. Even if LLMs solve "how to do language", that was hardly the only missing piece of the puzzle. The fact that solving the language component (to the extent that ChatGPT 'solves' it) results in an agent that needs so much hand-holding to interact with a very simple simulated world shows how much is left to solve.


You've been told it doesn't need that much handholding.

https://arxiv.org/abs/2303.11366

https://arxiv.org/abs/2303.17651

Why insist otherwise ?


I don't understand what you intend these papers to demonstrate. Surely the fact that the level of hand-holding they propose (both Self-Refine and Reflexion offload higher-order reasoning to a hand-crafted process) is so helpful even on extremely simple tasks demonstrates that a great deal of hand-holding is required for complex tasks. That these techniques improve upon the baseline tells us that ChatGPT is incapable of doing this sort of simple higher-order thinking internally, and the fact that the augmented models still offer only middling performance on the target tasks suggests that "not that much handholding" (as you describe them) is insufficient.


Honestly, I feel like the level of, um, I guess “hostile anthropomorphism” is the best term, here is…bizarre and off-putting.

LLMs aren’t people, they are components in information processing systems; adding additional components alongside LLMs to compose a system with some functionality isn’t “hand-holding” the LLM. Its just building systems with LLMs as a component that demonstrate particular, often novel, capacities.

And hand-holding is especially wrong because implementing these other components is a once-and-done task, like implementing the LLM component. The non-LLM component isn’t a person that needs to be dedicated to babysitting the LLM. Its, like the LLM, a component in an autonomous system.


Middling performance ? Do you actually understand the benchmarks you saw ? assuming you even read it. 88% of human eval is not middling lmao. Fuck, i really have seen everything.


I don't see a benchmark in either paper that shows "88% of human eval". Which table or figure are you looking at?



But this is not raw Reflexion (it's not a result from the paper, but rather from follow-on work). The project uses significantly more scaffolding to guide the agent in how to approach the code generation problem. They design special prompts including worked examples to guide the model to generate test cases, prompt it to generate a function body, run the generated code through the tests, off-load the decision of whether to submit the code or to try to refine to hand-crafted logic, collate the results from the tests to make self-reflection easier, and so on.

This is hardly an example of minimal hand-holding. I'd go so far as to say this is MORE handholding than the paper this thread is about.


for me, an unsupervised pipeline is not handholding. the thoughts drive actions. If you can't control how those thoughts form or process memories then i don't see what is hand holding about it. a pipeline is one and done.


I would say that if you have to direct the steps of the agent's thought process:

-Generate tests

-Run tests (performed automatically)

-Gather results (performed automatically)

-Evaluate results, branch to either accept or refine

-Generate refinements

etc., then that's hand-holding. It's task specific reasoning that the agent can't perform on its own. It presents a big obstacle to extending the agent to more complex domains, because you'd have to hand-implement a new guided thought process for each new domain, and as the domains become more complex, so do the necessary thought processes.


You can call it handholding. Or call it having control over the direction of 'thought' of the LLM. you can train another LLM that creates handholding pipeline steps. Then LLM squared can be tagged new LLM.


The pipeline doesn't really have to be task/domain specific.


I guess we just have different meanings of hand holding then.


Framing LLMs as primitives is marketing-speak. These are high-level construction for specific runtimes, which are difficult to test and subject to change at anytime.


Does a primitive definitely need to be easy to test or deterministic?


Hah. Sounds like qubits.


> We're still a long way from sci-fi scenarios of AI world domination.

You only have to program the memory logic once. Now if you stick it in a robot that thinks with ChatGPT and moves via motors (think those videos we’ve seen), you have a more or less independent entity (running off innards of 6 3090’s or so?)


But it's not so simple to just "program the memory logic". The hand-holding offered here is sufficient to navigate this restricted simulated world, but what would be required to achieve increasingly complex behaviors? If a ChatGPT agent can't even handle this simple simulation without all this assistance, what hope does it have to act effectively in the real world?


> But it’s not so simple to just “program the memory logic”.

But, it is. The application domain here is fairly trivial, but the logic is both simple and highly general.

> but what would be required to achieve increasingly complex behaviors?

Basically, three things on top of this:

(1) more input adaptors to map external data into language, and

(2) a bigger context space to process more current & retrieved data simultaneously, and

(3) more output adaptors to map intentions expressed in language to substantive action.

But the basic memory/recall system seems fairly robust and general, as does the basic interaction system.


I think you're ignoring a lot of ways in which this system will not easily extend to more complex tasks.

-While the retrieval heuristic is sensible for the domain, it's not applicable to all domains. In what situations should you favor more recent memories over more relevant ones?

-The prompt for evaluating importance is domain-specific, asking the model to rate on a scale of 1 to 10 how important a life event is, giving examples like "brushing teeth" (a specific action in the domain) as a 0, and college acceptance as a 10. How do you extend that to a real-world agent?

-The process of running importance evaluation over all memories is only tractable because the agents receive a very small number of short memories over the course of a day. This can't scale to a continuous stream of observations.

-Reflections help add new inferences to the agent's memory, but they can only be generated in limited quantities, guided by a heuristic. In more complex domains where many steps of reasoning may be required to solve a problem, how can an agent which relies on this sort of ad hoc reflection make progress?

-The planning step requires that the agent's actions be decomposable from high-level to fine-grained. In more challenging domains, the agent will need to reason about the fine-grained details of potential plan items to determine their feasibility.


> This can't scale to a continuous stream of observations.

My mind doesn’t scale to a continuous stream either.

While I’m typing this on my phone 99.99% of all my observations are immediately discarded, and since this memory ranks as zero, I very much doubt I’ll remember writing this tomorrow.


I did not read the original post, but your reflections are a great enrichment to what I think the post is about, so congratulations for this good addition.


They don't need that much handholding. They are a couple memory augmented gpt papers out now (self-refine, reflexion etc). This is by far the most involved in terms of instructing memory and reflection.

It helps for control/observation but it is by no means necessary.


(thanks for pointer to memory-augmented llms)


> We're giving the agents step-by-step instructions about how to think, and handling tasks like book-keeping memories and modeling the environment outside the interaction loop.

Sure, but this process seems amenable to automation based on the self-reflection that's already in the model. It's a good example of the kinds of prompts that drive human-like behaviour.


Pretty interesting when you take this insight into the human world. What does it mean to learn to think? Well, if we're like GPT then we're just pattern matchers who've had good prompts and structuring built into us cueing. At University I had a whole unit focussed on teaching referencing like "(because of 1, 5, 3)" but more detailed.


Chatgpt is a stochastic word correlation machine, nothing more. It does not understand the meaning of the words it uses, and in fact wouldn't even need a dictionary definition to function. Hypothetically, we could give chatgpt an alien language dataset of sufficient size and it would hallucinate answers in that language, which neither it nor anybody else would be able understand.

This isn't AI, not in the slightest. It has no understanding. It doesn't create sentences in an attempt to communicate an idea or concept, as humans do.

It's a robot hallucinating word correlations. It has no idea what it's saying, or why. That's not AI overlord stuff.


>Chatgpt is a stochastic word correlation machine

it seems humans might be too...?

my son is 4. when he was 2, I told him I love him. he clearly did not understand the concept or reciprocate.

I reinforced the word with actions that felt good: hugs, warmth, removing negative experience/emotion etc. Isn't that just associating words which align with certain "good inputs".

my son is 4 now and he gets it more, but still doesn't have a fully fleshed out understanding of the concept of "love" yet. He'll need to layer more language linked with experience to get a better "understanding".

LLMs have the language part, it seems that we'll link that with physical input/output + a reward system and ..... ? Intelligence/consciousness will emerge, maybe?

"but they don't _really_ feel" - ¯\_(ツ)_/¯ what does that even mean? if it walks like a duck and quacks like a duck...


> Intelligence/consciousness will emerge, maybe?

Extending that: LLM latent spaces are now some 100 000+ dimensional vector spaces. There's a lot of semantic associations you can pack in there by positioning tokens in such space. At this point, I'm increasingly convinced that, with sufficiently high-dimensional latent space, adjacency search is thinking. I also think GPT-4 is already close to be effectively a thinking entity, and it's more limited by lack of "inner loop" and small context window than by the latent space size.

Also, my kids are ~4 and ~2. At times they both remind me of ChatGPT. In particular, I've recently realized that some of their "failure modes" in thinking/reacting, which I could never describe in a short way, seem to perfectly fit the idea of "too small context window".


You say it has no understanding. So people can communicate idea's/concepts while chatgpt can't.

What if... what we think are idea's or concepts, are in fact prompts recited from memory, which were planted/trained during our growing up? In fact I'm pretty sure our consciousness stems from or is memory feeding a (bigger and more advanced) stochastic correlation machine.

That chatgpt can only do this with words, does not mean the same technique cannot be used for other data, such as neural sensors or actuators.

Chatgpt could be trained with alien datasets and act accordingly. Humans can be trained with alien datasets.

See the convergence?


I'm pretty sure LLMs can be used on anything considered a language, including things we as humans wouldn't consider language.

Sam Harris was recently talking about using an LLM processing wireless signals to identify where humans were standing in a room. I've not looked up the paper on this, but from everything I understand about this the generalized applican can apply to vast ranges of data.


>It's a robot hallucinating word correlations. It has no idea what it's saying, or why. That's not AI overlord stuff.

All that matters is economic and political impact. Definitions are irrelevant.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: