Hacker News new | past | comments | ask | show | jobs | submit login

I don't understand what you intend these papers to demonstrate. Surely the fact that the level of hand-holding they propose (both Self-Refine and Reflexion offload higher-order reasoning to a hand-crafted process) is so helpful even on extremely simple tasks demonstrates that a great deal of hand-holding is required for complex tasks. That these techniques improve upon the baseline tells us that ChatGPT is incapable of doing this sort of simple higher-order thinking internally, and the fact that the augmented models still offer only middling performance on the target tasks suggests that "not that much handholding" (as you describe them) is insufficient.



Honestly, I feel like the level of, um, I guess “hostile anthropomorphism” is the best term, here is…bizarre and off-putting.

LLMs aren’t people, they are components in information processing systems; adding additional components alongside LLMs to compose a system with some functionality isn’t “hand-holding” the LLM. Its just building systems with LLMs as a component that demonstrate particular, often novel, capacities.

And hand-holding is especially wrong because implementing these other components is a once-and-done task, like implementing the LLM component. The non-LLM component isn’t a person that needs to be dedicated to babysitting the LLM. Its, like the LLM, a component in an autonomous system.


Middling performance ? Do you actually understand the benchmarks you saw ? assuming you even read it. 88% of human eval is not middling lmao. Fuck, i really have seen everything.


I don't see a benchmark in either paper that shows "88% of human eval". Which table or figure are you looking at?



But this is not raw Reflexion (it's not a result from the paper, but rather from follow-on work). The project uses significantly more scaffolding to guide the agent in how to approach the code generation problem. They design special prompts including worked examples to guide the model to generate test cases, prompt it to generate a function body, run the generated code through the tests, off-load the decision of whether to submit the code or to try to refine to hand-crafted logic, collate the results from the tests to make self-reflection easier, and so on.

This is hardly an example of minimal hand-holding. I'd go so far as to say this is MORE handholding than the paper this thread is about.


for me, an unsupervised pipeline is not handholding. the thoughts drive actions. If you can't control how those thoughts form or process memories then i don't see what is hand holding about it. a pipeline is one and done.


I would say that if you have to direct the steps of the agent's thought process:

-Generate tests

-Run tests (performed automatically)

-Gather results (performed automatically)

-Evaluate results, branch to either accept or refine

-Generate refinements

etc., then that's hand-holding. It's task specific reasoning that the agent can't perform on its own. It presents a big obstacle to extending the agent to more complex domains, because you'd have to hand-implement a new guided thought process for each new domain, and as the domains become more complex, so do the necessary thought processes.


You can call it handholding. Or call it having control over the direction of 'thought' of the LLM. you can train another LLM that creates handholding pipeline steps. Then LLM squared can be tagged new LLM.


The pipeline doesn't really have to be task/domain specific.


I guess we just have different meanings of hand holding then.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: