I put together an open source deep research implementation using the OpenAI Agents SDK that was recently released. I'd love for folks to try it out and share feedback, but I also wanted to share some honest reflections on the limitations of deep research and the underlying models.
For ref the deep researcher does the following:
- Carries out initial research/planning on the query to understand the question / topic
- Splits the research topic into sub-topics and sub-sections
- Iteratively runs research on each sub-topic using a ReAct approach - this is done in async/parallel across sub-topics to maximise speed
- Consolidates all findings into a single report with references (I use a streaming methodology explained here to achieve outputs that are much longer than these models can typically produce - https://medium.com/@techsachin/longwriter-using-llm-agent-ba...)
Reflections on the current limitations of this technology:
- Although a lot of newer models boast massive contexts, the quality of output degrades materially the more we stuff into our prompt. This is not surprising: LLMs are not deterministic and therefore aren’t good at predictable data retrieval. If we’re dealing with quoting exact numbers, we’re better off taking a map-reduce approach - i.e. having a swarm of agents cheap agents/models dealing with smaller context/retrieval problems and stitching together the results, rather than one expensive model with huge amounts of information to process.
- The reasoning models off the shelf are pretty bad at thinking through the practical steps of a research task in the way that humans would (e.g. sometimes they’ll try to brute search a query rather than breaking it into logical steps). Often you’re actually better off chaining a bunch of cheap models rather than having a big expensive model with free reign to run whatever tool calls it wants. The latter still gets stuck in loops and goes down rabbit holes in a similar way we saw with the early AutoGPT days.
- Given the above, I’ve found that my deep research implementation - which applies a lot of ‘dividing and conquering’ to solve for the issues above - runs almost equally well with gpt-4o-mini as with o3-mini or even o1.
- What the above leads me to conclude is that calling any of these this ‘deep research’ is somewhat misleading. It’s ‘deep’ in the sense that it runs many iterations, but it implies a sort of accuracy which LLMs in general still fail to deliver. If your use case is one where you need to get a good overview of a topic then this is a great solution. If you’re highly reliant on 100% accurate figures then you will lose trust. Deep research gets things mostly right - but not always. It also fails to handle the nuance of prioritising sources on heuristics that an intern might be familiar with (e.g. if two sources contradict, relying on the one that has greater authority).
- This also presents a commoditisation problem for OpenAI and other providers of foundational models: If using a bigger and more expensive model takes me from 85% accuracy to 90% accuracy, it’s still not 100% and I’m stuck continuing to serve use cases that were likely fine with 85% in the first place.