Will AutoGPT-style AI Agents mostly work before the end of 2024?
53
229
1k
2025
24%
chance

Right now, AutoGPT and BabyAGI don't really work. If you give them a complex task they go off the rails and get stuck in loops, so most people prefer to use the models directly rather than an agent scaffold.

Will there be advancements in scaffolding this year or new models such as GPT-5 that solve these problems and allow them to mostly work? I will try the agents myself, and factor in whether it is common for people to use LLM agent frameworks for productivity assistant tasks.

Get Ṁ600 play money
Sort by:

Devin and swe-agent are apparently doing better than RAG on coding tasks, I haven’t looked at how they work yet but they made me update higher on this question

The type of task I had in mind when writing this was like “write me a summary of the current state of this industry or field of study” and the agent would figure out the best workflow, do web searches, read pages, do more web searches based on what it found, and ultimately complete the task with a better result than a single pass web search and summarization

Another way of framing the criteria is “Will AI agents be worth using instead of using the LLM directly for a significant portion of tasks”

bought Ṁ30 YES

I think that advancements this year will mean that agents "work" in the sense you describe in this question, but I do not expect them to produce high quality outcomes for more challenging tasks. Reckon the problem of "going off the rails" will mostly be fixed for standard tasks.

bought Ṁ25 NO

@RemNi my whole and ongoing experience with agent frameworks are that they simply goes nuts and spending the same amount of time actually coding a system is far more easier.

I assume so, but just to make sure: does this resolve YES if new, better agent frameworks get developed on top of LLMs (as opposed to AutoGPT and BabyAGI being improved)?

Also, now that I'm really thinking about this, how do you intend to resolve the question?

@inaimathi Yes, new agent frameworks that operate similarly will count.

I'm not sure I understand what you mean by how I intend to resolve. It's fundamentally a subjective assessment of the usefulness and reliability of the agents, but let me know if you have specific questions about what would count

@ahalekelly Yeah; that's what I was getting at (I wanted to know if you had specific metrics/test processes in mind or if it was going to be a subjective resolution).

Most problems with the agent frameworks will improve with the new much longer context lengths - so I think they will shape up. But for now it seems the main problem is that they tend to spin out of focus and have trouble communicating between the agents.

More related questions