Note: I'll reimburse anyone the cost of adding an answer (regardless of what it is), and also reimburse any further answers that refer to specific benchmarks (e.g. AgentBench).
I'd like to find out what Manifold thinks are the 5 best benchmarks of this kind.
Ideally they should have fairly objective protocols for evaluating the agent/system under test, but since I can't really define that rigorously myself, I will let people add and vote whatever they like.
At the end of January 2026, I'll conduct a poll to select the 5 winners.
I won't bet.
Linked market:
@CraigDemel this seems like something a simple calculating computer program could do, with no general intelligence involved, right?
@TheAllMemeingEye I think he might have meant solving this in polynomial time? https://en.wikipedia.org/wiki/Circuit_satisfiability_problem
@ProjectVictory At the time I wrote this, the LLMs I tried were bad at predicting output of one NAND with various inputs, even after being corrected. Which I found humorous, given how many NANDS they incorporated.
@CraigDemel I guess it might be a necessary but certainly not sufficient condition for agi, kinda like being able to draw specific shapes in ASCII art
Note: I'll reimburse anyone the cost of adding an answer.
Would I get reimbursed the 1000 mana cost for adding any of the following?
"Opinion poll of Manifold userbase"
"Opinion poll of general public"
"Equal success rate at user-controlled Turing test as median biological human"
"No remaining job roles in which its performance is evaluated by superiors as lower quality than the median biological human employee in said roles"
@TheAllMemeingEye Yes. I'll reimburse you for your favourite. I'll also reimburse any specific benchmark that anyone adds, even if they add multiple (I'll clarify the deal in the description).
@singer the fact that top 5 answers resolve Yes and there's only four answers is quite funny to me. I think it shows the current state of things quite well.