
The GAIA benchmark (https://arxiv.org/abs/2311.12983) aims to test for the next level of capability for AI agents.
Quoting from the paper: "GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. GAIA questions are conceptually simple for humans yet challenging for most advanced AIs: we show that human respondents obtain 92% vs. 15% for GPT-4 equipped with plugins."
This market will resolve based on when an AI system performs as well or better than humans on all 3 of the different levels of the benchmark. I'll use the numbers from Table 4 in paper: 93.9% on level 1, 91.8% on level 2, and 87.3% on level 3.
(I'm using the conjunction of all 3 levels rather than the average to be somewhat conservative about this level being achieved.)
If a given submission was likely trained on the test set (based on my judgement), I won't consider this valid.
This market resolves based on the date of publication/submission of a credible document or leaderboard entry which indicates that the corresponding performance on GAIA was reached. (Not the date at which the system was originally created.)
Each date will resolve YES if this publication/submission takes place before that date (UTC). Otherwise NO.
(I may add additional options later to add additional resolution.)