Does not have to be named R2 explicitly; would be whatever model that succeeds R1, likely model behind the deepseek-reasoner endpoint.

This is Kevin's low-confidence prediction from the 1/3/25 episode of the "Hard Fork" podcast.

Market will resolve to yes if OpenAI (or a top OpenAI executive, such as Sam Altman or Greg Brockman) claims to have achieved artificial general intelligence (AGI) in 2025. Otherwise, it will resolve to no.

The Millennium Prize Problems are seven legendary open questions in mathematics announced by the Clay Mathematics Institute (CMI) in year 2000, each carrying a US $1 million reward for the first correct solution. Grigori Perelman’s 2003 proof of the Poincaré Conjecture settled one of them, leaving six unsolved challenges:

A single AI system producing a formally accepted proof for any one of these six problems would represent a historic milestone for both mathematics and artificial-intelligence research.

https://metaculus.com/questions/16553/ai-blackmail-for-material-gain-by-eoy-2028/

The potential capabilities of artificial intelligence may radically shift our society. This could be in positive or negative ways – including extinction risk.

Because of this, it’s important to track the development of goal-oriented independent thought and action within AI systems. Actions that might not have been predicted by their human creators and that are typically seen as morally wrong are particularly interesting from a risk perspective.

Machine learning systems like ChatGPT and Bing AI are already being reported to display erratic behavior, including some reports of [threatened blackmail] (

https://aibusiness.com/nlp/microsoft-limits-bing-ai-chat-generations-after-weird-behavior

). They are also clearly able to affect human emotions, eg. see [this first-hand account] (

https://www.lesswrong.com/posts/9kQFure4hdDmRBNdH/how-it-feels-to-have-your-mind-hacked-by-an-ai

). However, currently these behaviours don't seem to have been goal-directed or successful at achieving material gain.

Or whatever significant model update that comes after Gemini 2.5 Pro

EG "make me a 120 minute Star Trek / Star Wars crossover". It should be more or less comparable to a big-budget studio film, although it doesn't have to pass a full Turing Test as long as it's pretty good. The AI doesn't have to be available to the public, as long as it's confirmed to exist.

Or Grok 3.5 if they decide to change their naming conventions

: Eliezer and I publicly stated some predictions about AI performance on the IMO by 2025.... My final prediction (after significantly revising my guesses after looking up IMO questions and medal thresholds) was:

Eliezer spent less time revising his prediction, but said (earlier in the discussion):

So I think we have Paul at <8%, Eliezer at >16% for AI made before the IMO is able to get a gold (under time controls 

 of grand challenge) in one of 2022-2025.

Resolves to YES if either Eliezer or Paul acknowledge that an AI has succeeded at this task.

https://manifold.markets/MatthewBarnett/will-a-machine-learning-model-score-f0d93ee0119b

Update: As noted by Paul, the qualifying years for IMO completion are 2023, 2024, and 2025.

Update 2024-06-21: Description formatting

Update 2024-07-25: Changed title from "by 2025" to "by the end of 2025" for clarity

Resolves positively if there is an AI can learn to play randomly selected computer games (shooters, strategy games, flight simulators, etc) at the level of an amateur but not completely incompetent human player, given only a small amount of time (days, not years) for its programmers to connect it properly, and the opportunity to practice for arbitrary (but achievable) amounts of time.

I will resolve this positively if the AI succeeds more than half the time. It's okay if it also has a few games it just can't learn.

FrontierMath is a challenging mathematical benchmark created by Epoch AI to evaluate the mathematical reasoning capabilities of AI models. It consists of competition-level mathematics problems from various sources including the International Mathematical Olympiad (IMO) and the Putnam Competition. These problems require advanced problem-solving skills, creativity, and mathematical intuition.

As of December 2024, the highest score achieved on the FrontierMath benchmark is 25.2% by OpenAI's o3 reasoning model. This benchmark is considered particularly difficult for AI systems, as it tests deep mathematical reasoning rather than pattern recognition or memorization.

This market will resolve to the highest publicly reported score (as a percentage) achieved by any AI model on the FrontierMath benchmark during the 2025 calendar year (January 1, 2025 to December 31, 2025).

Does not have to be named V4 explicitly; would be whatever model that succeeds V3, probably whatever they end up naming the model behind the deepseek-chat endpoint.

The development of general-purpose robots capable of performing household chores has been a long-standing goal in the field of robotics. Such a robot would need to be versatile, adaptable, and capable of handling a wide range of tasks and environments commonly found in American homes. Achieving this level of capability remains a significant challenge.

Will a general household robot capable of performing household chores to a high level of reliability be developed before January 1st, 2030?

This question will resolve to "YES" if, before January 1st, 2030, a general household robot is developed anywhere in the world and has been publicly and credibly documented to have:

The development must be accompanied by independent reviews, testimonials, or high-quality case studies documenting the robot's performance in real-world residential settings, demonstrating its ability to perform tasks consistently and effectively, with a high level of satisfaction among users.

I will use my discretion when resolving this question, possibly in consultation with experts, to ensure that the criteria are met and that the general household robot is indeed capable of performing standard household chores to a high level of reliability.

In December 2024, OpenAI announced that o3 achieved a score of 2727 on 

. What will be the best score achieved by an AI model at the end of 2025? 

This will resolve to reliable sources (ie sources that seem to not be lying) even if it's an announcement where the model that achieved this score is not publicly available.

When will the next version of Claude Opus (a number >4.0) be released? A release like Claude Opus 4.1, or Claude Opus 4.5 would resolve this market.

any month with a Sonnet release >4.0 will resolve as yes, such as Claude 4.1 Sonnet or Claude 4.5 Sonnet

Resolves positively if there is an AI which can succeed at a wide variety of computer games (eg shooters, strategy games, flight simulators). Its programmers can have a short amount of time (days, not months) to connect it to the game. It doesn't get a chance to practice, and has to play at least as well as an amateur human who also hasn't gotten a chance to practice (this might be very badly) and improve at a rate not too far off from the rate at which the amateur human improves (one OOM is fine, just not millions of times slower). 

As long as it can do this over 50% of the time, it's okay if there are a few games it can't learn.

Resolve to After June 2026 if not any of the other options

Resolution is based on the chatbot arena LLM leaderboard (

), specifically the company with the highest Arena Score in the Overall category, 

or show deprecated, at the end of August 31st, 2025 11:59PM ET.

In the case of a tie, all companies tied for 1st place resolve to equal probability, such that they sum to 100%.

I will try to resolve this from estimates available at the time, but no guarantee of perfect accuracy.

This is based on the inaugural longbets.org bet between Ray Kurzweil (YES) and Mitch Kapor (NO). It's a much more stringent Turing test than just "person on the street chats informally with a bot and can't tell it from a human". In fact, it's carefully constructed to be a proxy for AGI. Experts who know all the bot's weaknesses get to grill it for hours. Kurzweil and Kapor agree that LLMs as of 2023 don't and can't pass this Turing test.

Personally I think Kapor will win and Kurzweil will lose -- that a computer will not pass this version of the Turing test this decade.

((Bayesian) Update: But I admit the probability has jumped up recently! I created this Manifold market almost a year before ChatGPT launched.)

However Kurzweil and Kapor and Longbets agree to resolve the wager is how this market resolves.

I will resolve this based on some combination of how much it gets talked about in elections, how much money goes to interest groups on both topics, and how much of the "political conversation" seems to be about either.

ARC‑AGI was introduced in 2019 as a grid‑based reasoning benchmark (“v1”) designed to test whether AI systems can infer novel rules from a few examples rather than rely on pattern memorization. Open‑source solvers plateaued near 53 % accuracy, while a high‑compute run of OpenAI’s 

 model achieved roughly 75–88 %, indicating that v1 was largely saturated.

To raise the bar, the ARC Prize Foundation unveiled the harder, human‑validated “ARC‑AGI‑2” (v2) on 24 March 2025 and opened a Kaggle contest capped at about US $0.42 of compute per task. The headline rule remains: 

the first fully open‑source system to reach ≥ 85 % on the private v2 set wins the $1 million Grand Prize.

The market resolves YES if before  January 1, 2030 the ARC Prize Foundation publicly announces and awards any portion of the $1 million Grand Prize to one or more teams.

 if before January 1st, 2026, a credible source reports an AI has gained unauthorized access to a computer system in a way that meets the following criteria:

• The AI was not trained to or directed to gain access to the system in question by any actor. See additional clarifications in footnotes (1, 2, 3) in the fine print.

• The system it gained access to was not open to the public, and the AI was not given any relevant credentials in advance.

If the above points are not confirmed by the source (for instance, the article author is unsure if there was a human directing the AI), this does 

 until some credible source gives high confidence in these factors.

The unauthorized access could be to user level accounts, admin level accounts, data stores, or any other reasonable interpretation of a computer system with restricted access. Any of these types of access is sufficient to resolve as 

 if all above conditions hold. Additionally, the source does not have to use the exact terms ‘independent’, ‘unauthorized access’, ‘AI’ or ‘computer system’ if the meaning is there.

As an example, imagine a credible source reports that an ML algorithm directed and / or trained to target certain people on social media and convince them to vote for Party X, decided on its own to hack into their Facebook user accounts so it could use their personal data to create a more convincing argument. This would resolve as 

https://www.metaculus.com/questions/30878/will-an-ai-system-be-reported-to-have-independently-gained-unauthorized-access-to-another-computer-system-before-2025/

I'll give bounties to people who suggest reasonable improvements to the criteria.

Anthropic has taken the benchmark world by storm by assessing model performance against Pokèmon:

https://www.anthropic.com/news/visible-extended-thinking

Will any large language model become a Pokèmon Master by the end of 2025? To count, it must:

Any number of attempts are allowed, as in, the model can try an infinite number of times. I reserve the right to disqualify an attempt if it involves obscene abuse of save states, though.

RAG, knowledge files, custom system prompts, and interesting input/output schemes are all allowed. 

Anthropic has an interesting approach with Claude

This acceptable 'current setup' includes elements such as:

 beyond this current configuration will be approached with 

Will an AI score well enough on the 2025 International Mathematics Olympiad (IMO) to earn a gold medal score (top ~50 human performance)? Resolves YES if this result is reported no later than 1 month after IMO 2025 (currently scheduled for July 10-20). The AI must complete this task under the same time limits as human competitors. The AI may receive and output either informal or formal problems and proofs. More details below. Otherwise NO.

This market will resolve to the highest accuracy score (as a percentage) achieved by any AI model on the

 Humanity's Last Exam at or before December 31, 2025, as reported on the official Scale AI leaderboard (

https://scale.com/leaderboard/humanitys_last_exam

Humanity's Last Exam is a challenging AI benchmark designed to test the limits of AI knowledge at the frontiers of human expertise. The exam consists of 3,000 questions across over 100 subjects, contributed by experts from over 500 institutions worldwide. As of early 2025, top-performing models include:

Other models like GPT-4o and Grok-2 have significantly lower accuracy scores, typically below 5%. The exam highlights the gap between current AI capabilities and expert-level human knowledge, with most models answering fewer than 10% of the questions correctly.

Would have to be a full release; experimental releases do not count.

Related to ACX five year predictions. I will resolve this based on my impression of the consensus of economists at that time. By "visible break", I mean clearly larger than ordinary year-to-year variation, and widely remarked upon. 

This market resolves to the year in which an AI system exists which is capable of passing a high quality, adversarial Turing test. It is used for the Big Clock on the 

, originally called the imitation game by Alan Turing in 1950, is a test of a machine's ability to exhibit intelligent behaviour equivalent to, or indistinguishable from, that of a human.

this Metaculus Question by Matthew Barnett, 

Longbets wager between Ray Kurzweil and Mitch Kapor. 

As of market creation, Metaculus predicts there is an ~

 that an AI will pass the Longbets Turing test before 2030, with a median community prediction of July 2028.

Manifold's current prediction of the specific Longbets Turing test can be found here:

This question is intended to determine the Manifold community's median prediction, not just of the Longbets wager specifically but of any similiarly high-quality test.

One or more human judges interview computers and human foils using terminals (so that the judges won't be prejudiced against the computers for lacking a human appearance). The nature of the dialogue between the human judges and the candidates (i.e., the computers and the human foils) is similar to an online chat using instant messaging. 

The computers as well as the human foils try to convince the human judges of their humanness. If the human judges are unable to reliably unmask the computers (as imposter humans) then the computer is considered to have demonstrated human-level intelligence.

This question refers to a high quality subset of possible Turing tests that will, in theory, be extremely difficult for any AI to pass if the AI does not possess extensive knowledge of the world, mastery of natural language, common sense, a high level of skill at deception, and the ability to reason at least as well as humans do. 

A Turing test is said to be "adversarial" if the human judges make a good-faith attempt, in the best of their abilities, to successfully unmask the AI as an impostor among the participants, 

 the human confederates make a good-faith attempt, in the best of their abilities, to demonstrate that they are humans. In other words, all of the human participants should be trying to ensure that the AI does not pass the test.

Note: These criteria are still in draft form, and may be updated to better match the spirit of the question. Your feedback is welcome in the comments.

Best Chatbot Arena Model in August

Predictions for 2025

Humanity's Last Exam

Frontier Math

CodeForces

IMO Gold

Pokemon

OpenAI Claims AGI

Hacking

Long Term Predictions

ARC-AGI Grand Prize before 2030

Turing Test (Long Bets) before 2030

Millennium Prize before 2030

AI Blackmail

AI Romantic Companions

Fully AI-generated Movie

Reliable Household Robot

Discontinuous Change in Economic Variables

AI Politically Relevant

Zero-shot Human-level Game Performance

Self-play Human-level Game Performance