Will a single model achieve superhuman performance on all OpenAI gym environments by 2025?

1kṀ841

2026

25%

chance

ALL

"OpenAI gym environments" means all environments that come with the `gym` python package as of 2022-04-08 (this includes the Mujoco environments even though they require third-party software) Note: single *model*, not single *algorithm* If I decide the model is actually many environment specific models stuck together (e.g. a hierarchical model that predicts current environment and delegates to a subagent) it will not count (a single model that *learns* a hierarchy without any inductive bias pushing it towards that *would* count, however). Apr 8, 11:38pm: For the environments that don't have a human performance benchmark, I'll accept if the model is better than SOTA from two years before the publication date.

Technical AI Timelines

Get

1,000

to start trading!

People are also trading

Will any model get above human level on the Simple Bench benchmark before September 1st, 2025.

17% chance

Will OpenAI models achieve ≥90% on SimpleBench by the end of 2025?

36% chance

What will OpenAI do in 2025?

Will a single model achieve superhuman performance on all Atari environments by 2025?

22% chance

Will an AI model achieve superhuman ELO on Codeforces by the 31 December 2025?

47% chance

Will OpenAI offer a model that updates its weights while running during 2025?

9% chance

Will OpenAI launch a model even more expensive than o1-pro in 2025?

39% chance

Will the top model by OpenAI rank 3rd (or lower) behind 2 other model families at any point before 2026?

69% chance

Will OpenAI release next-generation models with varying capabilities and sizes?

82% chance

Will there be an AI language model that strongly surpasses ChatGPT and other OpenAI models before the end of 2025?

Sort by:

My guess is that a lab could train a model that could achieve superhuman performance on all OpenAI gym environments (includes Montezuma's revenge, so would be hard for most individuals to do easily is my sense), but I don't think it would. What are the odds a lab would train their multi-game RL policy on every single environment, including https://www.gymlibrary.dev/environments/toy_text/? Maybe 20%?

So then the question is if there will be a model that was trained on Montezuma's revenge or can generalize to Montezuma's revenge and generalize to these sillier toy text environments, or if someone finetunes an open-sourced atari-trained model on these sillier environments. 25%?

Then an extra 10% because there are worlds I haven't accounted for, and this tends to push towards this event happening instead of not happpening.

20% + (100%-20%)*(25%) + 10% = 0.5

@NoaNabeshima I think that if they can get 80% of environments then they can also get 100% of environments, and training on the extra 20% is a relatively low cost (especially if there are transfer gains, which seems likely), and in return they get to say "we solved all the environments" which has historically been a pretty big motivator for RL research.

@vluzko Gato didn't train on toy text (I think), I imagine because it doesn't seem that helpful/important.

https://arxiv.org/pdf/2205.06175.pdf

predictedNO

@NoaNabeshima or just wasn't being tracked by DM engineers

predictedNO

@vluzko I mean they did train on control environments which seem similarly trivial. shrug

@NoaNabeshima I think Gato is a little weird as an example, I think more central RL papers often try to solve as many environments as they can. e.g. Agent 57 cared a lot about beating the sub benchmark of "all Atari environments", I think that there might be a similar push for "GymX" that solves all Gym environments. Not guaranteed of course but if I was writing a paper that solved most gym environments I would definitely spend a weekend running it on all the tiny esoteric environments too

predictedNO

@vluzko "all atari environments", "all control environments" seem like more natural kinds than "all gym environments" to me. Also, as a model learns to solve more environments, I think focusing on training on easy environments will be less of a big deal because it'll be more obvious that your model could solve them.