Meta AI recently achieved 90th percentile Diplomacy play (no restrictions afaict): https://ai.facebook.com/blog/cicero-ai-negotiates-persuades-and-cooperates-with-people/.
Within one year will AI be mildly superhuman at (full-press) Diplomacy in the sense of having a higher ELO rating than any human player? If there are not ELO ratings available for some reason I may accept an alternative such as winning a tournament against the best human players. I will not accept any alternative that does not involve some kind of direct, well-incentivized competition between the AI and the best human players.
I made this related market to basically express the question of "will people try to make further progress on this at all", because I think getting a decent sample of test games against humans might be more annoying than people expect.
Chess and Go AI both seem to have taken something like 10-15 years to go from 90th percentile human to mildly superhuman. Things are different now and I don't know how much that says about Diplomacy AI timelines, but it does seem like evidence for it taking more than a year.
I wonder to what extent sampling randomness affected Cicero's results. The probability of 10 wins in 40 games is 0.029 given a win rate of 1/7 (which would make Cicero just an average player), 0.144 given a win rate of 1/4, so a likelihood ratio of 5, which is okay but leaves some room for doubt. The score they give is 25.8%, which is a little bit higher and implies some of the games were draws, which complicates this calculation.
@StevenK But note that Cicero could also be better than 90th percentile. Probability of 10 wins in 40 games would be 0.02 given an underlying Cicero win rate of 40%, which for all I know might be better than any human player; I don't know where to look for the data. And a lot of the human players that did better than 25.8% in the sample got lucky themselves. So I think instead of thinking of Cicero as 90th percentile, we should think of it as an unknown 50th-100th percentile.
@StevenK Though apparently scoring for Cicero didn't work the way I thought: "For our experiments, games end at the end of 1908, and are scored according to the sum-of-squares scoring system, in which each player’s share of the score is proportional to the square of the number of SCs they control."