Meta AI recently achieved 90th percentile Diplomacy play (no restrictions afaict): https://ai.facebook.com/blog/cicero-ai-negotiates-persuades-and-cooperates-with-people/.
Within one year, will AI be superhuman at Diplomacy, which for the purposes of this market means an ELO rating corresponding to a 90% win rate against the best human players?
Nov 22, 11:51pm:
Will AI for Diplomacy be superhuman by 2024? → Will AI for Diplomacy be strongly superhuman by 2024?
To me, this seems possible but unlikely given my very limited sense of how much people will continue to work on this. Maybe people are going to continue to push on it more than I realize though?
@StephenMalina Even if researchers put their all into it, here's what I see as the basic argument why it won't happen. Diplomacy is a 7-player game where each player starts with 2 or 3 neighbors. Because conflict is mostly a matter of "more armies wins", any pair of players can defeat any one neighboring player early on if they so choose. Whether they so choose is substantially random and depends on whim and on tactical expediency that in turn depends on the way that moves unpredictably play out. If AI players are noticeably AI and not human, it also depends on how human players feel about allying with AI players. Throughout the game, it remains the case that success depends on your opponents not allying against you. Later on, there are stalemate lines, where if an opponent controls enough territory, there's nothing you can do to force a win. So it's hard to see how a 90% win rate is possible without highly reliable superhuman psychological manipulation. I would update a lot if someone who had played a significant amount of Diplomacy thought a 90% win rate was an attainable criterion. As it stands, I think people are just betting on the words "strongly superhuman" because they analogize it to Chess or Go in a way that I don't think is right.
@StevenK Compare to whether AI will reach a 90% win rate against top human players at three player chess. No matter how good the AI is, it seems to me that sometimes its two opponents will gang up on it at key points, and to reliably avoid that, it would need a model of how the human mind responds to board positions that's deterministic enough that it can reliably steer into board positions that cause players to behave in a given way.
Mildly superhuman version of this market: https://manifold.markets/vluzko/will-ai-for-diplomacy-be-mildly-sup
a 90% win rate against the best human players
Given that it's a seven player game, sometimes the other players happen to ally against you, and there's luck involved (in the same sense as there's luck in rock-paper-scissors, because people play mixed strategies), a 90% win rate sounds like it would almost require a mind hacking level of persuasion, but maybe I'm missing something.
@StevenK I don't know much about Diplomacy specifically but I played some similar games, and I think superhuman level is quite possible and achievable. The problem is, I imagined superhuman levels as something like "ELO higher than any human player, with some margin", which is still a much lighter threshold than "90% probability of not losing".
@vluzko Are you currently intending to resolve according to a literal 90% win rate in multiplayer games or some other criterion that you're still thinking about? Multiplayer seems essential to the game and I don't see any way to measure Diplomacy skill in terms of a 90% chance of not losing against any individual player. Maybe someone who has played Diplomacy a lot could weigh in?
@vluzko I don't have a suggestion, but I do have another difficulty, which is that Cicero's games weren't played all the way to a win/draw:
"For our experiments, games end at the end of 1908, and are scored according to the sum-of-squares scoring system, in which each player’s share of the score is proportional to the square of the number of SCs they control."
If future experiments also use this kind of blitz scoring, it means games will rarely play out all the way to someone winning by the standard rules.
@StevenK It looks like the Diplodocus experiments for no-press Diplomacy AI used a similar scoring rule, with limited but slightly longer games and sum-of-squares scoring at the end. So a question it could make sense to ask is "If turn limit + sum of squares scoring is used for future full-press Diplomacy AI on a reasonable sized sample of games against top human players, will it score at least as well as Cicero (25.8%) or Diplodocus (26-27%) did in their respective games against a wider range of players?"
@StevenK One could also ask about Elo directly. From the Diplodocus paper:
Elo ratings were computed using a standard generalization of BayesElo (Coulom, 2005) to multiple players (Hunter, 2004) (see Appendix I for details). This gives similar rankings as average score, but also attempts to correct for both the average strength of the opponents, since some games may have stronger or weaker opposition, as well as for which of the seven European powers a player was assigned in each game, since some starting positions in Diplomacy are advantaged over others. To regularize the model, a weak Bayesian prior was applied such that each player’s rating was normally distributed around 0 with a standard deviation of around 350 Elo.
The best scoring Diplodocus, which scores 27% (compared to average 1/7) has an Elo of 181 where I think the median player has 0. I haven't looked into the details, but note:
400 points in Elo systems generally corresponds to a 10-fold increase in expected winning odds or expected average score
@StevenK Maybe it's easier to score 25% against a population that scores 25% against a population that scores 25% against the general population of players than it is to score 90% against the general population of players, just because some bad luck can't be eliminated. So as another complication, maybe the assumptions behind Elo break down here.
@StevenK If I'm not mistaken, getting a 90% score would require the AI to get 54 shares to 1 share for each of 6 human players, so it ends up with 54/60=0.9, so that's a log10(54) * 400 = 693 point Elo difference. 90% score at the end of a blitz game is probably even more stringent than an eventual 90% win rate, because it means the AI has to complete its wins faster, but on the other hand, people are claiming blitz games are relatively easy for AI.
@StevenK To rephrase some of what I've said earlier in the thread: it seems much more likely to me that there will be a tower of 7 AIs on top of the best human, each of which scores points as if it had 100 more Elo when playing against the next lowest AI in the tower, than a single AI that scores points as if it had 700 more Elo when playing against the best human.