Update: results are posted here: https://firstsigma.github.io/midterm-elections-forecast-comparison
The summary results are:
See the post for details and analysis.
It's interesting to compare forecasts between different prediction platforms, but it's rare for them to have questions that are identical enough to compare easily. Elections offer one helpful opportunity.
I will score several prediction platforms on a set of 10 questions on the outcome of the 2022 US midterm elections: Senate and House control, and several individual Senate and gubernatorial races with high forecasting interest.
For each prediction platform, I will take the predicted probabilities on Monday evening, and compute the average log score on these questions. This is a measure of prediction accuracy - higher log score means better accuracy.
I plan to compare these prediction platforms:
538 (statistical modeling and poll aggregation)
Polymarket (real-money prediction market)
PredictIt (real-money prediction market)
Election Betting Odds (prediction market aggregator)
Manifold (play-money prediction market)
Metaculus (prediction aggregator)
Others can be added, just add a comment with the data on their predictions for each of the questions above.
Important note: the election is much closer to one overall prediction than a set of independent predictions, because the races are highly correlated. The forecast that scores best is probably going to be the forecast that landed closest to the mark on the broader question of how much the nation overall went left or right, or how far left or right the polls were biased - and a large part of this is chance. So despite the large number of individual races, each election cycle can be thought of as roughly one data point, and to truly measure accuracy well, we'd need to run this experiment several times over different election cycles.
Of course, I've also created meta prediction markets on which prediction platform will be the most accurate: https://manifold.markets/group/election-forecast-comparison
Questions compared
I selected this set of 10 questions to compare across prediction platforms:
Senate control
House control
Senate races
Pennsylvania - Mehmet Oz (R) vs John Fetterman (D)
Nevada - Adam Laxalt (R) vs Catherine Cortez Masto (D)
Georgia - Herschel Walker (R) vs Raphael Warnock (D)
Wisconsin - Ron Johnson (R) vs Mandela Barnes (D)
Ohio - J. D. Vance (R) vs Tim Ryan (D)
Arizona - Blake Masters (R) vs Mark Kelly (D)
Governor races
Texas - Greg Abbott (R) vs Beto O'Rourke (D)
Pennsylvania - Doug Mastriano (R) vs Josh Shapiro (D)
These were selected as races that had a high amount of interest across the prediction platforms. The main reason for using a limited set of questions is that not all prediction platforms made forecasts on all races - the main limiting factor was which questions were on Metaculus. (I did later find a couple more races on Metaculus, but did not add them to my list because I had already preregistered the question set.) Using a smaller set of questions also makes the data collection easier for me.
They are not all highly competitive races - which is a good thing for looking at how accurate and well-calibrated predictions are across a range of high or low competitiveness races.
Fine print on methodology
In the event that the winner of an election is not one of the current major-party candidates, I will exclude that race from the calculation. This is to normalize slightly different questions between platforms - some ask which candidate will win, others ask which party will win.
For 538, I use the forecasts on this page https://projects.fivethirtyeight.com/2022-election-forecast/, i.e. the Deluxe model. I also score the Classic and Lite models for comparison
For PredictIt, I compute the inferred Republican win probability as the average of Republican YES price and 1 - Democratic YES price. I do not use the NO prices (this is because the YES prices are what the platform highlights most prominently)
For Metaculus, I will use the Metaculus Prediction. I will also score the Metaculus Community Prediction for comparison.
For Manifold, there are often multiple questions on the same race, sometimes with slight differences in resolution criteria. I used only the prediction on the market featured on the main midterms map page https://manifold.markets/midterms.
Manifold has a separate instance for the Salem Center/CSPI Tournament which I will also compare. The market mechanics are the same but it uses a separate play-money currency and has a similar but different user base.
This tournament does not have a question on the Texas governor race. I will substitute the main Manifold's prediction there. (For the purposes of main Manifold to Salem Manifold comparison, this is equivalent to excluding this question.)
See prediction questions on which platforms will be most accurate here: https://manifold.markets/group/election-forecast-comparison