I’m writing an undergraduate thesis comparing real and play money prediction markets at the moment, for which Polymarket and Manifold are my data sources respectively. Their relative accuracy is one of a few questions I plan to investigate.
The data: paired price time series of markets with identical resolution criteria. Polymarket’s price is the mid of the best bid and ask, Manifold’s the AMM price. Topics span sports futures, politics, econ, crypto prices, awards, and whatever other pairs I could find. Shooting for a sample size of at least 150.
I’ll probably use the prices one week before resolution, at least to resolve this market. I’ll bound Polymarket’s prices between 0.01 and 0.99 for a fair test. I’ll restrict the analysis to a-priori plausibly independent markets (which throws out a lot of politics markets). There’s a fairly big range of liquidity/number of traders in the markets.
The test: permutation test on difference in log scores. This means each market’s forecast is given the score ln(p) if it happened, ln(1-p) if it didn’t. Here, higher log score = more accurate. Then I’ll take the sum of differences in log scores across Polymarket-Manifold pairs. This is the test statistic.
If there were no systematic difference in accuracy, then the sign of each difference in log-scores should be random. This lets us generate a distribution of test statistics if Polymarket and Manifold were equally accurate - assign a random sign to the empirical log-score differences, compute the test statistic, then repeat (say) 10,000 times. If the true test statistic is greater than 95% of these values, we can reject the hypothesis of equal accuracy at 0.05 significance.
This market resolves YES iff this procedure shows Polymarket is more accurate than Manifold at p<0.05.
I anticipate I’ll have done this test some time in the next 1-3 months. But could be next week, whenever I get around to it given my other courses etc. I won’t trade in this market.
Update 2025-04-16 (PST) (AI summary of creator comment): Update from creator
Exclusion of Manipulated Markets: Any market with clearly manipulated resolutions (e.g. the Ukraine market or the Bitcoin reserve event) will be excluded from the analysis.
Purpose: This update ensures that only markets with genuine, independently determined resolutions are considered in assessing accuracy.
People are also trading
@Kingfisher plausible! my intuition is a >150 markets would be enough, but the test i’m using is non-parametric, so it does have less statistical power compared to eg a T-test
also worth noting log scores tend to reward/penalise probabilities near 0 or 1 a lot, so i suspect a lot of the result hinges on how well each market prices 90-100% or 0-10% events
@brod It depends on the type of market. Manifold>Polymarket on most 2024 election markets. On others IDK, that would be interesting.
@HillaryClinton Agreed, excited to see results.
@Brad do you have a plan to handle Polymarket markets with clearly-manipulated resolutions? For example, Polymarket's "Will Trump create Bitcoin reserve in first 100 days" is at 10%, due to coordinated manipulation of the consensus mechanism (see comments), while the Manifold consensus is that this has already resolved YES. (Arguably, the Manifold one is correct.)
- Polymarket: https://polymarket.com/event/will-trump-create-a-national-bitcoin-reserve-in-his-first-100-days
- Manifold: https://manifold.markets/AaronSimansky/what-will-happen-within-donald-trum ->
"Trump create a national Bitcoin reserve" sub-question
@brod Are the probability pairs generally pretty close to each other? Should be easier to detect a difference when the forecasts disagree a lot.
@Kingfisher will avoid any markets with manipulated resolutions like the ukraine one a few weeks ago - didn’t know about the bitcoin reserve one!
@travis Still cleaning data but here’s the Manifold price as a function of Polymarket’s price over about 100 markets (prices sampled daily)

@brod What are the probabilities with the horizontal “manifold lines” in that chart? Eg looks like maybe 90%, 85%, etc? And what’s up with all the manifold markets near 0% with high polymarket probabilities? Mind sharing an example?
(After you’re done, would love to see the dataset uploaded, but totally understand if you’d rather not until the project is complete!)
@Ziddletwix @travis took a closer look - a few illiquid markets and a few fuck ups in pairing on my part, whoops! corrected version:

the remaining lines (see around (0.1, 0.85) and (0.6, 0.2) and (0.95, 0.35)) are markets that didn’t get much attention on manifold and stayed mispriced for a while in particular:
How many SpaceX Starship launches reach space in 2024?
$PNUT listed on Coinbase in 2024?
my main fuck up was accidentally pairing a market on the november 2024 FOMC decision to one on the november 2023 decision - that was the weird set of points at the bottom on the previous chart, my bad!
@brod ah got it, so this plot includes multiple points per market (at different times). For the final test, will it just be a single probability per market (IIuc from description, ~1 wk before resolution), or will it also be a multiple data points?
Cool to see the details!
@Ziddletwix yep that’s right - final analysis will just be the one data point per market (to avoid issues from correlated data points). will also need to get more markets for the final analysis
@brod makes sense!
If the true test statistic is greater than 95% of these values, we can reject the hypothesis of equal accuracy at 0.05 significance.
This market resolves YES iff this procedure shows Polymarket is more accurate than Manifold at p<0.05
so to confirm, this is 95% one-sided? (i.e. just for polymarket more accurate than manifold)
@Kingfisher fwiw i don't think p=0.05 is such a high bar to clear here, since the pairing helps a fair bit (compared to a difference in means).
rough intuition: assume 150 questions, there's some true prob of the event occurring (i went in a uniform sequence), & simulate outcomes. assume manifold & poly always diverge by some delta in the log odds (+/- delta/2 compared to that true prob in log odds). but poly is better, so 60% of the time, that delta points in the right direction, & 40% it points in the wrong direction.
with delta=0.2 (so if true prob = 0.5, you'd have manifold/poly with like a ~5pp gap), & poly is "right" 60% of the time. that should be detected ~most of the time (60%+) @ 95% confidence. "poly is only right 60% of the time, and the markets never disagree by more than 5pp" isn't a super high bar imo—paired tests are fairly strong (for the narrow thing they claim to test).
(that being said, not sure how relevant that naive sim will be bc i'd expect the results will mostly be dominated by their performance on those occasional cases of extreme divergence. my guess is that poly will fare better on those—fewer markets, more users, higher stakes, etc, so fewer blindspots/forgotten markets—in which case it couldn't be too hard to detect the difference if brad can get to 150+ markets. but i understand taking the NO side given that it covers all cases lacking statistical power in addition to other odd surprises. tbh my prediction would hinge quite a bit on seeing a simple scatterplot like the one above but with one data point per market + the final list of all markets included—a lot of this may come down to data cleaning/filters).
@Ziddletwix I tried a simulation like that. I used a random direction for the error, but a larger average error for manifold than polymarket. It was hitting <0.05 about a third of the time, but after I saw Brad's plot, I increased the error to try to match it (just eyeballing) and it's getting <0.05 about half the time. I tried adding big outliers, but surprisingly it didn't make much difference, I guess because it increases the variance of the test statistic and makes <0.05 harder to achieve.
@Ziddletwix yep, one sided test
(also appreciate your & everyone’s comments here, good to get feedback on design and super cool people have taken an interest)
@travis yup. also, in log score, variation tends to be less punished than correctness (obviously that's a simplification, depends on the exact #s & scale you use, but i think it's the general intuition). e.g. for two events that both happen, if polymarket had [0.5, 0.5], versus manifold's [0.4, 0.6] (i.e. same EV forecast but manifold has more variation), poly has a better log score, as expected. but if instead polymarket is [0.52, 0.52] and manifold is [0.5, 0.5] (i.e. poly is just a little bit more correct), poly's log score is ~2x better than in the first case. my sim assumed poly's forecast EV was more correct than manifold's, not just that it had more variation.
@brod
I'm surprised there still seem to be horizontal clusters in both polymarket and Manifold. I'd expected patterns like that too be mirrored along the axis which should result in vertical clusters on Manifold and horizontal ones on polymarket. But then I'm not clear what's causing these clusters in the first place
@AlexanderTheGreater there are multiple data points per market in this plot. So if a
Market on manifold
Is forgotten about and tbe price doesnt change for weeks, but the polymarket price is shifting, you’ll get a horizontal line
@AlexanderTheGreater haha yep ziddletwix is right. also the polymarket price is the middle of the bid/ask, so the price can move if people place/remove orders even if no transactions take place, unlike manifold
Really excited to see what happens with this!
Will you be requiring Manifold markets to have a certain amount of traders? Manifold says somewhere between 10-20 traders is where calibration stops getting more accurate, and also that they haven't conducted thorough analysis on the effect of liquidity yet: https://manifold.markets/calibration
@MingCat thank you! I didn’t have any hard cutoff for traders in mind, but all I’ve got so far have been >10. I’d guess if the market’s on polymarket too it must be somewhat popular to trade on. And where multiple Manifold markets on one topic exist I’ve chosen whichever has more traders.