This question will resolve to the minimum Brier score achieved on the leaderboard by a qualified submission. The calibrated random baseline is 85. The lower the better. See an example of the Brier score in action.
See the competition page.
For true/false and multiple-choice questions, we evaluate models using the Brier score, which is then divided by 2 to normalize between 0% and 100%. For numerical questions, we use L1 distance, bounded between 0% and 100%. We denote these question types as T/F, MCQ, and Numerical, respectively. To evaluate aggregate performance, we use a combined metric (T/F + MCQ + Numerical), which has a lower bound of 0%. A score of 0% indicates perfect prediction on all three question types. For more details, please check out the Autocast paper.