
Only the sub benchmarks that are scored as an accuracy (i.e. from 0-100%) will be included (I think that's all of them but I'm not sure)
It must be a single model. If Model A achieves 75% on half and Model B achieves 75% on the other half that does not resolve the question YES
Ensemble models are fine but something like "run Model A on this benchmark and model B on this other benchmark" is not. If there is model selection is must be learned and it cannot include the current benchmark as an input.
Update 2025-05-01 (PST) (AI summary of creator comment): - If no BIG-bench results are available for any major models by the resolution date, the market will be resolved as N/A.
NO will not be resolved based solely on SOTA results from 2023.
YES will not be resolved based on personal predictions.