Will published performance on GSM8K-test exceed 90% by 1st April 2023?
Basic
8
Ṁ134resolved Mar 15
Resolved
YES1D
1W
1M
ALL
https://arxiv.org/abs/2110.14168
https://paperswithcode.com/dataset/gsm8k
State-of-the-art language models can match human performance on many tasks, but they still struggle to robustly perform multi-step mathematical reasoning. To diagnose the failures of current models and support research, we introduce GSM8K, a dataset of 8.5K high quality linguistically diverse grade school math word problems. We find that even the largest transformer models fail to achieve high test performance, despite the conceptual simplicity of this problem distribution. To increase performance, we propose training verifiers to judge the correctness of model completions. At test time, we generate many candidate solutions and select the one ranked highest by the verifier. We demonstrate that verification significantly improves performance on GSM8K, and we provide strong empirical evidence that verification scales more effectively with increased data than a finetuning baseline.
This question is managed and resolved by Manifold.
Get
1,000
and3.00
Sort by:
Related questions
Related questions
What will be the best performance on SWE-bench Verified by December 31st 2025?
Will any model get above human level on the Simple Bench benchmark before September 1st, 2025.
69% chance
MMLU 99% #5: Will SOTA for MMLU (average) pass 99% by the start of 2028?
44% chance
MMLU 99% #3: Will SOTA for MMLU (average) pass 99% by the start of 2026?
16% chance
What will be the best score on the GAIA benchmark before 2025?
64% chance
MMLU 99% #4: Will SOTA for MMLU (average) pass 99% by the start of 2027?
12% chance
Will there be a commercial 6G network live by 1st of Jan 2030
77% chance
What will be the best score on the GPQA benchmark before 2025?
87% chance