Will published performance on GSM8K-test exceed 90% by 1st April 2023?
Basic
8
Ṁ134resolved Mar 15
Resolved
YES1D
1W
1M
ALL
https://arxiv.org/abs/2110.14168
https://paperswithcode.com/dataset/gsm8k
State-of-the-art language models can match human performance on many tasks, but they still struggle to robustly perform multi-step mathematical reasoning. To diagnose the failures of current models and support research, we introduce GSM8K, a dataset of 8.5K high quality linguistically diverse grade school math word problems. We find that even the largest transformer models fail to achieve high test performance, despite the conceptual simplicity of this problem distribution. To increase performance, we propose training verifiers to judge the correctness of model completions. At test time, we generate many candidate solutions and select the one ranked highest by the verifier. We demonstrate that verification significantly improves performance on GSM8K, and we provide strong empirical evidence that verification scales more effectively with increased data than a finetuning baseline.
This question is managed and resolved by Manifold.
Get
1,000
and3.00
Sort by:
Related questions
Related questions
Will GPT4/Opus report >50% score on ARC in 2024?
32% chance
Will any model get above human level (92%) on the Simple Bench benchmark before September 1st, 2025.
55% chance
What will be the best score on the GPQA benchmark before 2025?
82% chance
Will >50% of the tasks in the WebArena benchmark be solved by EOY 2024?
62% chance
What will be the best score on the GAIA benchmark before 2025?
46% chance
MMLU 99% #5: Will SOTA for MMLU (average) pass 99% by the start of 2028?
44% chance
MMLU 99% #2: Will SOTA for MMLU (average) pass 99% by the start of 2025?
12% chance
Will Grok achieve 98% or greater on ARC by the end of November 2024?
3% chance
MMLU 99% #3: Will SOTA for MMLU (average) pass 99% by the start of 2026?
16% chance
MMLU 99% #4: Will SOTA for MMLU (average) pass 99% by the start of 2027?
12% chance