
Only the sub benchmarks that are scored as an accuracy (i.e. from 0-100%) will be included (I think that's all of them but I'm not sure)
It must be a single model. If Model A achieves 75% on half and Model B achieves 75% on the other half that does not resolve the question YES
Ensemble models are fine but something like "run Model A on this benchmark and model B on this other benchmark" is not. If there is model selection is must be learned and it cannot include the current benchmark as an input.
Feb 8, 2:38pm:
BIG-bench accuracy 75%: Will SOTA for a single model on BIG-bench pass 75% by the start of 2024?→ BIG-bench accuracy 75% #1: Will SOTA for a single model on BIG-bench pass 75% by the start of 2024?
🏅 Top traders
# | Name | Total profit |
---|---|---|
1 | Ṁ14 | |
2 | Ṁ3 |
People are also trading
This is resolving NO but not in a way I like - many of the benchmarks simply are not used anymore (BIG-bench-hard is more common now), so worst case performance is below 75% in a somewhat trivial way. Average accuracy on BIG-bench-hard is above 80% now, but GPT-4 and Gemini only report average, not worst case.
The link no longer works, but by the URL the new link appears to be this. @VincentLuczkow Is that right?
@Shump See description: "Only the sub benchmarks that are scored as an accuracy (i.e. from 0-100%) will be included (I think that's all of them but I'm not sure)"