Resolves yes if before 2030, a neural net with <10B parameters achieves all of: >75% on GPQA, >80% on SWE-bench verified, and >95% on MATH
Arbitrary scaffolding allowed (retrieval over fixed DB is ok), no talking with other AI, no internet access. We'll use whatever tools are available at the time to determine whether such an AI memorized the answers to these datasets; if verbatim memorization obviously happened, the model will be disqualified.
Edit: we'll allow up to 1 minute of time per question.
Possible clarification from creator (AI generated):
Model must complete each question within 1 minute of wall-clock time
New research from Meta describing "Memory Layers" which resemble both attention / a vector DB to keys in the model's latent space.
https://ai.meta.com/research/publications/memory-layers-at-scale/
I think it's quite clear that active params will end up being a smaller and smaller proportion of a model's data (MoEs were only the beginning of this), with most parameters used very sparsely in the same vein as associative memory. My sense is that techniques like these don't count under this question's resolution criteria (since they're trained parameters), but they do point to the same principle.
I didn't explicitly mention wall-clock time, but I said "Arbitrary scaffolding allowed" so unless anyone objects I'll add "Must use below X minutes of wall-clock per question". I am conflicted between 1 minute (upper bound on how long users would be willing to wait) and something higher since the spirit of this question is upper-bound-y.
@JacobPfau Added 1 minute cap. Since we're talking about 10b models on arbitrarily optimized hardware, this isn't much of a constraint. I expect that'l allow >100k tokens/question.
@JoeBoyle I don't think I ought to share it. No point giving up so much alpha while prices for downstream markets remain this good.
@JoeBoyle Sorry that I'll have to wait, but here's this to keep me honest: de3be1f4472c9adb4a479b97d140d6615b7189536d8916d52c5426aa0291fd28
I may consider sharing by April or so.
@JacobPfau I'm also happy to bet YES on a "before 2027" version of this market.
@AdamK Yea given qwen/o1 progress I agree that 2027 is possible. I've made a question here https://manifold.markets/JacobPfau/is-scale-unnecessary-for-intelligen?play=true
Do current larger models reach those scores? Or is improvement AND compression currently necessary?
@KimberlyWilberLIgt Improvement for SWE-bench verified is necessary. The others have been roughly hit by O1. I chose these numbers as being my sense of in domain expert performance
@IasonKoukas Thanks for catching this @CraigDemel
Pinging @AdamK to make sure your limit orders are in the right direction