Did a Google Deepmind AI really get Silver on IMO 2024?
➕
Plus
58
118k
resolved Jul 25
Resolved
YES

The Verge posted an article claiming this which is now deleted. Possibly The Verge just accidentally broke the press embargo? If anyone has the full text that would be appreciated.

https://www.theverge.com/2024/7/24/24205267/google-deepminds-ai-makes-leap-in-math-reasoning?showComments=1

Get Ṁ1,000 play money

🏅 Top traders

#NameTotal profit
1Ṁ4,351
2Ṁ2,236
3Ṁ1,640
4Ṁ1,355
5Ṁ498
Sort by:

There have been a lot more people who had a different interpretation of the title than I expected, and upon more reflection I think their interpretation was reasonable. So I think an N/A resolution makes sense. @mods can you do that?

I think I personally disagree? In a a soft, common sense, spirit of the market way, I don't think it'd have been a significantly bigger achievement if Google had dedicated all of their chips to this for 4 hours so the massively parallel search could run faster. And in a hard "what do the criteria say" sense, the seems to point the meaning of the title to the Verge's claims, and the Verge's claims are true.

That's separate from what I'd do as a mod though hm

The verge's claim is true, but that is different from actually getting a silver in IMO.

It has been precedent on Manifold for a very long time that descriptions override titles! Titles are often necessarily ambiguous because they're just short.

If Deepmind got silver by seeing the test beforehand, would that count? The description seems to allow it, but nobody would claim it got a silver legitimately.

or if deepmind ai just searched the internet for 2024 imo solutions. Wow I too can achieve silver on the IMO

No? Because at that point the verge article and title would egregiously mismatch? Descriptions can clarify if the clarification is reasonable (which this one is), but not if it's unreasonable.

Like this title is much better than markets like (made up example) "Will GPT-4 be human-level at math?" and the description says "resolves if the MATH score of gpt-4 is greater than 75%" but we don't n/a those

@jacksonpolack this wasn't a situation of the resolution criteria in the description being clear but the title being ambiguous. The description contained almost no information about the resolution criteria, the resolution criteria were underspecified because it was late and I didn’t think edge cases like this seemed likely

I mean it's your market, I just think that the thing saying "AlphaProof and AlphaGeometry were able to solve 4 of the 6 IMO problems", which is true modulo concerns about formalization that I don't think are important, is what would count over the title

Just my 2 cents: I think the description doesn't tell us that much. It points in the direction of "If the Verge article is confirmed then YES", but it could also be read as "Will the full text support the claim or will there be something in there saying they didn't really achieve it?" (I didn't read it that way, but it's a understandable reading.)

But then the question would be, what does really achieving a silver mean? I don't think there's an obvious answer, which is why I didn't think this interpretation made much sense, but it could be some subjective take by the author.

Sorry for the unclear criteria everyone. The original intent was to see whether the leaked article was real, and Google’s own blog headline says “achieves silver-medal standard,” so I’m resolving Yes.

should resolve N/A then. Your original intent obviously wasn’t clear to many, and Google exaggerating the achievement of their own model isn’t exactly the word of god

@ZacharyKreiser (impartial voice) so true

I think the market title really doesn't reflect your stated intent, part of why I bet NO was e.g. a scenario where Google was just redefined the challenge enough to have an easier job, resulting e.g. in the Verge headline.

Coincidentally I think Google got close enough to silver that I don't care much about the mana I lost.

I think this is a reasonable resolution but the title here probably should probably have been "Is the Verge story about A Google Deep mind AI getting silver on IMO 2024 credible?"

The general precedent is for the criteria in the description to override the title summary of course, but making them match is good when possible. I suppose the title could still be edited to that for consistency.

I disagree that the criteria in the description is "Is the Verge story about A Google Deep mind AI getting silver on IMO 2024 credible?“

The description about the Verge article reads as context for why one would suspect that google’s AI had achieved a silver medal result in the first place. Having read the description (before the press embargo was lifted), the possible realities were

  1. The Verge story is complete bunk - it’s a hoax or a prank or something

  2. The Verge story is legitimate and factually reported - Google’s AI truly achieved a silver medal result

  3. The Verge story is legitimate but inaccurately reported - its author believed that Google’s AI achieved a silver medal result, but this is not the case in reality

The market seemed to be about determining which of these realities were in. Nothing in the title or description would leave one to believe that reality 3 should resolve to YES.

The dispute over timing is not just a quibble - we would not say that a human achieved a silver medal were they allotted the same amount of time as the AI:

Mathematician Timothy Gowers on X:

https://x.com/wtgowers/status/1816509811111940221?s=46

It is not only about the allotted time - the problems were translated into formal language by humans, so the computer received human help with the one step where checking correctness is hard - so the human help is substantial and likely necessary for the stated performance (even small imprecisions in the translation step would likely render solutions invalid).

This should resolve YES. Some considerations:

  • The question was created in the context of the uncertain veracity of a leak from the Verge. Verge had indeed leaked embargoed information. This was the basis of my bet on YES.

  • DeepMind’s accomplishment differs from the “human” test conditions in two ways. First, the problems were manually converted to formal language, and second the system was run longer than the standard test conditions. Though some such programs may not be highly parallelizable, both AlphaZero and AlphaProof look to be based on algorithms that are parallelizable enough that time to solve is just a matter of your compute budget. I don’t think that ought to be required for a YES resolution, however.

  • I think that it is eminently reasonable to say that a computer program which solved the mathematical substance of 4 IMO problems, whose solutions would have merited 28 points at the IMO if converted to natural language, has reached the “Silver medalist level,” or “got Silver.”

  • Headlines in the news are all of the format “DeepMind Gets Silver at IMO.” The nuances of the test conditions were clearly not foreign enough to discourage the media from making this claim.

  • The lack of specification in the resolution criteria should favor a decently liberal reading of whether “Deepmind AI really [got] Silver.” The market was not created to precisely measure some milestone or operationalize a specific test. It was made to gauge the veracity of a rumor which turned out to be true.

Afaict the current meta when it comes to letting AI solve IMO-style math problems is to translate the problems to a formal language. When I bought my NO position today morning, I was okay with the AI using a (~trivial) translation into some formal language.

I do care a lot more about compute limits, because in principle I can solve all solvable formal problems by enumerating and verifying all potential proofs, except I just don't have enough compute.

If I were to nitpick, I'd say that I now believe that Deepmind is able to get silver, but hasn't done it so far.

Tbc, if the algorithm is actually nicely parallelizable, and the total budget to get it done in 3 hours is only 24 times higher than what they spent, then I personally don't care about that nitpick and would prefer if the market resolved YES.

(I sold my position at 91%, and expect the market resolves as YES)

Strongly disagree with your first point. If OP meant to ask "Is this leak legit?", then that is what the question should have been. When the question is "Did a Google Deepmind AI really get Silver on IMO 2024?", that includes not only the question of whether Deepmind claims to have done so, but also whether they really did it.

I’m sorry. The word “really” implies skepticism about the currently available information, the original intention of “really” was in reference to the fact that it was only a rumor, but when new information came out it changed the implication of the title, and I shouldn’t have picked such a word.

I'm not confident on a yes resolution here:

  • First, the problems were manually translated into formal mathematical language for our systems to understand. Implies humans manually converted the actual problems. (Even though they had an AI formalizer network?)

  • The network exceeded competition time limits by an order of magnitude

The resolution criteria is ambiguous, but the time constraint or the fact that this was formulated in Lean doesn't seem important to the prediction which is about capability leaps.

In particular, the context about this market is about the Verge breaking an embargo, and yes that was the case

Which is why I'm not going more in on "no" because of risk of resolution disputes.

This question isn't asking if the verge broke a press embargo - I agree that's near 100% true. It's asking if they got "silver"

Regardless, I don't believe "were able to solve 4 out of 6 problems" is a correct interpretation of what happened -- would you say someone "was able to get an 800 on the math SAT" if they spent 3 days on it?

This market didn’t fall out of a coconut tree, I agree the resolution criteria should always be explicit and unambiguous, but also like duh, the Verge was reporting on a real press release and the rest is cope.