Will the first AI model that receives a score of 75 or higher be capable (with an agent scaffolding) of replacing a software engineer?
Resolves based on my personal judgement, in particular whether it is cost- and time-effective for ZeroPath to use it to replace one of our engineers (or accomplish the same amount with fewer people). Example tasks it should be capable of:
"Fix this error we're getting on BetterStack."
"Move our Redis cache from DigitalOcean to AWS."
"Add and implement a cancellation feature for ZeroPath scans."
"Add the results of this evaluation to our internal benchmark."
I will not be betting, but let it be known that I am pessimistic about the state of current evals.
Update 2025-03-17 (PST) (AI summary of creator comment): Clarification on team size reduction:
Fewer people in the parenthetical (or enable us to accomplish the same amount with fewer people) is defined to mean eighty percent of original team size. This means the AI could allow us to do the same amount of engineering work with 20% fewer people than was possible March 2024.
Is there a reason this market closes EoY? I'm not expecting HLE to be "saturated," nor a drop in remote software engineer AI to be achieved, by the closing date.
Although I just realized it might be saturated by some kind of insanely expensive internal system like o5, which would imply I should vote no since that wouldn't be cost effective (nor likely publicly available) to use for software development. Also, OpenAI would totally break the bank over that sort of thing because of hype/sustaining more Stargate investment money.
Initial progress on HLE is partially due to regression to the mean, since the benchmark was created by excluding questions that then-current systems could answer. (Any equivalent system created later would by random chance get more questions right.) There has clearly been real progress, though.
The benchmark in general is mostly knowledge-based rather than reasoning-based, and relies on very niche and obscure facts, so a boxed superintelligence (if there was such a thing) still might not get 75%.
The correlation between niche subject knowledge and software engineering is completely circumstantial. A broadly more advanced model will also be advanced at software engineering. I can be confident of this because current models are already very close to this threshold.
I do not expect HLE to saturate this year (unless the holdout set is leaked), but I do expect an AI capable of fully replacing senior software engineers to be demoed in Q3 this year.
@Haiku Why do you expect an AI capable of Fully replacing Senior software engineers this year? GPT-5 as ~[o4 based on GPT-4.5/mixture of models architecture] doesn't seem likely to do that and I don't see any particular reason to expect a substantially better AI system than GPT-5 this year (always possible ofc).
Obstacles seem to include: (much?) longer context length needed, better distant chat memory aside from context, much better useful integration of context when using a substantial portion of limit, less hallucinations, much better calibration on implied or explicit confidence in outputs, better error correction, and ofc generally better reasoning/learning/intelligence. All while keeping inference costs significantly below what OpenAI used to get their initial best scores on HLE & FrontierMath with o3.
Obviously there will be multiple models making significant progress on most/all of those limitations, but it seems unlikely any one model such as GPT-5 gets enough better at all of them by end of year while staying within affordable compute cost constraints.
@DavidHiggs You make some very good points. I should expect context length / memory to be the sticking point this year.
I'm going to clarify "fewer people" to mean "sixty percent of the people". It'd be possible for mundane developer tools to shrink our team size a little bit, and the point of this question is to see whether or not passing HLE means you get something close to a drop in remote worker you can send arbitrary issues and tasks to resolve.
@MilfordHammerschmidt That bar is not indicative of the prior criteria. One shot Claude on a prompt you can inspect for neutrality suggests 10-15% while maintaining the current productivity of your team.
https://claude.ai/share/e0439f38-c1d0-4fad-900f-e2cd958c815a
Generally these types of clarifications need to be made before substantial volume has been bet on the question. Not to mention, your prior criteria were already specific. An AI capable of:
"Fix this error we're getting on BetterStack."
"Move our Redis cache from DigitalOcean to AWS."
"Add and implement a cancellation feature for ZeroPath scans."
"Add the results of this evaluation to our internal benchmark."
is already clearly "something close to a drop in remote worker you can send arbitrary issues and tasks to resolve" and not just a "mundane developer tool." Further, there are plenty of reasons why you might not cut your team in practice due to AI which do not speak to whether there is AI which is "cost- and time-effective for ZeroPath to use it to replace one of our engineers." For example, maybe your engineers have become much more productive and the ambitions of your company have grown in parallel.
Why replace your preexisting and specific operationalization---the one 20 traders have bet 6000 mana on---with a new requirement that is frankly not consistent with the prior language of the resolution? Just strike it or ask a new question.
@MilfordHammerschmidt Fair enough, but the point about holding productivity constant should still stand. If you don't make hires you would otherwise have made because your team is 25% more productive with access to a drop-in agentic AI model than it would be with only "mundane developer tools," it should probably count. In practice I hope it's clear once we've reached this point.
@AdamK I'm expecting the dataset to leak and the test to become invalid as it gets included in training data, as has happened with every other benchmark.
@AdamK I made another comment with more detail, but basically I think the most likely way HLE is saturated this year is some kind of prohibitively expensive internal model like o5 (or whatever is ~1 reasoning model more advanced than GPT-5). You either couldn't, or it wouldn't worth, using it as a software engineer.
Humanity's Last Exam is a small set of questions (~2.5k last I checked). It will probably leak online over time, and, being small, is easily memorized. To score >75, a model doesn't need to be any better than models that exist today; it just needs to have seen those answers.
I'd note that the glamorous name "Humanity's Last Exam" wasn't like, voted on by a group of experts. The people who published it just decided to name it that. The name shouldn't carry any more weight than if I decided to name my YouTube channel "humanity's greatest gamer".
@Bayesian If the resolution criteria focus on whether such a model is "capable of" being a software engineer, it seems like it wouldn't need to be publicly available. For instance, perhaps a model that saturates HLE might be broadly understood to be a ~pareto improvement over another model that already meets the criteria of substituting for a software engineer.
@AdamK But as you imply, demonstration of sufficient ability to ~replace a software engineer is a high bar of evidence and likely will only come from a publicly available model, or some kind of inside information that is extensive and unusual.