Will the first AI model that receives a score of 75 or higher be capable (with an agent scaffolding) of replacing a software engineer?
Resolves based on my personal judgement, in particular whether it is cost- and time-effective for ZeroPath to use it to replace one of our engineers (or accomplish the same amount with fewer people). Example tasks it should be capable of:
"Fix this error we're getting on BetterStack."
"Move our Redis cache from DigitalOcean to AWS."
"Add and implement a cancellation feature for ZeroPath scans."
"Add the results of this evaluation to our internal benchmark."
I will not be betting, but let it be known that I am pessimistic about the state of current evals.
Update 2025-03-17 (PST) (AI summary of creator comment): Clarification on team size reduction:
Fewer people in the parenthetical (or enable us to accomplish the same amount with fewer people) is defined to mean eighty percent of original team size. This means the AI could allow us to do the same amount of engineering work with 20% fewer people than was possible March 2024.