In 2029, will any AI be able to construct "reasonably" bug-free code of >= 10k LOC from a natural language specification? (Gary Marcus benchmark #4)

113

1kṀ25k

2030

82%

chance

ALL

The fourth question from this post: https://garymarcus.substack.com/p/dear-elon-musk-here-are-five-things

The full text is: "In 2029, AI will not be able to reliably construct bug-free code of more than 10,000 lines from natural language specification or by interactions with a non-expert user. [Gluing together code from existing libraries doesn’t count.]"

Judgment will be by me, not Gary Marcus.

Ambiguous whether this means start or end of 2029, so I have set it for the end.

For this question I am not using the exact text of the question, because I think "bug-free" is 1. silly 2. untestable. I will instead accept if it produces code of >=10k LOC with <= the number of bugs in an implementation by a human (many small bugs for some significant bugs will unfortunately be down to my subjective impression of whether it's "better")

I am also ignoring the "no gluing libraries together" requirement, because I don't know what he means. Does he want an AI that writes 10k LOC of assembly? I will accept code that is calling/using libraries at <= the rate that normal human programmers do.

Sep 16, 2:26pm: Some additional clarifications:

If there was a benchmark that, for instance, compared human to AI code, allowed both to ask follow up questions about the initial natural language prompt, allowed tests, allowed multiple submissions, etc. (so roughly the workflow of "human consultant is hired to write a ~10k LOC project") I will accept that.
If there's an agent that can pass this for some "typical" coding tasks but not for highly-specialized tasks (e.g. it can write a website, a data structure library, or implement some standard ML workflows but can't write highly secure code or an efficient optimizing compiler) I will accept that.
To frame it another way: if it can write small-medium projects that a median FAANG coder can do, but not projects that FAANG coders who implement research-level code can do (and non-research-level coders can't), I will accept that. (tbc I don't mean "research level quality", I mean "production/industry quality, research level difficulty/complexity")

AGI Timelines

Technical AI Timelines

Get

1,000

to start trading!

People are also trading

Will there be entry-level AI coders by 2026?

25% chance

Will an AI system be able to fully refactor a 10k+ line codebase before 2026 ?

10% chance

On Dec 31, 2025, will a widely available AI model be able to write a sophisticated 2000 line program?

10% chance

Will an algorithm be able to work on million-line codebases before 2026?

26% chance

In 2029, will any AI be able to take an arbitrary proof in the mathematical literature and translate it into a form suitable for symbolic verification? (Gary Marcus benchmark #5)

82% chance

Will any AI solve more than four of AI 2027 Marcus-Brundage tasks in 2025?

8% chance

Will AI pass the Winograd schema challenge by the end of 2025?

39% chance

In 2028, will an AI be able to generate equivalent to ~=200 man years of effort towards a software 1.0 given a prompt?

24% chance

By 2029, will any AI be able to read a novel and reliably answer questions about it? (Gary Marcus benchmark #2)

90% chance

Will AI be able to write, compile, and unit test a single .c file to reproduce GPT-2 training from PyTorch code by 2026?

Sort by:

bought Ṁ100 YES

I think it will happen in 2026 or 2027

My sense is that "no gluing together libraries" just means that the LOC in the libraries used don't count towards the total LOC. As long as the AI writes 10K original lines of code, I think it should be fine

https://threadreaderapp.com/thread/1618426056163356675.html

I think it's plausible an AI system could implement something like a simple programming language / compiler in the next decade. Programming language codebases are fairly long and simple and don't require super abstract thinking, plus there's a lot of them out there to learn from.

No chance.

This is about eight levels beyond the “Turing test” everyone is so skeptical about.

10k is a very serious system;

(Note: advent of code <100 lines, algo interviews, doable half a decade before this one)

I think my bet here might depend on the prompts that would be asked.

Ex: "Make a [common data structure] for this set of data." vs "Make an art agnostic 2D platformer with similar psychics to Mario."

Physics... not psychics autocorrect. We really need the ability to edit comments!

@SneakySly I am not sure if a 90th percentile coder could implement something like Mario. It seems reasonably likely to me that they can, and if so I would require the AI to be able to do that.

@vluzko Fair. I guess this limitation is really a lack of specificity of Gary's - but I wish we had some example prompts!

This question still seems too vague and underspecified. Depending on the task, programming 10k lines can be nearly trivial or incredibly difficult.

That's true. I think the spirit of the question is "will an AI be 90th percentile human coder level on small-medium sized programming projects". I will add some stuff to the description to try to clarify.