Will an AI system be able to fully refactor a 10k+ line codebase before 2026 ? | Manifold

Will an AI system be able to fully refactor a 10k+ line codebase before 2026 ?

148

1.1kṀ26k

Dec 31

10%

chance

1H

6H

1D

1W

1M

ALL

Will an AI system be able to fully refactor a 10k+ line codebase before 2026 ?

Growing capabilities and context lenght increase of recent AI systems will potentially allow ever more powerful applications concerning code and IT infrastructures in general.

A full refactoring is a long and intense process that require a important amount of skill and knowledge. Good refactoring usually increase efficiency and readability of codebases, while facilitating further improvements on the codebase.

Refactoring & generation rules

To be considered a valid refactoring, the AI refactoring should actually show, in one go : good readability, efficiency gain (if possible), harmonization of the syntax and structure of the code while not showing any loss in feature or specification.
The system would need to deduce everything related to code, configuration files and basically the whole github repo
Pre-generation user feedback is possible but should be 100% optionnal and should only concern architecture preferences, naming conventions and high level considerations.
Re-run of the same input by the user until getting a valid result will not be counted as success.

Reliability

It would need to have a very high average reliability (~95%+) accross various common programming languages (Python, Java, C++, C#, etc...) and librairies.

Allowed human interactions

Interaction that need administrator privilege and directly asked by the system for package installation or similar for example (feedback possible for this).

Additionnal

There is one attempt for the final code generation, but internally the system could go for as many iterative test-loop process as needed and use as many external tool as needed.

For resolution

I would prefer not to rely on a single source (including me) for the resolution,

that's why I will prefer using public benchmarks (that of course doesn't exist yet ...).

If not available I will go for online forum consensus.

Get

1,000

to start trading!

People are also trading

Will an algorithm be able to work on million-line codebases before 2026?

On Dec 31, 2025, will a widely available AI model be able to write a sophisticated 2000 line program?

Will Meta have a "mid-level" AI engineer that can write code by the end of 2025?

Will AI pass the Winograd schema challenge by the end of 2025?

Will an AI model achieve superhuman ELO on Codeforces by the 31 December 2025?

A major tech company, besides Anthropic, reports at least 98% of its code is AI-generated before April 1, 2026

In 2029, will any AI be able to construct "reasonably" bug-free code of >= 10k LOC from a natural language specification? (Gary Marcus benchmark #4)

Will AI be able to write, compile, and unit test a single .c file to reproduce GPT-2 training from PyTorch code by 2026?

Will AI be Recursively Self Improving by mid 2026?

Will an AI system capable of doing 50% of knowledge job arrive by 2027?

Sort by:

I think for any reasonable interpretation of the resolution criteria, Opus 4.5 passes. Still waiting for clarifications

Not even close.

Would you clarify how you're going to measure success?

The system would need to deduce

Wdym?

It would need to have a very high average reliability (~95%+)

On what distribution of codebases, and what counts as failure? There could be lots of codebases where it's basically impossible (even for a human) to refactor meaningfully and also not break any behavior.

If not available I will go for online forum consensus.

Which forum?

while not showing any loss in feature or specification

This is gonna be a really tricky criterion to evaluate. One reason I'm big into NO is because IME it's super hard for refactored code to have the same behavior in error/edge cases. These aspects of the specification often aren't written in technical documentation but are encoded in tests. So could I propose that the criteria require the original codebases to have 80%+ code coverage in tests, and that the refactored code needs to pass original tests involving public API points?

GPT-5 + Claude Code can do this. GPT-5 to write the plan and Claude Code to implement it

You just need a system to dump the 10k lines of the repo to a text file and feed that as context to GPT-5. And then the rest is just how you write the prompt and feed the plan to Claude Code

The resolution criteria on this feel too loose to be worth investing a lot in

I believe Claude Code may currently be able to do this with the new Claude 4 Sonnet

@Guillaume Would it be valid, if the codebase started out over 10k lines of code, but ended up significantly less, with all of the other stipulations met?

In my view, it hinges on how many more generations of AI systems we will get. Assuming there's a GPT-6 or equivalent by 2026, it should resolve as yes. That said, two more generations in 1.5 years would require a further acceleration in the pace of progress, which is what I'm actually betting on.

@K4IICHI GPT-5 won’t even be here until more than halfway through 2025 most likely. That’s less than 6 months before this market resolves. If anything GPT-5 is probably coming later than “expected” given the time gap between GPT-3 (pre-ChatGPT) and GPT-4.

Now if course RL/Reasoning Models are probably most of the explanation for this “delay,” so it’s not like overall progress is slowing, and may even (as you say) be accelerating. But if GPT-6, or even 5.5, is required for a YES resolution, this should be trading (much) lower. GPT-5, o5, Gemini diffusion, some other lateral breakthrough, etc. is what gives a chance.

bought Ṁ40 YES

I bought YES as a hedge, at least if I'm unemployed I've made some mana.

bought Ṁ100 NO

Buying up NO as the conditions specified by the author seem highly unrealistic at this point.

bought Ṁ5 YES

@nsokolsky What's an example of one such condition? To me, all of them seem likely before 2025, never mind 2026.

@12c498e “95% average reliability” is one of them. I use GPT-4 daily and it’s maybe 80% accurate on the average task, much less so for ambiguous and abstract queries. What OP describes is such an advanced system that it would effectively result in 90% of software engineering jobs getting eliminated overnight.

How would this be tested? Will any example of a refactor be sufficient (in which case I’m sure an example can be contrived already for Gemini Ultra)? Or will you be picking random GitHub repos with 10k lines and asking for a refactor?
Does the code have to compile and run without any human intervention? Or will human intervention be acceptable - and if so, how many lines can humans change for this to count as YES?
Does “one go” mean there’s only 1 attempt in total with no feedback? Does this also mean re-runs of the same input until a valid result is obtained are not acceptable?
If the AI system runs the code on its own and keeps on doing refactoring until it compiles (latest GPT-4 can do this for Python), does this count as “one shot”?

Yes of course a single lucky refactor would not suffice. It would need to have a very high average reliability (~95%+) accross various common programming languages (Python, Java, C++, C#, etc...) and librairies. I would prefer not to rely on a single source (including me) for the resolution, that's why I will prefer using public benchmarks (that of course doesn't exist yet ...). If not available I will go for online forum consensus.
There will be no tolerance for the output on code modification, the system would need to deduce everything related to code, configuration files and basically the whole github repo (you can see this as a full repo generation). The only actions with human intervention that would be allowed is interaction that need administrator privilege and directly asked by the system for package installation or similar for example (feedback possible for this).
Pre-generation user feedback is possible but should be 100% optionnal and should only concern architecture preferences, naming conventions and high level considerations. Re-run of the same input by the user until getting a valid result will not be counted as success.
There is one attempt for the final code generation, but internally the system could go for as many iterative test-loop process as needed and use as many external tool as needed.

@nsokolsky I professionally use Aider for much of what the question state as criteria - but not sure that the current version really does all of it, because I've not had such an use case - but Aider + TreeSitter + GIT does come close to it... I recommend you have a look: https://github.com/paul-gauthier/aider?tab=readme-ov-file#example-chat-transcripts it's a nice tool!

@Magnus_ it seems like a peacemeal tool. I'm pretty sure it would fail for any reasonably big project, given that OP requested 95% success rate at one-shot performance.

bought Ṁ25 YES

Can't Gemini Ultra already do this?

@Pykess don't know if you can connect tools like Aider too gemini, but GPT4 does a quite good job when having access to treesitter data for your repository.

People are also trading

Will an algorithm be able to work on million-line codebases before 2026?

On Dec 31, 2025, will a widely available AI model be able to write a sophisticated 2000 line program?

Will Meta have a "mid-level" AI engineer that can write code by the end of 2025?

Will AI pass the Winograd schema challenge by the end of 2025?

Will an AI model achieve superhuman ELO on Codeforces by the 31 December 2025?

A major tech company, besides Anthropic, reports at least 98% of its code is AI-generated before April 1, 2026

In 2029, will any AI be able to construct "reasonably" bug-free code of >= 10k LOC from a natural language specification? (Gary Marcus benchmark #4)

Will AI be able to write, compile, and unit test a single .c file to reproduce GPT-2 training from PyTorch code by 2026?

Will AI be Recursively Self Improving by mid 2026?

Will an AI system capable of doing 50% of knowledge job arrive by 2027?

Related questions

Will an algorithm be able to work on million-line codebases before 2026?

On Dec 31, 2025, will a widely available AI model be able to write a sophisticated 2000 line program?

Will Meta have a "mid-level" AI engineer that can write code by the end of 2025?

Will AI pass the Winograd schema challenge by the end of 2025?

Will an AI model achieve superhuman ELO on Codeforces by the 31 December 2025?

A major tech company, besides Anthropic, reports at least 98% of its code is AI-generated before April 1, 2026

In 2029, will any AI be able to construct "reasonably" bug-free code of >= 10k LOC from a natural language specification? (Gary Marcus benchmark #4)

Will AI be able to write, compile, and unit test a single .c file to reproduce GPT-2 training from PyTorch code by 2026?

Will AI be Recursively Self Improving by mid 2026?

Will an AI system capable of doing 50% of knowledge job arrive by 2027?

© Manifold Markets, Inc.•Terms•Privacy