
LLMs such as gpt-4 are good at many programming subtasks, but cannot fully replace programmers. As a proxy for fully replacing junior programmers, this question asks when an AI system will be able to correctly implement and test a complex feature in a large codebase based on a detailed spec. This question looks at Python interpreters, and the formal specifications used to add new features to the language, Python Enhancement Proposals (PEPs).
This question resolves true if an AI system produces a correct and tested patch to a Python implementation from the text of a PEP in 2024 or earlier. No code implementations may be published before the training cutoffs of any LLMs involved. The PEP must be in the top 60% of PEPs by implementation difficulty, as judged by me.