Given a description of a backend (code and/or specification) and a general description of a front-end (e.g. list of essential features, desiderata, style, etc), AI code generator should be able to generate front-end code, in a widely used programming language such as JS or TS, based on a widely used framework such as React or Vue.
Criteria:
code must be written completely by AI with no human interventions besides providing relevant information about the backend and a general description of a front-end
should work with any kind of a web front-end task a senior front-end / full-stack software engineer is expected to be able to implement
code must confirm to the specification and meet quality standards of an expert human senior developer proficient in a given language and problem domain
front-end must work according to commonly acceptable UI/UX standards
at least 2000 non-trivial lines of code
a task is considered to be performed successfully when all important parts of functionality are implemented; minor cosmetic defects or missing niceties are acceptable
a code generator should have at least 80% success rate
Hi. I resolved it at 25% due to a high degree of uncertainty and lack of clear benchmarks.
What we know:
many LLMs are able to make React components with a good rate of success (e.g. v0, Claude)
multi-step AI workflows are still in 'prototype' phase
Devin is offered as a "junior" developer, so there's no expectation of reliability
recently SWE-bench results look good, but it's not clear if they match capability of making web front-ends
o3 SWE-bench score is 71% roughly within the range
I think this indicates we are on the edge of this capability and clear Yes/No resolution would be wrong. Percentage is closer to 'No' because SWE-bench results are still significantly below 80%.
@AlexMizrahi the question was not "are we close?" but "are we there yet?" and it's definitely right to answer NO to the latter.
@AlexMizrahi In addition to the direct point made by @PierreThierry, the market trading incentives are very different for resolutions to a percentage. It would be right to bet a lot down from 25% to 0% in a binary market even if that would lose a lot of mana in a percentage market.
@Jacy I'm actually surprised that a market can be modified that way at the end. In a prediction market with actual money, I bet that would be illegal.
@PierreThierry Are you certain that e.g. Devin doesn't have this capability?
From Bayesian perspective getting close to 100% certainty costs resources. It's not free. You can pay somebody money to make such an eval, or we keep it uncertain.
Note that this market was created before SWE-bench existed. It would be a lot easier to reference benchmark than to define custom criteria
@AlexMizrahi I have asked dozens of times for people that claim that it is possible to show some evidence, and so far I've been met with the same kind of reactions you get from flat earthers and antivaxxers: "it's easy to try, do it yourself and you'll see" or "I won't do your research for you".
The only exception was one guy showing an impressive resulting code but when I pressed him for prompts and tried them, it produced nothing like what he was showing off.
And that's just for creating an application that does something on its own. A major part of creating a web application for a software engineer is interfacing it with a known API according to its specification and I've seen funny stuff on that front.
@Tulip Yes, it is a binary market, but we cannot definitively say whether there's AI which meets the criteria because (1) nobody made a webdev-specific benchmark (let alone one matching the description; (2) o3 shows very good results in SWE-bench but we can't test it.
Thus I think partial resolution is fair here, as it reflects uncertainty about the event itself.
"should work with any kind of a web front-end task a senior front-end / full-stack software engineer is expected to be able to implement"
Senior front-end / full-stack software engineers are occasionally expected to be able to implement literally impossible tasks, so this arguably should already resolve NO.
https://v0.dev/ might be the most advanced I know of, but it's still a long shot till this question would resolve yes
Could you give an upper / lower bound for scale, apart from LOC?
Like, a random note-saving app without editing, deletion, or sign-in the lower bound? Would the same with sign-in count? Or would a ToDo list app with edits, deletes, and sign-in be the lower bound? Or something like a ToDo list app with tags and reordering and recurring items?
@1a3orn The question is basically "Can AI replace front-end web devs?", so lower bound should be similar to what people develop commercially, sign-in and so on are required.
Something similar to a TODO list app as you described might be good for a lower bound, except that an actual TODO list app won't qualify because it's a common tutorial topic so we won't be able to tell which parts are just copied from the training set.
@vluzko Not particularly. I know that people had some success with GPT-4, but initially released model with 8k context is definitely not sufficient due to limited size. (Although maybe somebody can prompt it to generate front-end piece by piece, but AFAIK nobody yet succeeded doing that.) Is 32k context enough? It's a bit hard to check as a detailed descriptions of a back-end of a non-trivial app are hard to find.