How will my vibe-coded rewrite of a SoTA genomics program perform?

When I test my vibe-coded Rust rewrite of Beagle, an industry-standard SoTA genotype imputation program, how will the accuracy change?

Accuracy will be measured by R^2 (estimated dosage vs. actual dosage) on the 1KG+HGDP samples (80/20 test train split) in which the test set is downsampled to microarray markers. R^2 will be measured individually for each sample in the test set, so the final R^2 metric will be the mean R^2 across test samples.

The R^2 of Beagle is already pretty good. The "%" in the question refers to the absolute difference in R^2 between standard Beagle and the rewritten version. (If Beagle has R^2 = 0.80 and the rewrite has R^2 = 0.78, this would correspond to 2% worse.)

Vibe-coded definition:
- 100% of the LoC written by AI (I can still edit config files like .gitignore or Cargo.toml).

- I will ask broad, open-ended questions to AIs (like "find logic mistakes" or "figure out why X happens," usually to enable communication between Claude and Gemini. I can also ask them to complete tasks, like "add speed benchmarks" or "add difficult integration tests that measure XYZ."

- I'm not planning on reading all of the code, though I'll skim the diffs.

- So far, I've been mostly using Gemini 3 Pro for review and Opus 4.5 for implementation, but I am free to use any AI tool (e.g. Codex).

It's not a 1:1 rewrite of the internals of Beagle. If the AIs believe they have found a better approach than the reference implementation, they are encouraged to implement it.

Feel free to ask questions to gain more information.

Related questions