When I test my vibe-coded Rust rewrite of Beagle, an industry-standard SoTA genotype imputation program, how will the accuracy change?
Accuracy will be measured by R^2 (estimated dosage vs. actual dosage) on the 1KG+HGDP samples (80/20 test train split) in which the test set is downsampled to microarray markers.
The R^2 of Beagle is already pretty good, usually a bit above 0.8 for microarray-style data. The "%" in the question refers to the absolute difference in R^2 between standard Beagle and the rewritten version. (If Beagle has R^2 = 0.80 and the rewrite has R^2 = 0.78, this would correspond to 2% worse.)
Vibe-coded definition:
- 100% of the LoC written by AI (I can still edit config files like .gitignore or Cargo.toml).
- I will ask broad, open-ended questions to AIs (like "find logic mistakes" or "figure out why X happens," usually to enable communication between Claude and Gemini. I can also ask them to complete tasks, like "add speed benchmarks" or "add difficult integration tests that measure XYZ."
- I'm not planning on reading all of the code, though I'll skim the diffs.
- So far, I've been mostly using Gemini 3 Pro for review and Opus 4.5 for implementation, but I am free to use any AI tool (e.g. Codex).
Feel free to ask questions to gain more information.