Will I be able to reproduce a NanoGPT 1.6B training run in less than two weeks of wall time on a 6xMI100 node?

1kṀ3260

resolved Feb 15

Resolved

ALL

The MI100 is AMD's DC accelerator card from 2020.

I have access to a node with six of them and a single EPYC 7xx3 64c CPU.

Each MI100 supposedly achieves ~200 FP16 TFLOPs, 1/8th of an H100 PCIe, and around half the memory bandwidth at 1.2TB/s vs 2TB/s.

Hyperbolic recently announced a 16 hour speedrun of modded_nanogpt 1.6B on an 8xH100 PCIe node: https://x.com/Yuchenj_UW/status/1861477701821047287

Can I perform a similar run on this hardware?

AMD's software stack has a poor reputation and I have not historically managed to get anything like the rated FLOPs.

Using an FP16/similar dtype is allowed, smaller than that is not.

Modifications to modded_nanogpt to support AMD/ROCm are allowed.

Resolution Criteria

This market will resolve YES if:

I successfully complete a training run to llm.c baseline validation loss (2.46) of a
https://github.com/KellerJordan/modded-nanogpt model with 1.6B parameters in less than 14 days (336 hours) of wall clock time before the resolution date on the above hardware

The market will resolve NO if:

A run that takes less than 14 days wall clock time is not completed before the resolution date.

Personal

LLMs

Large language models

Get

1,000

to start trading!

🏅 Top traders

#	Name	Total profit
1		Ṁ946
2		Ṁ7
3		Ṁ6
4		Ṁ4

People are also trading

What hardware will GPT-5 be trained on?

GPT-5 trained with >=24k GPUs?

82% chance

Will growth in the maximum MW used to train AIs, slow down by more than x2 after GPT-6-like?

60% chance

Will there be an LLM (as good as GPT-4) that was trained with 1/100th the energy consumed to train GPT-4, by 2026?

82% chance

Before 2028, will anyone train a GPT-4-level model in a minute?

14% chance

How much compute will be used to train GPT-5?

Will growth in the maximum MW used to train AIs, slow down by more than x2 after GPT-5-like?

55% chance

When will GPT-5 finish training?

Will a GPT-4 level system be trained for <$1mm by 2028?

89% chance

Was GPT-4 trained in 4 months or less?

Sort by:

Best loss achieved on the 6xMI100 box before the close date was 2.6. Requirement to resolve yes was 2.46. 20241127 run in screenshot is the hyperbolic run as benchmark.

The currently in progress run at ~time of close had only achieved a val loss of ~3.0.

I'm surprised nobody took the opportunity near the start of the market to bet against my success given the generally negative views of ROCm online.

bought Ṁ22 YES

Started a 124M run as proof of concept today, having finally got all the deps to compile and maybe work with gfx908 support, including hipblaslt.
This working makes me more hopeful a 1.6B run will be a success.