Will I be able to reproduce a NanoGPT 1.6B training run in less than two weeks of wall time on a 6xMI100 node?
6
1kṀ3260
resolved Feb 15
Resolved
NO

The MI100 is AMD's DC accelerator card from 2020.

I have access to a node with six of them and a single EPYC 7xx3 64c CPU.

Each MI100 supposedly achieves ~200 FP16 TFLOPs, 1/8th of an H100 PCIe, and around half the memory bandwidth at 1.2TB/s vs 2TB/s.

Hyperbolic recently announced a 16 hour speedrun of modded_nanogpt 1.6B on an 8xH100 PCIe node: https://x.com/Yuchenj_UW/status/1861477701821047287

Can I perform a similar run on this hardware?

AMD's software stack has a poor reputation and I have not historically managed to get anything like the rated FLOPs.

Using an FP16/similar dtype is allowed, smaller than that is not.

Modifications to modded_nanogpt to support AMD/ROCm are allowed.

Resolution Criteria

This market will resolve YES if:

  • I successfully complete a training run to llm.c baseline validation loss (2.46) of a
    https://github.com/KellerJordan/modded-nanogpt model with 1.6B parameters in less than 14 days (336 hours) of wall clock time before the resolution date on the above hardware

The market will resolve NO if:

  • A run that takes less than 14 days wall clock time is not completed before the resolution date.

Get
Ṁ1,000
to start trading!

🏅 Top traders

#NameTotal profit
1Ṁ946
2Ṁ7
3Ṁ6
4Ṁ4
Sort by:

Best loss achieved on the 6xMI100 box before the close date was 2.6. Requirement to resolve yes was 2.46. 20241127 run in screenshot is the hyperbolic run as benchmark.

The currently in progress run at ~time of close had only achieved a val loss of ~3.0.

I'm surprised nobody took the opportunity near the start of the market to bet against my success given the generally negative views of ROCm online.

bought Ṁ22 YES

Started a 124M run as proof of concept today, having finally got all the deps to compile and maybe work with gfx908 support, including hipblaslt.
This working makes me more hopeful a 1.6B run will be a success.

1.6B training update with an early run relative wall time chart:

Lab dash with near-real-time status: https://lun.horse/lab

Looks like this current run is a failure. ~10B tokens in, val loss > 2.7.

Insufficient time remaining before the arbitrary deadline I set on this question for current attempt, last attempt was a failure.

© Manifold Markets, Inc.TermsPrivacy