Summary: We trained a SmolLM 3B parameter model to 100B tokens using TorchTitan with FP8 and hit 41.4k tokens/sec/GPU and 45% MFU. That is a 3x throughput boost and 61% higher MFU over the original Nanotron baseline. This is the fastest publicly known training run for a 3B model in its class, surpassing previously published SmolLM3 by Hugging Face.
Background
The blue diagram below is from the official SmolLM training, a clean and efficient 3B model trained over 10.5T tokens using Nanotron:
It's a solid setup: 36 layers, grouped-query attention, NoPE, multilingual tokenizer, and good hyperparameters. But their tokens/sec/GPU plateaued at 14k, and MFU was capped at 29.4%.
Our goal was to squeeze out every last FLOP.
So we implemented a pretraining setup for a 3B size model using TorchTitan with native FP8 kernels, without changing model architecture or batch size. Here's what happened:
What We Changed
1. Switched from Nanotron to TorchTitan
TorchTitan brings in custom fused CUDA kernels, extremely efficient attention/MLP pipelines, and tight memory scheduling.
2. Moved to FP8 precision
We trained the model in FP8 with bf16 fallback on:
- LayerNorm
- Input/Output embeddings
- Final logits projection
FP8 dropped memory usage and increased arithmetic intensity, boosting both cache utilization and register-level compute.
3. NoPE Support with Kernel Fusion
Despite NoPE (which modifies rotary embeddings every 4th layer), TorchTitan allowed us to fuse rotary + QKV + MHA into a single CUDA op without instability.
4. Stability tricks
- eps=1e-8 in AdamW
- Custom FP8-aware loss scaling
Why This Matters
This is the first publicly known training run of a 3B model hitting more than 41k+ tokens/sec and more than 45% MFU without using custom silicon, TPU v5e, or unreleased hardware.
Note: while our run used B200 GPUs, this hardware difference accounts for only a smaller portion of the 3x speedup. The vast majority of the performance gains come from our software optimizations.
If SmolLM represents the "reference implementation" of 3B LLM training, then this is the new state-of-the-art for open LLMs in this size class.
In fact, with a similar setup, SmolLM needed 24 days on 384 H100 GPUs. If we used 384 GPUs instead of our 128 to train on 10.5 trillion tokens, we could've done it in 8.1 days.
Results: TorchTitan Rewrite
Metric | SmolLM (Nanotron) | Ours (TorchTitan + FP8) |
---|---|---|
Tokens/sec/GPU | 14,000 | 41,400 |
MFU | 29.4% | 45% |
Optimizer | Adam (bf16) | Adam (bf16/FP8 hybrid) |
Training duration | 24 days | 8.1 days |
What's Next?
We're working on:
- Publicly releasing our TorchTitan training recipe for the 3B model class (code coming soon)
- Releasing our first Vision Action Language Model with 3B parameters, which will power Lightcone