Breaking 40k tokens/sec: How We Set a New SOTA for 3B LLM Training with TorchTitan + FP8

Summary: We trained a SmolLM 3B parameter model to 100B tokens using TorchTitan with FP8 and hit 41.4k tokens/sec/GPU and 45% MFU. That is a 3x throughput boost and 61% higher MFU over the original Nanotron baseline. This is the fastest publicly known training run for a 3B model in its class, surpassing previously published SmolLM3 by Hugging Face.

Background

The blue diagram below is from the official SmolLM training, a clean and efficient 3B model trained over 10.5T tokens using Nanotron:

SmolLM Model Architecture

It's a solid setup: 36 layers, grouped-query attention, NoPE, multilingual tokenizer, and good hyperparameters. But their tokens/sec/GPU plateaued at 14k, and MFU was capped at 29.4%.

Our goal was to squeeze out every last FLOP.

So we implemented a pretraining setup for a 3B size model using TorchTitan with native FP8 kernels, without changing model architecture or batch size. Here's what happened:

What We Changed

1. Switched from Nanotron to TorchTitan

TorchTitan brings in custom fused CUDA kernels, extremely efficient attention/MLP pipelines, and tight memory scheduling.

2. Moved to FP8 precision

We trained the model in FP8 with bf16 fallback on:

LayerNorm
Input/Output embeddings
Final logits projection

FP8 dropped memory usage and increased arithmetic intensity, boosting both cache utilization and register-level compute.

3. NoPE Support with Kernel Fusion

Despite NoPE (which modifies rotary embeddings every 4th layer), TorchTitan allowed us to fuse rotary + QKV + MHA into a single CUDA op without instability.

4. Stability tricks

eps=1e-8 in AdamW
Custom FP8-aware loss scaling

Why This Matters

This is the first publicly known training run of a 3B model hitting more than 41k+ tokens/sec and more than 45% MFU without using custom silicon, TPU v5e, or unreleased hardware.

Note: while our run used B200 GPUs, this hardware difference accounts for only a smaller portion of the 3x speedup. The vast majority of the performance gains come from our software optimizations.

If SmolLM represents the "reference implementation" of 3B LLM training, then this is the new state-of-the-art for open LLMs in this size class.

In fact, with a similar setup, SmolLM needed 24 days on 384 H100 GPUs. If we used 384 GPUs instead of our 128 to train on 10.5 trillion tokens, we could've done it in 8.1 days.

Results: TorchTitan Rewrite

Metric	SmolLM (Nanotron)	Ours (TorchTitan + FP8)
Tokens/sec/GPU	14,000	41,400
MFU	29.4%	45%
Optimizer	Adam (bf16)	Adam (bf16/FP8 hybrid)
Training duration	24 days	8.1 days

What's Next?

We're working on:

Publicly releasing our TorchTitan training recipe for the 3B model class (code coming soon)
Releasing our first Vision Action Language Model with 3B parameters, which will power Lightcone