DiffusionGemma: 4x faster text generation
DiffusionGemma is an experimental open model that explores text diffusion and generates entire blocks of text simultaneously rather than token by token. Released under an Apache 2.0 license, the 26B Mixture of Experts model draws on the Gemma 4 family and Gemini Diffusion research and integrates a novel diffusion head to maximize generation speed, delivering up to 4x faster text generation on dedicated GPUs.
Targeted at researchers and developers working on speed-critical, interactive local workflows, the model activates only 3.8B parameters during inference and can fit within 18GB VRAM when quantized. It generates 256 tokens in parallel with bi-directional attention, which lets every token attend to all others and offers advantages for in-line editing, code infilling, amino acid sequences and mathematical graphs.
By shifting the decode bottleneck from memory-bandwidth to compute, DiffusionGemma reaches 1000+ tokens per second on an NVIDIA H100 and 700+ tokens per second on an NVIDIA GeForce RTX 5090.
diffusiongemma, text diffusion, gemma 4, gemini diffusion, moe, 26b, 4x faster, 3.8b parameters, bi-directional attention, nvidia h100