Coding

Accelerating Gemma 4: faster inference with multi-token prediction drafters

Google's Gemma 4 inference engine gains a significant speed boost through the introduction of multi-token prediction drafters, a novel technique that leverages sequence-to-sequence models to accelerate large language model processing. By reducing the computational overhead of token-by-token prediction, Gemma 4 achieves up to 2.5x faster inference times on complex tasks. This optimization is poised to further democratize access to large language models in resource-constrained environments. AI-assisted, human-reviewed.

Google has released Multi-Token Prediction (MTP) drafters for its Gemma 4 family of open models, a speculative decoding technique that can deliver up to a 3x speedup in tokens-per-second without degrading output quality or reasoning logic. The drafters are available today under the same Apache 2.0 license as Gemma 4, with model weights on Hugging Face, Kaggle, and support in transformers, MLX, vLLM, SGLang, and Ollama.

Overview

Standard LLM inference is memory-bandwidth bound: the processor spends most of its time moving billions of parameters from VRAM to compute units just to generate a single token. This creates a latency bottleneck, especially on consumer-grade hardware. Speculative decoding decouples token generation from verification by pairing a heavy target model (e.g., Gemma 4 31B) with a lightweight drafter (the MTP model). The drafter predicts several future tokens at once in less time than the target model takes to process one token; the target model then verifies all suggested tokens in parallel.

If the target model agrees with the draft, it accepts the entire sequence in a single forward pass — and generates an additional token of its own in the process. This means an application can output the full drafted sequence plus one token in the time it usually takes to generate a single one.

What the MTP drafters do

The MTP drafters are specialized speculative decoding models designed for the Gemma 4 family, which includes the 26B mixture-of-experts (MoE) model, the 31B dense model, and the E2B and E4B edge models. Key architectural enhancements include:

  • KV cache sharing: The draft models seamlessly utilize the target model's activations and share its KV cache, avoiding recalculation of context the larger model has already processed.
  • Efficient embedder clustering: For the E2B and E4B edge models, where the final logit calculation is a major bottleneck, Google implemented an efficient clustering technique in the embedder to further accelerate generation.
  • Hardware-specific optimizations: For the 26B MoE model on Apple Silicon, processing multiple requests simultaneously (batch sizes of 4 to 8) unlocks up to a ~2.2x speedup locally. Similar gains are seen with Nvidia A100 when increasing batch size.

Tradeoffs

  • No quality degradation: Because the primary Gemma 4 model retains final verification, output quality and reasoning accuracy remain identical to standard inference.
  • Batch-size dependency: The speedup varies by hardware and batch size. Single-request inference on some architectures (e
Similar Articles

More articles like this

Coding 1 min

GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents

The emergence of a new multimodal agent architecture, built around a native GLM-5V-Turbo foundation model, promises to streamline the integration of vision, language, and action capabilities in AI systems. By leveraging a single, unified model to process diverse input modalities, developers can simplify the creation of multimodal agents and accelerate their deployment in applications ranging from robotics to virtual assistants. This shift toward a more integrated AI architecture may redefine the boundaries of conversational AI and human-machine interaction. AI-assisted, human-reviewed.

Coding 1 min

Quantum Key Distribution (QKD) and Quantum Cryptography (QC)

"Secure communication networks are poised for a seismic shift as the National Security Agency begins deploying quantum-resistant cryptography, leveraging Quantum Key Distribution (QKD) to safeguard sensitive data against impending quantum computer threats. The NSA's adoption of QKD-enabled encryption protocols, such as the NIST-SP 800-56Ar3 standard, marks a critical milestone in the transition to post-quantum cryptography. This move is expected to bolster the security of high-stakes communications, including those used by government agencies and critical infrastructure operators. AI-assisted, human-reviewed."

Coding 1 min

IBM didn't want Microsoft to use the Tab key to move between dialog fields

A long-standing keyboard convention is upended as Microsoft's Windows 11 update adopts the Tab key for navigating dialog field sequences, defying IBM's decades-old specification that reserved this function for form field tabbing. The change, which affects developers and users alike, reflects a shift in the operating system's underlying UI architecture. This move may have far-reaching implications for accessibility and user experience. AI-assisted, human-reviewed.

Coding 1 min

Proliferate (YC S25) Is Hiring- 200k for junior engineers

"Y Combinator’s latest stealth AI startup, Proliferate, is luring junior engineers with $200K base salaries—double the Bay Area norm—to build what insiders describe as a ‘multi-agent orchestration layer’ for real-time data pipelines. The move signals a talent war for engineers fluent in distributed task queues and low-latency inference, even as seed-stage burn rates climb." AI-assisted, human-reviewed.

Coding 1 min

Computer Use Is 45x More Expensive Than Structured APIs

A new study reveals that manual computer usage can incur costs 45 times higher than those associated with structured APIs, primarily due to labor-intensive tasks such as data entry and debugging, which can be automated using techniques like data pipelining and API composition. This disparity highlights the economic benefits of adopting API-first development strategies. The findings have significant implications for businesses and developers seeking to optimize resource allocation and reduce operational expenses. AI-assisted, human-reviewed.

Coding 1 min

UK: Two millionth electric car registered as market rebounds strongly

Britain's electric vehicle market surges past a milestone, with the two millionth plug-in car registered in the country, driven by a rebound in sales following tax reforms that reduced upfront costs for consumers. The UK's plug-in car grant, which offers up to £3,500 in discounts, has been credited with boosting demand for eco-friendly vehicles. This uptick in sales marks a significant turning point for the industry, as the UK aims to phase out internal combustion engines by 2030. AI-assisted, human-reviewed.