A Linux kernel optimization aimed at reducing power consumption has been found to cause catastrophic QUIC packet loss on Cloudflare's edge network. The issue arises from the kernel's aggressive collapsing of CPU idle states, which triggers a bug in the CUBIC congestion control algorithm used by QUIC.
Overview
CUBIC is a loss-based congestion control algorithm that governs how TCP and QUIC connections probe for available bandwidth, back off when they detect loss, and recover afterward. The algorithm uses a congestion window (cwnd) to limit the number of bytes that can be in flight at any moment. A larger cwnd allows the sender to push more data per round trip, while a smaller cwnd throttles it.
The Bug
The bug occurs when the connection exits slow-start and switches to congestion avoidance. In this state, the CUBIC algorithm enters a rapid oscillation between recovery and congestion avoidance, causing the cwnd to remain pinned at its minimum value. This oscillation is triggered by the kernel's idle period optimization, which shifts the epoch forward by the idle duration rather than resetting it.
The fix involves measuring the idle duration from when bytes_in_flight actually transitioned to zero, rather than the last packet sent. This change ensures that the recovery boundary stops chasing the send time, allowing the cwnd to grow along the expected CUBIC curve.
Tradeoffs
The fix highlights the tradeoffs between energy efficiency and latency sensitivity in transport protocols. The Linux kernel optimization aimed to reduce power consumption by collapsing CPU idle states, but this optimization clashed with the latency-sensitive requirements of QUIC. The fix trades microjoules for microseconds, prioritizing latency over energy efficiency.
The investigation into the bug required weeks of instrumenting qlogs and analyzing visualizations, but the solution required changing just three lines of code. The fix has been contributed to Cloudflare's open-source implementation of QUIC and HTTP/3, and the company continues to experiment with and tune its model-based BBRv3 implementation.