Coding

Kubernetes v1.36: PSI Metrics for Kubernetes Graduates to GA

"Kubernetes v1.36's PSI Metrics Graduation Offers a New Lens on Resource Saturation, Providing High-Fidelity Signals to Identify CPU, Memory, and I/O Bottlenecks Before They Become Outages, and Offering a More Accurate Alternative to Traditional Utilization Metrics."

Overview

Kubernetes v1.36 introduces Pressure Stall Information (PSI) metrics as a generally available feature, providing high-fidelity signals to identify CPU, memory, and I/O bottlenecks before they become outages. PSI offers a more accurate alternative to traditional utilization metrics by telling the story of tasks stalled and time lost.

What PSI does

PSI fills the gap left by traditional utilization metrics by providing cumulative totals and moving averages of time spent in a stalled state. This allows operators to distinguish between transient spikes and sustained resource tension. The metrics are collected at the node, pod, and container levels, giving users a detailed view of resource contention.

Tradeoffs and Performance Testing

To address concerns about the resource overhead required to collect and serve PSI metrics, SIG Node conducted extensive performance validation on high-density workloads. The testing focused on two primary scenarios: the Kubelet overhead and the kernel overhead. The results showed that the Kubelet's collection logic is highly lightweight and blends seamlessly into standard housekeeping cycles, with no significant impact on resource usage. The kernel overhead was also found to be negligible, with a consistent delta between kernel-enabled and kernel-disabled clusters.

To use PSI metrics in a Kubernetes cluster, nodes must meet specific requirements, including running a Linux kernel version 4.20 or later, using cgroup v2, and having PSI enabled at the OS level. Once these prerequisites are met, users can start scraping the /metrics/cadvisor endpoint with a Prometheus-compatible monitoring solution or query the Summary API to collect and visualize the new PSI metrics.

In summary, Kubernetes v1.36's PSI metrics provide a powerful tool for identifying resource bottlenecks and improving cluster performance. With its lightweight collection logic and negligible kernel overhead, PSI is a valuable addition to any Kubernetes cluster.

{ "headline": "Kubernetes v1.36 Introduces PSI Metrics for Resource Bottleneck Detection", "synthesis": "Kubernetes v1.36 introduces Pressure Stall Information (PSI) metrics as a generally available feature, providing high-fidelity signals to identify CPU, memory, and I/O bottlenecks before they become outages. PSI offers a more accurate alternative to traditional utilization metrics by telling the story of tasks stalled and time lost. The metrics are collected at the node, pod, and container levels, giving users a detailed view of resource contention. To use PSI metrics, nodes must meet specific requirements, including running a Linux kernel version 4.20 or later and having PSI enabled at the OS level.", "tags": ["Kubernetes", "PSI Metrics", "Resource Bottleneck Detection"], "sources_used": ["Kubernetes.io"]

Similar Articles

More articles like this

Coding 2 min

Kubernetes v1.36: Advancing Workload-Aware Scheduling

Kubernetes v1.36 overhauls its scheduling architecture to finally treat AI/ML and batch jobs as first-class citizens, splitting the Workload API’s static templates from the PodGroup API’s runtime state. The new PodGroup scheduling cycle enables atomic workload processing—critical for gang scheduling—while topology-aware placement and workload-aware preemption debut to slash latency and resource fragmentation in large-scale clusters.

Coding 2 min

MacBook Neo Deep Dive: Benchmarks, Wafer Economics, and the 8GB Gamble

Apple's MacBook Neo flagship risks profitability with a 25% die shrink to 3nm, offset by a 50% increase in 8GB LPDDR5X memory, raising questions about the cost-effectiveness of this wafer-scale gamble. Benchmarks reveal a 15% performance boost, but at the expense of a 30% power consumption hike, underscoring the delicate balance between transistor density and system efficiency. Can Apple's supply chain and manufacturing prowess mitigate these trade-offs?

Coding 1 min

Fragnesia Made Public as Latest Linux Local Privilege Escalation Vulnerability

A previously undisclosed local privilege escalation vulnerability, dubbed Fragnesia, has been disclosed in the Linux kernel, exposing a critical flaw in the ext4 file system's handling of extended attributes. The vulnerability, assigned CVE-2023-41692, allows attackers to bypass access controls and execute arbitrary code with elevated privileges. Fragnesia affects Linux distributions as far back as kernel version 4.15.

Coding 1 min

Open Source Resistance: keep OSS alive on company time

As companies increasingly adopt "open-source everything" policies, a grassroots movement is emerging to ensure that employees can contribute to open-source projects on company time without sacrificing their intellectual property or compromising sensitive data. This pushback is centered around the concept of "open-source-compatible" enterprise software licenses, which would allow developers to contribute to OSS projects without risking corporate liability. The movement's advocates argue that such licenses are essential for preserving the integrity of open-source ecosystems.

Coding 2 min

The limits of Rust, or why you should probably not follow Amazon and Cloudflare

Rust's promise of memory safety is being put to the test as Amazon and Cloudflare's high-profile migrations to the language reveal a disturbing trend: the more complex the system, the more it exposes the limitations of Rust's borrow checker. Specifically, the language's inability to handle cyclic references and its reliance on manual memory management are causing headaches for developers. As a result, some are questioning whether Rust is truly ready for prime-time.

Coding 1 min

The AI Backlash Could Get Ugly

As the AI industry's carbon footprint and data storage needs continue to balloon, a growing coalition of environmental activists and community organizers is linking the expansion of data centers to rising rates of political violence and displacement, sparking a contentious debate over the true costs of AI's accelerating growth. The movement's focus on data center siting and energy consumption has already led to high-profile protests and municipal ordinances restricting new facility development.