Overview
Kubernetes v1.36 introduces Pressure Stall Information (PSI) metrics as a generally available feature, providing high-fidelity signals to identify CPU, memory, and I/O bottlenecks before they become outages. PSI offers a more accurate alternative to traditional utilization metrics by telling the story of tasks stalled and time lost.
What PSI does
PSI fills the gap left by traditional utilization metrics by providing cumulative totals and moving averages of time spent in a stalled state. This allows operators to distinguish between transient spikes and sustained resource tension. The metrics are collected at the node, pod, and container levels, giving users a detailed view of resource contention.
Tradeoffs and Performance Testing
To address concerns about the resource overhead required to collect and serve PSI metrics, SIG Node conducted extensive performance validation on high-density workloads. The testing focused on two primary scenarios: the Kubelet overhead and the kernel overhead. The results showed that the Kubelet's collection logic is highly lightweight and blends seamlessly into standard housekeeping cycles, with no significant impact on resource usage. The kernel overhead was also found to be negligible, with a consistent delta between kernel-enabled and kernel-disabled clusters.
To use PSI metrics in a Kubernetes cluster, nodes must meet specific requirements, including running a Linux kernel version 4.20 or later, using cgroup v2, and having PSI enabled at the OS level. Once these prerequisites are met, users can start scraping the /metrics/cadvisor endpoint with a Prometheus-compatible monitoring solution or query the Summary API to collect and visualize the new PSI metrics.
In summary, Kubernetes v1.36's PSI metrics provide a powerful tool for identifying resource bottlenecks and improving cluster performance. With its lightweight collection logic and negligible kernel overhead, PSI is a valuable addition to any Kubernetes cluster.
{ "headline": "Kubernetes v1.36 Introduces PSI Metrics for Resource Bottleneck Detection", "synthesis": "Kubernetes v1.36 introduces Pressure Stall Information (PSI) metrics as a generally available feature, providing high-fidelity signals to identify CPU, memory, and I/O bottlenecks before they become outages. PSI offers a more accurate alternative to traditional utilization metrics by telling the story of tasks stalled and time lost. The metrics are collected at the node, pod, and container levels, giving users a detailed view of resource contention. To use PSI metrics, nodes must meet specific requirements, including running a Linux kernel version 4.20 or later and having PSI enabled at the OS level.", "tags": ["Kubernetes", "PSI Metrics", "Resource Bottleneck Detection"], "sources_used": ["Kubernetes.io"]