Coding

Kubernetes v1.36: Advancing Workload-Aware Scheduling

Kubernetes v1.36 overhauls its scheduling architecture to finally treat AI/ML and batch jobs as first-class citizens, splitting the Workload API’s static templates from the PodGroup API’s runtime state. The new PodGroup scheduling cycle enables atomic workload processing—critical for gang scheduling—while topology-aware placement and workload-aware preemption debut to slash latency and resource fragmentation in large-scale clusters.

Kubernetes v1.36 introduces a significant architectural evolution for scheduling AI/ML and batch workloads, separating the Workload API into a static template and elevating the PodGroup to a first-class runtime API. This release also debuts a dedicated PodGroup scheduling cycle, topology-aware scheduling, workload-aware preemption, and Dynamic Resource Allocation (DRA) support for PodGroups. The Job controller can now automatically create and manage Workload and PodGroup objects for qualifying Jobs, enabling native gang scheduling without additional tooling.

Overview

In Kubernetes v1.35, the project introduced the first tranche of workload-aware scheduling improvements, including the foundational Workload API, basic gang scheduling support built on a Pod-based framework, and an opportunistic batching feature. Kubernetes v1.36 builds on this foundation with a clean separation of API concerns: the Workload API now acts as a static template, while the new PodGroup API handles the runtime state. This separation improves performance and scalability, as the PodGroup API allows per-replica sharding of status updates.

What's new in v1.36

Workload and PodGroup APIs

The Workload and PodGroup APIs are now part of the scheduling.k8s.io/v1alpha2 API group, completely replacing the previous v1alpha1 API version. The Workload serves as a static template object, while the PodGroup manages the runtime state. The kube-scheduler can directly read the PodGroup, which contains all information required by the scheduler, without needing to watch or parse the Workload object itself.

PodGroup scheduling cycle and gang scheduling

The kube-scheduler now features a dedicated PodGroup scheduling cycle. Instead of evaluating and reserving resources sequentially Pod-by-Pod, the scheduler evaluates the group as a unified operation. When a PodGroup member is popped from the scheduling queue, the scheduler fetches the rest of the queued Pods for that group, sorts them deterministically, and executes an atomic scheduling cycle:

  1. Takes a single snapshot of the cluster state to prevent race conditions.
  2. Attempts to find valid Node placements for all Pods in the group using a PodGroup scheduling algorithm.
  3. Applies the scheduling decision atomically for the entire PodGroup.

If the group fails to meet its requirements, none of the Pods are bound, and they are returned to the scheduling queue to retry later after a backoff period.

Limitations: The first version of the PodGroup scheduling cycle does not guarantee finding a valid placement for heterogeneous Pod groups or Pod groups with inter-Pod dependencies (e.g., affinity, anti-affinity, topology spread constraints).

Topology-aware scheduling

Topology-aware scheduling allows you to define topology constraints directly on a PodGroup, ensuring its Pods are co-located within specific physical or logical domains. The scheduler extends the PodGroup scheduling cycle with a dedicated placement-based algorithm consisting of three phases: generate candidate placements, evaluate each proposed placement, and score all feasible placements. Currently, topology-aware scheduling does not trigger Pod preemption to satisfy constraints.

Workload-aware preemption

When a PodGroup cannot be scheduled, the scheduler can use workload-aware preemption to try making scheduling possible. This mechanism treats the entire PodGroup as a single preemptor unit, searching across the entire cluster to preempt Pods from multiple Nodes simultaneously. It introduces two new concepts to the PodGroup API: PodGroup priority (overrides individual Pod priorities) and PodGroup disruptionMode (dictates whether Pods within a PodGroup can be preempted independently or together in an all-or-nothing fashion).

DRA ResourceClaim support for workloads

PodGroups can now represent the replicable unit for a ResourceClaimTemplate. For ResourceClaimTemplates referenced by a PodGroup's spec.resourceClaims, Kubernetes generates one ResourceClaim for the entire PodGroup, no matter how many Pods are in the group. ResourceClaims referenced by PodGroups become reserved for the entire PodGroup, and a single PodGroup reference in status.reservedFor can represent many more than 256 Pods, allowing high-cardinality sharing of devices.

Integration with the Job controller

When the WorkloadWithJob feature gate is enabled, the Job controller automatically creates a Workload and a corresponding runtime PodGroup for each qualifying Job, sets .spec.schedulingGroup onto every Pod the Job creates, and sets the Job as the owner of the generated objects. The integration kicks in only when the Job has a well-defined, fixed shape: .spec.parallelism is greater than 1, .spec.completionMode is set to Indexed, .spec.completions is equal to .spec.parallelism, and the schedulingGroup is not already set on the Pod template.

Getting started

All workload-aware scheduling improvements in v1.36 are available as Alpha features. To try them out, you must enable the GenericWorkload feature gate on both the kube-apiserver and kube-scheduler, and ensure the scheduling.k8s.io/v1alpha2 API group is enabled. Specific features require additional feature gates:

  • Gang scheduling: Enable GangScheduling on the kube-scheduler.
  • Topology-aware scheduling: Enable TopologyAwareWorkloadScheduling on the kube-scheduler.
  • Workload-aware preemption: Enable WorkloadAwarePreemption on the kube-scheduler (requires GangScheduling).
  • DRA ResourceClaim support: Enable DRAWorkloadResourceClaims on the kube-apiserver, kube-controller-manager, kube-scheduler, and kubelet.
  • Workload API integration with the Job controller: Enable WorkloadWithJob on the kube-apiserver and kube-controller-manager.

What's next

For v1.37, the community is working on graduating the Workload and PodGroup APIs to Beta, introducing minCount mutability for elastic jobs, multi-level Workload hierarchies (e.g., for JobSet or LeaderWorkerSet), graduating topology-aware scheduling and workload-aware preemption to Beta, and developing a unified controller integration API.

Bottom line

Kubernetes v1.36 marks a major step toward treating AI/ML and batch workloads as first-class citizens in the scheduler. The separation of Workload and PodGroup APIs, combined with the new PodGroup scheduling cycle, topology-aware placement, and workload-aware preemption, provides a foundation for more efficient resource utilization in large-scale clusters. The integration with the Job controller makes gang scheduling accessible without additional tooling, though the current constraints limit it to static, indexed, fully-parallel Jobs.

Similar Articles

More articles like this

Coding 2 min

MacBook Neo Deep Dive: Benchmarks, Wafer Economics, and the 8GB Gamble

Apple's MacBook Neo flagship risks profitability with a 25% die shrink to 3nm, offset by a 50% increase in 8GB LPDDR5X memory, raising questions about the cost-effectiveness of this wafer-scale gamble. Benchmarks reveal a 15% performance boost, but at the expense of a 30% power consumption hike, underscoring the delicate balance between transistor density and system efficiency. Can Apple's supply chain and manufacturing prowess mitigate these trade-offs?

Coding 1 min

Fragnesia Made Public as Latest Linux Local Privilege Escalation Vulnerability

A previously undisclosed local privilege escalation vulnerability, dubbed Fragnesia, has been disclosed in the Linux kernel, exposing a critical flaw in the ext4 file system's handling of extended attributes. The vulnerability, assigned CVE-2023-41692, allows attackers to bypass access controls and execute arbitrary code with elevated privileges. Fragnesia affects Linux distributions as far back as kernel version 4.15.

Coding 1 min

Open Source Resistance: keep OSS alive on company time

As companies increasingly adopt "open-source everything" policies, a grassroots movement is emerging to ensure that employees can contribute to open-source projects on company time without sacrificing their intellectual property or compromising sensitive data. This pushback is centered around the concept of "open-source-compatible" enterprise software licenses, which would allow developers to contribute to OSS projects without risking corporate liability. The movement's advocates argue that such licenses are essential for preserving the integrity of open-source ecosystems.

Coding 2 min

The limits of Rust, or why you should probably not follow Amazon and Cloudflare

Rust's promise of memory safety is being put to the test as Amazon and Cloudflare's high-profile migrations to the language reveal a disturbing trend: the more complex the system, the more it exposes the limitations of Rust's borrow checker. Specifically, the language's inability to handle cyclic references and its reliance on manual memory management are causing headaches for developers. As a result, some are questioning whether Rust is truly ready for prime-time.

Coding 1 min

The AI Backlash Could Get Ugly

As the AI industry's carbon footprint and data storage needs continue to balloon, a growing coalition of environmental activists and community organizers is linking the expansion of data centers to rising rates of political violence and displacement, sparking a contentious debate over the true costs of AI's accelerating growth. The movement's focus on data center siting and energy consumption has already led to high-profile protests and municipal ordinances restricting new facility development.

Coding 2 min

The US is winning the AI race where it matters most: commercialization

As the global AI landscape shifts towards practical applications, the US is gaining a decisive edge in commercializing cutting-edge technologies, with a surge in AI-powered product deployments and a growing ecosystem of specialized startups and venture capital firms. This momentum is driven by the increasing adoption of cloud-based infrastructure, particularly Amazon Web Services and Google Cloud Platform, which provide scalable resources for AI model training and deployment.