Kubernetes v1.36 introduces a significant architectural evolution for scheduling AI/ML and batch workloads, separating the Workload API into a static template and elevating the PodGroup to a first-class runtime API. This release also debuts a dedicated PodGroup scheduling cycle, topology-aware scheduling, workload-aware preemption, and Dynamic Resource Allocation (DRA) support for PodGroups. The Job controller can now automatically create and manage Workload and PodGroup objects for qualifying Jobs, enabling native gang scheduling without additional tooling.
Overview
In Kubernetes v1.35, the project introduced the first tranche of workload-aware scheduling improvements, including the foundational Workload API, basic gang scheduling support built on a Pod-based framework, and an opportunistic batching feature. Kubernetes v1.36 builds on this foundation with a clean separation of API concerns: the Workload API now acts as a static template, while the new PodGroup API handles the runtime state. This separation improves performance and scalability, as the PodGroup API allows per-replica sharding of status updates.
What's new in v1.36
Workload and PodGroup APIs
The Workload and PodGroup APIs are now part of the scheduling.k8s.io/v1alpha2 API group, completely replacing the previous v1alpha1 API version. The Workload serves as a static template object, while the PodGroup manages the runtime state. The kube-scheduler can directly read the PodGroup, which contains all information required by the scheduler, without needing to watch or parse the Workload object itself.
PodGroup scheduling cycle and gang scheduling
The kube-scheduler now features a dedicated PodGroup scheduling cycle. Instead of evaluating and reserving resources sequentially Pod-by-Pod, the scheduler evaluates the group as a unified operation. When a PodGroup member is popped from the scheduling queue, the scheduler fetches the rest of the queued Pods for that group, sorts them deterministically, and executes an atomic scheduling cycle:
- Takes a single snapshot of the cluster state to prevent race conditions.
- Attempts to find valid Node placements for all Pods in the group using a PodGroup scheduling algorithm.
- Applies the scheduling decision atomically for the entire PodGroup.
If the group fails to meet its requirements, none of the Pods are bound, and they are returned to the scheduling queue to retry later after a backoff period.
Limitations: The first version of the PodGroup scheduling cycle does not guarantee finding a valid placement for heterogeneous Pod groups or Pod groups with inter-Pod dependencies (e.g., affinity, anti-affinity, topology spread constraints).
Topology-aware scheduling
Topology-aware scheduling allows you to define topology constraints directly on a PodGroup, ensuring its Pods are co-located within specific physical or logical domains. The scheduler extends the PodGroup scheduling cycle with a dedicated placement-based algorithm consisting of three phases: generate candidate placements, evaluate each proposed placement, and score all feasible placements. Currently, topology-aware scheduling does not trigger Pod preemption to satisfy constraints.
Workload-aware preemption
When a PodGroup cannot be scheduled, the scheduler can use workload-aware preemption to try making scheduling possible. This mechanism treats the entire PodGroup as a single preemptor unit, searching across the entire cluster to preempt Pods from multiple Nodes simultaneously. It introduces two new concepts to the PodGroup API: PodGroup priority (overrides individual Pod priorities) and PodGroup disruptionMode (dictates whether Pods within a PodGroup can be preempted independently or together in an all-or-nothing fashion).
DRA ResourceClaim support for workloads
PodGroups can now represent the replicable unit for a ResourceClaimTemplate. For ResourceClaimTemplates referenced by a PodGroup's spec.resourceClaims, Kubernetes generates one ResourceClaim for the entire PodGroup, no matter how many Pods are in the group. ResourceClaims referenced by PodGroups become reserved for the entire PodGroup, and a single PodGroup reference in status.reservedFor can represent many more than 256 Pods, allowing high-cardinality sharing of devices.
Integration with the Job controller
When the WorkloadWithJob feature gate is enabled, the Job controller automatically creates a Workload and a corresponding runtime PodGroup for each qualifying Job, sets .spec.schedulingGroup onto every Pod the Job creates, and sets the Job as the owner of the generated objects. The integration kicks in only when the Job has a well-defined, fixed shape: .spec.parallelism is greater than 1, .spec.completionMode is set to Indexed, .spec.completions is equal to .spec.parallelism, and the schedulingGroup is not already set on the Pod template.
Getting started
All workload-aware scheduling improvements in v1.36 are available as Alpha features. To try them out, you must enable the GenericWorkload feature gate on both the kube-apiserver and kube-scheduler, and ensure the scheduling.k8s.io/v1alpha2 API group is enabled. Specific features require additional feature gates:
- Gang scheduling: Enable
GangSchedulingon the kube-scheduler. - Topology-aware scheduling: Enable
TopologyAwareWorkloadSchedulingon the kube-scheduler. - Workload-aware preemption: Enable
WorkloadAwarePreemptionon the kube-scheduler (requiresGangScheduling). - DRA ResourceClaim support: Enable
DRAWorkloadResourceClaimson the kube-apiserver, kube-controller-manager, kube-scheduler, and kubelet. - Workload API integration with the Job controller: Enable
WorkloadWithJobon the kube-apiserver and kube-controller-manager.
What's next
For v1.37, the community is working on graduating the Workload and PodGroup APIs to Beta, introducing minCount mutability for elastic jobs, multi-level Workload hierarchies (e.g., for JobSet or LeaderWorkerSet), graduating topology-aware scheduling and workload-aware preemption to Beta, and developing a unified controller integration API.
Bottom line
Kubernetes v1.36 marks a major step toward treating AI/ML and batch workloads as first-class citizens in the scheduler. The separation of Workload and PodGroup APIs, combined with the new PodGroup scheduling cycle, topology-aware placement, and workload-aware preemption, provides a foundation for more efficient resource utilization in large-scale clusters. The integration with the Job controller makes gang scheduling accessible without additional tooling, though the current constraints limit it to static, indexed, fully-parallel Jobs.