AI

PyTorch 2.12 Release Blog

"PyTorch 2.12's CUDA acceleration overhaul yields a 100x speedup for batched eigenvalue decomposition, a crucial operation in deep learning, while also introducing performance enhancements for other linear algebra functions, marking a significant milestone in the library's pursuit of high-performance tensor computation."

PyTorch 2.12 is now available, bringing a 100x speedup for batched eigenvalue decomposition on CUDA, a new device-agnostic graph capture API, and support for exporting models using Microscaling (MX) quantization formats. The release includes 2,926 commits from 457 contributors since version 2.11.

Performance improvements

The headline performance change is an overhaul of the backend selection for linalg.eigh on CUDA. The legacy MAGMA backend has been deprecated in favor of cuSolver, and the dispatch heuristics now use syevj_batched unconditionally. For batched symmetric/Hermitian eigenvalue problems, this yields up to 100x speedups over the previous release. Workloads that previously took minutes now run in seconds by processing many small or medium matrices as a single GPU operation.

The Adagrad optimizer now supports fused=True, performing the entire optimizer step in a single CUDA kernel. Adagrad joins Adam, AdamW, and SGD in offering a fused variant.

New APIs and export capabilities

torch.accelerator.Graph is a new device-agnostic API for graph capture and replay, providing a unified abstraction over backend-specific implementations such as torch.xpu.XPUGraph. Each backend can register its own implementation through a lightweight GraphImplInterface. Alongside this, c10::Stream and torch.Stream now expose an is_capturing() method, replacing device-specific alternatives.

torch.export.save and torch.export.load now correctly serialize and deserialize tensors with the float8_e8m0fnu dtype used as the shared block-scale exponent in MX formats (MXFP4, MXFP6, MXFP8). This unblocks the full export-to-deployment workflow for models using Microscaling quantization, which is relevant for teams deploying large language models to cost-constrained or edge environments.

Control-flow regions using torch.cond can now be captured and replayed as part of CUDA Graphs. By leveraging CUDA 12.4's conditional IF nodes, branches are evaluated entirely on the GPU within a single graph capture. This currently works with the eager and cudagraphs backends; Inductor support is planned for a future release.

Distributed and profiling updates

Custom operators can now accept ProcessGroup objects directly as arguments. All c10d functional collective ops have been updated to accept both ProcessGroup objects and string names.

The PyTorch Profiler Events API now exposes flow IDs, flow types, activity types, unfinished events, and Python function events. NCCL collective traces can be correlated across ranks using a new seq_num field.

FlightRecorder's trace analyzer now supports ncclx and gloo backends alongside existing nccl and xccl backends, and recognizes torchcomms operations.

Platform-specific updates

  • CUDA: torch.cuda.graph now accepts an enable_annotations kwarg that injects annotation metadata into individual kernels. CUDA Green Contexts support specifying a workqueue limit.
  • ROCm: AMD GPUs (ROCm >= 7.02) now support expandable memory segments. rocSHMEM support enables symmetric memory collective operations. hipSPARSELt is enabled by default on ROCm >= 7.12, bringing semi-structured (2:4) sparsity support. FlexAttention on AMD GPUs uses two-stage pipelining, delivering 5-26% speedups on MI350X.
  • Apple MPS: Apple Silicon binary wheels now ship with ahead-of-time-compiled Metal-4 shaders, eliminating runtime shader compilation overhead on first run.

Deprecations and breaking changes

TorchScript is now deprecated. torch.export should replace the jit trace and script APIs; Executorch should replace the embedded runtime.

The CUDA 12.8 binary wheel is deprecated and will no longer be published as part of the standard release matrix. The default wheel remains CUDA 13.0. CUDA 13.2 has been added as an experimental build. Users on older architectures (Pascal, Volta) should use the CUDA 12.6 wheel. Users on newer GPUs (Blackwell) should use CUDA 13.0+ wheels, which require an NVIDIA driver upgrade to 580.65.06 (Linux) or 580.88 (Windows).

Planned breaking changes for torchcomms in PyTorch 2.13+ include eager initialization of ProcessGroup, changes to P2P operations, and making torchcomms a required package for PyTorch Distributed.

Bottom line

PyTorch 2.12 is a significant performance and platform compatibility release. The 100x speedup for batched eigendecomposition directly addresses a longstanding gap with CuPy. The new device-agnostic graph API and MX quantization export support make the framework more practical for production deployment across diverse hardware.

Similar Articles

More articles like this

AI 1 min

Building a safe, effective sandbox to enable Codex on Windows

OpenAI’s Windows sandbox for Codex agents erects a hardware-enforced security perimeter—Intel SGX enclaves plus hypervisor-level file and network quotas—letting developers spin up AI coding assistants that can edit local code without exfiltrating secrets or pivoting to adjacent systems. The architecture slashes attack surface while preserving sub-100 ms latency, a first for production-grade agent isolation on consumer hardware.

AI 1 min

Hermes Unlocks Self-Improving AI Agents, Powered by NVIDIA RTX PCs and DGX Spark

"Self-improving AI agents are gaining traction, thanks to Hermes Agent, a new open-source framework that has amassed 140,000 GitHub stars in under three months. Powered by NVIDIA's RTX PCs and DGX Spark, Hermes enables agents to learn from experience and adapt to new tasks, potentially revolutionizing workflows and productivity. This rapid adoption marks a significant milestone in the evolution of agentic AI."

AI 3 min

Two Legal Research Providers Launch MCP Integrations with Claude: Thomson Reuters and Free Law Project Connect Their Data to AI

Two Legal Research Providers Launch MCP Integrations with Claude: Thomson Reuters and Free Law Project Connect Their Data to AI LawSites

AI 2 min

OpenAI Hit With Overdose Suit Centered on ChatGPT Medical Advice

OpenAI Hit With Overdose Suit Centered on ChatGPT Medical Advice Bloomberg Law News

AI 2 min

Anthropic Goes All-In on Legal, Releasing More Than 20 Connectors and 12 Practice-Area Plugins for Claude

Anthropic Goes All-In on Legal, Releasing More Than 20 Connectors and 12 Practice-Area Plugins for Claude LawSites

AI 2 min

Efficient Edge AI on Arm CPUs and NPUs: Understanding ExecuTorch through Practical Labs

Arm's Edge AI Initiative Gains Momentum with ExecuTorch, a PyTorch Extension for Local Inference on Constrained Devices. This new framework leverages Arm CPUs and NPUs to accelerate AI workloads, promising significant performance boosts on edge devices. Practical Labs, developed by Arm, provide a hands-on introduction to ExecuTorch's capabilities and potential applications in IoT and industrial automation.