Coding

Training an LLM in Swift, Part 1: Taking matrix mult from Gflop/s to Tflop/s

A Swift developer's innovative approach to accelerating matrix multiplication, a crucial operation in large language model training, has yielded a 100-fold performance boost, catapulting execution rates from mere gigaflops to teraflops. By leveraging the Metal API and exploiting parallel processing capabilities, this optimization technique demonstrates the potential for significant speedups in machine learning workloads on Apple's M-series chips. The implications for training large language models on mobile devices are substantial.

Overview

A Swift developer has achieved a 100-fold performance boost in matrix multiplication, a crucial operation in large language model training, by leveraging the Metal API and exploiting parallel processing capabilities on Apple's M-series chips.

The initial Swift implementation was 15-20 times slower than the equivalent C code, producing only 2.8 Gflop/s. However, by utilizing Swift 6.2's MutableSpan and InlineArray features, the developer was able to significantly improve performance.

What it does

The optimized Swift code uses the Metal API to perform matrix multiplication on the GPU, achieving a substantial speedup over the initial implementation. The Metal code is divided into two parts: the inner kernel, written in Metal/C++, and the outer invocation machinery, written in Swift.

The inner kernel performs the actual matrix multiplication, while the outer machinery handles the invocation of the kernel and the management of the buffers. The developer also experimented with threading on the GPU, achieving an easy win by simply changing the threadsPerThreadgroup parameter.

Tradeoffs

The optimized implementation has several tradeoffs, including increased complexity and the need for manual memory management. The use of Metal and the GPU also introduces additional overhead, such as the need to copy data between the CPU and GPU.

However, the significant performance gains make these tradeoffs worthwhile for large language model training workloads. The developer notes that the optimized implementation is still not perfect, with opportunities for further improvement, such as more efficient tiling and packing of the matrices.

When to use it

The optimized matrix multiplication implementation is suitable for large language model training workloads on Apple's M-series chips. It can be used as a drop-in replacement for the standard matrix multiplication implementation, providing a significant speedup without requiring substantial changes to the surrounding code.

In conclusion, the optimized Swift implementation of matrix multiplication achieves a substantial performance boost over the initial implementation, making it suitable for large language model training workloads on Apple's M-series chips. While there are tradeoffs to consider, the significant performance gains make this implementation a worthwhile choice for developers working with large language models.

Similar Articles

More articles like this

Coding 1 min

Visual Studio Code 1.120

Visual Studio Code’s 1.120 update slashes debugging friction with native Data Breakpoints, letting engineers pause execution when specific object properties change—not just memory addresses. The release also bakes in GitHub Copilot-powered inline code completions for Python, JavaScript, and TypeScript, cutting keystrokes by up to 40% in early benchmarks, while a revamped terminal shell integration finally bridges the gap between local and remote workflows.

Coding 1 min

Software engineering may no longer be a lifetime career

The rise of AI-powered code generators threatens to disrupt the traditional career trajectory of software engineers, as automated tools capable of producing high-quality, production-ready code begin to erode the need for human expertise in routine programming tasks, potentially rendering the notion of a lifetime career in software engineering obsolete. This shift is driven by advances in large language models and their integration into development workflows.

Coding 1 min

I Work in Hollywood. Everyone Who Used to Make TV Is Now Training AI

The TV industry's creative talent is being rapidly repurposed as AI trainers, as former writers, directors, and producers leverage their storytelling expertise to fine-tune large language models and generate original content. This shift is driven by the growing demand for high-quality AI-generated scripts, dialogue, and narratives in the entertainment industry. Industry insiders estimate that up to 30% of former TV professionals are now employed in AI training roles.

Coding 1 min

Mythos Finds a Curl Vulnerability

A previously unknown vulnerability in the libcurl library, a widely-used C library for transferring data over various protocols, has been discovered by security researchers, potentially allowing malicious actors to execute arbitrary code on vulnerable systems via crafted HTTP requests. The flaw, which affects curl versions 7.84.0 and earlier, resides in the library's handling of HTTP/2 protocol headers. Exploitation is possible via a specially crafted HTTP/2 request.

Coding 1 min

7 lines of code, 3 minutes: Implement a programming language (2010)

A 7-line code snippet and a 3-minute time frame can now be the foundation for a custom programming language, thanks to a minimalist approach that leverages a recursive descent parser and a simple lexer to translate source code into machine-executable bytecode. This streamlined implementation eschews traditional compiler design in favor of a lightweight, iterative model that prioritizes ease of use over performance. The result is a remarkably concise yet functional language framework.

Coding 2 min

Show HN: adamsreview – better multi-agent PR reviews for Claude Code

I built adamsreview, a Claude Code plugin that runs deeper, multi-stage PR reviews using parallel sub-agents, validation passes, persistent JSON state, and optional ensemble review via Codex CLI and PR bot comments. On my own PRs, it has been catching dramatically more real bugs than Claude’s built-in /review, /ultrareview, CodeRabbit, Greptile, and Codex’s built-in review, while producing fewer false positives. adamsreview is six Claude Code slash commands packaged as a plugin: review, codex-review, add, promote, walkthrough, and fix. I modeled it after the built-in /review command and extended it meaningfully. You can clear context between review stages because state is stored in JSON artifacts on disk, with built-in scripts for keeping it updated. The walkthrough command uses Claude’s AskUserQuestion feature to walk you through uncertain findings or items needing human review one by o