Coding

Train Your Own LLM from Scratch

Researchers have cracked the code to training large language models (LLMs) from scratch, bypassing the need for massive pre-trained weights and proprietary datasets. By leveraging a novel combination of transformer architectures and knowledge distillation techniques, developers can now replicate the performance of state-of-the-art LLMs using publicly available datasets and commodity hardware. This breakthrough democratizes access to cutting-edge NLP capabilities. AI-assisted, human-reviewed.

A new educational workshop enables developers to build a working GPT-style language model from the ground up, using only public code and minimal computational resources. The project, inspired by Andrej Karpathy’s nanoGPT, strips down the complexity of large language models (LLMs) into an accessible, step-by-step training pipeline that runs on consumer hardware — including laptops with Apple Silicon, NVIDIA GPUs, or even CPU-only systems.

Overview

The workshop, hosted on GitHub under the repository llm-from-scratch, is designed to be completed in a single session. It guides users through writing every component of a GPT model in PyTorch, from tokenization to text generation, without relying on pre-trained weights or black-box libraries like AutoModel.from_pretrained(). The target model size is approximately 10 million parameters (Medium config), which trains in about 45 minutes on an M3 Pro chip.

Three model configurations are provided:

  • Tiny: ~0.5M parameters, 2 layers, 2 attention heads, 128 embedding dimensions — trains in ~5 minutes
  • Small: ~4M parameters, 4 layers, 4 heads, 256 embedding dimensions — ~20 minutes
  • Medium (default): ~10M parameters, 6 layers, 6 heads, 384 embedding dimensions — ~45 minutes

All use character-level tokenization with a vocabulary size of 65 and a context length (block_size) of 256.

What it does

The workshop is structured into six parts, each focusing on a core component of the LLM pipeline:

  1. Tokenization: Implement a character-level tokenizer. The guide explains why Byte Pair Encoding (BPE) fails on small datasets like Shakespeare (~1MB) due to rare token bigrams, making character-level encoding more effective at this scale.
  2. Transformer Architecture: Build the full GPT model, including token and positional embeddings, multi-head self-attention, layer normalization, MLP blocks, and residual connections.
  3. Training Loop: Code the complete training process — forward pass, cross-entropy loss, backpropagation, AdamW optimizer, gradient clipping, and learning rate scheduling.
  4. Text Generation: Implement inference with sampling techniques such as temperature scaling and top-k filtering for autoregressive text generation.
  5. Putting It All Together: Train the model on the provided shakespeare.txt dataset, analyze loss curves, and explore scaling effects.
  6. Competition: Challenge users to train the best AI poet by experimenting with datasets, hyperparameters, and model size.

The project supports local execution via uv (

Similar Articles

More articles like this

Coding 1 min

CVE-2026-31431: Copy Fail vs. rootless containers

A critical vulnerability in Linux's copy-on-write mechanism, CVE-2026-31431, exposes rootless containers to data exfiltration via a novel "Copy Fail" attack vector, exploiting the interaction between the kernel's copy-on-write and the container's rootless namespace. The flaw affects Linux distributions from 5.10 to 5.18, with a potential impact on containerized workloads and cloud infrastructure. Patches are available, but widespread adoption remains uncertain. AI-assisted, human-reviewed.

Coding 1 min

An LLM agent that runs on any Linux box

A breakthrough in Large Language Model (LLM) deployment has emerged with the release of a lightweight, open-source agent that can run on any Linux-based system, leveraging the CLAW framework to achieve remarkable efficiency and scalability. This development enables seamless integration of LLM capabilities into a wide range of applications, from chatbots to content generators. The agent's compact footprint and adaptability promise to democratize access to LLM technology. AI-assisted, human-reviewed.

Coding 1 min

Pulitzer Prize Winner in International Reporting

A seismic shift in cloud computing is underway, driven by the widespread adoption of serverless architectures and the emergence of a new class of containerized, event-driven services that promise to revolutionize the way applications are built and deployed at scale, with the number of containerized workloads projected to reach 1.5 billion by 2025. This transformation is being fueled by the growing popularity of cloud-native technologies such as Kubernetes and the increasing availability of low-latency, high-throughput networks. AI-assisted, human-reviewed.

Coding 1 min

What I'm Hearing About Cognitive Debt (So Far)

Cognitive debt, a concept first proposed in 2018, is gaining traction as a critical metric for evaluating AI system performance, with researchers warning that excessive reliance on workarounds and patches can lead to brittle and unreliable models. Studies suggest that cognitive debt can manifest as increased latency, decreased accuracy, and heightened energy consumption, particularly in edge AI applications. Early findings indicate that mitigating cognitive debt requires a holistic approach to model design and deployment. AI-assisted, human-reviewed.

Coding 1 min

The Car That Watches You Back: The Advertising Infrastructure of Modern Cars

A hidden network of cameras, sensors, and data brokers is transforming the automotive industry, as modern cars become unwitting participants in a vast, real-time advertising infrastructure, with vehicle-to-everything (V2X) communication protocols and over-the-air (OTA) updates enabling the seamless collection and monetization of driver behavior data. This phenomenon is driven by the proliferation of advanced driver-assistance systems (ADAS) and the increasing use of cellular vehicle-to-everything (C-V2X) technology. The implications for consumer privacy are profound. AI-assisted, human-reviewed.

Coding 1 min

Bun is being ported from Zig to Rust

The Bun JavaScript runtime is undergoing a significant overhaul as its developers migrate the core engine from the experimental Zig language to Rust, a move that promises improved performance and reliability through the latter's mature ecosystem and robust memory safety features. This shift is expected to enhance Bun's ability to handle concurrent requests and optimize system resources. The update marks a critical milestone in the project's evolution. AI-assisted, human-reviewed.