Whitematter

March 14, 2026

I wanted to know what happens between loss.backward() and the weights changing. Not conceptually - I mean the actual bytes. Which multiply goes where. How a convolution gradient turns into a transposed convolution. What the chain rule looks like when it's not a diagram in a textbook but 83,000 lines of C++.

So I built a deep learning framework from scratch: tensors, autograd, layers, optimizers, SIMD kernels, GPU shaders, the whole stack. Then wrapped it in a web platform where you can train models from a browser.


The Numbers

Tensor operations 100+
Layer types 20+ (Conv2d, LSTM, MultiHeadAttention, ...)
C++ source ~90,000 lines
GPU backends Metal (macOS) + CUDA (NVIDIA)
MNIST convergence 99%+ accuracy, 3 epochs

How the System Fits Together

Three layers. The browser talks to a Python API. The API generates C++ code, compiles it, and runs it. Training metrics stream back in real-time.

graph LR
    subgraph Browser
        A["React / Next.js<br/>Architecture builder<br/>Training dashboard"]
    end

    subgraph API ["FastAPI Backend"]
        B["Claude API"] --> C["Code Generator"]
        C --> D["CMake Compiler"]
        D --> E["Worker Process"]
    end

    subgraph Engine ["libwhitematter (C++)"]
        F["Tensor + Autograd"]
        G["Layers"]
        H["SIMD / BLAS / GPU"]
    end

    A -- "REST + SSE" --> B
    E -- "compile & exec" --> F
    F --> G
    F --> H

    style A fill:#1a1a1a,stroke:#555,color:#c0c0c0
    style F fill:#1a1a1a,stroke:#555,color:#c0c0c0

No Python in the training loop. The backend transpiles an architecture description into C++ source, links it against the static library, and runs the binary directly.


Tensors and Autograd

The tensor is a contiguous float buffer with shape metadata, stride info, and a pointer to the function that created it. That pointer is the autograd system - every operation records a closure that knows how to compute its gradient.

auto x = Tensor::randn({64, 784}, true);   // requires_grad=true
auto W = Tensor::randn({784, 128}, true);
auto h = x->matmul(W)->relu();
auto loss = h->sum();
loss->backward();
// x->grad and W->grad now hold ∂loss/∂x and ∂loss/∂W

backward() walks the graph in reverse topological order. Each node calls its stored closure, computes the local gradient, passes it upstream. Matmul backward for A @ B produces grad @ B^T for A and A^T @ grad for B. ReLU masks where input was negative. Convolution backward is a transposed convolution.

Every backward function is written by hand. That's the point.

flowchart TD
    X["x"] --> MM["matmul"]
    W["W"] --> MM
    MM --> R["relu"]
    R --> S["sum"]
    S --> L["loss"]

    L -. "1" .-> S
    S -. "ones" .-> R
    R -. "mask" .-> MM
    MM -. "grad @ Wᵀ" .-> X
    MM -. "xᵀ @ grad" .-> W

    style L fill:#1a1a1a,stroke:#555,color:#c0c0c0

Broadcasting follows NumPy rules - shapes are right-aligned, dimensions of size 1 expand. Bias addition, attention masking, and batch-wise scaling all rely on it.


Layers

20+ layer types, each implementing forward() and parameters():

 CONVOLUTION      RECURRENT      ATTENTION       NORMALIZATION
 ───────────      ─────────      ─────────       ─────────────
 Conv2d           LSTM           MultiHead       BatchNorm2d
 Conv1d           GRU            Grouped Query   LayerNorm
 ConvTranspose2d                 KV Cache        GroupNorm
 Grouped Conv                    RoPE            RMSNorm
 Dilated Conv                    Sinusoidal PE

 ACTIVATION       POOLING        UTILITY
 ──────────       ───────        ───────
 ReLU             MaxPool2d      Dropout
 GELU             AvgPool2d      Flatten
 SiLU             Adaptive       Sequential
 Mish             AvgPool2d      Embedding
 Tanh                            Upsample

ResNet-18 on CIFAR-10 looks like:

Sequential model({
    new Conv2d(3, 64, 3, 1, 1),
    new BatchNorm2d(64),
    new ReLU(),
    // ... residual blocks with skip connections
    new AdaptiveAvgPool2d(1),
    new Flatten(),
    new Linear(512, 10)
});

Every layer handles its own weight initialization, tracks running stats where needed (BatchNorm), and computes gradients through its backward pass.


Making It Fast

Naive matrix multiplication in C++ is slow. Cache misses kill you. Whitematter stacks three levels of optimization:

SIMD - Element-wise ops use vector instructions: AVX2 on Intel (8 floats/instruction), NEON on Apple Silicon (4 floats/instruction). Detected at compile time.

BLAS - Matmul dispatches to system BLAS (Apple Accelerate, OpenBLAS). Hand-tuned GEMM routines that exploit cache hierarchy. Roughly 10x over a naive triple loop. Convolutions use im2col - unfold receptive fields into columns, multiply by flattened kernels.

GPU - Metal compute shaders on macOS, CUDA kernels for NVIDIA. A unified Device abstraction: tensor->to(Device::Metal) is all it takes.

flowchart LR
    OP["Operation"] --> D{"Device?"}
    D -- "CPU" --> S["SIMD<br/>AVX2 / NEON"]
    D -- "Metal" --> M["Metal Shaders"]
    D -- "CUDA" --> C["CUDA Kernels"]
    S --> B["BLAS<br/>for GEMM"]

    style OP fill:#1a1a1a,stroke:#555,color:#c0c0c0
    style D fill:#1a1a1a,stroke:#555,color:#c0c0c0

Compiled with -O3 -ffast-math -funroll-loops. Memory allocation uses an object pool to recycle tensor buffers during training.


The Training Pipeline

You describe a model in plain English. Claude suggests an architecture. You refine it in a visual node graph. Hit train.

flowchart TD
    A["'ResNet for CIFAR-10<br/>with dropout 0.3'"] --> B["Claude generates<br/>architecture JSON"]
    B --> C["Python transpiler<br/>emits C++ source"]
    C --> D["CMake compiles<br/>links libwhitematter.a"]
    D --> E["Binary executes<br/>stdout: epoch=1 loss=2.31 acc=0.22"]
    E --> F["Worker parses metrics<br/>pushes via SSE"]
    F --> G["Browser renders<br/>live loss curves"]

    style A fill:#1a1a1a,stroke:#555,color:#c0c0c0

The code generator maps architecture JSON to a complete C++ training script - includes, model definition, data loading, optimizer setup, training loop, metric printing. Writes to a temp directory, invokes CMake, worker supervises execution. Loss, accuracy, and learning rate stream to the browser via SSE. You can cancel mid-training.

Bundled training utilities:

  • Optimizers: SGD, Adam, AdamW, RMSprop
  • Schedulers: step, exponential, cosine annealing, warmup + cosine, plateau-adaptive
  • Mixed precision (fp16 with loss scaling)
  • Gradient accumulation and clipping
  • Early stopping, checkpointing
  • ONNX export

One-click deploy to AWS EC2 - provisions an instance, uploads the binary, exposes a REST inference endpoint.


Reference Models

The model zoo ships with full implementations: ResNet-18 (residual blocks, BatchNorm, adaptive pooling), MobileNetV2 (inverted residuals, depthwise separable convolutions), and a GPT decoder (causal multi-head attention, positional encoding, autoregressive generation - trained on Shakespeare as proof-of-concept).

All three use only Whitematter's layer primitives. Reading the source is how you learn what these architectures actually are - not abstractions over abstractions, but the literal matrix operations.


Why C++

Because I wanted to understand it.

PyTorch is great. If you're training for production, use PyTorch. But PyTorch is a building - you walk in, press buttons, things happen. I wanted to build the building. Writing matmul backward by hand teaches you it's just two transposed multiplications. Implementing BatchNorm teaches you why training and eval modes exist. Writing convolution as im2col teaches you that convolutions are matrix multiplies in disguise.

The web platform exists so the framework doesn't require a C++ toolchain to use. But the framework exists because the learning is in the implementation.