CIFAR-10 Classifier | Henry Wang

Upload an image. The model classifies it as one of: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, or truck.

Runs locally via WebAssembly. No data leaves your browser.

Note: this demo is the model at ~20 epochs (~83% accuracy). There are 180 epochs left to train, and the full-run target is 93%+. Compute on a single RTX 2070 SUPER is slow. The model will be updated as training progresses.

The demo uses a ResNet-18 trained from scratch with whitematter: 11 million parameters, CUDA-accelerated training on an RTX 2070 SUPER, exported to ONNX, and running locally through ONNX Runtime Web. +

The Architecture

ResNet-18. Four groups of residual blocks, each with two 3x3 convolutions and a skip connection. The skip connection gives the gradient a shorter path through the network.

flowchart TD
    IN["Input<br/>3x32x32"] --> C1["Conv 3x3, 64<br/>+ BN + ReLU"]
    C1 --> L1["Layer 1<br/>64 channels<br/>32x32"]
    L1 --> L2["Layer 2<br/>128 channels<br/>16x16"]
    L2 --> L3["Layer 3<br/>256 channels<br/>8x8"]
    L3 --> L4["Layer 4<br/>512 channels<br/>4x4"]
    L4 --> P["Global AvgPool<br/>512x1x1"]
    P --> FC["Linear 512→10"]
    FC --> OUT["Softmax<br/>10 classes"]

    style IN fill:#1a1a1a,stroke:#555,color:#c0c0c0
    style OUT fill:#1a1a1a,stroke:#555,color:#c0c0c0

Each "Layer" is two BasicBlocks. Each BasicBlock:

auto out = conv1.forward(x);      // 3x3 conv
out = bn1.forward(out);           // batch normalize
out = out->relu();                // activate
out = conv2.forward(out);         // 3x3 conv
out = bn2.forward(out);           // batch normalize
out = out->add(x);                // skip connection: add input directly
out = out->relu();                // activate

That add(x) is the residual connection. Without it, an 18-layer network is harder to train because gradients attenuate through repeated multiplications. With it, the network learns the residual, the difference from the input.

When spatial dimensions change (32x32 → 16x16), the skip connection uses a 1x1 convolution to match dimensions:

flowchart LR
    subgraph BasicBlock
        X["x"] --> C1["Conv 3x3<br/>stride=2"]
        C1 --> BN1["BN"] --> R1["ReLU"]
        R1 --> C2["Conv 3x3"] --> BN2["BN"]
        X --> D["Conv 1x1<br/>stride=2"] --> DBN["BN"]
        BN2 --> ADD["+ add"]
        DBN --> ADD
        ADD --> R2["ReLU"]
    end

    style X fill:#1a1a1a,stroke:#555,color:#c0c0c0
    style ADD fill:#1a1a1a,stroke:#555,color:#c0c0c0

Convolution as Matrix Multiply

A convolution looks like a sliding window, but that's slow. The actual implementation unfolds the input into a matrix (im2col) and multiplies by the flattened kernel. One GEMM call replaces a 6-deep nested loop.

Input [B, C_in, H, W]
  ↓ im2col
Column matrix [B*H_out*W_out, C_in*kH*kW]
  ↓ GEMM (matmul)
Output [B*H_out*W_out, C_out]
  ↓ reshape
Output [B, C_out, H_out, W_out]

On CPU, the GEMM dispatches to OpenBLAS. On GPU, it dispatches to cuDNN which fuses im2col + GEMM into a single optimized kernel. That fusion is the main reason GPU training is faster: it eliminates the intermediate column matrix entirely.

// cuDNN replaces ~200 lines of im2col + GEMM with:
cudnnConvolutionForward(dnn, &alpha,
    input_desc, d_input,
    filter_desc, d_filter,
    conv_desc, algo, workspace, ws_size,
    &beta, output_desc, d_output);

Algorithm selection affects throughput. IMPLICIT_PRECOMP_GEMM precomputes the im2col offsets. WINOGRAD uses a mathematical transform to reduce the number of multiplications for 3x3 kernels. cuDNN picks the fastest one for your tensor sizes.

Batch Normalization

Every conv layer is followed by BatchNorm. It normalizes activations to zero mean, unit variance, then scales and shifts:

y = gamma * (x - mean) / sqrt(var + eps) + beta

gamma and beta are learnable. mean and var are computed per-channel across the batch during training, then frozen as running averages during inference.

Without it, the distribution of activations shifts every time you update weights. BatchNorm keeps things stable, lets you use higher learning rates, and acts as a mild regularizer.

The backward pass has three separate gradients (input, gamma, beta), each requiring the saved mean and inverse standard deviation from the forward pass:

// grad_gamma = sum(grad_output * x_normalized)
// grad_beta  = sum(grad_output)
// grad_input = (1/N) * inv_std * (N * grad_output
//              - sum(grad_output)
//              - x_norm * sum(grad_output * x_norm))

Training

SGD with momentum (0.9), weight decay (5e-4), cosine annealing from lr=0.1 to 0. Data augmentation: pad 4 pixels, random crop back to 32x32, random horizontal flip.

flowchart LR
    D["CIFAR-10<br/>50K images"] --> A["Augment<br/>pad + crop + flip"]
    A --> F["Forward<br/>20 conv layers"]
    F --> L["Cross-Entropy<br/>Loss"]
    L --> B["Backward<br/>compute gradients"]
    B --> O["SGD + Momentum<br/>update weights"]
    O --> D

    style D fill:#1a1a1a,stroke:#555,color:#c0c0c0
    style L fill:#1a1a1a,stroke:#555,color:#c0c0c0

200 epochs. ~9 minutes per epoch on the RTX 2070. The first epoch starts at random (10% accuracy, same as guessing). By epoch 10 it's at ~80%. Cosine annealing drops the learning rate slowly, letting the optimizer settle into a sharper minimum.

Epoch	Loss	Test Accuracy
1	1.86	46%
5	0.73	75%
10	0.53	79%
15	0.45	83%
50	0.18	91%
200	0.04	93%+

Weight decay penalizes large weights (grad += 0.0005 * weight). Without it the network overfits: train accuracy hits 99% while test accuracy plateaus at 85%.

From Trained Weights to Browser

The pipeline has three steps. Train in C++, export weights, convert to ONNX, load in JavaScript.

flowchart LR
    T["C++ Training<br/>cuDNN + cuBLAS"] --> C["Checkpoint<br/>.ckpt file"]
    C --> E["Export Tool<br/>weights → binary"]
    E --> O["Python Script<br/>binary → ONNX"]
    O --> W["ONNX Runtime Web<br/>WASM inference"]

    style T fill:#1a1a1a,stroke:#555,color:#c0c0c0
    style W fill:#1a1a1a,stroke:#555,color:#c0c0c0

The export tool writes all 122 tensors (weights, biases, BatchNorm running statistics) to a flat binary file with named entries. A Python script reads this and constructs the ONNX graph with every conv, batchnorm, and skip connection wired up explicitly.

ONNX Runtime Web loads the 42MB model in your browser, compiles it to WebAssembly, and runs inference in ~50ms per image. +

Limitations

CIFAR-10 images are 32x32 pixels. The model works well on CIFAR-10 test images but struggles with real photos because they are much higher resolution and come from a different distribution. A model trained on ImageNet (224x224, 1000 classes) would generalize better, but would need a lot more compute.

The cuDNN 8/9 API differences were the main integration issue. The batchnorm API behaves subtly differently. Future runs should compare against PyTorch's output tensor-by-tensor from the start.

Next target: a small language model exported to ONNX. Whitematter already has MultiHeadAttention, RoPE, RMSNorm, and a GPT training script, so the main work is training, export, and browser-side inference.