ResNet-18 for Image Classification: Residual Learning on MNIST and CIFAR-10

Image classification is one of the most well-studied problems in computer vision, but the gap between a working tutorial implementation and a system you can actually deploy and iterate on is significant. This article documents the architecture, training formulation, and system design choices behind a classification framework we built on ResNet-18, evaluated against MNIST and CIFAR-10.

The residual connection

The central idea in ResNet, introduced by He et al. (2016), is simple: instead of asking a stack of layers to learn a mapping $\mathcal{H}(x)$ directly, reformulate the problem so the layers learn the residual $\mathcal{F}(x) = \mathcal{H}(x) - x$ . The original mapping becomes $\mathcal{H}(x) = \mathcal{F}(x) + x$ , implemented as a shortcut connection that adds the block’s input to its output.

\mathbf{y} = \mathcal{F}(\mathbf{x},\, \{W_i\}) + \mathbf{x}

When the input and output dimensions match, this addition requires no parameters. When they differ (across downsampling layers), a $1 \times 1$ convolution projects the shortcut to the correct dimension:

\mathbf{y} = \mathcal{F}(\mathbf{x},\, \{W_i\}) + W_s\,\mathbf{x}

The practical consequence is that gradients can flow directly through the shortcut path back to early layers during backpropagation, sidestepping the vanishing gradient problem that made networks beyond ~20 layers difficult to train with plain architectures.

ResNet-18 specification

ResNet-18 consists of an initial $7 \times 7$ convolution followed by four layer groups, each containing two residual blocks with $3 \times 3$ convolutions, and a final global average pooling step before the classification head.

Layer group	Blocks	Channels	Stride	Output size (CIFAR-10)
conv1	—	64	2	16×16
layer1	2	64	1	16×16
layer2	2	128	2	8×8
layer3	2	256	2	4×4
layer4	2	512	2	2×2
avgpool	—	—	—	1×1
fc	—	10	—	—

Total trainable parameters: approximately 11.2 million. The model is adapted at the input layer to handle one-channel MNIST images (by setting in_channels=1) versus three-channel CIFAR-10 inputs, with everything downstream unchanged.

Each residual block applies the sequence: convolution → batch normalisation → ReLU → convolution → batch normalisation, then adds the shortcut before the final activation:

\mathbf{y} = \text{ReLU}\!\left(\text{BN}\!\left(W_2 * \text{ReLU}\!\left(\text{BN}(W_1 * \mathbf{x})\right)\right) + \mathbf{x}\right)

Batch normalisation before each activation stabilises the distribution of inputs to each layer, allowing higher learning rates and reducing sensitivity to initialisation.

Training formulation

The loss function is standard cross-entropy over the $K=10$ classes. For a single example with true class $k$ and softmax output probabilities $\hat{p}$ :

\mathcal{L} = -\log \hat{p}_k = -\log \frac{\exp(z_k)}{\sum_{j=1}^{K} \exp(z_j)}

For MNIST we used Adam with a learning rate of $10^{-3}$ and trained for 10 epochs with a batch size of 64. For CIFAR-10, SGD with momentum and a cosine annealing schedule performed better than Adam in our experiments — a pattern that holds broadly for ResNet training on CIFAR-class tasks:

\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\!\frac{t\pi}{T}\right)

where $T$ is the total number of epochs, $\eta_{\max} = 0.01$ , and $\eta_{\min} = 10^{-5}$ .

Dataset normalisation

Both datasets are normalised per-channel to zero mean and unit variance using statistics computed over the training set. For CIFAR-10, the per-channel means and standard deviations are:

\mu = (0.4914,\ 0.4822,\ 0.4465), \quad \sigma = (0.2023,\ 0.1994,\ 0.2010)

For MNIST, the single-channel statistics are $\mu = 0.1307$ , $\sigma = 0.3081$ .

System architecture

The system applies Separation of Concerns across three layers: user interfaces (CLI and PyQt5 GUI), application logic (trainer and inference classes), and core components (model, data loaders, utilities). The CLI is suitable for scripting and remote training; the GUI wraps the same trainer classes with a PyQt5 interface that surfaces real-time loss curves and provides direct access to checkpoints and export functions.

Checkpointing saves the model state dict alongside the optimiser state and current epoch, so training can be resumed cleanly after interruption. The best checkpoint (by validation accuracy) is saved separately from the latest checkpoint.

Results

On the held-out test sets:

Dataset	Epochs	Test accuracy	Inference speed (GPU)
MNIST	10	99.3–99.5%	~1000 images/sec
CIFAR-10	50	92–94%	~800 images/sec

The MNIST result is near the practical ceiling for this architecture without data augmentation. The CIFAR-10 result is consistent with published ResNet-18 baselines on this benchmark; pushing past 94% typically requires heavier augmentation (Cutout, AutoAugment) or a wider network.

Inference pipeline

The inference engine supports both single-image and batch-directory processing. Predictions are returned as a ranked list of class probabilities with configurable top- $k$ and confidence threshold filtering. Output is serialised to JSON, with batch results also written to CSV for downstream analysis.

For a single-image prediction, the top class and its confidence are determined as:

\hat{k} = \arg\max_{j} \hat{p}_j, \qquad \text{confidence} = \hat{p}_{\hat{k}} \times 100\%

The batch pipeline processes a directory of images and aggregates a confusion matrix and per-class precision, recall, and F1 scores using the standard definitions. These are exported as a Bokeh interactive HTML dashboard alongside static PNG and CSV outputs.

He, K. et al. (2016). Deep Residual Learning for Image Recognition. CVPR 2016.