Semantic Segmentation of Land Cover from Multispectral Imagery: A CNN Approach

Land cover mapping from satellite imagery is a classical remote sensing problem. What changed over the last decade is that deep convolutional models now substantially outperform traditional methods (maximum likelihood classification, random forests over hand-crafted spectral indices) on this task, provided you have labelled training data and sufficient compute for fine-tuning.

This article documents the approach we used to train a semantic segmentation model on optical multiband imagery and deliver it as a QGIS plugin for field use by analysts who are not ML practitioners.

Data

We used Sentinel-2 Level-2A imagery (surface reflectance, atmospherically corrected) at 10m ground resolution. Sentinel-2 provides 13 spectral bands; we used 10 after dropping the three 60m-resolution bands (B1, B9, B10) that contribute little information for land cover discrimination at this scale.

Bands used: B2 (Blue), B3 (Green), B4 (Red), B5–B7 (Red Edge), B8 (NIR), B8A (Narrow NIR), B11–B12 (SWIR).

Training labels were produced by digitising polygons against high-resolution Google Earth imagery and Sentinel-2 false-colour composites, covering five classes:

Class	Label
Built-up / Urban	0
Cropland	1
Dense Vegetation	2
Sparse Vegetation / Grassland	3
Water	4

Architecture

We used a U-Net architecture with a ResNet-34 encoder pretrained on ImageNet. The encoder weights were initialised from ImageNet pretraining (adapted to accept 10-channel input by averaging the original 3-channel weights across the channel dimension for the first convolutional layer).

The U-Net decoder path:

\hat{y} = \text{Softmax}\!\left(D\!\left(E(X; \theta_E); \theta_D\right)\right)

where $E$ is the ResNet-34 encoder, $D$ is the decoder with skip connections from encoder feature maps, and $X \in \mathbb{R}^{H \times W \times 10}$ is the multispectral input patch.

Skip connections concatenate encoder feature maps at each resolution level to the upsampled decoder feature maps, preserving spatial detail lost during downsampling.

Loss Function

We used a combined loss: Dice Loss and Focal Loss, summed with equal weight.

Dice Loss addresses class imbalance by operating on the overlap ratio rather than per-pixel cross-entropy:

\mathcal{L}_\text{Dice} = 1 - \frac{2 \sum_i p_i g_i + \varepsilon}{\sum_i p_i + \sum_i g_i + \varepsilon}

where $p_i \in [0,1]$ is the predicted probability for the correct class at pixel $i$ , $g_i \in \{0, 1\}$ is the ground truth, and $\varepsilon = 10^{-6}$ is a smoothing term.

Focal Loss down-weights easy examples (well-classified pixels) to focus gradient updates on hard, misclassified pixels:

\mathcal{L}_\text{Focal} = -\frac{1}{N} \sum_i (1 - p_i)^\gamma \log p_i

with $\gamma = 2$ , a standard value from the original Lin et al. (2017) paper.

Total loss:

\mathcal{L} = \mathcal{L}_\text{Dice} + \mathcal{L}_\text{Focal}

Training

import segmentation_models_pytorch as smp
import torch

model = smp.Unet(
    encoder_name="resnet34",
    encoder_weights="imagenet",
    in_channels=10,
    classes=5,
)

optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50)

dice_loss = smp.losses.DiceLoss(mode="multiclass")
focal_loss = smp.losses.FocalLoss(mode="multiclass", gamma=2.0)

def criterion(logits, targets):
    return dice_loss(logits, targets) + focal_loss(logits, targets)

Training ran for 80 epochs on 256×256 pixel patches (with a stride of 128 for overlap during inference), with standard augmentations: random horizontal/vertical flips, 90° rotations, and random brightness/contrast jitter applied to spectral bands independently.

Results

Evaluation on a held-out test area (geographic split, not random split — to test generalisation across spatial domains):

Class	IoU	F1
Built-up	0.74	0.85
Cropland	0.68	0.81
Dense Vegetation	0.81	0.89
Sparse Vegetation	0.62	0.77
Water	0.91	0.95
Mean (mIoU)	0.75	0.85

Water bodies are the easiest class (strong NDWI signal). Sparse vegetation and cropland are hardest to separate, which is expected given their spectral overlap during certain phenological periods.

QGIS Plugin

Delivering the model as a QGIS plugin was a deliberate product decision. Analysts doing land use assessments already work in QGIS; wrapping inference behind a plugin interface meant they could run the model on a new raster directly from within their existing workflow, without a Python environment or command-line knowledge.

The plugin:

Accepts a multiband raster layer (in the correct band order) as input
Runs sliding window inference with overlap, reconstructing the full-extent segmentation map
Outputs a classified raster layer added to the current QGIS project
Optionally vectorises the raster output to a polygon layer for further editing

The inference wrapper handles the patch extraction, overlap blending (averaging logits in overlapping regions), and CRS/transform preservation automatically.

Lin, T. et al. (2017). Focal Loss for Dense Object Detection. ICCV 2017.
Ronneberger, O., Fischer, P. & Brox, T. (2015). U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI 2015.