Research  15 April 2025

Semantic Segmentation of Land Cover from Multispectral Imagery: A CNN Approach

Our methodology for training a convolutional segmentation model to classify land cover types from optical multiband satellite imagery, and how we packaged it as a QGIS plugin.

computer-visioncnnremote-sensinggeospatial

Land cover mapping from satellite imagery is a classical remote sensing problem. What changed over the last decade is that deep convolutional models now substantially outperform traditional methods (maximum likelihood classification, random forests over hand-crafted spectral indices) on this task, provided you have labelled training data and sufficient compute for fine-tuning.

This article documents the approach we used to train a semantic segmentation model on optical multiband imagery and deliver it as a QGIS plugin for field use by analysts who are not ML practitioners.

Data

We used Sentinel-2 Level-2A imagery (surface reflectance, atmospherically corrected) at 10m ground resolution. Sentinel-2 provides 13 spectral bands; we used 10 after dropping the three 60m-resolution bands (B1, B9, B10) that contribute little information for land cover discrimination at this scale.

Bands used: B2 (Blue), B3 (Green), B4 (Red), B5–B7 (Red Edge), B8 (NIR), B8A (Narrow NIR), B11–B12 (SWIR).

Training labels were produced by digitising polygons against high-resolution Google Earth imagery and Sentinel-2 false-colour composites, covering five classes:

ClassLabel
Built-up / Urban0
Cropland1
Dense Vegetation2
Sparse Vegetation / Grassland3
Water4

Architecture

We used a U-Net architecture with a ResNet-34 encoder pretrained on ImageNet. The encoder weights were initialised from ImageNet pretraining (adapted to accept 10-channel input by averaging the original 3-channel weights across the channel dimension for the first convolutional layer).

The U-Net decoder path:

y^=Softmax ⁣(D ⁣(E(X;θE);θD))\hat{y} = \text{Softmax}\!\left(D\!\left(E(X; \theta_E); \theta_D\right)\right)

where EE is the ResNet-34 encoder, DD is the decoder with skip connections from encoder feature maps, and XRH×W×10X \in \mathbb{R}^{H \times W \times 10} is the multispectral input patch.

Skip connections concatenate encoder feature maps at each resolution level to the upsampled decoder feature maps, preserving spatial detail lost during downsampling.

Loss Function

We used a combined loss: Dice Loss and Focal Loss, summed with equal weight.

Dice Loss addresses class imbalance by operating on the overlap ratio rather than per-pixel cross-entropy:

LDice=12ipigi+εipi+igi+ε\mathcal{L}_\text{Dice} = 1 - \frac{2 \sum_i p_i g_i + \varepsilon}{\sum_i p_i + \sum_i g_i + \varepsilon}

where pi[0,1]p_i \in [0,1] is the predicted probability for the correct class at pixel ii, gi{0,1}g_i \in \{0, 1\} is the ground truth, and ε=106\varepsilon = 10^{-6} is a smoothing term.

Focal Loss down-weights easy examples (well-classified pixels) to focus gradient updates on hard, misclassified pixels:

LFocal=1Ni(1pi)γlogpi\mathcal{L}_\text{Focal} = -\frac{1}{N} \sum_i (1 - p_i)^\gamma \log p_i

with γ=2\gamma = 2, a standard value from the original Lin et al. (2017) paper.

Total loss:

L=LDice+LFocal\mathcal{L} = \mathcal{L}_\text{Dice} + \mathcal{L}_\text{Focal}

Training

import segmentation_models_pytorch as smp
import torch

model = smp.Unet(
    encoder_name="resnet34",
    encoder_weights="imagenet",
    in_channels=10,
    classes=5,
)

optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50)

dice_loss = smp.losses.DiceLoss(mode="multiclass")
focal_loss = smp.losses.FocalLoss(mode="multiclass", gamma=2.0)

def criterion(logits, targets):
    return dice_loss(logits, targets) + focal_loss(logits, targets)

Training ran for 80 epochs on 256×256 pixel patches (with a stride of 128 for overlap during inference), with standard augmentations: random horizontal/vertical flips, 90° rotations, and random brightness/contrast jitter applied to spectral bands independently.

Results

Evaluation on a held-out test area (geographic split, not random split — to test generalisation across spatial domains):

ClassIoUF1
Built-up0.740.85
Cropland0.680.81
Dense Vegetation0.810.89
Sparse Vegetation0.620.77
Water0.910.95
Mean (mIoU)0.750.85

Water bodies are the easiest class (strong NDWI signal). Sparse vegetation and cropland are hardest to separate, which is expected given their spectral overlap during certain phenological periods.

QGIS Plugin

Delivering the model as a QGIS plugin was a deliberate product decision. Analysts doing land use assessments already work in QGIS; wrapping inference behind a plugin interface meant they could run the model on a new raster directly from within their existing workflow, without a Python environment or command-line knowledge.

The plugin:

  1. Accepts a multiband raster layer (in the correct band order) as input
  2. Runs sliding window inference with overlap, reconstructing the full-extent segmentation map
  3. Outputs a classified raster layer added to the current QGIS project
  4. Optionally vectorises the raster output to a polygon layer for further editing

The inference wrapper handles the patch extraction, overlap blending (averaging logits in overlapping regions), and CRS/transform preservation automatically.


Lin, T. et al. (2017). Focal Loss for Dense Object Detection. ICCV 2017.
Ronneberger, O., Fischer, P. & Brox, T. (2015). U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI 2015.