Research  18 March 2025

Liver Segmentation from CT Scans: U-Net with Combined Dice–BCE Loss and HU Windowing

Our modular PyTorch pipeline for segmenting liver regions from CT imagery — covering the U-Net architecture, Hounsfield unit windowing, combined loss formulation, and the full suite of segmentation metrics.

computer-visionsegmentationmedical-imagingunetpytorch

Medical image segmentation differs from natural image segmentation in ways that matter architecturally and in training. The foreground class is typically a small fraction of the image (a liver occupies perhaps 5–15% of a CT slice’s pixels), the input modality has specific physical meaning that must be respected in preprocessing, and the evaluation metrics used clinically are not the same as those used in general computer vision benchmarks. This article documents the approach we took to liver segmentation from CT slices: the preprocessing choices, the model architecture, the loss function formulation, and the full metric suite we track.

Hounsfield unit windowing

CT pixel values are Hounsfield units (HU), a quantitative scale where water is 0 HU, air is approximately −1000 HU, and dense bone is approximately +1000 HU. Liver parenchyma sits in the range of roughly 40–60 HU. Feeding the full HU range to the network is counterproductive — the liver’s signal occupies a narrow band and the contrast between liver and surrounding structures is maximised by windowing to that band before normalisation.

We apply a window centred at c=50c = 50 HU with width w=250w = 250 HU, clipping values outside [cw/2, c+w/2][c - w/2,\ c + w/2], then normalising to [0,1][0, 1]:

xwin=clip ⁣(x, cw2, c+w2)x_{\text{win}} = \text{clip}\!\left(x,\ c - \tfrac{w}{2},\ c + \tfrac{w}{2}\right) xnorm=xwin(cw2)wx_{\text{norm}} = \frac{x_{\text{win}} - \left(c - \tfrac{w}{2}\right)}{w}

This is equivalent to setting the window level and width on a clinical PACS workstation to values a radiologist would use when reviewing liver cases. The parameters are configurable in the training config rather than hardcoded, since optimal windowing varies by task and scanner protocol.

U-Net architecture

The model follows the encoder–decoder structure of U-Net (Ronneberger et al., 2015), with skip connections that concatenate encoder feature maps to the upsampled decoder feature maps at each resolution level. This allows the decoder to recover spatial detail lost during downsampling, which is essential for segmentation tasks where mask boundaries matter.

For an input XR1×H×WX \in \mathbb{R}^{1 \times H \times W}, the encoder produces feature maps at resolutions H/2kH/2^k for k=0,1,2,3,4k = 0, 1, 2, 3, 4. The decoder upsamples back through the same levels. At each level kk, the decoder input is the concatenation of the upsampled features from the previous level and the corresponding encoder feature map:

Dk=ConvBlock ⁣([Up(Dk+1)  Ek])D_k = \text{ConvBlock}\!\left([\text{Up}(D_{k+1})\ \|\ E_k]\right)

where \| denotes channel-wise concatenation. The output is a single-channel logit map of the same spatial dimension as the input, passed through sigmoid during inference to obtain the probability mask p^[0,1]H×W\hat{p} \in [0,1]^{H \times W}.

The channel depths follow a standard doubling schedule: 64, 128, 256, 512, 1024. The implementation is modular — channel depths are a configurable parameter, so a shallower or narrower network can be instantiated from the same codebase for resource-constrained deployment.

Loss function

Class imbalance is the central training challenge in medical segmentation. In a 512×512 CT slice with a liver occupying 10% of pixels, there are roughly nine background pixels for every foreground pixel. Standard binary cross-entropy minimises the aggregate loss by over-predicting background. We address this with a combined Dice loss and weighted BCE loss.

Dice loss directly optimises the Dice coefficient (the metric we care about) and is inherently insensitive to class imbalance because it operates on the ratio of overlap to total predicted and true positives:

LDice=12ip^igi+εip^i+igi+ε\mathcal{L}_{\text{Dice}} = 1 - \frac{2\sum_i \hat{p}_i\, g_i + \varepsilon}{\sum_i \hat{p}_i + \sum_i g_i + \varepsilon}

where p^i\hat{p}_i is the predicted probability at pixel ii, gi{0,1}g_i \in \{0,1\} is the ground truth, and ε=106\varepsilon = 10^{-6} prevents division by zero on empty masks.

Binary cross-entropy with logit weighting applies a higher weight to foreground pixels to counteract the imbalance directly. Using raw logits (before sigmoid) with BCEWithLogitsLoss is numerically more stable than applying sigmoid first:

LBCE=1Ni[w+gilogσ(zi)+(1gi)log(1σ(zi))]\mathcal{L}_{\text{BCE}} = -\frac{1}{N}\sum_i \left[w_+ g_i \log \sigma(z_i) + (1 - g_i)\log(1 - \sigma(z_i))\right]

where ziz_i is the raw logit and w+=5.0w_+ = 5.0 is the positive class weight — tunable in config.

The combined loss is a weighted sum:

L=αLDice+(1α)LBCE\mathcal{L} = \alpha\, \mathcal{L}_{\text{Dice}} + (1 - \alpha)\, \mathcal{L}_{\text{BCE}}

with α=0.8\alpha = 0.8 in our default configuration, weighting Dice more heavily since it directly reflects our evaluation objective. Both weights are configurable.

Training configuration

Training is fully config-driven with no hardcoded hyperparameters in the training code:

{
  "loss": "dice_bce",
  "dice_weight": 0.8,
  "bce_pos_weight": 5.0,
  "learning_rate": 0.0005,
  "batch_size": 8,
  "epochs": 100,
  "hu_center": 50,
  "hu_width": 250,
  "early_stopping_patience": 50
}

The optimiser is Adam with η=5×104\eta = 5 \times 10^{-4}. Early stopping monitors validation Dice with a patience of 50 epochs, saving the best checkpoint and restoring it at the end of training. A fixed random seed is set at the start of each run for reproducibility.

Evaluation metrics

We track a broader suite of metrics than is typical in general computer vision, reflecting the clinical context. Let TP, FP, FN, TN denote true/false positives/negatives at the pixel level.

Dice coefficient (primary metric):

Dice=2TP2TP+FP+FN\text{Dice} = \frac{2\,\text{TP}}{2\,\text{TP} + \text{FP} + \text{FN}}

Intersection over Union:

IoU=TPTP+FP+FN\text{IoU} = \frac{\text{TP}}{\text{TP} + \text{FP} + \text{FN}}

Precision and recall:

Precision=TPTP+FP,Recall=TPTP+FN\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}, \qquad \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}

Relative Volume Difference (RVD) measures volumetric agreement between predicted and ground truth masks without penalising compensating errors:

RVD=V^VV\text{RVD} = \frac{|\hat{V}| - |V|}{|V|}

Average Surface Distance (ASD) and 95th percentile Hausdorff Distance (HD95) measure boundary accuracy — important for surgical planning applications where the liver surface location matters, not just the interior overlap.

HD95=max ⁣(perc95(d(Sp^,Sg)), perc95(d(Sg,Sp^)))\text{HD95} = \max\!\left(\text{perc}_{95}(d(S_{\hat{p}}, S_g)),\ \text{perc}_{95}(d(S_g, S_{\hat{p}}))\right)

where d(A,B)d(A, B) is the set of distances from each point on surface AA to the nearest point on surface BB, and perc95\text{perc}_{95} is the 95th percentile. HD95 is used instead of the maximum Hausdorff distance because it is robust to outliers caused by isolated segmentation errors at image borders.

Results

On a held-out validation case (image 017.png):

MetricValue
Dice0.9345
IoU0.877
Precision0.935
Recall0.934
Accuracy0.995

The Dice of 0.9345 and IoU of 0.877 indicate strong spatial overlap. Precision and recall are well-balanced at 0.935 and 0.934 respectively — the model is neither over-segmenting nor missing substantial liver volume. At 512×512 resolution, this corresponds to 9,343 true positive pixels, 649 false positives, and 661 false negatives out of roughly 262,000 total pixels.

Results at this level are consistent with published U-Net baselines on the LiTS (Liver Tumour Segmentation) benchmark, where top-performing U-Net variants typically achieve Dice in the 0.92–0.96 range on liver (not tumour) segmentation.

Inference output structure

The inference pipeline produces a structured output directory:

inference_results/
├── masks/          ← predicted binary masks + ground truth if available
├── overlays/       ← prediction overlaid on original CT slice
├── overlays_with_gt/  ← green = prediction, red = ground truth
├── error_maps/     ← yellow = TP, red = FP, blue = FN
└── metrics.csv     ← per-image metric table

The error map colour encoding (yellow/red/blue for TP/FP/FN) is directly interpretable by a clinician reviewing results: yellow regions are correctly segmented liver, red regions are false positives (predicted liver where there is none), blue regions are missed liver tissue.


Ronneberger, O., Fischer, P. & Brox, T. (2015). U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI 2015.
Bilic, P. et al. (2023). The Liver Tumor Segmentation Benchmark (LiTS). Medical Image Analysis.