Liver Segmentation from CT Scans: U-Net with Combined Dice–BCE Loss and HU Windowing
Our modular PyTorch pipeline for segmenting liver regions from CT imagery — covering the U-Net architecture, Hounsfield unit windowing, combined loss formulation, and the full suite of segmentation metrics.
Medical image segmentation differs from natural image segmentation in ways that matter architecturally and in training. The foreground class is typically a small fraction of the image (a liver occupies perhaps 5–15% of a CT slice’s pixels), the input modality has specific physical meaning that must be respected in preprocessing, and the evaluation metrics used clinically are not the same as those used in general computer vision benchmarks. This article documents the approach we took to liver segmentation from CT slices: the preprocessing choices, the model architecture, the loss function formulation, and the full metric suite we track.
Hounsfield unit windowing
CT pixel values are Hounsfield units (HU), a quantitative scale where water is 0 HU, air is approximately −1000 HU, and dense bone is approximately +1000 HU. Liver parenchyma sits in the range of roughly 40–60 HU. Feeding the full HU range to the network is counterproductive — the liver’s signal occupies a narrow band and the contrast between liver and surrounding structures is maximised by windowing to that band before normalisation.
We apply a window centred at HU with width HU, clipping values outside , then normalising to :
This is equivalent to setting the window level and width on a clinical PACS workstation to values a radiologist would use when reviewing liver cases. The parameters are configurable in the training config rather than hardcoded, since optimal windowing varies by task and scanner protocol.
U-Net architecture
The model follows the encoder–decoder structure of U-Net (Ronneberger et al., 2015), with skip connections that concatenate encoder feature maps to the upsampled decoder feature maps at each resolution level. This allows the decoder to recover spatial detail lost during downsampling, which is essential for segmentation tasks where mask boundaries matter.
For an input , the encoder produces feature maps at resolutions for . The decoder upsamples back through the same levels. At each level , the decoder input is the concatenation of the upsampled features from the previous level and the corresponding encoder feature map:
where denotes channel-wise concatenation. The output is a single-channel logit map of the same spatial dimension as the input, passed through sigmoid during inference to obtain the probability mask .
The channel depths follow a standard doubling schedule: 64, 128, 256, 512, 1024. The implementation is modular — channel depths are a configurable parameter, so a shallower or narrower network can be instantiated from the same codebase for resource-constrained deployment.
Loss function
Class imbalance is the central training challenge in medical segmentation. In a 512×512 CT slice with a liver occupying 10% of pixels, there are roughly nine background pixels for every foreground pixel. Standard binary cross-entropy minimises the aggregate loss by over-predicting background. We address this with a combined Dice loss and weighted BCE loss.
Dice loss directly optimises the Dice coefficient (the metric we care about) and is inherently insensitive to class imbalance because it operates on the ratio of overlap to total predicted and true positives:
where is the predicted probability at pixel , is the ground truth, and prevents division by zero on empty masks.
Binary cross-entropy with logit weighting applies a higher weight to foreground pixels to
counteract the imbalance directly. Using raw logits (before sigmoid) with
BCEWithLogitsLoss is numerically more stable than applying sigmoid first:
where is the raw logit and is the positive class weight — tunable in config.
The combined loss is a weighted sum:
with in our default configuration, weighting Dice more heavily since it directly reflects our evaluation objective. Both weights are configurable.
Training configuration
Training is fully config-driven with no hardcoded hyperparameters in the training code:
{
"loss": "dice_bce",
"dice_weight": 0.8,
"bce_pos_weight": 5.0,
"learning_rate": 0.0005,
"batch_size": 8,
"epochs": 100,
"hu_center": 50,
"hu_width": 250,
"early_stopping_patience": 50
}
The optimiser is Adam with . Early stopping monitors validation Dice with a patience of 50 epochs, saving the best checkpoint and restoring it at the end of training. A fixed random seed is set at the start of each run for reproducibility.
Evaluation metrics
We track a broader suite of metrics than is typical in general computer vision, reflecting the clinical context. Let TP, FP, FN, TN denote true/false positives/negatives at the pixel level.
Dice coefficient (primary metric):
Intersection over Union:
Precision and recall:
Relative Volume Difference (RVD) measures volumetric agreement between predicted and ground truth masks without penalising compensating errors:
Average Surface Distance (ASD) and 95th percentile Hausdorff Distance (HD95) measure boundary accuracy — important for surgical planning applications where the liver surface location matters, not just the interior overlap.
where is the set of distances from each point on surface to the nearest point on surface , and is the 95th percentile. HD95 is used instead of the maximum Hausdorff distance because it is robust to outliers caused by isolated segmentation errors at image borders.
Results
On a held-out validation case (image 017.png):
| Metric | Value |
|---|---|
| Dice | 0.9345 |
| IoU | 0.877 |
| Precision | 0.935 |
| Recall | 0.934 |
| Accuracy | 0.995 |
The Dice of 0.9345 and IoU of 0.877 indicate strong spatial overlap. Precision and recall are well-balanced at 0.935 and 0.934 respectively — the model is neither over-segmenting nor missing substantial liver volume. At 512×512 resolution, this corresponds to 9,343 true positive pixels, 649 false positives, and 661 false negatives out of roughly 262,000 total pixels.
Results at this level are consistent with published U-Net baselines on the LiTS (Liver Tumour Segmentation) benchmark, where top-performing U-Net variants typically achieve Dice in the 0.92–0.96 range on liver (not tumour) segmentation.
Inference output structure
The inference pipeline produces a structured output directory:
inference_results/
├── masks/ ← predicted binary masks + ground truth if available
├── overlays/ ← prediction overlaid on original CT slice
├── overlays_with_gt/ ← green = prediction, red = ground truth
├── error_maps/ ← yellow = TP, red = FP, blue = FN
└── metrics.csv ← per-image metric table
The error map colour encoding (yellow/red/blue for TP/FP/FN) is directly interpretable by a clinician reviewing results: yellow regions are correctly segmented liver, red regions are false positives (predicted liver where there is none), blue regions are missed liver tissue.
Ronneberger, O., Fischer, P. & Brox, T. (2015). U-Net: Convolutional Networks for Biomedical
Image Segmentation. MICCAI 2015.
Bilic, P. et al. (2023). The Liver Tumor Segmentation Benchmark (LiTS). Medical Image Analysis.