// Medical Computer Vision

Trained a colorectal-cancer detection + segmentation model on full-resolution CT

We trained the colorectal-cancer detection + segmentation model - lesion-level localization plus voxel-level segmentation on full-resolution abdominal CT - for a seed-stage medical imaging startup. The GPUs on their 8×A100 box were idling at <30% under a NIfTI dataloader and CPU 3D augmentation, so we rebuilt the data path GPU-native (Zarr-chunked volumes, cuCIM preprocessing on-device, Kornia augmentation in the same memory). Single-epoch wallclock collapsed from ~9h to ~50m, GPU utilization went from sub-30% to >90%, and the iteration loop finally caught up with the radiologists' labeling cadence.

offices

Europe

size

Seed-stage

industry

Medical imaging AI - anonymized

revenue

-

// Outcomes

The numbers that matter

  • 9h → ~50m

    single-epoch wallclock

  • >90%

    GPU utilization (was <30%)

  • On-prem

    no PHI egress, hospital VLAN

01 · GPUs idle at sub-30% while the data loader does all the work

The Challenge

The customer is an early-stage medical imaging startup. The model they needed to ship was a single-head detection + segmentation network on full-resolution abdominal CT - lesion-level localization (colon polyps, suspicious wall thickening, primary-tumor candidates) and voxel-level segmentation of the same lesions, so a downstream radiologist gets both a flag and a contour they can review in the PACS viewer. The clinical promise is real: CT colonography and contrast-enhanced abdominal CT are routine, and a second-reader AI that surfaces missed lesions with a localized mask has a credible path to clinical adoption. The problem was that the team's training loop wasn't keeping up with their own labeling throughput.

A single annotated CT volume in their cohort is ~512×512×800 voxels at sub-millimeter spacing - typically 0.5–2 GB on disk per study, with multiple series (arterial, portal-venous, delayed) per patient. Their training pipeline read NIfTI files end-to-end on every epoch, decompressed on the CPU, ran 3D augmentations (random crops, elastic deformations, intensity jitter, rotations) on the CPU, and then pushed the result to the GPU. The result on their 8×A100 box was textbook starvation: GPU utilization hovering around 25%, single-epoch wallclock around 9 hours on the curated training set, and an iteration loop where a single hyperparameter sweep took most of a week.

Two non-negotiables shaped the rebuild. PHI couldn't leave the hospital VLAN - DICOM ingest, training data, and model artifacts all had to live inside the customer-controlled environment, with no cloud step in the loop. And the rebuild had to land on the same 8×A100 box the customer already owned; raising more capital to buy more GPUs because the data loader was inefficient wasn't a plan their seed round could afford.

02 · Move the entire data path to the GPU. Stop re-reading volumes from disk.

Approach

Step 1: Zarr-backed chunked storage instead of NIfTI re-reads.

We converted the curated cohort once, end-to-end, into a Zarr store with 3D chunks sized to the patch size the model trains on (96×96×96 by default, configurable per experiment). Compression with Blosc/Zstd kept the on-disk footprint comparable to NIfTI while making random patch reads constant-time and parallel-safe. The Zarr layout sits on the customer's NVMe-backed shared storage; the dataloader maps chunks directly into pinned host memory and DMAs them into GPU memory without ever rehydrating a full volume.

Step 2: cuCIM for GPU-resident preprocessing.

cuCIM (NVIDIA's GPU-accelerated image-processing library) takes over the work that used to run on the CPU: HU windowing, resampling to isotropic spacing, body-mask computation, and per-patient intensity normalization - all on the device, in CuPy/Torch-compatible memory. Volumes that come off Zarr go straight into cuCIM kernels and stay on the GPU. No CPU bounce, no SimpleITK re-encode, no tensor copy back across PCIe.

Step 3: Kornia for 3D augmentation in the same GPU memory.

Augmentation is where most CPU pipelines collapse: 3D elastic deformations, random affine transforms, and intensity jitter on volumetric data are expensive enough that even a 32-core CPU can't feed 8 A100s. Kornia runs the same family of augmentations natively on the GPU - affine, elastic, intensity, noise, and the random patch sampling itself - operating directly on the cuCIM output tensor. Augmentation is part of the training step's compute graph, not a separate process boundary the GPUs wait on.

Step 4: Eval-gated, deterministic, reproducible.

The augmentation pipeline is seeded per-batch and the Zarr chunking is content-addressed, so any training run is exactly reproducible from the run config plus the dataset hash. Validation runs without augmentation through the same code path - same cuCIM normalization, same patch geometry - so the eval numbers measure the model, not a second preprocessing pipeline drifting away from training. Training metrics, sweeps, and validation per-cohort breakdowns all land in a Weights & Biases workspace the radiologists and ML team share.

03 · GPU-bound training. Iteration loop in hours, not days.

Result

  • Single-epoch wallclock dropped from ~9 hours to ~50 minutes on the same 8×A100 box - about a 10× training-throughput lift, no new hardware.
  • GPU utilization moved from <30% (CPU-bound dataloader) to >90% sustained (GPU-bound on the actual model).
  • Random 3D patch reads are constant-time off the Zarr store regardless of cohort size - the pipeline scales as the customer adds annotated studies, instead of slowing down.
  • Augmentation runs in the same GPU memory as preprocessing and the model - zero PCIe round-trips per batch, no CPU augmentation worker pool to tune.
  • The whole pipeline runs inside the hospital VLAN - DICOM ingest → Zarr build → training → eval, no PHI egress, no cloud dependency.
  • Training pipeline ships to the customer's repo with the Zarr build script, the cuCIM/Kornia transform graph, the W&B project, and a runbook the team can re-run as new annotated studies come in.

The win wasn't a new model architecture or a clever loss - those weren't the bottleneck. The bottleneck was a CPU-bound dataloader and a redundant disk path masquerading as a training problem. Once the data path lives on the GPU, the team's iteration cadence catches up to the rate the radiologists are producing labels, and the actual model work - the part the medical results depend on - finally gets the compute it needed all along.

// Expert insight

Medical imaging teams burn enormous amounts of GPU time waiting on a CPU-bound NIfTI dataloader and a 3D augmentation pipeline that never made it off the CPU. Zarr for storage, cuCIM for preprocessing, Kornia for augmentation - once the whole data path lives on the GPU, the iteration cadence finally matches the rate the radiologists are producing labels. That's the engagement.
Michał Pogoda-Rosikoń

Michał Pogoda-Rosikoń

Co-founder @ bards.ai

// Ready to ship?

Let's build something that delivers numbers like these.

Book a meeting