[TPAMI] Large-Scale 3D Medical Image Pre-training with Geometric Context Priors

Recently, the HKUST Smart Lab team has completed a groundbreaking project on the CT foundation model. Published in IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), this work presents a new foundation model (VoCo) for 3D medical images with a comprehensive evaluation benchmark.

Introduction

The scarcity of annotations poses a significant challenge in medical image analysis, which demands extensive efforts from radiologists, especially for high-dimension 3D medical images. Large-scale pre-training has emerged as a promising label-efficient solution, owing to the utilization of large-scale data, large models, and advanced pre-training techniques. However, its development in medical images remains underexplored. The primary challenge lies in harnessing large-scale unlabeled data and learning high-level semantics without annotations. We observe that 3D medical images exhibit consistent geometric context, i.e., consistent geometric relations between different organs, which leads to a promising way for learning consistent representations. Motivated by this, we introduce a simple-yet-effective Volume Contrast (VoCo) framework to leverage geometric context priors for self-supervision. Given an input volume, we extract base crops from different regions to construct positive and negative pairs for contrastive learning. Then we predict the contextual position of a random crop by contrasting its similarity to the base crops. In this way, VoCo implicitly encodes the inherent geometric context into model representations, facilitating high-level semantic learning without annotations. Extensive experiments highlight the superiority of VoCo, showcasing promising transferability to unseen modalities and datasets. VoCo notably enhances performance on datasets with limited labeled cases and significantly expedites fine-tuning convergence.

VoCo

Method

VoCo

The pivotal procedure is to generate position labels for self-supervision. We propose to leverage the inherent geometric context priors in 3D medical images. Given an input volume V, we first randomly crop a sub-volume k, with the objective of constructing positive and negative pairs with k for contrastive learning. Specifically, we propose to employ position encoding to generate n non-overlap base crops q-i, where each base crop represents a distinct region of the input volume.

Within human body anatomy, various organs are situated in distinct regions, leading to a potential way for us to form positive and negative pairs. As shown in Fig., the random crop k and the positive base crops q-pos exhibit overlap areas, whereas the negative base crops q-neg, lacking such overlaps, more likely encompass different organs (not absolutely). For example, k and q-pos both contain stomach, pancreas, vein, aorta, and cava, while k and q-neg exhibit different organ information. Thus, we can employ the position encoding to construct positive and negative pairs for contrastive learning.

Previous contrastive learning methods mainly employ InfoNCE loss to maximize the mutual information of positive pairs. In this paper, we propose to generate labels with specific values to supervise the correlation extent of positive pairs, i.e., with labels to reflect how similar between k and q-pos. It can be observed that the correlation between k and q-pos is associated with their overlap proportions. Intuitively, if a positive base crop q-pos shares more overlap areas with k, this q-pos would be more similar to k. Thus, we propose to assign the overlap proportions as the values of position labels y, enabling us to measure the similarity between k and q-pos. In contrast, the position labels y of q-neg are assigned to 0.

VoCo

Overall framework of VoCo. (a) First, we generate base crops q with corresponding position labels y. Then we input the random crop k and base crops q for contextual position prediction. Specifically, we employ a student-teacher module to project k and q separately, where the teacher projector is frozen and updated from the student projector with Exponential Moving Average (EMA). Finally, we conduct volume contrast between k and q to predict similarity s, where s is supervised by position labels y. (b) We use the position labels to supervise the intra-volume contrast on k, q-pos, and q-neg, where k, q-pos, and q-neg are from the same volume. (c) We extract random crop k-A and base crops q-B from different volumes V-A and V-B for inter-volume contrast.

Experiments

We build the largest benchmark in this field, where we open-source the implementation of more than 50 downstream tasks, including segmentation, classification, registration, and Visual Language Processing (VLP). Extensive experiments demonstrate the superiority of VoCo. Consistent and significant improvements across 51 tasks are highlighted, i.e., average +3.62% over baseline and +2.19% above the second-best model.

VoCo

Conclusion

In this paper, we proposed a simple-yet-effective Volume Contrast (VoCo) framework for large-scale 3D medical image pre-training. Inspired by the consistent geometric relation between different organs, we propose to leverage the geometric context priors to learn consistent semantic representations for SSL. VoCo can also be seamlessly integrated into a semi-supervised learning framework for omni-supervised pre-training. To facilitate the study of large-scale 3D medical image pre-training, we curated the existing largest medical image pre-training dataset PreCT-160K, which encompasses 160K CT volumes covering diverse anatomical structures. We further delve into the scaling law of model capacity and propose guidelines for tailoring different model sizes to various medical tasks. To evaluate the effectiveness of pre-training, we establish a comprehensive evaluation benchmark encompassing 51 downstream datasets across various tasks. Extensive experiments highlighted the superior performance of VoCo compared with previous methods.


Resources

For more details, please see our paper Large-Scale 3D Medical Image Pre-training with Geometric Context Priors via TPAMI.

Citation:
L. Wu, J. Zhuang and H. Chen, “Large-Scale 3D Medical Image Pre-Training With Geometric Context Priors,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/TPAMI.2025.3639593.