The advent of foundation models (FMs) in healthcare offers unprecedented opportunities to enhance medical diagnostics through automated classification and segmentation tasks. However, these models also raise significant concerns about their fairness, especially when applied to diverse and underrepresented populations in healthcare applications. Currently, there is a lack of comprehensive benchmarks, standardized pipelines, and easily adaptable libraries to evaluate and understand the fairness performance of FMs in medical imaging, leading to considerable challenges in formulating and implementing solutions that ensure equitable outcomes across diverse patient populations. To fill this gap, we introduce FairMedFM, a fairness benchmark for FM research in medical imaging. FairMedFM integrates with 17 popular medical imaging datasets, encompassing different modalities, dimensionalities, and sensitive attributes. It explores 20 widely used FMs, with various usages such as zero-shot learning, linear probing, parameter-efficient fine-tuning, and prompting in various downstream tasks -- classification and segmentation. Our exhaustive analysis evaluates the fairness performance over different evaluation metrics from multiple perspectives, revealing the existence of bias, varied utility-fairness trade-offs on different FMs, consistent disparities on the same datasets regardless FMs, and limited effectiveness of existing unfairness mitigation methods.
The figure above presents the pipeline of FairMedFM framework, which offers an easy-to-use codebase for benchmarking the fairness of FMs in medical imaging. FairMedFM currently contains 17 datasets (9 for classification and 8 for segmentation) and 20 FMs (11 for classification and 9 for segmentation). It also integrates 9 fairness metrics (5 for classification and 4 for segmentation) and 6 unfairness mitigation algorithms (3 for classification and 3 for segmentation), trying to provide a relatively comprehensive benchmark for fairness in medical imaging FMs.
FairMedFM includes 17 publicly available datasets to evaluate fairness in FMs in medical imaging. These datasets vary in task type (classification and segmentation), dimension (2D and 3D), modality (OCT, X-ray, CT, MRI, Ultrasound, Fundus, dermatology), body part (brain, eyes, skin, thyroid, chest, liver, kidney, spine), number of classes (ranging from 2 to 15), number of samples (ranging from 20 to more than 350k), sensitive attribute (sex, age, race, preferred language, skin tone), and SA skewness (Male : Female ranging from 0.19 to 1.67). Some of the datasets include CheXpert, MIMIC-CXR, HAM10000, FairVLMed10k, GF3300, and more.
Classification FMs: FairMedFM uses 11 FMs from two categories: vision models (VMs) like C2L, DINOv2, MedLVM, MedMAE, MoCo-CXR, and vision-language models (VLMs) like CLIP, BLIP, BLIP2, MedCLIP, PubMedCLIP, BiomedCLIP. The evaluation is performed using linear probing (LP), parameter-efficient fine-tuning (PEFT), and CLIP-ZS and CLIP-Adapt for vision-language models.
Segmentation FMs: Nine SegFMs from three categories are used: general-SegFMs (SAM, MobileSAM, TinySAM), 2D Med-SegFMs (MedSAM, SAM-Med2D, FT-SAM), and 3D Med-SegFMs (SAM-Med3D, FastSAM3D). These models are evaluated using different types of prompts like center, rand, rands, and bbox.
FairMedFM integrates several bias mitigation strategies categorized as:
FairMedFM evaluates fairness using the following metrics:
If you find our code and paper helpful, please consider citing our work: (Citation coming soon)
@article{jin2024fairmedfm,
title={FairMedFM: Fairness Benchmarking for Medical Imaging Foundation Models},
author={Jin, Ruinan and Xu, Zikang and Zhong, Yuan and Yao, Qiongsong and Dou, Qi and Zhou, S Kevin and Li, Xiaoxiao},
journal={arXiv preprint arXiv:2407.00983},
year={2024}
}