FairMedFM: Fairness Benchmarking for Medical Imaging Foundation Models

The University of British Columbia
The Chinese University of Hong Kong
University of Science and Technology of China
* Equal contribution. Authors are listed in alphabetical order.
† Co-Corresponding authors.

Overview

The advent of foundation models (FMs) in healthcare offers unprecedented opportunities to enhance medical diagnostics through automated classification and segmentation tasks. However, these models also raise significant concerns about their fairness, especially when applied to diverse and underrepresented populations in healthcare applications. Currently, there is a lack of comprehensive benchmarks, standardized pipelines, and easily adaptable libraries to evaluate and understand the fairness performance of FMs in medical imaging, leading to considerable challenges in formulating and implementing solutions that ensure equitable outcomes across diverse patient populations. To fill this gap, we introduce \ours, a fairness benchmark for FM research in medical imaging. \ours{} integrates with 17 popular medical imaging datasets, encompassing different modalities, dimensionalities, and sensitive attributes. It explores 20 widely used FMs, with various usages such as zero-shot learning, linear probing, parameter-efficient fine-tuning, and prompting in various downstream tasks -- classification and segmentation. Our exhaustive analysis evaluates the fairness performance over different evaluation metrics from multiple perspectives, revealing the existence of bias, varied utility-fairness trade-offs on different FMs, consistent disparities on the same datasets regardless FMs, and limited effectiveness of existing unfairness mitigation methods.

  • We offer a comprehensive evaluation pipeline covering 17 diverse medical imaging datasets, 20 FMs, and their usages. This benchmark addresses the need for a consistent evaluation and standardized process to investigate FMs' fairness in medical imaging.
  • With FairMedFM, we conducted a thorough analysis from various perspectives, where we found that
    • Bias is prevalent in using FMs for medical imaging tasks, and the fairness-utility trade-off in these tasks is influenced not only by the choice of FMs but also by how they are used;
    • There is significant dataset-aware disparities between SA groups in most FMs;
    • Consistent disparities in SA occur across various FMs on the same dataset; and
    • Existing bias mitigation strategies do not demonstrate strong effectiveness in FM parameter-efficient fine-tuning scenarios.

teaser Overview of the FairMedFM framework, a standardized pipeline to investigate fairness on diverse datasets (2D, 2.5D, and 3D), comprehensive functionalities (various FMs, tasks, usages, and debias algorithms), thorough evaluation metrics.

A Comprehensive Medical FM Integration Pipeline

The figure above presents the pipeline of FairMedFM framework, which offers an easy-to-use codebase for benchmarking the fairness of FMs in medical imaging. FairMedFM currently contains 17 datasets (9 for classification and 8 for segmentation) and 20 FMs (11 for classification and 9 for segmentation). It also integrates 9 fairness metrics (5 for classification and 4 for segmentation) and 6 unfairness mitigation algorithms (3 for classification and 3 for segmentation), trying to provide a relatively comprehensive benchmark for fairness in medical imaging FMs.

Datasets

FairMedFM includes 17 publicly available datasets to evaluate fairness in FMs in medical imaging. These datasets vary in task type (classification and segmentation), dimension (2D and 3D), modality (OCT, X-ray, CT, MRI, Ultrasound, Fundus, dermatology), body part (brain, eyes, skin, thyroid, chest, liver, kidney, spine), number of classes (ranging from 2 to 15), number of samples (ranging from 20 to more than 350k), sensitive attribute (sex, age, race, preferred language, skin tone), and SA skewness (Male : Female ranging from 0.19 to 1.67). Some of the datasets include CheXpert, MIMIC-CXR, HAM10000, FairVLMed10k, GF3300, and more.

Models

Classification FMs: FairMedFM uses 11 FMs from two categories: vision models (VMs) like C2L, DINOv2, MedLVM, MedMAE, MoCo-CXR, and vision-language models (VLMs) like CLIP, BLIP, BLIP2, MedCLIP, PubMedCLIP, BiomedCLIP. The evaluation is performed using linear probing (LP), parameter-efficient fine-tuning (PEFT), and CLIP-ZS and CLIP-Adapt for vision-language models.

Segmentation FMs: Nine SegFMs from three categories are used: general-SegFMs (SAM, MobileSAM, TinySAM), 2D Med-SegFMs (MedSAM, SAM-Med2D, FT-SAM), and 3D Med-SegFMs (SAM-Med3D, FastSAM3D). These models are evaluated using different types of prompts like center, rand, rands, and bbox.

Unfairness Mitigation Methods

FairMedFM integrates several bias mitigation strategies categorized as:

  • Group rebalancing: Adjusting representation of different subgroups.
  • Adversarial training: Penalizing models for recognizing sensitive attributes.
  • Fairness constraints: Adding fairness metrics to the training objective.
  • Subgroup-tailored modeling: Different model parameters for different subgroups.
  • Domain generalization: Improving model performance across various domains.

Evaluation Metrics

FairMedFM evaluates fairness using the following metrics:

  • Utility: AUC for classification, Dice similarity score (DSC) for segmentation.
  • Group fairness: Including metrics like delta AUC, Equalized Odds (EqOdds), delta DSC, DSC skewness, and expected calibration error gap (ECEΔ).
  • Utility-fairness trade-off: Metrics like equity-scaled AUC (AUC_ES) and equity-scaled DSC (DSC_ES).

A Simple & Extendible Codebase

Our codebase consists of five modules.
teaser
  1. Dataloader provides a consistent interface for loading and processing imaging data across various modalities and dimensions, supporting both classification and segmentation tasks.
  2. Model is a one-stop library that includes implementations of the most popular pre-trained foundation models for medical image analysis.
  3. Usage Wrapper encapsulates foundation models for various use cases and tasks, including linear probe, zero-shot inference, PEFT, promptable segmentation, etc.
  4. Trainer offers a unified workflow for fine-tuning and testing wrapped models, and includes state-of-the-art unfairness mitigation algorithms.
  5. Evaluation includes a set of metrics and tools to visualize and analyze fairness across different tasks.

BibTeX

If you find our code and paper helpful, please consider citing our work: (Citation coming soon)

@article{}