TLDR: The research introduces CALIN, an inference-time calibration method for Multimodal Large Language Models (MLLMs) used in medical image classification. It addresses calibration biases and demographic unfairness in few-shot in-context learning by using a bi-level procedure (population and subgroup level null-input probing). Experiments on three medical datasets show CALIN significantly improves confidence calibration, reduces demographic bias, and maintains prediction accuracy, enabling safer and fairer deployment of MLLMs in clinical practice.
Multimodal large language models (MLLMs) are showing immense promise in various fields, and their application in medical image analysis, particularly with few-shot in-context learning, holds significant potential. These models can learn new tasks with just a few examples, reducing the need for extensive training data typically required by traditional deep learning methods. However, for these powerful AI tools to be safely and ethically deployed in real-world clinical settings, it’s crucial to thoroughly examine the accuracy and reliability of their predictions, especially concerning potential biases and fairness across different patient demographics.
A recent research paper, titled “Exposing and Mitigating Calibration Biases and Demographic Unfairness in MLLM Few-Shot In-Context Learning for Medical Image Classification,” delves into this critical area. Authored by Xing Shen, Justin Szeto, Mingyang Li, Hengguan Huang, and Tal Arbel, the study highlights a significant challenge: MLLMs can exhibit calibration biases and demographic unfairness in their predictions and confidence scores. This means the model might be overly confident in wrong predictions, or its performance and reliability could vary unfairly across different groups, such as by sex or age, potentially leading to harm for underrepresented populations.
The core problem lies in ensuring that the confidence scores provided by MLLMs are truly reliable and fair across all demographic subgroups. Traditional calibration methods often fall short in few-shot in-context learning settings because they typically require additional training data, which isn’t available, or access to the internal parameters of these large, “black-box” models like GPT-4o or Gemini 1.5, which is usually not feasible.
To address these challenges, the researchers introduce a novel inference-time calibration method called CALIN (CALibration INference-time). CALIN is designed to mitigate these biases and ensure fair confidence calibration without needing extra training data or access to the MLLM’s internal workings. It operates through a clever bi-level procedure that estimates the necessary calibration from a population level down to a subgroup level before the actual inference takes place.
How CALIN Works
CALIN employs a two-step approach to adjust the confidence scores:
- Population-Level Calibration (L1): This initial step estimates the overall calibration needed for the entire population. It uses a technique called “multimodal null-input probing.” Essentially, the model is fed a “content-free” or “semantic-free” query (e.g., an arbitrary image without specifying sex or diagnosis). The idea is that without specific information, the model’s predicted confidence distribution should be uniform. CALIN then calculates a “calibration matrix” to align the model’s general confidence with this uniform distribution.
- Subgroup-Level Calibration (L2): Building on L1, this step focuses on calibrating for specific demographic subgroups. For instance, it might use a null query conditioned on an attribute like “male” or “elder.” This helps capture variations in calibration needed for different groups.
A crucial aspect of CALIN is how it combines these two levels. While L2 aims for subgroup-specific fairness, relying solely on it can be unstable due to inherent biases in language models. Therefore, CALIN regularizes L2 with L1 using an exponential decay mechanism. This ensures that the final calibration captures subgroup variability accurately while also being stable and robust, especially when subgroup-specific estimations might be less reliable.
Experimental Validation
The effectiveness of CALIN was rigorously tested on three publicly available medical imaging datasets: PAPILA (for fundus image classification related to glaucoma), HAM10000 (for skin cancer classification from dermatoscopic images), and MIMIC-CXR (for chest X-ray classification related to pleural effusion). The experiments were conducted using GPT-4o-mini, a model from the GPT family.
The results were compelling. CALIN consistently outperformed the vanilla few-shot in-context learning method across various metrics. It significantly reduced the Expected Calibration Error (ECE), indicating more reliable confidence scores across the entire patient population. More importantly, CALIN achieved a notable decrease in the Confidence Calibration Error Gap (CCEG) across all datasets, effectively mitigating confidence calibration bias associated with demographic attributes like sex and age. The study also showed that CALIN improved overall prediction accuracies and exhibited a minimal trade-off between fairness and utility, as measured by the Equity-Scaling Measure of Calibration Error (ESCE).
An ablation study further validated CALIN’s bi-level framework, demonstrating that combining population-level and subgroup-level calibration yields superior results compared to using either level alone. This highlights the importance of the integrated approach for achieving both accuracy and fairness.
Also Read:
- Assessing AI Models for Healthcare: A Multimodal EHR Benchmark
- Advancing Gastrointestinal Diagnostics with Multimodal AI and Visual Question Answering
Looking Ahead
While CALIN marks a significant step forward in ensuring fair and reliable MLLM predictions in medical imaging, the authors acknowledge certain limitations. The study primarily focused on the GPT family of models and tasks where labels are represented by single tokens. Future research could explore its applicability across different MLLM architectures, varying model sizes, and tasks requiring multi-token labels. The paper and its associated codebase can be found at this link.