Unveiling and Correcting AI Biases in Medical Image Classification

TLDR: The research introduces CALIN, an inference-time calibration method for Multimodal Large Language Models (MLLMs) used in medical image classification. It addresses calibration biases and demographic unfairness in few-shot in-context learning by using a bi-level procedure (population and subgroup level null-input probing). Experiments on three medical datasets show CALIN significantly improves confidence calibration, reduces demographic bias, and maintains prediction accuracy, enabling safer and fairer deployment of MLLMs in clinical practice.

Multimodal large language models (MLLMs) are showing immense promise in various fields, and their application in medical image analysis, particularly with few-shot in-context learning, holds significant potential. These models can learn new tasks with just a few examples, reducing the need for extensive training data typically required by traditional deep learning methods. However, for these powerful AI tools to be safely and ethically deployed in real-world clinical settings, it’s crucial to thoroughly examine the accuracy and reliability of their predictions, especially concerning potential biases and fairness across different patient demographics.

A recent research paper, titled “Exposing and Mitigating Calibration Biases and Demographic Unfairness in MLLM Few-Shot In-Context Learning for Medical Image Classification,” delves into this critical area. Authored by Xing Shen, Justin Szeto, Mingyang Li, Hengguan Huang, and Tal Arbel, the study highlights a significant challenge: MLLMs can exhibit calibration biases and demographic unfairness in their predictions and confidence scores. This means the model might be overly confident in wrong predictions, or its performance and reliability could vary unfairly across different groups, such as by sex or age, potentially leading to harm for underrepresented populations.

The core problem lies in ensuring that the confidence scores provided by MLLMs are truly reliable and fair across all demographic subgroups. Traditional calibration methods often fall short in few-shot in-context learning settings because they typically require additional training data, which isn’t available, or access to the internal parameters of these large, “black-box” models like GPT-4o or Gemini 1.5, which is usually not feasible.

To address these challenges, the researchers introduce a novel inference-time calibration method called CALIN (CALibration INference-time). CALIN is designed to mitigate these biases and ensure fair confidence calibration without needing extra training data or access to the MLLM’s internal workings. It operates through a clever bi-level procedure that estimates the necessary calibration from a population level down to a subgroup level before the actual inference takes place.

How CALIN Works

CALIN employs a two-step approach to adjust the confidence scores:

Population-Level Calibration (L1): This initial step estimates the overall calibration needed for the entire population. It uses a technique called “multimodal null-input probing.” Essentially, the model is fed a “content-free” or “semantic-free” query (e.g., an arbitrary image without specifying sex or diagnosis). The idea is that without specific information, the model’s predicted confidence distribution should be uniform. CALIN then calculates a “calibration matrix” to align the model’s general confidence with this uniform distribution.
Subgroup-Level Calibration (L2): Building on L1, this step focuses on calibrating for specific demographic subgroups. For instance, it might use a null query conditioned on an attribute like “male” or “elder.” This helps capture variations in calibration needed for different groups.

A crucial aspect of CALIN is how it combines these two levels. While L2 aims for subgroup-specific fairness, relying solely on it can be unstable due to inherent biases in language models. Therefore, CALIN regularizes L2 with L1 using an exponential decay mechanism. This ensures that the final calibration captures subgroup variability accurately while also being stable and robust, especially when subgroup-specific estimations might be less reliable.

Experimental Validation

The effectiveness of CALIN was rigorously tested on three publicly available medical imaging datasets: PAPILA (for fundus image classification related to glaucoma), HAM10000 (for skin cancer classification from dermatoscopic images), and MIMIC-CXR (for chest X-ray classification related to pleural effusion). The experiments were conducted using GPT-4o-mini, a model from the GPT family.

The results were compelling. CALIN consistently outperformed the vanilla few-shot in-context learning method across various metrics. It significantly reduced the Expected Calibration Error (ECE), indicating more reliable confidence scores across the entire patient population. More importantly, CALIN achieved a notable decrease in the Confidence Calibration Error Gap (CCEG) across all datasets, effectively mitigating confidence calibration bias associated with demographic attributes like sex and age. The study also showed that CALIN improved overall prediction accuracies and exhibited a minimal trade-off between fairness and utility, as measured by the Equity-Scaling Measure of Calibration Error (ESCE).

An ablation study further validated CALIN’s bi-level framework, demonstrating that combining population-level and subgroup-level calibration yields superior results compared to using either level alone. This highlights the importance of the integrated approach for achieving both accuracy and fairness.

Also Read:

Looking Ahead

While CALIN marks a significant step forward in ensuring fair and reliable MLLM predictions in medical imaging, the authors acknowledge certain limitations. The study primarily focused on the GPT family of models and tasks where labels are represented by single tokens. Future research could explore its applicability across different MLLM architectures, varying model sizes, and tasks requiring multi-token labels. The paper and its associated codebase can be found at this link.

Dr. Ezurike Advocates for Consumer-Centric Digital Future at ReThinkAI 2025 Conference

Anthropic CEO Raises Alarm Over ‘Fishy’ AI Deals and Industry Practices

The AI Paradox: As Online Shoppers Embrace AI, Retailers Face Escalating Fraud Risks

UK Shoppers Embrace AI for Festive Season, Yet Concerns Over Control and Privacy Persist

The Dual Edge of AI Marketing Personas: Accelerating Insights While Navigating Pitfalls

The Green Imperative: Marketers Confront AI’s Carbon Cost

Dr. Ezurike Advocates for Consumer-Centric Digital Future at ReThinkAI 2025 Conference

Anthropic CEO Raises Alarm Over ‘Fishy’ AI Deals and Industry Practices

The AI Paradox: As Online Shoppers Embrace AI, Retailers Face Escalating Fraud Risks

UK Shoppers Embrace AI for Festive Season, Yet Concerns Over Control and Privacy Persist

The Dual Edge of AI Marketing Personas: Accelerating Insights While Navigating Pitfalls

The Green Imperative: Marketers Confront AI’s Carbon Cost

Unveiling and Correcting AI Biases in Medical Image Classification

How CALIN Works

Experimental Validation

Looking Ahead

Gen AI News and Updates

The Human Factor in AI: UC Stroke Experts Chart a Principled Path for Healthcare & Life Sciences Innovation

The Dual Edge of AI Marketing Personas: Accelerating Insights While Navigating Pitfalls

Navigating the AI Hype Cycle in Healthcare: Industry Leaders and Media Weigh In on Practicality and Ethics

New Approach to Reinforcement Learning Handles Noisy, Complex Rewards

Accelerating Optimization: A Parallel Approach to the Artificial Protozoa Optimizer

DeepAries: A New AI Framework for Smart Portfolio Rebalancing

Navigating Volatile Markets: A New AI System for Smarter Investment Portfolios

How Federated Learning is Reshaping Financial Security

Improving PET Scan Clarity with a Physics-Aware Denoising Network

Machine Learning Unlocks Earlier Detection of Kidney and Heart Disease in Diabetic Patients

VaultGemma 1B: A New Milestone in Differentially Private Language Models

Boosting Code Translation with Automated Snippet Data and Two-Stage Training

TangledFeatures: Untangling Correlated Data for Clearer Scientific Insights

Unpacking LLM Toxicity: A Multi-Label Evaluation Framework

Generative AI’s Ability to Interpret Idioms in Essay Scoring: A Comparative Study

Boosting Wind Turbine Reliability with a Novel Deep Learning System

Bridging Neural Network Theory: Geometry-Aware Initialization for Sigmoidal MLPs

DeLeaker: A New Method to Prevent Semantic Leakage in Text-to-Image Models

Proactive Defense: How Honeypots Are Securing LLMs Against Multi-Turn Jailbreaks

Subscribe to get the latest news and updates