spot_img
HomeResearch & DevelopmentImproving Multimodal AI: Understanding How Modalities Work Together

Improving Multimodal AI: Understanding How Modalities Work Together

TLDR: A new framework called Modality Composition Awareness (MCA) is proposed to make multimodal retrieval systems, especially those using unified MLLM encoders, more robust. It tackles the “modality shortcut” problem where models over-rely on one modality. MCA uses two objectives: one to ensure composed inputs are more discriminative than their unimodal parts, and another to align composed embeddings with prototypes from their unimodal components. This leads to significant improvements in out-of-distribution performance while maintaining in-domain accuracy.

In the rapidly evolving world of artificial intelligence, multimodal retrieval systems are becoming increasingly vital. These systems allow us to search for and find relevant content across different types of data, such as text, images, and audio. Imagine searching for a specific product using both a picture and a detailed text description – that’s multimodal retrieval in action, powering everything from advanced AI search engines to content creation tools.

Traditionally, multimodal retrieval often relied on separate AI models (encoders) for each type of data, like one for text and another for images. These models would then try to align their outputs so that related text and images would be close to each other in a shared digital space. A well-known example of this is CLIP.

However, with the rise of powerful Multimodal Large Language Models (MLLMs), a new approach has emerged. MLLMs can process different types of inputs, including combinations of text and images, using a single, unified AI architecture. This offers great flexibility and advanced capabilities, allowing for more complex queries like “find an image similar to this one, but with these specific changes described in text.”

The Challenge of Modality Shortcuts

While MLLMs bring many advantages, researchers have identified a significant challenge: modality shortcut learning. When a unified encoder is trained using conventional methods, it can sometimes learn to over-rely on the strongest signal from one modality while ignoring the complementary information from others. This leads to a lack of robustness, especially when the system encounters new or slightly different types of data (known as out-of-distribution scenarios).

For instance, if you ask an AI to find an image based on a picture and a text instruction like “put up convertible roof, remove snow, place SUV standing on flat asphalt,” the model might focus only on the visual similarity of the cars and ignore the specific textual modifications. This means it takes a “shortcut,” failing to truly understand the combined meaning of the input.

Introducing Modality Composition Awareness (MCA)

To tackle this problem, a new framework called Modality Composition Awareness (MCA) has been proposed. MCA is designed to explicitly model the structural relationships between a combined (multimodal) input and its individual (unimodal) parts. It does this through two key objectives:

1. Modality Composition Preference (MCP): This objective ensures that the AI’s understanding of a combined input is more distinct and useful than its understanding of any single part of that input. In simpler terms, if a query has both text and an image, the AI should find it more informative than if it only had the text or only the image. This discourages the AI from taking shortcuts and relying on just one dominant modality.

2. Modality Composition Regularization (MCR): This objective encourages the AI’s representation of a combined input to be consistent with a “prototype” created by simply blending its individual parts. This helps to keep the combined representation grounded in the meanings of its constituent modalities, preventing it from straying too far or becoming arbitrary. Simple blending techniques, like mean pooling or gated fusion, are used to create these prototypes.

Also Read:

Promising Results for Robust Retrieval

Extensive experiments have shown that MCA significantly improves the robustness of multimodal retrieval systems. While maintaining strong performance on familiar data, MCA delivers substantial gains when dealing with new or out-of-distribution scenarios. This is crucial for real-world applications where AI systems need to generalize well beyond their training data.

The research also revealed that both MCP and MCR contribute uniquely to these improvements, working together to form a powerful constraint against modality shortcuts. Interestingly, MCA proved even more effective in situations where the quality of one modality was lower (e.g., low-resolution images), helping the model to leverage complementary information from other modalities rather than collapsing onto a single, clearer signal.

The choice of how to blend the unimodal parts (the “mixer” in MCR) also played a role, with a “gated fusion” method yielding the best results. This suggests that while the core principle of compositional consistency is important, the specific implementation can further enhance performance.

Qualitative examples demonstrate MCA’s ability to make AI models truly integrate information from both text and images. For instance, in a fashion search, a baseline model might find a polka dot and white item but ignore the original clothing style in the image. MCA, however, successfully combines both modalities to select the desired target. This work highlights Modality Composition Awareness as a fundamental principle for building more robust and reliable multimodal retrieval systems using MLLMs. You can read the full research paper here: MCA: MODALITYCOMPOSITIONAWARENESS FOR ROBUSTCOMPOSEDMULTIMODALRETRIEVAL.

Ananya Rao
Ananya Raohttp://edgentiq.com
Ananya Rao is a tech journalist with a passion for dissecting the fast-moving world of Generative AI. With a background in computer science and a sharp editorial eye, she connects the dots between policy, innovation, and business. Ananya excels in real-time reporting and specializes in uncovering how startups and enterprises in India are navigating the GenAI boom. She brings urgency and clarity to every breaking news piece she writes. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -