TLDR: DeLeaker is a novel, lightweight, and optimization-free method that mitigates semantic leakage in Text-to-Image (T2I) models. Semantic leakage is the unintended transfer of features between distinct entities in generated images. DeLeaker intervenes during inference by dynamically reweighting attention maps to suppress cross-entity interactions and strengthen individual entity identities. The research introduces the SLIM dataset and a new evaluation framework to systematically assess leakage mitigation. Experiments show DeLeaker outperforms existing methods, preserving image quality and fidelity, and its effectiveness is largely attributed to self-identity strengthening and cross-entity image-text suppression.
Text-to-Image (T2I) models have made incredible strides in generating realistic and creative images from simple text descriptions. These models, often powered by diffusion-based architectures, can produce high-quality visuals. However, despite their advancements, they face a persistent challenge known as semantic leakage.
Semantic leakage occurs when features from one entity in a generated image unintentionally transfer to another distinct entity. Imagine asking a model to generate a picture of a cow and a horse in a farm, and the horse ends up with cow-like ears or mouth features. This is semantic leakage – a subtle yet significant error in semantic fidelity. While it’s a form of image-text misalignment, it has remained largely unexplored.
Previous attempts to tackle this issue often relied on layout-based controls, assigning entities to fixed regions using external inputs like bounding boxes. While these methods worked for simple scenes, they struggled with more complex interactions between entities. They also tended to be computationally expensive, requiring optimization strategies during the image generation process.
Introducing DeLeaker: A Dynamic Solution
A new approach called DeLeaker has been introduced to address semantic leakage. DeLeaker is a lightweight, optimization-free method that works during the inference time – meaning it intervenes while the image is being generated, without needing prior training or external guidance. Its core mechanism involves directly manipulating the model’s attention maps throughout the diffusion process.
DeLeaker operates in a synergistic way: it dynamically reweights attention maps to suppress excessive interactions between different entities while simultaneously strengthening the unique identity of each entity. This targeted intervention helps mitigate leakage without sacrificing the overall quality or fidelity of the generated image.
The method works in three main steps. First, it automatically extracts entity-specific masks from early image-text attention, essentially identifying where each entity should appear in the image. Second, it suppresses connections between entities in both image-text and image-image attention maps, reducing unwanted feature transfer. Finally, it enhances the self-identity of each entity by increasing the attention between its corresponding text and image tokens.
The SLIM Dataset and Evaluation Framework
To systematically evaluate semantic leakage and the effectiveness of mitigation strategies, the researchers also introduced the Semantic Leakage in Images (SLIM) dataset. This is the first dataset specifically designed for this purpose, comprising 1,130 human-verified samples that cover diverse leakage scenarios, including visually similar entities, spatial interactions, and multi-entity compositions. The dataset was built using images generated by the FLUX.1-dev model and prompts created by GPT-4o, followed by a rigorous human filtering process.
Alongside SLIM, a novel automatic evaluation framework was developed. This framework uses a comparative setup, contrasting a mitigated image against its original version. It breaks down the complex visual comparison into discrete logical steps, leveraging the robust reasoning capabilities of Vision-Language Models (VLMs). The process involves identifying visual differences between entities, assessing the ‘typicality’ of each entity in both images, and finally, a comparative judgment to determine which image better preserves distinct identities. This automatic pipeline was extensively validated through a human study involving 980 responses.
Also Read:
- Improving Multimodal AI: Understanding How Modalities Work Together
- New Framework Enhances Detection of Unseen Jailbreak Attacks in Vision-Language Models
Promising Results and Future Directions
Experiments with the FLUX model demonstrated that DeLeaker consistently outperforms all evaluated baselines, including those that rely on external information. It achieved effective leakage mitigation without compromising image fidelity or quality. Human evaluations strongly confirmed these findings, with DeLeaker showing a clear majority of improvements. An ablation study further revealed that the self-identity strengthening and cross-entity image-text suppression interventions are the most influential components of DeLeaker.
The research also found that semantic leakage becomes more pronounced with increasing prompt complexity, validating the use of complex scenarios in the SLIM dataset as stress tests. DeLeaker’s ability to preserve image content and quality, even in cases without initial leakage, highlights its non-intrusive nature.
This work not only provides a practical, lightweight solution for semantic leakage in Text-to-Image models but also establishes a comprehensive foundation for its systematic study. The code and the SLIM dataset will be made publicly available, encouraging further research into more controlled and reliable generative models. Future work could expand the SLIM dataset to new domains, use it to train leakage classifiers, or fine-tune models to inherently avoid semantic leakage. The approach could also be extended to other modalities like 3D or video. You can read the full research paper here: DeLeaker: Dynamic Inference-Time Reweighting for Semantic Leakage Mitigation in Text-to-Image Models.