TLDR: The paper introduces InfoAug, a novel data augmentation technique for contrastive learning that uses mutual information to identify “twin patches” as positive samples. Unlike traditional methods that rely on augmented views of the same entity, InfoAug discovers cross-entity positive pairs by tracking patches in videos and estimating their mutual information. This approach, combined with a dual-branch training pipeline, consistently improves the performance of various state-of-the-art contrastive learning frameworks on image classification benchmarks.
Self-supervised learning, particularly contrastive learning, has made significant strides in teaching computers to understand images and videos without extensive human labeling. These methods typically work by bringing different augmented versions of the same image closer together in a learned representation space, while pushing apart representations of different images. This approach, known as ‘instance discrimination,’ helps models learn to be ‘view invariant’ – meaning they recognize an object regardless of minor changes like color or rotation.
However, the way positive samples are selected in traditional contrastive learning often relies on human assumptions about what constitutes a ‘positive pair’ (e.g., two different crops of the same image). The authors of a new research paper, “Mutual Information Guided Visual Contrastive Learning”, argue that human visual learning goes beyond just recognizing different views of the same entity. Humans can also identify relationships between different entities in a scene that are inherently connected, even if they aren’t identical.
Introducing InfoAug: A Mutual Information Approach
To address this, Hanyang Chen and Yanchao Yang propose a novel data augmentation technique called InfoAug. This method aims to unify positive sample determination by incorporating ‘cross-entity’ positive pairs based on their mutual information. Imagine two birds flying together in the sky; knowing the position of one bird reduces the uncertainty about the other. This shared information makes them ‘positive samples’ in a more natural, real-world sense, even though they are distinct entities.
InfoAug works by first splitting the initial frame of a video into multiple patches. For each patch, a representative point is tracked across subsequent video frames to capture its motion trajectory. By observing the trajectories of any two patches simultaneously, the system can empirically estimate the mutual information between them. The patch that exhibits the highest mutual information with a given patch is then identified as its ‘twin patch’ – a cross-entity positive sample.
How InfoAug Enhances Learning
The core idea is to make the model ‘mutual information aware’ in addition to being ‘view invariant.’ To achieve this, InfoAug employs a ‘two-branch training’ pipeline. One branch handles the traditional view-based data augmentation, ensuring the model learns view-invariant features. The second branch incorporates the newly discovered twin patches, encouraging the model to learn representations that capture the mutual information between different parts of a scene. These two learning objectives are decoupled using separate projection heads, allowing each to be optimized effectively.
The researchers evaluated InfoAug across seven prominent state-of-the-art contrastive learning frameworks, including SimCLR, BYOL, and MoCo, on various image classification benchmarks like CIFAR-10, CIFAR-100, and STL-10. The results consistently showed that InfoAug improved the performance of every baseline-benchmark combination. This demonstrates InfoAug’s effectiveness as a framework-agnostic technique that can be integrated into existing contrastive learning pipelines.
Also Read:
- Visual-Contrast Attention: A New Approach for Efficient Vision Transformers
- DINO-MX: A Flexible Framework for Self-Supervised Learning in Medical Imaging
Looking Ahead
While InfoAug shows promising results, the authors acknowledge limitations, particularly when dealing with large, in-the-wild video datasets that may have insufficient observations or camera jittering. Future work could involve using more points to represent a patch for more robust mutual information estimation and integrating InfoAug with temporal contrastive learning methods to create a truly unified approach that considers both spatial and temporal relationships within video sequences.
In essence, InfoAug offers a more natural and comprehensive way to define positive samples in contrastive learning, moving beyond simple augmented views to leverage the inherent relationships between different elements in a scene, guided by the principle of mutual information.


