TLDR: This research paper explores data augmentation techniques for improving natural disaster assessment using the CrisisMMD multimodal dataset, which combines text and images from social media. For visual data, diffusion-based methods like Real Guidance and DiffuseMix were used, showing benefits for convolutional models but mixed results for transformer-based models. Text augmentation involved back-translation and transformer-based paraphrasing, which generally improved performance, while image captioning-based augmentation surprisingly led to a decrease. The study also investigated multimodal and multi-view learning, confirming the superiority of combining text and images, but highlighting challenges in effectively integrating complex augmented views. Overall, the work demonstrates the potential of targeted augmentation strategies to build more robust disaster assessment systems.
Natural disasters strike with little warning, and timely, accurate information is crucial for effective humanitarian response. Social media platforms have emerged as a vital real-time source during these events, offering a flood of data from affected areas. However, leveraging this data effectively is challenging due to issues like class imbalance and limited sample sizes in existing datasets.
A recent study, titled Multimodal Learning with Augmentation Techniques for Natural Disaster Assessment, by Adrian-Dinu Urse, Dumitru-Clementin Cercel, and Florin Pop from NUST POLITEHNICA Bucharest, explores advanced data augmentation techniques to enhance dataset diversity and improve model performance for natural disaster classification. The researchers focused on the CrisisMMD dataset, which combines both textual and visual information from disaster-related tweets.
Addressing Data Challenges with Augmentation
The core of this research lies in its innovative approach to data augmentation for both images and text. For visual data, the team investigated two diffusion-based methods: Real Guidance and DiffuseMix. Real Guidance subtly modifies original images to create realistic synthetic versions, doubling the training dataset size while maintaining context. DiffuseMix, a more advanced technique, uses prompt-based transformations, masked blending, and fractal-based modifications to generate diverse augmented images, specifically targeting underrepresented classes like “Affected Individuals” and “Infrastructure and Utility Damage.”
The impact of these image augmentations varied across different model architectures. Convolutional neural networks (like ResNet18 and ResNet50) generally benefited, showing improved accuracy and F1-scores. However, transformer-based models (like ViT and MambaViT) sometimes saw a decrease in performance, suggesting that the augmentations could introduce visual noise that interfered with their attention mechanisms.
For textual data, three strategies were employed to increase linguistic diversity. Back-translation involved translating tweets through multiple languages (English to French, then German, back to French, and finally to English) to create paraphrased versions. Paraphrasing with transformers used the Mistral-7B-Instruct model to rewrite tweets, preserving meaning and social media style. The third method, caption-based augmentation, generated descriptive captions for images using the BLIP-2 model and concatenated them with the original tweet text, enriching the textual input with visual context.
Back-translation and paraphrasing generally led to slight but consistent improvements in text classification models. However, caption-based augmentation surprisingly reduced performance, likely due to a mismatch between the augmented training data and the unaugmented test data, causing models to overfit to features present only during training.
Combining Modalities for Better Understanding
Beyond unimodal improvements, the study also explored multimodal and multi-view learning setups, combining text and image information. Multimodal classification, which integrates both textual and visual features, consistently outperformed unimodal approaches on the original CrisisMMD dataset. When back-translated text was combined with Real Guidance image augmentations, some multimodal models, like RoBERTa-ViT, showed significant gains. However, the best-performing model, RoBERTa-MambaViT, sometimes experienced a slight performance decrease with augmentations, indicating sensitivity to the introduced variations.
Multi-view learning, a more complex approach that incorporates original and augmented data representations during training, did not outperform baseline multimodal models in this study. The researchers suggest that the increased complexity of these models might require more extensive training, and the mismatch between multi-view training and classic multimodal inference during evaluation could limit their benefits.
Also Read:
- Enhancing Multimodal Sentiment Analysis with a Double Information Bottleneck
- Advancing Vision-Language Models with Multi-Prompt Learning
Conclusion
This research highlights the potential of diffusion-based image augmentations and effective text augmentation techniques to improve disaster assessment models, particularly for underrepresented classes. While augmentations can significantly enhance model performance, their effectiveness depends on the specific model architecture and careful integration. The study also underscores the challenges of effectively combining multiple data sources, especially with complex learning strategies like multi-view learning, pointing towards future work in refining augmentation filtering and evaluation on broader disaster-related datasets.


