Multi-modal Robustness Analysis Against Language and Visual Perturbations


Different real-world perturbations used in this study.


Joint visual and language modeling on large-scale datasets has recently shown a good progress in multi-modal tasks when compared to single modal learning. However, robustness of these approaches against real-world perturbations has not been studied. In this work, we perform the first extensive robustness study of such models against various real-world perturbations focusing on video and language. We focus on text-to-video retrieval and propose two large-scale benchmark datasets, MSRVTT-P and YouCook2-P, which utilize 90 different visual and 35 different textual perturbations. The study reveals some interesting findings: 1) The studied models are more robust when text is perturbed versus when video is perturbed 2) The transformer text encoder is more robust on non-semantic changing text perturbations and visual perturbations compared to word embedding approaches. 3) Using two-branch encoders in isolation is typically more robust than when architectures use cross-attention. We hope this study will serve as a benchmark and guide future research in robust multimodal learning.

Sample perturbations

Severity increasing from left to right.

Box jumbling

Random rotation

Motion blur

Robustness analysis

A performance and robustness visualization of multimodal models on YouCook2-P and MSRVTT-P. y-axis: relative robustness (lower is better), x-axis: R@5 on text-to-video retrieval, , and the size of circle indicates FLOPs. The models used are Videoclip, Univl, COOT, and HowTo100M MIL. The results vary based on the dataset, for example, Videoclip from scratch is both more robust and a better performer on the MSRVTT dataset than on YouCook2. This relationship indicatea a difference on how models handle clips from long, complex activities compared to videos that are short and of a simple activity. The top performers are consistently Videoclip and Univl, which are also the larger models.