Despite their strong performance in multimodal emotion reasoning, existing Multimodal Large Language Models (MLLMs) often overlook the scenarios involving emotion conflicts, where emotional cues from different modalities are inconsistent. To fill this gap, we first introduce CA-MER, a new benchmark designed to examine MLLMs under realistic emotion conflicts. It consists of three subsets: video-aligned, audio-aligned, and consistent, where only one or all modalities reflect the true emotion.
However, evaluations on our CA-MER reveal that current state-of-the-art emotion MLLMs systematically over-rely on audio signal during emotion conflicts, neglecting critical cues from visual modality. To mitigate this bias, we propose MoSEAR, a parameter-efficient framework that promotes balanced modality integration.
MoSEAR consists of two modules:
Our framework offers two key advantages: it mitigates emotion conflicts and improves performance on consistent samplesāwithout incurring a trade-off between audio and visual modalities. Experiments on multiple benchmarksāincluding MER2023, EMER, DFEW, and our CA-MERādemonstrate that MoSEAR achieves state-of-the-art performance, particularly under modality conflict conditions.
Overview of MoSEAR: (a) MoSE (Modality-Specific Experts): We implement modality-specific LoRA experts with shared A matrices and multiple expert B matrices per modality. A gating network selects the most suitable expert for each modality, and dynamic modality weights balance audio-visual contributions during training. (b) AR (Attention Reallocation): During inference, we monitor attention distributions across modalities and dynamically reallocate attention weights to mitigate modality bias without introducing additional parameters.
CA-MER Benchmark Overview: CA-MER is designed to evaluate MLLMs' ability to handle emotion conflicts. The benchmark includes a systematic annotation pipeline and diverse conflict scenarios across three subsets: video-aligned, audio-aligned, and consistent. Each subset is carefully curated to reflect realistic emotion conflicts where different modalities convey inconsistent emotional cues.
Audio Bias Problem: Our analysis reveals that current emotion MLLMs systematically over-rely on audio signals during emotion conflicts. We identify two key findings:
Our Solution: MoSEAR addresses this through modality-specific experts (MoSE) for training and attention reallocation (AR) for inference, effectively mitigating bias without the trade-offs of simple token balancing.
Qualitative Results: MoSEAR effectively handles emotion conflicts by correctly identifying and reasoning about inconsistent emotional cues across modalities, while baseline methods often fail in conflict scenarios.
@inproceedings{han2025mosear,
title={Benchmarking and Bridging Emotion Conflicts for Multimodal Emotion Reasoning},
author={Han, Zhiyuan and Zhu, Beier and Xu, Yanlong and Song, Peipei and Yang, Xun },
booktitle={Proceedings of the 33rd ACM International Conference on Multimedia},
year={2025}
}
We thank the following projects for their excellent open-source contributions: Emotion-LLaMA, MiniGPT-v2, and AffectGPT.