Benchmarking and Bridging Emotion Conflicts for Multimodal Emotion Reasoning

Zhiyuan Han1, Beier Zhu2, Yanlong Xu1, Peipei Song1, Xun Yang1
1University of Science and Technology of China, 2Nanyang Technological University
ACM International Conference on Multimedia (ACM MM) 2025 Oral
MoSEAR Teaser

Example of an emotion conflict case with reasoning outputs from Emotion-LLaMA and our MoSEAR.
(a) A visual-aligned sample in which the character's facial expression conveys a clear sense of disappointment.
(b) Our MoSEAR provides a correct emotion reasoning, while Emotion-LLaMA produces an incorrect one under emotion conflict.

šŸ’” Tip: Movie: Coming Home directed by Yimou Zhang. Background: The man's beloved wife is suffering from amnesia and no longer recognizes him. Despite his calm tone, his facial expression reveals sorrow and suppression.

Video

Abstract

Despite their strong performance in multimodal emotion reasoning, existing Multimodal Large Language Models (MLLMs) often overlook the scenarios involving emotion conflicts, where emotional cues from different modalities are inconsistent. To fill this gap, we first introduce CA-MER, a new benchmark designed to examine MLLMs under realistic emotion conflicts. It consists of three subsets: video-aligned, audio-aligned, and consistent, where only one or all modalities reflect the true emotion.

However, evaluations on our CA-MER reveal that current state-of-the-art emotion MLLMs systematically over-rely on audio signal during emotion conflicts, neglecting critical cues from visual modality. To mitigate this bias, we propose MoSEAR, a parameter-efficient framework that promotes balanced modality integration.

MoSEAR consists of two modules:

Our framework offers two key advantages: it mitigates emotion conflicts and improves performance on consistent samples—without incurring a trade-off between audio and visual modalities. Experiments on multiple benchmarks—including MER2023, EMER, DFEW, and our CA-MER—demonstrate that MoSEAR achieves state-of-the-art performance, particularly under modality conflict conditions.

Framework Architecture

MoSEAR Architecture

Overview of MoSEAR: (a) MoSE (Modality-Specific Experts): We implement modality-specific LoRA experts with shared A matrices and multiple expert B matrices per modality. A gating network selects the most suitable expert for each modality, and dynamic modality weights balance audio-visual contributions during training. (b) AR (Attention Reallocation): During inference, we monitor attention distributions across modalities and dynamically reallocate attention weights to mitigate modality bias without introducing additional parameters.

CA-MER Benchmark

CA-MER Benchmark

CA-MER Benchmark Overview: CA-MER is designed to evaluate MLLMs' ability to handle emotion conflicts. The benchmark includes a systematic annotation pipeline and diverse conflict scenarios across three subsets: video-aligned, audio-aligned, and consistent. Each subset is carefully curated to reflect realistic emotion conflicts where different modalities convey inconsistent emotional cues.

Understanding MLLM Reasoning in Emotion Conflicts

Attention Analysis

Audio Bias Problem: Our analysis reveals that current emotion MLLMs systematically over-rely on audio signals during emotion conflicts. We identify two key findings:

Our Solution: MoSEAR addresses this through modality-specific experts (MoSE) for training and attention reallocation (AR) for inference, effectively mitigating bias without the trade-offs of simple token balancing.

Experimental Results

Performance on CA-MER Benchmark

CA-MER Results

Qualitative Comparison

Qualitative Example 1 Qualitative Example 2

Qualitative Results: MoSEAR effectively handles emotion conflicts by correctly identifying and reasoning about inconsistent emotional cues across modalities, while baseline methods often fail in conflict scenarios.

Citation

@inproceedings{han2025mosear,
  title={Benchmarking and Bridging Emotion Conflicts for Multimodal Emotion Reasoning},
  author={Han, Zhiyuan and Zhu, Beier and Xu, Yanlong and Song, Peipei and Yang, Xun },
  booktitle={Proceedings of the 33rd ACM International Conference on Multimedia},
  year={2025}
}

Acknowledgements

We thank the following projects for their excellent open-source contributions: Emotion-LLaMA, MiniGPT-v2, and AffectGPT.