MoSEAR: Multimodal Emotion Reasoning

Example of an emotion conflict case with reasoning outputs from Emotion-LLaMA and our MoSEAR.
(a) A visual-aligned sample in which the character's facial expression conveys a clear sense of disappointment.
(b) Our MoSEAR provides a correct emotion reasoning, while Emotion-LLaMA produces an incorrect one under emotion conflict.

💡 Tip: Movie: Coming Home directed by Yimou Zhang. Background: The man's beloved wife is suffering from amnesia and no longer recognizes him. Despite his calm tone, his facial expression reveals sorrow and suppression.

Abstract

Despite their strong performance in multimodal emotion reasoning, existing Multimodal Large Language Models (MLLMs) often overlook the scenarios involving emotion conflicts, where emotional cues from different modalities are inconsistent. To fill this gap, we first introduce CA-MER, a new benchmark designed to examine MLLMs under realistic emotion conflicts. It consists of three subsets: video-aligned, audio-aligned, and consistent, where only one or all modalities reflect the true emotion.

However, evaluations on our CA-MER reveal that current state-of-the-art emotion MLLMs systematically over-rely on audio signal during emotion conflicts, neglecting critical cues from visual modality. To mitigate this bias, we propose MoSEAR, a parameter-efficient framework that promotes balanced modality integration.

MoSEAR consists of two modules:

(1) MoSE: Modality-specific experts with a regularized gating mechanism that reduces modality bias in the fine-tuning heads
(2) AR: An attention reallocation mechanism that rebalances modality contributions in frozen backbones during inference

Our framework offers two key advantages: it mitigates emotion conflicts and improves performance on consistent samples—without incurring a trade-off between audio and visual modalities. Experiments on multiple benchmarks—including MER2023, EMER, DFEW, and our CA-MER—demonstrate that MoSEAR achieves state-of-the-art performance, particularly under modality conflict conditions.

Framework Architecture

Overview of MoSEAR: (a) MoSE (Modality-Specific Experts): We implement modality-specific LoRA experts with shared A matrices and multiple expert B matrices per modality. A gating network selects the most suitable expert for each modality, and dynamic modality weights balance audio-visual contributions during training. (b) AR (Attention Reallocation): During inference, we monitor attention distributions across modalities and dynamically reallocate attention weights to mitigate modality bias without introducing additional parameters.

CA-MER Benchmark

CA-MER Benchmark Overview: CA-MER is designed to evaluate MLLMs' ability to handle emotion conflicts. The benchmark includes a systematic annotation pipeline and diverse conflict scenarios across three subsets: video-aligned, audio-aligned, and consistent. Each subset is carefully curated to reflect realistic emotion conflicts where different modalities convey inconsistent emotional cues.

Understanding MLLM Reasoning in Emotion Conflicts

Audio Bias Problem: Our analysis reveals that current emotion MLLMs systematically over-rely on audio signals during emotion conflicts. We identify two key findings:

Attention Imbalance: In video-aligned samples, models place significantly more attention on audio tokens (>0.15) than visual tokens (~10⁻³), even when visual cues are correct.
Token Count Disparity: Extreme imbalance exists between video and audio token counts (e.g., 256:1 in Emotion-LLaMA, 2048:93 in ViTA1.5), causing models to favor compact audio cues.

Our Solution: MoSEAR addresses this through modality-specific experts (MoSE) for training and attention reallocation (AR) for inference, effectively mitigating bias without the trade-offs of simple token balancing.

Experimental Results

Performance on CA-MER Benchmark

Qualitative Comparison

Qualitative Results: MoSEAR effectively handles emotion conflicts by correctly identifying and reasoning about inconsistent emotional cues across modalities, while baseline methods often fail in conflict scenarios.

Citation

@inproceedings{han2025mosear,
  title={Benchmarking and Bridging Emotion Conflicts for Multimodal Emotion Reasoning},
  author={Han, Zhiyuan and Zhu, Beier and Xu, Yanlong and Song, Peipei and Yang, Xun },
  booktitle={Proceedings of the 33rd ACM International Conference on Multimedia},
  year={2025}
}

Benchmarking and Bridging Emotion Conflicts for Multimodal Emotion Reasoning

Video

Abstract

Framework Architecture

CA-MER Benchmark

Understanding MLLM Reasoning in Emotion Conflicts

Experimental Results

Performance on CA-MER Benchmark

Qualitative Comparison

Citation

Acknowledgements