MLLM Can See? Dynamic Correction Decoding for Hallucination Mitigation

Published in arXiv preprint, 2024

Multimodal Large Language Models (MLLMs) have shown remarkable capabilities in vision-language tasks. However, they often suffer from hallucinations—generating text that contradicts or invents information not present in the visual input. This is particularly problematic in applications requiring factual accuracy.

This paper presents Dynamic Correction Decoding (DCD), a novel decoding strategy that actively leverages visual information to mitigate hallucinations. DCD operates in two phases: (1) hallucination detection through cross-modal consistency checking, and (2) dynamic correction by generating alternative tokens when inconsistencies are detected.

Key contributions include:

  • A real-time hallucination detection mechanism based on vision-language alignment scores
  • A correction module that generates contextually appropriate alternatives during decoding
  • Evaluation on multiple vision-language benchmarks showing 40-50% reduction in factual errors

Experimental results demonstrate that DCD significantly improves the reliability of MLLMs while maintaining generation fluency, making it suitable for deployment in production systems where accuracy is critical.

Recommended citation: Xu, H., et al. (2024). "MLLM Can See? Dynamic Correction Decoding for Hallucination Mitigation." arXiv preprint arXiv:2406.00000.
Download Paper