Multimodal Image Understanding for Explainable Anomaly Detection

MUXAD
Multimodal Image Understanding for Explainable Anomaly Detection

basic research project

January 2025 - December 2027

Collaborating partners

University of Ljubljana, Faculty of Computer and Information Science

Funding

ARRS (J2-60055)

Researchers

Vitjan Zavrtanik, MSc

Project overview

With the rapid advancements in artificial intelligence, particularly in computer vision and natural language processing, deep learning has enabled impressive performance across many tasks. However, fundamental challenges remain concerning AI’s depth of understanding and its ability to explain decisions. This project aims to address these critical issues by focusing on anomaly detection in images through multimodal models that not only detect if and where something is anomalous but also understand and explain why.

The core objective is to integrate visual and linguistic information to tackle three key challenges in contemporary AI: semantic image understanding, multimodal image understanding, and multimodal explanations. The first research challenge, Semantic Image Understanding, targets the limitations of current anomaly detection methods by enhancing models’ ability to recognize complex logical and structural anomalies beyond surface-level defects. The second challenge, Multimodal Image Understanding, seeks to develop zero-shot anomaly detection approaches that leverage vision-language models to detect anomalies without prior exposure to specific object classes, supplemented by textual descriptions of anomalies at both task and instance levels. The third challenge, Multimodal Explanations, focuses on enriching visual anomaly explanations with textual descriptions, improving the intuitiveness and transparency of the models.

MUXAD aims to elevate anomaly detection to a new level by harnessing the power of multimodal AI, creating models that are not only accurate but also interpretable and explainable, marking a significant step toward transparent AI systems.

Expected contributions of the project are:

Enhanced semantic image understanding for detecting complex and logical anomalies beyond surface defects.
Development of zero-shot multimodal anomaly detection methods that combine visual and linguistic data without prior exposure to specific classes.
Creation of multimodal explanation techniques that combine visual anomaly localization with rich textual descriptions.
Application of the developed methods to manufacturing visual inspection and medical imaging interpretation.

Work packages: The work programme will be divided into four main work packages, addressing the following objectives:

Development of advanced methods for semantic image understanding aimed at detecting complex anomalies (WP1).
Creation of multimodal image understanding approaches integrating vision and language for zero-shot anomaly detection (WP2).
Development of methods for generating multimodal explanations combining visual and textual descriptions of anomalies (WP3).
Application of the developed methods to real-world use cases in manufacturing visual inspection and medical imaging interpretation (WP4).

Project phases:

Year 1: Focus on local and global appearance learning, object composition learning, and dataset curation (WP1).
Year 2: Activities on zero-shot anomaly detection, text-based knowledge injection, text-based weakly labelled supervision, and manufacturing visual inspection (WP2, WP4).
Year 3: Focus on text-driven explanations and modeling uncertainty in vision-language models (WP3).

Financer

ARRS, Slovenian Research Agency

arrs