MuDoC: A Multimodal Document-grounded Conversational AI System for Classrooms

Overview

Multimodal AI is an important step towards building effective tools to leverage multiple modalities in human-AI communication. Building a trustworthy and interative multimodal document-grounded AI system to interact with long documents remains a challenge. This research work aims to fill the research gap of directly leveraging grounded visuals from documents alongside textual content in documents for response generation. We proposed an interactive conversational AI agent ‘MuDoC’ based on LLMs to generate document-grounded responses with interleaved text and figures. MuDoC’s intelligent textbook interface promotes trustworthiness and enables verification of system responses by allowing instant navigation to source text and figures in the documents. We also discuss qualitative observations based on MuDoC responses highlighting its strengths and limitations.

AAAI-MAKE’25, AIED’25

Technologies and Tools

LLMs, Interleaved Text-and-Image Generation, Multimodal Embedding Models, Document Layout Analysis, Weaviate VectorDB, AWS, ReactJS, Docker

Team

Karan Taneja (Project Lead, System Development, Study Deployment), Anjali Singh (Experiment Design), Ashok Goel (Advisor)