MuDoC: A Multimodal Document-grounded Conversational AI System for Classrooms
Overview
Multimodal AI is an important step towards building effective tools to leverage multiple modalities in human-AI communication. Building a trustworthy and interative multimodal document-grounded AI system to interact with long documents remains a challenge. This research work aims to fill the research gap of directly leveraging grounded visuals from documents alongside textual content in documents for response generation. We proposed an interactive conversational AI agent ‘MuDoC’ based on LLMs to generate document-grounded responses with interleaved text and figures. MuDoC’s intelligent textbook interface promotes trustworthiness and enables verification of system responses by allowing instant navigation to source text and figures in the documents. We also discuss qualitative observations based on MuDoC responses highlighting its strengths and limitations.
Related Papers
Technologies and Tools
LLMs, Interleaved Text-and-Image Generation, Multimodal Embedding Models, Document Layout Analysis, Weaviate VectorDB, AWS, ReactJS, Docker
Team
Karan Taneja (Project Lead, System Development, Study Deployment, Classroom Deployment), Anjali Singh (Experiment Design), Issac Lo (LTI Tool Development), Shay Samat (LTI Tool Development), Ashok Goel (Advisor)
Deployment as Jill Watson
MuDoC has been deployed as Jill Watson in Prof. David Joyner’s Fall 2025 Knowledge-based AI course in Georgia Tech’s OMSCS program. Watch the intro video below!