MuDoC: A Multimodal Document-grounded Conversational AI System for Classrooms
Overview
Multimodal AI is an important step towards building effective tools to leverage multiple modalities in human-AI communication. Building a trustworthy and interative multimodal document-grounded AI system to interact with long documents remains a challenge. This research work aims to fill the research gap of directly leveraging grounded visuals from documents alongside textual content in documents for response generation. We proposed an interactive conversational AI agent ‘MuDoC’ based on LLMs to generate document-grounded responses with interleaved text and figures. MuDoC’s intelligent textbook interface promotes trustworthiness and enables verification of system responses by allowing instant navigation to source text and figures in the documents. We also discuss qualitative observations based on MuDoC responses highlighting its strengths and limitations.
Related Papers
Technologies and Tools
LLMs, Interleaved Text-and-Image Generation, Multimodal Embedding Models, Document Layout Analysis, Weaviate VectorDB, AWS, ReactJS, Docker
Team
Karan Taneja (Project Lead, System Development, Study Deployment), Anjali Singh (Experiment Design), Ashok Goel (Advisor)