Building a Voice-Enabled RAG Chat Agent with Local ML Models

I recently built a comprehensive Retrieval-Augmented Generation (RAG) system that supports both voice and text queries, with a focus on privacy, performance, and scalability. The project demonstrates how to build production-ready AI applications using local ML models while maintaining cost efficiency and data privacy.

Demo Video

Project Overview

The Voice RAG Chat Agent Demo is a full-stack application that combines document indexing, vector search, and conversational AI with voice capabilities. The system is built with FastAPI, PostgreSQL (pgvector), Redis, and runs ML models locally for optimal privacy and performance.

Key Features

  • Document Indexing Pipeline: OCR → TF-IDF Filtering → NER → Embeddings → Vector Database
  • Voice/Text Query Processing: STT → Question Rectification → Vector Search → TTS
  • Local ML Models: All processing (OCR, STT, TTS, Embeddings, NER) runs locally for privacy and cost efficiency
  • Conversation History: Session-based multi-turn conversations with context awareness
  • QnA Caching: Redis-based caching for fast retrieval of recent question-answer pairs

Architecture

Document Indexing Flow

The document indexing pipeline transforms raw documents into searchable vector embeddings:

Documents → OCR Engine → Text Blob → TF-IDF Filter → NER Models → Structured Knowledge (JSON) → Sentence-Transformers Embeddings → Vector Database (pgvector)

Query Processing Flow

The query processing pipeline handles both voice and text queries:

User Query (Voice/Text) → Local STT (if voice) → Question Rectification → Structured Query → Retrieval (Recent QnA / Vector DB) → Response Generation → Local TTS (if voice) → User Response

Technical Highlights

Hybrid Model Serving

The system uses a hybrid approach where STT (Whisper), TTS (Coqui TTS), Embeddings (sentence-transformers), and NER (spaCy) run as embedded models locally in the backend container. These models are loaded into memory on first use and reused for subsequent requests, eliminating network latency and API costs. The LLM (OpenAI GPT) is used via API only for question rectification and response generation, providing advanced language understanding while keeping other models local for cost and latency optimization.

Latency Optimization

Several optimization strategies were implemented to reduce latency:

  • Model Pre-loading: STT, TTS, and embedding models are kept resident in memory, saving 2-5 seconds per request
  • QnA Caching: Redis enables <100ms responses for cache hits versus 1-6 seconds for the full pipeline
  • Optimized TTS: Using glow-tts (2-3x faster than tacotron2-DDC, reducing synthesis from ~19s to ~6-8s)
  • Smart Query Rectification: Skips LLM calls for simple queries, saving 1-2 seconds
  • Parallel Processing: Thread pool executor handles CPU-intensive STT/TTS operations asynchronously

Overall voice query latency was reduced from ~27-30s to ~15-18s while maintaining accuracy.

Handling Ambiguity

The system employs multi-layer ambiguity resolution:

  • LLM-based Question Rectification: Clarifies ambiguous queries by resolving pronouns/references from conversation history and expanding abbreviations
  • Configurable Similarity Thresholds: Vector search uses thresholds (default: 0.7) to ensure only relevant results are considered
  • QnA Cache Validation: Requires high similarity (0.85) before returning cached responses
  • Pre-defined Canned Responses: Prevents hallucination when no relevant information is found
  • NER Extraction: Provides context-aware filtering to identify when queries contain entities not present in indexed documents

Scalability

The system is designed for horizontal scaling:

  • Stateless Backend: Enables horizontal scaling with multiple instances behind a load balancer
  • Redis-based Session Storage: Provides shared state across instances
  • Database Scaling: PostgreSQL connection pooling and read replicas, with pgvector HNSW indexes maintaining performance at scale
  • Redis Cluster Caching: Handles high-throughput operations and reduces database load by 30-50%
  • FastAPI Async Capabilities: Non-blocking I/O handles concurrent requests efficiently
  • Resource Management: Model instances are shared across requests within a process (models loaded once per process)

Estimated capacity: single instance handles ~50-100 concurrent requests; horizontal scaling with 10 instances supports ~500-1,000 concurrent requests.

Data Privacy

The system ensures comprehensive privacy protection:

  • Local Processing: STT (Whisper), TTS (Coqui TTS), embeddings (sentence-transformers), and NER (spaCy) models run entirely on-premises
  • No External Transmission: Audio and document content never leave the server infrastructure
  • Ephemeral Session Data: Redis session data has configurable TTL (default: 1 hour)
  • Isolated Conversations: Conversation history is isolated per session ID
  • Minimal External API Usage: Only OpenAI GPT is called externally, and only receives text queries (not audio or documents)

Technology Stack

  • Backend: FastAPI, Python 3.11
  • Database: PostgreSQL with pgvector extension
  • Cache: Redis
  • ML Models:
    • OCR: Tesseract
    • STT: OpenAI Whisper "small" model
    • TTS: Coqui TTS (glow-tts model)
    • Embeddings: sentence-transformers
    • NER: spaCy
  • LLM: OpenAI GPT (for question rectification and response generation)
  • Frontend: HTML/CSS/JavaScript

Key Achievements

  • Reduced voice query latency from ~27-30s to ~15-18s through comprehensive optimization
  • Achieved <100ms response times for cached queries via Redis
  • Implemented privacy-first architecture with local ML model processing
  • Built scalable system architecture supporting 500-1,000 concurrent requests
  • Created end-to-end pipeline from document ingestion to voice-enabled query processing

Repository

The complete source code, documentation, and setup instructions are available on GitHub:

https://github.com/montyd1905/voice-rag-chat-agent-demo

The repository includes:

  • Complete Docker Compose setup for easy deployment
  • Detailed documentation and architecture diagrams
  • Sample articles for testing
  • Configuration examples and troubleshooting guides