Building a Voice-Enabled RAG Chat Agent with Local ML Models

I recently built a comprehensive Retrieval-Augmented Generation (RAG) system that supports both voice and text queries, with a focus on privacy, performance, and scalability. The project demonstrates how to build production-ready AI applications using local ML models while maintaining cost efficiency and data privacy.

Demo Video

Project Overview

The Voice RAG Chat Agent Demo is a full-stack application that combines document indexing, vector search, and conversational AI with voice capabilities. The system is built with FastAPI, PostgreSQL (pgvector), Redis, and runs ML models locally for optimal privacy and performance.

Key Features

Document Indexing Pipeline: OCR → TF-IDF Filtering → NER → Embeddings → Vector Database
Voice/Text Query Processing: STT → Question Rectification → Vector Search → TTS
Local ML Models: All processing (OCR, STT, TTS, Embeddings, NER) runs locally for privacy and cost efficiency
Conversation History: Session-based multi-turn conversations with context awareness
QnA Caching: Redis-based caching for fast retrieval of recent question-answer pairs

Architecture

Document Indexing Flow

The document indexing pipeline transforms raw documents into searchable vector embeddings:

Documents → OCR Engine → Text Blob → TF-IDF Filter → NER Models → Structured Knowledge (JSON) → Sentence-Transformers Embeddings → Vector Database (pgvector)

Query Processing Flow

The query processing pipeline handles both voice and text queries:

User Query (Voice/Text) → Local STT (if voice) → Question Rectification → Structured Query → Retrieval (Recent QnA / Vector DB) → Response Generation → Local TTS (if voice) → User Response

Technical Highlights

Hybrid Model Serving

The system uses a hybrid approach where STT (Whisper), TTS (Coqui TTS), Embeddings (sentence-transformers), and NER (spaCy) run as embedded models locally in the backend container. These models are loaded into memory on first use and reused for subsequent requests, eliminating network latency and API costs. The LLM (OpenAI GPT) is used via API only for question rectification and response generation, providing advanced language understanding while keeping other models local for cost and latency optimization.

Latency Optimization

Several optimization strategies were implemented to reduce latency:

Model Pre-loading: STT, TTS, and embedding models are kept resident in memory, saving 2-5 seconds per request
QnA Caching: Redis enables <100ms responses for cache hits versus 1-6 seconds for the full pipeline
Optimized TTS: Using glow-tts (2-3x faster than tacotron2-DDC, reducing synthesis from ~19s to ~6-8s)
Smart Query Rectification: Skips LLM calls for simple queries, saving 1-2 seconds
Parallel Processing: Thread pool executor handles CPU-intensive STT/TTS operations asynchronously

Overall voice query latency was reduced from ~27-30s to ~15-18s while maintaining accuracy.

Handling Ambiguity

The system employs multi-layer ambiguity resolution:

LLM-based Question Rectification: Clarifies ambiguous queries by resolving pronouns/references from conversation history and expanding abbreviations
Configurable Similarity Thresholds: Vector search uses thresholds (default: 0.7) to ensure only relevant results are considered
QnA Cache Validation: Requires high similarity (0.85) before returning cached responses
Pre-defined Canned Responses: Prevents hallucination when no relevant information is found
NER Extraction: Provides context-aware filtering to identify when queries contain entities not present in indexed documents

Scalability

The system is designed for horizontal scaling:

Stateless Backend: Enables horizontal scaling with multiple instances behind a load balancer
Redis-based Session Storage: Provides shared state across instances
Database Scaling: PostgreSQL connection pooling and read replicas, with pgvector HNSW indexes maintaining performance at scale
Redis Cluster Caching: Handles high-throughput operations and reduces database load by 30-50%
FastAPI Async Capabilities: Non-blocking I/O handles concurrent requests efficiently
Resource Management: Model instances are shared across requests within a process (models loaded once per process)

Estimated capacity: single instance handles ~50-100 concurrent requests; horizontal scaling with 10 instances supports ~500-1,000 concurrent requests.

Data Privacy

The system ensures comprehensive privacy protection:

Local Processing: STT (Whisper), TTS (Coqui TTS), embeddings (sentence-transformers), and NER (spaCy) models run entirely on-premises
No External Transmission: Audio and document content never leave the server infrastructure
Ephemeral Session Data: Redis session data has configurable TTL (default: 1 hour)
Isolated Conversations: Conversation history is isolated per session ID
Minimal External API Usage: Only OpenAI GPT is called externally, and only receives text queries (not audio or documents)

Technology Stack

Backend: FastAPI, Python 3.11
Database: PostgreSQL with pgvector extension
Cache: Redis
ML Models:
- OCR: Tesseract
- STT: OpenAI Whisper "small" model
- TTS: Coqui TTS (glow-tts model)
- Embeddings: sentence-transformers
- NER: spaCy
LLM: OpenAI GPT (for question rectification and response generation)
Frontend: HTML/CSS/JavaScript

Key Achievements

Reduced voice query latency from ~27-30s to ~15-18s through comprehensive optimization
Achieved <100ms response times for cached queries via Redis
Implemented privacy-first architecture with local ML model processing
Built scalable system architecture supporting 500-1,000 concurrent requests
Created end-to-end pipeline from document ingestion to voice-enabled query processing

Repository

The complete source code, documentation, and setup instructions are available on GitHub:

https://github.com/montyd1905/voice-rag-chat-agent-demo

The repository includes:

Complete Docker Compose setup for easy deployment
Detailed documentation and architecture diagrams
Sample articles for testing
Configuration examples and troubleshooting guides