**A curated, developer-friendly learning path for building real-time voice AI agents, from your first STT call to scaling production telephony.**
[](https://awesome.re)
[](LICENSE)
[](https://github.com/mahimairaja/voiceai/stargazers)
[](https://github.com/mahimairaja/voiceai/commits/main)
[](#table-of-contents)
[](#contributing)
**English** Β· [δΈζηζ¬](/README_zh.html)
Voice AI has moved from research demos into shipping product in under three years. The modern stack is converging around a clear pattern: a real-time transport layer (WebRTC or telephony), a streaming pipeline of speech-to-text β LLM β text-to-speech, and a turn-taking model that decides when the agent should speak. This list is structured to mirror that learning order: start with the foundations, pick a framework, then drill into individual components and production concerns.
Learning resources are tagged π’ Beginner, π‘ Intermediate, or π΄ Advanced (blogs, podcasts, and communities in sections 17-19 are intentionally left untagged). Prefer free official docs and vendor-neutral guides; flag where authors have commercial interests.
How to use this list
Read top-to-bottom if youβre brand new. The recommended path:
Foundations β understand the pipeline and latency budget
Frameworks β pick one (LiveKit Agents or Pipecat are the safest open-source bets) and ship a hello-world
Components (STT, TTS, LLM, VAD, turn detection) β swap pieces to learn what each layer does
Transport & telephony β connect to a real phone number
Evaluation, production, ethics β make it safe enough to ship
π Companion book: Voice Agents Handbook
If you want this material in a tighter, opinionated, production-grade form, I wrote the Voice Agents Handbook: building production voice AI with LiveKit, plus appendices on choosing your stack and the LiveKit ecosystem beyond agents. Available now on Kindle (and in paperback).
The README youβre reading collects the fieldβs best free resources. The book is the curated path through them, with the patterns Iβve used shipping voice agents for trade people, lawyers, and immigration consultants.
Disclosure: I maintain this repo and authored the handbook. Free sample (Introduction + Chapter 1) at handbook.mahimai.ca.
Table of contents
π Expand the 21 sections
1. [Foundational concepts and learning paths](#1-foundational-concepts-and-learning-paths)
2. [Frameworks and orchestration platforms](#2-frameworks-and-orchestration-platforms)
3. [Speech-to-text (STT / ASR)](#3-speech-to-text-stt--asr)
4. [Text-to-speech (TTS)](#4-text-to-speech-tts)
5. [LLMs for voice and real-time AI](#5-llms-for-voice-and-real-time-ai)
6. [Voice activity detection and turn-taking](#6-voice-activity-detection-and-turn-taking)
7. [Audio enhancement and noise suppression](#7-audio-enhancement-and-noise-suppression)
8. [WebRTC fundamentals](#8-webrtc-fundamentals)
9. [Telephony and SIP](#9-telephony-and-sip)
10. [Tutorials and hands-on projects](#10-tutorials-and-hands-on-projects)
11. [GitHub starter repos and awesome lists](#11-github-starter-repos-and-awesome-lists)
12. [Datasets and benchmarks](#12-datasets-and-benchmarks)
13. [Beginner-accessible research papers](#13-beginner-accessible-research-papers)
14. [Evaluation and testing](#14-evaluation-and-testing)
15. [Production, deployment, and scaling](#15-production-deployment-and-scaling)
16. [Ethics, safety, and regulation](#16-ethics-safety-and-regulation)
17. [Blogs and newsletters](#17-blogs-and-newsletters)
18. [Podcasts](#18-podcasts)
19. [Communities](#19-communities)
20. [Conferences and events](#20-conferences-and-events)
21. [Hackathons and competitions](#21-hackathons-and-competitions)
1. Foundational concepts and learning paths
Start here. These resources establish the mental model of the voice agent pipeline and the latency budget youβll fight for the rest of your career.
AI Voice Agents (LiveKit Docs): The canonical βwhat is a voice agentβ reference, covering the Agents framework, sessions, and the STT-LLM-TTS pipeline vs realtime model split. π’ Beginner
The frameworks below all let you wire STT, an LLM, and TTS together. For open-source production work, LiveKit Agents and Pipecat are the two safest bets; for managed dashboards, Vapi, Retell, and Bland win on time-to-first-call.
Pipecat: Quickstart: Scaffolds a Deepgram + OpenAI + Cartesia pipeline via the Pipecat CLI (uv tool install pipecat-ai-cli, then pipecat init quickstart); talk to it in the browser in ~5 minutes. π’ Beginner
Ultravox (fixie-ai/ultravox): Open-weight multimodal speech LLM (Llama/Gemma/Qwen variants) that skips the separate ASR stage for ~150 ms TTFT. π΄ Advanced
Managed platforms
Vapi: Quickstart: Dashboard-first; ship an agent on a free US phone number in under 5 minutes. π’ Beginner
ElevenLabs Agents: Quickstart: Build and embed a voice agent widget on any website in 5 minutes (formerly branded βConversational AI,β now ElevenAgents). π’ Beginner
Realtime / speech-to-speech APIs
OpenAI Realtime API: Guide: Official guide to gpt-realtime (now GA) over WebRTC, WebSockets, or SIP. π‘ Intermediate
Google Gemini Live API: Overview: Low-latency, bidirectional voice + vision agents with barge-in and tool use, on Gemini 3 native audio. π‘ Intermediate
Twilio ConversationRelay: WebSocket bridge that handles STT/TTS so you focus on LLM logic; works with any LLM. π‘ Intermediate
Pick one streaming STT and learn it deeply before shopping around. Deepgram, AssemblyAI, and Whisper-derivatives cover most use cases. (All-in-one ASR + end-of-turn models like Deepgram Flux are covered under turn-taking.)
Commercial APIs
Deepgram Nova-3: STT benchmarks: Primer on WER, latency, and cost alongside Deepgramβs product reference; Nova-3 now spans 36+ languages with multilingual code-switching. π’ Beginner
AssemblyAI Universal-3 Pro: Streaming STT walkthrough that doubles as a function-calling tutorial; Universal-3 Pro is the current flagship, adding natural-language keyterm prompting. π‘ Intermediate
Moonshine: Tiny on-device ASR (tiny 27M / base 61M params); v2 adds an ergodic streaming encoder built for latency-critical live transcription on edge devices. π‘ Intermediate
Best Speech-to-Text Providers in 2026 (Coval): Independent benchmark across 14 providers (WER, latency, end-of-turn, cost), with guidance on testing against your own traffic. π‘ Intermediate
Chatterbox (resemble-ai/chatterbox): Resemble AIβs MIT-licensed TTS that beats ElevenLabs in blind preference tests; ~5 s zero-shot voice cloning, emotion-exaggeration control, and a built-in PerTh watermark. Turbo variant (350M) hits sub-150 ms first audio; Multilingual (V3, 0.5B) covers 23+ languages. π‘ Intermediate
Kokoro 82M: Tiny Apache-licensed model that tops community ELO arenas; runs on CPU. π’ Beginner
Piper (OHF-Voice/piper1-gpl): Fast local neural TTS optimized for Raspberry Pi; perfect for offline projects. π’ Beginner
Coqui TTS (idiap fork): Maintained fork of Coqui-TTS / XTTS v2; still battle-tested, though Chatterbox now leads on zero-shot cloning quality. π‘ Intermediate
Orpheus-TTS: Llama-3B-based emotive TTS with ~200 ms streaming and emotion tags. π‘ Intermediate
Sesame CSM: Conversational, context-aware multi-speaker TTS using a Llama backbone with the Mimi codec. π΄ Advanced
A voice agentβs perceived intelligence is bounded by how fast the LLM streams its first token. Sub-300 ms TTFT changes the conversation feel entirely.
OpenAI Realtime API guide: Flagship S2S product with WebRTC/WebSocket transport (gpt-realtime, now GA). π‘ Intermediate
Google Gemini Live: Real-time multimodal voice/video with barge-in and broad language support, on Gemini 3 native audio. π‘ Intermediate
Moshi (kyutai-labs): Open full-duplex speech-text foundation model (~200 ms, Mimi codec). Kyutaiβs broader stack now includes Unmute (cascaded STT+LLM+TTS with tool use), Kyutai STT/TTS, and Hibiki (streaming translation). π΄ Advanced
Deepgram Flux: All-in-one conversational STT with built-in end-of-turn detection (median EOT <300 ms), integrated with Deepgramβs Voice Agent API; collapses STT and turn detection into a single model. π‘ Intermediate
Pipecat Smart Turn v3: Whisper-Tiny-based audio semantic VAD with fast CPU inference (~12 ms on a standard instance per the v3 repo), BSD-2 licensed. π‘ Intermediate
pipecat-ai/smart-turn: Repo with model code, training scripts, and integration examples (~8M params, Whisper-Tiny base). π‘ Intermediate
Krisp Turn-Taking: Commercial turn-taking model used alongside any STT/LLM/TTS stack. π‘ Intermediate
ai-coustics VAD: VAD bundled with real-time speech enhancement, noise suppression, and voice isolation in a single audio preprocessing SDK; useful when you want cleanup and turn-taking signals from the same component. π’ Beginner
ai-coustics: Real-time speech enhancement SDK covering noise cancellation, voice isolation, and VAD; on-device and cloud deployment. See the docs and developer platform. π’ Beginner
Krisp SDK: Commercial-grade real-time noise and background-voice cancellation; the de facto standard for voice comms (Python, Node.js, Go, C++ SDKs). LiveKitβs background voice cancellation and Pipecat Cloud both build on Krisp. Enterprise access via contact form. π’ Beginner
DeepFilterNet (Rikorose/DeepFilterNet): Open-source, low-complexity real-time speech enhancement for full-band audio; designed to run on embedded devices. The strongest actively-developed OSS noise suppressor. π‘ Intermediate
RNNoise (xiph/rnnoise): Classic hybrid DSP + deep-learning noise suppression; a tiny, well-understood baseline, but no longer actively maintained. π‘ Intermediate
Noise Suppression Guide 2026 (Picovoice): Algorithms, intelligibility metrics (SII / STI / STOI), and implementation tradeoffs; note the commercial author. π‘ Intermediate
8. WebRTC fundamentals
WebRTC is the default transport for voice agents that donβt run over the phone network. Understanding ICE, STUN, TURN, and SFU architecture is non-negotiable for production work.
MDN WebRTC API: Authoritative free reference for RTCPeerConnection, getUserMedia, and signaling. π’ Beginner
Build a Voice Agent with LiveKit (AssemblyAI): End-to-end walkthrough wiring LiveKit Agents + AssemblyAI Universal-3 Pro + Cartesia, run locally then on the Agents Playground. π‘ Intermediate
Clone these instead of writing boilerplate from scratch.
livekit/agents: The flagship open-source Python/Node framework for production voice agents (tip: pair it with the LiveKit Docs MCP server and Agent Skill for AI-assisted builds). π’ β π΄
pipecat-ai/pipecat: Vendor-neutral framework with 40+ STT/LLM/TTS service plugins. π’ β π΄
wildminder/awesome-ai-voice: Actively maintained 2026 list of open-source TTS, voice-cloning, and audio/music-generation models. π’ Beginner
12. Datasets and benchmarks
Youβll rarely train from scratch, but knowing which dataset a model was trained on explains its accents, languages, and failure modes.
LibriSpeech ASR Corpus: ~1,000 hours of English audiobooks; nearly every ASR paper benchmarks against it. π’ Beginner
Mozilla Common Voice: Crowdsourced multilingual dataset (100+ languages); the easiest legal way to fine-tune ASR. π’ Beginner
Common Voice on HuggingFace: One-line load_dataset() access for hands-on experiments. The official mozilla-foundation releases top out around v17; newer corpus versions (up to v22) are hosted on community mirrors. π’ Beginner
Open ASR Leaderboard: Live comparison of 60+ ASR models on WER and real-time factor. π’ Beginner
LJSpeech Dataset: ~24 hours of single-speaker English audio; baseline corpus for Tacotron 2 and VITS. π’ Beginner
VCTK Corpus: ~110 English speakers with diverse accents; widely used for multi-speaker TTS. π‘ Intermediate
VoxCeleb (Oxford VGG): Million-utterance βin the wildβ dataset for speaker identification and verification. π‘ Intermediate
13. Beginner-accessible research papers
These are the landmark papers behind the models youβll actually use. Read the Whisper and Common Voice papers first: theyβre unusually approachable.
You canβt ship what you canβt measure. Voice-agent evaluation is fundamentally probabilistic: a single transcript can pass and fail across runs, so simulation and statistics matter more than fixed test cases.
Coval: Voice AI Testing Platform: Defines the core voice-agent metrics: TTFB, WER, resolution rate, simulated accents, and interruptions. π’ Beginner
Future AGI simulate-sdk: Open-source voice AI simulation SDK for testing AI agents; generates synthetic conversations for evaluation. π‘ Intermediate
Future AGI: Open-source platform to simulate, evaluate, trace, guardrail, and optimize voice and AI agent apps in one feedback loop, with persona-driven simulation and 50+ eval metrics. π‘ Intermediate
15. Production, deployment, and scaling
Real production voice infrastructure is the hardest unsolved problem in this space. Read these before quoting anyone a per-minute price.
Voice AI Space: Vendor-neutral hub for the voice AI ecosystem: a curated product and tool directory, the Voice AI Newsroom, tutorials and repos, a jobs board, and community meetups.
Vapi Discord: Builder community for Vapi voice agents; invite from the homepage.
Retell AI Community: Forum for Retell developers building phone-call voice agents.
ElevenLabs Discord: Large TTS, voice cloning, and Conversational AI community with daily help threads.
Deepgram Discord: STT/TTS/Voice Agent API support and build-with-us threads.
Reddit: r/LocalLLaMA: Active threads on local Whisper/Parakeet, on-device TTS, and end-to-end voice stacks.
Reddit: r/AI_Agents: General AI-agent community where voice topics surface frequently.
20. Conferences and events
AI Engineer Worldβs Fair: Biggest AI-engineering conference; the Voice track has hosted major launches from ElevenLabs, Vapi, LiveKit, Pipecat, and Cartesia. The 2026 edition runs 29 June - 2 July 2026 at Moscone West, San Francisco.
AI Engineer YouTube channel: All Worldβs Fair and Summit talks are posted free; the best library of recent voice-AI talks.
VOICE & AI (Modev): Long-running voice technology conference with broader CX and voicebot focus, happening on Oct 5β7, 2026
Interspeech 2026: Top academic speech-science conference; intimidating but worth knowing, since most landmark papers debut here. Sydney, Australia, 27 September - 1 October 2026.
21. Hackathons and competitions
ElevenHacks (weekly sprints): Weekly themed challenges with credits and prizes; low-pressure way to ship one project per week. π’ Beginner
AI Engineer Worldβs Fair Hackathon: Co-located with the conference; $10K prizes judged by 3,000+ AI engineers, with a strong voice track, happening on Jun 27 at 9:00 AM - Jun 28 at 5:00 PM (PDT). π‘ Intermediate
lablab.ai AI Hackathons: Continuous calendar of short online hackathons frequently sponsored by voice-AI vendors. π’ Beginner
Devpost: Voice AI Hackathons: Centralized search for active voice-AI hackathons; the best way to find whatβs open right now. π’ Beginner
Suggested learning path
Week 1: Foundations: Read the LiveKit pipeline post and Voice AI Illustrated Primer (sections 1, 8).
Week 2: First agent: Finish the LiveKit or Pipecat quickstart end-to-end (sections 2, 10).
Week 4: Turn-taking, audio cleanup & telephony: Add Silero VAD, a turn detector, and a speech-enhancement pass; connect a SIP trunk (sections 6, 7, 9).
Week 5: Production: Add evaluation, observability, and read the FCC/EU AI Act material (sections 14, 15, 16).
Ongoing: Subscribe to two newsletters, follow Voice AI Space, and join the Voice AI community on LinkedIn group (sections 17, 18, 19).
Contributing
Pull requests welcome. Resources must be active in the last 12 months, accessible to developers, and vendor-neutral or clearly labeled when authored by a commercial party. Open an issue to suggest additions or removals. See CONTRIBUTING.md for the full contribution guide.