voiceai

**A curated, developer-friendly learning path for building real-time voice AI agents, from your first STT call to scaling production telephony.** [![Awesome](https://awesome.re/badge.svg)](https://awesome.re) [![License: MIT](https://img.shields.io/github/license/mahimairaja/voiceai?style=flat-square&color=blue)](LICENSE) [![Stars](https://img.shields.io/github/stars/mahimairaja/voiceai?style=flat-square&logo=github&color=yellow)](https://github.com/mahimairaja/voiceai/stargazers) [![Last commit](https://img.shields.io/github/last-commit/mahimairaja/voiceai?style=flat-square&color=informational)](https://github.com/mahimairaja/voiceai/commits/main) [![Resources](https://img.shields.io/badge/resources-190%2B-5b21b6?style=flat-square)](#table-of-contents) [![PRs welcome](https://img.shields.io/badge/PRs-welcome-brightgreen?style=flat-square)](#contributing) **English** · [中文版本](/README_zh.html)

Voice AI has moved from research demos into shipping product in under three years. The modern stack is converging around a clear pattern: a real-time transport layer (WebRTC or telephony), a streaming pipeline of speech-to-text → LLM → text-to-speech, and a turn-taking model that decides when the agent should speak. This list is structured to mirror that learning order: start with the foundations, pick a framework, then drill into individual components and production concerns.

Learning resources are tagged 🟢 Beginner, 🟡 Intermediate, or 🔴 Advanced (blogs, podcasts, and communities in sections 17-19 are intentionally left untagged). Prefer free official docs and vendor-neutral guides; flag where authors have commercial interests.

How to use this list

Read top-to-bottom if you’re brand new. The recommended path:

Foundations → understand the pipeline and latency budget
Frameworks → pick one (LiveKit Agents or Pipecat are the safest open-source bets) and ship a hello-world
Components (STT, TTS, LLM, VAD, turn detection) → swap pieces to learn what each layer does
Transport & telephony → connect to a real phone number
Evaluation, production, ethics → make it safe enough to ship

📘 Companion book: Voice Agents Handbook

If you want this material in a tighter, opinionated, production-grade form, I wrote the Voice Agents Handbook: building production voice AI with LiveKit, plus appendices on choosing your stack and the LiveKit ecosystem beyond agents. Available now on Kindle (and in paperback).

The README you’re reading collects the field’s best free resources. The book is the curated path through them, with the patterns I’ve used shipping voice agents for trade people, lawyers, and immigration consultants.

Disclosure: I maintain this repo and authored the handbook. Free sample (Introduction + Chapter 1) at handbook.mahimai.ca.

📖 Expand the 21 sections

1. [Foundational concepts and learning paths](#1-foundational-concepts-and-learning-paths) 2. [Frameworks and orchestration platforms](#2-frameworks-and-orchestration-platforms) 3. [Speech-to-text (STT / ASR)](#3-speech-to-text-stt--asr) 4. [Text-to-speech (TTS)](#4-text-to-speech-tts) 5. [LLMs for voice and real-time AI](#5-llms-for-voice-and-real-time-ai) 6. [Voice activity detection and turn-taking](#6-voice-activity-detection-and-turn-taking) 7. [Audio enhancement and noise suppression](#7-audio-enhancement-and-noise-suppression) 8. [WebRTC fundamentals](#8-webrtc-fundamentals) 9. [Telephony and SIP](#9-telephony-and-sip) 10. [Tutorials and hands-on projects](#10-tutorials-and-hands-on-projects) 11. [GitHub starter repos and awesome lists](#11-github-starter-repos-and-awesome-lists) 12. [Datasets and benchmarks](#12-datasets-and-benchmarks) 13. [Beginner-accessible research papers](#13-beginner-accessible-research-papers) 14. [Evaluation and testing](#14-evaluation-and-testing) 15. [Production, deployment, and scaling](#15-production-deployment-and-scaling) 16. [Ethics, safety, and regulation](#16-ethics-safety-and-regulation) 17. [Blogs and newsletters](#17-blogs-and-newsletters) 18. [Podcasts](#18-podcasts) 19. [Communities](#19-communities) 20. [Conferences and events](#20-conferences-and-events) 21. [Hackathons and competitions](#21-hackathons-and-competitions)

1. Foundational concepts and learning paths

Start here. These resources establish the mental model of the voice agent pipeline and the latency budget you’ll fight for the rest of your career.

Voice AI & Voice Agents: An Illustrated Primer: Kwindla Hultman Kramer’s free, regularly-updated long-form primer. The de facto textbook for the field. 🟢 Beginner
Voice Agent Architecture: STT, LLM, and TTS Pipelines Explained (LiveKit): Visual walkthrough of streaming patterns, turn detection, and where latency accumulates. 🟢 Beginner
Everything You Need to Know About Voice AI Agents (Deepgram): End-to-end primer covering feature extraction, ASR, LLM reasoning, and synthesis. 🟢 Beginner
AI Voice Agents (LiveKit Docs): The canonical “what is a voice agent” reference, covering the Agents framework, sessions, and the STT-LLM-TTS pipeline vs realtime model split. 🟢 Beginner
Core Latency in AI Voice Agents (Twilio): Visual explanation of end-of-turn detection, silence thresholds, and smart endpointing. 🟢 Beginner
Advice on Building Voice AI in June 2025 (Daily.co): Practical P50/P95 latency-budget guidance from Pipecat’s creators. 🟡 Intermediate
How Intelligent Turn Detection Solves the Biggest Challenge in Voice Agents (AssemblyAI): Endpointing is the most underestimated problem; this is the clearest deep-dive. 🟡 Intermediate

2. Frameworks and orchestration platforms

The frameworks below all let you wire STT, an LLM, and TTS together. For open-source production work, LiveKit Agents and Pipecat are the two safest bets; for managed dashboards, Vapi, Retell, and Bland win on time-to-first-call.

Open-source frameworks

LiveKit Agents: Voice AI Quickstart: Working assistant in <10 min via Python or TypeScript, runs on top of WebRTC. 🟢 Beginner
Pipecat: Quickstart: Scaffolds a Deepgram + OpenAI + Cartesia pipeline via the Pipecat CLI (uv tool install pipecat-ai-cli, then pipecat init quickstart); talk to it in the browser in ~5 minutes. 🟢 Beginner
Ultravox (fixie-ai/ultravox): Open-weight multimodal speech LLM (Llama/Gemma/Qwen variants) that skips the separate ASR stage for ~150 ms TTFT. 🔴 Advanced

Managed platforms

Vapi: Quickstart: Dashboard-first; ship an agent on a free US phone number in under 5 minutes. 🟢 Beginner
Retell AI: Introduction & Quickstart: Phone-agent platform with $10 free credit on signup. 🟢 Beginner
Bland AI: Send Your First Phone Call: Minimal API tutorial for placing your first AI phone call. 🟢 Beginner
ElevenLabs Agents: Quickstart: Build and embed a voice agent widget on any website in 5 minutes (formerly branded “Conversational AI,” now ElevenAgents). 🟢 Beginner

Realtime / speech-to-speech APIs

OpenAI Realtime API: Guide: Official guide to gpt-realtime (now GA) over WebRTC, WebSockets, or SIP. 🟡 Intermediate
Google Gemini Live API: Overview: Low-latency, bidirectional voice + vision agents with barge-in and tool use, on Gemini 3 native audio. 🟡 Intermediate
Twilio ConversationRelay: WebSocket bridge that handles STT/TTS so you focus on LLM logic; works with any LLM. 🟡 Intermediate

Vendor-neutral comparisons

Vapi vs Pipecat vs LiveKit (AssemblyAI): Architecture-focused comparison of pipeline control and transport choices. 🟡 Intermediate
11 Voice Agent Platforms Compared (Softcery): Broad market map with use-case recommendations. 🟢 Beginner
Best Voice Agent Stack (Hamming AI): Buy-vs-build framework with concrete cost, latency, and time-to-launch numbers. 🟡 Intermediate

3. Speech-to-text (STT / ASR)

Pick one streaming STT and learn it deeply before shopping around. Deepgram, AssemblyAI, and Whisper-derivatives cover most use cases. (All-in-one ASR + end-of-turn models like Deepgram Flux are covered under turn-taking.)

Commercial APIs

Deepgram Nova-3: STT benchmarks: Primer on WER, latency, and cost alongside Deepgram’s product reference; Nova-3 now spans 36+ languages with multilingual code-switching. 🟢 Beginner
AssemblyAI Universal-3 Pro: Streaming STT walkthrough that doubles as a function-calling tutorial; Universal-3 Pro is the current flagship, adding natural-language keyterm prompting. 🟡 Intermediate
OpenAI Whisper / gpt-4o-transcribe API docs: Easiest cloud STT if you already use OpenAI. 🟢 Beginner
Soniox multilingual benchmark: Public WER comparison across 60 languages. 🟢 Beginner
Cartesia Ink 2: Streaming STT paired with Sonic TTS for a single-vendor low-latency stack. 🟢 Beginner

Open source

openai/whisper: The original repo and the de facto starting point for any DIY ASR project. 🟢 Beginner
SYSTRAN/faster-whisper: CTranslate2 reimplementation up to 4× faster with INT8; recommended for self-hosted Whisper. 🟡 Intermediate
NVIDIA NeMo (Parakeet / Canary): Top-of-leaderboard open ASR models with streaming inference recipes. 🔴 Advanced
Moonshine: Tiny on-device ASR (tiny 27M / base 61M params); v2 adds an ergodic streaming encoder built for latency-critical live transcription on edge devices. 🟡 Intermediate

Benchmarks and explainers

Open ASR Leaderboard (HuggingFace): Community leaderboard across 11 datasets: your reference for open-source picks. 🟢 Beginner
Artificial Analysis: Speech-to-Text: Independent leaderboard ranking 48+ STT providers by WER, speed, and cost. 🟢 Beginner
Best Speech-to-Text Providers in 2026 (Coval): Independent benchmark across 14 providers (WER, latency, end-of-turn, cost), with guidance on testing against your own traffic. 🟡 Intermediate
Best Speech-to-Text APIs in 2026 (Deepgram): Provider comparison guide; note the commercial author. 🟢 Beginner
Streaming vs Batch ASR (Arun Baby): Engineer-friendly explainer of RNN-T and Conformer streaming architectures. 🟡 Intermediate

4. Text-to-speech (TTS)

Latency, not raw quality, is what kills voice agents: prioritize providers offering true streaming with first-byte under 200 ms.

Commercial APIs

ElevenLabs Docs: Industry-leading quality, voice cloning, and Agents platform in one SDK. 🟢 Beginner
Cartesia Sonic Quickstart: Sonic 3.5, sub-90 ms first-byte latency, designed specifically for voice agents. 🟢 Beginner
Deepgram Aura-2: Low-latency streaming TTS (Aura-2) that pairs cleanly with Deepgram STT. 🟢 Beginner
OpenAI TTS (gpt-4o-mini-tts): Easiest plug-in TTS for the OpenAI stack. 🟢 Beginner
Artificial Analysis: TTS leaderboard: ELO, price, and speed comparison covering Rime, PlayHT, Hume, Inworld, and others. 🟢 Beginner

Open source

Chatterbox (resemble-ai/chatterbox): Resemble AI’s MIT-licensed TTS that beats ElevenLabs in blind preference tests; ~5 s zero-shot voice cloning, emotion-exaggeration control, and a built-in PerTh watermark. Turbo variant (350M) hits sub-150 ms first audio; Multilingual (V3, 0.5B) covers 23+ languages. 🟡 Intermediate
Kokoro 82M: Tiny Apache-licensed model that tops community ELO arenas; runs on CPU. 🟢 Beginner
Piper (OHF-Voice/piper1-gpl): Fast local neural TTS optimized for Raspberry Pi; perfect for offline projects. 🟢 Beginner
Coqui TTS (idiap fork): Maintained fork of Coqui-TTS / XTTS v2; still battle-tested, though Chatterbox now leads on zero-shot cloning quality. 🟡 Intermediate
Orpheus-TTS: Llama-3B-based emotive TTS with ~200 ms streaming and emotion tags. 🟡 Intermediate
Sesame CSM: Conversational, context-aware multi-speaker TTS using a Llama backbone with the Mimi codec. 🔴 Advanced

Streaming and ethics

Streaming TTS for Low-Latency Agents (Picovoice): Clear taxonomy of single, output-streaming, and dual-streaming TTS. 🟡 Intermediate
Ethics of Voice Cloning & Deepfakes (Deepgram): Vendor-neutral discussion of misuse, regulation, and developer responsibility. 🟢 Beginner

5. LLMs for voice and real-time AI

A voice agent’s perceived intelligence is bounded by how fast the LLM streams its first token. Sub-300 ms TTFT changes the conversation feel entirely.

Low-latency inference

Groq: LPU-based inference cloud delivering ~10× faster Llama tokens/sec than commodity GPUs. 🟢 Beginner
Cerebras Inference: Wafer-scale chip inference with very high throughput on Llama models. 🟢 Beginner
SambaNova Cloud: Reconfigurable Dataflow inference; stable throughput at low latency. 🟢 Beginner

Speech-to-speech models

OpenAI Realtime API guide: Flagship S2S product with WebRTC/WebSocket transport (gpt-realtime, now GA). 🟡 Intermediate
Google Gemini Live: Real-time multimodal voice/video with barge-in and broad language support, on Gemini 3 native audio. 🟡 Intermediate
Moshi (kyutai-labs): Open full-duplex speech-text foundation model (~200 ms, Mimi codec). Kyutai’s broader stack now includes Unmute (cascaded STT+LLM+TTS with tool use), Kyutai STT/TTS, and Hibiki (streaming translation). 🔴 Advanced
Speech-to-Speech Models in 2026: Three Architectural Bets (Krzysztof Sopyla): Vendor-neutral comparison of full-duplex (Moshi), near-duplex multimodal (Qwen-Omni), and cascade approaches, with FullDuplexBench numbers and tradeoffs. 🟡 Intermediate

Voice-specific prompting and tools

OpenAI Voice Agents Guide: Compares chained vs S2S architectures with prompt and tool best practices. 🟢 Beginner
ElevenLabs Voice Agent Prompting Guide: Production-grade prompt structure tuned for voice; vendor-neutral lessons. 🟡 Intermediate
Voice AI Prompt Engineering Guide (VoiceInfra): Explains why voice prompts must be 60–70% shorter than chat prompts, with templates. 🟢 Beginner
Tool Definition and Use for Voice Agents (LiveKit Docs): Defining @function_tool tools and raw-schema tools inside a voice agent. 🟡 Intermediate

6. Voice activity detection and turn-taking

Pure VAD is no longer enough: modern agents combine acoustic VAD with a small semantic model that predicts end-of-utterance from words and prosody.

Silero VAD: MIT-licensed pre-trained VAD; <1 ms per chunk on CPU. The de facto VAD inside LiveKit and Pipecat. 🟢 Beginner
py-webrtcvad: Python bindings for Google’s classic WebRTC VAD; lightweight baseline. 🟢 Beginner
LiveKit Turn Detector: blog post: How a small transformer-based EOU model complements VAD with semantic context. 🟡 Intermediate
LiveKit turn-detector model on HuggingFace: Open-weights multilingual EOU model running ONNX on CPU in under 500 MB. 🟡 Intermediate
Deepgram Flux: All-in-one conversational STT with built-in end-of-turn detection (median EOT <300 ms), integrated with Deepgram’s Voice Agent API; collapses STT and turn detection into a single model. 🟡 Intermediate
Pipecat Smart Turn v3: Whisper-Tiny-based audio semantic VAD with fast CPU inference (~12 ms on a standard instance per the v3 repo), BSD-2 licensed. 🟡 Intermediate
pipecat-ai/smart-turn: Repo with model code, training scripts, and integration examples (~8M params, Whisper-Tiny base). 🟡 Intermediate
Krisp Turn-Taking: Commercial turn-taking model used alongside any STT/LLM/TTS stack. 🟡 Intermediate
The Complete Guide to AI Turn-Taking (Tavus): Reader-friendly overview of why pure VAD fails in real conversations. 🟢 Beginner
Tackling Turn Detection in Voice AI (Notch): Engineer-first walkthrough combining VAD probability, volume, and TTS markers. 🟡 Intermediate
Evaluating End-of-Turn Detection Models (Deepgram): Methodology plus a head-to-head of Flux, Pipecat Smart Turn, and LiveKit EOU; note the commercial author. 🟡 Intermediate
ai-coustics VAD: VAD bundled with real-time speech enhancement, noise suppression, and voice isolation in a single audio preprocessing SDK; useful when you want cleanup and turn-taking signals from the same component. 🟢 Beginner

7. Audio enhancement and noise suppression

The audio reaching your VAD and STT is often noisy, reverberant, or mixed with background voices. Cleaning the signal before the rest of the pipeline is frequently the difference between an agent that ships and one that frustrates users in real-world conditions (cars, cafés, call centres). In 2026 every major voice-AI vendor ships a deep-learning suppressor on top of WebRTC’s classic noise-suppression chain.

ai-coustics: Real-time speech enhancement SDK covering noise cancellation, voice isolation, and VAD; on-device and cloud deployment. See the docs and developer platform. 🟢 Beginner
Krisp SDK: Commercial-grade real-time noise and background-voice cancellation; the de facto standard for voice comms (Python, Node.js, Go, C++ SDKs). LiveKit’s background voice cancellation and Pipecat Cloud both build on Krisp. Enterprise access via contact form. 🟢 Beginner
DeepFilterNet (Rikorose/DeepFilterNet): Open-source, low-complexity real-time speech enhancement for full-band audio; designed to run on embedded devices. The strongest actively-developed OSS noise suppressor. 🟡 Intermediate
RNNoise (xiph/rnnoise): Classic hybrid DSP + deep-learning noise suppression; a tiny, well-understood baseline, but no longer actively maintained. 🟡 Intermediate
Koala Noise Suppression (Picovoice): On-device, cross-platform voice isolation with self-serve access (browser, mobile, desktop, Raspberry Pi). 🟢 Beginner
Noise Suppression Guide 2026 (Picovoice): Algorithms, intelligibility metrics (SII / STI / STOI), and implementation tradeoffs; note the commercial author. 🟡 Intermediate

8. WebRTC fundamentals

WebRTC is the default transport for voice agents that don’t run over the phone network. Understanding ICE, STUN, TURN, and SFU architecture is non-negotiable for production work.

MDN WebRTC API: Authoritative free reference for RTCPeerConnection, getUserMedia, and signaling. 🟢 Beginner
MDN: Introduction to WebRTC Protocols: Beginner-friendly explanation of ICE, STUN, TURN, and SDP. 🟢 Beginner
WebRTC.org Getting Started: Official Google-maintained intro, splitting WebRTC into media-capture and connectivity. 🟢 Beginner
GetStream: WebRTC for the Brave: Free multi-module tutorial covering networking basics through advanced topics. 🟢 Beginner
Why WebRTC Beats WebSockets for Voice AI (LiveKit): 2025 explainer aimed at AI builders, comparing transports in plain English. 🟡 Intermediate
Daily Docs: Intro to Video Architecture (P2P vs SFU): One of the clearest beginner write-ups of P2P vs SFU. 🟢 Beginner
P2P, SFU, MCU, Hybrid: WebRTC Architecture Guide (Forasoft): Vendor-neutral 2026 breakdown of the four architectures with current OSS tooling (mediasoup, Janus, Jitsi). 🟡 Intermediate
Agora: How WebRTC Works: Side-by-side WebRTC vs WebSockets walkthrough with signaling diagrams. 🟢 Beginner

9. Telephony and SIP

The phone network has its own physics. Once you know which SIP trunk provider to point at LiveKit or Pipecat, you can ship.

Twilio Programmable Voice: TwiML, Voice API, and PSTN connectivity in one hub; the default starting point. 🟢 Beginner
Twilio: Voice AI Assistant with OpenAI Realtime + Python: Step-by-step junior-friendly tutorial wiring Twilio Media Streams to an LLM. 🟢 Beginner
Twilio SIP Quickstart: Clearest beginner explainer of SIP basics, SIP Domains, and softphone setup. 🟢 Beginner
Telnyx Voice API: Strong Twilio alternative with WebSocket media streaming and AI Assistant tooling. 🟢 Beginner
Telnyx: How to Set Up a SIP Trunk: Friendly walkthrough of SIP trunking architecture, codecs, and authentication. 🟢 Beginner
Plivo Voice API Documentation: XML call control and audio-streaming integrations for AI agents. 🟢 Beginner
SignalWire Voice Docs: Built on FreeSWITCH; SWML, TwiML-compatible API, and an AI Agents SDK. 🟡 Intermediate
LiveKit SIP Primer: Best diagram of how a call flows from PSTN → trunk → SIP service → agent. 🟢 Beginner
LiveKit SIP Trunk Setup: Practical guide for wiring Twilio/Telnyx/Plivo/Wavix/Sinch trunks into LiveKit. 🟡 Intermediate
Pipecat Telephony Overview: Differences between WebSocket-based telephony and SIP-based call control. 🟡 Intermediate

10. Tutorials and hands-on projects

Pick one tutorial and finish it before starting another. Voice AI is unforgiving of half-built pipelines.

LiveKit Voice AI Quickstart: Official 10-minute walkthrough in Python or Node with starter templates. 🟢 Beginner
Build Your First AI Voice Agent in Python (LiveKit): End-to-end Python tutorial covering streaming, latency, and deployment. 🟢 Beginner
Pipecat Quickstart: Build and deploy a Deepgram + OpenAI + Cartesia bot via the Pipecat CLI in roughly 10 minutes. 🟢 Beginner
How to Build a Real-Time Voice Agent with Pipecat (AssemblyAI): Production-oriented walkthrough including local testing and Pipecat Cloud deployment. 🟡 Intermediate
Build a Voice Agent with LiveKit (AssemblyAI): End-to-end walkthrough wiring LiveKit Agents + AssemblyAI Universal-3 Pro + Cartesia, run locally then on the Agents Playground. 🟡 Intermediate
Deepgram: Build a Voice AI Agent: Step-by-step guide wiring Deepgram STT, GPT, and Aura TTS. 🟢 Beginner
Build a Voice Assistant with Twilio ConversationRelay + LiteLLM: Provider-agnostic tutorial supporting OpenAI, Anthropic, or DeepSeek. 🟡 Intermediate
freeCodeCamp: Build Advanced AI Agents (LiveKit, Exa, LangChain): Free 3-part video course covering interactive voice agents end-to-end. 🟢 Beginner
freeCodeCamp: Build a Voice AI Agent with Open-Source Tools: Hands-on local stack covering open-source STT, a local LLM, and system TTS, plus the cascaded vs end-to-end tradeoff. 🟡 Intermediate

11. GitHub starter repos and awesome lists

Clone these instead of writing boilerplate from scratch.

livekit/agents: The flagship open-source Python/Node framework for production voice agents (tip: pair it with the LiveKit Docs MCP server and Agent Skill for AI-assisted builds). 🟢 → 🔴
pipecat-ai/pipecat: Vendor-neutral framework with 40+ STT/LLM/TTS service plugins. 🟢 → 🔴
livekit-examples/agent-starter-python: Production-ready starter with Dockerfile, eval suite, turn detector, and core plugins. 🟢 Beginner
livekit-examples (org): Official collection of LiveKit Python/React/Swift/Android starters. 🟢 Beginner
pipecat-ai/pipecat-examples: Sample apps for push-to-talk, websocket, telephony, and multimodal use cases. 🟢 → 🟡
elevenlabs/elevenlabs-examples: Runnable Next.js and Python examples for TTS, STT, and real-time agents. 🟢 Beginner
kwindla/macos-local-voice-agents: Pipecat example hitting sub-800 ms voice-to-voice latency entirely on M-series Macs. 🟡 Intermediate
zzw922cn/awesome-speech-recognition-speech-synthesis-papers: Comprehensive curated index of ASR, TTS, voice conversion, and speech-LLM papers. 🟡 Intermediate
wildminder/awesome-ai-voice: Actively maintained 2026 list of open-source TTS, voice-cloning, and audio/music-generation models. 🟢 Beginner

12. Datasets and benchmarks

You’ll rarely train from scratch, but knowing which dataset a model was trained on explains its accents, languages, and failure modes.

LibriSpeech ASR Corpus: ~1,000 hours of English audiobooks; nearly every ASR paper benchmarks against it. 🟢 Beginner
Mozilla Common Voice: Crowdsourced multilingual dataset (100+ languages); the easiest legal way to fine-tune ASR. 🟢 Beginner
Common Voice on HuggingFace: One-line load_dataset() access for hands-on experiments. The official mozilla-foundation releases top out around v17; newer corpus versions (up to v22) are hosted on community mirrors. 🟢 Beginner
Open ASR Leaderboard: Live comparison of 60+ ASR models on WER and real-time factor. 🟢 Beginner
Artificial Analysis: Speech: Independent benchmarks of commercial STT and TTS providers. 🟢 Beginner
LJSpeech Dataset: ~24 hours of single-speaker English audio; baseline corpus for Tacotron 2 and VITS. 🟢 Beginner
VCTK Corpus: ~110 English speakers with diverse accents; widely used for multi-speaker TTS. 🟡 Intermediate
VoxCeleb (Oxford VGG): Million-utterance “in the wild” dataset for speaker identification and verification. 🟡 Intermediate

13. Beginner-accessible research papers

These are the landmark papers behind the models you’ll actually use. Read the Whisper and Common Voice papers first: they’re unusually approachable.

Whisper: Robust Speech Recognition via Large-Scale Weak Supervision (2022): Behind the most popular open ASR model; unusually clear prose for an ML paper. 🟡 Intermediate
HuggingFace Whisper fine-tuning blog (companion): Hands-on walkthrough that lets you “feel” the Whisper paper in code. 🟢 Beginner
VITS: Conditional VAE with Adversarial Learning for End-to-End TTS (2021): The single-stage TTS model behind many open-source voice cloners. 🟡 Intermediate
Tacotron 2: Natural TTS Synthesis (2017): Landmark seq2seq + WaveNet-vocoder paper that made neural TTS sound natural. 🟡 Intermediate
Conformer: Convolution-augmented Transformer for ASR (2020): The architecture inside NVIDIA Parakeet, Canary, and many leaderboard models. 🟡 Intermediate
wav2vec 2.0: Self-Supervised Learning of Speech Representations (2020): Showed that pretraining on unlabeled audio drastically reduces labeled-data needs. 🟡 Intermediate
Common Voice: A Massively-Multilingual Speech Corpus (2020): Short, accessible paper describing how Common Voice is built and validated. 🟢 Beginner
Moshi: A Speech-Text Foundation Model for Real-Time Dialogue (2024): The first real-time full-duplex spoken LLM; introduces the Mimi codec and the “Inner Monologue” method (time-aligned text before audio tokens). 🔴 Advanced
Open ASR Leaderboard preprint (2025): Reproducible benchmark of 60+ ASR models across 11 datasets; the modern landscape map. 🟡 Intermediate
Full-Duplex-Bench: Evaluating Full-Duplex Spoken Dialogue Models on Turn-Taking (2025): A reproducible benchmark for interruption handling and turn-taking in speech-to-speech models. 🟡 Intermediate

14. Evaluation and testing

You can’t ship what you can’t measure. Voice-agent evaluation is fundamentally probabilistic: a single transcript can pass and fail across runs, so simulation and statistics matter more than fixed test cases.

Coval: Voice AI Testing Platform: Defines the core voice-agent metrics: TTFB, WER, resolution rate, simulated accents, and interruptions. 🟢 Beginner
Coval: How to Evaluate Voice Agents (Practical Guide): One of the most cited 2025 guides on probabilistic vs deterministic evaluation. 🟢 Beginner
Cekura: Metrics Overview: Predefined metrics, instruction-following checks, and simulation framework. 🟢 Beginner
Cekura: Performance Testing for Voice Agents: Practical 2025 guide on multi-turn simulation and edge-case generation. 🟡 Intermediate
Hamming AI: Production-focused QA platform with simulation, load testing, and 50+ metrics. 🟡 Intermediate
Hamming: Voice Agent Evaluation Metrics Guide: Reference of latency percentiles, WER, MOS-style quality, and task completion with formulas. 🟡 Intermediate
LiveKit: Understand and Improve Agent Latency: Per-turn latency metrics (e2e, LLM TTFT, TTS TTFB) and where to optimize. 🟡 Intermediate
Twilio: How Do You Know if Your Voice AI Agents Are Working?: Vendor-neutral 2025 guide arguing for business-outcome metrics over raw WER/latency. 🟢 Beginner
Future AGI simulate-sdk: Open-source voice AI simulation SDK for testing AI agents; generates synthetic conversations for evaluation. 🟡 Intermediate
Future AGI: Open-source platform to simulate, evaluate, trace, guardrail, and optimize voice and AI agent apps in one feedback loop, with persona-driven simulation and 50+ eval metrics. 🟡 Intermediate

15. Production, deployment, and scaling

Real production voice infrastructure is the hardest unsolved problem in this space. Read these before quoting anyone a per-minute price.

LiveKit: Deploy and scale agents on LiveKit Cloud: Real-world write-up on stateful load balancing, autoscaling, and warm pools. 🟡 Intermediate
LiveKit: Why You Shouldn’t Build Voice Agents Directly on Model APIs: Honest breakdown of what raw model APIs don’t give you. 🟡 Intermediate
Latent Space: OpenAI Realtime API: The Missing Manual: Field-tested guide from Pipecat’s creator on Realtime API production realities. 🟡 Intermediate
TWIML: Building Voice AI Agents That Don’t Suck (Kwindla Kramer): One-hour discussion on real production architecture and turn-taking. 🟡 Intermediate
AWS: Voice Agents with Pipecat and Amazon Bedrock: Full architecture walkthrough including latency optimization and Nova Sonic. 🟡 Intermediate
Deepgram: STT API Pricing Breakdown: Vendor-by-vendor per-minute economics: required reading before signing any contract. 🟢 Beginner
Sierra: Shipping and Scaling AI Agents: Case-study on Sonos, SiriusXM, and OluKai voice deployments. 🟡 Intermediate
Sierra: Constellation of Models: How a leading CX company composes 15+ models per agent. 🟡 Intermediate
LiveKit Agent Observability: Built-in tracing, transcripts, and per-stage latency for LiveKit Cloud. 🟢 Beginner

16. Ethics, safety, and regulation

If you’re shipping a voice agent in 2026, disclosure and consent are no longer optional. The FCC and EU AI Act both have teeth.

FCC: AI-Generated Voices in Robocalls Illegal (Feb 2024): The landmark TCPA ruling every U.S. voice-agent dev must read. 🟢 Beginner
EU AI Act: Article 50 (Transparency for Deepfakes & AI Interactions): Authoritative text of EU disclosure rules; transparency obligations apply from 2 August 2026 (systems already on the market before that date have until 2 December 2026 to comply). 🟡 Intermediate
European Commission: Code of Practice on AI-Generated Content: Official EU implementation guidance on watermarking and labelling; the finalized Code was published on 10 June 2026. 🟡 Intermediate
FTC: Approaches to Address AI-Enabled Voice Cloning: Plain-English summary of the Voice Cloning Challenge winners and Impersonation Rule. 🟢 Beginner
FTC: Proposed Rule on AI Impersonation of Individuals (Feb 2024): Direct source on U.S. impersonation-fraud rules covering AI deepfakes. 🟢 Beginner
Pindrop: Voice Intelligence & Security Report: Industry report documenting the sharp rise in deepfake fraud attempts. 🟢 Beginner
Voice Cloning Ethics (CAMB.AI): Practical overview of consent frameworks, ELVIS Act, and EU AI Act. 🟢 Beginner
NCLC: Top Six TCPA/Robocall Developments 2024/2025: Consumer-protection lens on what’s actually being enforced. 🟡 Intermediate

17. Blogs and newsletters

Subscribe to two or three to stay current: the field moves quickly.

LiveKit Blog: Engineering deep-dives on WebRTC, agents framework releases, and production patterns.
Deepgram Learn: Tutorials on STT/TTS, voice agent design, evals, and pipeline architecture.
Cartesia Blog: State-space TTS models, Sonic releases, and yearly “State of Voice AI” reports.
ElevenLabs Blog: Product and research announcements with implementation notes.
Daily.co Blog (Pipecat): Posts from Pipecat’s maintainers covering scaling and feature releases.
Voice AI & Voice Agents: An Illustrated Primer: Free, regularly-updated long-form primer.
Voice AI Space: Vendor-neutral hub for the voice AI ecosystem: a curated product and tool directory, the Voice AI Newsroom, tutorials and repos, a jobs board, and community meetups.
Voice AI Newsletter (Krisp): “Future of Voice AI” interview series with founders.
Voice AI Weekly (Vapi): Weekly Substack rounding up news, products, and tools.

18. Podcasts

Deepgram AI Minds: Founder and builder interviews across the voice AI ecosystem.
The Future of Voice AI (Krisp): Weekly founder interviews focused on enterprise voice AI architecture.
TWIML AI Podcast: voice episodes: Strong technical interviews; the Kwin Kramer episode is a great starting point.
This Week In Voice (Project Voice): News-roundtable format covering conversational AI.

19. Communities

LiveKit Community Slack: Direct access to maintainers and other agent builders.
Pipecat Discord: Active community with weekly office hours; invite link from the homepage.
HuggingFace Discord: #ml-for-audio-and-speech: 200k-member server with strong audio/speech channels.
Vapi Discord: Builder community for Vapi voice agents; invite from the homepage.
Retell AI Community: Forum for Retell developers building phone-call voice agents.
ElevenLabs Discord: Large TTS, voice cloning, and Conversational AI community with daily help threads.
Deepgram Discord: STT/TTS/Voice Agent API support and build-with-us threads.
Reddit: r/LocalLLaMA: Active threads on local Whisper/Parakeet, on-device TTS, and end-to-end voice stacks.
Reddit: r/AI_Agents: General AI-agent community where voice topics surface frequently.

20. Conferences and events

AI Engineer World’s Fair: Biggest AI-engineering conference; the Voice track has hosted major launches from ElevenLabs, Vapi, LiveKit, Pipecat, and Cartesia. The 2026 edition runs 29 June - 2 July 2026 at Moscone West, San Francisco.
AI Engineer YouTube channel: All World’s Fair and Summit talks are posted free; the best library of recent voice-AI talks.
AI Engineer Summit Online: Voice playlist: Curated playlist including voice-track sessions from leading labs.
AIEWF 2025 Recap (Latent Space): Written deep-dive into 2025’s voice-track talks and major launches.
VOICE & AI (Modev): Long-running voice technology conference with broader CX and voicebot focus, happening on Oct 5–7, 2026
Interspeech 2026: Top academic speech-science conference; intimidating but worth knowing, since most landmark papers debut here. Sydney, Australia, 27 September - 1 October 2026.

21. Hackathons and competitions

ElevenHacks (weekly sprints): Weekly themed challenges with credits and prizes; low-pressure way to ship one project per week. 🟢 Beginner
AI Engineer World’s Fair Hackathon: Co-located with the conference; $10K prizes judged by 3,000+ AI engineers, with a strong voice track, happening on Jun 27 at 9:00 AM - Jun 28 at 5:00 PM (PDT). 🟡 Intermediate
lablab.ai AI Hackathons: Continuous calendar of short online hackathons frequently sponsored by voice-AI vendors. 🟢 Beginner
Devpost: Voice AI Hackathons: Centralized search for active voice-AI hackathons; the best way to find what’s open right now. 🟢 Beginner

Suggested learning path

Week 1: Foundations: Read the LiveKit pipeline post and Voice AI Illustrated Primer (sections 1, 8).
Week 2: First agent: Finish the LiveKit or Pipecat quickstart end-to-end (sections 2, 10).
Week 3: Components: Swap STT, TTS, and LLM providers; benchmark latency (sections 3, 4, 5).
Week 4: Turn-taking, audio cleanup & telephony: Add Silero VAD, a turn detector, and a speech-enhancement pass; connect a SIP trunk (sections 6, 7, 9).
Week 5: Production: Add evaluation, observability, and read the FCC/EU AI Act material (sections 14, 15, 16).
Ongoing: Subscribe to two newsletters, follow Voice AI Space, and join the Voice AI community on LinkedIn group (sections 17, 18, 19).

Contributing

Pull requests welcome. Resources must be active in the last 12 months, accessible to developers, and vendor-neutral or clearly labeled when authored by a commercial party. Open an issue to suggest additions or removals. See CONTRIBUTING.md for the full contribution guide.

⭐ Stargazers and contributors

📜 License

MIT. Fork it, ship it.

This site is open source. Improve this page.

voiceai

How to use this list

📘 Companion book: Voice Agents Handbook

Table of contents

1. Foundational concepts and learning paths

2. Frameworks and orchestration platforms

Open-source frameworks

Managed platforms

Realtime / speech-to-speech APIs

Vendor-neutral comparisons

3. Speech-to-text (STT / ASR)

Commercial APIs

Open source

Benchmarks and explainers

4. Text-to-speech (TTS)

Commercial APIs

Open source

Streaming and ethics

5. LLMs for voice and real-time AI

Low-latency inference

Speech-to-speech models

Voice-specific prompting and tools

6. Voice activity detection and turn-taking

7. Audio enhancement and noise suppression

8. WebRTC fundamentals

9. Telephony and SIP

10. Tutorials and hands-on projects

11. GitHub starter repos and awesome lists

12. Datasets and benchmarks

13. Beginner-accessible research papers

14. Evaluation and testing

15. Production, deployment, and scaling

16. Ethics, safety, and regulation

17. Blogs and newsletters

18. Podcasts

19. Communities

20. Conferences and events

21. Hackathons and competitions

Suggested learning path

Contributing

⭐ Stargazers and contributors

📜 License