What this means in practice
A voice assistant is a chain of three components that pass data between each other in a few seconds:
-
Speech-to-text (STT): the user speaks, the STT component transcribes the audio into text. This is the foundation – if STT mis-hears the user, nothing downstream can recover. Quality depends on the language, the model's ability to process the accent, the background noise and any domain-specific vocabulary.
-
Language model (LLM): the transcribed text goes to a language model, which understands the intent and generates a written response. This is the same family of technology that powers ChatGPT or Claude. Crucially the LLM is what makes the assistant a conversation partner rather than a search box – it can carry context across turns, refine its answer when corrected, and call tools.
-
Text-to-speech (TTS): the written response goes to a TTS component, which generates the spoken audio the user hears. Quality here is what makes the assistant sound credible (or robotic). For projects in healthcare, accessibility or public services, voice naturalness directly affects whether the system is trusted.
Each of those components can run in the cloud (you call an API and the audio leaves your infrastructure) or self-hosted (the component runs on your own server and the audio never leaves). You can mix and match: cloud STT + self-hosted LLM, or any other combination. But every cloud component is an API call, which generally boils down to cost per token (or cost per minute in the case of phone agents) and an external data flow. Every self-hosted component means upfront infrastructure work but no per-call cost and no audio leaving your control.
Streaming audio is the one ingredient you cannot skip. A 2-way conversational interaction needs the system to stream audio continuously, not pass complete utterances back and forth. Without streaming, the conversation feels like a slow turn-taking exercise. With streaming, the agent can adjust to the user's conversational flow, handle interruptions, pauses and changes in pace naturally. Build streaming in from day one, not as a retrofit.