How to Build a Voice Assistant: From idea to working voice agent in 1–2 weeks – the cloud and self-hosted paths, compared by people who have shipped both

How to Build a Voice Assistant

The short answer

To build a voice assistant that works, decide first whether you need the cloud path (ElevenLabs + your LLM provider of choice), the self-hosted path (Parakeet + LiveKit + Kokoro + Ollama) – or something in between. The answer depends on data sensitivity, cost over 12-24 months, compliance and the energy footprint your project can defend. Both paths use the same three components (speech-to-text, language model, text-to-speech) and the same tool-calling layer (via MCP or the LiveKit Agent Framework) that lets the assistant actually do things – book a call, query a database, trigger a workflow – not just talk back. They differ on where the audio goes, what it costs at scale and how natural the resulting voice sounds. Many projects start with the cloud path to validate the concept in 1-2 weeks, then migrate to self-hosted once the use case is proven.

What this means in practice

A voice assistant is a chain of three components that pass data between each other in a few seconds:

  1. Speech-to-text (STT): the user speaks, the STT component transcribes the audio into text. This is the foundation – if STT mis-hears the user, nothing downstream can recover. Quality depends on the language, the model's ability to process the accent, the background noise and any domain-specific vocabulary.

  2. Language model (LLM): the transcribed text goes to a language model, which understands the intent and generates a written response. This is the same family of technology that powers ChatGPT or Claude. Crucially the LLM is what makes the assistant a conversation partner rather than a search box – it can carry context across turns, refine its answer when corrected, and call tools.

  3. Text-to-speech (TTS): the written response goes to a TTS component, which generates the spoken audio the user hears. Quality here is what makes the assistant sound credible (or robotic). For projects in healthcare, accessibility or public services, voice naturalness directly affects whether the system is trusted.

Each of those components can run in the cloud (you call an API and the audio leaves your infrastructure) or self-hosted (the component runs on your own server and the audio never leaves). You can mix and match: cloud STT + self-hosted LLM, or any other combination. But every cloud component is an API call, which generally boils down to cost per token (or cost per minute in the case of phone agents) and an external data flow. Every self-hosted component means upfront infrastructure work but no per-call cost and no audio leaving your control.

Streaming audio is the one ingredient you cannot skip. A 2-way conversational interaction needs the system to stream audio continuously, not pass complete utterances back and forth. Without streaming, the conversation feels like a slow turn-taking exercise. With streaming, the agent can adjust to the user's conversational flow, handle interruptions, pauses and changes in pace naturally. Build streaming in from day one, not as a retrofit.

Key components

STT – speech to text icon

STT – speech to text

  • Parakeet, Whisper, Moonshine for self-hosted
  • Cloud STT for fast prototyping
  • Quality depends on the model's ability to process the language, accent and domain vocabulary

LLM – language model icon

LLM – language model

  • GPT-4 / GPT-4o via Microsoft AI Foundry for the cloud path
  • Self-hosted Llama, Mistral or Gemma via Ollama for the sovereign path
  • Combined with a live streaming server, enables context-aware turn-taking

TTS – text to speech icon

TTS – text to speech

  • Cloud TTS like ElevenLabs for the most natural voice quality today
  • Self-hosted Kokoro, Piper or Coqui for sovereignty (still closing the quality gap)
  • Voice quality directly affects trust in many domains, especially healthcare, accessibility and public-service contexts

Outcomes

Cloud path: 1-2 weeks to MVP icon

Cloud path: 1-2 weeks to MVP

API keys, embed a widget, ship a working prototype fast – ideal for validating the concept

Self-hosted: full sovereignty icon

Self-hosted: full sovereignty

everything runs on your infrastructure, audio never leaves, no per-conversation cost

Hybrid path: validate then migrate icon

Hybrid path: validate then migrate

most projects start with cloud, prove the use case, then migrate to self-hosted before going to scale

Real-time conversation

turn-taking, interruption handling and natural pause timing are architecture decisions, not features you bolt on later

Tool calling and action

a voice assistant that can actually do things – book a call, query a database, trigger a workflow – via MCP or the LiveKit Agent Framework

Want to talk it through? Book a call: Free of charge, full of value.

How it works

1. Decide cloud vs self-hosted

  • Map data sensitivity, expected scale and compliance requirements * Model the cost over 12-24 months, not just the first month * Decide whether real-time interruption is essential or whether request/response is enough

2. Build the working agent

  • Cloud path: pick STT/LLM/TTS providers, embed a widget, ship in 1-2 weeks * Self-hosted path: deploy n8n + Whisper + Ollama + Piper on Docker, integrate with your website * Test in a real acoustic environment with real users, not in a quiet lab

3. Operate and scale

  • Monitor accuracy, latency and conversation quality * Tune voices, prompts and tool calling against actual conversations * Migrate from cloud to self-hosted once the use case is validated and scale starts to bite

Why N3XTCODER

We bring a decade of impact-tech experience and over 160 AI projects since 2019. Through our free AI for Impact course, more than 100,000 people have learned how to use AI for the common good. We do not run inspiration days. We run scoping sessions and build engagements that ship, the way we have delivered AI for the organisations below:

  • Mother Earth AI – self-hosted voice agent for climate communication, K3-Preis 2023 winner, used in museums and on "Mutter Erde Telefon" Raspberry Pi installations
  • A leading member network – production retrieval-augmented generation (RAG) chatbot serving 1,000+ HumHub members on n8n + Qdrant + GPT-4 via Microsoft EU, delivered in four sprints
  • GDV (German Insurers Association) – AI Knowledge Assistant over tens of thousands of policy documents for 400+ member companies, on Azure AI Search + GPT-4o via Microsoft AI Foundry. Halved research time, prevented shadow AI use, increased internal employee satisfaction
  • A leading German association – AI Member Platform ("Association GPT") combining chat-based discovery with traditional category filters, on Microsoft AI Foundry + pgvector
  • A leading donation platform – AI email agent classifying enquiries and drafting replies with mandatory human review, currently in pilot, on N8N and Azure OpenAI
  • Default stack: n8n in Berlin, Qdrant or pgvector for vector search, Azure OpenAI / GPT-4o via Microsoft AI Foundry, plus open-source EU alternatives like Mistral, Milvus and self-hosted Ollama / Whisper / Piper for sovereign deployments.

Honest constraints

Cloud TTS still beats open-source TTS on naturalness. ElevenLabs and similar commercial services produce the most lifelike voices today. Open-source tools like Piper and Coqui are catching up fast but if voice quality is the hill to die on, the cloud path makes more sense.

Open-source voice AI in less common languages is uneven. German, English and French are well-supported by both Whisper and Piper. Smaller European languages and dialects are patchier. Test with your actual user group before committing.

Voice models eat energy. Voice pipelines are heavier than text pipelines. For projects with a real carbon constraint – like Mother Earth AI – this shapes the architecture choice from day one.

Real-time interruption requires WebSocket-style architecture. Request/response systems are easier to build but feel sluggish. If your use case needs natural turn-taking, design it in from the start.

Frequently asked questions

Build your voice assistant with N3XTCODER

Tell us about the use case, the language, the audience and the constraints. We will reply with a proposed architecture and a date, usually within a working day.

Simon Stegemann
Co-Founder and CEO

Related Services