How to Build a Voice Assistant: From idea to working voice agent in 1–2 weeks – the cloud and self-hosted paths, compared by people who have shipped both

Q: Cloud or self-hosted – which should we pick?

If you need a working MVP in days and your data is not sensitive, start with cloud. If your project handles sensitive data, has strict compliance, runs at scale, or has a carbon constraint – go self-hosted. Mother Earth AI is on the self-hosted path because all three reasons applied.

Q: How long does it take to ship a voice assistant?

1-2 weeks for a cloud-based MVP. Longer for the self-hosted path because of the infrastructure setup, but the long-term cost and control profile is very different.

Q: Should we worry about cost?

Cloud voice assistants charge per API call – per second of audio transcribed, per token generated, per character spoken. At small scale this is fine. At medium scale it adds up fast. Model the cost over 12-24 months before committing to a cloud path.

Q: What languages are supported?

STT and TTS in the major European languages, including German, English and French, are excellent. Smaller languages are uneven in open source and depend on the model. Test with your actual user group.

Q: Can the voice assistant take action?

Yes. Tool calling via MCP and the LiveKit Agent Framework lets the agent book calls, query a database, trigger an n8n workflow or call any API you give it access to.

Q: What does "data leaves my infrastructure" actually mean?

On the cloud path, every API call sends audio (or transcribed text) to the provider's servers – usually outside the EU. On the self-hosted path, audio is processed on your own server and never leaves. For sensitive data this is the most important architectural decision.

Q: What's a good real-world example of a self-hosted voice assistant?

Mother Earth AI – a self-hosted voice agent for climate communication that won the K3-Preis 2023 für Klimakommunikation. Built on Ollama + Open WebUI on the team's own infrastructure. Used online and via a physical "Mutter Erde Telefon" Raspberry Pi installation in museums and exhibitions.

The short answer

To build a voice assistant that works, decide first whether you need the cloud path (ElevenLabs + your LLM provider of choice), the self-hosted path (Parakeet + LiveKit + Kokoro + Ollama) – or something in between. The answer depends on data sensitivity, cost over 12-24 months, compliance and the energy footprint your project can defend. Both paths use the same three components (speech-to-text, language model, text-to-speech) and the same tool-calling layer (via MCP or the LiveKit Agent Framework) that lets the assistant actually do things – book a call, query a database, trigger a workflow – not just talk back. They differ on where the audio goes, what it costs at scale and how natural the resulting voice sounds. Many projects start with the cloud path to validate the concept in 1-2 weeks, then migrate to self-hosted once the use case is proven.

What this means in practice

A voice assistant is a chain of three components that pass data between each other in a few seconds:

Speech-to-text (STT): the user speaks, the STT component transcribes the audio into text. This is the foundation – if STT mis-hears the user, nothing downstream can recover. Quality depends on the language, the model's ability to process the accent, the background noise and any domain-specific vocabulary.
Language model (LLM): the transcribed text goes to a language model, which understands the intent and generates a written response. This is the same family of technology that powers ChatGPT or Claude. Crucially the LLM is what makes the assistant a conversation partner rather than a search box – it can carry context across turns, refine its answer when corrected, and call tools.
Text-to-speech (TTS): the written response goes to a TTS component, which generates the spoken audio the user hears. Quality here is what makes the assistant sound credible (or robotic). For projects in healthcare, accessibility or public services, voice naturalness directly affects whether the system is trusted.

Each of those components can run in the cloud (you call an API and the audio leaves your infrastructure) or self-hosted (the component runs on your own server and the audio never leaves). You can mix and match: cloud STT + self-hosted LLM, or any other combination. But every cloud component is an API call, which generally boils down to cost per token (or cost per minute in the case of phone agents) and an external data flow. Every self-hosted component means upfront infrastructure work but no per-call cost and no audio leaving your control.

Streaming audio is the one ingredient you cannot skip. A 2-way conversational interaction needs the system to stream audio continuously, not pass complete utterances back and forth. Without streaming, the conversation feels like a slow turn-taking exercise. With streaming, the agent can adjust to the user's conversational flow, handle interruptions, pauses and changes in pace naturally. Build streaming in from day one, not as a retrofit.

Key components

STT – speech to text icon

STT – speech to text

Parakeet, Whisper, Moonshine for self-hosted
Cloud STT for fast prototyping
Quality depends on the model's ability to process the language, accent and domain vocabulary

LLM – language model icon

LLM – language model

GPT-4 / GPT-4o via Microsoft AI Foundry for the cloud path
Self-hosted Llama, Mistral or Gemma via Ollama for the sovereign path
Combined with a live streaming server, enables context-aware turn-taking

TTS – text to speech icon

TTS – text to speech

Cloud TTS like ElevenLabs for the most natural voice quality today
Self-hosted Kokoro, Piper or Coqui for sovereignty (still closing the quality gap)
Voice quality directly affects trust in many domains, especially healthcare, accessibility and public-service contexts

Outcomes

Cloud path: 1-2 weeks to MVP icon

Cloud path: 1-2 weeks to MVP

API keys, embed a widget, ship a working prototype fast – ideal for validating the concept

Self-hosted: full sovereignty icon

Self-hosted: full sovereignty

everything runs on your infrastructure, audio never leaves, no per-conversation cost

Hybrid path: validate then migrate icon

Hybrid path: validate then migrate

most projects start with cloud, prove the use case, then migrate to self-hosted before going to scale

Real-time conversation

turn-taking, interruption handling and natural pause timing are architecture decisions, not features you bolt on later

Tool calling and action

a voice assistant that can actually do things – book a call, query a database, trigger a workflow – via MCP or the LiveKit Agent Framework

**Want to talk it through? Book a call: Free of charge, full of value.**

How it works

1. Decide cloud vs self-hosted

Map data sensitivity, expected scale and compliance requirements
Model the cost over 12-24 months, not just the first month
Decide whether real-time interruption is essential or whether request/response is enough

2. Build the working agent

Cloud path: pick STT/LLM/TTS providers, embed a widget, ship in 1-2 weeks
Self-hosted path: deploy n8n + Whisper + Ollama + Piper on Docker, integrate with your website
Test in a real acoustic environment with real users, not in a quiet lab

3. Operate and scale

Monitor accuracy, latency and conversation quality
Tune voices, prompts and tool calling against actual conversations
Migrate from cloud to self-hosted once the use case is validated and scale starts to bite

Why N3XTCODER

We bring a decade of impact-tech experience and over 160 AI projects since 2019. Through our free AI for Impact course, more than 100,000 people have learned how to use AI for the common good. We do not run inspiration days. We run scoping sessions and build engagements that ship, the way we have delivered AI for the organisations below:

Mother Earth AI – self-hosted voice agent for climate communication, K3-Preis 2023 winner, used in museums and on "Mutter Erde Telefon" Raspberry Pi installations
Kompetenzz – production retrieval-augmented generation (RAG) chatbot serving 1,000+ HumHub members on n8n + Qdrant + GPT-4 via Microsoft EU, delivered in four sprints
GDV (German Insurers Association) – AI Knowledge Assistant over tens of thousands of policy documents for 400+ member companies, on Azure AI Search + GPT-4o via Microsoft AI Foundry. Halved research time, prevented shadow AI use, increased internal employee satisfaction
A leading German association – AI Member Platform ("Association GPT") combining chat-based discovery with traditional category filters, on Microsoft AI Foundry + pgvector
innatura – AI email agent classifying enquiries and drafting replies with mandatory human review, currently in pilot, on N8N and Azure OpenAI
Default stack: n8n in Berlin, Qdrant or pgvector for vector search, Azure OpenAI / GPT-4o via Microsoft AI Foundry, plus open-source EU alternatives like Mistral, Milvus and self-hosted Ollama / Whisper / Piper for sovereign deployments.

Honest constraints

Cloud TTS still beats open-source TTS on naturalness. ElevenLabs and similar commercial services produce the most lifelike voices today. Open-source tools like Piper and Coqui are catching up fast but if voice quality is the hill to die on, the cloud path makes more sense.

Open-source voice AI in less common languages is uneven. German, English and French are well-supported by both Whisper and Piper. Smaller European languages and dialects are patchier. Test with your actual user group before committing.

Voice models eat energy. Voice pipelines are heavier than text pipelines. For projects with a real carbon constraint – like Mother Earth AI – this shapes the architecture choice from day one.

Real-time interruption requires WebSocket-style architecture. Request/response systems are easier to build but feel sluggish. If your use case needs natural turn-taking, design it in from the start.

Frequently asked questions

Cloud or self-hosted – which should we pick?

How long does it take to ship a voice assistant?

Should we worry about cost?

What languages are supported?

Can the voice assistant take action?

What does "data leaves my infrastructure" actually mean?

What's a good real-world example of a self-hosted voice assistant?

Build your voice assistant with N3XTCODER

Tell us about the use case, the language, the audience and the constraints. We will reply with a proposed architecture and a date, usually within a working day.

Simon Stegemann
Co-Founder and CEO

Related Services

AI Chatbot

AI Chatbot. An intelligent customer support assistant that guides users to the right content and actions. Enhance your customer experience with 24/7 automated support.

AI Discovery Lab

Enhance your product or tech vision with AI, Machine Learning and data expertise.

AI Knowledge Assistant

AI Knowledge Assistant for your team. A customised AI chatbot that knows about your data. Get definitive answers from your data in seconds.

AI Voice Agents

An AI voice agent is a real-time spoken interface that talks to your users, takes action for them and runs on infrastructure you control.