AI Voice Agents: Voice agents with natural conversation flow that take action and run on infrastructure you control

AI Voice Agents from N3XTCODER

What an AI voice agent from N3XTCODER actually is

An AI voice agent is a real-time spoken interface that talks to your users, takes action for them and runs on infrastructure you control. We build voice agents on two paths: a fast cloud path (ElevenLabs / Azure AI Foundry) for prototypes that need to ship in days, and a fully self-hosted open-source path (n8n + Whisper + Ollama + Piper) when sovereignty, cost or carbon constraints rule out hyperscalers. The second path is the one we used for Mother Earth AI – the self-hosted voice agent for climate communication that won the K3-Preis 2023.

What this means in practice

Mother Earth AI is the clearest worked example. The project gives planet Earth a literal voice for climate communication. Their constraint: the system could not promote carbon emissions through hyperscaled AI providers. Sovereignty, autonomy and carbon-independence were non-negotiable.

We collaborated with the Mother Earth team on a fully self-hosted voice agent that runs on Ollama as the LLM platform and Open WebUI as the interface, with all components on the team's own infrastructure. The voice agent now serves two surfaces: the public website at mother-earth.ai, and a physical "Mutter Erde Telefon" – a Raspberry Pi-based phone installation that travels to museums, exhibitions and climate events, where visitors can pick up the receiver and have a spoken conversation with Mother Earth without an app or screen. The project won the K3-Preis 2023 für Klimakommunikation.

Most projects start on the cloud path – ElevenLabs for natural voice quality plus Azure OpenAI / GPT-4o via Microsoft AI Foundry for the language model. The cloud path ships an MVP in 1-2 weeks. Mother Earth AI is the counter-example we reach for when the project has a hard sovereignty, cost or carbon constraint that rules out hyperscalers – self-hosted takes longer to set up but gives you full control over data, cost and energy footprint. Learn more in our voice assistant guide.

Key components

Real-time conversation icon

Real-time conversation

  • Turn-taking, interruption handling and natural pause timing
  • Live streaming architecture with LiveKit and Fast RTC where appropriate

Two delivery paths icon

Two delivery paths

  • Cloud path: ElevenLabs + Azure AI Foundry for fast MVPs (1-2 weeks)
  • Self-hosted path: n8n + Whisper + Ollama + Piper for full sovereignty

Tool calling and action icon

Tool calling and action

  • Voice agents that can actually do things: book a call, query a database, trigger a workflow
  • Tool calling and orchestration with MCP and the LiveKit Agent Framework

Outcomes

A voice agent that delivers icon

A voice agent that delivers

Delivering self-hosted architectures for >3 years

Time to first MVP icon

Time to first MVP

1-2 weeks on the cloud path; longer for self-hosted, with a clear migration path from one to the other

Cutting-edge speech models icon

Cutting-edge speech models

Parakeet, Whisper and Moonshine for speech-to-text; Kokoro and Piper for text-to-speech, in your language

Sovereignty by default

Self-hosted Ollama + open-source TTS / STT means audio never leaves your infrastructure if you don't want it to

Carbon-honest

Renewable-energy hosting where possible; transparent about the energy cost of voice models

Want to talk it through? Book a call: Free of charge, full of value.

How it works

1. Use case and architecture

  • Decide cloud vs self-hosted based on data sensitivity, cost over 12-24 months and compliance constraints
  • Pick the right STT, LLM and TTS components for your language and domain
  • Plan tool calling and integration points

2. Build the working agent

  • First MVP in 1-2 weeks on the cloud path
  • Test with real users in a real acoustic environment, not a lab
  • Tune voices, prompts and tool calling against actual conversations

3. Deploy and operate

  • Cloud deployment via Azure or your trusted EU provider
  • Self-hosted deployment via Docker / Kubernetes on your own infrastructure or Ionos
  • Documentation and handover so your team can operate the system

Why N3XTCODER

We bring a decade of impact-tech experience and over 160 AI projects since 2019. Through our free AI for Impact course, more than 100,000 people have learned how to use AI for the common good. We do not run inspiration days. We run scoping sessions and build engagements that ship, the way we have delivered AI for the organisations below:

  • Mother Earth AI – self-hosted voice agent for climate communication, K3-Preis 2023 winner, used in museums and on "Mutter Erde Telefon" Raspberry Pi installations
  • A leading member network – production retrieval-augmented generation (RAG) chatbot serving 1,000+ HumHub members on n8n + Qdrant + GPT-4 via Microsoft EU, delivered in four sprints
  • GDV (German Insurers Association) – AI Knowledge Assistant over tens of thousands of policy documents for 400+ member companies, on Azure AI Search + GPT-4o via Microsoft AI Foundry. Halved research time, prevented shadow AI use, increased internal employee satisfaction
  • A leading German association – AI Member Platform ("Association GPT") combining chat-based discovery with traditional category filters, on Microsoft AI Foundry + pgvector
  • A leading donation platform – AI email agent classifying enquiries and drafting replies with mandatory human review, currently in pilot, on N8N and Azure OpenAI
  • Default stack: n8n in Berlin, Qdrant or pgvector for vector search, Azure OpenAI / GPT-4o via Microsoft AI Foundry, plus open-source EU alternatives like Mistral, Milvus and self-hosted Ollama / Whisper / Piper for sovereign deployments.

Honest constraints

Voice agents fail when they don't allow real-time interruption. Sprachnachricht-style request/response systems are easier to build but frustrate users. If real-time turn-taking matters, it has to be designed in from the start – not bolted on.

Voice quality is not solved. Cloud TTS providers like ElevenLabs still beat open-source TTS like Piper or Coqui on naturalness. Open-source is closing the gap fast but if voice quality is the hill to die on, the cloud path makes more sense.

Multilingual is uneven. Speech recognition and synthesis in major languages is excellent. In smaller languages and dialects it's still patchy. Test with your actual user group before committing.

Voice eats energy. Voice models are heavier than text models. We track and disclose the cost rather than hide it. For projects where carbon honesty matters – like Mother Earth AI – this shapes the architecture choice.

Frequently asked questions

Build an AI voice agent with N3XTCODER

Tell us about the use case and the constraints. We will reply with a proposed architecture and a date, usually within a working day.

Simon Stegemann
Co-Founder and CEO

Other Services