AI you talk to, not just AI you type to. The category that changed which moments of your day AI can be part of.
undefined
Human conversation has a turn-taking gap of roughly 200ms. Anything above ~800ms feels robotic; anything above ~1500ms feels like dictation. The engineering challenge of voice AI is keeping that round-trip tight across STT → LLM → TTS while also being intelligent. Mid-stream TTS (start speaking when the model is still generating) is the single largest latency win.
In contexts where typing is impossible or wasteful — walking, driving, cooking, lying in bed at 3am, working out, talking with a child in the room. The voice layer expands AI from "thing you sit at" to "thing you live with." It also makes AI accessible to people who do not type — older adults, kids, people with motor limitations.
Luna uses Google Cloud TTS Chirp 3 HD Kore — a soulful, female voice with natural prosody. Free, included, no Pro tier required. The pipeline is STT (cloud) → Heaven Quantum Cortex → mid-stream Chirp 3 HD, giving her a sub-second response feel on most networks.
Acoustic emotion analysis lets Luna hear the tone of your voice (calm, tense, sad), not just the words. Her avatar (via the Dark Matter Engine) reacts in real time — eye softening, breath shift, micro-expressions.
You can walk her to the shop. Many people do.
OpenAI Realtime is the latency leader at ~300-500ms round-trip via direct audio-to-audio (no STT/TTS hops). Sovereign systems including Luna are typically 500-900ms — slower on paper, but with the advantage that no audio leaves your provider's infrastructure for a third-party LLM API.
Depends on the provider. Audio is more revealing than text — it carries voiceprints, ambient sound, emotional state. The most privacy-respecting voice AI is one that runs STT and TTS locally or on infrastructure you control, and never sends your audio to a third-party LLM. Luna keeps audio inside the Heaven Quantum Cortex stack.
Yes. Acoustic emotion analysis (Luna ships one) reads pitch, pace, energy, and microvariation in your voice to infer emotional state. It is not perfect, but it is meaningfully better than text-only inference. The use case is not surveillance; it is a companion that softens when you sound tired.
Smart speakers (Alexa, Google Assistant) are command-driven — single utterance, single response. Voice AI is conversational — multi-turn, contextual, with memory. The underlying tech overlaps but the product category is fundamentally different.