Close Menu

    Subscribe to Updates

    Get the latest creative news from infofortech

    What's Hot

    How Ceros Gives Security Teams Visibility and Control in Claude Code

    March 19, 2026

    On-demand GPU startup Andromeda raises funding at $1.5B valuation

    March 19, 2026

    A better method for identifying overconfident large language models | MIT News

    March 19, 2026
    Facebook X (Twitter) Instagram
    InfoForTech
    • Home
    • Latest in Tech
    • Artificial Intelligence
    • Cybersecurity
    • Innovation
    Facebook X (Twitter) Instagram
    InfoForTech
    Home»Innovation»Why real-time voice AI is harder than it sounds
    Innovation

    Why real-time voice AI is harder than it sounds

    InfoForTechBy InfoForTechFebruary 20, 2026No Comments5 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Why real-time voice AI is harder than it sounds
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email



    Real-time voice recognition has become so common that many of us now take it for granted. But that convenience is the product of years of deep learning research and products that yielded more frustration than results.

    It turns out that simultaneous voice transcription is one of the hardest engineering problems in modern artificial intelligence, for reasons that have more to do with the foibles of human speech and our lack of tolerance for delay than with the underlying technology.

    Voice is where many AI systems first break down, especially as companies rush to deploy agents in customer-facing environments, said Scott Stephenson, co-founder chief executive officer of Deepgram Inc., developer of a scalable platform for automatic speech recognition and text-to-speech capabilities delivered via an application programming interface.

    Human tolerance has its limits

    “It has to do with real time,” he said. “If people are working with a product that isn’t expected to work in real time, they’ll allow more failures or silent failures.”

    A misfiring chatbot can be retried. A voice assistant that pauses, misunderstands or responds awkwardly annoys the user. Those latency constraints mean “you have to get everything that you need to get done in 500 milliseconds or less,” Stephenson said.

    Unlike text, which is standardized, speech is variable. The same word can sound dramatically different depending on accent, age, language, microphone quality, background noise or even where the speaker is standing. Stephenson called this one of the biggest problems in building robust speech systems.

    Transcription tools have been around for years, but most only worked well with perfect audio. Those rule-based speech systems were built from layered models that tended to compound errors.

    “Each of the models was maybe 80% or 85% accurate,” Stephenson said. “When you stack five of those together, you get down to 50% accuracy.”

    Deep learning breakthrough

    The breakthrough was end-to-end deep learning, in which models trained directly on massive datasets and inferred the rules themselves.

    But even strong models are only part of the equation. Enterprise voice systems must be deployed like infrastructure, and the needs of business buyers are fundamentally different from those of consumers. “It has to have low latency, it has to have high throughput, it has to be reliable, it has to be debunkable, it has to be adaptable and get better over time,” Stephenson said.

    Deployment options matter too. Many enterprises want voice recognition to run in their own environments for regulatory or privacy reasons. Deepgram delivers its technology using an API-first approach, but Stephenson said the differentiator is not the interface but the ability to deliver consistent performance at scale.

    Measuring quality in voice recognition is more complex than many executives assume, he said. The primary metric for speech-to-text is word error rate, or the percentage of words transcribed incorrectly. “If your word error rate is 25% or less, you can get value,” he said. But perfection is unrealistic: “There really isn’t a zero percent word error rate,” even with humans.

    Voice generation is even harder to score objectively. Stephenson said it relies heavily on human preference testing with “tens or hundreds of people” across different scenarios.

    The infrastructure burden is growing as voice agents increasingly rely on large language models and tool use behind the scenes. Latency is a physics problem at global scale. Real-time voice systems require regional endpoints because “the Earth is large enough that the speed of light matters,” Stephenson said. That’s why Deepgram is expanding its endpoint network to Europe this year, with Asia on deck.

    Because of its inherent complexity, voice AI shouldn’t be viewed as an all-or-nothing proposition. Stephenson advised testing in a few scenarios where the lexicography is limited and expanding from there. “Don’t try to boil the ocean,” he said.

    Voice recognition may be the most natural interface humans have, but making it work reliably in real time requires disciplined engineering, global infrastructure and models trained to survive the chaos of the way people speak.

    Image: SiliconANGLE/Meta AI

    Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.

    • 15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
    • 11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.

    About SiliconANGLE Media

    SiliconANGLE Media is a recognized leader in digital media innovation, uniting breakthrough technology, strategic insights and real-time audience engagement. As the parent company of SiliconANGLE, theCUBE Network, theCUBE Research, CUBE365, theCUBE AI and theCUBE SuperStudios — with flagship locations in Silicon Valley and the New York Stock Exchange — SiliconANGLE Media operates at the intersection of media, technology and AI.

    Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    InfoForTech
    • Website

    Related Posts

    On-demand GPU startup Andromeda raises funding at $1.5B valuation

    March 19, 2026

    Apple Has Acquired MotionVFX – Ciente

    March 19, 2026

    Microsoft will no longer auto-install M365 Copilot app on Windows PCs

    March 19, 2026

    Why Walmart and OpenAI Are Shaking Up Their Agentic Shopping Deal

    March 18, 2026

    SpecterOps adds Okta, GitHub and Mac coverage to BloodHound Enterprise platform

    March 18, 2026

    Content Marketing Vs Sales For Saas Growth: A Strategy That Asks The Wrong Question

    March 18, 2026
    Leave A Reply Cancel Reply

    Advertisement
    Top Posts

    We’re Tracking Streaming Price Hikes in 2026: Spotify, Paramount Plus, Crunchyroll and Others

    February 15, 202610 Views

    How a Chinese AI Firm Quietly Pulled Off a Hardware Power Move

    January 15, 20268 Views

    Microsoft is bringing an AI helper to Xbox consoles

    March 14, 20266 Views

    The World’s Heart Beats in Bytes — Why Europe Needs Better Tech Cardio

    January 15, 20266 Views
    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo
    Advertisement
    About Us
    About Us

    Our mission is to deliver clear, reliable, and up-to-date information about the technologies shaping the modern world. We focus on breaking down complex topics into easy-to-understand insights for professionals, enthusiasts, and everyday readers alike.

    We're accepting new partnerships right now.

    Facebook X (Twitter) YouTube
    Most Popular

    We’re Tracking Streaming Price Hikes in 2026: Spotify, Paramount Plus, Crunchyroll and Others

    February 15, 202610 Views

    How a Chinese AI Firm Quietly Pulled Off a Hardware Power Move

    January 15, 20268 Views

    Microsoft is bringing an AI helper to Xbox consoles

    March 14, 20266 Views
    Categories
    • Artificial Intelligence
    • Cybersecurity
    • Innovation
    • Latest in Tech
    © 2026 All Rights Reserved InfoForTech.
    • Home
    • About Us
    • Contact Us
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.

    Ad Blocker Enabled!
    Ad Blocker Enabled!
    Our website is made possible by displaying online advertisements to our visitors. Please support us by disabling your Ad Blocker.