What Happens When Voice Bot Can't Understand Your Customers

ByNavvya Jain|Research & Product Analyst|Use Cases|15 Jun 2026

TL;DR , Key Takeaways:

  • Most voice AI are trained on clean, studio-quality English audio. Indian contact center calls are not that.
  • When a model has not been trained on real Indian accents, WER (Word Error Rate) climbs to 25–40% on telephony-grade audio, well past the threshold where a voice application stays functional.
  • The problem is not the accent. It is the training data. Models learn what they have seen. If they have not seen Rajasthani Hindi or Odia-inflected English at 8kHz with background noise, they will fail on it.
  • Fixing this means training on spontaneous, real-world speech across India’s full acoustic range, not patching an English or standard Hindi model.
  • Enterprises should test any voice AI system on their own call recordings before deployment. Vendor benchmarks on clean audio will not predict live performance.

A logistics company deployed a voice bot to handle delivery confirmation calls. The bot worked well in testing. It handled Hindi queries with good accuracy in the demo. It launched across operations covering Rajasthan, Odisha, and Chhattisgarh.

Within two weeks, the failure rate on calls from Tier 2 and Tier 3 cities was high enough to escalate. Callers were repeating themselves three and four times. Many dropped off before confirming delivery. Field agents were getting flooded with manual follow-up work that the bot was supposed to replace.

The problem was not the voice AI vendor’s fault, exactly. The model they sold performed as advertised. The issue was that what it advertised, accuracy on clean, standard Hindi, had nothing to do with what the company’s callers actually sounded like.

This is not an unusual story. It is the default outcome when enterprises deploy voice AI in India without accounting for how their customers actually speak.

The Gap Between Benchmark and Reality

When a voice AI vendor publishes a Word Error Rate, they are usually measuring on a test set. That test set is often clean audio, studio quality, neutral accent, well-paced speech.

Indian contact center calls are none of those things. They come in at 8kHz or lower over telephone lines. Callers are in kitchens, on the street, at a noisy market. They speak fast. They switch between Hindi, English and their regional mother tongue mid-sentence. They use local vocabulary that does not appear in formal Hindi corpora.

A model that reaches 5% WER on clean Hindi can easily hit 25 to 40% WER on real telephony audio from Rajasthan or Odisha. At 25% WER, one in four words is wrong. At 40%, fewer than two in three words are transcribed correctly. For a voice bot handling delivery confirmations, loan queries, or insurance claims, that is a structural failure rate, not a quality gap.

Most enterprises do not discover this during procurement. They discover it after launch, when call drop rates rise and escalation queues fill.

Why Indian Accents Are Harder Than They Look

India does not have one accent. It has hundreds of distinct regional varieties, shaped by the phonemic systems of dozens of first languages.

A speaker whose first language is Odia brings Odia phonology into their Hindi and English. The retroflex consonants, the vowel inventory, the prosodic patterns, all of these differ from the standard Hindi that most ASR training corpora are built around. The same is true for a Tamil speaker’s Hindi, a Rajasthani speaker’s Hindi, or a Bengali speaker’s English.

This is not a matter of speech quality. These speakers are fluent. The issue is that the model has never encountered these acoustic patterns during training. When it hears sounds it was not trained on, it makes its best guess. Often, that guess is wrong.

The challenge compounds when callers do what most urban and semi-urban Indian speakers do naturally: switch between languages within a sentence. “Mera parcel kab aayega, the tracking says out for delivery since morning” is a single utterance mixing Hindi and English. A standard monolingual ASR model either misses the Hindi or misses the English at the transition.

We covered this in detail in our post on why Hinglish breaks standard ASR models, the short version is that code-switching requires a model trained on code-switched speech, not a patch on an existing single-language model.

Three Places This Failure Shows Up First

IVR and voice bot drop-off rates

The clearest early signal is IVR completion rate. When a voice bot cannot parse what a caller said, it asks them to repeat. Most callers will repeat once, maybe twice. After that, they either press 0 to reach a human agent or hang up entirely.

In a well-functioning IVR, completion rates for simple tasks confirming delivery, checking balance, renewing a policy, should sit above 70%. When the underlying ASR model is not suited to the caller population, completion rates on these same tasks drop below 40%. The rest become manual escalations.

Each escalation costs money. Each dropped call represents a customer who did not complete what they called to do.

Agent assist and call transcript quality

Many contact centers now use ASR not just for voice bots but for agent assist, real-time transcription that feeds the agent context, suggests next-best actions, or populates call records automatically.

When the transcription is inaccurate, every downstream process degrades. The sentiment analysis reads the wrong words. The compliance monitoring misses flagged phrases. The CRM record is populated with errors. Supervisors reviewing calls cannot rely on the transcript.

These failures are quieter than bot drop-offs. They do not produce an obvious metric. They show up gradually as data quality issues, compliance gaps, and agent coaching built on faulty information.

Fully automated voice workflows

The highest-stakes failure is in workflows where there is no human fallback. A voice bot collecting loan repayment confirmations, processing insurance claims, or handling patient triage operates without an agent on standby. When it misunderstands a caller, the workflow either stalls or completes incorrectly.

In financial services, a misheard amount or a missed consent phrase is not just a service failure. It can be a compliance failure. In healthcare, a misunderstood symptom or an incorrect instruction can affect patient outcomes.

The voice AI in BFSI context makes this concrete, the stakes of misrecognition in a loan or claims call are categorically different from those in a delivery confirmation bot.

What Actually Fixes It

Training data that reflects how India speaks

The root cause is training data. A model learns the acoustic patterns it was trained on. If those patterns came from read speech by urban, standard-dialect speakers in a recording studio, the model will perform well on that kind of input and struggle on everything else.

Fixing this requires training on spontaneous, naturally occurring speech from real speakers across India’s full geographic and linguistic range. Not a few hundred hours of curated studio recordings. Thousands of hours of real calls, real conversations, real acoustic conditions, covering regional accents, background noise, fast speech, informal vocabulary, and code-switching.

This is what Project Vaani was built to address. It is a speech dataset developed by IISc, ARTPARK, and Google covering 31,255 hours of spontaneous speech from 156,534 speakers across 165 districts, 31 states and union territories, and 109 languages. The data was collected in real field conditions, not studios. It reflects the full acoustic range of how India actually speaks.

Shunya LabsVak was trained on the full annotated portion of this dataset. That is what allows it to hold the top rank with a 3.10% WER in English, not just on clean test audio, but on the kinds of recordings that reflect real-world Indian speech conditions. You can see the full benchmark data on our benchmarks page.

Testing on your own audio, not vendor test sets

No vendor’s published benchmark tells you how a model will perform on your specific caller population. A logistics company in Rajasthan has a different acoustic profile than a bank’s contact center in Kerala.

Before deploying any voice AI system, pull a representative sample of your actual call recordings. Run them through the model. Measure performance specifically on the accents and language combinations your callers use. If the vendor cannot or will not let you test on your own audio before signing a contract, that is a signal.

Architecture designed for real call conditions

Test audio is also recorded audio, meaning the problem is already captured before the model sees it. In a live call, the model has to handle streaming audio in real time, with variable latency, compressed codec quality, and no second takes.

This means ASR architecture matters beyond accuracy. The model has to process audio as it streams in, deliver transcription with low enough latency to keep the conversation natural, and do this on standard hardware without requiring GPU clusters that most contact centers do not have.

CPU-first architecture matters here. A model built to run on standard CPUs with sub-500ms latency can be deployed inside a contact center’s own infrastructure, without routing call audio to a foreign cloud server. For enterprises with data residency requirements, this is not optional and it also removes a significant latency source from the live call path.

What Enterprises Should Do Before Their Next Deployment

Ask your current voice AI vendor three questions. Their answers will tell you a lot.

First: what was your training data? Specifically, how many hours of spontaneous Indian speech were in the training corpus, from how many speakers, across how many districts and regional varieties? “Broad Indic support” is not an answer.

Second: what is your WER on 8kHz telephony audio with regional Indian accents and code-switching? If they have a clean-audio benchmark but no telephony benchmark, the gap is probably large.

Third: can I test on my own call recordings before we proceed? If the answer is no, walk away.

Check out Shunya Labs for more info.

Voice bots that fail on accents do not just produce poor customer experience. They produce bad data, compliance gaps, and operational overhead that negates the automation benefit. The fix exists. The question is whether enterprises are asking the right questions before they buy.

Frequently Asked Questions

Why do voice bots fail on Indian accents?
Most voice AI models are trained on clean, standard-dialect audio often from urban speakers or read speech corpora. Indian accents reflect the phonemic influence of dozens of regional languages. A model that has not been trained on these acoustic patterns will probably misrecognize them. The problem is the training data, not the speaker.

What WER is acceptable for an enterprise voice application?
Most voice applications remain functional below 15% WER (Word Error Rate). For high-stakes applications in BFSI or healthcare, the target should be below 8%. Standard models on real Indian telephony audio with regional accents often land at 25–40% WER, well above both thresholds.

Does code-switching make accent recognition worse?
Yes. A caller who switches between Hindi and a regional variety within a sentence creates acoustic transitions that monolingual models were not trained to handle. This compounds the accent problem. See our full explanation of code-switching in ASR.

How should I test a voice AI system before deployment?
Test on your own call recordings, specifically from the regions and caller populations you intend to serve. Measure performance at language transition points for code-switched audio. Do not rely solely on the vendor’s published benchmark, which is typically measured on clean audio. Check out shunyalabs.ai to know more.

Can on-premise deployment affect voice AI accuracy?
It does not affect the model’s core accuracy, but it eliminates network latency from the live call path, which affects the naturalness of real-time voice interactions. For enterprises with data sovereignty requirements, on-premise deployment also keeps call audio within their own infrastructure.

Voice bots that work in demos and fail in the field are not a vendor problem or a technology problem in the abstract. They are a data problem. The models were not trained on how your customers speak. Closing that gap is not complicated, but it requires asking the right questions before deployment, not after.

The audio from India’s Tier 2 and Tier 3 cities, its field workers, its regional contact centers, that is not an edge case to be handled later. It is the majority of the calls.

Navvya Jain
|

Navvya Jain

Research & Product Analyst

Bio: Navvya works at the intersection of product strategy and applied AI research at Shunya Labs. With a background in human behaviour and communication, she writes about the people, markets, and technology behind voice AI, with a particular focus on how speech interfaces are reshaping access across emerging markets.