How Code-Switching Affects Voice AI Accuracy in Asia

ByNavvya Jain|Research & Product Analyst|Build & Learn|08 Jun 2026

A customer calls a bank in Mumbai. She starts in Hindi, switches to English for the account number, then drops back into Hindi to explain her complaint. That’s code-switching. The whole sentence takes ten seconds. But the voice AI catches maybe half of it.

This is not a fringe scenario. It is the default mode of communication across Asia’s largest markets. And for most enterprise voice AI systems, it is still an unsolved problem.

The business consequence is inaccurate transcription, failed service interactions, and eroded customer trust. This post explains what code-switching actually is, why it breaks standard ASR models, how the problem plays out across different Asian markets, and what enterprises should look for in a system that can handle it.

What Is Code-Switching in Voice AI?

Code-switching is when a speaker moves between two or more languages within a single conversation, or even within a single sentence. It is not a mistake or a sign of poor language ability. It is a natural feature of multilingual speech, shaped by how people learn languages and how they use them in daily life.

In Asia, this pattern is everywhere.

In India, over 250 million people communicate daily in Hinglish, a fluid mix of Hindi and English. A sentence like “Mujhe apna account statement check karna hai by end of day” is completely natural. It is not Hindi with errors. It is how the speaker actually talks.

In Singapore, a business meeting might move between English, Mandarin, and Tamil within minutes. A contact center caller might open in Malay, switch to English to read out a reference number, and close in Tamil. In the Philippines, Tagalog-English code-switching, called Taglish, is so common it is effectively the default register for urban professional conversations.

In each of these cases, a voice AI system trained on a single language will fail. It does not know which language the speaker is in at any given moment. And it has no model for what speech sounds like when both languages are active at the same time.

Why Standard ASR Models Break on Code-Switched Speech

Most automatic speech recognition systems are built around a monolingual assumption. They are trained on a corpus of one language, built to recognize the phonemes of that language, and optimized for WER on single-language benchmarks. When a speaker switches languages mid-sentence, several things break at once.

Phoneme confusion. Each language has its own set of sounds. When a speaker shifts from Hindi to English, the acoustic properties of their speech change. A model trained only on Hindi does not have a reliable representation of English phonemes, and vice versa. The transition point, where the switch happens, is where errors cluster.

Language model mismatch. ASR systems use a language model to predict which words are likely to follow each other. A Hindi language model assigns low probability to English words appearing in the middle of a Hindi sentence, even when that is exactly what a real speaker says. The model “corrects” the transcription toward Hindi, producing errors.

Training data scarcity. Code-switched speech is harder to collect and annotate than single-language speech. Most available datasets are either too small or too formal to capture the full range of how people actually code-switch in real conversations. The HiACC corpus, one of the few annotated Hinglish datasets, contains 5.24 hours of speech. That is useful for research, but nowhere near enough to train a production system.

The result is measurable. Standard monolingual ASR models see roughly a 30 to 50% relative increase in word error rate when exposed to code-switched speech. For Hinglish specifically, the HiACC benchmark study found that standard monolingual models reach approximately 42% WER on code-switched input. For context, most enterprise voice applications become unreliable above 15% WER. At 42%, the system is getting fewer than three words in five right.

How This Plays Out Across Asian Markets

The code-switching problem is not uniform across Asia. Each market has its own language pairs, its own patterns of switching, and its own enterprise contexts where accuracy failure causes the most damage.

India

India is the largest and most complex code-switching market in the world. The combinations are not just Hindi-English. Urban speakers in Chennai code-switch between Tamil and English. In Hyderabad, Telugu-English switching is standard in professional settings. In Kerala, Malayalam and English mix fluidly in both formal and informal conversations.

The stakes are high in BFSI and healthcare, where accuracy directly affects outcomes. A voice AI system transcribing a loan officer’s call that misses a number, a condition, or a consent phrase creates real liability.

Southeast Asia

Singapore compresses the challenge into one city. English, Mandarin, Malay, and Tamil are all common used languages, and switching between them in a single interaction is completely normal. A compliance call at a bank might involve all four.

In the Philippines, Taglish is so embedded in daily speech that some linguists describe it as a separate register rather than a mixing of two. Contact centers, a major industry in the Philippines, deal with this constantly. An agent switching between Taglish and standard English depending on the customer is not an edge case. It is every shift.

In Malaysia, Bahasa Malaysia-English code-switching (Manglish) is the norm in urban business settings. In Indonesia, Bahasa Indonesia-English and Bahasa-Javanese combinations add another layer.

East Asia

Tonal languages like Mandarin and Cantonese bring additional challenges. Mandarin has four tones and a neutral tone, the same syllable spoken in a different tone is a different word. English has no tonal system. When a Mandarin speaker shifts into English mid-sentence and then returns to Mandarin, the ASR system has to track not just a language change but a shift between tonal and non-tonal phonology.

English typically achieves under 8% WER in standard ASR, while tonal languages like Mandarin may reach 15–20% even in monolingual conditions. Add code-switching, and the error rate climbs further.

The Enterprise Cost

Code-switching failures are not always visible in the metrics enterprises track. They show up as high handle times, elevated repeat call rates, low IVR completion rates, and CSAT scores that trend down without a clear explanation.

A few places where the cost is most direct:

Contact centers. When a voice AI system misunderstands a code-switched query, one of three things happens: the call gets misrouted, the agent receives a wrong transcript and has to start the interaction over, or the customer repeats themselves multiple times and eventually gives up. Each of these outcomes costs time and customer goodwill.

Financial services. In lending, insurance, and banking, voice interactions often involve specific figures, terms, and consent language. A transcription error on a loan amount or a missed consent phrase is not just a service failure. It can be a compliance failure.

Healthcare. For emergency services, accurate transcription is needed regardless of language switches. A misheard instruction in a medical triage call or a missed symptom description in a health helpline interaction carries real risk.

IVR and voice bots. Fully automated voice workflows have no human fallback. If the model cannot parse a code-switched utterance, the interaction fails completely. In markets where code-switching is the norm, an IVR built on monolingual ASR will have a structural failure rate that no amount of tuning can fix.

What a Code-Switching-Ready Voice AI System Needs

Fixing the code-switching problem is not a configuration change. It requires a different approach to model architecture and training data. Here is what separates systems that handle code-switching from those that do not.

Bilingual and multilingual model training. The model has to be trained on speech that actually contains code-switching, not just separate monolingual corpora. The breakthrough lies in understanding code-switching as natural communication rather than an error to be corrected. Models trained with that framing produce very different results from those that treat it as noise.

Real-world training data. Studio recordings and read speech do not capture how code-switching actually sounds. The training corpus needs spontaneous, conversational, naturally occurring speech from real speakers in real environments.

Low-latency language detection. The system needs to identify which language is active at any given moment, fast enough to apply the right acoustic and language model at the point of the switch. Latency matters, if language detection lags by even a fraction of a second, the transition point gets transcribed incorrectly.

Coverage across language pairs, not just individual languages. Supporting Hindi and English as separate modes is not the same as supporting Hinglish. A system that can handle Hinglish but not Tanglish or Taglish is not a solution for an enterprise operating across multiple Asian markets.

Production-grade accuracy in noisy conditions. Contact center audio is compressed, the background is noisy, and speakers talk fast. Benchmark WER on clean audio does not predict real-world performance. Enterprises should test any ASR system against their own call recordings before drawing conclusions.

What Enterprises Should Do Now

Most enterprises deploying voice AI in Asia have not explicitly tested their system on code-switched speech. They have tested it on clean audio in the primary language. That test does not reflect the calls they will actually receive.

Three steps that matter:

First, pull a sample of real call recordings and run them through your current ASR system. Count transcription errors specifically at language transition points. This gives you a baseline WER for code-switched input, which is almost certainly higher than the vendor’s published benchmark.

Second, evaluate any new system on your actual language pairs. Hinglish performance does not predict Tanglish performance. Mandarin-English performance does not predict Malay-English performance. Test the specific combinations your customers use.

Third, prioritise systems trained on spontaneous speech, not read speech. The difference in real-world accuracy is significant, and it shows up most clearly on code-switched input where the gap between controlled and real conditions is widest.

Systems built specifically for Asian language environments are beginning to close the gap. Shunya Labs‘ Vāķ, for example, covers 55 Indian languages including code-switched varieties, is trained on spontaneous real-world speech, and is trained specifically to handle how India actually speaks, not how it speaks in a recording studio.

Frequently Asked Questions

What is code-switching in voice AI? Code-switching is when a speaker alternates between two or more languages within a conversation or a single sentence. It is a normal feature of speech in multilingual populations. In voice AI, it refers to the challenge of accurately transcribing speech that moves between languages — something most standard ASR systems are not built to handle.

Why does code-switching cause high word error rates? Standard ASR models are trained on single-language data. When a speaker switches languages mid-sentence, the model encounters sounds, words, and grammatical patterns it was not trained to expect in that context. This causes phoneme confusion at the transition point and language model errors throughout. Studies show a 30 to 50% relative increase in WER on code-switched speech compared to single-language input.

Which Asian markets are most affected by code-switching in enterprise voice AI? India is the largest market, with over 250 million daily code-switchers. Hinglish (Hindi-English), Tanglish (Tamil-English), and Telugu-English are the most common enterprise-relevant combinations. Southeast Asia is also heavily affected: Singapore (English-Mandarin-Malay-Tamil), the Philippines (Taglish), and Malaysia (Manglish) all have high rates of code-switching in professional and customer-facing contexts.

What WER should enterprises target for code-switched ASR? Most voice applications become unreliable above 15% WER. For high-stakes applications in financial services or healthcare, the target should be below 8%. Standard monolingual models on Hinglish typically reach 42% WER — well above both thresholds.

How do I know if my current voice AI handles code-switching? Test it on real call recordings from your environment, not vendor-provided test sets. Pull calls where customers are likely to code-switch and measure transcription accuracy specifically at language transition points. If your vendor’s published WER is based on clean single-language audio, assume real-world performance will be significantly worse.

What should I look for in a voice AI system that handles code-switching? Look for models trained on real spontaneous speech in your specific language pairs, not just separate monolingual models. Check whether the vendor’s benchmark data includes code-switched test sets. Evaluate latency at the language transition point. And test on your own audio before making a final decision.

Code-switching is not an edge case in Asia. It is the norm. In India, Southeast Asia, and East Asia’s multilingual business centres, the customers calling your contact center, the patients using your health helpline, and the borrowers going through your loan workflow are all switching between languages in ways that standard voice AI systems were never designed to handle.

The gap between “supports multiple languages” and “handles code-switching accurately” is where most enterprise voice AI deployments fail in practice. Closing that gap starts with knowing where your current system breaks.

Navvya Jain

Research & Product Analyst

Bio: Navvya works at the intersection of product strategy and applied AI research at Shunya Labs. With a background in human behaviour and communication, she writes about the people, markets, and technology behind voice AI, with a particular focus on how speech interfaces are reshaping access across emerging markets.