How to Replace IVR With Voice AI for Contact Centres

TL;DR , Key Takeaways:

Start with a call flow audit. Ninety days of recordings, transcribed and clustered by intent. Without it, your deflection rate estimates are guesses. With it, they carry ±5% accuracy from week one of deployment.
The audio path is: PSTN → SBC (Ribbon/AudioCodes/CUBE) → FreeSWITCH → ASR WebSocket → LLM → TTS → back to FreeSWITCH. Total end-to-end latency target: 450-650ms on cloud, sub-300ms on-premise.
G.711 8kHz telephony audio must be upsampled to 16kHz PCM before sending to Zero STT. Use librosa kaiser_fast (0.3ms/chunk). Never use a polynomial resampler for real-time, it adds 210ms latency per 2 seconds of audio.
Hindi affirmatives (हाँ, ठीक है, अच्छा) are 200-400ms long. The default 500ms barge-in window misses 40% of them. Set the minimum detection window to 150ms for Indic deployments.
TRAI mandates DTMF fallback for any automated system handling financial transactions, OTP delivery, or KYC in India. Voice-only deployment is non-compliant. TRAI can direct the telco to terminate your number.
Indian callers provide unsolicited context in ~34% of opening utterances. A linear slot-filler built for Western caller behaviour will silently discard this information.

India’s IVR infrastructure is three decades old and has not changed meaningfully since the 1990s. Press 1 for billing. Press 2 for support. Say your account number now.

The technology still runs on VXML 2.1 state machines, rigid call trees, and DTMF menus. It was designed for a world where callers had no alternative and customer experience was not a competitive variable. That world is gone. A recent Salesforce survey found that 83% of customers expect to interact with someone immediately when contacting a company. IVR systems do the opposite.

Voice AI agents replace IVR not by layering intelligence on top of VXML but by replacing the fundamental interaction model. Instead of a menu, a caller has a conversation. Instead of routing by button press, the system routes by intent. Instead of pressing 0 to escape, a caller who needs a human gets one, automatically.

This playbook covers those technical steps. It addresses audio architecture, barge-in handling, TRAI compliance, and the pre-migration work that determines whether the deployment succeeds from week one.

83%

Customers expect immediate response

Salesforce Research, 2026

40-60%

Annual agent attrition in India

Indian contact centre average

500ms

Default VAD window misses 40% of Indic words like हाँ

Correct threshold: 150ms for Indic

Step Zero: The Call Flow Audit You Cannot Skip

Before any architecture decision, before any vendor selection, before any integration work: you need around 90 days of call recording data transcribed and analysed.

This is not optional preparation. It is the only way to know what callers actually say, which intents are genuinely automatable, and what your realistic deflection rate will be. Without it, every projection in your business case is a guess.

The audit process has four steps. First, transcribe 90 days of call recordings with a batch ASR job. Shunya Labs Zero STT handles this via the REST API. Second, run k-means or LDA clustering on the transcripts to group calls by intent. Third, have a human review the cluster labels and build a ground-truth taxonomy from actual caller language. Fourth, classify intents as deflectable (no human judgement required, bounded answer space) or non-deflectable (complaints, escalations, complex queries).

The audit takes three to four weeks. The payoff is significant. Without a call flow audit, first-month intent recognition accuracy in Indian contact centre deployments typically runs 71 to 78%. With a proper audit, teams consistently achieve 87 to 93% accuracy from week one.

Callers describe the same intent in radically different ways. ‘Net nahi chal raha’, ‘broadband down hai’, and ‘connection ka problem hai’ can have the same intent. A taxonomy built from business assumptions will miss 20-30% of real caller expressions. The audit closes that gap.

Understanding Your Existing IVR Architecture

The migration path depends entirely on what you are replacing. Three stacks account for most Indian enterprise contact centres, and each has a different migration path.

Cisco Unified CVP (VXML 2.1)

This is the most common stack in Indian enterprise contact centres. Cisco CVP runs VXML 2.1 with JTAPI/TSAPI CTI middleware connecting to Avaya or Genesys ACD. The SIP trunk handoff happens at the CUBE (Cisco Unified Border Element).

The key technical constraint: VXML 2.1 has no concept of streaming partial results. The entire utterance must complete before the VXML application can process it. Migration here requires replacing the VXML application entirely, not just the recognition endpoint. The CUBE adds 2 to 5ms processing overhead, which is within acceptable latency budgets but must be accounted for.

Avaya Experience Portal

Custom VXML applications running on Avaya Aura call control. The CTI event flow uses either TSAPI device events or JTAPI call events. Which one your deployment uses affects how agent screen-pop works after migration. Check this before designing your integration. Avaya Aura SIP trunks on Indian PSTN interconnects use G.711 a-law exclusively.

Cloud IVR (Exotel, Knowlarity, Servetel)

The simplest migration path. These platforms run managed Asterisk/FreeSWITCH with webhook-based flow builders. Their webhooks fire on DTMF input, not speech. Migration is a webhook redirect: replace the DTMF handler endpoint with a Voice AI endpoint that accepts the same webhook payload and returns the same response format. The first migration step is enabling ASR input alongside DTMF, not replacing DTMF entirely.

The Full Telephony Audio Path

This is the architecture every integration engineer needs to understand before touching a line of code. Every hop adds latency. Know where each millisecond goes.

The full path is: PSTN → SBC (Ribbon, AudioCodes, or CUBE) → Media Server (FreeSWITCH or Asterisk) → ASR WebSocket (Zero STT) → NLU / LLM → TTS (Zero TTS) → back to Media Server → PSTN.

Component	Latency Contribution	Notes
G.711 packetisation	20ms	Fixed. 20ms RTP packets = 160 bytes each at 8kHz.
SBC processing	2-5ms	Ribbon, AudioCodes, or Cisco CUBE.
RTP to WebSocket transcoding	5-8ms	At the media server (FreeSWITCH).
Zero STT first partial	180-220ms	Streaming transcript, first words returned.
NLU / LLM processing	40-80ms	Intent and slot extraction.
Zero TTS first audio	Under 100ms	First audio bytes returned.
Audio playout	20ms	Buffer at media server before sending to PSTN.
TOTAL (cloud)	450-650ms	p99. Sub-300ms requires on-premise deployment.

Critical: use p99 latency, not p50

Vendor latency specs cite p50 (median). Callers perceive p99.A system with p50 = 400ms and p99 = 1200ms will feel broken to roughly 1 in 100 callers. At 50,000 calls per month, that is 500 callers per month experiencing a broken interaction.Always measure and report latency at p99. Budget accordingly.

The G.711 upsampling problem (most common failure point)

Indian PSTN, including BSNL, Airtel, and Jio interconnects, is almost entirely G.711 a-law at 8kHz. Shunya Labs Zero STT expects 16kHz PCM. This gap causes more Indian IVR migration failures than any other single issue.

When you upsample 8kHz audio to 16kHz, there is a hard frequency ceiling at 4kHz. No upsampling algorithm can recover frequencies above that ceiling because they were never captured. The perceptual impact is worst on Hindi retroflex consonants and English sibilants, which both rely on high-frequency spectral content above 4kHz.

Use librosa.resample(audio, orig_sr=8000, target_sr=16000, res_type=’kaiser_fast’) for real-time processing. The kaiser_fast resampler costs 0.3ms per 20ms chunk. A polynomial resampler costs 2.1ms per chunk. At 20ms chunks, that difference compounds to 210ms of added latency per 2 seconds of audio. Do not use a polynomial resampler for live streaming.

RTP chunk sizing

Send 20ms RTP chunks directly to the WebSocket. One chunk = 320 bytes at 16kHz PCM 16-bit mono. WebSocket frame overhead is 6 bytes per frame, which is negligible.

Do not buffer to larger chunks to reduce overhead. Developers who buffer to 200ms chunks to reduce the number of WebSocket sends add 180ms of unnecessary latency with every call. Each 20ms of additional buffer is 20ms added to every response in the call.

Barge-In Handling: The Feature That Makes or Breaks Experience

Barge-in is when a caller speaks while the voice agent is still talking. Legacy IVR systems handle this clumsily or not at all. A voice AI agent that does not handle barge-in correctly produces a broken experience: callers who try to interrupt get ignored, which forces them to wait for the agent to finish speaking before they can respond.

Echo cancellation must be at the media server level

Apply Acoustic Echo Cancellation (AEC) at the media server level using FreeSWITCH mod_dptools echo suppression. Do not apply it at the ASR level.

If AEC is applied at the ASR level instead, partial transcripts of TTS playback feed back into the recogniser. The agent hears itself speaking and starts transcribing its own output as new input. In open-plan Indian contact centre environments with poor headset acoustic isolation, false barge-in rates increase by approximately 35% without proper AEC.

VAD threshold calibration for Indian audio

WebRTC VAD has an aggressiveness scale from 0 to 3. For Indian contact centre environments with a 65 to 70 dB ambient noise floor, aggressiveness 2 with a 150ms onset window is the correct starting point.

VAD Aggressiveness	False Barge-in Rate	Missed Genuine Barge-ins
0 (least aggressive)	Very low	High (>25%)	Too permissive for noisy floors
1	Low	12-15% missed	Under-triggers on genuine speech
2 + 150ms onset	Controlled	<5% missed	Recommended for Indian contact centres
3 (most aggressive)	Every 8-12 seconds	Very low	Constant false barge-ins in noisy environments

India-specific: Hindi affirmatives are short
हाँ (haan), ठीक है (theek hai), अच्छा (accha) are often 200-400ms in duration.The default Western barge-in detection window assumes a minimum utterance length of 500ms.At 500ms, roughly 40% of single-word Hindi affirmatives are missed. The caller says yes and the agent does not hear it.Set the minimum barge-in detection window to 150ms for all Hindi and Indic language deployments. This reduces missed affirmatives to under 5%.

TTS interrupt and graceful stop

When barge-in is detected, the state machine should follow this sequence: send RTP silence to the media server immediately, play an 80ms audio fade-out (below 80ms callers report the voice cutting off rudely), then begin streaming the new utterance to Zero STT. Do not restart the dialogue state.

The context of the interrupted turn, including slots already filled and dialogue history, must carry forward. A caller who barged in mid-response should not be asked to repeat information they already provided. Losing filled slots on a barge-in is one of the top three CSAT drivers in early voice AI deployments.

DTMF Fallback: The TRAI Compliance Requirement

This is not optional and it is not a nice-to-have. TRAI regulations require a DTMF fallback path for any automated voice system handling financial transactions, OTP delivery, or KYC verification in India.

A voice-only deployment with no DTMF fallback is non-compliant. TRAI can direct the telco to terminate the number used by a non-compliant automated system. The exposure is not just poor user experience. It is a number that stops working.

RFC 2833 vs SIP INFO: silent DTMF loss

DTMF can travel over two different paths. RFC 2833 sends DTMF as RTP telephone-event packets, in-band alongside the voice. SIP INFO sends DTMF as out-of-band SIP messages. Which one arrives depends on your SBC configuration and your carrier.

Knowlarity and Exotel use RFC 2833. Cisco CUBE typically uses SIP INFO. If your FreeSWITCH configuration handles one but not the other, DTMF input silently disappears with no error message. Configure FreeSWITCH to accept both RFC 2833 and SIP INFO simultaneously and verify before go-live.

Language Detection for Multilingual Inbound Calls

If your contact centre serves callers in multiple Indian languages, two architectures exist. Ask callers to select their language upfront, or use automatic language identification. Each has tradeoffs.

The cost of auto-detection

Setting language_code=auto in Zero STT adds approximately 40ms per utterance versus a pre-specified language code. This overhead comes from running an additional softmax pass over language embeddings to identify the language before transcribing.

40ms sounds small, but it compounds. Over a 20-utterance call, that is 800ms of cumulative overhead added to the response latency budget. For high-volume deployments, a language selection menu in the first three seconds of the call, just one spoken choice, is often the more efficient architecture.

Minimum utterance length for reliable detection

Language identification accuracy depends heavily on how much audio is available. Testing across Zero STT deployments shows the following accuracy by utterance length:

Utterance Length	LID Accuracy	Recommendation
Under 0.5 seconds	61%	Do not attempt LID on this input
0.5 to 1.2 seconds	78%	Acceptable only if no better option
Over 1.2 seconds	94%	Reliable for language routing decisions

Do not base language routing on the caller’s first word. That first word is often just ‘hello’ or ‘haan’. Basing language selection on it causes 15 to 22% language mismatches in practice. Use the first full-sentence response, typically the caller’s reply to ‘please state your query’, as the LID anchor.

Mid-call language switches

Approximately 8% of calls lasting more than three minutes involve a language switch. This happens when a caller switches to a family member on the same call, when frustration triggers a language change, or when a caller moves between Hindi and a regional language.

Run per-utterance LID continuously throughout the call. When a language switch is detected, update the language model for the current and future turns without resetting the dialogue state. Slots already filled must persist. A caller who provided their account number in Hindi and then switches to Tamil should not be asked for their account number again.

Dialogue Manager Design: Where Most Deployments Fail

The dialogue manager is where most Indian IVR replacements underperform. The reason is almost always the same: the dialogue was designed for Western caller behaviour, not Indian caller behaviour.

Confidence-based reprompting

When Zero STT returns a critical slot, such as an account number, an amount, or a date, with word-level confidence below 0.75, the agent should reprompt. But how it reprompts matters more than whether it reprompts.

Specific reprompting reduces the repeat-attempt rate by approximately 40% compared to generic reprompting. The difference: ‘I heard fifteen thousand rupees, is that correct?’ versus ‘Sorry, I did not understand, could you please repeat?’

The caller knows the system heard something. A generic reprompt signals the system did not understand at all, which is frustrating even when it is not true. A specific reprompt signals the system got close, which is honest and faster to correct.

Escalation trigger definition

Define escalation triggers explicitly before go-live. Vague triggers mean either too many unnecessary transfers, which wastes agent time, or callers trapped in automation they cannot escape, which is a CSAT disaster.

Three conditions should always trigger automatic escalation. First: two failed reprompt attempts on the same critical slot. Second: the caller’s utterance contains escalation vocabulary in any supported language. Third: sentiment score falls below your defined threshold for two consecutive turns.

Escalation vocabulary by language (must be included in your NLU model)
English: manager, complaint, escalate, supervisor, speak to human, real person
Hindi: manager chahiye, complaint karna hai, supervisor se baat karni hai, insaan se baat karo
Tamil: manager venum, pugatchi seiya vendum, uyarntavar kitta pesanum
Telugu: manager kavali, complaint cheyali, manishi tho matladaali
Kannada: manager beka, complaint maadabeku, person jote matadabeku
Marathi: manager pahije, tक्रार karavi ahe, manasaashi bola

What the Migration Looks Like End to End

For a team using Exotel or Knowlarity, the fastest migration path follows these steps in sequence:

Run the call flow audit. 90 days of recordings, intent clustering, deflectable/non-deflectable classification. Three to four weeks.
Build the dialogue flows for deflectable intents using the ground-truth taxonomy from the audit. Not from what the business thinks callers say.
Configure Zero STT with the correct language codes for your caller population. Test on 30 minutes of actual recordings to verify WER before integration. Benchmarks at shunyalabs.ai/benchmarks.
Set up the FreeSWITCH media server with AEC enabled at module level, VAD aggressiveness 2, 150ms minimum barge-in window, and both RFC 2833 and SIP INFO DTMF handling configured.
Implement the resampling pipeline: G.711 a-law 8kHz → 16kHz PCM using kaiser_fast. Verify output before sending to Zero STT.
Redirect the Exotel/Knowlarity webhook to your Voice AI endpoint. Start with 5% of traffic on one intent category. Measure intent recognition accuracy, fallback rate, and CSAT daily for two weeks.
If accuracy exceeds 90% and CSAT holds: expand to remaining deflectable intents. If accuracy is below 88%: the call flow audit taxonomy needs refinement. Do not expand until the accuracy threshold is met.
Implement DTMF fallback on all financial transaction, OTP, and KYC flows before full go-live. TRAI compliance is not a post-launch task.

The Speech Infrastructure Layer

The quality of the migration rests on the ASR and TTS models underneath everything else. An accurate dialogue manager built on a poor speech layer will underperform regardless of how well the dialogue is designed.

Shunya Labs Zero STT is trained on real audio. The training set includes regional accents, code-switched speech, and the ambient noise conditions of Indian contact centres. . Full benchmark data is at shunyalabs.ai/benchmarks.

Zero TTS brings native Indic voice synthesis to the output side. For collections and BFSI deployments where caller trust affects call outcome, the quality of the voice matters. A TTS model adapted from English produces output that Indian callers identify as foreign-accented, which affects how they respond. Zero TTS is trained on Indian speech data per language, not adapted from another base.

Models run on-premise on CPU hardware without GPU infrastructure, which matters for DPDPA compliance and for contact centres operating within Indian data boundaries. Deployment documentation is at shunyalabs.ai/deployment.

References

Bowen, E. (2025). How conversational IVR enhances customer experience with AI. [online] Telnyx.com. Available at: https://telnyx.com/resources/conversational-ai-ivr [Accessed 27 Mar. 2026].

Bown, B. (2023). Future of Customer Service is Personalised & Connected: 2023. [online] Salesforce. Available at: https://www.salesforce.com/eu/blog/future-of-customer-service/.

Nair, S. (2019). Conversational IVR: Automate Customer Care Calls with AI. [online] Haptik.ai. Available at: https://www.haptik.ai/blog/conversational-ivr-automate-customer-care-calls [Accessed 27 Mar. 2026].

reverie (2025). Future of IVR Systems: Trends Shaping Customer Experience. [online] Reverie. Available at: https://reverieinc.com/blog/future-of-ivr/ [Accessed 26 Mar. 2026].