Getting Started with ASR APIs: Node.js Quickstart

By Harish Kumar | Senior Business Analyst | Build & Learn | 23 Oct 2025

Ever wonder how your phone transcribes your voice messages or how virtual assistants understand your commands? The magic behind it is Automatic Speech Recognition (ASR). ASR APIs allow developers to integrate this powerful technology into their own applications.

What is an ASR API?

An ASR API is a service that converts spoken language (audio) into written text. You send an audio file to the API, and it returns a transcription. This is incredibly useful for a wide range of applications, from creating subtitles for videos to enabling voice-controlled interfaces and analyzing customer service calls.

This simple process enables complex features like:

🎬 Auto-generated subtitles
🗣️ Voice-controlled applications
📞 Speech analytics for customer calls

Before we dive into the code, you'll need two things for most ASR providers:

An API Key: Sign up with an ASR provider (like Google Cloud Speech-to-Text, AssemblyAI, Deepgram, or AWS Transcribe) to get your unique API key. This key authenticates your requests.
An Audio File: Have a sample audio file (e.g., in .wav, .mp3, or .m4a format) ready to test. For this guide, we'll assume you have a file named my-audio.wav.
API Endpoint: The URL for the service, which we'll assume is https://api.shunya.org/v1/transcribe.

Integrating ASR APIs with Node.js

Let's go step by step and build a working Node.js script that sends an audio file to ShunyaLabs Pingala ASR API, retrieves the transcription, and displays it neatly on your terminal.

We'll use the following dependencies:

axios — for HTTP communication
form-data — to handle multipart file uploads

Step 1: Set Up Your Environment

Make sure you have Node.js v14+ installed, then set up your project:

# Create a project folder
mkdir asr-node-demo && cd asr-node-demo

# Initialize npm
npm init -y

# Install dependencies
npm install axios form-data

Step 2: Building the Node.js Script

Create a file named transcribe_shunya.js and let's build it section by section.

Part A: Configuration

First, we'll import the necessary libraries and set up our configuration variables at the top of the file. This makes them easy to change later.

// transcribe_shunya.js
import fs from "fs";
import axios from "axios";
import FormData from "form-data";

// --- Configuration ---
const API_KEY = "YOUR_SHUNYA_LABS_API_KEY";
const API_URL = "https://tb.shunyalabs.ai/transcribe";
const AUDIO_FILE_PATH = "sample.wav";
// --------------------

Here's what each variable does:

API_KEY: Your personal authentication token.
API_URL: The endpoint where transcription jobs are submitted.
AUDIO_FILE_PATH: Path to your local audio file.

Part B: Submitting the Transcription Job

This function handles the initial POST request. It opens your audio file, specifies the language model (pingalla), and sends it all to the API to start the process.

async function submitTranscriptionJob(apiUrl, apiKey, filePath) {
  console.log("1. Submitting transcription job...");
  
  const form = new FormData();
  form.append("file", fs.createReadStream(filePath));
  form.append("language_code", "auto");
  form.append("output_script", "auto");
  
  try {
    const response = await axios.post(apiUrl, form, {
      headers: {
        "X-API-Key": apiKey,
        ...form.getHeaders(),
      },
    });
    
    console.log("   -> Job submitted successfully!");
    return response.data;
  } catch (error) {
    console.error("   -> Error submitting job:", error.response?.data || error.message);
    return null;
  }
}

Part C: Displaying the Transcription Result

Once the API finishes processing, it returns a JSON response containing your transcription and metadata.

function printTranscriptionResult(result) {
  if (!result || !result.success) {
    console.log("❌ Transcription failed.");
    return;
  }

  console.log("\n✅ Transcription Complete!");
  console.log("=".repeat(50));
  console.log("Final Transcript:\n");
  console.log(result.text || "No transcript found");
  console.log("=".repeat(50));

  if (result.segments && result.segments.length) {
    console.log("\nSpeaker Segments:");
    result.segments.forEach((seg) => {
      console.log(`[${seg.start}s → ${seg.end}s] ${seg.speaker}: ${seg.text}`);
    });
  }
}

Part D: Putting It All Together

Finally, the main function orchestrates the entire process by calling our functions in the correct order.

async function main() {
  const result = await submitTranscriptionJob(API_URL, API_KEY, AUDIO_FILE_PATH);
  
  if (result) {
    printTranscriptionResult(result);
  }
}

main();

Step 3: Run the Node.js Script

With your audio file in the same folder, run:

node transcribe_shunya.js

If everything's set up correctly, you'll see:

1. Submitting transcription job…
   -> Job submitted successfully!

✅ Transcription Complete!
==================================================
Final Transcript:

ਸਤ ਸ੍ਰੀ ਅਕਾਲ! ਤੁਸੀਂ ਕਿਵੇਂ ਹੋ?
==================================================

How It Works Behind the Scenes

Here's what your script actually does step by step:

Upload: The script sends your audio and metadata to ShunyaLabs' ASR REST API.
Processing: The backend model (Pingala V1) performs multilingual ASR, handling Indian languages, accents, and speech clarity.
Response: The API returns a JSON response with:
- Full text transcript
- Timestamps for each segment
- Speaker diarization info (if enabled)

This same pattern — submit → poll → retrieve — is used by nearly every ASR provider, from Google Cloud to AssemblyAI to Pingala.

Best Practices

Keep files under 10 MB for WebSocket requests (REST supports larger).
Store API keys securely:
```
export SHUNYA_API_KEY="your_key_here"
```
Use clean mono audio (16kHz) for best accuracy.
Experiment with parameters like:
- --language-code hi for Hindi
- --output-script Devanagari for Hindi text output

Final Thoughts

You've just built a working speech-to-text integration in Node.js using ShunyaLabs Pingala ASR API - the same technology that powers real-time captioning, transcription tools, and voice analytics systems.

With its multilingual support, low-latency streaming, and simple REST/WebSocket APIs, Pingala makes it easy for developers to bring accurate, fast, and inclusive ASR into any workflow - whether for India or the world.

Automatic Speech Recognition bridges the gap between humans and machines, making technology more natural and inclusive.

As models like Pingala V1 continue to improve in accuracy and efficiency, ASR is becoming not only smarter - but accessible to every app that can listen.

Harish Kumar

Senior Business Analyst

Bio: Harish Kumar is a data-driven professional with 3.7+ years of experience in analytics and product management, having worked across startups like Noon, Zomato, and Junglee Games.

He specialises in turning data into actionable insights, driving growth, and building scalable systems from the ground up. Passionate about solving complex business problems and creating measurable impact, he currently explores opportunities in analytics, product strategy, and business growth.