OpenAI Whisper

OpenAI Whisper is a speech recognition model that converts audio into text. It supports multiple tasks such as transcription, translation and language detection, making it highly useful for working with audio data.

Converts speech into text (speech-to-text)
Supports multiple languages and accents
Can translate audio into English
Works well even with noisy audio

Working of OpenAI Whisper

Whisper processes audio through multiple stages to convert speech into accurate text.

Audio Preprocessing: The input audio is split into smaller segments and converted into spectrograms, which represent sound frequencies visually
Feature Extraction: The model extracts important linguistic and acoustic patterns from these spectrograms
Language Identification: If the language is unknown, the model detects it automatically
Speech Recognition: The model predicts the most likely sequence of words based on the extracted features
Translation (Optional): The recognized text can be translated into another language if required
Post-processing: The output is refined using language rules to improve accuracy and readability

Implementation Using Open AI

Step 1: Install Openai library

!pip install -q openai

Step 2: Import Library

Import the OpenAI library and assign your generated API KEY by replacing "YOUR_API_KEY" with your API key in the code below

To know how to get Open AI API Key refer to: OpenAI API Key

Python

import openai

openai.api_key = "YOUR_API_KEY"

Step 3: Transcribe Audio

Converts speech into text in the same language.

Python

audio_file = open("Path to an audio file", "rb")

transcript = client.audio.transcriptions.create(
    model="whisper-1",
    file=audio_file
)

print(transcript.text)

Step 4: Translate Audio to English

Translates audio into English.

Python

audio_file = open("audio.mp3", "rb")

translation = client.audio.transcriptions.create(
    model="whisper-1",
    file=audio_file,
    translate=True
)

print(translation.text)

Implementation Using Hugging Face

Step 1: Set Up the Environment

First, install the required libraries. Run the following command one by one in your command prompt.

pip install transformers --upgrade
pip install torch torchaudio

Step 2: Import Required Modules

This step sets up the foundational components required to build the speech to text pipeline.

WhisperProcessor: Prepares audio input for the Whisper model (feature extraction + decoding).
WhisperForConditionalGeneration: Loads the Whisper speech to text model developed by OpenAI.
torch: Core deep learning framework used to run the model and handle tensors.
torchaudio: Used to load and preprocess audio files before feeding them into the model.

Python

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch
import torchaudio

Step 3: Load Model and Processor

We load the pre trained Whisper Small model developed by OpenAI from Hugging Face.

Processor : Converts audio into model ready features and handles tokenization.
Model : Generates text tokens from processed audio.

Python

model_name = "openai/whisper-small"

processor = WhisperProcessor.from_pretrained(model_name)
model = WhisperForConditionalGeneration.from_pretrained(model_name)

Output:

Step 4: Download and Load Audio

It downloads a sample audio file from Hugging Face and saves it locally. Then, torchaudio.load() reads the file and returns the audio waveform and its sampling rate. This prepares the speech input for the Whisper model.

Python

import requests

url = "https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac"

r = requests.get(url)

with open("sample.flac", "wb") as f:
    f.write(r.content)

audio, sampling_rate = torchaudio.load("Your audio file path")

Step 5: Resampling Audio to 16kHz

Whisper requires audio sampled at 16kHz. If the loaded audio has a different sampling rate, we resample it.

Python

if sampling_rate != 16000:
    resampler = torchaudio.transforms.Resample(sampling_rate, 16000)
    audio = resampler(audio)

Step 6: Preprocess Audio

The processor converts the raw audio waveform into numerical features that the Whisper model can understand.

audio.squeeze().numpy(): removes extra dimensions and converts the tensor to NumPy format
sampling_rate=16000: ensures correct audio frequency
return_tensors="pt": returns PyTorch tensors

Python

inputs = processor(
    audio.squeeze().numpy(),
    sampling_rate=16000,
    return_tensors="pt"
)

Step 7: Generate Transcription

torch.no_grad(): Disables gradient computation (faster inference, less memory usage).
model.generate(): Uses the Whisper model to generate text token IDs from the audio features.
The output predicted_ids contains the predicted text tokens, which will be decoded into readable text in the next step.

Python

with torch.no_grad():
    predicted_ids = model.generate(inputs["input_features"])

Step 8: Decode Output

batch_decode(): converts the predicted token IDs into readable text.
skip_special_tokens=True: removes unnecessary special tokens.
[0]: extracts the final transcription from the batch.

Python

transcription = processor.batch_decode(
    predicted_ids,
    skip_special_tokens=True
)[0]

print("Transcription:", transcription)

Advantages

Delivers high accuracy in transcription and translation, especially for podcasts, lectures and interviews
Supports multiple languages, enabling transcription and translation across diverse datasets
Handles background noise, accents and technical terms effectively
Provides open-source models, allowing customization and research flexibility
Offers both local (CLI) and cloud-based API options for different use cases
Cost efficient compared to many other speech-to-text solutions

Applications

Converts speech into text for podcasts, meetings and lectures
Generates subtitles and captions for videos
Translates audio across different languages
Powers voice assistants and speech based interfaces
Enables transcription for customer support and call analysis

Limitations

Performance may drop with extremely noisy or low quality audio
Large audio files require splitting due to size limits
Real time processing can be slower depending on hardware
May struggle with highly domain specific vocabulary
Requires computational resources for large scale usage

Working of OpenAI Whisper

Implementation Using Open AI

Step 1: Install Openai library

Step 2: Import Library

Step 3: Transcribe Audio

Step 4: Translate Audio to English

Implementation Using Hugging Face

Step 1: Set Up the Environment

Step 2: Import Required Modules

Step 3: Load Model and Processor

Step 4: Download and Load Audio

Step 5: Resampling Audio to 16kHz

Step 6: Preprocess Audio

Step 7: Generate Transcription

Step 8: Decode Output

Advantages

Applications

Limitations

Explore