OpenAI Whisper

Last Updated : 14 Apr, 2026

OpenAI Whisper is a speech recognition model that converts audio into text. It supports multiple tasks such as transcription, translation and language detection, making it highly useful for working with audio data.

  • Converts speech into text (speech-to-text)
  • Supports multiple languages and accents
  • Can translate audio into English
  • Works well even with noisy audio

Working of OpenAI Whisper

Whisper processes audio through multiple stages to convert speech into accurate text.

  1. Audio Preprocessing: The input audio is split into smaller segments and converted into spectrograms, which represent sound frequencies visually
  2. Feature Extraction: The model extracts important linguistic and acoustic patterns from these spectrograms
  3. Language Identification: If the language is unknown, the model detects it automatically
  4. Speech Recognition: The model predicts the most likely sequence of words based on the extracted features
  5. Translation (Optional): The recognized text can be translated into another language if required
  6. Post-processing: The output is refined using language rules to improve accuracy and readability

Implementation Using Open AI

Step 1: Install Openai library

!pip install -q openai

Step 2: Import Library

Import the OpenAI library and assign your generated API KEY by replacing "YOUR_API_KEY" with your API key in the code below

To know how to get Open AI API Key refer to: OpenAI API Key

Python
import openai

openai.api_key = "YOUR_API_KEY"

Step 3: Transcribe Audio

Converts speech into text in the same language.

Python
audio_file = open("Path to an audio file", "rb")

transcript = client.audio.transcriptions.create(
    model="whisper-1",
    file=audio_file
)

print(transcript.text)

Step 4: Translate Audio to English

Translates audio into English.

Python
audio_file = open("audio.mp3", "rb")

translation = client.audio.transcriptions.create(
    model="whisper-1",
    file=audio_file,
    translate=True
)

print(translation.text)

Implementation Using Hugging Face

Step 1: Set Up the Environment

First, install the required libraries. Run the following command one by one in your command prompt.

pip install transformers --upgrade
pip install torch torchaudio

Step 2: Import Required Modules

This step sets up the foundational components required to build the speech to text pipeline.

  • WhisperProcessor: Prepares audio input for the Whisper model (feature extraction + decoding).
  • WhisperForConditionalGeneration: Loads the Whisper speech to text model developed by OpenAI.
  • torch: Core deep learning framework used to run the model and handle tensors.
  • torchaudio: Used to load and preprocess audio files before feeding them into the model.
Python
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch
import torchaudio

Step 3: Load Model and Processor

We load the pre trained Whisper Small model developed by OpenAI from Hugging Face.

  • Processor : Converts audio into model ready features and handles tokenization.
  • Model : Generates text tokens from processed audio.
Python
model_name = "openai/whisper-small"

processor = WhisperProcessor.from_pretrained(model_name)
model = WhisperForConditionalGeneration.from_pretrained(model_name)

Output:

output-
Output

Step 4: Download and Load Audio

It downloads a sample audio file from Hugging Face and saves it locally. Then, torchaudio.load() reads the file and returns the audio waveform and its sampling rate. This prepares the speech input for the Whisper model.

Python
import requests

url = "https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac"

r = requests.get(url)

with open("sample.flac", "wb") as f:
    f.write(r.content)

audio, sampling_rate = torchaudio.load("Your audio file path")

Step 5: Resampling Audio to 16kHz

Whisper requires audio sampled at 16kHz. If the loaded audio has a different sampling rate, we resample it.

Python
if sampling_rate != 16000:
    resampler = torchaudio.transforms.Resample(sampling_rate, 16000)
    audio = resampler(audio)

Step 6: Preprocess Audio

The processor converts the raw audio waveform into numerical features that the Whisper model can understand.

  • audio.squeeze().numpy(): removes extra dimensions and converts the tensor to NumPy format
  • sampling_rate=16000: ensures correct audio frequency
  • return_tensors="pt": returns PyTorch tensors
Python
inputs = processor(
    audio.squeeze().numpy(),
    sampling_rate=16000,
    return_tensors="pt"
)

Step 7: Generate Transcription

  • torch.no_grad(): Disables gradient computation (faster inference, less memory usage).
  • model.generate(): Uses the Whisper model to generate text token IDs from the audio features.
  • The output predicted_ids contains the predicted text tokens, which will be decoded into readable text in the next step.
Python
with torch.no_grad():
    predicted_ids = model.generate(inputs["input_features"])

Step 8: Decode Output

  • batch_decode(): converts the predicted token IDs into readable text.
  • skip_special_tokens=True: removes unnecessary special tokens.
  • [0]: extracts the final transcription from the batch.
Python
transcription = processor.batch_decode(
    predicted_ids,
    skip_special_tokens=True
)[0]

print("Transcription:", transcription)

Advantages

  • Delivers high accuracy in transcription and translation, especially for podcasts, lectures and interviews
  • Supports multiple languages, enabling transcription and translation across diverse datasets
  • Handles background noise, accents and technical terms effectively
  • Provides open-source models, allowing customization and research flexibility
  • Offers both local (CLI) and cloud-based API options for different use cases
  • Cost efficient compared to many other speech-to-text solutions

Applications

  • Converts speech into text for podcasts, meetings and lectures
  • Generates subtitles and captions for videos
  • Translates audio across different languages
  • Powers voice assistants and speech based interfaces
  • Enables transcription for customer support and call analysis

Limitations

  • Performance may drop with extremely noisy or low quality audio
  • Large audio files require splitting due to size limits
  • Real time processing can be slower depending on hardware
  • May struggle with highly domain specific vocabulary
  • Requires computational resources for large scale usage
Comment

Explore