OpenAI Whisper is a speech recognition model that converts audio into text. It supports multiple tasks such as transcription, translation and language detection, making it highly useful for working with audio data.
- Converts speech into text (speech-to-text)
- Supports multiple languages and accents
- Can translate audio into English
- Works well even with noisy audio
Working of OpenAI Whisper
Whisper processes audio through multiple stages to convert speech into accurate text.
- Audio Preprocessing: The input audio is split into smaller segments and converted into spectrograms, which represent sound frequencies visually
- Feature Extraction: The model extracts important linguistic and acoustic patterns from these spectrograms
- Language Identification: If the language is unknown, the model detects it automatically
- Speech Recognition: The model predicts the most likely sequence of words based on the extracted features
- Translation (Optional): The recognized text can be translated into another language if required
- Post-processing: The output is refined using language rules to improve accuracy and readability
Implementation Using Open AI
Step 1: Install Openai library
!pip install -q openai
Step 2: Import Library
Import the OpenAI library and assign your generated API KEY by replacing "YOUR_API_KEY" with your API key in the code below
To know how to get Open AI API Key refer to: OpenAI API Key
import openai
openai.api_key = "YOUR_API_KEY"
Step 3: Transcribe Audio
Converts speech into text in the same language.
audio_file = open("Path to an audio file", "rb")
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file
)
print(transcript.text)
Step 4: Translate Audio to English
Translates audio into English.
audio_file = open("audio.mp3", "rb")
translation = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
translate=True
)
print(translation.text)
Implementation Using Hugging Face
Step 1: Set Up the Environment
First, install the required libraries. Run the following command one by one in your command prompt.
pip install transformers --upgrade
pip install torch torchaudio
Step 2: Import Required Modules
This step sets up the foundational components required to build the speech to text pipeline.
- WhisperProcessor: Prepares audio input for the Whisper model (feature extraction + decoding).
- WhisperForConditionalGeneration: Loads the Whisper speech to text model developed by OpenAI.
- torch:Â Core deep learning framework used to run the model and handle tensors.
- torchaudio:Â Used to load and preprocess audio files before feeding them into the model.
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch
import torchaudio
Step 3: Load Model and Processor
We load the pre trained Whisper Small model developed by OpenAI from Hugging Face.
- Processor : Converts audio into model ready features and handles tokenization.
- Model : Generates text tokens from processed audio.
model_name = "openai/whisper-small"
processor = WhisperProcessor.from_pretrained(model_name)
model = WhisperForConditionalGeneration.from_pretrained(model_name)
Output:

Step 4: Download and Load Audio
It downloads a sample audio file from Hugging Face and saves it locally. Then, torchaudio.load() reads the file and returns the audio waveform and its sampling rate. This prepares the speech input for the Whisper model.
import requests
url = "https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac"
r = requests.get(url)
with open("sample.flac", "wb") as f:
f.write(r.content)
audio, sampling_rate = torchaudio.load("Your audio file path")
Step 5: Resampling Audio to 16kHz
Whisper requires audio sampled at 16kHz. If the loaded audio has a different sampling rate, we resample it.
if sampling_rate != 16000:
resampler = torchaudio.transforms.Resample(sampling_rate, 16000)
audio = resampler(audio)
Step 6: Preprocess Audio
The processor converts the raw audio waveform into numerical features that the Whisper model can understand.
- audio.squeeze().numpy(): removes extra dimensions and converts the tensor to NumPy format
- sampling_rate=16000: ensures correct audio frequency
- return_tensors="pt": returns PyTorch tensors
inputs = processor(
audio.squeeze().numpy(),
sampling_rate=16000,
return_tensors="pt"
)
Step 7: Generate Transcription
- torch.no_grad(): Disables gradient computation (faster inference, less memory usage).
- model.generate(): Uses the Whisper model to generate text token IDs from the audio features.
- The output predicted_ids contains the predicted text tokens, which will be decoded into readable text in the next step.
with torch.no_grad():
predicted_ids = model.generate(inputs["input_features"])
Step 8: Decode Output
- batch_decode():Â converts the predicted token IDs into readable text.
- skip_special_tokens=True:Â removes unnecessary special tokens.
- [0]:Â extracts the final transcription from the batch.
transcription = processor.batch_decode(
predicted_ids,
skip_special_tokens=True
)[0]
print("Transcription:", transcription)
Advantages
- Delivers high accuracy in transcription and translation, especially for podcasts, lectures and interviews
- Supports multiple languages, enabling transcription and translation across diverse datasets
- Handles background noise, accents and technical terms effectively
- Provides open-source models, allowing customization and research flexibility
- Offers both local (CLI) and cloud-based API options for different use cases
- Cost efficient compared to many other speech-to-text solutions
Applications
- Converts speech into text for podcasts, meetings and lectures
- Generates subtitles and captions for videos
- Translates audio across different languages
- Powers voice assistants and speech based interfaces
- Enables transcription for customer support and call analysis
Limitations
- Performance may drop with extremely noisy or low quality audio
- Large audio files require splitting due to size limits
- Real time processing can be slower depending on hardware
- May struggle with highly domain specific vocabulary
- Requires computational resources for large scale usage