Course
Mistral recently launched its first open-source multimodal audio models, Voxtral Small and Voxtral Mini, built upon the Mistral 3B architecture and specifically optimized for audio understanding and transcription tasks.
My focus in this blog will be on Voxtral Mini 3B, a compact, open-source model designed for real-time audio-to-text tasks like transcription, summarization, and Q&A. With its efficient size and support for long-context reasoning, Voxtral Mini is especially powerful when paired with high-throughput inference frameworks like vLLM, making it ideal for building fast, offline audio applications.
In this tutorial, I’ll walk you through how to:
- Run Voxtral Mini 3B with vLLM on Colab Pro
- Deploy an ngrok-tunneled API endpoint to access your model from anywhere
- Build a Streamlit-based audio assistant that performs transcription, summarization, and Q&A using raw audio input
We keep our readers updated on the latest in AI by sending out The Median, our free Friday newsletter that breaks down the week’s key stories. Subscribe and stay sharp in just a few minutes a week:
What Is Mistral’s Voxtral?
Voxtral is Mistral’s fully open‑source audio model family designed for powerful speech understanding. Voxtral comes in two model sizes:
- Voxtral Small: 24 B parameters, ideal for production-scale, enterprise use
- Voxtral Mini: 3 B parameters, optimized for running locally on edge devices

Source: Mistral
Voxtral accepts raw audio inputs (such as .wav or .mp3) and is trained to efficiently generate transcriptions and summaries from spoken content. Voxtral is a smaller variant optimized for fast inference and offline deployment.
It follows Mistral’s model format and is compatible with high-throughput inference frameworks like vLLM, making it an excellent choice for real-time transcription or lightweight speech applications.
How to Run Voxtral Mini 3B Using vLLM
In this section, I’ll walk you through how to serve Voxtral Mini 3B using vLLM on a Colab Pro with T4 GPU enabled. vLLM is chosen for its high-throughput, low-latency serving capabilities which are ideal for models like Voxtral that require fast streaming responses with audio support.
Additionally, we'll use PyNGrok to expose the vLLM server via a public endpoint, making it accessible to applications running on your local machine.
Step 1: Install dependencies
Let’s begin by installing the dependencies within our Colab environment. Run the following command in the Colab cell:
!uv pip install -U transformers accelerate "vllm[audio]" --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly
We begin by installing the required libraries using uv, a faster Python package installer. This pulls the nightly version of vllm[audio] which includes experimental support for audio-language models like Voxtral. If you do not have uv installed in your environment, then first install uv by running the following code:
!pip install uv
Step 2: Install Mistral-Common
Next, we install the mistral-common package, which provides essential utilities required to interact with Voxtral and other Mistral-family models. This package includes tokenizers aligned with official implementations and Pydantic-based validation for message structures.
!pip install mistral-common --upgrade
!python -c "import mistral_common; print(mistral_common.__version__)"
The mistral-common package contains utility modules for Voxtral and other Mistral family models. It includes tokenizers, typed messages in Pydantic formats, and audio helper functions. This step ensures compatibility with Voxtral's internal handling of audio inputs and structured messages.
Step 3: Set up VLLM
Now we have the basic Mistral dependencies ready, next we set up vLLM by cloning into its main repository.
Step 3.1: Clone the vLLM repository
We clone the official vLLM GitHub repository, which gives us access to built-in audio examples and serves as the base for launching offline or hosted inference.
!git clone https://github.com/vllm-project/vllm && cd vllm
Step 3.2: Run a sample audio inference
Let's run a quick sanity check to ensure our setup is correct.
!python vllm/examples/offline_inference/audio_language.py --num-audios 2 --model-type voxtral
I ran the offline audio_language.py example, which runs audio-language inference on two pre-defined samples using the Voxtral model and confirms whether audio decoding and generation work properly.
Step 4: Set up PyGrok
With vLLM now set up, let's configure PyNGrok to expose the vLLM server on Colab through a publicly accessible internet address.
- Start by creating a free account on https://ngrok.com/
- Scroll down under the Connect tab and copy the authentication tag (hidden in the image below) and save it as a secret in the Colab Secrets.

Next, we install PyNGrok within the colab environment and set the authentication token using the NGROK token we copied previously. If you are working for a prototype, you can pass the token here directly.
!pip install pyngrok -q
from pyngrok import ngrok
ngrok.set_auth_token("NGROK_TOKEN")
Step 5: Serve the model
This is the key step, where we launch the Voxtral Mini 3B model via vLLM.
!vllm serve mistralai/Voxtral-Mini-3B-2507 --tokenizer_mode mistral --config_format mistral --load_format mistral --max-model-len 4864
The code snippet includes several important flags:
--tokenizer_mode mistral: This uses the Mistral-specific tokenizer for accurate tokenization.--config_format mistraland--load_format mistral: These flags ensure that both the model configuration and weights are loaded in Mistral's custom format, maintaining compatibility.--max-model-len 4864: This flag sets the maximum input context length to 4864 tokens.
Note: Keep this cell running as this is an active model server.
Step 6: Get the NGrok public endpoint
Once your vLLM server is running, run the following code line by line in a Python terminal:
from pyngrok import ngrok
ngrok.set_auth_token(NGROK_TOKEN)
public_url = ngrok.connect(8000)
print("Public endpoint:", public_url)
The above code will expose the model to the internet at a temporary public URL like:
https://80xxxxxxxxxx.ngrok-free.app
Save this URL, which will be used in the local config.py to connect to the remote model. After the run is complete, terminate the connection by running:
ngrok.kill()
Building an Audio Transcription and Summarization Demo
In this section, we will build a Streamlit UI that:
- Takes audio as input (user upload)
- Sends it to the model over the vLLM server
- Displays the transcription, summary, and allows the user to ask questions
Step 1: Setting up dependencies
Before building the app, we define all required dependencies in a requirements.txt file. This ensures consistent environments across local runs or Colab notebooks.
streamlit>=1.28.0
openai>=1.0.0
mistral-common>=0.0.12
huggingface-hub>=0.19.0
pyngrok>=6.0.0
requests>=2.28.0
pydub>=0.25.1
Here’s why we need each:
- streamlit: This library helps in building the interactive audio interface.
- openai: it interfaces with the Voxtral-compatible vLLM server using OpenAI's SDK.
- mistral-common: It provides Voxtral-specific utilities like
AudioChunkandTextChunk. - pyngrok: This exposes our local vLLM server to the internet, allowing the Streamlit app to access it.
- Pydub: This library helps in converting audio files for inference.
- huggingface-hub and requests: These are used for model and config fetching (if needed).
Step 2: Setting up the environment variable
To keep configuration clean and portable, we store API details as environment variables. You can define them inside a .env file:
VOXTRAL_API_KEY=dummy-key
VOXTRAL_API_BASE=NGROK_TOKEN/v1
VOXTRAL_MODEL_NAME=mistralai/Voxtral-Mini-3B-2507
VOXTRAL_API_BASE should point to your running vLLM instance, which can be localhost (http://localhost:8000) or an ngrok public endpoint that looks like https://80xxxxxxxxxx.ngrok-free.app.
Note: Ensure that you append /v1 at the end of API base for OpenAI compatibility.
Step 3: Setting up the config file
To make your app modular, easily adjustable, and ready for production, we’ll define a central configuration class in config.py. This file will control model settings, API access, supported audio formats, languages, and UI preferences.
Step 3.1: Load environment variables
Before we define the config class, we load any environment variables saved in a .env file. This keeps sensitive information separate from the code.
import os
from typing import Optional
from pathlib import Path
# Load environment variables from .env file
env_file = Path(".env")
if env_file.exists():
with open(env_file, 'r') as f:
for line in f:
if line.strip() and not line.startswith('#'):
key, value = line.strip().split('=', 1)
os.environ[key] = value
This snippet checks if a .env file exists in the root directory. If found, it reads key-value pairs and sets them as environment variables. These values can now be accessed using os.getenv(), ensuring that secrets don’t get hardcoded into the app.
Step 3.2: Define the Config class
Now, we wrap all settings into a clean Config class for easy access and reuse.
class Config:
# API Configuration
VOXTRAL_API_KEY: str = os.getenv("VOXTRAL_API_KEY", "EMPTY")
VOXTRAL_API_BASE: str = os.getenv("VOXTRAL_API_BASE", "http://localhost:8000/v1")
# Model Configuration
MODEL_NAME: str = os.getenv("VOXTRAL_MODEL_NAME", "voxtral-mini-3b-2507")
# Default Parameters
DEFAULT_TEMPERATURE: float = 0.2
DEFAULT_TOP_P: float = 0.95
# Audio Configuration
MAX_AUDIO_SIZE_MB: int = 100
SUPPORTED_AUDIO_FORMATS: list = ['mp3', 'wav', 'm4a', 'flac', 'ogg']
SUPPORTED_LANGUAGES: list = [
"English", "Spanish", "French", "German", "Italian", "Portuguese",
"Russian", "Chinese", "Japanese", "Korean", "Arabic", "Hindi"
]
# UI Configuration
STREAMLIT_THEME: dict = {
"primaryColor": "#1f77b4",
"backgroundColor": "#ffffff",
"secondaryBackgroundColor": "#f0f2f6",
"textColor": "#262730",
"font": "sans serif"
}
@classmethod
def get_api_config(cls) -> dict:
return {
"api_key": cls.VOXTRAL_API_KEY,
"base_url": cls.VOXTRAL_API_BASE,
"model_name": cls.MODEL_NAME
}
@classmethod
def validate_config(cls) -> bool:
if not cls.VOXTRAL_API_BASE:
return False
return True
@classmethod
def get_language_code(cls, language_name: str) -> Optional[str]:
language_mapping = {
"English": "en",
"Spanish": "es",
"French": "fr",
"German": "de",
"Italian": "it",
"Portuguese": "pt",
"Russian": "ru",
"Chinese": "zh",
"Japanese": "ja",
"Korean": "ko",
"Arabic": "ar",
"Hindi": "hi"
}
return language_mapping.get(language_name)
Together, this Config class acts as the centralized control hub for your Voxtral-powered app. Whether you're configuring model parameters, managing audio formats, or setting the app theme, this setup ensures clean abstraction, code reusability, and maintainability.
The class includes:
get_api_config()method: This method centralizes all API-related credentials. Wherever your app makes a call to the Voxtral model, you can simply do Config.get_api_config() to retrieve everything required in one go.validate_config()method: Before starting inference, this method allows you to validate that essential configuration values (like the base URL) are properly defined. If not, you can catch the issue early and alert the user.get_language_code()method: This method enables seamless multilingual Q&A by mapping user-friendly language names from the UI (e.g., “French”) to standardized ISO language codes (e.g., “fr”). If an unsupported language is passed, it safely returns None.
Step 4: Build the Streamlit app
Let’s walk through each sub-step that builds up the app interface and logic.
Step 4.1: Set up the layout and styling
In this step, we initialize the Streamlit frontend, load dependencies, and set up CSS style. This forms the visual and functional foundation of the app.
import streamlit as st
import tempfile
import os
from mistral_common.protocol.instruct.messages import TextChunk, AudioChunk, UserMessage
from mistral_common.audio import Audio
from openai import OpenAI
import time
from config import Config
# Page configuration
st.set_page_config(
page_title="Voxtral Audio Assistant",
page_icon="🎵",
layout="wide",
initial_sidebar_state="expanded"
)
# Custom CSS
st.markdown("""
<style>
.main-header {
font-weight: bold;
text-align: center;
}
.section-header {
font-weight: bold;
}
.info-box {
background-
border-radius: 0.5rem;
border-left: 4px solid #1f77b4;
}
.success-box {
background-
border-radius: 0.5rem;
border-left: 4px solid #28a745;
}
.chat-message {
border-radius: 0.5rem;
}
.user-message {
background-
border-left: 4px solid #2196f3;
}
.assistant-message {
background-
border-left: 4px solid #9c27b0;
}
</style>
""", unsafe_allow_html=True)
In the above Python and CSS code:
- We import essential libraries for UI, file handling, and audio/text message formatting using Voxtral’s
mistral_commonpckage. - We then configure the Streamlit app's layout which is a dual-pane interface with a sidebar that is expanded by default.
- Finally, we inject custom CSS to visually separate different parts of the UI, such as headers, audio sections, chat bubbles, and summary boxes. This improves the visual hierarchy of the app with styled titles, info messages, and distinct chat bubbles for user vs. assistant replies.
Step 4.2: Initialize session and client
Next, we ensure the app behaves consistently across reloads by initializing session variables and client.
def init_session_state():
defaults = {
'transcription': "",
'summary': "",
'chat_history': [],
'audio_file_path': None
}
for key, default_value in defaults.items():
if key not in st.session_state:
st.session_state[key] = default_value
def initialize_client():
config = Config.get_api_config()
client = OpenAI(
api_key=config["api_key"],
base_url=config["base_url"],
)
# Test connection
try:
models = client.models.list()
return client
except Exception as e:
st.error(f"Failed to connect to Voxtral API: {str(e)}")
st.info("Make sure your ngrok tunnel is running in Google Colab")
return None
This code snippet sets up the application's internal state and handles the connection to the Voxtral API server. It includes two key components:
- Session initialization: It ensures that user data such as transcriptions, summaries, and uploaded audio files exist correctly across app interactions and reloads.
- Client initialization: It connects to the locally hosted Voxtral Mini 3B model via the OpenAI-compatible endpoint exposed through vLLM (served over Ngrok). If the connection fails, an error message is shown in the UI.
Step 4.3: Audio chunking and real-time transcription
This step handles the core logic of preparing the audio file and generating a transcription using Voxtral Mini 3B.
def file_to_chunk(file_path: str) -> AudioChunk:
audio = Audio.from_file(file_path, strict=False)
return AudioChunk.from_audio(audio)
def transcribe_audio(client, audio_file_path):
try:
with open(audio_file_path, "rb") as f:
response = client.audio.transcriptions.create(
file=f,
model=Config.MODEL_NAME,
response_format="text",
stream=True
)
transcription = ""
progress_bar = st.progress(0)
status_text = st.empty()
# Collect all chunks first to get total count
chunks = list(response)
total_chunks = len(chunks)
for i, chunk in enumerate(chunks):
delta = chunk.choices[0].get("delta", {}).get("content")
if delta:
transcription += delta
progress = min((i + 1) / max(total_chunks, 1), 1.0)
progress_bar.progress(progress)
status_text.text(f"Transcribing... {len(transcription)} characters")
progress_bar.empty()
status_text.empty()
return transcription
except Exception as e:
st.error(f"Error during transcription: {str(e)}")
return None
There are two key functions here:
- Audio chunk conversion: First, we convert the uploaded audio file into the
AudioChunkformat, which is the input requirement for Voxtral’s API. - Streaming transcription with progress feedback: We send the audio to the model in a streamed fashion (
stream=True) and decode it chunk-by-chunk. A progress bar dynamically updates as tokens are received, giving users real-time feedback on transcription progress.
Step 4.4: Generate summary from audio
This step sends the audio input along with a textual prompt to the Voxtral model, which returns a concise, structured summary of the audio content.
def generate_summary(client, audio_file_path):
try:
audio_chunk = file_to_chunk(audio_file_path)
text_chunk = TextChunk(text="Please provide a comprehensive summary of this audio content, highlighting the key points and main themes discussed.")
user_msg = UserMessage(content=[audio_chunk, text_chunk]).to_openai()
response = client.chat.completions.create(
model=Config.MODEL_NAME,
messages=[user_msg],
temperature=Config.DEFAULT_TEMPERATURE,
top_p=Config.DEFAULT_TOP_P,
)
return response.choices[0].message.content
except Exception as e:
st.error(f"Error generating summary: {str(e)}")
return None
The above code uses several key methods and functions to generate a summary from the uploaded audio:
audio_chunk: First, we convert the uploaded audio file into anAudioChunkfor Voxtral’s audio processing.text_chunk: Then, a text prompt is used to instruct the model to summarize the content in a clear and comprehensive manner.UserMessage: This combines theaudio_chunkandtext_chunkinto a single multi-modal message in an OpenAI-compatible format.Response.choices[0].message.content: This object extracts the actual summary text from the model’s response.
Step 4.5: Multilingual Q&A over audio
Now, we have the summary. Let’s enable the users to ask natural language questions about the uploaded audio content. This uses Voxtral’s multimodal capabilities by combining audio with text prompts to generate context-aware answers.
The following function also supports multilingual responses by dynamically modifying the prompt based on the selected language.
def ask_question(client, audio_file_path, question, language="English"):
try:
audio_chunk = file_to_chunk(audio_file_path)
if language != "English":
question = f"Please answer the following question in {language}: {question}"
text_chunk = TextChunk(text=question)
user_msg = UserMessage(content=[audio_chunk, text_chunk]).to_openai()
response = client.chat.completions.create(
model=Config.MODEL_NAME,
messages=[user_msg],
temperature=Config.DEFAULT_TEMPERATURE,
top_p=Config.DEFAULT_TOP_P,
)
return response.choices[0].message.content
except Exception as e:
st.error(f"Error asking question: {str(e)}")
return None
Here is an outline of what’s happening here:
- The
ask_question()function allows users to ask any question about the uploaded audio. - It begins by converting the uploaded file into an
AudioChunkusing the reusablefile_to_chunk()function, which prepares the audio in the format expected by Voxtral’s multimodal API. - Next, it checks the selected language. If it’s not English, the user’s question is prefixed with an instruction asking the model to respond in the desired language (e.g., Hindi or Spanish).
- The user’s question is wrapped in a
TextChunk, and both the audio and text are bundled into a multimodalUserMessage. - We send this message to the model, along with temperature and top-p values from the configuration to control generation quality.
- The model then returns a single-turn response.
Step 4.6: Setting up the sidebar
The sidebar functions as a control panel, allowing users to manage session settings, select the output language, test the connection to the Voxtral API, and customize the model's response behavior through intuitive sliders.
def render_sidebar():
with st.sidebar:
st.markdown('<h3 class="section-header">Configuration</h3>', unsafe_allow_html=True)
# Connection status
st.markdown('<h4>Connection Status</h4>', unsafe_allow_html=True)
if st.button("Test Connection"):
client = initialize_client()
if client:
st.success("Connected to Voxtral API")
else:
st.error("Connection failed")
# Language selection
selected_language = st.selectbox("Select language for Q&A:", Config.SUPPORTED_LANGUAGES)
# Model configuration
st.markdown('<h4>Model Settings</h4>', unsafe_allow_html=True)
temperature = st.slider("Temperature", 0.0, 1.0, Config.DEFAULT_TEMPERATURE, 0.1)
top_p = st.slider("Top P", 0.0, 1.0, Config.DEFAULT_TOP_P, 0.05)
if st.button("Clear Session"):
for key in ['transcription', 'summary', 'chat_history', 'audio_file_path']:
st.session_state[key] = "" if key in ['transcription', 'summary'] else [] if key == 'chat_history' else None
st.rerun()
return selected_language, temperature, top_p
Here’s a breakdown of the above code:
- Sidebar: The
st.sidebarobject opens a collapsible sidebar where we store configuration tools for the app. - Connection test: A "Test Connection" button triggers the
initialize_client()function, which attempts to connect to the Voxtral API and reports success or failure using Streamlit alerts. - Language selection: A dropdown menu lets users pick one of the 12 supported languages. This selection later informs the model to respond in that language.
- Model parameters: A slider allows the user to control model parameters such as:
temperature: This controls creativity. The lower values make responses more deterministic while the higher ones make them more diverse.top_p: This parameter controls nucleus sampling which helps restrict generation to the most probable tokens.- Session reset: A "Clear Session" button resets all session state variables and re-runs the app to start fresh.
Step 4.7: Audio upload and processing section
This section enables users to upload audio files (like .mp3, .wav, etc.), which are temporarily saved. Once uploaded, users can choose to either transcribe the audio or generate a high-level summary.
def render_audio_processing():
st.markdown('<h3 class="section-header">Audio Upload & Processing</h3>', unsafe_allow_html=True)
uploaded_file = st.file_uploader(
"Choose an audio file",
type=Config.SUPPORTED_AUDIO_FORMATS,
help="Upload an audio file to transcribe and analyze"
)
if uploaded_file is not None:
with tempfile.NamedTemporaryFile(delete=False, suffix=f".{uploaded_file.name.split('.')[-1]}") as tmp_file:
tmp_file.write(uploaded_file.getvalue())
st.session_state.audio_file_path = tmp_file.name
st.success(f"File uploaded: {uploaded_file.name}")
# Initialize client
client = initialize_client()
col1, col2 = st.columns(2)
with col1:
if st.button("Generate Summary", type="primary"):
with st.spinner("Generating summary..."):
summary = generate_summary(client, st.session_state.audio_file_path)
if summary:
st.session_state.summary = summary
with col2:
if st.button("Transcribe Audio", type="secondary"):
with st.spinner("Transcribing audio..."):
transcription = transcribe_audio(client, st.session_state.audio_file_path)
if transcription:
st.session_state.transcription = transcription
if st.session_state.summary:
st.markdown('<h4>Summary</h4>', unsafe_allow_html=True)
st.markdown(f'<div class="success-box">{st.session_state.summary}</div>', unsafe_allow_html=True)
if st.session_state.transcription:
st.markdown('<h4>Audio Transcription</h4>', unsafe_allow_html=True)
st.text_area("Transcription", st.session_state.transcription, height=200, label_visibility="collapsed")
We require a modular structure to facilitate easy audio uploads, processing, and persistent storage of results across user interactions. Here is how I did it:
- File uploader: First, we need a drag-and-drop file uploader that accepts multiple audio formats defined in
Config.SUPPORTED_AUDIO_FORMATS. Once a file is uploaded, it is written to a temporary file and saved tost.session_state.audio_file_path. - Client initialization: A Voxtral-compatible client is initialized to allow calls for summary and transcription.
- Summary and transcription buttons: Next, we define two buttons:
- Generate summary: When clicked, this button sends the uploaded audio to the model with a summary prompt. A progress spinner is shown, and if successful, the output is saved to
st.session_state.summary. - Transcribe audio: This button sends the audio file to the transcription endpoint. Real-time progress is shown using a spinner, and the result is stored in
st.session_state.transcription. - Finally, both results are shown within a text area.
Step 4.8: Interactive multilingual Q&A
This step enables users to ask questions about the uploaded audio in multiple languages (e.g, English, Hindi, Spanish). The selected language is used to format the model’s response, and each question-answer pair is stored in a session-managed conversation history and displayed using chat bubbles.
def render_qa_section(selected_language):
st.markdown('<h3 class="section-header">Multilingual Q&A</h3>', unsafe_allow_html=True)
if st.session_state.audio_file_path:
st.markdown(f'<div class="info-box">Selected language: <strong>{selected_language}</strong></div>', unsafe_allow_html=True)
question = st.text_input(
f"Ask a question about the audio (in {selected_language}):",
placeholder="e.g., What is the main topic discussed?"
)
if st.button("Ask Question", type="primary") and question:
client = initialize_client()
with st.spinner("Processing your question..."):
answer = ask_question(client, st.session_state.audio_file_path, question, selected_language)
if answer:
# Chat history
st.session_state.chat_history.append({
"question": question,
"answer": answer,
"language": selected_language,
"timestamp": time.strftime("%H:%M:%S")
})
st.success("Question answered!")
else:
st.error("Failed to get answer. Please try again.")
if st.session_state.chat_history:
st.markdown('<h4>Conversation History</h4>', unsafe_allow_html=True)
for chat in reversed(st.session_state.chat_history):
st.markdown(f'<div class="chat-message user-message"><strong>Question:</strong> {chat["question"]}</div>', unsafe_allow_html=True)
st.markdown(f'<div class="chat-message assistant-message"><strong>Answer:</strong> {chat["answer"]}</div>', unsafe_allow_html=True)
st.markdown("---")
else:
st.markdown('<div class="info-box">Please upload an audio file to start asking questions.</div>', unsafe_allow_html=True)
Here is a breakdown of what’s happening in the above code snippet:
Ask Questionbutton: A text input field allows users to type a question in selected language abd click Ask Question button. Once, clicked it:- Initializes the Voxtral-compatible client.
- Sends the question along with the uploaded audio to the model.
- Displays a spinner while waiting for the model's response.
- Response handling: If a response is received, it is:
- Stored in
st.session_state.chat_historywith a timestamp. - Rendered with role-specific formatting using the custom CSS from earlier (user vs assistant bubble).
- If the model fails, a clear error message is shown.
- Conversation history: If previous Q&A pairs exist, they are shown in reverse chronological order with a separator.
Step 4.9: Launching the app
This final step brings together all previously defined components, including initializing session state, rendering the sidebar, handling audio uploads, transcription, summaries, and Q&A using a two-column layout.
Once the server is running and all dependencies are installed, this step ensures that the app is served to the browser via Streamlit and is ready for interaction.
def main():
st.markdown('<h1 class="main-header">🎵 Voxtral Audio Assistant</h1>', unsafe_allow_html=True)
# Initialize session state
init_session_state()
# Render sidebar and get configuration
selected_language, temperature, top_p = render_sidebar()
col1, col2 = st.columns([1, 1])
with col1:
render_audio_processing()
with col2:
render_qa_section(selected_language)
st.markdown("---")
st.markdown("""
<div style="text-align: center;">
<p>Powered by <strong>Voxtral Mini 3B</strong> | Built with Streamlit</p>
<p>Supports multiple languages for audio analysis and Q&A</p>
</div>
""", unsafe_allow_html=True)
if __name__ == "__main__":
main()
The main() function acts as the entry point for the app. It controls the layout, orchestrates rendering, and ensures all session variables and UI components are initialized properly. Here is a summary of what the above-defined functions describe:
init_session_state(): This function prepares default values liketranscription,summary, andchat_historyto persist app state across interactions.render_sidebar(): It loads configuration widgets including language selector, connection test, and model parameters.render_audio_processing(): This function lets users upload and process audio files for transcription and summary generation.render_qa_section(): Finally, this function enables multilingual question-answering on uploaded audio.
Once everything is ready, run the app using:
pip install -r requirements.txt
Make sure it's accessible via the URL specified in VOXTRAL_API_BASE inside your config.py. Once all dependencies are installed, run the Streamlit application by running the following command in the terminal:
streamlit run app.py
Your browser will open the interface automatically. The app is now fully functional with audio upload, real-time transcription, summaries, and multilingual Q&A, powered by Voxtral Mini 3B.
You can find all the code we’ve covered here in this GitHub repository I’ve set up.
Conclusion
Voxtral Mini 3B is an efficient, open-weight model for speech transcription and understanding. In this tutorial, we ran it using vLLM, set up an API, and built an audio summarizer app using Streamlit.
Whether you're building a podcast summarizer, meeting assistant, or voice-command app, Voxtral Mini can serve as the foundation for fast and local audio reasoning.

I am a Google Developers Expert in ML(Gen AI), a Kaggle 3x Expert, and a Women Techmakers Ambassador with 3+ years of experience in tech. I co-founded a health-tech startup in 2020 and am pursuing a master's in computer science at Georgia Tech, specializing in machine learning.
