Skip to main content

Mistral's Voxtral: A Guide With Demo Project

Learn how to run Mistral's Voxtral Mini 3B model with vLLM, set up an API, and build a Streamlit-based audio summarizer and Q&A app.
Jul 23, 2025  · 12 min read

Mistral recently launched its first open-source multimodal audio models, Voxtral Small and Voxtral Mini, built upon the Mistral 3B architecture and specifically optimized for audio understanding and transcription tasks.

My focus in this blog will be on Voxtral Mini 3B, a compact, open-source model designed for real-time audio-to-text tasks like transcription, summarization, and Q&A. With its efficient size and support for long-context reasoning, Voxtral Mini is especially powerful when paired with high-throughput inference frameworks like vLLM, making it ideal for building fast, offline audio applications.

In this tutorial, I’ll walk you through how to:

  • Run Voxtral Mini 3B with vLLM on Colab Pro
  • Deploy an ngrok-tunneled API endpoint to access your model from anywhere
  • Build a Streamlit-based audio assistant that performs transcription, summarization, and Q&A using raw audio input

We keep our readers updated on the latest in AI by sending out The Median, our free Friday newsletter that breaks down the week’s key stories. Subscribe and stay sharp in just a few minutes a week:

What Is Mistral’s Voxtral?

Voxtral is Mistral’s fully open‑source audio model family designed for powerful speech understanding.  Voxtral comes in two model sizes:

  • Voxtral Small: 24 B parameters, ideal for production-scale, enterprise use
  • Voxtral Mini: 3 B parameters, optimized for running locally on edge devices

mistral's voxtral benchmarks

Source: Mistral

Voxtral accepts raw audio inputs (such as .wav or .mp3) and is trained to efficiently generate transcriptions and summaries from spoken content. Voxtral is a smaller variant optimized for fast inference and offline deployment.

It follows Mistral’s model format and is compatible with high-throughput inference frameworks like vLLM, making it an excellent choice for real-time transcription or lightweight speech applications.

How to Run Voxtral Mini 3B Using vLLM

In this section, I’ll walk you through how to serve Voxtral Mini 3B using vLLM on a Colab Pro with T4 GPU enabled. vLLM is chosen for its high-throughput, low-latency serving capabilities which are ideal for models like Voxtral that require fast streaming responses with audio support.

Additionally, we'll use PyNGrok to expose the vLLM server via a public endpoint, making it accessible to applications running on your local machine.

Step 1: Install dependencies

Let’s begin by installing the dependencies within our Colab environment. Run the following command in the Colab cell:

!uv pip install -U transformers accelerate "vllm[audio]" --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly

We begin by installing the required libraries using uv, a faster Python package installer. This pulls the nightly version of vllm[audio] which includes experimental support for audio-language models like Voxtral. If you do not have uv installed in your environment, then first install uv by running the following code:

!pip install uv

Step 2: Install Mistral-Common

Next, we install the mistral-common package, which provides essential utilities required to interact with Voxtral and other Mistral-family models. This package includes tokenizers aligned with official implementations and Pydantic-based validation for message structures.

!pip install mistral-common --upgrade
!python -c "import mistral_common; print(mistral_common.__version__)"

The mistral-common package contains utility modules for Voxtral and other Mistral family models. It includes tokenizers, typed messages in Pydantic formats, and audio helper functions. This step ensures compatibility with Voxtral's internal handling of audio inputs and structured messages.

Step 3: Set up VLLM

Now we have the basic Mistral dependencies ready, next we set up vLLM by cloning into its main repository.

Step 3.1: Clone the vLLM repository

We clone the official vLLM GitHub repository, which gives us access to built-in audio examples and serves as the base for launching offline or hosted inference.

!git clone https://github.com/vllm-project/vllm && cd vllm

Step 3.2: Run a sample audio inference 

Let's run a quick sanity check to ensure our setup is correct.

!python vllm/examples/offline_inference/audio_language.py --num-audios 2 --model-type voxtral

I ran the offline audio_language.py example, which runs audio-language inference on two pre-defined samples using the Voxtral model and confirms whether audio decoding and generation work properly.

Step 4: Set up PyGrok

With vLLM now set up, let's configure PyNGrok to expose the vLLM server on Colab through a publicly accessible internet address.

NGrok Token

Next, we install PyNGrok within the colab environment and set the authentication token using the NGROK token we copied previously. If you are working for a prototype, you can pass the token here directly.

!pip install pyngrok -q
from pyngrok import ngrok
ngrok.set_auth_token("NGROK_TOKEN")

Step 5: Serve the model

This is the key step, where we launch the Voxtral Mini 3B model via vLLM. 

!vllm serve mistralai/Voxtral-Mini-3B-2507 --tokenizer_mode mistral --config_format mistral --load_format mistral --max-model-len 4864

The code snippet includes several important flags:

  • --tokenizer_mode mistral: This uses the Mistral-specific tokenizer for accurate tokenization.
  • --config_format mistral and --load_format mistral: These flags ensure that both the model configuration and weights are loaded in Mistral's custom format, maintaining compatibility.
  • --max-model-len 4864: This flag sets the maximum input context length to 4864 tokens.

Note: Keep this cell running as this is an active model server.

Step 6: Get the NGrok public endpoint

Once your vLLM server is running, run the following code line by line in a Python terminal:

from pyngrok import ngrok
ngrok.set_auth_token(NGROK_TOKEN)
public_url = ngrok.connect(8000)
print("Public endpoint:", public_url)

The above code will expose the model  to the internet at a temporary public URL like:

https://80xxxxxxxxxx.ngrok-free.app

Save this URL, which will be used in the local config.py to connect to the remote model. After the run is complete, terminate the connection by running:

ngrok.kill()

Building an Audio Transcription and Summarization Demo

In this section, we will build a Streamlit UI that:

  1. Takes audio as input (user upload)
  2. Sends it to the model over the vLLM server
  3. Displays the transcription, summary, and allows the user to ask questions

Step 1: Setting up dependencies

Before building the app, we define all required dependencies in a requirements.txt file. This ensures consistent environments across local runs or Colab notebooks.

streamlit>=1.28.0
openai>=1.0.0
mistral-common>=0.0.12
huggingface-hub>=0.19.0
pyngrok>=6.0.0
requests>=2.28.0
pydub>=0.25.1 

Here’s why we need each:

  • streamlit: This library helps in building the interactive audio interface.
  • openai: it interfaces with the Voxtral-compatible vLLM server using OpenAI's SDK.
  • mistral-common: It provides Voxtral-specific utilities like AudioChunk and TextChunk.
  • pyngrok: This exposes our local vLLM server to the internet, allowing the Streamlit app to access it.
  • Pydub: This library helps in converting audio files for inference.
  • huggingface-hub and requests: These are used for model and config fetching (if needed).

Step 2: Setting up the environment variable

To keep configuration clean and portable, we store API details as environment variables. You can define them inside a .env file:

VOXTRAL_API_KEY=dummy-key
VOXTRAL_API_BASE=NGROK_TOKEN/v1
VOXTRAL_MODEL_NAME=mistralai/Voxtral-Mini-3B-2507

VOXTRAL_API_BASE should point to your running vLLM instance, which can be localhost (http://localhost:8000) or an ngrok public endpoint that looks like https://80xxxxxxxxxx.ngrok-free.app.

Note: Ensure that you append /v1 at the end of API base for OpenAI compatibility.

Step 3: Setting up the config file

To make your app modular, easily adjustable, and ready for production, we’ll define a central configuration class in config.py. This file will control model settings, API access, supported audio formats, languages, and UI preferences.

Step 3.1: Load environment variables

Before we define the config class, we load any environment variables saved in a .env file. This keeps sensitive information separate from the code.

import os
from typing import Optional
from pathlib import Path
# Load environment variables from .env file
env_file = Path(".env")
if env_file.exists():
    with open(env_file, 'r') as f:
        for line in f:
            if line.strip() and not line.startswith('#'):
                key, value = line.strip().split('=', 1)
                os.environ[key] = value

This snippet checks if a .env file exists in the root directory. If found, it reads key-value pairs and sets them as environment variables. These values can now be accessed using os.getenv(), ensuring that secrets don’t get hardcoded into the app.

Step 3.2: Define the Config class

Now, we wrap all settings into a clean Config class for easy access and reuse.

class Config:
    # API Configuration
    VOXTRAL_API_KEY: str = os.getenv("VOXTRAL_API_KEY", "EMPTY")
    VOXTRAL_API_BASE: str = os.getenv("VOXTRAL_API_BASE", "http://localhost:8000/v1")
    # Model Configuration
    MODEL_NAME: str = os.getenv("VOXTRAL_MODEL_NAME", "voxtral-mini-3b-2507")
    # Default Parameters
    DEFAULT_TEMPERATURE: float = 0.2
    DEFAULT_TOP_P: float = 0.95
    # Audio Configuration
    MAX_AUDIO_SIZE_MB: int = 100
    SUPPORTED_AUDIO_FORMATS: list = ['mp3', 'wav', 'm4a', 'flac', 'ogg']
    SUPPORTED_LANGUAGES: list = [
        "English", "Spanish", "French", "German", "Italian", "Portuguese", 
        "Russian", "Chinese", "Japanese", "Korean", "Arabic", "Hindi"
    ]
    # UI Configuration
    STREAMLIT_THEME: dict = {
        "primaryColor": "#1f77b4",
        "backgroundColor": "#ffffff",
        "secondaryBackgroundColor": "#f0f2f6",
        "textColor": "#262730",
        "font": "sans serif"
    }
    @classmethod
    def get_api_config(cls) -> dict:
        return {
            "api_key": cls.VOXTRAL_API_KEY,
            "base_url": cls.VOXTRAL_API_BASE,
            "model_name": cls.MODEL_NAME
        }    
    @classmethod
    def validate_config(cls) -> bool:
        if not cls.VOXTRAL_API_BASE:
            return False
        return True    
    @classmethod
    def get_language_code(cls, language_name: str) -> Optional[str]:
        language_mapping = {
            "English": "en",
            "Spanish": "es", 
            "French": "fr",
            "German": "de",
            "Italian": "it",
            "Portuguese": "pt",
            "Russian": "ru",
            "Chinese": "zh",
            "Japanese": "ja",
            "Korean": "ko",
            "Arabic": "ar",
            "Hindi": "hi"
        }
        return language_mapping.get(language_name) 

Together, this Config class acts as the centralized control hub for your Voxtral-powered app. Whether you're configuring model parameters, managing audio formats, or setting the app theme, this setup ensures clean abstraction, code reusability, and maintainability.

The class includes:

  • get_api_config() method: This method centralizes all API-related credentials. Wherever your app makes a call to the Voxtral model, you can simply do Config.get_api_config() to retrieve everything required in one go.
  • validate_config() method: Before starting inference, this method allows you to validate that essential configuration values (like the base URL) are properly defined. If not, you can catch the issue early and alert the user.
  • get_language_code() method: This method enables seamless multilingual Q&A by mapping user-friendly language names from the UI (e.g., “French”) to standardized ISO language codes (e.g., “fr”). If an unsupported language is passed, it safely returns None.

Step 4: Build the Streamlit app

Let’s walk through each sub-step that builds up the app interface and logic.

Step 4.1: Set up the layout and styling

In this step, we initialize the Streamlit frontend, load dependencies, and set up CSS style. This forms the visual and functional foundation of the app.

import streamlit as st
import tempfile
import os
from mistral_common.protocol.instruct.messages import TextChunk, AudioChunk, UserMessage
from mistral_common.audio import Audio
from openai import OpenAI
import time
from config import Config
# Page configuration
st.set_page_config(
    page_title="Voxtral Audio Assistant",
    page_icon="🎵",
    layout="wide",
    initial_sidebar_state="expanded"
)
# Custom CSS 
st.markdown("""
<style>
    .main-header {
        
        font-weight: bold;
        
        text-align: center;
        
    }
    .section-header {
        
        font-weight: bold;
        
        
    }
    .info-box {
        background-
        
        border-radius: 0.5rem;
        border-left: 4px solid #1f77b4;
        
    }
    .success-box {
        background-
        
        border-radius: 0.5rem;
        border-left: 4px solid #28a745;
        
    }
    .chat-message {
        
        border-radius: 0.5rem;
        
    }
    .user-message {
        background-
        border-left: 4px solid #2196f3;
    }
    .assistant-message {
        background-
        border-left: 4px solid #9c27b0;
    }
</style>
""", unsafe_allow_html=True)

In the above Python and CSS code:

  • We import essential libraries for UI, file handling, and audio/text message formatting using Voxtral’s mistral_common pckage.
  • We then configure the Streamlit app's layout which is a dual-pane interface with a sidebar that is expanded by default.
  • Finally, we inject custom CSS to visually separate different parts of the UI, such as headers, audio sections, chat bubbles, and summary boxes. This improves the visual hierarchy of the app with styled titles, info messages, and distinct chat bubbles for user vs. assistant replies.

Step 4.2: Initialize session and client

Next, we ensure the app behaves consistently across reloads by initializing session variables and client.

def init_session_state():
    defaults = {
        'transcription': "",
        'summary': "",
        'chat_history': [],
        'audio_file_path': None
    }
    for key, default_value in defaults.items():
        if key not in st.session_state:
            st.session_state[key] = default_value
def initialize_client():
    config = Config.get_api_config() 
    client = OpenAI(
        api_key=config["api_key"],
        base_url=config["base_url"],
    ) 
    # Test connection
    try:
        models = client.models.list()
        return client
    except Exception as e:
        st.error(f"Failed to connect to Voxtral API: {str(e)}")
        st.info("Make sure your ngrok tunnel is running in Google Colab")
        return None

This code snippet sets up the application's internal state and handles the connection to the Voxtral API server. It includes two key components:

  1. Session initialization: It ensures that user data such as transcriptions, summaries, and uploaded audio files exist correctly across app interactions and reloads.
  2. Client initialization: It connects to the locally hosted Voxtral Mini 3B model via the OpenAI-compatible endpoint exposed through vLLM (served over Ngrok). If the connection fails, an error message is shown in the UI.

Step 4.3: Audio chunking and real-time transcription

This step handles the core logic of preparing the audio file and generating a transcription using Voxtral Mini 3B.

def file_to_chunk(file_path: str) -> AudioChunk:
    audio = Audio.from_file(file_path, strict=False)
    return AudioChunk.from_audio(audio)
def transcribe_audio(client, audio_file_path):
    try:
        with open(audio_file_path, "rb") as f:
            response = client.audio.transcriptions.create(
                file=f,
                model=Config.MODEL_NAME,
                response_format="text",
                stream=True
            )       
            transcription = ""
            progress_bar = st.progress(0)
            status_text = st.empty()
            # Collect all chunks first to get total count
            chunks = list(response)
            total_chunks = len(chunks)
            for i, chunk in enumerate(chunks):
                delta = chunk.choices[0].get("delta", {}).get("content")
                if delta:
                    transcription += delta
                    progress = min((i + 1) / max(total_chunks, 1), 1.0)
                    progress_bar.progress(progress)
                    status_text.text(f"Transcribing... {len(transcription)} characters")
            progress_bar.empty()
            status_text.empty()
            return transcription
    except Exception as e:
        st.error(f"Error during transcription: {str(e)}")
        return None

There are two key functions here:

  1. Audio chunk conversion: First, we convert the uploaded audio file into the AudioChunk format, which is the input requirement for Voxtral’s API.
  2. Streaming transcription with progress feedback: We send the audio to the model in a streamed fashion (stream=True) and decode it chunk-by-chunk. A progress bar dynamically updates as tokens are received, giving users real-time feedback on transcription progress.

Step 4.4: Generate summary from audio

This step sends the audio input along with a textual prompt to the Voxtral model, which returns a concise, structured summary of the audio content.

def generate_summary(client, audio_file_path):
    try:
        audio_chunk = file_to_chunk(audio_file_path)
        text_chunk = TextChunk(text="Please provide a comprehensive summary of this audio content, highlighting the key points and main themes discussed.")
        user_msg = UserMessage(content=[audio_chunk, text_chunk]).to_openai()    
        response = client.chat.completions.create(
            model=Config.MODEL_NAME,
            messages=[user_msg],
            temperature=Config.DEFAULT_TEMPERATURE,
            top_p=Config.DEFAULT_TOP_P,
        )
        return response.choices[0].message.content
    except Exception as e:
        st.error(f"Error generating summary: {str(e)}")
        return None

The above code uses several key methods and functions to generate a summary from the uploaded audio:

  • audio_chunk: First, we convert the uploaded audio file into an AudioChunk for Voxtral’s audio processing.
  • text_chunk: Then, a text prompt is used to instruct the model to summarize the content in a clear and comprehensive manner.
  • UserMessage: This combines the audio_chunk and text_chunk into a single multi-modal message in an OpenAI-compatible format. 
  • Response.choices[0].message.content: This object extracts the actual summary text from the model’s response.

Step 4.5: Multilingual Q&A over audio

Now, we have the summary. Let’s enable the users to ask natural language questions about the uploaded audio content. This uses Voxtral’s multimodal capabilities by combining audio with text prompts to generate context-aware answers.

The following function also supports multilingual responses by dynamically modifying the prompt based on the selected language.

def ask_question(client, audio_file_path, question, language="English"):
    try:
        audio_chunk = file_to_chunk(audio_file_path)
        if language != "English":
            question = f"Please answer the following question in {language}: {question}"
        text_chunk = TextChunk(text=question)
        user_msg = UserMessage(content=[audio_chunk, text_chunk]).to_openai()        
        response = client.chat.completions.create(
            model=Config.MODEL_NAME,
            messages=[user_msg],
            temperature=Config.DEFAULT_TEMPERATURE,
            top_p=Config.DEFAULT_TOP_P,
        )        
        return response.choices[0].message.content
    except Exception as e:
        st.error(f"Error asking question: {str(e)}")
        return None

Here is an outline of what’s happening here:

  • The ask_question() function allows users to ask any question about the uploaded audio.
  • It begins by converting the uploaded file into an AudioChunk using the reusable file_to_chunk() function, which prepares the audio in the format expected by Voxtral’s multimodal API.
  • Next, it checks the selected language. If it’s not English, the user’s question is prefixed with an instruction asking the model to respond in the desired language (e.g., Hindi or Spanish).
  • The user’s question is wrapped in a TextChunk, and both the audio and text are bundled into a multimodal UserMessage.
  • We send this message to the model, along with temperature and top-p values from the configuration to control generation quality.
  • The model then returns a single-turn response.

Step 4.6: Setting up the sidebar

The sidebar functions as a control panel, allowing users to manage session settings, select the output language, test the connection to the Voxtral API, and customize the model's response behavior through intuitive sliders.

def render_sidebar():
    with st.sidebar:
        st.markdown('<h3 class="section-header">Configuration</h3>', unsafe_allow_html=True)        
        # Connection status
        st.markdown('<h4>Connection Status</h4>', unsafe_allow_html=True)
        if st.button("Test Connection"):
            client = initialize_client()
            if client:
                st.success("Connected to Voxtral API")
            else:
                st.error("Connection failed")        
        # Language selection
        selected_language = st.selectbox("Select language for Q&A:", Config.SUPPORTED_LANGUAGES)        
        # Model configuration
        st.markdown('<h4>Model Settings</h4>', unsafe_allow_html=True)
        temperature = st.slider("Temperature", 0.0, 1.0, Config.DEFAULT_TEMPERATURE, 0.1)
        top_p = st.slider("Top P", 0.0, 1.0, Config.DEFAULT_TOP_P, 0.05)
        if st.button("Clear Session"):
            for key in ['transcription', 'summary', 'chat_history', 'audio_file_path']:
                st.session_state[key] = "" if key in ['transcription', 'summary'] else [] if key == 'chat_history' else None
            st.rerun()        
        return selected_language, temperature, top_p

Here’s a breakdown of the above code:

  • Sidebar: The st.sidebar object opens a collapsible sidebar where we store configuration tools for the app.
  • Connection test: A "Test Connection" button triggers the initialize_client() function, which attempts to connect to the Voxtral API and reports success or failure using Streamlit alerts.
  • Language selection: A dropdown menu lets users pick one of the 12 supported languages. This selection later informs the model to respond in that language.
  • Model parameters: A slider allows the user to control model parameters such as:
    • temperature: This controls creativity. The lower values make responses more deterministic while the higher ones make them more diverse.
    • top_p: This parameter controls nucleus sampling which helps restrict generation to the most probable tokens.
  • Session reset: A "Clear Session" button resets all session state variables and re-runs the app to start fresh.

Step 4.7: Audio upload and processing section

This section enables users to upload audio files (like .mp3, .wav, etc.), which are temporarily saved. Once uploaded, users can choose to either transcribe the audio or generate a high-level summary. 

def render_audio_processing():
    st.markdown('<h3 class="section-header">Audio Upload & Processing</h3>', unsafe_allow_html=True)    
    uploaded_file = st.file_uploader(
        "Choose an audio file",
        type=Config.SUPPORTED_AUDIO_FORMATS,
        help="Upload an audio file to transcribe and analyze"
    )
    if uploaded_file is not None:
        with tempfile.NamedTemporaryFile(delete=False, suffix=f".{uploaded_file.name.split('.')[-1]}") as tmp_file:
            tmp_file.write(uploaded_file.getvalue())
            st.session_state.audio_file_path = tmp_file.name
        st.success(f"File uploaded: {uploaded_file.name}")
        # Initialize client
        client = initialize_client()
        col1, col2 = st.columns(2) 
        with col1:
            if st.button("Generate Summary", type="primary"):
                with st.spinner("Generating summary..."):
                    summary = generate_summary(client, st.session_state.audio_file_path)
                    if summary:
                        st.session_state.summary = summary
        with col2:
            if st.button("Transcribe Audio", type="secondary"):
                with st.spinner("Transcribing audio..."):
                    transcription = transcribe_audio(client, st.session_state.audio_file_path)
                    if transcription:
                        st.session_state.transcription = transcription
        if st.session_state.summary:
            st.markdown('<h4>Summary</h4>', unsafe_allow_html=True)
            st.markdown(f'<div class="success-box">{st.session_state.summary}</div>', unsafe_allow_html=True)
        if st.session_state.transcription:
            st.markdown('<h4>Audio Transcription</h4>', unsafe_allow_html=True)
            st.text_area("Transcription", st.session_state.transcription, height=200, label_visibility="collapsed")

We require a modular structure to facilitate easy audio uploads, processing, and persistent storage of results across user interactions. Here is how I did it:

  • File uploader: First, we need a drag-and-drop file uploader that accepts multiple audio formats defined in Config.SUPPORTED_AUDIO_FORMATS. Once a file is uploaded, it is written to a temporary file and saved to st.session_state.audio_file_path.
  • Client initialization: A Voxtral-compatible client is initialized to allow calls for summary and transcription.
  • Summary and transcription buttons: Next, we define two buttons:
    • Generate summary: When clicked, this button sends the uploaded audio to the model with a summary prompt. A progress spinner is shown, and if successful, the output is saved to st.session_state.summary.
    • Transcribe audio: This button sends the audio file to the transcription endpoint. Real-time progress is shown using a spinner, and the result is stored in st.session_state.transcription.
  • Finally, both results are shown within a text area.

Step 4.8: Interactive multilingual Q&A

This step enables users to ask questions about the uploaded audio in multiple languages (e.g, English, Hindi, Spanish). The selected language is used to format the model’s response, and each question-answer pair is stored in a session-managed conversation history and displayed using chat bubbles.

def render_qa_section(selected_language):
    st.markdown('<h3 class="section-header">Multilingual Q&A</h3>', unsafe_allow_html=True) 
    if st.session_state.audio_file_path:
        st.markdown(f'<div class="info-box">Selected language: <strong>{selected_language}</strong></div>', unsafe_allow_html=True)
        question = st.text_input(
            f"Ask a question about the audio (in {selected_language}):",
            placeholder="e.g., What is the main topic discussed?"
        )    
        if st.button("Ask Question", type="primary") and question:
            client = initialize_client()
            with st.spinner("Processing your question..."):
                answer = ask_question(client, st.session_state.audio_file_path, question, selected_language)
                if answer:
                    # Chat history
                    st.session_state.chat_history.append({
                        "question": question,
                        "answer": answer,
                        "language": selected_language,
                        "timestamp": time.strftime("%H:%M:%S")
                    })
                    st.success("Question answered!")
                else:
                    st.error("Failed to get answer. Please try again.")
                if st.session_state.chat_history:
            st.markdown('<h4>Conversation History</h4>', unsafe_allow_html=True)
            for chat in reversed(st.session_state.chat_history):
                st.markdown(f'<div class="chat-message user-message"><strong>Question:</strong> {chat["question"]}</div>', unsafe_allow_html=True)
                st.markdown(f'<div class="chat-message assistant-message"><strong>Answer:</strong> {chat["answer"]}</div>', unsafe_allow_html=True)
                st.markdown("---")
    else:
        st.markdown('<div class="info-box">Please upload an audio file to start asking questions.</div>', unsafe_allow_html=True)

Here is a breakdown of what’s happening in the above code snippet:

  • Ask Question button: A text input field allows users to type a question in selected language abd click Ask Question button. Once, clicked it:
    • Initializes the Voxtral-compatible client.
    • Sends the question along with the uploaded audio to the model.
    • Displays a spinner while waiting for the model's response.
  • Response handling: If a response is received, it is:
    • Stored in st.session_state.chat_history with a timestamp.
    • Rendered with role-specific formatting using the custom CSS from earlier (user vs assistant bubble).
    • If the model fails, a clear error message is shown.
  • Conversation history: If previous Q&A pairs exist, they are shown in reverse chronological order with a separator.

Step 4.9: Launching the app

This final step brings together all previously defined components, including initializing session state, rendering the sidebar, handling audio uploads, transcription, summaries, and Q&A using a two-column layout.

Once the server is running and all dependencies are installed, this step ensures that the app is served to the browser via Streamlit and is ready for interaction.

def main():
    st.markdown('<h1 class="main-header">🎵 Voxtral Audio Assistant</h1>', unsafe_allow_html=True)
    # Initialize session state
    init_session_state()
    # Render sidebar and get configuration
    selected_language, temperature, top_p = render_sidebar()
    col1, col2 = st.columns([1, 1]) 
    with col1:
        render_audio_processing()
    with col2:
        render_qa_section(selected_language)
    st.markdown("---")
    st.markdown("""
    <div style="text-align: center;">
        <p>Powered by <strong>Voxtral Mini 3B</strong> | Built with Streamlit</p>
        <p>Supports multiple languages for audio analysis and Q&A</p>
    </div>
    """, unsafe_allow_html=True)
if __name__ == "__main__":
    main() 

The main() function acts as the entry point for the app. It controls the layout, orchestrates rendering, and ensures all session variables and UI components are initialized properly. Here is a summary of what the above-defined functions describe:

  • init_session_state(): This function prepares default values like transcription, summary, and chat_history to persist app state across interactions.
  • render_sidebar(): It loads configuration widgets including language selector, connection test, and model parameters.
  • render_audio_processing(): This function lets users upload and process audio files for transcription and summary generation.
  • render_qa_section(): Finally, this function enables multilingual question-answering on uploaded audio.

Once everything is ready, run the app using:

pip install -r requirements.txt

Make sure it's accessible via the URL specified in VOXTRAL_API_BASE inside your config.py. Once all dependencies are installed, run the Streamlit application by running the following command in the terminal:

streamlit run app.py

Your browser will open the interface automatically. The app is now fully functional with audio upload, real-time transcription, summaries, and multilingual Q&A, powered by Voxtral Mini 3B.

You can find all the code we’ve covered here in this GitHub repository I’ve set up.

Conclusion

Voxtral Mini 3B is an efficient, open-weight model for speech transcription and understanding. In this tutorial, we ran it using vLLM, set up an API, and built an audio summarizer app using Streamlit.

Whether you're building a podcast summarizer, meeting assistant, or voice-command app, Voxtral Mini can serve as the foundation for fast and local audio reasoning.


Aashi Dutt's photo
Author
Aashi Dutt
LinkedIn
Twitter

I am a Google Developers Expert in ML(Gen AI), a Kaggle 3x Expert, and a Women Techmakers Ambassador with 3+ years of experience in tech. I co-founded a health-tech startup in 2020 and am pursuing a master's in computer science at Georgia Tech, specializing in machine learning.

Topics

Learn AI with these courses!

Course

Deploying AI into Production with FastAPI

4 hr
1.9K
Learn how to use FastAPI to develop APIs that support AI models, built to meet real-world demands.
See DetailsRight Arrow
Start Course
See MoreRight Arrow
Related

Tutorial

Magistral Small: A Guide With Demo Project on vLLM and Ollama

Learn how to set up and run Mistral's Magistral Small model using Ollama and vLLM, and build a demo project that debugs faulty logic.
Aashi Dutt's photo

Aashi Dutt

Tutorial

Mistral's Codestral 25.01: A Guide With VS Code Examples

Learn about Mistral's Codestral 25.01, including its capabilities, benchmarks, and how to set it up in VS Code for code assistance.
Hesam Sheikh Hassani's photo

Hesam Sheikh Hassani

Tutorial

Mistral Agents API: A Guide With Demo Project

Learn how to build AI agents using Mistral's Agents API, and explore key concepts like tool usage, connectors, handoffs, and more.
Aashi Dutt's photo

Aashi Dutt

Tutorial

Llama 3.3: Step-by-Step Tutorial With Demo Project

Learn how to build a multilingual code explanation app using Llama 3.3, Hugging Face, and Streamlit.
Dr Ana Rojo-Echeburúa's photo

Dr Ana Rojo-Echeburúa

Tutorial

Mistral Medium 3 Tutorial: Building Agentic Applications

Develop a multi-tool agent application using LangGraph, Tavily, Python tool, and Mistral AI API in the DataLab environment.
Abid Ali Awan's photo

Abid Ali Awan

Tutorial

Mistral 7B Tutorial: A Step-by-Step Guide to Using and Fine-Tuning Mistral 7B

The tutorial covers accessing, quantizing, fine-tuning, merging, and saving this powerful 7.3 billion parameter open-source language model.
Abid Ali Awan's photo

Abid Ali Awan

See MoreSee More