0% found this document useful (0 votes)
110 views20 pages

Hybrid Offline-Online Voice Assistant

patent

Uploaded by

Rupak Saini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
110 views20 pages

Hybrid Offline-Online Voice Assistant

patent

Uploaded by

Rupak Saini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

ABSTRACT (Expanded – 350–400 words)

This invention is about a hybrid system that works both offline and online,
allowing users to interact through voice commands without needing constant
internet access. It uses a wake word to start the interaction, then processes
speech locally without sending data online. The system includes a large language
model that runs locally for reasoning, and it can switch between different modes
depending on the situation. When needed, it can connect online to get real-time
information like weather, time, and news. It also has a local part that helps with
retrieving and generating responses using local data. The design prioritizes
privacy and quick response times, and it only connects to the internet when
necessary to add more up-to-date information.

A hybrid reasoning system combines local LLM processing with real-time online
data. Most general questions and logic-based tasks are handled entirely offline
using the local LLM. If a query needs up-to-date information or internet-based
sources, like weather, time, news, or media, the system uses specific online APIs.
It also has a local retrieval tool that organizes personal documents, notes, PDFs,
and stored data, allowing the assistant to answer customized questions without
needing an internet connection.

The invention also includes an action-execution layer that lets users control their
device like opening apps, automating keyboard and mouse actions, playing
music, taking screenshots, finding files, sending emails, or opening websites.
There's a reminder and alarm system that lets users schedule tasks to run
automatically in the background. An offline translator tool uses speech
recognition, local translation, and offline text-to-speech for communication in
multiple languages. The text-to-speech part has a dual system that first tries to use
Piper for offline speech and switches to an online service if needed.

This combined system offers a versatile, privacy-focused, self-contained smart


assistant that can work independently, carry out various tasks, and use real-time
information when needed.
Background of the Invention
Voice assistants have become a common feature in many consumer devices, but
the current systems still face several important issues that affect privacy,
flexibility, independence, and the ability to work without an internet connection.
Most popular assistants like Amazon Alexa, Google Assistant, and Apple Siri
depend heavily on cloud-based systems to handle tasks such as activating the
wake word, understanding speech, interpreting language, and generating
responses. Since these systems often send or analyze user audio through the
internet, they raise privacy concerns and require a continuous internet
connection. When there's no internet, most of these assistants can't work
properly or at all. Also, they don't give users direct control over local device
functions such as launching apps, automating tasks, searching for files, sending
system commands, or carrying out personalized workflows.

Offline assistants, while able to do some tasks like speech recognition or limited
command execution, usually can't access up-to-date information such as weather,
time zones, and news.
They also lack the reasoning ability that modern large language models offer,
which means they can't handle tasks like holding conversations, translating
languages, solving math problems, or writing code with high accuracy. These
offline tools don't combine features like wake-word detection with hybrid
reasoning or the ability to retrieve personal documents. This forces users to
choose between a cloud-based assistant that offers less privacy or an offline
assistant with limited intelligence and no real-time data access.

Another big shortcoming in existing systems is the lack of an integrated


framework that can switch between offline thinking, online support, and
autonomous local actions.
Current assistants aren't built to perform complex sequences, such as opening
apps, changing volume, setting reminders, automating messages on WhatsApp or
email, or retrieving information from personal notes. They also rarely support
retrieval-augmented generation (RAG) for local documents, and they don't offer
features like modular mode switching, personalized memory, or user-controlled
privacy settings.
Because of these issues, there's a clear need for a single, unified assistant that
provides strong spoken interaction, smart reasoning, execution of tools, retrieval
of personal knowledge, and real-time enhanced responses, all while keeping user
privacy and ensuring reliable offline operation.
This invention meets that need by combining wake-word detection, offline speech
recognition, local LLM processing, hybrid online enhancement, mode switching,
automated task execution, local RAG, reminders, media playback, and dual-mode
text-to-speech in one system. This new invention introduces a type of voice
assistant that is private, independent, intelligent, and capable of working well
both online and offline.

The invention introduces a hybrid voice assistant that works both when
connected to the internet and when it's not, offering smart interactions, the
ability to complete tasks on its own, and real-time helpful answers while keeping
user data private and ensuring it works even without a connection. It combines
several parts that run on the device itself—like recognizing the wake word,
converting speech to text, local thinking using a large language model, answering
questions by retrieving information, and performing actions—into one working
system that can function without needing a constant internet connection.

When the wake word is detected by an offline system, the assistant listens and
turns the user's speech into text using a model on the device.
Then, an intent system checks what the user is asking for and decides if the task
needs to be done offline, with information from the internet, to control the
device, translate, do calculations, write code, or look up personal documents. For
general conversations, thinking, translation, coding, and math, the assistant uses
a local large language model to give answers without needing an internet
connection. If the question is about real-time info like weather, time, or news, the
system uses a special online layer to pull data from simple APIs and combine it
with the model’s responses.

The system has a way to switch between different modes, like chatting, doing
tasks, searching, or keeping things private.
A tool layer lets the assistant do actions like opening apps, adjusting volume,
managing files, launching websites, sending messages, or playing audio from
YouTube. There’s also a scheduling feature that lets the assistant handle tasks at
specific times automatically. A local search module can go through user
documents for answers when offline, and a memory section stores preferences
and past chats locally. The assistant speaks using a text-to-speech system that
prefers to work offline but can switch to an online service if needed.

This design creates a powerful, private, and flexible assistant that brings smart
offline features together with the ability to use online resources when necessary.

✅ DETAILED DESCRIPTION OF THE INVENTION


(Approx. 560 words)

The invention describes a hybrid autonomous voice assistant system that


combines offline speech processing, intelligent reasoning, online augmentation,
device automation, and multimodal task execution all in a single structure. The
system is designed to work offline by default, ensuring privacy and reliability, but
can access online resources when needed for real-time data or media. The
invention uses a new approach that includes wake-word activation, buffered
audio capture, local speech-to-text processing, intent classification, hybrid
reasoning, retrieval augmentation, and autonomous task execution.

The system starts in an idle monitoring state, where a dedicated offline wake-
word detection module listens to incoming audio without sending any data to the
internet.
When it detects a wake word, the system moves into an active listening mode. A
buffered audio capture system records the user's speech for a set time, with a
warm-up period to stabilize the microphone input. Noise reduction and optional
voice activity detection may be used to improve the accuracy of the speech
transcription.
The captured audio is processed by a local speech-to-text engine, like an on-
device Whisper model, which turns spoken input into text.
The text is sent to an intent-classification and mode-routing module to determine
the nature of the user's request. The system supports multiple modes, including
Conversation Mode for general dialogue and reasoning, Task Mode for device
automation commands, Search Mode for retrieving local or online data, and
Offline Mode where all processing stays on the device.

For conversational queries, reasoning tasks, translation requests, coding


instructions, and mathematical computations, the system uses a local large-
language model (LLM) that runs completely on the user's device.
This offline reasoning layer ensures that personal or sensitive queries do not
require internet access. A calculation engine and code generation module
enhance the system's ability to solve mathematical problems or produce code
locally.

When the user's intent involves real-time information—such as weather updates,


time zone conversions, or news headlines—the system activates an online
augmentation layer.
This layer retrieves data from lightweight APIs and blends the factual
information with LLM-generated responses to keep the conversation natural. A
fall-back mechanism ensures reliability by switching to alternative APIs if the
primary sources fail.

The invention also includes a tool-execution subsystem that allows users to


control local device functions through voice commands.
This subsystem uses system-level operations, automation libraries, and browser-
control mechanisms to perform actions like opening applications, adjusting
volume, taking screenshots, managing files, launching websites, and playing
YouTube audio. It also supports email and WhatsApp automation by simulating
user actions locally through secure methods.

A retrieval-augmentation module indexes user documents, notes, and stored files


using local embeddings.
This allows the assistant to answer queries based on personal documents without
sending data to the internet. A smart-memory subsystem stores user preferences,
often-used commands, and interaction patterns locally in a secure file for
personalized behavior.

The system also has a reminder-and-alarm scheduler that manages timed


notifications and executes them automatically using background threads.
An offline translation pipeline combines speech-to-text, local machine translation,
and text-to-speech to enable multilingual interactions without needing external
services.

Output is generated using a dual-mode text-to-speech engine.


The system prefers offline synthesis using an on-device TTS model to ensure
privacy and independence from internet connectivity. If offline TTS isn't
available, an online fallback module activates automatically to maintain
operability.

Together, these modules create a unified, hybrid, autonomous assistant that can
perform intelligent reasoning, live augmentation, personal knowledge retrieval,
multimodal task execution, and continuous offline operation.

1.
Overview

The invention is a hybrid voice interaction system that works offline by default
but can use online resources for time-sensitive information like weather, time
zones, and news.
The system integrates: (i) wake-word activation; (ii) speech-to-text (STT); (iii) local
large-language-model (LLM) reasoning; (iv) dynamic tool routing to offline/online
functions; and (v) adaptive text-to-speech (TTS) with offline priority and
automatic online fallback. The architecture emphasizes privacy, low latency, and
graceful degradation when connectivity is unavailable.
2.
Core Pipeline

1.
Wake-Word Activation

An always-on, lightweight detector, like Porcupine or openWakeWord,


continuously monitors the microphone stream using a low-power, quantized
acoustic model.
Detecting phrases like "Hey Luna" moves the system from an idle state to a
listening state without needing a physical button press.

2.
Audio Capture & Pre-Processing

A streaming audio buffer, such as 16 kHz mono, is filled via an input device
abstraction that supports default and user-selected microphones.
Optional steps include automatic gain control, clipping detection, and VAD (voice
activity detection) to remove silence.

3.
Offline STT

The captured segment is transcribed by a local STT engine, like Whisper


small/[Link] via faster-whisper.
The system can auto-detect the language or use a fixed one. Partial hypotheses
may be surfaced to reduce the perceived delay.

4.
Intent Understanding & Mode Controller

The transcribed text is analyzed by a deterministic intent parser (using regex or


grammar rules) and, if needed, a local LLM for clarification.
A Mode Controller handles state transitions between Conversation, Task, Search,
and Offline modes, supporting voice commands like “switch to task mode.” Mode
selection limits which tools are available and sets safety and permission
boundaries.

5.
Dynamic Tool Routing (Offline-First)

A Router evaluates the user request against a tool registry:

- Offline tools (preferred): local RAG over user documents, calculator, code helper,
file search, app launcher, media control, screenshots, reminders/alarms, offline
translation (STT → Argos Translate → TTS), and local TTS.

- Online tools (selective): weather and geocoding, world time, and news feeds.

The Router uses rules (keywords, entities, confidence) and optional LLM
classification to decide which tool to use.
If an online source fails or is not allowed, the system returns an offline-only
response or a user-friendly alternative.

6.
Reasoning & Response Synthesis

Tool outputs are normalized into a structured Context Package.


A local LLM (running via Ollama or equivalent) composes the final response,
conditioned on system prompts that enforce conciseness, specific citation styles
(for news), or execution summaries (for actions like "open YouTube").

The system uses adaptive text-to-speech technology. It starts with offline TTS like
Piper, which uses a stored voice model. If the offline TTS is not available, it
automatically switches to online TTS options like Edge TTS. The audio is then
played through the default device, and the system handles interruptions,
allowing users to interrupt the speech if needed.
The system has various subsystems that help with different tasks.
One of these is the Offline Knowledge Assistant, also called Local RAG. It
processes text files like PDFs, notes, and code by breaking them into chunks. It
then creates embeddings using local tools like sentence-transformers or nomic-
embed-text and stores them in a database like Chroma or FAISS. When a user asks
a question, the system creates an embedding for the question, searches for the
most relevant chunks, and uses a local large language model to form a response
with sources cited. This entire process happens offline, without using the
internet.

Another subsystem is Task Automation, which uses operating system APIs and
tools to manage tasks.
It can open applications, control volume, type text, and navigate the user
interface. To ensure safety, sensitive tasks require verbal confirmation, such as
asking if the user wants to close unsaved documents.

Reminders and alarms are handled with a scheduling system that understands
natural language.
Instructions like "in 10 minutes" or "tomorrow at 6 AM" are converted into
scheduled tasks. The system can deliver these as audio announcements with
optional repeat, confirmation, or snooze features.

For media, the system can play audio content from YouTube or local files.
When the user says something like "play lo-fi beats," the system searches for the
audio, streams it using tools like yt-dlp, and plays it through a local player with
basic controls.

Communication helpers support sending messages via email or WhatsApp.


The system converts voice messages into text, formats them, and sends them.
Before sending, it reads back the recipient and the message for confirmation.

Translation and utility functions work offline.


The system uses speech-to-text, followed by an offline translator, and then text-to-
speech. For calculations and coding, it performs tasks locally and can provide
code written by a local large language model.

Selective online augmentation allows the system to fetch weather, time, and news
information when needed.
It uses services like Open-Meteo and WorldTimeAPI, and includes caching and
retries for reliability. If the system is offline, it informs the user and allows it to
proceed without real-time data.

The system uses a state machine to manage the interaction flow.


It transitions through states like Idle, Listening, Transcribing, Routing, Acting,
Speaking, and back to Idle. The system can be interrupted by a wake word during
speaking, or a user can cancel the current task or switch modes. The system also
maintains a local memory store with user preferences, recent locations, and
other settings, all kept on the device.

Privacy, security, and reliability are important considerations.


All data, including audio, transcripts, and memory, remains on the device unless
a specific online function is used. The system allows users to consent to tasks,
redact personal information, and disable logging if needed. Each online call has
timeouts, failovers, and cached responses, and the system checks the health of
each subsystem to handle failures gracefully.

The system is designed to be extendable.


Plugins can register their capabilities, and the Router dynamically discovers and
ranks these tools. The system supports different models for speech-to-text,
language models, and text-to-speech without changing how tools work. It also
works across multiple platforms like Windows, macOS, and Linux.

The main modules include the Wake-Word Engine, which detects commands like
"Hey Luna" or "Hey Assistant."
The system captures audio and uses Voice Activity Detection, then transcribes the
audio using models like Whisper. The system manages interaction modes like
conversation, task, search, and offline. Users can switch modes with voice
commands, and the current mode influences which tools are used.

The Intent layer helps understand user commands using keywords and a local
large language model as a backup.
The Tool Router considers the user's intent, mode, privacy settings, and data
freshness when choosing which tool to use. Examples include checking the
weather online or using local resources for document summaries or code
generation.

The system uses local reasoning with a large language model and supports
document-based answers through RAG.
It processes text into chunks, creates embeddings, and searches for relevant
information to answer questions. All data stays on the device.

Device automation allows the system to perform actions like opening apps,
adjusting volume, taking screenshots, and entering text.
These actions are run in a secure environment with permissions, confirmations
for sensitive tasks, and optional audio feedback.

The system selectively uses online services for time, weather, and news.
If the network fails or the system is offline, it provides alternative responses to
ensure a smooth user experience.

9. TTS with Automatic Fallback


Piper is the main text-to-speech system used for offline voice synthesis.
If a Piper voice isn't available or if the synthesis doesn't work, the system
automatically switches to Edge TTS. The audio output is adjusted to a consistent
volume level and played through the same audio system used for monitoring.

10.
Smart Memory & Personalization
A local storage system, like a JSON file or simple database, keeps track of user
profiles, preferences, often-visited locations, recent tasks, and conversation
highlights.
This data is managed under the user's control, allowing them to view, export, or
clear it. This memory helps improve personalization, such as selecting a
preferred voice or setting default modes, without sending any data outside the
system.

11.
Scheduler, Alarms, and Media
A background scheduler handles alarms and reminders, like "Remind me in 10
minutes," and can also speak out alerts using TTS.
The media feature allows playing audio from YouTube (using tools like yt-dlp) in
response to commands like "Play lo-fi beats."

12.
Security, Privacy, and Extensibility
All local features work without an internet connection by default, and any
network access is done with user permission and controlled by policies.
Sensitive information in logs is redacted. The system is built in a modular way,
meaning components like wake-word engines, speech-to-text, language models,
and data sources can be replaced without changing the whole design. Some
versions can run completely offline or function as part of a local network with
shared knowledge bases.
**ADVANTAGES OF THE INVENTION**
This section is optional but highly recommended because it highlights why
your invention is better than current systems.
It helps examiners understand the new and useful features clearly.

Here is a clean and strong version of that section:

**✓ ADVANTAGES OF THE INVENTION**


1.
**Offline-First Architecture With Selective Online Augmentation**
The invention works fully without needing a constant internet connection,
which keeps your data private and the system reliable.
Online resources are only used when needed, giving the system the best of
both offline and online capabilities.

2.
**Wake-Word Activated Hands-Free Interaction**
The system lets you use it hands-free by detecting a wake word offline.
This makes it easier to use in situations where you can't or don’t want to
manually activate it.

3.
**Unified Intent Routing and Mode Switching**
The intelligent controller allows smooth switching between conversations,
task execution, search, and strict offline mode, making the assistant flexible
for different situations, which most other assistants don’t support.

4.
**Local LLM Reasoning for Privacy-Sensitive Tasks**
Tasks like chatting, coding, translating, reasoning, and math are done
locally, ensuring your personal data doesn’t leave your device.
This provides much better privacy than cloud-based assistants.

5.
**Device Automation Capabilities**
The invention can control your device locally like launching apps, changing
volume, taking screenshots, managing files, or typing text.
Most assistants can’t control your device in this detailed and flexible way.

6.
**Local Retrieval-Augmented Generation (RAG)**
By indexing personal notes and files locally, the system can answer custom
questions without needing an external server.
This creates a personalized assistant that respects your data.

7.
**Robust Dual TTS Mechanism**
The dual TTS system prefers offline voice synthesis but switches to online if
needed.
This ensures it works well even when your internet is slow or unreliable.

8.
**Autonomous Task Scheduler**
The system can set and manage reminders and alarms that run in the
background without needing a remote server or cloud account.

9.
**Extensible Architecture**
You can add new tools, data sources, and models through a modular
interface without changing the core system.
This makes it easier to update and use in the future.

10.
**Enhanced User Personalization**
The smart memory system keeps track of your preferences, frequent
requests, and usage patterns locally.
This makes the assistant more personal and adaptive, without compromising
your privacy.

11.
**Complete Multimodal Capability**
The assistant combines voice input, text reasoning, system actions, file
retrieval, messaging automation, and translation into one system.
This gives it more functionality than most existing assistants.

**INDUSTRIAL APPLICABILITY**
The present invention has a wide range of applications across various
industries because of its hybrid offline-online design, ability to handle tasks
on its own, and focus on data privacy.
Since it doesn’t rely on continuous cloud services, it’s ideal for environments
with limited, controlled, or restricted internet access and where sensitive
data must stay on the device. The invention can be used in consumer
electronics, enterprise environments, industrial automation, education, and
specialized fields.
In consumer electronics, it can be used as an embedded assistant in laptops,
smartphones, tablets, and wearables.
Its ability to handle speech recognition, translation, reasoning, and personal
document access offline means it can be a secure, cloud-free personal
assistant.

In the enterprise and corporate world, it can help automate workflows,


manage documents, assist with scheduling, and perform local tasks on
workstations without sending confidential data to servers.
This is especially useful in industries like finance, healthcare, legal, and
government, where data privacy and compliance are crucial.

In manufacturing and industrial settings, the hands-free wake-word


activation and local automation allow operators to control machines, find
manuals, access instructions, and log reports using voice commands, even
without an internet connection.
This is especially helpful in factories and remote areas with limited
connectivity.

In education, the invention can act as an offline tutor, answering questions,


explaining concepts, accessing local syllabus content via RAG, and helping
with translations or coding, without exposing student data.

In transportation, defense, and field work, where internet might be


unreliable or intentionally disabled, the invention offers a reliable voice
interface for device control, navigation, alerts, and other tasks.

In accessibility and assistive technology, the assistant allows hands-free


computing for users with disabilities, providing private, personalized, and
always-on support.
Overall, the invention’s modular hybrid design, secure local processing, and
ability to run independently make it useful across industries that need
secure, intelligent, offline voice-based systems.

**CONCLUSION**
The present invention offers a full hybrid intelligent assistant that combines
offline speech recognition, wake-word activation, local LLM reasoning,
personalized retrieval, autonomous task execution, and selective online
assistance into one privacy-focused system.
By mixing local processing with access to online resources when needed, the
invention achieves a good balance between reliability, real-time
performance, and user privacy, which is not found in existing assistants.

Its multi-mode interaction, smart intent routing, secure automation, local


knowledge retrieval, and flexible TTS system make it a powerful and
adaptable platform for both chatting and doing tasks.
Features like autonomous reminders, message automation, offline
translation, reasoning, and code generation make it applicable in
professional, educational, and industrial environments.

Unlike traditional cloud-dependent assistants, this invention works well


even when there’s no internet connection, making it suitable for areas with
limited or restricted connectivity or strict data policies.
Its flexible design allows it to be improved in the future with new tools,
models, or device integration without changing the core system.

Overall, the invention brings a new and practical approach to intelligent


voice interaction by offering a secure, standalone assistant with a wide range
of features that work in a variety of real-world settings.
It represents a major step forward in offline AI design and sets the stage for
next-generation voice-based technologies.

You might also like