TEXT SEARCH ALGORITHMS
AND MULTIMEDIA INFORMATION RETRIEVAL
UNIT-V
Text Search Algorithms: Introduction to Text Search Techniques, Software Text Search
Algorithms, Hardware Text Search Systems
Multimedia Information Retrieval: Spoken Language Audio Retrieval, Non-Speech Audio
Retrieval, Graph Retrieval, Imagery Retrieval, Video Retrieval
Text Search Techniques
Three classical methods for organizing and searching textual databases are mentioned:
Full Text Scanning (Streaming): Directly scanning the entire text to find matches for a query.
Word Inversion (Indexing): Creating an index of words to quickly locate relevant items, reducing
the data to be searched.
Multiattribute Retrieval: Searching based on multiple attributes of the data (e.g., metadata and
content).
The diagram shows the
architecture of a text streaming
search system, which includes
three main components:
Database: Stores the full text of
items to be searched.
Term Detector: A
hardware/software module that
identifies search terms in the text Query Resolver:
stream. It checks for the presence
Takes user queries, extracts search terms and
of query terms and may also handle
logic, and sends the terms to the Term
basic logic (e.g., AND/OR
Detector.
relationships between terms).
Receives results from the Term Detector,
User Interface: Displays search
evaluates the query logic, and determines
results and status to the user,
which items satisfy the query (possibly
allowing retrieval of matching
assigning weights to matches).
items.
How it works:
1. The database streams text to the Term Detector.
2. The Term Detector identifies query terms in the stream and sends detected terms to the
Query Resolver.
3. The Query Resolver processes the results, applies query logic, and passes the final results to
the User Interface for display.
4. This system allows results to be shown as soon as a match is found, unlike indexing, which
typically processes the entire query first.
K. JAYASRI | UNIT – V | CSM 1 Information Retrieval System (IRS)
TEXT SEARCH ALGORITHMS
AND MULTIMEDIA INFORMATION RETRIEVAL
Finite State Automata (FSA) in Text Search
Text search systems often use finite state automata (FSA) to efficiently detect patterns in a text
stream.
An FSA is a logical machine defined by five elements (as per the text):
Automata Definition: This defines an FSA to detect the string "CPU" in a text stream.
The components are:
How it Works:
K. JAYASRI | UNIT – V | CSM 2 Information Retrieval System (IRS)
TEXT SEARCH ALGORITHMS
AND MULTIMEDIA INFORMATION RETRIEVAL
Software Text Search Algorithms
1. Brute Force
2. Knuth-Morris-Pratt (KMP)
3. Boyer-Moore
4. Aho-Corasick
5. Shift-Add
1. Brute Force
The Brute Force method is the simplest string-matching algorithm.
It compares the search pattern with the input text character by character, starting from the
first position.
If a mismatch occurs, the pattern shifts one position to the right, and the comparison
restarts.
The expected number of comparisons for an input text of length n and a pattern of length m
with an alphabet size c is given by the formula in the image:
Example
The document doesn’t provide a specific example for Brute Force, but let’s consider the
input stream and search pattern from the KMP example:
Input Stream: a, b, d, a, d, e, f, g
Search Pattern: a, b, d, f
Start at position 1: a matches a , b matches b , d matches d , but f mismatches with
a. Shift one position.
Position 2: b mismatches a. Shift again.
This continues, checking each position, leading to many comparisons.
K. JAYASRI | UNIT – V | CSM 3 Information Retrieval System (IRS)
TEXT SEARCH ALGORITHMS
AND MULTIMEDIA INFORMATION RETRIEVAL
2. Knuth-Morris-Pratt (KMP)
The KMP algorithm improves on Brute Force by avoiding unnecessary comparisons.
When a mismatch occurs, KMP uses information from previously matched characters
to determine how far to shift the pattern, rather than shifting one position at a time.
This is achieved by preprocessing the pattern to create a Shift Table (also called a
failure function or prefix function), which indicates how many positions to skip based
on the mismatch.
Example:
Steps:
K. JAYASRI | UNIT – V | CSM 4 Information Retrieval System (IRS)
TEXT SEARCH ALGORITHMS
AND MULTIMEDIA INFORMATION RETRIEVAL
3. Boyer-Moore
Boyer-Moore enhances efficiency by comparing the pattern from right to left (unlike
Brute Force and KMP, which go left to right).
When a mismatch occurs, it uses two rules to determine the shift:
o Bad Character Rule: Shift the pattern to align the mismatched character in the
input stream with its next occurrence in the pattern (or skip the pattern entirely
if the character doesn’t exist in the pattern).
o Good Suffix Rule: If a suffix of the pattern matches, shift to align the next
occurrence of that suffix in the pattern.
K. JAYASRI | UNIT – V | CSM 5 Information Retrieval System (IRS)
TEXT SEARCH ALGORITHMS
AND MULTIMEDIA INFORMATION RETRIEVAL
4. Aho-Corasick
The Aho-Corasick algorithm is designed for multiple pattern matching.
It uses a finite state machine (FSM) to process the input text and match multiple
patterns in one pass.
The FSM is defined by three functions:
o GOTO Function: A state transition graph (Figure 9.6a) that defines the next
state based on the current state and input character.
o Failure Function: When a mismatch occurs, this function (Figure 9.6b)
specifies the state to fall back to.
o Output Function: Specifies which patterns are matched at each state (Figure
9.6c).
K. JAYASRI | UNIT – V | CSM 6 Information Retrieval System (IRS)
TEXT SEARCH ALGORITHMS
AND MULTIMEDIA INFORMATION RETRIEVAL
5. Shift-Add
The Shift-Add algorithm, developed by Baeza-Yates and Gonnet, handles fuzzy matching
(allowing mismatches, "don’t care" symbols, and complements).
It uses a vector s(j) to track the number of mismatches between the pattern and the
text at position j. The vector is updated as the text is processed:
The search pattern be ababc (m = 5)
A segment of input text to be cbbabababcaba
K. JAYASRI | UNIT – V | CSM 7 Information Retrieval System (IRS)
TEXT SEARCH ALGORITHMS
AND MULTIMEDIA INFORMATION RETRIEVAL
K. JAYASRI | UNIT – V | CSM 8 Information Retrieval System (IRS)
TEXT SEARCH ALGORITHMS
AND MULTIMEDIA INFORMATION RETRIEVAL
Hardware Text Search Systems
Hardware text search systems were developed to offload the computationally intensive task
of searching large text databases from general-purpose computers to specialized hardware.
This approach emerged in the 1970s due to limitations in software-based search systems,
such as:
o Scalability Issues: Software systems struggled to handle many search terms
simultaneously against the same text.
o I/O Bottlenecks: The speed of transferring data from secondary storage (like disk
drives) to the processor limited search performance.
o Indexing Overhead: Traditional software search systems rely on indexes, which can
be 70% the size of the actual data, requiring significant storage and time to build or
update.
Hardware Text Search Systems – issues
Hardware text search systems address these issues by:
o Using dedicated hardware to perform searches, freeing up the main computer for
user interface and result retrieval.
o Eliminating the need for an index, allowing new data to be searched immediately
upon receipt.
o Providing deterministic search times, as the speed depends only on the time to
stream data from storage, not on the complexity of the query or database size.
The maximum search time in such systems is the time to search one disk, achieved by
assigning one hardware search unit per disk. This makes the system highly scalable—adding
more hardware units allows larger databases to be searched without increasing search
time.
Hardware Text Search Systems - Architecture
Database: Stores the text data on
secondary storage (e.g., disk drives).
Term Detector: A hardware
component that identifies matches
between the query terms and the text
stream.
Query Resolver: Processes the
matches (e.g., applying Boolean logic)
and determines if the text satisfies the
query.
User Interface: The main computer
that interacts with the user, receiving
queries and displaying results.
1. The user submits a query via the user interface.
K. JAYASRI | UNIT – V | CSM 9 Information Retrieval System (IRS)
TEXT SEARCH ALGORITHMS
AND MULTIMEDIA INFORMATION RETRIEVAL
2. The query is sent to the hardware text search unit.
3. The term detector scans the text stream from the database, identifying matches.
4. The query resolver evaluates the matches against the query conditions (e.g., Boolean logic,
proximity).
5. Results (hits) are sent back to the user interface for display.
Hardware Text Search Systems - Term Detector Implementations
The term detector is the core of the hardware text search unit, responsible for matching query
terms against the text stream.
Three main approaches to implementing term detectors are mentioned:
o Parallel Comparators or Associative Memory
o Cellular Structure
o Universal Finite State Automata (FSA)
Term Detector - Parallel Comparators or Associative Memory
How It Works:
o Each query term is assigned to a separate comparison element (comparator).
o The text stream is serially fed into the detector, and each comparator checks for its
assigned term in parallel.
o When a match is found, the comparator sets a status flag, which is sent to the query
resolver (typically on the main computer).
Example:
o In the GESCAN system, some Boolean logic (e.g., AND, OR between terms) is resolved
directly in the term detector hardware, reducing the load on the main computer.
Term Detector - Cellular Structure
This approach uses a series of interconnected cells, each responsible for matching a single
character or small part of a query term.
Cells are typically implemented on LSI (Large-Scale Integration) chips, which can be chained
together to handle longer query terms.
Example:
o In the GESCAN Text Array Processor (TAP), a string of character cells on an LSI chip
matches query terms, while a separate resolution chip handles Boolean logic and
proximity requirements.
Term Detector - Universal Finite State Automata (FSA)
The system uses a finite state machine (FSM) to represent the query terms as states and
transitions.
The text stream is processed by the FSM, which transitions between states as characters are
matched.
K. JAYASRI | UNIT – V | CSM 10 Information Retrieval System (IRS)
TEXT SEARCH ALGORITHMS
AND MULTIMEDIA INFORMATION RETRIEVAL
Example: The High Speed Text Search (HSTS) machine by Operating Systems Inc. (OSI) uses
an algorithm similar to the Aho-Corasick software FSM but runs three parallel state
machines:
i. One for contiguous word phrases.
ii. One for embedded term matches.
iii. One for exact word matches.
Advantages:
o Efficient for complex pattern matching, including phrases and partial matches.
o Can handle multiple queries simultaneously by running parallel FSMs.
Historical Development of Hardware Text Search Systems
Rapid Search Machine (General Electric, 1970s)
Associative File Processor (AFP) by Operating Systems Inc. (OSI)
High Speed Text Search (HSTS) by OSI
GESCAN (General Electric)
Fast Data Finder (FDF) by TRW (Later Paracel)
Other Systems
Advantages of Hardware Text Search Systems
No Indexing Required:
o Eliminates the need for an index (which can be 70% the size of the data), saving storage
and allowing immediate searching of new data.
Deterministic Search Times:
o Search speed depends only on the time to stream data from the disk, making
performance predictable.
K. JAYASRI | UNIT – V | CSM 11 Information Retrieval System (IRS)
TEXT SEARCH ALGORITHMS
AND MULTIMEDIA INFORMATION RETRIEVAL
Real-Time Results:
o Hits are delivered to the user as they are found, rather than waiting for the entire search
to complete.
Scalability:
o Adding more hardware units (one per disk) allows the system to handle larger databases
without increasing search time.
Modern Applications
Genetic Analysis
o The Fast Data Finder (FDF) has found a new application in genetic analysis, as marketed
by Paracel:
Sequence Homology:
o The FDF is used to compare genetic sequences (e.g., DNA or proteins) to known families,
helping identify the functions of newly sequenced genes.
Fuzzy Matching:
o The FDF’s fuzzy matching capability is particularly useful for genetic sequences, where
approximate matches are often more relevant than exact matches.
Algorithms:
o Smith-Waterman Algorithm:
Used for finding local sequence similarities, optimal for genetic analysis.
o General Profile Algorithm:
Searches for conserved regions in nucleic acids or proteins across evolutionary
changes.
o Biology Tool Kit (BTK):
Paracel combines the FDF with BTK software to perform advanced genetic
analysis.
The FDF identifies sequences with high similarity to the query, and the BTK
software completes the analysis (e.g., scoring, alignment).
Multimedia Information Retrieval
K. JAYASRI | UNIT – V | CSM 12 Information Retrieval System (IRS)
TEXT SEARCH ALGORITHMS
AND MULTIMEDIA INFORMATION RETRIEVAL
Spoken Language Audio Retrieval
Left Column: Lists
speakers (e.g., Elizabeth
Vargas, William Cohen).
Center Column: Shows
transcribed speech with
highlighted named
entities (people,
organizations,
locations).
Right Column: Indicates
topics (e.g., "Foreign
relations with the
United Nations").
Non-Speech Audio Retrieval
K. JAYASRI | UNIT – V | CSM 13 Information Retrieval System (IRS)
TEXT SEARCH ALGORITHMS
AND MULTIMEDIA INFORMATION RETRIEVAL
Non-speech audio retrieval, such as noises or sounds, which is crucial for fields like music and
video production.
It focuses on the soundfisher system (developed by thorn blum et al., 1997, at
[Link]), a user-extensible tool for sound classification and retrieval.
Non-speech audio retrieval, such as noises or sounds, which is crucial for fields like music and
video production.
It focuses on the soundfisher system (developed by thorn blum et al., 1997, at
[Link]), a user-extensible tool for sound classification and retrieval.
SoundFisher System Overview
o SoundFisher uses techniques from signal processing, psychoacoustics, speech
recognition, computer music, and multimedia databases to index and retrieve sounds.
Instead of words (as in text retrieval), it indexes sounds using a vector of acoustic
features:
o Directly measurable: duration, loudness, pitch, brightness.
o These features allow users to search for sounds within specific ranges.
Analysis of Male Laughter
o a graphical analysis of a male laughter sound across multiple dimensions:
o Amplitude: Loudness over time.
o Brightness: The perceived "sharpness" or "clarity" of the sound.
o Bandwidth: The range of frequencies present.
o Pitch: The perceived frequency of the sound. The graphs (labeled
"LAUGHTER_MALE_BRIL") visualize how these features vary, helping to characterize the
sound for indexing.
This figure shows SoundFisher’s user interface for browsing and querying a sound database.
Features include:
o A list of sound categories (e.g., animals like "[Link]," "[Link]").
o Options to filter by acoustic properties (e.g., pitch, duration) or perceptual properties
(e.g., "scratchy").
o Support for complex queries like finding AIFF-encoded animal or human vocal sounds
similar to barking, ignoring duration or amplitude.
Non-Speech Audio Retrieval - System Capabilities and Evaluation
SoundFisher was tested on a database of 400 diverse sound files (e.g., nature, animals,
instruments, speech).
It supports advanced features like training the system to recognize indirect perceptual
properties (e.g., "buzziness").
Identified needs for improvement include:
o Better sound displays.
o Sound synthesis for query refinement.
o Sound separation.
o Matching feature trajectories over time.
K. JAYASRI | UNIT – V | CSM 14 Information Retrieval System (IRS)
TEXT SEARCH ALGORITHMS
AND MULTIMEDIA INFORMATION RETRIEVAL
Graph Retrieval
SageBook is a system for retrieving and adapting data graphics (e.g., bar charts, line graphs,
scatter plots) based on their content and properties.
It supports querying, indexing, and modifying graphics, similar to how text or audio
retrieval systems work but tailored for visual data representations.
Left Side: The SageBrush interface shows a query being built for a chart with horizontal
interval bars.
Right Side: The retrieved graphics are charts with similar properties (one space, horizontal
interval bars), highlighted to show they match the query using a “close graphics matching
strategy.”
Graph Retrieval - How Does It Work?
1. Query Interface (SageBrush):
– The left side of Figure 10.3 shows the SageBrush interface, where users create a query
using a graphical drag-and-drop method.
– Users select and arrange elements like spaces (e.g., a chart area), objects (e.g., bars,
lines, marks), and their properties (e.g., color, size, shape).
K. JAYASRI | UNIT – V | CSM 15 Information Retrieval System (IRS)
TEXT SEARCH ALGORITHMS
AND MULTIMEDIA INFORMATION RETRIEVAL
– For example, you can specify a chart with horizontal interval bars in a certain color.
2. Search and Matching:
– SageBook searches a library of stored graphics by matching the query’s graphical
elements (called graphemes) and their properties (e.g., bars must match in type, color,
shape, etc.).
– It also matches the underlying data represented by the graphic (e.g., numerical values or
categories).
– The system uses both exact matching (identical properties) and similarity-based
matching (close but not identical).
– In Figure 10.3, the query specifies a chart with one space and horizontal interval bars.
The right side of the figure shows retrieved graphics that match this criteria, ranked by
similarity.
3. Adaptation:
– Retrieved graphics can be modified. SageBook understands the syntax and semantics of
graphics, including spatial relationships, data domains (e.g., 2D coordinates), and
attributes.
– Users can manually adapt graphics, or SageBook can automatically adjust them (e.g.,
removing unmatched elements).
4. Search Strategies:
– SageBook offers multiple search strategies (three for graphical properties, four for data
properties) with varying levels of strictness.
– It also supports clustering techniques to organize large collections of graphics for easier
browsing.
Graph Retrieval – Applications
Beyond business graphics, SageBook’s approach can be applied to fields like:
– Cartography: Maps with terrain or elevation data.
– Architecture: Blueprints.
– Networking: Diagrams of routers and links.
– Military Planning: Maps with forces and defenses.
K. JAYASRI | UNIT – V | CSM 16 Information Retrieval System (IRS)
TEXT SEARCH ALGORITHMS
AND MULTIMEDIA INFORMATION RETRIEVAL
Imagery Retrieval
Imagery Retrieval - content-based image retrieval (CBIR) systems like IBM's Query By Image
Content (QBIC) system.
Context and Importance
– With the rise of digital imagery (e.g., web images, personal photo collections), there’s a
growing need for efficient image search and retrieval.
– Traditional methods rely on metadata (like captions or tags), but these are limited
because they require manual annotation, which is time-consuming.
– CBIR systems aim to search images based on their visual content—features like color,
texture, shape, or sketches—without needing manual tags.
QBIC System Overview
The QBIC system, developed by Flicker et al. (1997), is a pioneering example of CBIR. It
allows users to query image databases using visual properties instead of keywords. For
instance:
Query by Color: A user can search for images dominated by a specific color, like red.
Query by Shape/Texture: Users can draw a shape or select a texture to find similar images.
Combined Queries: Users can refine searches by adding keywords to visual queries (e.g.,
searching for "red stamps" and then narrowing it to "presidents").
This shows the
QBIC interface
where a user
queries a
database of U.S.
stamps (pre-1995)
for those that are
predominantly
red.
The interface
includes options
to specify color
percentages and
adjust the query.
K. JAYASRI | UNIT – V | CSM 17 Information Retrieval System (IRS)
TEXT SEARCH ALGORITHMS
AND MULTIMEDIA INFORMATION RETRIEVAL
This displays the results
of the red stamp
query—a grid of
stamps that are
primarily red.
This refines the
previous query by
adding the keyword
"president."
The results show red
stamps featuring U.S.
presidents, with one
exception: the bottom-
right stamp is of
Martha Washington,
part of the presidential
stamp collection.
Advanced Features of QBIC
QBIC supports complex queries, such as finding images with specific combinations of visual
elements (e.g., "a coarsely textured, red round object and a green square"). To make this
possible:
QBIC uses automated and semi-automated tools to extract objects from images (e.g.,
distinguishing foreground from background).
If text captions are available, they can be used to enhance the search (as seen in Figure
10.4c).
Extending CBIR to video retrieval
Shot Detection: Videos are broken into shots, and a representative frame (called an r-frame
or keyframe) is extracted for each shot.
Motion Analysis: QBIC can analyze motion in videos, enabling queries like "find shots
panning left to right." Results are shown as ranked r-frames, which act as thumbnails;
clicking one plays the associated video shot.
Specialized Applications
The text highlights other areas of content-based retrieval:
1. Face Processing:
– Face Detection: Identifying faces in an image.
K. JAYASRI | UNIT – V | CSM 18 Information Retrieval System (IRS)
TEXT SEARCH ALGORITHMS
AND MULTIMEDIA INFORMATION RETRIEVAL
– Face Recognition: Verifying a face’s identity (e.g., the U.S. Immigration Service uses
FaceIt® at the Otay Mesa border to match drivers’ faces with registered photos for fast-
lane access).
– Face Retrieval: Finding matching faces in a database.
– FaceIt® uses real-time imaging and RF tags to verify identities, improving border security
and efficiency.
2. Human Movement and Expression Recognition:
– Systems can track movements (e.g., of heads, hands) or recognize expressions like
smiles or anger (Pentland, 1997).
– This ties into emotion recognition (Picard, 1997) for better human-computer
interaction.
3. Video Search with Named Faces:
– The Informedia Digital Video Library (Wactlar et al., 2000) extracts data from
video/audio and supports full-content search.
– Its "named face" feature links names to faces, allowing searches like "find videos with
this person."
Performance Metrics
Systems like QBIC and FaceIt® are evaluated using precision (how many retrieved items are
relevant) and recall (how many relevant items were retrieved), metrics borrowed from text
retrieval.
Challenges and Future Goals
o Object Identification: Fully automatic, domain-independent object recognition is still
difficult, so manual or semi-automated annotation tools are used.
o Semantic Access: The ultimate goal is to enable semantic-based retrieval (e.g., "find
images of happy people at a beach") rather than just visual feature matching.
Video Retrieval
Broadcast News Navigator (BNN)
BNN is a web-based tool that automates the capture, annotation, segmentation,
summarization, and visualization of broadcast news video. It integrates text, speech, and
image processing to enable content-based search and retrieval, addressing the inefficiencies
of manual video annotation.
K. JAYASRI | UNIT – V | CSM 19 Information Retrieval System (IRS)
TEXT SEARCH ALGORITHMS
AND MULTIMEDIA INFORMATION RETRIEVAL
Query Interface
Users can search
across 30 news
sources, select date
ranges (e.g., Feb 27
to Mar 12, 2000),
and search using
keywords, closed
captions, speech
transcriptions, or
named entities
(people, locations,
organizations).
Detailed Query
BNN generates a
custom query page
with menus of
extracted entities (e.g.,
"George Bush," "New
York") and keywords
("presidential primary")
to refine searches.
K. JAYASRI | UNIT – V | CSM 20 Information Retrieval System (IRS)
TEXT SEARCH ALGORITHMS
AND MULTIMEDIA INFORMATION RETRIEVAL
BNN displays "story skims" with keyframes and the three most frequent named entities for each
story. Fig. 10.6a shows stories about "George Bush," while 10.6b shows "Al Gore" stories.
Story Details
Selecting a story (e.g., a March 5
story about Al Gore) shows
detailed closed captions,
including delegate counts from
the presidential primary.
K. JAYASRI | UNIT – V | CSM 21 Information Retrieval System (IRS)
TEXT SEARCH ALGORITHMS
AND MULTIMEDIA INFORMATION RETRIEVAL
GeoNODE (Geospatial News on Demand Environment)
GeoNODE builds on BNN by adding geospatial and temporal context to news analysis, focusing on
topic detection and tracking (TDT) for broadcast news and other sources.
Functionality
Timeline View (Fig. 10.7): GeoNODE
displays a timeline of stories from
sources like CNN and MSNBC, showing
the frequency of named entities (e.g.,
"George W. Bush," "Al Gore") over time.
Peaks indicate high coverage periods.
Map Display
It visualizes the frequency
of location mentions on a
world map. Countries with
more mentions are darker
(e.g., North and South
America are dark brown),
and yellow circles indicate
specific locations (larger
circles for more mentions,
e.g., South American
capitals).
Performance:
o In tests with 65,000 documents and 100 manually identified topics, GeoNODE identified
over 80% of human-defined topics and detected 83% of stories within topics, with a
0.2% misclassification error—comparable to TDT initiative results.
Applications:
o GeoNODE helps analysts navigate news by topic, time, and place, using data from
diverse sources like broadcast video and online newspapers.
K. JAYASRI | UNIT – V | CSM 22 Information Retrieval System (IRS)
TEXT SEARCH ALGORITHMS
AND MULTIMEDIA INFORMATION RETRIEVAL
BNN and GeoNODE
Both systems highlight advancements in multimedia analysis:
o BNN focuses on story segmentation and content-based retrieval, improving speed and
accuracy over manual methods.
o GeoNODE extends this with topic detection, tracking, and geospatial visualization,
enabling deeper analysis of news trends.
o Future progress depends on multimedia corpora, standardized evaluation tasks, and
machine learning to enhance extraction and analysis.
K. JAYASRI | UNIT – V | CSM 23 Information Retrieval System (IRS)