UNIT-1
FUNCTIONAL OVERVIEW OF IRS
1. Normalizing Incoming Items:
his step is about converting various types of incoming data into
T
a consistent, standard format so that they can be easily
processed and searched.
● Language Encoding:Ensure that text from different
languages is properly encoded, typically in Unicode, which
allows consistent display and search across languages.
● Different File Formats:Convert files from various formats
(like text, images, videos) into a standard format. For
example:
○ Videos could be converted to formats like MPEG-2,
MPEG-1, AVI.
○ Audio files to WAV, Real Audio.
○ Images to GIF, JPEG, BMP.
2. Logical Restructuring – Zoning:
reak down the content into meaningful sections. For example, if
B
you're processing an academic paper, divide it into sections like
Title, Author, Abstract, Main Text, Conclusion, References,
Keywords. This helps in more precise searching and better
display of search results.
3. Creating a Searchable Data Structure (Indexing):
This involves several steps:
1.Identification of Processing Tokens:
○ Processing Tokens:These are the key pieces of
information used in searches, often better defined
than just words.
○ Valid Word Symbols:Alphabetic characters and
numbers.
○ Inter-Word Symbols:Blanks, periods, semicolons
(these don't affect the search).
○ Special Processing Symbols:Hyphens.
2.W ords are defined as continuous sequences of valid word
symbols separated by inter-word symbols.
3.Stop Algorithm:
○ Stop Words:Remove common words (like 'the', 'and')
that appear in almost every document, or words that
appear very infrequently, to save system resources.
Stop List:A predefined list of such stop words.
○
4.Characterize Tokens:
○ W ord Characteristics:Identify specific features like
proper names, acronyms, numbers, dates.
○ Part of Speech Tagging:Determine if the word is a
noun, verb, etc.
○ W ord Sense Disambiguation:Understand the
meaning of a word based on context.
5.Stemming Algorithm:
○ Stemming:Reduce words to their base or root form.
For example, 'computing', 'computers', and
'computation' are all reduced to 'comput'. This reduces
the number of unique words and saves storage space,
while also improving search efficiency.
4. Creating the Searchable Data Structure:
fter processing tokens through the stemming algorithm, they
A
are updated into a searchable data structure. This structure
could be a signature file, inverted list, or PAT tree, and it
represents the semantic concepts of items in the database. It
limits what a user can find as a result of the search, ensuring
efficient and accurate retrieval of information.
Summary:
● Normalization:Convert and standardize different formats
and languages.
● Zoning:Break down content into logical sections.
● Token Identification:Identify important searchabletokens
and remove unnecessary ones.
● Token Characterization:Determine the specific features
and context of tokens.
● Stemming:Reduce words to their base form to save
space and improve search efficiency.
● Indexing:Create an internal structure that represents the
data and enables efficient searching.
Selective Dissemination of Information (SDI):
DI is a system that automatically matches new information
S
against users' interests and delivers relevant items to them.
● How it works:
○ Search Process:The system continuously searches
new items.
○ User Profiles:Each user has a profile that describes
their interests.
○ User Mail Files:W here the system stores items
matching user interests.
● User Profile:
○ A broad search statement that describes what the
user is interested in.
○ A list of mail files to receive documents that match the
search statement.
○ W hen a new item matches the profile, it is sent to the
associated mail files.
● Difference from Ad Hoc Queries:
○ Profiles have many search terms and cover a wide
range of interests.
○ Ad hoc queries are short and specific.
Document Database Search:
his allows users to search all items that have been received
T
and stored in the system.
● Components:
○ Search Process:The mechanism that handles
searches.
○ User Queries:Specific search statements entered by
users.
○ Document Database:The collection of all processed
and stored items.
● Characteristics of Document Database:
○ Items usually do not change once stored.
○ It can be partitioned by time and allow for archiving.
● Difference from Profiles:Queries are short and focused
on specific interests.
Index Database Search:
sers can save and organize items for future reference through
U
indexing.
● Index Process:
○ Users can add items to an index with extra terms and
descriptions.
○ The index can point to the original item or contain
detailed information about it.
● Components:
○ Indexes:Like a library card catalog, they help
organize and find items.
○ Index Database Search Process:Lets users create
and search indexes.
○ Users can search the index and retrieve either the
index itself or the original item.
● Types of Index Files:
○ Public Index Files:Managed by library staff and
include all items in the Document Database.
○ Private Index Files:Created by individual users,
each user can have multiple private indexes.
Combined File Search:
his process integrates searches across both the document and
T
index databases.
● Public vs. Private Index Files:
○ Public index files cover all items and are accessible to
all users.
○ Private index files are specific to individual users and
cover a smaller subset of items.
● Database Management System:
○ Often, index files are managed using a structured
database management system (RDBMS).
Automatic File Build (Information Extraction):
This process helps create indexes automatically.
● How it works:
○ Processes new documents and identifies key
information like authors, publication date, source, and
references.
○ Rules for which documents to process and how to
extract index terms are stored in Automatic File Build
Profiles.
● Candidate Index Records:
○ The result of processing new documents.
○ Reviewed and edited by users before updating the
actual index file.
Summary:
● SDI:Automatically matches new items to user interestsand
delivers relevant information.
● Document Database Search:Allows users to search all
stored items.
● Index Database Search:Enables users to save, organize,
and search items using indexes.
● Combined File Search:Integrates document and index
searches.
● Automatic File Build:Automates the creation of index
records by extracting key information from new documents
DIGITAL LIBRARY
DATA WAREHOUSE
IRS CAPABILITIES
Boolean Logic:
● Boolean logic allows users to combine search terms using
operators like AND, OR, and NOT. For instance, "cats AND
dogs" retrieves items containing both words, "cats OR
dogs" retrieves items containing either word, and "cats
NOT dogs" retrieves items containing "cats" but excluding
"dogs."
Proximity:
● Proximity search looks for words that appear close to each
other within a specified distance. For example, searching
"bake NEAR/5 cake" finds instances where "bake" and
"cake" appear within five words of each other, which helps
in locating related terms in context.
Contiguous Word Phrases:
● This capability searches for exact phrases where words
appear together in the same order. For example, searching
for "climate change" returns results where these two words
are next to each other, ensuring the phrase's specific
context is maintained in the search results.
Fuzzy Searches:
● Fuzzy searches find words that are similar to the search
term, accommodating spelling variations and typos. For
example, searching for "color" might also return "colour."
This is useful when dealing with documents containing
typographical errors or different spellings of the same word.
Term Masking:
● Term masking uses wildcards to replace characters in a
search term. For example, "comp*" can find "computer,"
"compete," and "compile." The asterisk (*) represents any
number of characters, while a question mark (?) can
replace a single character, broadening the search scope.
Numeric & Date Ranges:
● This capability allows searching within specific numeric or
date ranges. For example, searching for documents from
2010 to 2020 or finding products priced between $50 and
100. It helps in filtering search results based on
$
quantitative criteria, like dates or numbers.
Concept & Thesaurus Expansions:
● This search capability includes related concepts or
synonyms to broaden search results. For example,
searching for "happy" might also retrieve "joyful" or
"content." Thesaurus expansions enhance search flexibility
by understanding and including variations in terminology,
ensuring comprehensive results.
Natural Language Queries:
● Natural language queries allow users to search using
everyday language, mimicking human conversation. For
example, instead of using keywords, a user might ask,
"What is the capital of France?" The system interprets the
question and retrieves relevant information, making
searches more intuitive.
Multimedia Queries:
● Multimedia queries enable searching for various types of
content such as images, videos, and audio files. For
example, finding all videos related to "wildlife." This
capability is essential for databases that include diverse
media types, allowing users to locate non-textual
information easily.
Browse Capabilities
1.Ranking:
○ Ranking orders search results by relevance or
importance. This helps users see the most relevant
items first, based on criteria like keyword matches,
document popularity, or date of publication. For
example, a search for "renewable energy" will show
the most relevant articles at the top.
2.Zoning:
○ Zoning divides a document into logical sections such
as title, author, abstract, and main text. This helps in
targeted searching within specific sections. For
example, a user might search only within the
"abstract" zone to find articles with relevant
summaries.
3.Highlighting:
○ Highlighting visually emphasizes search terms in the
results. When users search for a keyword, this
feature highlights occurrences of that keyword in the
displayed documents. This makes it easier for users
to spot the relevant information quickly.
Miscellaneous Capabilities
1.Vocabulary Browse:
○ Vocabulary browsing allows users to explore terms
and their relationships within a specific domain or
subject. It often includes browsing through an index or
thesaurus to find related terms and expand searches
effectively. For example, exploring synonyms and
related terms for "biodiversity."
2.Iterative Search & Search History Log:
○ Iterative search involves refining searches based on
previous results to narrow down to the most relevant
information. The search history log keeps track of all
earch queries, allowing users to revisit and refine
s
past searches for improved results.
3.Canned Query:
○ Canned queries are pre-defined searches created for
common queries. These saved searches can be
quickly executed without having to re-enter the search
criteria. For example, a canned query for "latest
technology news" would fetch up-to-date articles on
that topic.
4.Multimedia:
○ Multimedia capabilities involve searching and
retrieving various types of content like images,
videos, and audio files. For instance, users can
search for educational videos, photographs, or music
files, enabling a richer and more diverse search
experience.