0% found this document useful (0 votes)

736 views20 pages

Inverted File

The document describes file structures and inverted indexes used for information retrieval. It discusses lexicographical indices, inverted files, sorted array implementations of inverted files, and partitioning an inverted file index into multiple loads to fit in memory. The FAST-INV algorithm avoids explicit sorting by using a binary tree to build inverted file indexes in partitions that fit in memory, traversing the tree and postings lists to construct the final index.

Uploaded by

kidoseno85

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

736 views20 pages

Inverted File

Uploaded by

kidoseno85

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 20

File Structures

Information Retrieval: Data Structures and Algorithms

by W.B. Frakes and R. Baeza-Yates (Eds.)

Englewood Cliffs, NJ: Prentice Hall, 1992.

(Chapters 3-5)

File Structures for IR

lexicographical indices
indices that are sorted e.g. inverted files e.g. Patricia (PAT) trees

cluster file structures indices based on hashing

signature files

Inverted Files
Information Retrieval: Data Structures and Algorithms

by W.B. Frakes and R. Baeza-Yates (Eds.)

Englewood Cliffs, NJ: Prentice Hall, 1992.

(Chapters 3)

Inverted Files

Each document is assigned a list of keywords or attributes. Each keyword (attribute) is associated with operational relevance weights. An inverted file is the sorted list of keywords (attributes), with each keyword having links to the documents containing that keyword. Penalty
the size of inverted files ranges from 10% to 100% of more of the size of the text itself need to update the index as the data set changes

Indexing Restrications

A controlled vocabulary which is the collection of keywords that will be indexed. Words in the text that are not in the vocabulary will not be indexed A list of stopwords that for reasons of volume will not be included in the index A set of rules that decide the beginning of a word or a piece of text that is indexable A list of character sequences to be indexed (or not indexed)

Sorted array implementation of an inverted file

Structures used in Inverted Files

Sorted Arrays
store the list of keywords in a sorted array using a standard binary search advantage: easy to implement disadvantage: updating the index is expensive

Hashing Structures Tries (digital search trees) Combinations of these structures

Sorted Arrays
1. The input text is parsed into a list of words along with their location in the text. (time and storage consuming operation) 2. This list is inverted from a list of terms in location order to a list of terms in alphabetical order. 3. Add term weights, or reorganize or compress the files.

Inversion of Word List

Dictionary and postings file

Idea: the file to be searched should be as short as possible split a single file into two pieces

e.g. data set: 38,304 records, 250,000 unique terms

(document #, frequency)

Producing an Inverted File for Large Data Sets without Sorting

Idea: avoid the use of an explicit sort by using a right-threaded binary tree

current number of term postings & the storage location of postings list

traverse the binary tree and the linked postings list

A Fast Inversion Algorithm

Principle 1
the large primary memories are available If databases can be split into memory loads that can be rapidly processed and then combined, the overall cost will be minimized.

Principle 2
the inherent order of the input data It is very expensive to use polynomial or even nlogn sorting algorithms for large files

FAST-INV algorithm
concept postings/ pointers See p. 13.

Sample document vector

document number concept number (one concept number for each unique word) Similar to the documentword list shown in p. 7. The concept numbers are sorted within document numbers, and document numbers are sorted within collection

Preparation

Terminology
HCN= highest concept number in dictionary, or the number of words to be indexed L= number of document/concept pairs in the collection M= available primary memory size

Assumption
M>>HCN M<L

: the range of concepts for each primary load

(Doc,Con) ConLoad Load

Load FileCONPTR Offset

Preparation
1. Allocate an array, con_entries_cnt, of size HCN. 2. For each <doc#, con#> entry in the document vector file: increment con_entries_cnt[con#]

0 (1,1), (1,4).. 2 (2,3) .. 3 (3,1), (3,2), (3,5) ... 6 (4,2), (4,3) . 8 (con#, doc#)

Preparation (continued)
5. For each <con#,count> pair obtained from con_entries_cnt: if there is no room for documents with this concept to fit in the current load, then created an entry in the load table and initialize the next load entry; otherwise update information for the current load table entry.

Building Load Table

Terminology
LL= length of current load S= spread of concept numbers in the current load 8 bytes = space needed for each concept/weight pair 4 bytes = space needed for each concept to store count of postings for it

Constraints
8*LL+4*S<M

: the range of concepts for each primary load

(Doc,Con) ConLoad Load

Load FileCONPTR Offset

A Risk Assessment Approach To Improve The Resilience of A Seaport System Using Bayesian Networks 2016 Ocean Engineering
No ratings yet
A Risk Assessment Approach To Improve The Resilience of A Seaport System Using Bayesian Networks 2016 Ocean Engineering
12 pages
ROMS Ocean Model Tutorial Guide
No ratings yet
ROMS Ocean Model Tutorial Guide
25 pages
Deepsea Mining Introduction 170490368325644784659ec403631c5
No ratings yet
Deepsea Mining Introduction 170490368325644784659ec403631c5
13 pages
Shipbreaking
100% (1)
Shipbreaking
68 pages
SOW CAO Extraction Plan For IPP Sarawak - Vendor Briefing Upsized V2
No ratings yet
SOW CAO Extraction Plan For IPP Sarawak - Vendor Briefing Upsized V2
14 pages
Piles for Coastal Wave Protection Review
No ratings yet
Piles for Coastal Wave Protection Review
13 pages
Tidal Turbine Design Challenges
No ratings yet
Tidal Turbine Design Challenges
39 pages
Seagrass Management in Bintan
No ratings yet
Seagrass Management in Bintan
32 pages
EOR Screening Part 2 Taber-Martin
No ratings yet
EOR Screening Part 2 Taber-Martin
7 pages
Biofuels Bulk Sea Transport Regulations
No ratings yet
Biofuels Bulk Sea Transport Regulations
26 pages
Exploring Ocean Resources and Benefits
No ratings yet
Exploring Ocean Resources and Benefits
8 pages
ChemMap Software User Manual
No ratings yet
ChemMap Software User Manual
206 pages
Maritime Wreck Law in India Explained
100% (1)
Maritime Wreck Law in India Explained
21 pages
Sea Pollution With Oil
No ratings yet
Sea Pollution With Oil
9 pages
Regulasi Tentang Pencegahan Pencemaran Oleh Minyak (Annex I)
No ratings yet
Regulasi Tentang Pencegahan Pencemaran Oleh Minyak (Annex I)
15 pages
Impact of Oil Spills on Environment
No ratings yet
Impact of Oil Spills on Environment
7 pages
Global Unemployment Data 1960
No ratings yet
Global Unemployment Data 1960
69 pages
Gajender Lal: Presentation On Oil Spill
No ratings yet
Gajender Lal: Presentation On Oil Spill
21 pages
Environmental Engineering: Topic:-Application of Remote Sensing For Ocean and Costal Monitoring
No ratings yet
Environmental Engineering: Topic:-Application of Remote Sensing For Ocean and Costal Monitoring
20 pages
Catastrophes in Oil and Petrochemical Industries
100% (2)
Catastrophes in Oil and Petrochemical Industries
21 pages
Oil Spill Prevention For Middle School Students
100% (1)
Oil Spill Prevention For Middle School Students
75 pages
Instant Download An Introduction To Geographical Information Systems 3rd Edition Ian Heywood PDF All Chapters
100% (14)
Instant Download An Introduction To Geographical Information Systems 3rd Edition Ian Heywood PDF All Chapters
86 pages
Environment Issue-Oil Spill
100% (1)
Environment Issue-Oil Spill
18 pages
Oil Spill Science and Technology 1st Edition Mervin Fingas Download
100% (10)
Oil Spill Science and Technology 1st Edition Mervin Fingas Download
61 pages
GDN-200 - Guidelines For Preparation of Oil Spill Respnse Contingency Plan
No ratings yet
GDN-200 - Guidelines For Preparation of Oil Spill Respnse Contingency Plan
31 pages
Oil Spill and Marine Ecosystem: - Satish More (01)
100% (1)
Oil Spill and Marine Ecosystem: - Satish More (01)
12 pages
Oil Spill in The Gulf of Mexico: Group - J
100% (1)
Oil Spill in The Gulf of Mexico: Group - J
23 pages
Materi FFPM 2018 Helmilus Moesa PT Chandra Asri Petrochemical TBK
No ratings yet
Materi FFPM 2018 Helmilus Moesa PT Chandra Asri Petrochemical TBK
56 pages
A Review On Biodiesel Production
No ratings yet
A Review On Biodiesel Production
7 pages
MGMT UN
No ratings yet
MGMT UN
357 pages
Unit 2
No ratings yet
Unit 2
10 pages
Completed UNIT-III 20.9.17
No ratings yet
Completed UNIT-III 20.9.17
61 pages
Chapter 3,4, 5 and 6
No ratings yet
Chapter 3,4, 5 and 6
145 pages
Chapter 3 Indexing
No ratings yet
Chapter 3 Indexing
48 pages
Data Structures & Algorithms for IR
No ratings yet
Data Structures & Algorithms for IR
34 pages
Introduction To Information Storage and Retrieval: Chapter Four: Indexing Structure
100% (1)
Introduction To Information Storage and Retrieval: Chapter Four: Indexing Structure
34 pages
Chap5 Index Construction
No ratings yet
Chap5 Index Construction
38 pages
IR Chapter Three
No ratings yet
IR Chapter Three
30 pages
L05
No ratings yet
L05
33 pages
IR Chapter Three
No ratings yet
IR Chapter Three
30 pages
Learning Guide Unit 2
No ratings yet
Learning Guide Unit 2
15 pages
IR Chap3
No ratings yet
IR Chap3
45 pages
Understanding Indexing Structures
No ratings yet
Understanding Indexing Structures
38 pages
Lecture 2 Inverted Index PDF
No ratings yet
Lecture 2 Inverted Index PDF
24 pages
3 Indexing
No ratings yet
3 Indexing
28 pages
Indexing and Searching Techniques
No ratings yet
Indexing and Searching Techniques
15 pages
IRS Module5-I
No ratings yet
IRS Module5-I
15 pages
4 Indexing
No ratings yet
4 Indexing
29 pages
3-Index Construction
No ratings yet
3-Index Construction
43 pages
Text Indexing
No ratings yet
Text Indexing
11 pages
Heaps Law Linguistic Pre-Processing Index Terms
No ratings yet
Heaps Law Linguistic Pre-Processing Index Terms
8 pages
Ir Chapter Three
No ratings yet
Ir Chapter Three
41 pages
Indexing 1
No ratings yet
Indexing 1
61 pages
Chapter 4 IR
No ratings yet
Chapter 4 IR
56 pages
Indexing Structures Explained
No ratings yet
Indexing Structures Explained
44 pages
4 Indexing
No ratings yet
4 Indexing
59 pages
Indexing for Efficient Retrieval
No ratings yet
Indexing for Efficient Retrieval
26 pages
Unit 2 IR
No ratings yet
Unit 2 IR
13 pages
3 Index Construction
No ratings yet
3 Index Construction
43 pages
Index Construction Guide
No ratings yet
Index Construction Guide
43 pages
2011 Dawson Stemmer
No ratings yet
2011 Dawson Stemmer
7 pages
Preprocessing Stemin JI
No ratings yet
Preprocessing Stemin JI
3 pages
S Teeming Porter
No ratings yet
S Teeming Porter
6 pages
Jurnal Information Retrieval
No ratings yet
Jurnal Information Retrieval
4 pages
Setting Koneksi VB Ke Mysql
No ratings yet
Setting Koneksi VB Ke Mysql
14 pages
Component Load Data for Nozzles
0% (1)
Component Load Data for Nozzles
15 pages
AI & ML for Employee Burnout Prediction
No ratings yet
AI & ML for Employee Burnout Prediction
11 pages
OS Comparison for Security Experts
No ratings yet
OS Comparison for Security Experts
2 pages
ImageFusion Module 321 User Guide
No ratings yet
ImageFusion Module 321 User Guide
96 pages
Industrial Drone Solutions
No ratings yet
Industrial Drone Solutions
4 pages
MMG SAP Equipment Request Training Guide
No ratings yet
MMG SAP Equipment Request Training Guide
21 pages
E2 PTAct 7 6 1 Instructor
No ratings yet
E2 PTAct 7 6 1 Instructor
4 pages
Subtracting Degrees, Minutes, Seconds
No ratings yet
Subtracting Degrees, Minutes, Seconds
6 pages
AI, Blockchain, and IoT Quiz
100% (1)
AI, Blockchain, and IoT Quiz
61 pages
DVD-V5500 DVD-V6000 DVD-V6500: User's Manual
No ratings yet
DVD-V5500 DVD-V6000 DVD-V6500: User's Manual
68 pages
Sediment Transport Analysis in HEC-RAS
No ratings yet
Sediment Transport Analysis in HEC-RAS
20 pages
Live Streaming Links for Sports Channels
No ratings yet
Live Streaming Links for Sports Channels
2 pages
4 Chapter 1 Introduction To Computers
No ratings yet
4 Chapter 1 Introduction To Computers
20 pages
A Review of Auto-Scaling Techniques For Elastic Applications in Cloud Environments
No ratings yet
A Review of Auto-Scaling Techniques For Elastic Applications in Cloud Environments
34 pages
LinRegPCR Update History and Features
No ratings yet
LinRegPCR Update History and Features
4 pages
Form 2 Comp Science
No ratings yet
Form 2 Comp Science
5 pages
Knowledge Management Basics and Strategies
No ratings yet
Knowledge Management Basics and Strategies
12 pages
Detection and Distinction of Colours Using Colour Sorting Robotic Arm in A Pick and Place Mechanism
No ratings yet
Detection and Distinction of Colours Using Colour Sorting Robotic Arm in A Pick and Place Mechanism
8 pages
Jazz The First 100 Years 3rd Edition Martin Full Download
No ratings yet
Jazz The First 100 Years 3rd Edition Martin Full Download
406 pages
E-Call (A Call Between Life and Death)
No ratings yet
E-Call (A Call Between Life and Death)
3 pages
Quantmod: R Financial Modelling Tool
No ratings yet
Quantmod: R Financial Modelling Tool
103 pages
Productivity Tools (Word Processing)
No ratings yet
Productivity Tools (Word Processing)
90 pages
AI Tools For Educators - Course Contents
No ratings yet
AI Tools For Educators - Course Contents
8 pages
Divisibility Rules and Exercises Guide
No ratings yet
Divisibility Rules and Exercises Guide
27 pages
(Day-1) - Power BI Do-it-Yourself
No ratings yet
(Day-1) - Power BI Do-it-Yourself
2 pages
Digital Logic Abd Bahrim Yusoff No Waiting Time
No ratings yet
Digital Logic Abd Bahrim Yusoff No Waiting Time
169 pages
HR Assignment Group14 - Shashank
No ratings yet
HR Assignment Group14 - Shashank
3 pages
Current & Saving Account Statement: Mandeulgaon Somnath Lavkhare
No ratings yet
Current & Saving Account Statement: Mandeulgaon Somnath Lavkhare
9 pages
CPPREP4002 - Dispute Resolution Case Studies v1.3
No ratings yet
CPPREP4002 - Dispute Resolution Case Studies v1.3
7 pages
Windows 8.1 Registry Report 2019
No ratings yet
Windows 8.1 Registry Report 2019
27 pages

Inverted File

Uploaded by

Inverted File

Uploaded by

File Structures

Information Retrieval: Data Structures and Algorithms

by W.B. Frakes and R. Baeza-Yates (Eds.)

File Structures for IR

cluster file structures indices based on hashing

by W.B. Frakes and R. Baeza-Yates (Eds.)

Sorted array implementation of an inverted file

Structures used in Inverted Files

Hashing Structures Tries (digital search trees) Combinations of these structures

Inversion of Word List

Dictionary and postings file

e.g. data set: 38,304 records, 250,000 unique terms

Producing an Inverted File for Large Data Sets without Sorting

traverse the binary tree and the linked postings list

A Fast Inversion Algorithm

Sample document vector

: the range of concepts for each primary load

(Doc,Con) ConLoad Load

Load FileCONPTR Offset

Building Load Table

: the range of concepts for each primary load

(Doc,Con) ConLoad Load

Load FileCONPTR Offset

You might also like