CI-6226
Lecture 5 part 1.
Index Construction & Compressing
Information Retrieval and Analysis
Vasily Sidorov
1
Let’s Recall
Term TermID
friend 1
roman 2
countryman 3
Can also be positional
lend 4
i 5 TermID Freq. Postings List (DocIDs)
you 6 → 1 → 5 → 6 → 12
1 4
ear 7
5 2 → 1 → 8
Dictionary 7 6 → 1 → 2 → 6 → 8 → 12 → 13
Inverted Index
3
Sec. 4.1
Hardware basics
• Many design decisions in information retrieval are
based on the characteristics of hardware
• We begin by reviewing hardware basics
5
Hardware basics
6
Sec. 4.1
Hardware basics
• Access to data in memory (RAM) is much faster
than access to data on disk.
• Disk seeks: No data is transferred from disk while
the disk head is being positioned.
• Therefore: Transferring one large chunk of data
from disk to memory is faster than transferring
many small chunks.
• Disk I/O is block-based: Reading and writing of
entire blocks (as opposed to smaller chunks).
• Block sizes: 8KB to 256 KB.
7
Hardware basics
• Solid State Drives (SSD) are mitigating some of the
problems:
—~100 times faster access time than HDDs
—1-2 order of magnitude faster I/O than HDDs
—Possesses a property of Random Access — (almost)
no “disk seek” delay
• Still much slower than RAM
• Still reads/writes in blocks
• Still too expensive compared to HDD
8
Sec. 4.1
Hardware basics
• Servers used in IR systems now typically have
hundreds of GB of main memory, sometimes
several TB
• Available disk space is several (2–3) orders of
magnitude larger. 128 GB of RAM → 5–10 TB of disk
• Fault tolerance is very expensive: It’s much cheaper
to use many regular machines rather than one fault
tolerant machine.
9
Sec. 4.1
Hardware assumptions for this
lecture
symbol statistic value
s average seek time 5 ms = 5 x 10−3 s
b transfer time per byte 0.02 μs = 2 x 10−8 s
processor’s clock rate 109 s−1
p low-level operation 0.01 μs = 10−8 s
(e.g., compare & swap a word)
size of main memory several GB
size of disk space 1 TB or more
10
Sec. 4.2
RCV1: Our collection for this lecture
• Shakespeare’s collected works definitely aren’t large
enough for demonstrating many of the points in
this course.
• The collection we’ll use isn’t really large enough
either, but it’s publicly available and is at least a
more plausible example.
• As an example for applying scalable index
construction algorithms, we will use the Reuters
RCV1 collection.
• This is one year of Reuters newswire (part of 1995
and 1996)
11
Sec. 4.2
A Reuters RCV1 document
12
Sec. 4.2
Reuters RCV1 statistics
symbol statistic value
N documents 800,000
L avg. # tokens per doc 200
M terms (= word types) 400,000
avg. # bytes per token 6
(incl. spaces/punct.)
avg. # bytes per token 4.5
(without spaces/punct.)
avg. # bytes per term 7.5
non-positional postings 100,000,000
4.5 bytes per token vs. 7.5 bytes per term: why? 13
Sec. 4.2
Recall IIR 1 index construction Term
I
Doc #
1
did 1
enact 1
julius 1
• Documents are parsed to extract words and these caesar
I
1
1
are saved with the Document ID. was 1
killed 1
i' 1
the 1
capitol 1
brutus 1
killed 1
me 1
Doc 1 Doc 2 so 2
let 2
it 2
be 2
I did enact Julius So let it be with with 2
Caesar I was killed Caesar. The noble
caesar
the
2
2
i' the Capitol; Brutus hath told you noble
brutus
2
2
Brutus killed me. Caesar was ambitious hath 2
told 2
you 2
caesar 2
was 14 2
ambitious 2
Sec. 4.2
Key step Term
I
Doc #
1
Term
ambitious
Doc #
2
did 1 be 2
enact 1 brutus 1
julius 1 brutus 2
• After all documents have been caesar
I
1
1
capitol
caesar
1
1
parsed, the inverted file is sorted was 1 caesar 2
caesar 2
by terms. killed
i'
1
1 did 1
the 1 enact 1
capitol 1 hath 1
brutus 1 I 1
killed 1 I 1
We focus on this sort step. me
so
1
2
i'
it
1
2
We have 100M items to sort. let
it
2
2
julius
killed
1
1
be 2 killed 1
with 2 let 2
caesar 2 me 1
the 2 noble 2
noble 2 so 2
brutus 2 the 1
hath 2 the 2
told 2 told 2
you 2 you 2
caesar 2 was 1
was 2 was 2
ambitious 2 with 2
15
Sec. 4.2
Scaling index construction
• In-memory index construction does not scale
—Can’t stuff entire collection into memory, sort, then
write back
• How can we construct an index for very large
collections?
• Taking into account the hardware constraints we
just learned about . . .
• Memory, disk, speed, etc.
16
Sec. 4.2
Sort-based index construction
• As we build the index, we parse docs one at a time.
—While building the index, we cannot easily exploit
compression tricks (you can, but much more complex)
• The final postings for any term are incomplete until the end.
• At 12 bytes per non-positional postings entry (term, doc,
freq), demands a lot of space for large collections.
• T = 100,000,000 in the case of RCV1
—So… we can do this in memory in 2021, but typical
collections are much larger. E.g., the New York Times
provides an index of >150 years of newswire
• Thus: We need to store intermediate results on disk.
17
Sec. 4.2
Sort using disk as “memory”?
• Can we use the same index construction algorithm
for larger collections, but by using disk instead of
memory?
• No: Sorting T = 100,000,000 records on disk is too
slow – too many disk seeks.
• We need an external sorting algorithm.
18
Sec. 4.2
Bottleneck
• Parse and build postings entries one doc at a time
• Now sort postings entries by term (then by doc
within each term)
• Doing this with random disk seeks would be too
slow – must sort T=100M records
If every comparison took 2 disk seeks, and N items could be
sorted with N log2N comparisons, how long would this take?
19
Sec. 4.2
BSBI: Blocked Sort-Based Indexing
(sorting with fewer disk seeks)
• 12-byte (4+4+4) records (term, doc, freq)
• These are generated as we parse docs
• Must now sort 100M such 12-byte records by term.
• Define a Block ~ 10M such records
—Can easily fit a couple into memory.
—Will have 10 such blocks to start with.
• Basic idea of algorithm:
—Accumulate postings for each block, sort, write to
disk.
—Then merge the blocks into one long sorted order.
20
21
Sec. 4.2
Sorting 10 blocks of 10M records
▪First, read each block and sort within:
▪Quicksort takes 2N ln N expected steps
▪In our case 2 x (10M ln 10M) steps
▪Exercise: estimate total time to read each block
from disk and quicksort it.
▪10 times this estimate – gives us 10 sorted runs of
10M records each.
▪Done straightforwardly, need 2 copies of data on
disk
▪But can optimize this
22
Sec. 4.2
23
Sec. 4.2
How to merge the sorted runs?
• Can do binary merges, with a merge tree of log210 = 4 layers.
• During each layer, read into memory runs in blocks of 10M,
merge, write back.
1
1 2
2 Merged run.
3 4
3
4
Runs being
merged.
Disk
24
Sec. 4.2
How to merge the sorted runs?
• But it is more efficient to do a multi-way merge, where you
are reading from all blocks simultaneously
• Providing you read decent-sized chunks of each block into
memory and then write out a decent-sized output chunk,
then you’re not killed by disk seeks
25
Sec. 4.3
Remaining problem with sort-based
algorithm
• Our assumption was: we can keep the dictionary in
memory.
• We need the dictionary (which grows dynamically)
in order to implement a term to termID mapping.
• Actually, we could work with term,docID postings
instead of termID,docID postings . . .
• . . . but then intermediate files become very large.
(We would end up with a scalable, but very slow
index construction method.)
26
Sec. 4.3
SPIMI:
Single-Pass In-Memory Indexing
• Key idea 1: Generate separate dictionaries for each
block – no need to maintain term-termID mapping
across blocks.
• Key idea 2: Don’t sort. Accumulate postings in
postings lists as they occur.
• With these two ideas we can generate a complete
inverted index for each block.
• These separate indexes can then be merged into
one big index.
27
Sec. 4.3
SPIMI-Invert
Merging of blocks is analogous to BSBI. 28
Sec. 4.3
SPIMI: Compression
• Compression makes SPIMI even more efficient.
—Compression of terms
—Compression of postings
• We’ll discuss later today
29
Sec. 4.4
Distributed indexing
• For web-scale indexing (don’t try this at home!):
must use a distributed computing cluster
• Individual machines are fault-prone
—Can unpredictably slow down or fail
• How do we exploit such a pool of machines?
30
Sec. 4.4
Web search engine data centers
• Web search data centers (Google, Bing, Baidu)
mainly contain commodity machines.
https://www.pcworld.com/article/112891/article.html
• Data centers are distributed around the world
—15 locations in the world
◦ There’s one in Jurong West Avenue 2
• Estimate: Google uses ~900K servers (2011 report)
http://www.datacenterknowledge.com/archives/2011/08/01/report-google-
uses-about-900000-servers
31
Google Data Center in Jurong West
32
Sec. 4.4
Massive data centers
• If in a non-fault-tolerant system with 1000 nodes,
each node has 99.9% uptime, what is the uptime of
the system?
• Answer: 37%
• Exercise: Calculate the number of servers failing per
minute for an installation of 1 million servers.
• Colloquially, SLA is counted in “nines”, e.g., an SLA
of four “nines”: 99.99% uptime.
33
Sec. 4.4
Distributed indexing
• Maintain a master machine directing the indexing
job – considered “safe”.
• Break up indexing into sets of (parallel) tasks.
• Master machine assigns each task to an idle
machine from a pool.
34
Sec. 4.4
Parallel tasks
• We will use two sets of parallel tasks
—Parsers
—Inverters
• Break the input document collection into splits
• Each split is a subset of documents (corresponding
to blocks in BSBI/SPIMI)
35
Sec. 4.4
Parsers
• Master assigns a split to an idle parser machine
• Parser reads one document at a time and emits
(term, doc) pairs
• Parser writes pairs into j partitions
• Each partition is for a range of terms’ first letters
—(e.g., a-f, g-p, q-z) – here j = 3.
• Now to complete the index inversion
36
Sec. 4.4
Inverters
• An inverter collects all (term,doc) pairs (= postings)
for one term-partition.
• Sorts and writes to postings lists
37
Sec. 4.4
Data flow
assign Master assign
Postings
Parser a-f g-p q-z Inverter a-f
Parser a-f g-p q-z
Inverter g-p
splits Inverter q-z
Parser a-f g-p q-z
Map Reduce
Segment files
phase phase 38
Sec. 4.4
MapReduce
• The index construction algorithm we just described
is an instance of MapReduce.
• MapReduce (Dean & Ghemawat 2004) is a robust
and conceptually simple framework for distributed
computing …
• … without having to write code for the distribution
part.
• They describe the Google indexing system (ca.
2002) as consisting of a number of phases, each
implemented in MapReduce.
39
Sec. 4.4
MapReduce
• Index construction was just one phase.
• Another phase: transforming a term-partitioned
index into a document-partitioned index.
—Term-partitioned: one machine handles a subrange
of terms
—Document-partitioned: one machine handles a
subrange of documents
• As we’ll discuss in the web part of the course, most
search engines use a document-partitioned index
for better load balancing, etc.
40
Sec. 4.5
Dynamic indexing
• Up to now, we have assumed that collections are
static.
• They rarely are:
—Documents come in over time and need to be
inserted.
—Documents are deleted and modified.
• This means that the dictionary and postings lists
have to be modified:
—Postings updates for terms already in dictionary
—New terms added to dictionary
◦ #MoonbyulxPunch
43
Sec. 4.5
Simplest approach
• Maintain “big” main index
• New docs go into “small” auxiliary index
• Search across both, merge results
• Deletions
—Invalidation bit-vector for deleted docs
—Filter docs output on a search result by this
invalidation bit-vector
• Periodically, re-index into one main index
44
Issues with multiple indexes
• Collection-wide statistics are hard to maintain
—E.g., when we spoke of spell-correction: which of
several corrected alternatives do we present to the
user?
—We said, pick the one with the most hits
• How do we maintain the top ones with multiple
indexes and invalidation bit vectors?
—One possibility: ignore everything but the main
index for such ordering
• Will see more such statistics used in results ranking
49
Dynamic indexing at search engines
• All the large search engines now do dynamic
indexing
• Their indices have frequent incremental changes
—News, blogs, twitter, new topical web pages
• But (sometimes/typically) they also periodically
reconstruct the index from scratch
—Query processing is then switched to the new index,
and the old index is deleted
50