0% found this document useful (0 votes)

34 views45 pages

Chapter Two IR

Uploaded by

felmiketfikadu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views45 pages

Chapter Two IR

Uploaded by

felmiketfikadu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

Text/Document Operations and

Term weighting

1
Text Operations

2
Statistical Properties of Text
How is the frequency of different words
distributed?
How fast does vocabulary size grow with the
size of a corpus?
Such factors affect the performance of IR system &
can be used to select suitable term weights & other
aspects of the system.
A few words are very common.
2 most frequent words (e.g. “the”, “of”) can account
for about 10% of word occurrences.
Most words are very rare.
Half the words in a corpus appear only once, called
“read only once”
3 Called a “heavy tailed” distribution, since most
Sample Word Frequency Data

4
Zipf’s distributions
Rank Frequency Distribution
For all the words in a collection of documents, for each word w
• f : is the frequency that w appears
• r : is rank of w in order of frequency. (The most commonly
occurring word has rank 1, etc.)
f
Distribution of sorted word
frequencies, according to
Zipf’s law
w has rank r and
frequency f

5 r
Word distribution: Zipf's
Law
Zipf's Law- named after the Harvard linguistic
professor George Kingsley Zipf (1902-1950),
attempts to capture the distribution of the frequencies
(i.e. , number of occurances ) of the words within a text.
Zipf's Law states that when the distinct words in
a text are arranged in decreasing order of their
frequency of occuerence (most frequent words
first), the occurence characterstics of the
vocabulary can be characterized by the constant
rank-frequency law of Zipf:
Frequency * Rank = constant
that is If the words, w, in a collection are ranked, r,
by their frequency, f, they roughly fit the relation:
r*f=c
6 Different collections have different constants c.
Example: Zipf's Law

The table shows the most frequently occurring words

from 336,310 document collection containing 125, 720,
891 total words; out of which 508, 209 unique words
7
Zipf’s law: modeling word
distribution
The collection frequency of the ith most common term
1
is proportional to 1/i fi 
i
If the most frequent term occurs f1 times then the second most
frequent term has half as many occurrences, the third most
frequent term has a third as many, etc
Zipf's Law states that the frequency of the i-th most
frequent word is 1/iӨ times that of the most frequent
word for some Ө>=1
occurrence of some event ( P ), as a function of the rank (i)
when the rank is determined by the frequency of occurrence, is
a power-law function Pi ~ 1/i Ө with the exponent Ө close to 1.

8
More Example: Zipf’s Law
 Illustration of Rank-Frequency Law. Let the total number of word
occurrences in the sample N = 1, 000, 000
Rank (R) Term Frequency R.(F/N)
(F)
1 the 69 971 0.070
2 of 36 411 0.073
3 and 28 852 0.086
4 to 26 149 0.104
5 a 23237 0.116
6 in 21341 0.128
7 that 10595 0.074
8 is 10099 0.081
9 9 was 9816 0.088
Methods that Build on Zipf's
Law
• Stop lists: Ignore the most frequent words
(upper cut-off). Used by almost all systems.
• Significant words: Take words in between the
most frequent (upper cut-off) and least frequent
words (lower cut-off).
• Term weighting: Give differing weights to
terms based on their frequency, with most
frequent words weighed less. Used by almost
all ranking methods.

10
Explanations for Zipf’s Law
• The law has been explained by “principle of least
effort” which makes it easier for a speaker or
writer of a language to repeat certain words
instead of coining new and different words.
– Zipf’s explanation was his “principle of least effort”
which balance between speaker’s desire for a small
vocabulary and hearer’s desire for a large one.
• Zipf’s Law Impact on IR
– Good News: Stopwords will account for a large fraction of
text so eliminating them greatly reduces inverted-index
storage costs.
– Bad News: For most words, gathering sufficient data for
meaningful statistical analysis (e.g. for correlation
analysis for query expansion) is difficult since they are
11 extremely rare.
Word significance: Luhn’s Ideas
• Luhn Idea (1958): the frequency of word occurrence in a text
furnishes a useful measurement of word significance.
• Luhn suggested that both extremely common and extremely
uncommon words were not very useful for indexing.
• For this, Luhn specifies two cut-off points: an upper and a lower
cutoffs based on which non-significant words are excluded
– The words exceeding the upper cut-off were considered to be common
– The words below the lower cut-off were considered to be rare
– Hence they are not contributing significantly to the content of the text
– The ability of words to discriminate content, reached a peak at a rank order
position half way between the two-cutoffs
• Let f be the frequency of occurrence of words in a text, and r
their rank in decreasing order of word frequency, then a plot
relating f & r yields the following curve
12
Luhn’s Ideas

Luhn (1958) suggested that both extremely common and

extremely uncommon words were not very useful for document
13
representation & indexing.
Vocabulary size : Heaps’ Law
Dictionaries
600,000+ words

The words do not include names of people,

locations, products etc

Heap’s law: estimates the number of

vocabularies in a given corpus

14
Vocabulary Growth: Heaps’ Law
How does the size of the overall vocabulary
(number of unique words) grow with the size of
the corpus?
This determines how the size of the inverted index will
scale with the size of the corpus.
Heap’s law: estimates the number of
vocabularies in a given corpus
The vocabulary size grows by O(nβ), where β is a
constant between 0 – 1.
If V is the size of the vocabulary and n is the length of
the corpus in words, Heap’s provides the following
equation: 
Where constants: V Kn
K  10100
15 
  0.40.6 (approx. square-root)
Heap’s distributions
• Distribution of size of the vocabulary: there is
a linear relationship between vocabulary size
and number of tokens

Example: from 1,000,000,000 documents, there

may be 1,000,000 distinct words. Can you agree?
16
Example
We want to estimate the size of the vocabulary
for a corpus of 1,000,000 words. However, we
only know statistics computed on smaller
corpora sizes:
For 100,000 words, there are 50,000 unique words
For 500,000 words, there are 150,000 unique
words

Estimate the vocabulary size for the 1,000,000

words corpus?
How about for a corpus of 1,000,000,000 words?

17
Text Operations
 Not all words in a document are equally significant to
represent the contents/meanings of a document
 Some word carry more meaning than others
 Noun words are the most representative of a document content
 Therefore, one needs to preprocess the text of a document
in a collection to be used as index terms
 Using the set of all words in a collection to index
documents creates too much noise for the retrieval task
 Reduce noise means reduce words which can be used to refer to the
document
 Preprocessing is the process of controlling the size of the
vocabulary or the number of distinct words used as index
terms
 Preprocessing will lead to an improvement in the information
retrieval performance
 However, some search engines on the Web omit
preprocessing
18 Every word in the document is an index term
Text Operations
• Text operations is the process of text
transformations in to logical representations
• The main operations for selecting index terms,
i.e. to choose words/stems (or groups of
words) to be used as indexing terms are:
– Lexical analysis/Tokenization of the text - digits, hyphens,
punctuations marks, and the case of letters
– Elimination of stop words - filter out words which are
not useful in the retrieval process
– Stemming words - remove affixes (prefixes and
suffixes)
– Construction of term categorization structures such as
thesaurus, to capture relationship for allowing the
expansion of the original query with related terms
19
Generating Document
Representatives
Text Processing System
Input text – full text, abstract or title
Output – a document representative adequate for use
in an automatic retrieval system
The document representative consists of a list of
class names, each name representing a class of
words occurring in the total input text. A document
will be indexed by a name if one of its significant words
occurs as a member of that class.
documents Tokenization stop words stemming Thesaurus

Index
terms
20
Lexical Analysis/Tokenization
of Text
Change text of the documents into words to
be adopted as index terms
Objective - identify words in the text
Digits, hyphens, punctuation marks, case of
letters
Numbers are not good index terms (like 1910,
1999); but 510 B.C. – unique
Hyphen – break up the words (e.g. state-of-the-art
= state of the art)- but some words, e.g. gilt-
edged, B-49 - unique words which require hyphens
Punctuation marks – remove totally unless
significant, e.g. program code: x.exe and xexe
Case of letters – not important and can convert all
to upper or lower
21
Tokenization
Analyze text into a sequence of discrete
tokens (words).
Input: “Friends, Romans and Countrymen”
Output: Tokens (an instance of a sequence of
characters that are grouped together as a useful
semantic unit for processing)
Friends
Romans
and
Countrymen
Each such token is now a candidate for an
index entry, after further processing

22But what are valid tokens to omit?
Issues in Tokenization
• One word or multiple: How do you decide it is one
token or two or more?
– Hewlett-Packard  Hewlett and Packard as two
tokens?
• state-of-the-art: break up hyphenated sequence.
• San Francisco, Los Angeles
• Addis Ababa, Arba Minch
– lowercase, lower-case, lower case ?
• data base, database, data-base

• Numbers:
• dates (3/12/91 vs. Mar. 12, 1991);
• phone numbers,
• IP addresses (100.2.86.144)

23
Issues in Tokenization
• How to handle special cases involving
apostrophes, hyphens etc? C++, C#, URLs,
emails, …
– Sometimes punctuation (e-mail), numbers (1999),
and case (Republican vs. republican) can be a
meaningful part of a token.
– However, frequently they are not.
• Simplest approach is to ignore all numbers and
punctuation and use only case-insensitive
unbroken strings of alphabetic characters as
tokens.
– Generally, don’t index numbers as text, But often
very useful. Will often index “meta-data” , including
creation date, format, etc. separately
• Issues of tokenization are language specific
24
– Requires the language to be known
Exercise: Tokenization
The cat slept peacefully in the living room. It’s
a very old cat.

Mr. O’Neill thinks that the boys’ stories about

Chile’s capital aren’t amusing.

25
Elimination of STOPWORD
• Stopwords are extremely common words across
document collections that have no discriminatory
power
– They may occur in 80% of the documents in a collection.
– They would appear to be of little value in helping select
documents matching a user need and needs to be
filtered out from potential index terms
• Examples of stopwords are articles, , pronouns,
prepositions, conjunctions, etc.:
– articles (a, an, the); pronouns: (I, he, she, it, their, his)
– Some prepositions (on, of, in, about, besides, against,
over),
– conjunctions/ connectors (and, but, for, nor, or, so, yet),
– verbs (is, are, was, were),
– adverbs (here, there, out, because, soon, after) and
– adjectives (all, any, each, every, few, many, some)
can also be treated as stopwords
• Stopwords are language dependent.
26
Stopwords
• Intuition (perception):
– Stopwords have little semantic content; It is
typical to remove such high-frequency words
– Stopwords take up 50% of the text. Hence,
document size reduces by 30-50%
• Smaller indices for information retrieval
– Good compression techniques for indices: The 30
most common words account for 30% of the tokens
in written text

• Better approximation of importance for

classification, summarization, etc.
27
How to determine a list of
stopwords?
• One method: Sort terms (in decreasing
order) by collection frequency and take the
most frequent ones
– Problem: In a collection about insurance
practices, “insurance” would be a stop word
• Another method: Build a stop word list that
contains a set of articles, pronouns, etc.
With a stop list, we
– Why do we need stop lists:
can compare and exclude from index terms
entirely the commonest words.
• With the removal of stopwords, we can
measure better approximation of importance
for classification, summarization, etc.
28
Stop words
• Stop word elimination used to be standard in
older IR systems.
• But the trend is getting away from doing this.
Most web search engines index stop words:
– Good query optimization techniques mean you pay
little at query time for including stop words.
– You need stopwords for:
• Phrase queries: “King of Denmark”
• Various song titles, etc.: “Let it be”, “To be or not to be”
• “Relational” queries: “flights to London”
– Elimination of stopwords might reduce recall (e.g.
“To be or not to be” – all eliminated except “be” –
no or irrelevant retrieval)
29
Normalization
• It is Canonicalizing tokens so that matches
occur despite superficial differences in the
character sequences of the tokens. It is in a
way standardization of text
– Need to “normalize” terms in indexed text as well as query
terms into the same form
– Example: We want to match U.S.A. and USA, by deleting
periods in a term
• Case Folding: Often best to lowercase everything, since
users will use lowercase regardless of ‘correct’
capitalization…
– Republican vs. republican
– Fasil vs. fasil vs. FASIL
– Anti-discriminatory vs. antidiscriminatory
– Car vs. automobile?
30
Normalization issues
• Good for
– Allow instances of Automobile at the beginning of
a sentence to match with a query of automobile
– Helps a search engine when most users type
ferrari when they are interested in a Ferrari car
• Bad for
– Proper names vs. common nouns
• E.g. General Motors, Associated Press, …
• Solution:
– lowercase only words at the beginning of the
sentence
• In IR, lowercasing is most practical because of
the way users issue their queries
31
Stemming/Morphological analysis
• Stemming reduces tokens to their “root” form of words
to recognize morphological variation .
– The process involves removal of affixes (i.e. prefixes and
suffixes) with the aim of reducing variants to the same stem
– Often removes inflectional and derivational morphology of a
word
• Inflectional morphology: vary the form of words in order to
express grammatical features, such as singular/plural or
past/present tense. E.g. Boy → boys, cut → cutting.
• Derivational morphology: makes new words from old ones.
E.g. creation is formed from create , but they are two
separate words. And also, destruction → destroy
• Stemming is language dependent
– Correct stemming is language specific and can be complex.

for example compressed and for example compress and

compression are both accepted. compress are both accept
32
Stemming
• The final output from a conflation (reducing words
to the same token) algorithm is a set of classes, one
for each stem detected.
–A Stem: the portion of a word which is left after the
removal of its affixes (i.e., prefixes and/or suffixes).
–Example: ‘connect’ is the stem for {connected,
connecting connection, connections}
–Thus, [automate, automatic, automation] all reduce to  automat
• A class name is assigned to a document if and only
if one of its members occurs as a significant word in
the text of the document.
–A document representative then becomes a list of
class names, which are often referred as the
documents index terms/keywords.
Queries : Queries are handled in the same way.
• 33
Ways to implement stemming
There are basically two ways to implement stemming.
The first approach is to create a big dictionary that
maps words to their stems.
The advantage of this approach is that it works perfectly
(insofar as the stem of a word can be defined perfectly);
The disadvantages are the space required by the dictionary
and the investment required to maintain the dictionary as new
words appear.
The second approach is to use a set of rules that
extract stems from words.
The advantages of this approach are that the code is typically
small, and it can gracefully handle new words appear;
The disadvantage is that it occasionally makes mistakes. But,
since stemming is imperfectly defined, anyway, occasional
mistakes are tolerable, and the rule-based approach is the one
that is generally chosen.
34
Porter Stemmer
• Stemming is the operation of stripping the
suffices from a word, leaving its stem.
– Google, for instance, uses stemming to search for web
pages containing the words connected, connecting,
connection and connections when users ask for a
web page that contains the word connect.
• In 1979, Martin Porter developed a stemming
algorithm that uses a set of rules to extract
stems from words, and though it makes some
mistakes, most common words seem to work
out right.
– Porter describes his algorithm and provides a
reference implementation in C at
35 http://tartarus.org/~martin/PorterStemmer/index.html
Porter stemmer
• It is the most common algorithm for stemming
English words to their common grammatical
root
• It uses a simple procedure for removing known
affixes in English without using a dictionary. To
gets rid of plurals the following rules are used:
– SSES  SS caresses  caress
– IES  i ponies  poni
– SS  SS caress → caress
– S   (nil) cats  cat
– EMENT   (Delete final element if what remains is
longer than 1 character )
replacement  replac
cement  cement

36
While step 1a gets rid of plurals, step 1b
removes -ed or -ing.
e.g.
;; agreed -> agree ;; disabled ->
disable
;; matting -> mat ;; mating -> mate
;; meeting -> meet ;; milling ->
mill
;; messing -> mess ;; meetings ->
mee
;; feed -> feedt

37
Stemming: challenges
May produce unusual stems that are not English
words:
Removing ‘UAL’ from FACTUAL and EQUAL

May conflate (reduce to the same token) words

that are actually distinct.
 “computer”, “computational”,
“computation” all reduced to same token
“compute”
Not recognize all morphological derivations.

38
Thesauri
• Mostly full-text searching cannot be
accurate, since different authors may select
different words to represent the same
concept
– Problem: The same meaning can be expressed
using different terms that are synonyms(a word or
phrase which has the same or nearly the same
meaning as another word or phrase in the same
language), homonyms(a word that sounds the
same or is spelled the same as another word but
has a different meaning), and related terms
– How can it be achieved such that for the same
meaning the identical terms are used in the index
and the query?
39
Thesauri
• Thesaurus: The vocabulary of a controlled
indexing language, formally organized so that a
priori relationships between concepts (for
example as "broader" and “related") are made
explicit.
• A thesaurus contains terms and relationships
between terms
– IR thesauri rely typically upon the use of symbols such
as USE/UF (UF=used for), BT(Broader Term), and
RT(Related term) to demonstrate inter-term
relationships.
– e.g., car = automobile, truck, bus, taxi, motor vehicle
-color = colour, paint
40
Aim of Thesaurus
• Thesaurus tries to control the use of the vocabulary by
showing a set of related words to handle synonyms
and homonyms
• The aim of thesaurus is therefore:
– to provide a standard vocabulary for indexing and
searching
• Thesaurus rewrite to form equivalence classes, and we index such
equivalences
• When the document contains automobile, index it under car as well
(usually, also vice-versa)
– to assist users with locating terms for proper query
formulation: When the query contains automobile, look
under car as well for expanding query
– to provide classified hierarchies that allow the
broadening and narrowing of the current request
according to user needs

41
Thesaurus Construction
Example: thesaurus built to assist IR for searching
cars and vehicles :
Term: Motor vehicles
UF : Automobiles
Cars
Trucks
BT: Vehicles
RT: Road Engineering
Road Transport

42
More Example
Example: thesaurus built to assist IR in the
fields of computer science:
TERM: natural languages
UF natural language processing (UF=used
for NLP)
BT languages (BT=broader term is languages)
NT languages (NT= Narrower related term)
TT languages (TT = top term is languages)
RT artificial intelligence (RT=related term/s)
computational linguistic
formal languages
query languages
speech recognition
43
Language-specificity
Many of the above features embody
transformations that are
Language-specific and
Often, application-specific
These are “plug-in” addenda to the
indexing process
Both open source and commercial plug-ins
are available for handling these

44
INDEX TERM SELECTION
Index language is the language used to
describe documents and requests
o

Elements of the index language are index

terms which may be derived from the text of
the document to be described, or may be
arrived at independently.
If a full text representation of the text is
adopted, then all words in the text are used as
index terms = full text indexing
Otherwise, need to select the words to be
used as index terms for reducing the size of
the index file which is basic to design an
45
efficient searching IR system

IR Chapter 2 Text Operations
No ratings yet
IR Chapter 2 Text Operations
25 pages
2-Text Operations - New
No ratings yet
2-Text Operations - New
39 pages
02 Text Operation
No ratings yet
02 Text Operation
52 pages
IR Chap3
No ratings yet
IR Chap3
45 pages
Query Operation 2021
No ratings yet
Query Operation 2021
35 pages
Information Retrieval: IR Evaluation
No ratings yet
Information Retrieval: IR Evaluation
36 pages
SAD PPt-1,2,3
No ratings yet
SAD PPt-1,2,3
39 pages
Chapter 1 SAD Introduction
No ratings yet
Chapter 1 SAD Introduction
25 pages
Lecture 01 Introduction
No ratings yet
Lecture 01 Introduction
84 pages
Chapter 3
No ratings yet
Chapter 3
24 pages
Chapter 1
No ratings yet
Chapter 1
23 pages
Chapter 4.4 Application Layers
No ratings yet
Chapter 4.4 Application Layers
22 pages
Systems Planning and Feasibility Analysis
No ratings yet
Systems Planning and Feasibility Analysis
20 pages
Chapter 4 - Reference Models and Network Protocols
No ratings yet
Chapter 4 - Reference Models and Network Protocols
74 pages
Understanding Layered Models in Networking
No ratings yet
Understanding Layered Models in Networking
53 pages
Lec21-22 Programming in C++ Variables & Data Types-1
100% (1)
Lec21-22 Programming in C++ Variables & Data Types-1
33 pages
Multiplexing Techniques Overview
No ratings yet
Multiplexing Techniques Overview
24 pages
Chapter 4.1A
No ratings yet
Chapter 4.1A
55 pages
Overview of Computer Software Types
No ratings yet
Overview of Computer Software Types
15 pages
Data Commu
No ratings yet
Data Commu
237 pages
Chapter 6
No ratings yet
Chapter 6
28 pages
C# Chapter 2
No ratings yet
C# Chapter 2
23 pages
DH-INT1472-CLC-Chapter 1 - Introduction To Information Security
No ratings yet
DH-INT1472-CLC-Chapter 1 - Introduction To Information Security
55 pages
2 Data Communications Concepts
No ratings yet
2 Data Communications Concepts
15 pages
UNit 5
No ratings yet
UNit 5
50 pages
OSI Layers
No ratings yet
OSI Layers
70 pages
Overview of Programming Languages and Concepts
No ratings yet
Overview of Programming Languages and Concepts
91 pages
Programming Paradigms-1-79
No ratings yet
Programming Paradigms-1-79
79 pages
FIS Unit Three
No ratings yet
FIS Unit Three
23 pages
Chapter Four Layered Models: Compiled By: Mr. Dawit M
No ratings yet
Chapter Four Layered Models: Compiled By: Mr. Dawit M
70 pages
C# Multiform & Database Guide
No ratings yet
C# Multiform & Database Guide
25 pages
Ds Chapter 5
No ratings yet
Ds Chapter 5
31 pages
Defining a Person Structure in C++
100% (1)
Defining a Person Structure in C++
48 pages
Intro to Basic C++ Programming
100% (1)
Intro to Basic C++ Programming
32 pages
C# Chapter 5
No ratings yet
C# Chapter 5
11 pages
Configure Computer Name and Domain Settings
No ratings yet
Configure Computer Name and Domain Settings
1 page
Java Event Handling and Delegation Model
No ratings yet
Java Event Handling and Delegation Model
59 pages
EDP Part 1
No ratings yet
EDP Part 1
42 pages
C++ Chapter Three
No ratings yet
C++ Chapter Three
55 pages
Network Switching & Multiplexing Guide
No ratings yet
Network Switching & Multiplexing Guide
31 pages
Multiplexing Techniques Explained
No ratings yet
Multiplexing Techniques Explained
40 pages
Chapter 4-Communication
No ratings yet
Chapter 4-Communication
36 pages
C# Chapter 3
No ratings yet
C# Chapter 3
24 pages
Haramaya University College of Computing and Informatics Department of Computer Science
No ratings yet
Haramaya University College of Computing and Informatics Department of Computer Science
48 pages
Chapter-2 SN
No ratings yet
Chapter-2 SN
68 pages
Lecture 8 - Functions
No ratings yet
Lecture 8 - Functions
38 pages
Lecture-02-Basic Elements of C++
100% (1)
Lecture-02-Basic Elements of C++
82 pages
ICT Lecturer 6
No ratings yet
ICT Lecturer 6
17 pages
Functions
No ratings yet
Functions
29 pages
OOP2 Lecture Week 12 (Spring2023 24)
No ratings yet
OOP2 Lecture Week 12 (Spring2023 24)
19 pages
CP Chapter 2
No ratings yet
CP Chapter 2
51 pages
C# Chapter 4
No ratings yet
C# Chapter 4
8 pages
Chapter 3 - C++ Handout
No ratings yet
Chapter 3 - C++ Handout
55 pages
Final Exam Instruction For Students
No ratings yet
Final Exam Instruction For Students
1 page
Overview of Information Retrieval Systems
No ratings yet
Overview of Information Retrieval Systems
25 pages
Evaluating Information Retrieval Systems
No ratings yet
Evaluating Information Retrieval Systems
26 pages
Basic Programing I Chapter 1
No ratings yet
Basic Programing I Chapter 1
48 pages
Compiler Design Group Assignment
No ratings yet
Compiler Design Group Assignment
11 pages
Chapter 4-Communication
No ratings yet
Chapter 4-Communication
41 pages
2 - Text Operation
No ratings yet
2 - Text Operation
45 pages
CH 4 Computer Network
No ratings yet
CH 4 Computer Network
36 pages
Online Employees Records Management System For Gubat National High School
No ratings yet
Online Employees Records Management System For Gubat National High School
13 pages
Irjmets 70800039835
No ratings yet
Irjmets 70800039835
4 pages
Elective II (Event Driven Programming Lecture Note 2024)
No ratings yet
Elective II (Event Driven Programming Lecture Note 2024)
24 pages
EDProgramming chpt1
No ratings yet
EDProgramming chpt1
35 pages
Fulltext
No ratings yet
Fulltext
34 pages
Chapter One-Introduction
No ratings yet
Chapter One-Introduction
45 pages
Chapter 9
No ratings yet
Chapter 9
24 pages
Event Chapter 2 Part II
No ratings yet
Event Chapter 2 Part II
54 pages
Fundamentals of Database Systems (Comp4062) : Introduction To Database System
No ratings yet
Fundamentals of Database Systems (Comp4062) : Introduction To Database System
33 pages
Chapter 3 - Embedded Systems Design Issues
No ratings yet
Chapter 3 - Embedded Systems Design Issues
40 pages
Event Chapter 2 Part I
No ratings yet
Event Chapter 2 Part I
31 pages
Chapter Four: IR Models (Part-I)
No ratings yet
Chapter Four: IR Models (Part-I)
32 pages
Chapter 3 COA
No ratings yet
Chapter 3 COA
28 pages
Unit Iv Coa
No ratings yet
Unit Iv Coa
99 pages
DS Chapter 2
No ratings yet
DS Chapter 2
13 pages
COA Chapter 3 Assembly Language
No ratings yet
COA Chapter 3 Assembly Language
49 pages
Ds Chapter 3
No ratings yet
Ds Chapter 3
22 pages
Chapter 2
No ratings yet
Chapter 2
15 pages
Chapter 6
No ratings yet
Chapter 6
10 pages
Chapter Five
No ratings yet
Chapter Five
11 pages
2022 AI Index Report Master (121 180)
No ratings yet
2022 AI Index Report Master (121 180)
60 pages
Soal Bahasa Inggris
No ratings yet
Soal Bahasa Inggris
10 pages
(Python Basics) - Cheatsheet
No ratings yet
(Python Basics) - Cheatsheet
7 pages
Tiếng Anh Lớp 4: What Time Is It?
No ratings yet
Tiếng Anh Lớp 4: What Time Is It?
8 pages
01 - Big Pig On A Dig
No ratings yet
01 - Big Pig On A Dig
11 pages
Parthia 8pp @wiki2021en
No ratings yet
Parthia 8pp @wiki2021en
8 pages
Allot Report
No ratings yet
Allot Report
2 pages
6553 Unit No 1 Emaad
No ratings yet
6553 Unit No 1 Emaad
26 pages
đề writing level 2 IS offline tháng 12.2022
No ratings yet
đề writing level 2 IS offline tháng 12.2022
9 pages
Sesotho FAL P2 Feb-March 2016
No ratings yet
Sesotho FAL P2 Feb-March 2016
20 pages
IELTS Lesson 11: Vocabulary & Speaking
No ratings yet
IELTS Lesson 11: Vocabulary & Speaking
13 pages
Classroom Language Questioning and Giving Instructions Revised
No ratings yet
Classroom Language Questioning and Giving Instructions Revised
60 pages
JaquesDerrida On Tschumi La Villette
100% (1)
JaquesDerrida On Tschumi La Villette
13 pages
Acomaf Rhys
100% (1)
Acomaf Rhys
308 pages
Research Proposal - A Contrastive Study of Article Usage in French and English
No ratings yet
Research Proposal - A Contrastive Study of Article Usage in French and English
41 pages
First Test Adaptado
No ratings yet
First Test Adaptado
4 pages
Is It Better To Speak or To Die
No ratings yet
Is It Better To Speak or To Die
9 pages
English Program Proposal
No ratings yet
English Program Proposal
4 pages
Answer Key Final 18th
No ratings yet
Answer Key Final 18th
6 pages
The Tenses
No ratings yet
The Tenses
31 pages
Ace-III-scoring-EnglishEnglish ACE-III Administration and Scoring Guide - 2012
No ratings yet
Ace-III-scoring-EnglishEnglish ACE-III Administration and Scoring Guide - 2012
9 pages
An Introduction To Language International Edition Victoria Fromkin
No ratings yet
An Introduction To Language International Edition Victoria Fromkin
450 pages
TET 2025 Notification
No ratings yet
TET 2025 Notification
15 pages
Third Summative Exam Grade 10
No ratings yet
Third Summative Exam Grade 10
3 pages
GES 101 General English and Communication Skills-1
No ratings yet
GES 101 General English and Communication Skills-1
247 pages
A Marginal Modal Form in Digor Ossetic
No ratings yet
A Marginal Modal Form in Digor Ossetic
13 pages
Modern Dating Vocabulary Explained
No ratings yet
Modern Dating Vocabulary Explained
3 pages
Greek Phrase Book and Dictionary Philippa Goodrich PDF Download
100% (1)
Greek Phrase Book and Dictionary Philippa Goodrich PDF Download
91 pages
LANGUAGE CHOICE Chapter 5 Ells 104
No ratings yet
LANGUAGE CHOICE Chapter 5 Ells 104
5 pages
Unit 7
No ratings yet
Unit 7
19 pages

Chapter Two IR

Uploaded by

Chapter Two IR

Uploaded by

Text/Document Operations and

The table shows the most frequently occurring words

Luhn (1958) suggested that both extremely common and

The words do not include names of people,

Heap’s law: estimates the number of

Example: from 1,000,000,000 documents, there

Estimate the vocabulary size for the 1,000,000

Mr. O’Neill thinks that the boys’ stories about

• Better approximation of importance for

for example compressed and for example compress and

May conflate (reduce to the same token) words

Elements of the index language are index

You might also like