SEM: VI CLASS: TYCS SUB: INFORMATION
RETRIEVAL (IR)
Multiple Choice Questions (Question Bank)
1) Which of the following is not a source used in Mid Infrared Spectrophotometer?
a) Nernst glower
b) High pressure mercury arc lamp
c) Globar
d) Nichrome wire.
2) Which of the following is the wave number of near infrared
spectrometer? a) 4000 – 200 cm-1
b) 200 – 10 cm-1
c) 12500 – 4000 cm-1
d) 50 – 1000 cm-1.
3) Which of the following is not a composition of Nernst glower or Nernst filament?
a) Oxide
s of
Zirconiu
m b)
Oxides
of
Barium
c) Oxides of Yitrium
d) Oxides of Thorium
4) What is the composition of Globar rod which is used as a source in Mid IR
spectroscopy?
a) Silicon carbide
b) Silver chloride
c) Silicon dioxide
d) Silver carbide
5) Bolometer, a type of detector, is also known as
a) Resistance temperature detector (RTD)
b) Thermistor
c) Thermocouple
d) Golay cell
6) Which of the following is not used as pyroelectric material used in pyroelectric
transducers in Infrared spectroscopy?
a) Triglycine Sulphate
b) Deutrated Triglycine Sulphate
c) Some Polymers
d) Tetraglycine sulphate
7) A model of information retrieval in which we can pose any query in which
search terms are combined with the operators AND, OR, and NOT
a) Ad Hoc Retrieva
b) Ranked Retrieval Model
c) Boolean Information Model
d) Proximity Query Model
8) A data structure that maps terms back to the parts of a document in which they occur
is called an
a) Postings list
b) Incidence Matrix
c) Dictionary
d) Inverted Index
9) Stemming increases the size of the
vocabulary True
False
10) In information retrieval, extremely common words which would appear to be of little value in
helping select documents that are excluded from the index vocabulary are called:
a) Stop Words
b) Tokens
c) Lemmatized Words
d) Stemmed Terms
11) A crude heuristic process that chops off the ends of the words to reduce inflectional forms of words
and reduce the size of the vocabulary is calle
a) Lemmatization
b) Case Folding
c) True casing
d) Stemming
12) Which of the following is a technique for context sensitive spelling correction
a) the Jaccard Coefficient
b) Soundex algorithms
c) k-gram indexes
d) Levenshtein distance
13) For a very large collection of books of classic literature the most appropriate indexing algorithm
would be
a) Block sort-based indexing algorithm
b) Single-pass in memory indexing algorithm
c) Distributed Map-Reduce indexing algorithm
d) Dynamic indexing process employing an auxiliary index
14) An index that includes sequences of words or terms of variable length that have been extracted from
a source document is called a
a) Phrase Index
b) Biword index
c) Positional index
d) Inverted Index
15) For a large collection of documents such as the internet that experience frequent change the most
appropriate indexing algorithm would be
a) Block sort-based indexing algorithm
b) Single-pass in memory indexing algorithm
c) Distributed Map-Reduce indexing algorithm
d) Dynamic indexing process employing an auxiliary index
15) Hashing is a process where an item is reduced, through a mathematical process, to an
integer. True
False
16) The formula used to estimate the vocabulary size of a collection is known as:
a) Zipf's law
b) Power law
c) Heap's law
d) Compression ratio
17) An approach to compression that takes advantage of the redundancy in the dictionary that results
from common prefixes that come from sorted terms is called:
a) Front Coding
b) Blocked storage
c) Prefix Coding
d) Variable byte encoding
18) A scheme where a weight is assigned to a term based upon the number of occurrences of the term
within a document is called
a) Bag of Words
b) Document Frequency
c) Term Frequency
d) Optimal weight
19) A measure of similarity between two vectors which is determined by measuring the angle between
them is called:
a) Cosine similarity
b) Sin similarity
c) Vector similarity
d) Vector scoring
20) A group of related documents against which information retrieval is employed is called:
a) Corpus
b) Text Database
c) Index Collection
d) Repository
21) A metric derived by taking the log of N divided by the document frequency where N is the total
number of documents in a collection is called
a) document frequency
b) tf-idf weight
c) collection frequency
d) inverse document frequency
22) A web page whose content doesn't vary from one request to another is called as:
a) Text Page
b) Dynamic Page
c) Active Server Page
d) Static Page
23) A program that captures and indexes content from web pages is known as what insect:
a) Fly
b) Centipede
c) Mosquito
d) Spider
24) To evaluate the effectiveness of an IR system the output from a standard query executed against the
test IR system is compared with the known output from a:
a) internet collection
b) reference book
c) separate IR system.
d) standard test collection
25) Which of the following is NOT one of the types of queries in a complete search system discussed in
our text?
a) Wildcard Query
b) Boolean retrieval
c) Phrase Query
d) Ranked retrieval Query
26) The standard approach to information retrieval system evaluation involves around the notion of:
a) Quantity of documents in the collection
b) Relevant and non relevant documents.
c) Accuracy
d) user happiness
27) Which of the following items is not a component of a complete search system?
a) Document cache
b) Indexers
c) Spell correction
d) Horizontal index
28) An approach to computing scores in an IR system that orders documents in the posting list of a term
by decreasing order of term frequency is called:
a) Champion list
b) Impact ordering
c) Cluster pruning
d) Tiered indexes
29) A web link within a web page that ref erences another part of the same page is called a:
a) Out link
b) Vector
c) In link
d) Tendril
30) Information retrieval is querying of textual data.
a) structured
b) unstructured
c) Formatted
d) None
31) The number of documents in the collection that contain a term t is called as
a) Document Index dit
b) Document frequency dft
c) Document Inverse dint
d) Document Incidence Matrix dimt
32) CPM stands for
a) Cost per migrating
b) Cost per making
c) Cost per manage
d) Cost per mil
33) fraction of the returned results are relevant to the information need.
a) Proximity
b) Posting Merge
c) Precision
d) Posting list
34) A dictionary of terms is sometime also referred as
a) Corpus
b) Collection
c) Lexicon
d) None of the above
35) SEO stands for
a) Search engine order
b) Search engine organizer
c) Search engine option
d) Search engine optimization
36) filtering recommends products which are similar to the ones that a user has liked in the past.
a) Collaborative based
b) Context based
c) Collection based
d) Content based
37) is the fraction of the relevant documents in the collection returned by the system.
a) Reconnect
b) Recall
c) Reciprocal
d) Retrieved
38) is a page that contains actual information on a topic.
a) Authority
b) Hub
c) Hyperlinks
d) Image
39) Given two strings s1 and s2, the edit distance between them is sometimes known as the
a) Levenshtein distance
b) Isolated-term distance
c) k-gram overlap
d) Jaccard Coefficient
40) Hadoop is a framework that works with a variety of related tools. Common cohorts include
a) MapReduce, Hive and HBase
b) MapReduce, MySQL and Google Apps
c) MapReduce, Hummer and Iguana
d) MapReduce, Heron and Trumpet
41) The purpose of the inverse document frequency is to increase the weight of terms with high
collection frequency
a) True
b) False
42) The basic operation of a web browser is to pass a request to the web server. This request is an
address for a web page and is known as the
a) UAL: Universal Address Locator
b) HTML: Hypertext Markup Language
c) URL: Universal Resource Locator
d) HTTP: Hypertext transfer protocol
43) Collaborative Filtering has following problems
a) Cold Start
b) Scalability
c) Sparsity
d) All of the above
44) Input, Purpose and Output are the factors of .
a) Summarization
b) Question Answering
c) Page Rank
d) Personalized Search
45) Information retrieval systems have much in common with
a) Filing systems
b) Transaction systems
c) Database systems
d) Management systems
46) A deadlock can be broken down by
a) Committing one or more transactions
b) Aborting one or more transactions
c) Rolling back one or more transactions
d) Terminating one or more transactions
47) Which one of the following is not Test Collection and Evaluation Series
a) Text Retrieval Conference (TREC)
b) NII Test Collections for IR Systems (NTCIR)
c) Cross Language Evaluation Forum(CLEF)
d) Collaborative Filtering
48) Information is
a) Data
b) Processed Data
c) Manipulated input
d) Computer output
49) Online transaction processing is used because
a) Disk is used for storing files
b) It is efficient
c) It can handle random queries.
d) Transactions occur in batches
50) The quality of information which is based on understanding user needs
a) Complete
b) Trustworthy
c) Relevant
d) None of the above
51) The primary storage medium for storing archival data is
a) Floppy disk
b) Magnetic disk
c) Magnetic tape
d) CD- ROM
53) Organizations have hierarchical structures because
a) It is convenient to do so
b) It is done by every organization
c) Specific responsibilities can be assigned for each level
d) It provides opportunities for promotions
54) Operational information is
a) Haphazard
b) Well organized
c) Unstructured
d) Partly structured
55) Operational information is needed for
a) Day to day operations
b) Meet government requirements
c) Long range planning
d) Short range planning
56) Data by itself is not useful unless
a) It is massive
b) It is processed to obtain information
c) It is collected from diverse sources
d) It is properly stated
57) For taking decisions data must be
a) Very accurate
b) Massive
c) Processed correctly
d) Collected from diverse sources
58) One of the application of Personalized Search is,
a) Google
b) Yahoo
c) IBM
d) Alpha Search Engine
59) Boolean retrieval model does not provide provision for:
a) Ranked search
b) Proximity search
c) Phrase search
d) Both proximity and ranked search
60) Which is a good idea for using skip pointers?
a) Fewer skips, larger skip spans
b) None
c) Depends upon the no. of comparisons needed
d) More skips, shorter skip spans
70) Edit distance (Levenshtein distance) is a way of:
a) Context-sensitive spelling correction
b) Document correction
c) Isolated word correction
d) Phonetic correction
71) Permuterm indices are used for solving:
a) None
b) Boolean queries
c) Phrase queries
d) Wildcard queries
72) Benefits of using a hash table is
a) Do not need to rehash everything periodically if vocabulary keeps growing.
b) Lookup in a hash table is faster than lookup in a tree.
c) All of the above
d) No prefix search is required
73) Variable-size postings lists is used when:
a) More seek time is desired and the corpus is dynamic
b) Less seek time is desired and the corpus is dynamic
c) No seek time is desired and the corpus is static
d) Time is desired and the corpus is dynamic
74) Unstructured data tends to refer to information on the web and is processed using:
a) Both
b) Database systems
c) IR systems
D) None
75) If list lengths are x and y, merge takes:
a) O(Yn) operations
b) O(xy) operations
c) O(xn) operations
d) O(x+y) operations
76) Term-document incidence matrix is:
a) Sparse
b) Depends upon the data
c) Dense
d) Cannot predict
77) Blocked sort-based Indexing is a method of:
a) Sorting with more disk seeks.
b) Merging with fewer disk seeks.
c) Comparing with fewer disk seeks.
d) Sorting with fewer disk seeks.
78) Issues in biword indexes are:
a) Any one
b) Index blowup due to bigger dictionary
c) Both
d) False positives
79) Best implementation approach for dynamic indexing is:
a) Periodic re-indexing
b) Using Invalidation bit-vector for deleted docs
c) None
d) Using logarithmic merge
80) The goal of IR is to:
a) Find documents relevant to an information need
b) Find documents relevant to an information need from a given document set
c) Find documents relevant to an information need from a large document set
d) Find documents relevant to an information need from a small document set
81) For postings of length L, no. of skip pointers required are:
a) Use L evenly-spaced skip pointer
b) Use L^2 evenly-spaced skip pointers.
c) Use L^1/2 evenly-spaced skip pointers
d) Use 2L evenly-spaced skip pointers.
82) Postings list should be sorted by:
a) Document Frequency
b) DocID
c) TermID
d) Term frequency
83) Benefits of using B-trees:
a) Re-balancing is cheap
b) Balanced trees allow efficient retrieval
c) Faster O(log M)
d) Solves the prefix problem
84) For ad hoc information ret is/are the test collectionsrieval system
evaluation.
a) Cranfield
b) TREC
c) Only a
d) Both a and b
85) The basic formula for paid placement is
a) Pay-per-click ($) = Advertising cost ($) ÷ Ads clicked (#)
b) Pay-per-click ($) = Advertising cost ($) * Ads clicked (#)
c) Pay-per-click ($) = Advertising cost ($) * Ads clicked (#)
d) Both a and b
86) Every web page is assigned score(s).
a) 1
b) 2
c) 4
d) 3
87) maintains the file system tree and the metadata for all the files and
directories present in the system.
a) Namenode
b) Datanode
c) Mapper
d) Tracker
88) nodes that can be reached from the giant SCC but cannot reach it.
a) In
b) Out
c) Gcc
d) in-out
89) The first special index for general wild card queries is the .
a) k-term index
b) Permuterm index
c) B-tree
d) Hashes
90) mainly encodes numerical and non-text attribute-value data.
a) Data centric XML
b) Text centric XML
c) Both a and b
d) User centric XML
91) Permuterm indexes are used for solving
a) Spelling Checking
b) Boolean queries
c) Phrase queries
d) Wildcard queries
92) A query such as mon* is known as a
a) Trailing wildcard query
b) Leading wildcard query
c) Both a and b
d) Mixed wildcard query
93) CLEF stands for
a) Cross Language Evaluation Forum
b) Cross lingual evaluating field
c) Cross Language Evaluating Field
d) Cross Language Evaluating Forum
94) Precision (P) is the fraction of
a) P(retrieved/relevant)
b) P(relevant/true)
c) P(relevant/retrieved)
d) P(retrieved/true)
95) Each node of the tree is an XML element and is written with an
a) Opening tag
b) Closing tag
c) Both a and b
d) Only a
96) is not the Basic Ranking Models of information retrieval system.
a) Boolean Retreival
b) Vector Space model
c) Probabilistic model
d) Data model
97) A good page for a topic links to many authority pages for that topic.
a) Crawler
b) SEO
c) Web
d) Hub
98) is the number of documents contains the term.
a) Term
b) Df
c) Idf
d) Inverse df
99) includes link building, increasing link popularity by submitting open
directories, search engines, link exchange, etc.
a) Off Page SEO
b) In Page SEO
c) Middle Page SEO
d) Both a and b
100) In information retrieval, extremely common words which would appear to be of little value in
helping select documents that are excluded from the index vocabulary are called:
a) Stop Words
b) Tokens
c) Lemmatized Words
d) Stemmed Terms
101) Document frequency of a term is the
a) Number of documents that contain the term
b) None of the above
c) Number of times the term appears in the document
d) Number of times the term appears in the collection
102) Boolean queries often result in
a) Too many or too few results
b) None of the above
c) Too few results
d) Too many results
103) Ranked retrieval models take as input
a) None of the given
b) Boolean queries
c) Logical queries
d) Free text queries
104) What is contiguity hypothesis in vector space classification
a) Documents from different classes don’t overlap
b) Documents in the same class form a contiguous region of space
c) All of the above.
d) Intra cluster similarity is higher than inter-cluster similarity
105) Information is
a) Data
b) Processed Data
c) Manipulated input
d) Computer output
106) Strategic information is needed for
a) Day to day operations
b) Meet government requirements
c) Long range planning
d) Short range planning
107) Strategic information is required by
a) Middle managers
b) Line managers
c) Top managers
d) All workers
108) Tactical information is needed for
a) Day to day operations
b) Meet government requirements
c) Long range planning
d) Short range planning
109) The is a wild card that represents one or more characters
a) Question mark
b) Asterisk
c) Exclamation mark
d) Dollar sign
110) The Search tool is best used when searching for which kind of data.
a) Simple
b) Multiple
c) Unique
d) Formatted
111) Given a document collection which has 35 relevant documents, if an IR system retrieves 10 relevant
and 13 irrelevant documents, what is the precision value of the system?
a) 0.43
b) 0.28
c) 0.33
d) 0.66
112) If the two postings list are of length X and Y , then maximum number of operations needed for
merge is
a) Max(X, Y)
b) X+Y
c) X*Y
d) Min(X, Y)
113) A computer based information system is needed because
(i) The size of organization have become large and data is massive
(ii) Timely decisions are to be taken based on available data
(iii) Computers are available
(iv) Difficult to get clerks to process data
a) (ii) and (iii)
b) (i) and (ii)
c) (i) and (iv)
d) (iii) and (iv)
114) Measures of Similarity are as Follows :
i. The lengths of the Documents.
ii. The number of terms in common.
iii. Whether the terms are common or unusual.
iv. How many times each term appears.
a) i) & ii)
b) ii) & iii)
c) iii) & iv)
d) i), ii), iii) & iv)
115) Proximity operator is a way of specifying that
a) Two terms in a query must occur close to each other in a document
b) Two terms in a query must occur in between in a document
c) Two terms in a query must occur close to each other in a document
d) None of the above
116) is the task of chopping documents into the pieces.
a) Ranked
b) Wild card
c) Tokenization
d) Boolean retrieval
117) A is the class of all tokens containing the same character sequence.
a) Term
b) Token
c) Type
d) Sequence
118) The DOM represents
a) Elements
b) Attributes
c) Text
d) All of the above
119) Data-centric XML mainly encodes
a) Numerical
b) Non text attribute value data
c) Both a and b
d) None of the above
120) XML document retrieval is characterized by
a) Long text field
b) Inexact matching
c) Relevance -ranked results
d) All a, b and c
121) One disadvantage, as outlined in our text, of using a permuterm index for wild card queries is:
a) It requires complex code that is difficult to maintain
b) It has the risk of key collisions which are difficult to resolve
c) The required rotations creates a very large dictionary
d) It cannot be used to find terms that are not spelled correctly
122) Which of the following is NOT a benefit of index compression?
a) Simplified algorithm design
b) Reduction of disk space
c) Faster transfer of data from disk to memory
d) Increased Use of caching
123) Which is not an option for Filter on a text field
a) Begins With
b) Between
c) Contains
d) End With
124) Which major database object stores all data
a) Field
b) Query
c) Record
d) Table
125) Given a document containing the sentence “I left my left bag at my home” the number of tokens in
the sentence is
a) 2
b) 8
c) 6
d) 4
126) Phrase queries can be solved using N-grams.
True
False
127) When Lemmatization is applied to the term “Destruction” to which of the following form it gets
reduced?
a) Destination
b) Destruct
c) Destroy
d) Destruc
128) What is the soundex code for the term “amazing”?
a) A552
b) A252
c) A525
d) A255
128) Hashing is a process where an item is reduced, through a mathematical process, to an integer.
True
False
129) A compression algorithm that results in some loss of data is called:
a) Zipf compression
b) Dictionary compression
c) Lossless compression
d) Lossy compression
130) The 30 most common words account for 30% of the tokens in written text is known as front coding.
True
False
131) An approach to retrieval in a search that is likely (but not precisely) to produce the top K
scoring documents is called:
a) Exact top K document retrieval
b) Top scoring document retrieval
c) Inexact top K document retrieval
d) Imprecise top K document retrieval
132) Recall is the fraction of non relevant documents that are retrieved.
True
False
133) In the context of web search engines the manipulation of web page content for the purpose of
appearing high up in search results for selected query terms is called:
a) Paid inclusion
b) SPAM
c) SEO
d) Link Analysis
134) Results from a search engine that are based upon the retrieval of items using a method of
term weighting such as cosine similarity is a form of
a) Sponsored Search
b) Algorithmic Search
c) Informational Search
d) Navigational Search
135) The list of web pages that a web crawler has queued up to index is called the:
a) Web Page Queue
b) Seed set
c) URL Filter
d) URL Frontier
136) In order to access a particular web site in the internet, the URL must be converted into an
IP address. Which service does this conversion?
a) HTTP
b) TNS
c) DNS
d) DHCP
137) The Search tool CANNOT be used on which major Access object
a) Forms
b) Queries
c) Reports
d) Tables
138) CLEF stands for
a) Cross Language Evaluation Forum
b) Cross lingual evaluating field
c) Cross Language Evaluating Field
d) Cross Language Evaluating Forum
139) Which of the following is not a technique for preparing solid samples in IR spectroscopy?
a) Solids run in solution
b) Mull technique
c) Solid films
d) Thin films
140) Which of the following is the principle of Golay cell which is used as a detector in IR spectroscopy?
a) Expansion of gas upon heating
b) Increase in resistance due to an increase in temperature and vice versa
c) Temperature difference gives rise to a potential difference in the material
d) Decrease in resistance due to an increase in temperature
141) For a moderately large collection of static documents maintained on a single system the most
appropriate indexing algorithm would be:
a) Block sort-based indexing algorithm
b) Single-pass in memory indexing algorithm
c) Distributed Map-Reduce indexing algorithm
d) Dynamic indexing process employing an auxiliary index
142) Weighted zone scoring is sometimes referred to as ranked Boolean retrieval.
True
False
143) An approach to computing scores in an IR system that orders documents in the posting list of a term
by decreasing order of term frequency is called:
a) Champion list
b) Impact ordering
c) Cluster pruning
d) Tiered indexes
144) The process where multiple lists are evaluated using AND or OR operators in a Boolean retrieval
query is called an intersection operation.
True
False
145) Which of the following applications are used in IR
a) Indexing
b) Ranked retrieval
c) Web search
d) All of the above
146) The Components of IR are
a) The Ser-system interface
b) The matching subsystem
c) Both a and b
d) None of them.
147) The function of Information Retrieval is
a) To make necessary adjustment in the system based on feedback
b) The human- computer interface
c) Computer Vision
d) Cognitive Theory.
148) Arrange the following in sequence
a) Archie , web crawler , Google , wiseNut
b) Archie , google, wiseNut, web crawler
c) Google, Archie, web crawler, wiseNut
d) WiseNut, google, Archie, web crawler
149) Web can be characterised by
a) Search engines
b) Web directories
c) Hyperlink search
d) All of the above
150) SEO stands for
a) System effect off
b) Search engine optimization
c) Search effect optimization
d) System engine off
151) What is direct addressing
a) Distinct array position for every possible key
b) Fewer array position than keys
c) Fewer keys than array positions
d) None of the mentioned
152) What can be the technique to avoid collision ?
a) Make the hash function appear random
b) Use the chaining method
c) Use uniform hashing
d) All of the mentioned
153) What is a hash function ?
a) A function has allocated memory to keys
b) A function that computes the location of the key in the array
c) A function that creates an array
d) None of the mentioned
154) A document is if it is one that the user perceives as containing information of value with
respect to their need.
a) Query
b) Relevant
c) Adhoc
d) Irrelevant
155) An need is the topic about which the user desires ti know more.
a) Information
b) Relevant
c) Statistical
d) None of the above
156) A search tree commonly used for a dictionary is the .
a) Subtrees
b) B-tree
c) Interval tree
d) Web tree
157) The best known search tree is in which each internal node has two children.
a) Balanced tree
b) Unbalanced tree
c) Internal Node
d) Binary tree
158) is used to communicate with web servers on the internet , which enables it to download and
display the web pages.
a) Web server
b) Search service
c) Web browser
d) None of the above
159) is finding material of an unstructured nature that satisfies an info need from large collection.
a) Adhoc query
b) Information retrieval
c) Conflation
d) Stemming
160) The core indexing step is the list so that the terms are arranged alphabetically.
a) Grouped
b) Normalized
c) Sorting
d) Recording
161) A search value can be an exact value or it can be
a) Logical operator
c) Relationship
c) Wild card character
d) Comparison operation
162) Instances of same term are grouped and the result is splint into
a) Classes
b) Columns
c) Both a and b
d) Dictionary
163) The operation is efficient so that we can quickly find the documents.
a) Intersection
b) Minus
c) Union
d) Matrix
164) Model is an algebraic model for representing text documents as vectors of identifiers.
a) Index
b) Sorting
c) Relevant
d) Vector Space
165) In web search , the vocabulary size keeps .
a) Constant
b) Reducing
c) Fluctuating
d) Growing
166) A function may become insufficient after several years.
a) Variant
b) Hash
c) B-tree
d) Primitive
167) Term is the number of times a term occurs in document.
a) Relevant
b) Lists
c) Accumulate
d) Frequency
168) The different types of queries used by the user
a) Informational query
b) Transactional query
c) Navigational query
d) All of the above
169) Given two engines A and B are given then the size of union may be
estimated a) |AUB|= |A|+|B|+|A-B|
b) |AUB|= |A|+|B|-|A ꓵ B|
c) | AUB|= |A|-|B|+|A ꓵ B|
d) |AUB|= |A|-|B|-|A ꓵ B|
170) To process queries from users as quickly as possible is called
a) Speed
b) Quality
c) Interface
d) Query processor
171) The relationship between sites and pages indicated by hyperlinks gives rise to
a) Static page
b) Dynamic page
c) Web graph
d) Size of web page
172) The process that occurs in a series of time-steps in each of which a random choice is made is
a) Markov Chains
b) Rank page
c) Link
d) Transition
173) Two documents are------------if they contain some of same terms.
a) Unique
b) Equal
c) Both a and b
d) Similar
174) Shared Word Count is
a) Here weighting are used
b) No weighing are used
c) Some weighing are used
d) None of them
175) NTCIR stands for
a) NII Test Collections for IR systems
b) Nil Test Collections for IR
c) Null Technique Collections for IR
d) Nil Test collaboration for IR
176) Deep expert is the capacity to deliver-------------------that is relevant to each individual inquirer
a) Same Information
b) False Information
c) Unique Information
d) True Information
177) It requires a large amount of existing data on a user in order to make accurate recommendation
a) Hot start
b) Cold start
c) Both a and b
d) None of them
178)builds systems that automatically answer questions posed humans in a natural language
a) Query
b) Solution
c) Question Answering
d) Multiple Solution
179) The information needs to be translated into a query by the user
a) The User Task
b) Logical View
c) Logical Task
d) None
180) It contains document by document data
a) Inverted File
b) Combination File
c) Bath a and b
d) Sequential File
181) It is group of documents that retrieval is performed on.
a) Term
b) Query
c) Collection
d) Posting
182) The main goal is to find the important meaning and create an internal representation
a) Query evaluation
b) Document Indexing
c) System evaluation
d) None
183)were the first to adopt Information Retrieval systems for retrieving Information
a) Laboratory
b) Libraries
c) Industry
d) All of the above
184) It is the topic which the user desires to know more and is differentiated from a query.
a) Posting
b) Term
c) Documents
d) Information need
185) It serves as a witness who knows specific information on a given event.
a) Shallow expert
b) Expert
c) Deep expert
d) None
186) Collaborative filtering has following problems.
a) Cold Start
b) Scalability
c) Both a and b
d) None of them
187) Factors of Summarization are
a) Input, Purpose, Output
b) Purpose, Output, Input
c) Output, Purpose, Input
d) Input, Output, Purpose.
187) XML stands for
a) Extensible Main Language
b) Extensible Markup Language
c) Exists Markup Language
d) Extensible Markup Lingual.
188) Many documents on web are not in--------------format.
a) Multicode
b) Unicode
c) Same code
d) Different code
189) It improves search engine ranking of a websites.
a) White Hat SEO
b) Black Hat SEO
c) On page SEO
d) Off page SEO
190) Building data structures that enable searching
a) Web Process
b) Index process
c) Query process
d) None
191) Query process comprises of the following sequence.
a) User interaction, Ranking, Evaluation.
b) Ranking, Evaluation, User interaction.
c) Evaluation, User interaction, Ranking
d) Evaluation, Ranking, User information.
192) An advantage of a positional index is that it reduces the asymptotic complexity of a postings
intersection operation.
a) True
b) False
193) Each document has a unique serial number known as
a) Document identifier
b) Document name
c) Document type
d) None of the above
194) A is a sequence of K Characters.
a) K-gram
b) Boolean
c) Post filter
d) None of the above
195) Structure of Web has following entities:
i. Web Graph
ii. Static and Dynamic Pages
iii. Hidden web pages
iv. Size of web page
a) i) & ii)
b) i) & ii)
c) iii) & iv)
d) i),ii),iii) & iv)
196) An XML document can contain
a) Wide variety of data
b) Unique data
c) Simple data
d) Single data
197) Regular keyword queries as in unstructured information retrieval is
a) CO Topics
b) CAS Topics
c) Both a and b
d) None of them.
198) There is------------collection of Markup tags.
a) Fixed
b) Vast
c) No fixed
d) Large.
199) The MapReduce of two pieces of code:
a) The Mapper and The Reducer
b) The index and Page rank
c) Input and Output
d) Map and Shuffle.
200) is transformation of a string of characters into a usually shorter fixed length value which
represents the original key.
a) Hashing
b) Indexing
c) Querying
d) Searching
[1] Data By Itself Is Not Useful Unless
(A) => It is massive
(B) => It is processed to obtain information
(C) => It is collected from divert source
Answer =>> It is processed to obtain information
[2] For Taking Decisions Data Must Be
(A) => Very accurate
(B) => Massive
(C) => Processed correctly
Answer =>> Processed correctly
[3] Strategic Information Is Needed For
(A) => Day to Day operations
(B) => Meet government requirements
(C) => Long range planning
Answer =>> Long range planning
[4] Strategic Information Is Required By
(A) => Middle managers
(B) => Line managers
(C) => Top managers
Answer =>> Top managers
[5] Tactical Information Is Needed For
(A) => Day to Day operations
(B) => Short range planning
(C) => Meet government requirements
Answer =>> Short range planning
[6] Tactical Information Is Required By
(A) => Middle managers
(B) => Line managers
(C) => Top managers
Answer =>> Middle managers
[7] Operational Information Is Needed For
(A) => Day to Day operations
(B) => Meet government requirements
(C) => Long range planning
Answer =>> Day to Day operations
[8] Operational Information Is Required By
(A) => Middle managers
(B) => Line managers
(C) => Top managers
Answer =>> Line managers
[9] Statutory Information Is Needed For
(A) => Day to Day operations
(B) => Meet government requirements
(C) => Long range planning
Answer =>> Meet government requirements
[10] In Motor Car Manufacturing The Following Type Of Information Is Strategic
(A) => Decision on introducing a new model
(B) => Scheduling production
(C) => Assessing competitor car
Answer =>> Decision on introducing a new model
[11] In Motor Car Manufacturing, The Following Type Of Information Is Tactical
(A) => Decision on introducing a new model
(B) => Scheduling productionB. Scheduling production
(C) => Assessing competitor car
Answer =>> Assessing competitor car
[12] A Computer Based Information System Is Needed Because
(A) => The size of organization have become large and data is massive
(B) => Computers are available
(C) => Difficult to get clerks to process data.
Answer =>> The size of organization have become large and data is massive
[13] Organizations Are Divided Into Departments Because
(A) => It is convenient to do so
(B) => Each department can be assigned a specific functional responsibility
(C) => It provides opportunities for promotions
Answer =>> Each department can be assigned a specific functional responsibility
[14] Organizations Have Hierarchical Structures Because
(A) => It is convenient to do so
(B) => It is done by every organizations
(C) => Specific responsibilities can be assigned for each level
Answer =>> Specific responsibilities can be assigned for each level
[15] Which Of The Following Function Is Most Likely In An Insurance Company
(A) => Training
(B) => Giving loans
(C) => Bill of material
Answer =>> Bill of material
[16] Which Of The Following Functions Is Most Likely In A University
(A) => Admissions
(B) => Accounting
(C) => Conducting examinations
Answer =>> Conducting examinations
[17] Every Record Stored In A Master File Has A Key Field Because
(A) => It is the most important field
(B) => It acts as a unique identification of records
(C) => It is the key to the database
Answer =>> It acts as a unique identification of records
[18] The Primary Storage Medium For Storing Archival Data Is
(A) => Floppy disc
(B) => Magnetic disk
(C) => Magnetic tape
Answer =>> Magnetic tape
[19] Master Files Are Normally Stored In
(A) => A hard disk
(B) => A tape
(C) => CD-ROM
Answer =>> A hard disk
[20] Master File Is A File Containing
(A) => All master records
(B) => All record relevant to the application
(C) => A collection of data items
Answer =>> All record relevant to the application
[21] Edit Program Is Required To
(A) => Authenticate data entered by an operator
(B) => Format correctly input data
(C) => Detect errors in input data
Answer =>> Detect errors in input data
[22] Data Rejected By Edit Program Are
(A) => Corrected and re-entered
(B) => Removed from processing
(C) => Collected for later use
Answer =>> Corrected and re-entered
[23] Online Transaction Processing Is Used Because
(A) => It is efficient
(B) => Disk is used for storing files
(C) => It can handle random queries
Answer =>> It can handle random queries
[24] A Management Information System Is One Which
(A) => Is required by all managers of the organizations
(B) => Processed data to yield information of value in tactical management
(C) => Provides operational information
Answer =>> Processed data to yield information of value in tactical management
[25] Data Mining Is Used To Aid In
(A) => Operational management
(B) => Analyzing past decision made by managers
(C) => Detecting patterns in operational data
Answer =>> Detecting patterns in operational data
[26] Data Mining Requires
(A) => Large quantities of operational data stored over a period of time
(B) => Lots of tactical data
(C) => Several tape drives to store archival data
Answer =>> Large quantities of operational data stored over a period of time
[27] Decision Support System Are Used For
(A) => Management decision making
(B) => Providing tactical information to management
(C) => Providing strategic information to management
Answer =>> Providing strategic information to management
[28] Decision Support System Are Used By
(A) => Line managers
(B) => Top-level managers
(C) => Middle level managers
Answer =>> Top-level managers
[29] Decision Support Systems Are Essential For
(A) => Day-to-Day operations of an organizations
(B) => Providing statutory information
(C) => Top level strategic decision making
Answer =>> Top level strategic decision making
[30] A Data Dictionary Has Consolidate List Of Data Contained In
(A) => Data flows
(B) => Data inputs
(C) => Data outputs
Answer =>> Data flows
[31] By Metadata We Mean
(A) => Very large data
(B) => Data about data
(C) => Data dictionary
Answer =>> Data about data
[32] A Data Dictionary Is Usually Developed
(A) => At requirement specification phase
(B) => During feasibility analysis
(C) => When DFD is developed
Answer =>> When DFD is developed
[33] A Data Dictionary Has Information About
(A) => Every data element in a data flow
(B) => Only key data element in a data flow
(C) => Only important data element in a data flow
Answer =>> Every data element in a data flow
[34] A Data Element In A Data Dictionary May Have
(A) => Only integer value
(B) => Only value
(C) => Only real value
Answer =>> Only value
[35] It Is Necessary To Carefully Design Data Input To A Computer Based System Because
(A) => It is good to be careful
(B) => The volume of data handled is large
(C) => The volume of data handled is small
Answer =>> The volume of data handled is large
[36] Error Occurs More Often When
(A) => Data is entered by users
(B) => Data is entered by operators
(C) => When data is hand written by users and entered by operators
Answer =>> When data is hand written by users and entered by operators
[37] In Online Data Entry It Is Possible To
(A) => Give immediate feedback if incorrect data is entered
(B) => Eliminate all errors
(C) => Save data entry operators time
Answer =>> Give immediate feedback if incorrect data is entered
[38] In Interactive Data Input A Menu Is Used To
(A) => Enter new data
(B) => Add/Delete data
(C) => Select one out of many alternatives often by a mouse click
Answer =>> Select one out of many alternatives often by a mouse click
[39] Data Inputs Which Requires Coding Are
(A) => Fields with specify prices
(B) => Key fields
(C) => Name field such as product name
Answer =>> Key fields
[40] By The Term ‘Meaningful Code’ We Understand That The Code
(A) => Conveys information on item being coded
(B) => Is of small length
(C) => Can add new item easi
MCQ
1) A model of information retrieval in which we can pose any query in which search terms are combined with the
operators AND, OR, and NOT:
Ad Hoc Retrieval Ranked
Retrieval Model
Boolean Information Model
Proximity Query Model
2)A data structure that maps terms back to the parts of a document in which they occur is called an (select the
best answer):
Postings list Incidence
Matrix Dictionary
Inverted Index
The correct answer is: Inverted Index
3)A process to efficiently intersect lists to be able to quickly find documents that contain both terms is referred to
as merging postings lists.
True
False
The correct answer is 'True'.
4)The model of information retrieval in which we can pose any query in the form of a Boolean expression is called
the ranked retrieval model.
True
False
The correct answer is 'False'.
5)The number of times that a word or term occurs in a document is called the:
Proximity Operator
Vocabulary Lexicon
Term Frequency
Indexing Granularity
The correct answer is: Term Frequency
6)Stemming increases the size of the vocabulary.
True
False
The correct answer is 'False'.
7)In information retrieval, extremely common words which would appear to be of little value in helping
select documents that are excluded from the index vocabulary are called:
Stop Words
Tokens
Lemmatized Words
Stemmed Terms
The correct answer is: Stop Words
8)A crude heuristic process that chops off the ends of the words to reduce inflectional forms of words
and reduce the size of the vocabulary is called:
Lemmatizatio
n Case
Folding True
casing
Stemming
The correct answer is: Stemming
9)An advantage of a positional index is that it reduces the asymptotic complexity of a postings
intersection operation.
True
False
The correct answer is 'False'.
10)An index that includes sequences of words or terms of variable length that have been extracted
from a source document is called a:
Phrase Index
Biword index
Positional index
Inverted Index
The correct answer is: Phrase Index
11)One disadvantage, as outlined in our text, of using a permuterm index for wild card queries is:
It requires complex code that is difficult to maintain
It has the risk of key collisions which are difficult to resolve
The required rotations creates a very large dictionary
It cannot be used to find terms that are not spelled correctly
The correct answer is: The required rotations creates a very large dictionary
12)Which of the following is a technique for context sensitive spelling correction:
the Jaccard Coefficient
Soundex algorithms
k-gram indexes
Levenshtein distance
The correct answer is: Soundex algorithms
13)For a very large collection of books of classic literature the most appropriate indexing
algorithm would be:
Block sort-based indexing algorithm
Single-pass in memory indexing
algorithm
Distributed Map-Reduce indexing algorithm
Dynamic indexing process employing an auxiliary index
The correct answer is: Distributed Map-Reduce indexing algorithm
14)For a large collection of documents such as the internet that experience frequent change the most
appropriate indexing algorithm would be:
Block sort-based indexing algorithm
Single-pass in memory indexing
algorithm
Distributed Map-Reduce indexing algorithm
Dynamic indexing process employing an auxiliary index
The correct answer is: Dynamic indexing process employing an auxiliary index
15)Given two strings s1 and s2, the edit distance between them is sometimes known as the:
Levenshtein distance
isolated-term distance
k-gram overlap
Jaccard Coefficient
The correct answer is: Levenshtein distance
16)For a moderately large collection of static documents maintained on a single system the most
appropriate indexing algorithm would
be: Block sort-based indexing
algorithm
Single-pass in memory indexing algorithm
Distributed Map-Reduce indexing algorithm
Dynamic indexing process employing an auxiliary index
The correct answer is: Single-pass in memory indexing algorithm
17)For a small collection of documents on a personal computer that don't experience any change the
most appropriate indexing algorithm would be:
Block sort-based indexing algorithm
Single-pass in memory indexing algorithm
Distributed Map-Reduce indexing algorithm
Dynamic indexing process employing an auxiliary
index The correct answer is: Block sort-based
indexing algorithm
18)Hashing is a process where an item is reduced, through a mathematical process, to an integer.
True
False
The correct answer is 'True'.
19)19)
The size of the document collection that can be indexed by single-pass in-memory indexing
algorithm is limited by the size of the disk storage the computer running the indexer process
has access to.
True
False
The correct answer is 'False'.
20)The formula used to estimate the vocabulary size of a collection is known as:
Zipf's law
Power law
Heap's law
Compression ratio
The correct answer is: Heap's law
21)Which of the following is NOT a benefit of index compression?
Simplified algorithm design
Reduction of disk space
Faster transfer of data from disk to
memory Increased Use of caching
The correct answer is: Simplified algorithm design
22)A compression algorithm that results in some loss of data is called:
zipf compression
dictionary
compression lossless
compression
lossy compression
The correct answer is: lossy compression
23)An approach to compression that takes advantage of the redundancy in the dictionary that results
from common prefixes that come from sorted terms is called:
Front Coding
Blocked storage
Prefix Coding
Variable byte encoding
The correct answer is: Front Coding
24)A disadvantage of compression is that it reduces the transfer of data from disk to memory.
True
False
The correct answer is 'False'.
25)The 30 most common words account for 30% of the tokens in written text is known as front coding.
True
False
The correct answer is 'False'.
26)Weighted zone scoring is sometimes referred to as ranked Boolean retrieval.
True
False
The correct answer is 'True'.
27)In the bag of words model, the exact ordering of terms within the document is both significant and
relevant to processing.
True
False
The correct answer is 'True'.
28)The purpose of the inverse document frequency is to increase the weight of terms with
high collection frequency.
True
False
The correct answer is 'False'.
29)A scheme where a weight is assigned to a term based upon the number of occurrences of the
term within a document is called:
Bag of Words
Document
Frequency
Term Frequency
Optimal weight
The correct answer is: Term Frequency
30)The number of documents within a collection that contain a particular term is the collection
frequency of the term.
True
False
The correct answer is 'False'.
31)A metric derived by taking the log of N divided by the document frequency where N is the total
number of documents in a collection is called:
document
frequency tf-idf
weight collection
frequency
inverse document frequency
The correct answer is: inverse document frequency
32)The tf-idf weight is highest when a term t occurs many times within a small number of documents.
True
False
The correct answer is 'True'.
33)The tf-idf weight is lower when a term t occurs many times in a document or occurs in relatively few
documents.
True
False
The correct answer is 'False'.
34)A measure of similarity between two vectors which is determined by measuring the angle between
them is called:
cosine similarity
sin similarity
vector similarity
vector scoring
The correct answer is: cosine similarity
35)An index that is often supplemental to the inverted index and contains terms from only a particular
field or section of a document is called a parametric index.
True
False
The correct answer is 'True'.
36)A scheme where a weight is assigned to a term based upon the number of occurrences of the
term within a document is called:
Select one:
a. Bag of Words
b. Document Frequency
c. Term Frequency
d. Optimal weight
The correct answer is: Term Frequency
37)A group of related documents against which information retrieval is employed is called:
a. Corpus
b. Text Database
c. Index Collection
d. Repository
The correct answer is: Corpus
38)Weighted zone scoring is referred to as:
a. ranked Boolean retrieval
b. Zipf retrieval
c. Ad Hoc query retrieval
d. Jaccard retrieval
The correct answer is: ranked Boolean retrieval
39)An approach to compression that takes advantage of the redundancy in the dictionary that results
from common prefixes that come from sorted terms is called:
a. Front Coding
b. Blocked storage
c. Prefix Coding
d. Variable byte encoding
The correct answer is: Front Coding
40)True/False: Given two strings s1 and s2, the edit distance between them is sometimes known as the
Levenshtein distance.
True
False
The correct answer is 'True'.
41)True/False: Ad hoc retrieval is a model of information retrieval in which we can pose any query in
which search terms are combined with the operators AND, OR, and NOT.
Select
one: True
False
The correct answer is 'False'.
42)True/False: An advantage of compression is that it reduces the transfer of data from disk to memory.
True
False
The correct answer is 'True'.
43)True/False: The process where multiple lists are evaluated using AND or OR operators in a Boolean
retrieval query is called an intersection operation.
True
False
The correct answer is 'True'.
44)For a small collection of documents on a personal computer that don't experience any change the
most appropriate indexing algorithm would be:
Select one:
a. Block sort-based indexing algorithm
b. Single-pass in memory indexing algorithm
c. Distributed Map-Reduce indexing algorithm
d. Dynamic indexing process employing an auxiliary index
The correct answer is: Block sort-based indexing algorithm
45)True/False: The number of documents within a collection that contain a particular term is the
collection frequency of the term.
True
False
The correct answer is 'False'.
46)True/False: In the bag of words model, the exact ordering of terms within the document is
not relevant to processing.
Select one:
True
False
The correct answer is 'True'.
47)In information retrieval, extremely common words which would appear to be of little value in helping
select documents that are excluded from the index vocabulary are called:
a. Stop Words
b. Tokens
c. Lemmatized Words
d. Stemmed Terms
The correct answer is: Stop Words
48)A process that reduces the size of a vocabulary by reducing to the 'root' of words is called:
a. Stemming
b. Lemmatizing
c. Removal of stop words
d. Posting
e. pruning
The correct answer is: Stemming
49)Which of the following is NOT a benefit of index compression?
a. Simplified algorithm design
b. Reduction of disk space
c. Faster transfer of data from disk to memory
d. Increased Use of caching
The correct answer is: Simplified algorithm design
50)To evaluate the effectiveness of an IR system the output from a standard query executed against the
test IR system is compared with the known output from a:
Select one:
a. internet collection
b. reference book
c. separate IR system.
d. standard test collection
The correct answer is: standard test collection
51)The standard approach to information retrieval system evaluation involves around the notion of:
a. Quantity of documents in the collection
b. Relevant and non relevant documents.
c. Accuracy
d. user happiness
The correct answer is: Relevant and non relevant documents
52)A web server communicates with a client (browser) using which
protocol: Select one:
a. HTML
b. HTTP
c. FTP
d. Telnet
The correct answer is: HTTP
53)The basic operation of a web browser is to pass a request to the web server. This request is an
address for a web page and is known as the:
a. UAL: Universal Address Locator
b. HTML: Hypertext Markup Language
c. URL: Universal Resource Locator
d. HTTP: Hypertext transfer protocol
The correct answer is: URL: Universal Resource Locator
54)A web page whose content doesn't vary from one request to another is called a:
a. Text Page
b. Dynamic Page
c. Active Server Page
d. Static Page
The correct answer is: Static Page
55)A web link within a web page that references another part of the same page is called a:
a. Out link
b. Vector
c. In link
d. Tendril
The correct answer is: In link
56)In the context of web search engines the manipulation of web page content for the purpose of
appearing high up in search results for selected query terms is called:
Select one:
a. Paid inclusion
b. SPAM
c. SEO
d. Link Analysis
The correct answer is: SPAM
57)Results from a search engine that are based upon the retrieval of items using a method of
term weighting such as cosine similarity is a form of:
a. Sponsored Search
b. Algorithmic Search
c. Informational Search
d. Navigational Search
The correct answer is: Algorithmic Search
58)A program that captures and indexes content from web pages is known as what insect:
a. Fly
b. Centipede
c. Mosquito
d. Spider
The correct answer is: Spider
59)The list of web pages that a web crawler has queued up to index is called the:
a. Web Page Queue
b. Seed set
c. URL Filter
d. URL Frontier
The correct answer is: URL Frontier
60)In order to access a particular web site in the internet, the URL must be converted into an IP
address. Which service does this conversion?
a. HTTP
b. TNS
c. DNS
d. DHCP
The correct answer is: DNS
61)For a very large collection of books of classic literature the most appropriate indexing
algorithm would be:
a. Block sort-based indexing algorithm
b. Single-pass in memory indexing algorithm
c. Distributed Map-Reduce indexing algorithm
d. Dynamic indexing process employing an auxiliary index
The correct answer is: Distributed Map-Reduce indexing algorithm
62)Which of the following is a technique for context sensitive spelling correction:
a. the Jaccard Coefficient
b. Soundex algorithms
c. k-gram indexes
d. Levenshtein distance
The correct answer is: Soundex algorithms
63)The formula used to estimate the vocabulary size of a collection is known as:
a. Zipf's law
b. Power law
c. Heap's law
d. Compression ratio
The correct answer is: Heap's law
THEORY
Page Rank
PageRank (PR) is an algorithm used by Google Search to rank websites in their search engine
results. PageRank was named after Larry Page, one of the founders of Google. PageRank is a
way of measuring the importance of website pages. According to Google:
PageRank works by counting the number and quality of links to a page to determine a rough
estimate of how important the website is. The underlying assumption is that more important
websites are likely to receive more links from other websites.
how calculate
The PageRank is calculated by the number and value of incoming links to a website. Initially,
one link from a site equaled one vote for the site that it was linked to. However, later versions
of the PageRank set 0.25 as the initial value for a new website (based on an assumed
probability distribution between 0 and 1).
MapReduce
MapReduce is a processing technique and a program model for distributed computing based on
java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map
takes a set of data and converts it into another set of data, where individual elements are
broken down into tuples (key/value pairs). Secondly, reduce task, which takes the output from a
map as an input and combines those data tuples into a smaller set of tuples. As the sequence of
the name MapReduce implies, the reduce task is always performed after the map job.
algorithm
Generally MapReduce paradigm is based on sending the computer to where the data resides.
MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce
stage. Map stage − The map or mapper's job is to process the input data. Generally the input
data is in the form of file or directory and is stored in the Hadoop file system (HDFS).
MapReduce in Hadoop
MapReduce Overview. Apache Hadoop MapReduce is a framework for processing large data
sets in parallel across a Hadoop cluster. Data analysis uses a two step map and reduce process.
The job configuration supplies map and reduce analysis functions and the Hadoop
framework provides the scheduling, distribution, and parallelization services. By default, the
MapReduce framework gets input data from the Hadoop Distributed File System (HDFS).
hadooop(By default, Hadoop uses the cleverly named Hadoop Distributed File System
(HDFS))
The Apache Hadoop software library is a framework that allows for the distributed processing of
large data sets across clusters of computers using simple programming models. It is designed to
scale up from single servers to thousands of machines, each offering local computation and
storage.
It’s the tool that actually gets data processed.
It tends to drive people slightly crazy when they work with it.
Link Analysis
Link analysis is a data analysis technique used in network theory that is used to evaluate the
relationships or connections between network nodes Link analysis is often used in search
engine optimization as
well as in intelligence, in security analysis and in market and medical research.
Question answering (QA)
Question answering (QA) is a computer science discipline within the fields of information
retrieval and natural language processing (NLP), which is concerned with building
systems that automatically answer questions posed by humans in a natural language.
A question answering implementation, usually a computer program, may construct its answers by
querying a structured database of knowledge or information, usually a knowledge base. More
commonly, question answering systems can pull answers from an unstructured collection of natural
language (this is copy right)
Some examples of natural language document collections used for question answering systems
include:
a local collection of reference texts
internal organization documents and web pages
compiled newswire reports
a set of Wikipedia pages
a subset of World Wide Web pages
summerization
Text summarization is a way to condense the large amount of information into a concise form by the
process of selection of important information and discarding unimportant and redundant information
1) Information is
a) Data b) Processed data
c) Manipulated input d) Computer output
ANS: b
2) Which of the following is a characteristic of Data?
a) Numerically expressed b) Affected by various cause
c) Aggregates of facts d) All of these
Ans: d
3) Which of the following is a characteristic of information?
a) Pre-determined objectives b) Collection of data in systematic manner
c) Accuracy in data collection d) All of these
Ans: d
4) A computer based information system is needed because
i) The size of organization have become large and data is massive ii )
Timely decisions are to be taken based on available data
iii) Computers are available
iv ) Difficult to get clerks to process data
a) (ii) and (iii) b) (i) and (ii)
c) (i) and (iv) d) (iii) and (iv)
Ans: b
5) An MIS objective can be stated as
a) Increase product sales b) Reduce marketing cost
c) Increase sale of product A by 10% in the next year d) All of the above
Ans: b
6) Information systems are organized combination of
a) People, hardware, software, computer networks and data resources b) Hardware, software
c) Computer cables d) None of these
Ans: a
7) One of the main capability of ‘IS’ is _
a) Provide computer for working b) Provide fast and accurate transaction processing
c) Both of above d) None of these Ans: b
8) IS is needed because it provides support for
a) Business processes, decision-making, and competitive advantage b) Generating reports only
c) Demonstration effect d) None of these
Ans: a
9) Main dimensions of information systems are
a) Organizations and management b) Management and technology
c) Organizations, management and technology d) None of these Ans: c
10) Components of information systems are
a) Computer and network b) Computer and software
c) People, hardware, software, data and networks d) None of the above Ans: c
11) The major components of a computer are
a) Memory c) I/O Devices
c) CPU d) All of the
above Ans: d
12) The Central Processing Unit
a) Is operated from the Control Panel b) Controls the Storage Unit
c) Is controlled by the input data entering the system d) Controls all input, output
and processing Ans: d
13) The CPU (Central Processing Unit) consists of
a) Input, output, and processing
b) Input, processing, and storage
c) Control Unit, Arithmetic and Logic Unit, and Primary Storage
d) Control unit, primary storage, and secondary
storage Ans: c
14) Memory is
a) Device that performs a sequence of operations specified by instructions in memory
b) The device where information is stored
c) A sequence of instruction
d) Typically characterized by interactive processing and time slicing of the CPU's time to allow quick response to
each user
Ans: b
15) Which is the component that allows the computer to permanently retain large amounts of data?
a) CPU b) Primary Memory
c) Mass Storage Device d) None of the
above Ans: c
16) Which of the following loses its contents when the computer is turned off?
a) RAM b) ROM
c) PROM d) All of the
above. Ans: a
17) The fastest memory in a computer system is
a) ROM b) RAM
c) Cache d) None of
these Ans: c
18) Which of the following is a portable computer
a) Laptops b) Notebook Computer
c) PDAs d) All of the
above Ans: d
19) Why a desktop computer is called Personal Computer?
a) Because it belongs to a single person
b) Because only one person can use it at any point of time
c) Because only persons can use it, not organizations
d) Because it needs personal attention
Ans: b
20) Which of the following is System Software?
a) MS-Word b) Tally
c) Ms-PowerPoint d) Operating
System Ans: d
21) Which of the following is not application Software?
a) Word Processing b) Spreadsheet
c) UNIX d) Desktop
Publishing Ans: c
22) Which of the following is not an output device?
a) Printer b) Keyboard
c) Projector d) Plotter
Ans: b
23) Mouse is which type of device?'
a) Extracting device b) Pointing Device
c) Hand device d) Gaming
device Ans: b
24) Mouse contains a wheel for scrolling is called
a) Scroll wheel b) Wheel
c) Roller d) None of these
Ans: a
25) Some of the most basic types of output devices is/are
a) Monitors, printers b) Plotters, computer output firms
c) Audio output d) All of the
above Ans: a
26) Mouse, trackball, and joystick are the examples of
a) scanning devices b) storing devices
c) pointing devices d) Multimedia
devices Ans: c
27) The device which is used to input images into the computer is
a) Mouse b) Digital Camera
c) Joystick d) None of the
above Ans: b
28) Which topology requires a central controller or hub?
a) Mesh b) star
c) Bus d) Ring
Ans: B
29) Which topology requires a multipoint connection?
a) Mesh b) star
c) Bus d) Ring
Ans: c
UNIT - II
1) DBMS stands for
a) Data base marginal system c) Data base management system
b) Directory based memory standard d) Dual bus mask storage
Ans: c
2) A Database Management System is
a) Collection of interrelated data b) Collection of programs to access data
c) Collection of data describing one particular enterprise d) All of
the above Ans: a
3) In the relational model, cardinality is termed as:
a) Number of tuples c) Number of tables
b) Number of attributes d) Number of constraints
Ans: a
4) Architecture of the database can be viewed as
a) Two levels b) Four levels
c) Three levels d) One
level Ans: c
5) In a relational model, relations are termed as
a) Tuples b) Attributes
c) Tables d) Rows
Ans: c
6) Related fields in a database are grouped to form a
a) Data File b) Data Record
c) Menu d) Bank
Ans: b
7) The database environment has all of the following components except
a) Users b) Separate files
c) Database d) Database
administrator Ans: a
8) An advantage of the database management approach is
a) Data is dependent on programs b) Database redundancy increases
c) Data is integrated and can be accessed by multiple programs. d) None of
the above. Ans: c
9) The RDBMS terminology for a row is
a) Tuple b) Relation
c) Attribute d)
Degree Ans: a
10) includes review of the existing procedures and information flow.
a) Feasibility Study b) Feasibility report
c) System Design d) System
analysis Ans: a
11) refers to the collection of information pertinent to systems Project.
a) Data transfer b) Data gathering
c) Data Embedding d) Data
Request . Ans: b
13) System Development process is also called as
a) System Development Life Cycle b) System Life Cycle
c) Both A and B d) System Process
Cycle Ans: a
15) Which of these sequences is correct for the systems development lifecycle?
a) Initiation, analysis, design, build b) Design, initiation, analysis, build
c) Analysis, design, initiation, build d) Analysis, initiation,
design, build Ans: a
16) Which is not a software life cycle model
a) Spiral Model b) Waterfall Model
c) Prototyping Model d) Capability Maturity
Model Ans: d
17) RAD stands for
a) Rapid Application Development b) Relative Application Development
c) Ready Application Development d) Repeated Application
Development Ans: a
18) The major goal of requirement determination phase of information system development is
a) Determine whether information is needed by an organization
b) Determine what information is needed by an organization
c) Determine how information needed by an organization can be provided
d) Determine when information is to be
given Ans: b
19) lnformation requirements of an organization can be determined by
a) Interviewing managers and users and arriving at the requirements based on consensus
b) Finding out what similar organizations do
c) Telling organization what they need based on your experience
d) Sending a questions to all employees of the organization
Ans : a
20) A feasibility study is carried out
a) After final requirements specifications are drawn up
b) During the period when requirements specifications are drawn up
c) Before the final requirements specifications are drawn up
d) At any time
Ans: c
21) The main objective of feasibility study is
a) To assess whether it is possible to meet the requirements specifications
b) To assess if it is possible to meet the requirements specified subject to constraints of budget, human
resource and hardware
c) To assist the management in implementing the desired system
d) To remove bottlenecks in implementing the desired system
Ans: b
22) Feasibility study is carried out by
a) Managers of the organization
b) System analyst in consultation with managers of the organization
c) Users of the proposed system
d) Systems designers in consultation with the prospective users of the system
Ans: b
23) The expansion of CASE tools is:
a) Computer Assisted Self Evaluation b) Computer Aided Software Engineering
c) Computer Aided Software Environment d) Core Aids for Software Engineering Ans: b
24) CASE tools are used by industries to
a) Improve productivity of their software engineers b) Reduce time to develop applications
c) Improve documentation d) All of the above Ans: d
25) CASE tools are useful
a) Only during system design stage b) During all the phases of system life cycle
c) Only for system documentation d) Only during System analysis stage Ans: b
26) CASE tools are
a) A Set of rules to be used during system analysis and design
b) Program, packages used during system analysis and design
c) A set of tools used by analysts
d) Needed for use case development Ans: b
27) Which of the following is, NOT a key component of object oriented programming?
a) Inheritance b) Encapsulation
c) Polymorphism d) Parallelism Ans: d
28) Which of these is TRUE of the relationship between objects and classes?
a) A class is an instance of an object. b) An object is the ancestor of its subclass.
c) An object is an instance of a class. d) An object is the descendant of its super-class Ans:
c
1.Distributed indexing is used in:
Select one:
a. All of the above
b. Web-scale indexing
c. Google data centres
d. Parallel tasking
Ans: a. All of the above
2.Which is a good idea for using skip pointers? Select one:
a. Fewer skips, larger skip spans
b. None
c. Depends upon the no. of comparisons needed
d. More skips, shorter skip spans
Ans: c. Depends upon the no. of comparisons needed
3. Edit distance (Levenshtein distance) is a way of:
Select one:
a. Context-sensitive spelling correction
b. Document correction
c. Isolated word correction
d. Phonetic correction
Ans: c. Isolated word correction
4.Boolean retrieval model does not provide provision for: Select one:
a. Ranked search
b. Proximity search
c. Phrase search
d. Both proximity and ranked search
Ans: d. Both proximity and ranked search
5. Permuterm indices are used for solving:
Select one:
a. None
b. Boolean queries
c. Phrase queries
d. Wildcard queries
Ans: d. Wildcard queries
6. A large repository of documents in IR is called as:
Select one:
a. Corpus
b. Database
c. Dictionary
d. Collection
Ans: a. Corpus
7. Benefits of using a hash table is:
Select one:
a. Do not need to rehash everything periodically if vocabulary keeps growing.
b. Lookup in a hash table is faster than lookup in a tree.
c. All of the above
d. No prefix search is required
Ans: b. Lookup in a hash table is faster than lookup in a tree.
8. Variable-size postings lists is used when:
Select one:
a. More seek time is desired and the corpus is dynamic
b. Less seek time is desired and the corpus is dynamic
c. Less seek time is desired and the corpus is static
d. More seek time is desired and the corpus is dynamic Ans: d.
More seek time is desired and the corpus is dynamic
9. An alternative to equivalence classing is to do:
Select one:
a.Asymmetric expansion
b. Symmetric expansion
c. Case folding
d. Normalization
Ans: d. Normalization
10. We need external sorting algorithms to:
Select one:
a. Maximize the disk seek time.
b. Maintain constant disk seek time
c. Minimize the disk seek time.
d. None
Ans: c. Minimize the disk seek time.
11. Benefits of using B-trees:
Select one:
a. Re-balancing is cheap
b. Balanced trees allow efficient retrieval
c. Faster O(log M)
d. Solves the prefix problem.
Ans: d. Solves the prefix problem.
12. Postings list should be sorted by:
Select one:
a. Document Frequency
b. DocID
c. TermID
d. Term frequency
Ans: b. DocID
13. Key idea behind Single-pass in-memory indexing is:
Select one:
a. Don’t sort, Accumulate postings in postings lists as they occur.
b. Generate separate dictionaries for each block.
c. All of the above
d. No need to maintain term-termID mapping across blocks.
Ans: c. All of the above
14. For postings of length L, no. of skip pointers required are:
Select one:
a. Use L evenly-spaced skip pointers
b. Use L^2 evenly-spaced skip pointers.
c. Use L^1/2 evenly-spaced skip pointers
d. Use 2L evenly-spaced skip pointers.
Ans: c. Use L^1/2 evenly-spaced skip pointers
15. For query optimization while intersecting two postings list, we should:
Select one:
a. Process in the order of increasing document frequency
b. Process in any order
c. None of the above
d. Process in the order of decreasing document frequency Ans: a.
Process in the order of increasing document frequency
16. The goal of IR is to:
Select one:
a.find documents relevant to an information need
b. find documents relevant to an information need from a given document set
c. find documents relevant to an information need from a large document set
d. find documents relevant to an information need from a small document set Ans: c.
find documents relevant to an information need from a large document set
17. Best implementation approach for dynamic indexing is:
Select one:
a. Periodic re-indexing
b. Using Invalidation bit-vector for deleted docs
c. None
d. Using logarithmic merge
Ans: d. Using logarithmic merge
18. Issues in biword indexes are:
Select one:
a. Any one
b. Index blowup due to bigger dictionary
c. Both
d. False positives
Ans: c. Both
19. Any string of terms of the following form is called an extended biword:
Select one:
a. NNX*
b. NXNN
c. *NNX
d. NX*N
Ans:d. NX*N
20. Structured data allows for:
Select one:
a. Does not depend on data complexity
b. Less complex queries
c. No relationship
d. More complex queries
Ans: d. More complex queries
21. Blocked sort-based Indexing is a method of:
Select one:
a. Sorting with more disk seeks.
b. Merging with fewer disk seeks.
c. Comparing with fewer disk seeks.
d. Sorting with fewer disk seeks.
Ans: a. Sorting with more disk seeks.
22. Term-document incidence matrix is:
Select one:
a. Sparse
b. Depends upon the data
c. Dense
d. Cannot predict Ans: a. Sparse
23. Lemmatization is a technique for:
Select one:
a. Ranking documents
b. Case folding
c. Normalization
d. Tokenization
Ans: c. Normalization
24. If list lengths are x and y, merge takes:
Select one:
a. O(Yn) operations
b. O(xy) operations
c. O(xn) operations
d. O(x+y) operations
Ans: d. O(x+y) operations
25. Unstructured data tends to refer to information on the web and is processed using: Select one:
a. Both
b. Database systems
c. IR systems
d. None
Ans: c. IR systems
Question 1
Consider the following documents:
D1. Cat in the hat
D2. The cat chased the rat D3. The rat died
D4: The cat died
What is the space requirement for an uncompressed Boolean term-document incidence matrix of the above
documents?
Select one:
7 bytes
28 bits
28 bytes
7 bits
Feedback
The correct answer is: 28 bits
Question 2
Which of the following terms have the same soundex code?
Select one or more: Brightsite
Briteside
Brightside
Feedback
Your answer is correct.
The correct answer is: Brightside, Brightsite
Question 3
Consider an index for 100000 documents each having a length of 750 words. Assume there are 200K
distinct terms in total. What is the minimum number of bits required for representing the Doc-ID?
Select one:
8 bits
18 bits
17 bits
20 bits
Feedback
The correct answer is: 17 bits
Question 4
Which of the following is(are) NOT true with Google Search Engine? Select one:
It offers specialized search services
It does stemming
It does stop-word
removal None of the
choices
Feedback
The correct answer is: None of the choices
Question 5
A fragment from an inverted index (augmented with positional information) is given below.
Information: d1:12 ; d2:23,32,43; d3:13, d5:32,45,80
systems: d1:15; d2:34,42; d3: 35, d5: 38
Which of the following phrase(s) has(have) possible occurrences in the above document
sequence?
Select one or more:
“Information retrieval
systems” “Information
systems”
“Information theory retrieval systems”
None of the choices
Feedback
The correct answer is: “Information retrieval systems”, “Information theory retrieval
systems”
Question 6
Consider the following two postings list with the skip pointers shown. How many
postings comparisons will be made while intersecting the two lists with skip pointers?
Select one:
7
8
6
9
Feedback
The correct answer is: 9
Question 7
Consider the following fragment of a positional index with the format:
word: document: (position, position, . . .); document:(position, . . .i). . .
Gates: 1: (3); 2: (6); 3: (2,17); 4: (1);
IBM: 4: (3); 7: (14);
Microsoft: 1: (1); 2: (1,21); 3: (3); 5: (16,22,51);
The /k operator, word1 /k word2 finds occurrences of word1 within k words of word2
(either on left or right side), where k is a positive integer argument. Thus k = 1 demands
that word1 be adjacent to word2.
What is the set of documents that satisfy the query Gates /2 Microsoft?
Select
one:
1,3
3
1
No document satisfies the query
Feedback
The correct answer is: 1
Question 8
Given the query uni*e , if you want to search for permuterm wildcard index, which of the
following keys can be looked upon?
Select one:
e$uni
*
e$uin*
$unie*
Ie$un*
Feedback
The correct answer is: e$uin*
Question 9
If X denotes the length of string s1 and Y denotes the length of the string s2, then the edit
distance between s1 and s2 is never more than --------------------
Select one:
Min(X,Y)
None of the Choices
Max(X,Y)
X+Y
Feedback
The correct answer is: Max(X,Y)
Question 10
What is the soundex code for the term “amazing”?
Select one:
A552
A252
A525
A255
Feedback
The correct answer is: A525
Question 11
Given a document collection of 1000 documents which has 110 relevant documents for a
given query and if the IR system retrieves 30 relevant and 15 irrelevant documents, what
is the recall value of the system?
Select one:
0.03
0.27
0.33
0.66
Feedback
The correct answer is: 0.27
Question 12
When Lemmatization is applied to the term “Destruction” to which of the following form it
gets reduced?
Select one:
Destruc
t
Destroy
Destruc
Feedback
The correct answer is: Destroy
Question 13
Variable-size postings lists is used when
Select one:
Less seek time is desired and the corpus is dynamic
Less seek time is desired and the corpus is static
More seek time is desired and the corpus is
dynamic
More seek time is desired and the corpus is static
Feedback
The correct answer is: More seek time is desired and the corpus is dynamic
Question 14
Inverted Index Dictionary is sorted by
Select one:
Term frequency
Document
Frequency
Term/TermID
DocID
Feedback
The correct answer is: Term/TermID
Question 15
Which of the following is called an extended biword?
Select one:
NXNN
NNX*
NX*N
*NNX
Feedback
The correct answer is: NX*N
Question 16
If the two postings list are of length X and Y , then maximum number of operations needed
for merge is
Select one:
max(X,Y)
X+Y
X*Y
min(X,Y)
Feedback
The correct answer is: X+Y
Question 17
Given the Boolean query with terms (cat OR bat) AND NOT (dog or mat) Which
of the following will be the equivalent Disjunctive Normal Form of the
above query?
Select one:
(cat AND (NOT dog) AND (NOT mat)) OR (cat AND bat AND(NOT dog))
(cat AND (NOT dog) AND (NOT mat)) OR (bat AND (NOT mat) AND(NOT dog))
None of the Choices
(cat AND bat AND (NOT dog)) OR (cat AND bat AND (NOT mat))
Feedback
The correct answer is: (cat AND (NOT dog) AND (NOT mat)) OR (bat AND (NOT mat) AND(NOT
dog))
Question 18
If string s1= filosophi and s2= philosophy, what is the minimum edit distance
between s1and s2?
Select
one: 3
5
4
2
Feedback
The correct answer is: 3
Question 19
Given a document containing the sentence “I left my left bag at my home” the number of
tokens in the sentence is
Select
one: 8
6
4
Feedback
The correct answer is: 8
Question 20
Given a document collection which has 35 relevant documents, if an IR system retrieves 10
relevant and 13 irrelevant documents, what is the precision value of the system?
Select one:
0.43
0.28
0.33
0.66
Feedback
The correct answer is: 0.43
Question 21
Consider the following documents:
Doc1: new home sales top forecasts
Doc2: home sales rise in july
Doc3: increase in home sales in july
Doc4: july new home sales rise
When the Term Document incidence matrix is constructed and the query home AND (new OR
july) is executed on it, the resultant doc’s retrieved will be
Select one:
Doc1
Doc1,Doc3, Doc4 Doc1, Doc4,
Doc1, Doc2,Doc3,Doc4
Feedback
The correct answer is: Doc1, Doc2,Doc3,Doc4
Question 22
Yahoo search engine uses stemming for its Index generation
Select one:
True False
Feedback
The correct answer is 'False'.
Question 23
When stemming is used, it should be used for both indexing and query processing. Select one:
True False
Feedback
The correct answer is 'True'.
Question 24
Boolean Retrieval model maintains the term frequency. Is the statement True or False.
Select one:
True False
Feedback
The correct answer is 'False'.
Question 25
Phrase queries can be solved using N-grams.
Select one:
True False
Feedback
The correct answer is 'False'.
TYCS SEM-6th Information Retrieval (MCQ) Question Bank
1) IR Stands for______________.
a) Information Retrieval
b) Information Retired
c) Inform Retrieval
d) Information Ready
2) Each item in the list is called as______________.
a) Items
b) Posting
c) Query
d) Information
3) etr term is called _________k-grams wildcard query.
a) 3
b)4
c) 1
d)2
4) To search document by _______________ in IR.
a)id
b)docID
c)number
d)#digits
5) SEO stands for _____________ .
a) Search English Optimization
b) Search Engine Optimization
c) Search Engine Operator
d) Search Engine Operation
6) Dictionary performed by _________________pair
a) Key and Value
b) Value and Number
c) Id and Number
d) Name and code
7) An advantage of a positional index is that it reduces the asymptotic complexity of a postings intersection operation.
A) True
B) False
8) _________can best be described as a programming model used to develop Hadoopbased applications that can
process massive amounts of data.
A) MapReduce
B) Mahout
C) Oozie
D) All of the mentioned
9) The purpose of the inverse document frequency is to increase the weight of terms with high collection frequenc.
A) True
B) False
10) URL Stands for ______________________.
a) Uniform Ravar Location
b) Uniform Resource Locator
c) Uni Resource Locate
d) Uniform Reverse Locator
11) A data structure that maps terms back to the parts of a document in which they occur is called an
A) Postings list
B) Incidence Matrix
C) Dictionary
D) Inverted Index
12) The first large information retrieval research group was formed by____________at cornell in 1960.
a) Gerard Salton
b) Ratan Tata
c) Ramesh Bush
d) Think Roy
13) Input, Purpose and Output are the factors of _________ .
a) Summarization
b) Question Answering
c) Page Rank
d) Personalized Search
14) A deadlock can be broken down by
a) Committing one or more transactions
b) Aborting one or more transactions
c) Rolling back one or more transactions
d) Terminating one or more transactions.
15) NLTK stands for ______________ .
a) Natural Language Toolkit
b) Natural Lang Tool
c) Natural Long Tooltip
d) Nature Language Toolkit
16) Online transaction processing is used because
a) disk is used for storing files
b) it is efficient
c) it can handle random queries.
d) Transactions occur in batches
17) The primary storage medium for storing archival data is
a)floppy disk
b)magnetic disk
c)magnetic tape
d)CD- ROM
18) Organizations have hierarchical structures because
a) it is convenient to do so
b) it is done by every organization
c) specific responsibilities can be assigned for each level
d) it provides opportunities for promotions
19) Spelling correction only depends on___________factor.
a) Query
b) term
c) indexpowerd
d)Postings
20) Boolean query operator?
a) +
b) -
c) AND,OR NOT
d) <<<
21) A computer based information system is needed because
(i) The size of organization have become large and data is massive
(ii) Timely decisions are to be taken based on available data
(iii) Computers are available
(iv) Difficult to get clerks to process data
a)(ii) and (iii)
b)(i) and (ii)
c)(i) and (iv)
d)(iii) and (iv)
22) Operational information is needed for
a) Day to day operations
b) Meet government requirements
c) Long range planning
d) Short range planning
23) Data by itself is not useful unless
a) It is massive
b) It is processed to obtain information
c) It is collected from diverse sources
d) It is properly stated
24) For taking decisions data must be
a) Very accurate
b) Massive
c) Processed correctly
d) Collected from diverse sources
25) CLEF stands for________
a) Cross Language Evaluation Forum
b) Cross lingual evaluating field
c) Cross Language Evaluating Field
d) Cross Language Evaluating Forum
26)Variable size postings lists is used when
A) More seek time is desired and the corpus is dynamic
B) Less seek time is desired and the corpus is dynamic
C) Less seek time is desired and the corpus is static
D) More seek time is desired and the corpus is dynamic
27)Best implementation approach for dynamic indexing is
A) Periodic re indexing
B) Using Invalidation bit vector for deleted docs
C) None
D) Using logarithmic merge
28)Structured data allows for
A) Does not depend on data complexity
B) Less complex queries
C) No relationship
D) More complex queries
29) Data represent in_________________format IR System a) Text
b) Image
c) Audio text media
d) Options a,b,c
30)Term document incidence matrix is
A) Sparse
B) Depends upon the data
C) Dense
D) Cannot predict
31) What is contiguity hypothesis in vector space classification
A) Documents from different classes dont overlap
B) Documents in the same class form a contiguous region of space
C) All of the above.
D) Intra cluster similarity is higher than inter-cluster similarity
32) Tactical information is needed for
A) Day to day operations
B) Meet government requirements
C) Long range planning
D) Short range planning
33) Strategic information is required by
a) Middle managers
b) Line managers
c) Top managers
d) All workers
34) Postings List is like Array structure in IR?
a) True
b) false
35) An index that includes sequences of words or terms of variable length that have been extracted from a source
document is called a
a) Phrase Index
b) Biword index
c) Positional index
d) Inverted Index
36) A process to efficiently intersect lists to be able to quickly find documents that contain both terms is referred to as
merging postings lists.
a) True
b) false
37) The formula used to estimate the vocabulary size of a collection is known as:
a) Zipf's law
b) Power law
c) Heap's law
d) Compression ratio
38) Weighted zone scoring is sometimes referred to as ranked Boolean retrieval.
a)True
b)False
39)In the bag of words model, the exact ordering of terms within the document is both significant and relevant to
processing.
a)True
b) False
40) The number of times that a word or term occurs in a document is called the:
a)Proximity Operator
b)Vocabulary Lexicon
c)Term Frequency
d)Indexing Granularity