ENTITY SEARCH ENGINE : A NEW SEARCH TOOL 
Speaker : Tanmay Mondal , MSLIS 2013-2015 
Indian Statistical Institute , Bangalore 
Documentation Research and Training Centre 
Seminar ( 1 ) - 2014
OOvveerrvviieeww 
PPrreesseenntt AApppprrooaacchh 
EEnnttiittyy SSeeaarrcchh 
BBeenneeffiitt ooff EEnnttiittyy SSeeaarrcchh 
EEnnttiittyy && IIttss FFaacceettss 
MMaaiinn WWoorrkk ooff EESSEE 
PPooppuullaarr EEnnttiittyy SSeeaarrcchh 
OOKKKKAAMM--EEnnaabblliinngg aa WWeebb ooff EEnnttiittiieess 
WWoorrkkffllooww ooff OOkkkkaamm 
MMyy LLiibbrraarryy 
RReeffeerreenncceess
Present Approach 
● Information is everywhere & it is growing exponentially 
● A traditional information extraction approach is to scan every 
document in any collection 
● As document collection is the set of all web pages indexed by a 
search engines 
● Time consuming for users for getting pin-pointed information
Person 
Location 
Organization 
Nationality 
Religion 
Product 
For specific Information 
Phone Number 
Email Address/URL 
Distance 
Date 
Time 
Money Generic Number 
Problem of identifying and linking / grouping different 
manifestations of the same real world object
Web of Documents Web of Entites 
Cluster the records that correspond to same entity
Entity Search 
● Entity refers to any object or a thing that can be uniquely identified in 
the world 
● It's a better match search queries with a database containing hundreds of 
millions of "entities" 
● Each entity is in relation with many entites 
● The answer entities have specific information & identifying the right 
relationship among the entities 
● Semantic or faceted search on entities
Why ? 
● When people use retrieval systems they are often not searching for 
documents or text passages 
● Summarization of entities and concepts 
● The named entities (persons, organizations, locations, products...) play a 
central role in answering such information needs 
● At least 20-30% of the queries submitted to Web SE are simply entities 
● ~71% of Web search queries contain named entities 
**Source - Building Taxonomy of Web Search Intents for Name Entity 
Queries by Xiaoxin Yin & Sarthak Shah
Benefit of Entity Search 
● Entities are often categorized into a taxonomy 
● Primary task of the user is often to make a decision 
● More structured than document based 
● Entity is associated with the same URI across the different repositories 
● Entity Information Integration 
● More understandable by Human 
● Increase precision & less Time Consuming
Entity & Its Facets 
● An entity must be distinguished from other entities Can be anything 
including an abstract thing like Diseases ,Imaginary art etc. 
● Type of an entity refers to a generic class into which the given entity is 
classified. 
● Attribute refers to a property (predicate) associated with an entity. 
● Value refers to the value of an attribute (for a given entity). 
● Relation provides more information with many entites 
● Entity, Prof. S.R. Ranganathan is a person , IBM is an organization
Main Work of ESE 
● Entity Retrieval : Entity search engines can return aranked list of 
entities most relevant for a user query 
● Entity Relationship / Fact Mining and Navigation : It discover 
interesting relationships / facts about the entities associated with their 
queries 
● Prominence Ranking : Detect the popularity of an entity and enable 
users to browse entities in different categories 
● Entity Description Retrieval : Entity description blocks for each entity 
information about an object in a web page is generally grouped together 
as an object block
Popular Entity Search 
● Product search-Various Products like Books, Electronics, Clothes, etc. 
● People search-Experts, Friends, Profile of famous persons, etc. 
● Location search-Travel, Address ,Business, Govt Offices, etc.
Idea about entity search engine
Main Work of ESE 
● Entity Retrieval : Entity search engines can return aranked list of 
entities most relevant for a user query 
● Entity Relationship / Fact Mining and Navigation : It discover 
interesting relationships / facts about the entities associated with their 
queries 
● Prominence Ranking : Detect the popularity of an entity and enable 
users to browse entities in different categories 
● Entity Description Retrieval : Entity description blocks for each entity 
information about an object in a web page is generally grouped together 
as an object block
Various ESE 
● Freebase-http://www.freebase.com/ 
● Sindice-http://sindice.com/ 
●Geneview-http://bc3.informatik.hu-berlin.de/ 
●Okkam-http://www.okkam.org/ 
●WolframAlpha-http://www.wolframalpha.com/ 
● Yatedo-http://www.yatedo.com/ 
●GeoNames-http://www.geonames.org/ 
●Dbpedia-http://dbpedia.org/About 
● EntityCube-http:// 
entitycube.research.microsoft.com/ 
etc......
OKKAM-Enabling a Web of Entities 
● Any collection of data and information about any type of entities 
published on the Web can be integrated into a single virtual, 
decentralized, open knowledge base. 
● It leads to a faster, more efficient and more precise way to 
deal with the flood of information available on the Web today 
Entities should not be multiplied beyond necessity
OKKAM ENS 
● OKKAM ENS is for entity search, where storage, indexing 
and matching technology was built for finding an entity given 
its description 
● Every entity (individual, instance, “thing”) is assigned a 
global identifier, ideally unique 
● More than 7.5 million entity repository with more structured 
form 
Entity identifiers should not be multiplied beyond necessity
Project Partners 
● University of Trento, Italy (Co-Ordinator) 
● L3S Research Center, Germany 
● SAP Research, Germany 
● Expert System, Italy 
● Elsevier B.V., Netherlands 
● Europe Unlimited SA, Belgium 
● National Microelectronics Application Center (MAC), Ireland 
● Ecole Polytechnique Fédérale de Lausanne (EPFL), Switzerland 
● DERI Galway, Ireland 
● University of Malaga, Spain 
● INMARK, Spain 
● Agenzia Nazionale Stampa Associata (ANSA), Italy
Sources Of Information 
● Wikipedia Provides lists of countries, cities, members of particulars 
domains which is very common for our search query 
● GeoNames contains over 10 million geographical names and consists of 
over 9 million unique features of 2.8 million populated places and 5.5 
million alternate names 
● OkkamDBManager Another important information source for OKKAM 
can be generic databases like extranets, online shops or publishing 
houses 
● OkkamManualEntry Another solution we provide to insert new entities 
is the manual case
Data extracted from any unstructed sources more effectively
Cogito Semantic Technology 
● Semantic analysis engine and complete semantic 
network for a complete understanding of text 
● Transforming unstructured information into structured 
data 
● Identifies the most relevant concepts 
● Interprets the meaning of texts 
● Precisely extracts information 
● Automatically connet entites extracted from sources
Sensigrafo 
● Enables the disambiguation of terms 
● It allows Cogito to understand the meaning of words and 
context 
● Extraction of data and metadata 
● Product development, competitive intelligence,marketing 
,Finance, Media & Publishing, Oil & Gas, Life Sciences & 
Pharma, Government and Telecommunications and many 
activities where knowledge sharing is critical 
● More than 1 million concepts,more than 4 million 
relationships
WWoorrkkffllooww ooff OOkkkkaamm 
● Storage: A scalable repository of entity profiles, in which billions of 
entities are assigned an ID and a profile, to distinguish one entity from 
another 
● Matching: Requests from client applications arrive in the form of a bag 
of keywords or a collection of name value pairs (unstructured or semi-structured 
queries 
● ID storage and management: stores, maintains and makes available 
for reuse IDs (URIs) for anything which is named in a networked 
environment 
● Lifecycle Management: It takes care of the evolution Storage of the 
repository and of all entity profiles through different time
Entity Query & Matching in Okkam
ISI
Wolfram|Alpha 
● Wolfram|Alpha is an engine for computing answers and 
providing knowledge 
● It generates output by doing computations from its own 
internal knowledge base, instead of searching the 
web and returning links 
● It is an online service that answers factual queries 
directly by computing the answer 
● Make all systematic knowledge immediately computable 
and accessible to everyone
5 nearest stars
How many newspapers are available in the globe
Overall Difficulties 
● The number of entities could be huge 
● Information Redundancy 
● Information Fragmentation 
● Entity Information Integration 
● A single algorithm for fine­grained 
entity matching may not exist 
● Store and retrieve using IR based techniques 
● Matching on very large datasets 
● Natural Language Processing
Contd... 
● Availability of a knowledge base is less 
● Multi‐domain entites 
● Deduplication Problem 
● Some names and relationships could be incorrect & the 
information may not be update­to­date 
● Name disambiguation is still largely unsolved 
● ESEs are at early age 
Creating knowledge bases from text and unstructured data is the 
goal
My Library 
● EEnnttiitteess aarree ffoorr UUssee 
● EEaacchh EEnnttiittyy hhaass iittss oowwnn aattttrriibbuutteess && rreellaattiioonn 
● EEvveerryy EEnnttiittyy hhaass iittss iimmppoorrttaannccee 
● SSaavvee tthhee TTiimmee ffoorr ffiinnddiinngg oouutt EEnnttiitteess 
● EEnnttiitteess aarree ggrroowwiinngg rraappiiddllyy
References 
1. Statistical Entity Extraction from Web by Zaiqing Nie, Ji-Rong Wen, 
and Wei-Ying Ma, Fellow, IEEE 
2. State of the art in IE, overview, comparison and analysis by Stefan 
Dumitrescu ,PhD Student 
3. The Entity Name System: Enabling the Web of Entities by Heiko 
Stoermer, Themis Palpanas, George Giannakopoulos,University of 
Trento 
4. Hybrid entity clustering using crowds and data by Jongwuk Lee, 
Hyunsouk Cho,Jin-Woo Park,Young-rok Cha,Seung-won Hwang, Zaiqing 
Nie ,Ji-Rong Wen 
5. Supporting Entity Search:A Large-Scale Prototype Search Engine by 
Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang
References... 
6. OKKAM: Enabling a Web of Entities by Paolo Bouquet ,Heiko 
Stoermer ,Daniel Giacomuzzi ,University of Trento 
7. Entity Data Management in OKKAM by Themis Palpanas 1 , Junaid 
Chaudhry 2 , Periklis Andritsos 1 , Yannis Velegrakis 1 ,1 University of 
Trento,2 Ajou University 
8. SPACE AND TIME ENTITY REPOSITORY Human-enhanced time-awaremulti 
media search funded by EU07 
See :http://issuu.com/cubrikproject/docs/issuu.cubrik.d41.unitn.wp4.v1.0 
9. http://api.okkam.org/search/ 
10. http://www.wolframalpha.com/
Entity Search Engine

Entity Search Engine

  • 1.
    ENTITY SEARCH ENGINE: A NEW SEARCH TOOL Speaker : Tanmay Mondal , MSLIS 2013-2015 Indian Statistical Institute , Bangalore Documentation Research and Training Centre Seminar ( 1 ) - 2014
  • 2.
    OOvveerrvviieeww PPrreesseenntt AApppprrooaacchh EEnnttiittyy SSeeaarrcchh BBeenneeffiitt ooff EEnnttiittyy SSeeaarrcchh EEnnttiittyy && IIttss FFaacceettss MMaaiinn WWoorrkk ooff EESSEE PPooppuullaarr EEnnttiittyy SSeeaarrcchh OOKKKKAAMM--EEnnaabblliinngg aa WWeebb ooff EEnnttiittiieess WWoorrkkffllooww ooff OOkkkkaamm MMyy LLiibbrraarryy RReeffeerreenncceess
  • 3.
    Present Approach ●Information is everywhere & it is growing exponentially ● A traditional information extraction approach is to scan every document in any collection ● As document collection is the set of all web pages indexed by a search engines ● Time consuming for users for getting pin-pointed information
  • 4.
    Person Location Organization Nationality Religion Product For specific Information Phone Number Email Address/URL Distance Date Time Money Generic Number Problem of identifying and linking / grouping different manifestations of the same real world object
  • 5.
    Web of DocumentsWeb of Entites Cluster the records that correspond to same entity
  • 6.
    Entity Search ●Entity refers to any object or a thing that can be uniquely identified in the world ● It's a better match search queries with a database containing hundreds of millions of "entities" ● Each entity is in relation with many entites ● The answer entities have specific information & identifying the right relationship among the entities ● Semantic or faceted search on entities
  • 7.
    Why ? ●When people use retrieval systems they are often not searching for documents or text passages ● Summarization of entities and concepts ● The named entities (persons, organizations, locations, products...) play a central role in answering such information needs ● At least 20-30% of the queries submitted to Web SE are simply entities ● ~71% of Web search queries contain named entities **Source - Building Taxonomy of Web Search Intents for Name Entity Queries by Xiaoxin Yin & Sarthak Shah
  • 8.
    Benefit of EntitySearch ● Entities are often categorized into a taxonomy ● Primary task of the user is often to make a decision ● More structured than document based ● Entity is associated with the same URI across the different repositories ● Entity Information Integration ● More understandable by Human ● Increase precision & less Time Consuming
  • 9.
    Entity & ItsFacets ● An entity must be distinguished from other entities Can be anything including an abstract thing like Diseases ,Imaginary art etc. ● Type of an entity refers to a generic class into which the given entity is classified. ● Attribute refers to a property (predicate) associated with an entity. ● Value refers to the value of an attribute (for a given entity). ● Relation provides more information with many entites ● Entity, Prof. S.R. Ranganathan is a person , IBM is an organization
  • 10.
    Main Work ofESE ● Entity Retrieval : Entity search engines can return aranked list of entities most relevant for a user query ● Entity Relationship / Fact Mining and Navigation : It discover interesting relationships / facts about the entities associated with their queries ● Prominence Ranking : Detect the popularity of an entity and enable users to browse entities in different categories ● Entity Description Retrieval : Entity description blocks for each entity information about an object in a web page is generally grouped together as an object block
  • 11.
    Popular Entity Search ● Product search-Various Products like Books, Electronics, Clothes, etc. ● People search-Experts, Friends, Profile of famous persons, etc. ● Location search-Travel, Address ,Business, Govt Offices, etc.
  • 13.
    Idea about entitysearch engine
  • 14.
    Main Work ofESE ● Entity Retrieval : Entity search engines can return aranked list of entities most relevant for a user query ● Entity Relationship / Fact Mining and Navigation : It discover interesting relationships / facts about the entities associated with their queries ● Prominence Ranking : Detect the popularity of an entity and enable users to browse entities in different categories ● Entity Description Retrieval : Entity description blocks for each entity information about an object in a web page is generally grouped together as an object block
  • 15.
    Various ESE ●Freebase-http://www.freebase.com/ ● Sindice-http://sindice.com/ ●Geneview-http://bc3.informatik.hu-berlin.de/ ●Okkam-http://www.okkam.org/ ●WolframAlpha-http://www.wolframalpha.com/ ● Yatedo-http://www.yatedo.com/ ●GeoNames-http://www.geonames.org/ ●Dbpedia-http://dbpedia.org/About ● EntityCube-http:// entitycube.research.microsoft.com/ etc......
  • 16.
    OKKAM-Enabling a Webof Entities ● Any collection of data and information about any type of entities published on the Web can be integrated into a single virtual, decentralized, open knowledge base. ● It leads to a faster, more efficient and more precise way to deal with the flood of information available on the Web today Entities should not be multiplied beyond necessity
  • 17.
    OKKAM ENS ●OKKAM ENS is for entity search, where storage, indexing and matching technology was built for finding an entity given its description ● Every entity (individual, instance, “thing”) is assigned a global identifier, ideally unique ● More than 7.5 million entity repository with more structured form Entity identifiers should not be multiplied beyond necessity
  • 18.
    Project Partners ●University of Trento, Italy (Co-Ordinator) ● L3S Research Center, Germany ● SAP Research, Germany ● Expert System, Italy ● Elsevier B.V., Netherlands ● Europe Unlimited SA, Belgium ● National Microelectronics Application Center (MAC), Ireland ● Ecole Polytechnique Fédérale de Lausanne (EPFL), Switzerland ● DERI Galway, Ireland ● University of Malaga, Spain ● INMARK, Spain ● Agenzia Nazionale Stampa Associata (ANSA), Italy
  • 20.
    Sources Of Information ● Wikipedia Provides lists of countries, cities, members of particulars domains which is very common for our search query ● GeoNames contains over 10 million geographical names and consists of over 9 million unique features of 2.8 million populated places and 5.5 million alternate names ● OkkamDBManager Another important information source for OKKAM can be generic databases like extranets, online shops or publishing houses ● OkkamManualEntry Another solution we provide to insert new entities is the manual case
  • 21.
    Data extracted fromany unstructed sources more effectively
  • 22.
    Cogito Semantic Technology ● Semantic analysis engine and complete semantic network for a complete understanding of text ● Transforming unstructured information into structured data ● Identifies the most relevant concepts ● Interprets the meaning of texts ● Precisely extracts information ● Automatically connet entites extracted from sources
  • 23.
    Sensigrafo ● Enablesthe disambiguation of terms ● It allows Cogito to understand the meaning of words and context ● Extraction of data and metadata ● Product development, competitive intelligence,marketing ,Finance, Media & Publishing, Oil & Gas, Life Sciences & Pharma, Government and Telecommunications and many activities where knowledge sharing is critical ● More than 1 million concepts,more than 4 million relationships
  • 26.
    WWoorrkkffllooww ooff OOkkkkaamm ● Storage: A scalable repository of entity profiles, in which billions of entities are assigned an ID and a profile, to distinguish one entity from another ● Matching: Requests from client applications arrive in the form of a bag of keywords or a collection of name value pairs (unstructured or semi-structured queries ● ID storage and management: stores, maintains and makes available for reuse IDs (URIs) for anything which is named in a networked environment ● Lifecycle Management: It takes care of the evolution Storage of the repository and of all entity profiles through different time
  • 27.
    Entity Query &Matching in Okkam
  • 30.
  • 31.
    Wolfram|Alpha ● Wolfram|Alphais an engine for computing answers and providing knowledge ● It generates output by doing computations from its own internal knowledge base, instead of searching the web and returning links ● It is an online service that answers factual queries directly by computing the answer ● Make all systematic knowledge immediately computable and accessible to everyone
  • 32.
  • 34.
    How many newspapersare available in the globe
  • 36.
    Overall Difficulties ●The number of entities could be huge ● Information Redundancy ● Information Fragmentation ● Entity Information Integration ● A single algorithm for fine­grained entity matching may not exist ● Store and retrieve using IR based techniques ● Matching on very large datasets ● Natural Language Processing
  • 37.
    Contd... ● Availabilityof a knowledge base is less ● Multi‐domain entites ● Deduplication Problem ● Some names and relationships could be incorrect & the information may not be update­to­date ● Name disambiguation is still largely unsolved ● ESEs are at early age Creating knowledge bases from text and unstructured data is the goal
  • 39.
    My Library ●EEnnttiitteess aarree ffoorr UUssee ● EEaacchh EEnnttiittyy hhaass iittss oowwnn aattttrriibbuutteess && rreellaattiioonn ● EEvveerryy EEnnttiittyy hhaass iittss iimmppoorrttaannccee ● SSaavvee tthhee TTiimmee ffoorr ffiinnddiinngg oouutt EEnnttiitteess ● EEnnttiitteess aarree ggrroowwiinngg rraappiiddllyy
  • 40.
    References 1. StatisticalEntity Extraction from Web by Zaiqing Nie, Ji-Rong Wen, and Wei-Ying Ma, Fellow, IEEE 2. State of the art in IE, overview, comparison and analysis by Stefan Dumitrescu ,PhD Student 3. The Entity Name System: Enabling the Web of Entities by Heiko Stoermer, Themis Palpanas, George Giannakopoulos,University of Trento 4. Hybrid entity clustering using crowds and data by Jongwuk Lee, Hyunsouk Cho,Jin-Woo Park,Young-rok Cha,Seung-won Hwang, Zaiqing Nie ,Ji-Rong Wen 5. Supporting Entity Search:A Large-Scale Prototype Search Engine by Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang
  • 41.
    References... 6. OKKAM:Enabling a Web of Entities by Paolo Bouquet ,Heiko Stoermer ,Daniel Giacomuzzi ,University of Trento 7. Entity Data Management in OKKAM by Themis Palpanas 1 , Junaid Chaudhry 2 , Periklis Andritsos 1 , Yannis Velegrakis 1 ,1 University of Trento,2 Ajou University 8. SPACE AND TIME ENTITY REPOSITORY Human-enhanced time-awaremulti media search funded by EU07 See :http://issuu.com/cubrikproject/docs/issuu.cubrik.d41.unitn.wp4.v1.0 9. http://api.okkam.org/search/ 10. http://www.wolframalpha.com/