0% found this document useful (0 votes)
168 views106 pages

1.elasticsearch Introduction Slides

Built-in analytics and aggregation capabilities Elasticsearch is the most popular enterprise search engine Elasticsearch Architecture - Cluster: A group of nodes that work together - Node: A single server that is part of the cluster - Index: A collection of documents - Type: Logical grouping of documents (deprecated) - Shard: Horizontal partition of an index - Replica: Additional copies of shards for high availability - Master node: Manages cluster state and operations - Data node: Stores and retrieves documents - Client node: Interacts with the cluster via REST API Elasticsearch is distributed, fault tolerant and scalable by design. Data is partition

Uploaded by

venunaini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
168 views106 pages

1.elasticsearch Introduction Slides

Built-in analytics and aggregation capabilities Elasticsearch is the most popular enterprise search engine Elasticsearch Architecture - Cluster: A group of nodes that work together - Node: A single server that is part of the cluster - Index: A collection of documents - Type: Logical grouping of documents (deprecated) - Shard: Horizontal partition of an index - Replica: Additional copies of shards for high availability - Master node: Manages cluster state and operations - Data node: Stores and retrieves documents - Client node: Interacts with the cluster via REST API Elasticsearch is distributed, fault tolerant and scalable by design. Data is partition

Uploaded by

venunaini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 106

Searching and Analyzing Logs

with Elasticsearch

Rajesh Kumar
[email protected]
A little search engine history and the
importance of search
Overview
Basics steps involved in indexing
and searching documents

The inverted index, the heart of a


search engine

An introduction to Elasticsearch and


its basic building blocks

Set up and install Elasticsearch on


your local machine and check
cluster health
What You Need for Learning Elastic Search?
Prerequisites

Familiarity with the command line on a


Mac, Linux or Windows machine
Familiarity with using RESTful APIs to
perform actions
A very basic understanding of distributed
computing
Install and Setup

The latest version of Elasticsearch, 7.5.1


requires Java version 8
A Mac, Linux or Windows machine on
which Elasticsearch can be installed
Overview
Introduction to basic concepts in
Elasticsearch, download and install
Building an index, adding documents to
it both individually and in bulk
Basic text analysis, including
tokenization and filtering
Search queries on an index using the
Query DSL
Aggregations: the faceting and
analytics workhorse of
Elasticsearch
A Brief History of Search
Brief History of Search

1945 1991 1993


Vannevar Bush first talks Tim Berners-Lee combined Excite improved search by
of the need to index hypertext, TCP and DNS to using statistical analysis of
records imagine W W W word relationships

1970s 1993 1994


The ARPANet network Primitive search engines, Yahoo offered a directory
which laid the foundation linear search of URLs,very of useful webpages i.e. a
of the modern internet basic ranking portal
Brief History of Search

1994 1996 1998


Lycos provided ranking Inktomi pioneered the paid Google ranking pages based
relevance, prefix inclusion model on how many other pages
matching, a huge catalog link to it

1994 1997 Today


Altavista had natural ask.com had natural Google, Bing, Baidu,
language queries, language search, human Naver, Yahoo
inbound link checking editors for queries
How Does Search Work?
What Is the Objective of Search?

Find the most relevantdocuments


with your search terms
Most Relevant Document for Search Terms

Know of the Index the Know how Retrieve


document’s document for relevant the ranked by
existence lookup document is relevance
Most Relevant Document for Search Terms

Web crawler Index the Know how Retrieve


document for relevant the ranked by
lookup document is relevance
Most Relevant Document for Search Terms

Web crawler Inverted Know how Retrieve


relevant the ranked by
index document is relevance
Most Relevant Document for Search Terms

Web crawler Inverted Scoring Retrieve


ranked by
index relevance
Most Relevant Document for Search Terms

Web crawler Inverted Scoring Search


index
Most Relevant Document for Search Terms

Web crawler Inverted Scoring Search


index
Search Is Not Restricted to The Web
Sites Have Their Own Search

E-commerce Video E-learning


The Inverted Index
An inverted index consists of a list of all the unique words that appear
in any document, and for each word, a list of the documents in which
it appears. Inverted index is created from document created in
elasticsearch.
The Inverted Index
Inverted index is created using process called analysis
- Tokenisation and
- Filterization)
Documents Have Content

Stark Baratheon Tyrell

Winter is coming Ours is the fury Growing Strong


Tokenize Text into Words

winter
is split words
coming
ours lowercased
the
fury
removed
punctuation
growing
strong
Tokenize Text into Words

winter 1
is 2
coming 1
ours 1
the 1
fury 1
growing 1
strong 1
Tokenize Text into Words

winter 1 Stark
is 2 Stark, Baratheon
coming 1 Stark
ours 1 Baratheon
the 1 Baratheon
fury 1 Baratheon
growing 1 Tyrell
strong 1 Tyrell
Tokenize Text into Words

winter 1 Stark
is 2 Stark, Baratheon
coming 1 Stark
ours 1 Baratheon
the 1 Baratheon
fury 1 Baratheon
growing 1 Tyrell
strong 1 Tyrell
Dictionary sorted so
lookup is easy
coming 1 Stark
fury 1 Baratheon
growing 1 Tyrell
is 2 Stark, Baratheon
ours 1 Baratheon
strong 1 Tyrell
the 1 Baratheon
winter 1 Stark
Postings

coming 1 Stark
fury 1 Baratheon
growing 1 Tyrell
is 2 Stark, Baratheon
ours 1 Baratheon
strong 1 Tyrell
the 1 Baratheon
winter 1 Stark
Search

coming 1 Stark
fury 1 Baratheon
growing 1 Tyrell
is 2 Stark, Baratheon
ours 1 Baratheon
strong 1 Tyrell
the 1 Baratheon
winter 1 Stark

winter
Search

coming 1 Stark
fury 1 Baratheon
growing 1 Tyrell
is 2 Stark, Baratheon
ours 1 Baratheon
strong 1 Tyrell
the 1 Baratheon
winter 1 Stark

fury
Search

coming 1 Stark
fury 1 Baratheon
growing 1 Tyrell
is 2 Stark, Baratheon
ours 1 Baratheon
strong 1 Tyrell
the 1 Baratheon
winter 1 Stark

is
Search

coming 1 Stark
fury 1 Baratheon
growing 1 Tyrell
is 2 Stark, Baratheon
ours 1 Baratheon
strong 1 Tyrell
the 1 Baratheon
winter 1 Stark

coming OR strong
Search

coming 1 Stark
fury 1 Baratheon
growing 1 Tyrell
is 2 Stark, Baratheon
ours 1 Baratheon
strong 1 Tyrell
the 1 Baratheon
winter 1 Stark

fury AND growing


Searches Using Inverted Indices

Find all words ending with “ong”

strong gnorts

Search for all words starting with “gno”


Searches Using Inverted Indices

Split words into n-gramsfor


substring search

yo, you, our,


yours
ours, urs

Match substrings with n-grams


Searches Using Inverted Indices

Geo-hashes for geographical search

Algorithms such as Metaphone for


phonetic matching

“Did you mean?” searches use a


Levenshtein automaton
The Inverted Index
The Inverted Index
The Inverted Index
Misconceptions

Some people have misconceptions, that Inverted Index is just


the mapping of word and document Ids.

But, it also contains many more information like


- The number of times the term occurred in the document,
- The length of the document, etc..

which ultimately helps it in defining the relevancy of the


documents and thus the score.
An inverted index is at the
heart of a searchengine
Implementing Search
Apache Lucene

The indexing and search library for a high


performance, full-text search engine
Apache Lucene

Open source, free to use


written in Java, ported to other languages
Apache Lucene

Just like Hadoop in the distributed computing


world, Lucene is the nucleus of several
technologies built around it
Apache Lucene

Solr

A search server with: distributed indexing,


load balancing, replication, automated
recover, centralized configuration
Apache Lucene

Nutch

Web crawlingand index parsing


Apache Lucene

CrateDB

Open source, SQL distributed database


Elasticsearch

Elasticsearch is a distributed search and


analytics engine which runs on Lucene
Introducing Elasticsearch
Elasticsearch

An open source, search and analytics engine,


written in Java built on Apache Lucene
Elasticsearch
Distributed: Scales to thousands of
nodes
High availability: Multiple copies of data
RESTful API: CRUD, monitoring and
other operation via simple JSON-based
HTTP calls
Powerful Query DSL: Express complex
queries simply
Schemaless: Index data without an
explicit schema
Elasticsearch

Product catalog Video clips Courses


Inventory Categories Authors
Autocomplete Tags Topics
Elasticsearch

Mining log data Price alerting Business analytics


for insights platform and intelligence
Working with Elasticsearch
Elasticsearch Options
Install and Setup
Install and Setup
NO ROOT USER
Install and Setup

https://www.elastic.co/guide/en/elasticsearch/reference/current/get
ting-started-install.html
Elasticsearch Ports

Elasticsearch will bind to a single port for both HTTP and the
node/transport APIs.

1. 9200 is for REST.


2. 9300 for nodes communication, discovery and transport module
port.
Running Elasticsearch

Running Elasticsearch from the command line


Elasticsearch can be started from the command line as follows:
./bin/elasticsearch

Running as a daemon
To run Elasticsearch as a daemon, specify -d on the command line,
and record the process ID in a file using the -p option:
./bin/elasticsearch -d -p pid
Log messages can be found in the $ES_HOME/logs/ directory.

To shut down Elasticsearch, kill the process ID recorded in the


pid file:
pkill -F pid
Basic Concepts of Elasticsearch
Near Realtime Search

Very low latency, ~1 second from


the time a document is indexed
until it becomes searchable
Node

Single server
Stores your data
Performs indexing
Allows search

Has a unique id
and name
Cluster

Collection of nodes
Holds the entire
indexed data
Has a unique name
Nodes join a cluster
using the cluster name

A cluster is identified by a unique name which by default is "elasticsearch". This name


is important because a node can only be part of a cluster if the node is set up to join
the cluster by its name.
Document

A whole bunch of documents that need to


be indexed so they can be searched
Document

catalog, reviews
Document

titles, description,
comments
Types

Documents are divided into


categories or types
Index

All of these types of


documents make up an index
Index

Collection of similar documents


Identified by name
Any number of indices in a cluster
Multiple indices for groupings
Type

Logical partitioning of
documents
User defined
grouping semantics
Documents with the
same fields belong to
one type
Document

Basic unit of information to be


indexed
Expressed in JSON Reside within an index

Assigned to a type within an index


Within an index, you can store as many
documents as you want.
Documents in an Index
Documents in an Index
Documents in an Index

Too large to fit in the Too slow to serve all search


hard disk of one node requests from one node
Shards

Split the index across


multiple nodes in the cluster
Shards

Sharding an index
Shards

Search in parallel on
multiple nodes
Replicas
Replicas

High availability in case a


node fails
Replicas

Scale search volume/throughput


by searching multiple replicas
Shards and Replicas

An index can be split into multiple


shards
A shard can be replicated zero or more
times
An index in Elasticsearch has 5 shards
and 1replica by default
Sharding is important for two primary reasons:
1. It allows you to horizontally split/scale your content volume
2. It allows you to distribute and parallelize operations across
shards (potentially on multiple nodes) thus increasing
performance/throughput
Replication is important for two primary reasons:
1. It provides high availability in case a shard/node fails. For this
reason, it is important to note that a replica shard is never
allocated on the same node as the original/primary shard that it
was copied from.
2. It allows you to scale out your search volume/throughput since
searches can be executed on all replicas in parallel.
An index with two primary shards and
one replica can scale out across four
nodes
Adjust the number of replicas to balance
the load between nodes
Summay
Summary
Summary
Demo 1 Download and install Elasticsearch on
your local machine
Demo 2 Configure/Install Single Node Elastic
Search Clustor
Demo 3 Monitor the health of your cluster using
HTTP requests
Learnt a little search engine history,
ubiquitous nature of search
Understood the basics steps involved in
indexing and searching documents
Summary Learnt how the inverted index data
structure works
Got a brief introduction to Elasticsearch
and its building blocks
Set up and installed Elasticsearch on
your local machine

You might also like