0% found this document useful (0 votes)
8 views

Rec It-It17701 Data Analytics Unit 1 Part - II

Uploaded by

krishnaharish678
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Rec It-It17701 Data Analytics Unit 1 Part - II

Uploaded by

krishnaharish678
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 59

IT17701-DATA ANALYTICS

UNIT 1-Introduction to Bigdata &Hadoop


PART II

10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 1


Analysis and Reporting
• Reporting is “the process of organizing data into informational
summaries in order to monitor how different areas of a business are
performing.”
• Measuring core metrics and presenting them — whether in an email,
a slidedeck, or online dashboard — falls under this category.
• Analytics is “the process of exploring data and reports in order to
extract meaningful insights, which can be used to better understand
and improve business performance.”

10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 2


10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 3
Example

10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 4


Distributed Computing
Challenges
• Multiple computer systems working on a single problem.
• In distributed computing, a single problem is divided into many parts,
and each part is solved by different computers. As long as the
computers are networked, they can communicate with each other to
solve the problem. If done properly, the computers perform like a single
entity.
• The ultimate goal of distributed computing is to maximize performance
by connecting users and IT resources in a cost-effective, transparent and
reliable manner. It also ensures fault tolerance and enables resource
accessibility in the event that one of the components fails.

10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 5


Challenge No.1 – Heterogeneity

• Heterogeneity – “Describes a system consisting of multiple distinct


components”
• In many systems in order to overcome heterogeneity a software layer
known as Middleware is often used to hide the differences amongst
the components underlying layers.

10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 6


Challenge No.2 – Openness

• Openness –“Property of each subsystem to be open for interaction


with other systems”
• So once something has been published it cannot be taken back or
reversed.
• Furthermore in open distributed systems there is often no central
authority, as different systems may have their own intermediary.

10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 7


Challenge No.3 – Security

• The issues surrounding security are those of


• Confidentiality-(protection against disclosure to unauthorized
individuals)
• Integration- (protection against alteration or corruption)
• Availability- for the authorized (protection against interference with
the means to access the resources).
• To combat these issues encryption techniques such as those of
cryptography can help but they are still not absolute. Denial of Service
attacks can still occur, where a server or service is bombarded with
false requests usually by botnets (zombie computers).
10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 8
Challenge No.4 – Scalability
• A system is said to be scalable if it can handle the addition of users and
resources without suffering a noticeable loss of performance or increase in
administrative complexity
• “As the system, number of resources, or users increase the performance of
the system is not lost and remains effective in accomplishing its goals”
• A number of important issues that arise as a result of increasing scalability,
such as increase in cost and physical resources. It is also important to avoid
performance bottlenecks by using caching and replication.
Scalability has 3 dimensions:
• Size-Number of users and resources to be processed. Problem associated is
overloading
• Geography-Distance between users and resources. Problem associated is
communication reliability
• Administration-As the size of distributed systems increases, many of the system
needs to be controlled. Problem
10/11/2024
associated is administrative mess
REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 9
Challenge No.5 – Fault handling

• Failures are inevitable in any system; some components may stop


functioning while others continue running normally. So naturally we
need a way to:
• Detect Failures – Various mechanisms can be employed such as
checksums.
• Mask Failures – retransmit upon failure to receive acknowledgement
• Recover from failures – if a server crashes roll back to previous state
• Build Redundancy – Redundancy is the best way to deal with failures.
It is achieved by replicating data so that if one sub system crashes
another may still be able to provide the required information.
10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 10
Challenge No.6 – Concurrency

• Concurrency issues arise when several clients attempt to request a


shared resource at the same time.
• This is problematic as the outcome of any such data may depend on
the execution order, and so synchronization is required.

10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 11


Challenge No.7 – Transparency
• A distributed system must be able to offer transparency to its users. As a user of a
distributed system you do not care if we are using 20 or 100’s of machines, so we
hide this information, presenting the structure as a normal centralized system.
• Access Hide differences in data representation and how a resource is accessed
• Location Hide where a resource is located
• Migration Hide that a resource may move to another location
• Relocation Hide that a resource may be moved to another location while in use
• Replication Hide that a resource may be copied in several places
• Concurrency Hide that a resource may be shared by several competitive users
• Failure Hide the failure and recovery of a resource
• Persistence Hide whether a (software) resource is in memory or a disk

10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 12


Big Data Analytics
• Big data analytics examines large amounts of data to uncover hidden
patterns, correlations and other insights.
• With today’s technology, it’s possible to analyze your data and get
answers from it almost immediately – an effort that’s slower and less
efficient with more traditional business intelligence solutions.

10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 13


Why is big data analytics important?
• Big data analytics helps organizations harness their data and use it to identify
new opportunities. That, in turn, leads to smarter business moves, more
efficient operations, higher profits and happier customers.
• Cost reduction. Big data technologies such as Hadoop and cloud-based
analytics bring significant cost advantages when it comes to storing large
amounts of data – plus they can identify more efficient ways of doing business.
• Faster, better decision making. With the speed of Hadoop and in-memory
analytics, combined with the ability to analyze new sources of data, businesses
are able to analyze information immediately – and make decisions based on
what they’ve learned.
• New products and services. With the ability to gauge customer needs and
satisfaction through analytics comes the power to give customers what they
want. Davenport points out that with big data analytics.
10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 14
HADOOP

10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 15


Introduction to Hadoop
• History of Hadoop
• What Is Hadoop
• Hadoop Architecture
• Hadoop Services
• Hadoop Ecosystem
• Advantage of Hadoop
• Disadvantage of Hadoop
• Use of Hadoop
• References
• Conclusion
10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 16
History Of Hadoop
• Hadoop was created by Doug Cutting who had created the Apache Lucene (Text
Search),which is origin in Apache Nutch (Open source search Engine).Hadoop is a part of
Apache Lucene Project. Actually Apache Nutch was started in 2002 for working crawler
and search
• In January 2008, Hadoop was made its own top-level project at Apache for, confirming
success ,By this time, Hadoop was being used by many other companies such as Yahoo!,
Facebook, etc.
• In April 2008, Hadoop broke a world record to become the fastest system to sort a
terabyte of data.
• Yahoo take test in which to process 1TB of data (1024 columns)
• oracle – 3 ½ day
• teradata – 4 ½ day
• netezza – 2 hour 50 min
• hadoop – 3.4 min
10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 17
WHAT IS HADOOP
• Hadoop is the product of Apache ,it is the type of distributed system, it is
framework for big data
• Apache Hadoop is an open-source software framework for storage and
large-scale processing of data-sets on clusters of commodity hardware.
• Some of the characteristics:
• Open source
• Distributed processing
• Distributed storage
• Reliable
• Economical
• Flexible
10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 18
Hadoop Framework Modules
• The base Apache Hadoop framework is composed of the following
modules:
• Hadoop Common :– contains libraries and utilities needed by other
Hadoop modules
• Hadoop Distributed File System (HDFS) :– a distributed file-system that
stores data on commodity machines, providing very high aggregate
bandwidth across the cluster
• Hadoop YARN:– a resource-management platform responsible for
managing computing resources in clusters and using them for
scheduling of users' applications
• Hadoop MapReduce:– an implementation of the MapReduce
programming model for large scale data processing.
10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 19
Framework Architecture

10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 20


Hadoop Services
Storage
1. HDFS (Hadoop distributed file System)
a)Horizontally Unlimited Scalability (No Limit For Max no.of
Slaves)
b)Block Size=64MB(old Version) 128MB(New Version)
Process
1. MapReduce(Old Model)
2. Spark(New Model)

10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 21


HADOOP ECOSYSTEM

10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 22


HADOOP ECOSYSTEM
• HDFS -> Hadoop Distributed File System
• YARN -> Yet Another Resource Negotiator
• MapReduce -> Data processing using programming
• Spark -> In-memory Data Processing
• PIG, HIVE-> Data Processing Services using Query (SQL-like)
• HBase -> NoSQL Database
• Mahout, Spark MLlib -> Machine Learning
• Apache Drill -> SQL on Hadoop
• Zookeeper -> Managing Cluster
• Oozie -> Job Scheduling
• Flume, Sqoop -> Data Ingesting Services
• Solr & Lucene -> Searching & Indexing
• Ambari -> Provision, Monitor and Maintain cluster
10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 23
HDFS-
Hadoop Distributed File System
• Hadoop Distributed File System is the core component or you can say,
the backbone of Hadoop Ecosystem.
• HDFS is the one, which makes it possible to store different types of
large data sets (i.e. structured, unstructured and semi structured
data).
• HDFS creates a level of abstraction over the resources, from where we
can see the whole HDFS as a single unit.
• It helps us in storing our data across various nodes and maintaining
the log file about the stored data (metadata).

10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 24


HDFS

10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 25


HDFS
• HDFS has two core components, i.e. Name Node and
DataNode. The NameNode is the main node and it doesn’t store the actual
data. It contains metadata, just like a log file or you can say as a table of
content. Therefore, it requires less storage and high computational resources.
• On the other hand, all your data is stored on the Data Nodes and hence it
requires more storage resources. These Data Nodes are commodity hardware
(like your laptops and desktops) in the distributed environment. That’s the
reason, why Hadoop solutions are very cost effective.
• You always communicate to the Name Node while writing the data. Then, it
internally sends a request to the client to store and replicate data on various
Data Nodes.

10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 26


YARN- Yet Another Resource
Negotiator
• Consider YARN as the brain of the Hadoop Ecosystem. It performs all
the processing activities by allocating resources and scheduling tasks.
• It has two major components, i.e. Resource Manager and Node
Manager.
• ResourceManager is again a main node in the processing department.
• It receives the processing requests, and then passes the parts of requests to
corresponding NodeManagers accordingly, where the actual processing takes
place.
• NodeManagers are installed on every DataNode. It is responsible for
execution of task on every single DataNode.

10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 27


YARN
• Schedulers: Based on your application resource requirements,
Schedulers perform scheduling algorithms and allocates the
resources.
• Applications Manager: While Applications Manager accepts the job
submission, negotiates to containers (i.e. the Data node environment
where process executes) for executing the application specific
Application Master and monitoring the progress. Application Masters
are the deamons which reside on Data Node and communicates to
containers for execution of tasks on each Data Node. Resource
Manager has two components, i.e. Schedulers and Applications
Manager.
10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 28
MAPREDUCE

• It is the core component of processing in a Hadoop Ecosystem as it


provides the logic of processing. In other words, MapReduce is a software
framework which helps in writing applications that processes large data
sets using distributed and parallel algorithms inside Hadoop environment.
• In a MapReduce program, Map() and Reduce() are two functions.
• Map function performs actions like filtering, grouping and sorting.
• Reduce function aggregates and summarizes the result produced by map
function.
• The result generated by the Map function is a key value pair (K, V) which
acts as the input for Reduce function.

10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 29


APACHE PIG

• PIG has two parts: Pig Latin, the language and the pig runtime, for
the execution environment.Can better understand it as Java and JVM.
• It supports pig latin language, which has SQL like command structure.
• As everyone does not belong from a programming background. So,
Apache PIG relieves them. You might be curious to know how?
• Well, there is an interesting fact:
• 10 line of pig latin = approx. 200 lines of Map-Reduce Java code

10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 30


APACHE PIG
• But don’t be shocked when I say that at the back end of Pig job, a map-reduce job
executes.
• The compiler internally converts pig latin to MapReduce. It produces a sequential
set of MapReduce jobs, and that’s an abstraction (which works like black box).
• PIG was initially developed by Yahoo.
• It gives a platform for building data flow for ETL (Extract, Transform and Load),
processing and analyzing huge data sets.
• How Pig works?
• In PIG, first the load command, loads the data. Then we perform various
functions on it like grouping, filtering, joining, sorting, etc. At last, either can
dump the data on the screen or can store the result back in HDFS.

10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 31


APACHE HIVE

• Facebook created HIVE for people who are fluent with SQL.
• Basically, HIVE is a data warehousing component which performs
reading, writing and managing large data sets in a distributed
environment using SQL-like interface.
• HIVE + SQL = HQL

10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 32


APACHE HIVE
• The query language of Hive is called Hive Query Language(HQL),
which is very similar like SQL.
• It has 2 basic components: Hive Command Line and JDBC/ODBC
driver.
• Hive Command line interface is used to execute HQL commands.
• Java Database Connectivity (JDBC) and Object Database Connectivity
(ODBC) is used to establish connection from data storage.

10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 33


APACHE HIVE
• Hive is highly scalable. As, it can serve both the purposes, i.e. large
data set processing (i.e. Batch query processing) and real time
processing (i.e. Interactive query processing).
• It supports all primitive data types of SQL.
• Can use predefined functions, or write tailored user defined functions
(UDF) also to accomplish the specific needs.

10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 34


APACHE
MAHOUT
• Mahout provides an environment for creating machine learning
applications which are scalable.
• What is machine learning?
• Machine learning algorithms allow us to build self-learning machines
that evolve by itself without being explicitly programmed. Based on
user behavior, data patterns and past experiences it makes important
future decisions. It is a descendant of Artificial Intelligence (AI).

10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 35


APACHE MAHOUT
• What Mahout does?
• It performs collaborative filtering, clustering and classification. Some
people also consider frequent item set missing as Mahout’s function.
• Collaborative filtering: Mahout mines user behaviors, their patterns and
their characteristics and based on that it predicts and make
recommendations to the users. The typical use case is E-commerce
website.
• Clustering: It organizes a similar group of data together like articles can
contain blogs, news, research papers etc.
• Classification: It means classifying and categorizing data into various
sub-departments like articles can be categorized into blogs, news, essay,
research papers and other categories.
10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 36
APACHE MAHOUT
• Frequent item set missing: Here Mahout checks, which objects are
likely to be appearing together and make suggestions, if they are
missing. For example, cell phone and cover are brought together in
general. So, if you search for a cell phone, it will also recommend you
the cover and cases.

• Mahout provides a command line to invoke various algorithms. It has


a predefined set of library which already contains different inbuilt
algorithms for different use cases.

10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 37


APACHE
SPARK
• Apache Spark is a framework for real time data analytics in a
distributed computing environment.
• The Spark is written in Scala and was originally developed at the
University of California, Berkeley.
• It executes in-memory computations to increase speed of data
processing over Map-Reduce.
• It is 100x faster than Hadoop for large scale data processing by
exploiting in-memory computations and other
optimizations. Therefore, it requires high processing power than Map-
Reduce.
10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 38
APACHE SPARK

10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 39


APACHE SPARK
• Packed with high-level libraries, including support for R, SQL, Python, Scala, Java etc.
These standard libraries increase the seamless integrations in complex workflow.
• various sets of services to integrate with it like MLlib, GraphX, SQL + Data Frames,
Streaming services etc. to increase its capabilities.
• This is a very common question in everyone’s mind:
• “Apache Spark: A Killer or Saviour of Apache Hadoop?” – O’Reily
• The Answer to this – This is not an apple to apple comparison. Apache Spark best fits
for real time processing, whereas Hadoop was designed to store unstructured data
and execute batch processing over it. When we combine, Apache Spark’s ability, i.e.
high processing speed, advance analytics and multiple integration support with
Hadoop’s low cost operation on commodity hardware, it gives the best results.
• That is the reason why, Spark and Hadoop are used together by many companies for
processing and analyzing their Big Data stored in HDFS.

10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 40


APACHE
HBASE
• HBase is an open source, non-relational distributed database. In other words, it is a
NoSQL database.
• It supports all types of data and that is why, it’s capable of handling anything and
everything inside a Hadoop ecosystem.
• It is modelled after Google’s BigTable, which is a distributed storage system
designed to cope up with large data sets.
• The HBase was designed to run on top of HDFS and provides BigTable like
capabilities.
• It gives us a fault tolerant way of storing sparse data, which is common in most Big
Data use cases.
• The HBase is written in Java, whereas HBase applications can be written in REST,
Avro and Thrift APIs.
10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 41
APACHE HBASE
• For better understanding, let us take an example. You have billions of
customer emails and you need to find out the number of customers
who has used the word complaint in their emails. The request needs
to be processed quickly (i.e. at real time). So, here we are handling a
large data set while retrieving a small amount of data. For solving
these kind of problems, HBase was designed.

10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 42


APACHE
DRILL
• As the name suggests, Apache Drill is used to drill into any kind of
data. It’s an open source application which works with distributed
environment to analyze large data sets.
• It is a replica of Google Dremel.
• It supports different kinds NoSQL databases and file systems, which is
a powerful feature of Drill. For example: Azure Blob Storage, Google
Cloud Storage, HBase, MongoDB, MapR-DB HDFS, MapR-FS, Amazon
S3, Swift, NAS and local files.

10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 43


APACHE DRILL
• So, basically the main aim behind Apache Drill is to provide scalability
so that we can process petabytes and exabytes of data efficiently (in
minutes).
• The main power of Apache Drill lies in combining a variety of data
stores just by using a single query.
• Apache Drill basically follows the ANSI SQL.
• It has a powerful scalability factor in supporting millions of users and
serve their query requests over large scale data.

10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 44


APACHE ZOOKEEPER

• Apache Zookeeper is the coordinator of any Hadoop job which includes a


combination of various services in a Hadoop Ecosystem.
• Apache Zookeeper coordinates with various services in a distributed
environment.
• Before Zookeeper, it was very difficult and time consuming to coordinate
between different services in Hadoop Ecosystem. The services earlier had many
problems with interactions like common configuration while synchronizing
data. Even if the services are configured, changes in the configurations of the
services make it complex and difficult to handle. The grouping and naming was
also a time-consuming factor.
• Due to the above problems, Zookeeper was introduced. It saves a lot of time by
performing synchronization, configuration maintenance, grouping and naming.
• Although it’s a simple service, it can be used to build powerful solutions.
10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 45
APACHE
OOZIE
• Apache Oozie as a clock and alarm service inside Hadoop Ecosystem. For
Apache jobs, Oozie has been just like a scheduler. It schedules Hadoop jobs
and binds them together as one logical work.
• There are two kinds of Oozie jobs:
• Oozie workflow: These are sequential set of actions to be executed. Can
assume it as a relay race. Where each athlete waits for the last one to
complete his part.
• Oozie Coordinator: These are the Oozie jobs which are triggered when the
data is made available to it. Think of this as the response-stimuli system in
our body. In the same manner as we respond to an external stimulus, an
Oozie coordinator responds to the availability of data and it rests otherwise.
10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 46
APACHE
FLUME
• Ingesting data is an important part of our Hadoop Ecosystem.
• The Flume is a service which helps in ingesting unstructured and
semi-structured data into HDFS.
• It gives us a solution which is reliable and distributed and helps us
in collecting, aggregating and moving large amount of data sets.
• It helps us to ingest online streaming data from various sources like
network traffic, social media, email messages, log files etc. in HDFS.

10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 47


APACHE FLUME
• Architecture of Flume :

• There is a Flume agent which ingests the streaming data from various
data sources to HDFS. From the diagram, the web server indicates the
data source. Twitter is among one of the famous sources for
streaming data.

10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 48


APACHE FLUME
• The flume agent has 3 components: source, sink and channel.
• Source: Accepts the data from the incoming streamline and stores the
data in the channel.
• Channel: Acts as the local storage or the primary storage. A Channel is
a temporary storage between the source of data and persistent data
in the HDFS.
• Sink: Sink, collects the data from the channel and commits or writes
the data in the HDFS permanently.

10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 49


APACHE
SQOOP
• Another data ingesting service i.e. Sqoop. The major difference
between Flume and Sqoop is that:
• Flume only ingests unstructured data or semi-structured data into
HDFS.
• While Sqoop can import as well as export structured data from
RDBMS or Enterprise data warehouses to HDFS or vice versa.

10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 50


APACHE SQOOP

• When we submit Sqoop command, our main task gets divided into
sub tasks which is handled by individual Map Task internally. Map Task
is the sub task, which imports part of data to the Hadoop
Ecosystem. Collectively, all Map tasks imports the whole data.
10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 51
APACHE SQOOP
Export also works in a similar
manner.
When we submit our Job, it is
mapped into Map Tasks which
brings the chunk of data from
HDFS. These chunks are
exported to a structured data
destination. Combining all these
exported chunks of data, we
receive the whole data at the
destination,
10/11/2024 which in REC\IT-IT17701_Data
most ofAnalytics_UNIT
the 1_Part -II 52
APACHE SOLR &
LUCENE
• Apache Solr and Apache Lucene are the two services which are used
for searching and indexing in Hadoop Ecosystem.
• Apache Lucene is based on Java, which also helps in spell checking.
• If Apache Lucene is the engine, Apache Solr is the car built around it.
Solr is a complete application built around Lucene.
• It uses the Lucene Java search library as a core for search and full
indexing.

10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 53


APACHE
AMBARI
• Ambari is an Apache Software Foundation Project which aims at making Hadoop ecosystem more
manageable.
• It includes software for provisioning, managing and monitoring Apache Hadoop clusters.
• The Ambari provides:
• Hadoop cluster provisioning:
• It gives us step by step process for installing Hadoop services across a number of hosts.
• It also handles configuration of Hadoop services over a cluster.
• Hadoop cluster management:
• It provides a central management service for starting, stopping and re-configuring Hadoop
services across the cluster.
• Hadoop cluster monitoring:
• For monitoring health and status, Ambari provides us a dashboard.
• The Amber Alert framework is an alerting service which notifies the user, whenever the
attention is needed. For example, if a node goes down or low disk space on a node, etc.
10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 54

Final Attention-
• Hadoop Ecosystem owes its success to the whole developer
community, many big companies like Facebook, Google, Yahoo,
University of California (Berkeley) etc. have contributed their part to
increase Hadoop’s capabilities.
• Inside a Hadoop Ecosystem, knowledge about one or two tools
(Hadoop components) would not help in building a solution. Need to
learn a set of Hadoop components, which works together to build a
solution.
• Based on the use cases, we can choose a set of services from Hadoop
Ecosystem and create a tailored solution for an organization.

10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 55


Hadoop Advantages
• Unlimited data storage
• 1. Server Scaling Mode a) Vertical Scale b)Horizontal Scale
• High speed processing system

• All varieties of data processing


1. Structural
2. Unstructural
3. semi-structural

10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 56


Hadoop Disadvantages
• If volume is small then speed of hadoop is bad
• Limitation of hadoop data storage Well there is obviously a practical
limit. But physically HDFS Block IDs are Java longs so they have a max
of 2^63 and if your block size is 64 MB then the maximum size is 512
yottabytes.
• Hadoop should be used for only batch processing 1. Batch process:-
background process where user can’t interactive
• Hadoop is not used for OLTP – OLTP process:-interactive with uses

10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 57


References
• https://www.edureka.co/blog/hadoop-ecosystem
• https://data-flair.training/blogs/hadoop-ecosystem-components/
• https://practicalanalytics.wordpress.com/2011/11/06/explaining-hadoo
p-to-management-whats-the-big-data-deal/
• http://training.cloudera.com/essentials.pdf
• http://en.wikipedia.org/wiki/Apache_Hadoop
• http://practicalanalytics.wordpress.com/2011/11 /06/explaining-
hadoop-to-management-whatsthe-big-data-deal/
• https://developer.yahoo.com/hadoop/tutorial/m odule1.html
• http://hadoop.apache.org/
http://wiki.apache.org/hadoop/FrontPage
•10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 58
THANK YOU

10/11/2024 REC\IT-IT17701_Data Analytics_UNIT 1_Part -II 59

You might also like