0% found this document useful (0 votes)
29 views16 pages

4 Hadoop Ecosystem

Uploaded by

Vipul Khandke
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views16 pages

4 Hadoop Ecosystem

Uploaded by

Vipul Khandke
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 16

Hadoop Ecosystem

CSE412 Big Data and Large Scale


Mr P 1
Computing
Balasubramanian
Hadoop Ecosystem …
Cont.

• HBase is a Column-based NoSQL database.


• Runs on top of HDFS and can handle any type of data.
• It allows for real-time processing and random read/write operations
in the data.
• Pig was developed for analysing large datasets and overcomes the
difficulty to write map and reduce functions.
• It consists of two components: Pig Latin and Pig Engine.
• Pig Latin is the Scripting Language that is similar to SQL.
• Pig Engine is the execution engine on which Pig Latin runs.

• Hive is a distributed data warehouse system developed by


Facebook.
• It allows for easy reading, writing, and managing files on HDFS.
• It has its own querying language for the purpose known as Hive
Querying Language (HQL)
• This makes it very easy for programmers to write MR functions
using simple HQL queries.
Hadoop Ecosystem …
Cont.

• Spark built on Scala but supports varied applications written in


Java, Python, etc.
• In-memory processing – faster Real-time processing.
 Spark Core is the main execution engine for Spark
 Spark SQL allows for querying structured data stored in
DataFrames or Hive tables
 Streaming enables Spark to handle real-time data.
 MLlib is a scalable machine learning library that will enable
you to perform data science
 GraphX is a graph computation engine that enables users to
interactively build, transform, and reason about graph-
structured data at scale
• Kafka is distributed and has in-built partitioning, replication, and
fault-tolerance.
• It can handle streaming data and also allows businesses to analyze
data in real-time.
• Kafka is suitable for real-time data streaming use cases, such as
clickstream analysis, fraud detection, and real-time analytics.
Hadoop Ecosystem …
Cont.

BML 521 Big Data Management and


Mr P 4
Processing
Balasubramanian
Apache Pig
• Abstraction over MR.
• Tool/platform used to analyse large sets of
data
• Pig used with Hadoop  perform all DML.
• Pig provides a high-level language known as
Pig Latin.
• Pig Latin  Various operators  Programmers
can develop their own functions for reading,
writing, and processing data.
Apache PIG History
• In 2006, Apache Pig developed as a
research project at Yahoo  to create and
execute MR jobs on large dataset.
• In 2007, Apache Pig was open sourced via
Apache incubator.
• In 2008, the first release of Apache Pig
came out.
• In 2010, Apache Pig graduated as an
Apache top-level project.
Convert to MR
• All scripts are internally converted to MR
tasks.
• Apache Pig  Pig Engine that accepts the
Pig Latin scripts as input and converts
those scripts into MapReduce jobs.
Easy !!!
• Programmers can perform MR tasks easily -
complex codes in Java.
• Multi-query approach - reducing the length
of codes.
• An operation require 200 LoC in Java done
with 10 LoC in Apache Pig.
• Pig reduces development time 16 x times
Pig Latin Code -
Wordcount
8 Lines of code in Pig latin

BML 521 Big Data Management and


9
Processing
PIG VS MR
Applications of PIG
• To process huge data sources such as
web logs.

• To perform data processing for


search platforms.

• To process time sensitive data loads.


Features of PIG
Rich set of operators: To perform
operations like join, sort, filter, etc.
Ease of programming: Like SQL
Optimization opportunities: Tasks in Pig -
optimize their execution automatically, focus
on semantics
Extensibility: Using existing operators,
users can develop their own functions to
read, process, write data.
Features of PIG
• UDF’s: Pig provides the facility to create
User-defined Functions in other programming
languages and invoke or embed them in Pig
Scripts.
• Handles all kinds of data:
• It stores the results in HDFS.
PIG Architecture and
Components
Comparision

BML 521 Big Data Management and


Mr P 15
Processing
Balasubramanian
Thank You

BML 521 Big Data Management and


Mr P 16
Processing
Balasubramanian

You might also like