0% found this document useful (0 votes)
8 views23 pages

big data - unit 5 - frame works - mini xerox- easy read

The document discusses frameworks for big data analytics, focusing on tools like Pig and Hive for data processing. It outlines the functionalities of Pig, including its high-level scripting language Pig Latin, and Hive's role as a data warehouse infrastructure for querying structured data. Additionally, it covers the architecture of Hive, its services, and the comparison between traditional databases and Hive, emphasizing the flexibility and scalability of these big data frameworks.

Uploaded by

hari karan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views23 pages

big data - unit 5 - frame works - mini xerox- easy read

The document discusses frameworks for big data analytics, focusing on tools like Pig and Hive for data processing. It outlines the functionalities of Pig, including its high-level scripting language Pig Latin, and Hive's role as a data warehouse infrastructure for querying structured data. Additionally, it covers the architecture of Hive, its services, and the comparison between traditional databases and Hive, emphasizing the flexibility and scalability of these big data frameworks.

Uploaded by

hari karan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

BIG DATA ANALYTICS

UNIT – V: FRAMEWORKS 3. Superior scalability, flexibility, and


cost-efficiency
Frameworks: Applications on Big Data 4. Streamlined security
Using Pig and Hive – Data processing 5. Low overhead
operators in Pig – Hive services – HiveQL – 6. Exceptional working capacity
Querying Data in Hive – fundamentals of
HBase and ZooKeeper– IBM
InfoSphereBigInsights and Streams. 3. HBase :
HBase is a column-oriented non-relational
1. Applications on Big Data Using Pig and database management system that runs on
Hive top of the Hadoop Distributed File System
(HDFS).
1. Pig : HBase provides a fault-tolerant way of
Pig is a high-level platform or tool which is storing sparse data sets, which are common
used to process large datasets. in many big data use cases
It provides a high level of abstraction for HBase does support writing applications in
processing over MapReduce. Apache Avro, REST and Thrift.
It provides a high-level scripting language, Application :
known as Pig Latin which is used to develop
the data analysis codes. 1. Medical
Applications : 2. Sports
3. Web
1. For exploring large datasets Pig 4. Oil and petroleum
Scripting is used. 5. e-commerce
2. Provides supports across large data
sets for Ad-hoc queries. PIG
3. In the prototyping of large data-sets 1. Introduction to PIG :
processing algorithms. 2. Pig is a high-level platform or tool
4. Required to process the time- which is used to process large
sensitive data loads. datasets.
5. For collecting large amounts of 3. It provides a high level of abstraction
datasets in form of search logs and for processing over MapReduce.
web crawls. 4. It provides a high-level scripting
6. Used where the analytical insights language, known as Pig Latin which
are needed using the sampling. is used to develop the data analysis
codes.
2. Hive : 5. Pig Latin and Pig Engine are the two
Hive is a data warehouse infrastructure tool main components of the Apache Pig
to process structured data in Hadoop. tool.
It resides on top of Hadoop to summarize 6. The result of Pig is always stored in
Big Data and makes querying and analyzing the HDFS.
easy. 7. One limitation of MapReduce is that
It is used by different companies. For the development cycle is very long.
example, Amazon uses it in Amazon Elastic Writing the reducer and mapper,
MapReduce. compiling packaging the code,
Benefits : submitting the job and retrieving the
output is a time-consuming task.
1. Ease of use 8. Apache Pig reduces the time of
2. Accelerated initial insertion of data development using the multi-query
approach.
BIG DATA ANALYTICS

9. Pig is beneficial for programmers Query optimization. query


who are not from Java backgrounds. optimization in
10. 200 lines of Java code can be written SQL.
in only 10 lines using the Pig Latin
language. Grunt :
11. Programmers who have SQL 1. Grunt shell is a shell command.
knowledge needed less effort to learn 2. The Grunt shell of the Apace pig is
Pig Latin. mainly used to write pig Latin
scripts.
Execution Modes of Pig : 3. Pig script can be executed with grunt
Apache Pig scripts can be executed shell which is a native shell provided
in three ways : by Apache pig to execute pig
Interactive Mode (Grunt shell) : queries.
1. You can run Apache Pig in 4. We can invoke shell commands
interactive mode using the Grunt using sh and fs.
shell. 5. Syntax of sh command :
2. In this shell, you can enter the Pig grunt> sh ls
Latin statements and get the output
(using the Dump operator). Syntax of fs command :
Batch Mode (Script) :
3. You can run Apache Pig in Batch grunt>fs -ls
mode by writing the Pig Latin script
Pig Latin :
in a single file with
the .pig extension. The Pig Latin is a data flow language used
Embedded Mode (UDF) : by Apache Pig to analyze the data in
4. Apache Pig provides the provision of Hadoop.
defining our own functions It is a textual language that abstracts the
(User Defined Functions) in programming from the Java MapReduce
programming languages such as Java idiom into a notation.
and using them in our script. The Pig Latin statements are used to process
Comparison of Pig with Databases :
the data.
It is an operator that accepts a relation as an
PIG SQL input and generates another relation as an
SQL is a output.
Pig Latin is a procedural · It can span multiple lines.
declarative
language
language · Each statement must end with a semi-
colon.
In Apache Pig, the schema
· It may include expression and schemas.
is optional. We can store Schema is
data without designing a mandatory in · By default, these statements are
schema (values are stored SQL. processed using multi-query execution
as $01, $02 etc.)
User-Defined Functions :
The data model
The data model in Apache Apache Pig provides extensive support for
used in SQL is
Pig is nested relational.
flat relational. User Defined Functions(UDF‟s).
Using these UDF‟s, we can define our own
Apache Pig provides There is more functions and use them.
limited opportunity for opportunity for
BIG DATA ANALYTICS

The UDF support is provided in six


programming languages:
· Java
· Jython
· Python
· JavaScript
· Ruby
· Groovy
For writing UDF‟s, complete support is
provided in Java and limited support is
provided in all the remaining languages. The above figure shows the architecture of
Using Java, you can write UDF‟s involving Apache Hive and its major components.
all parts of the processing like data The major components of Apache Hive are :
load/store, column transformation, and 1. Hive Client
aggregation. 2. Hive Services
3. Processing and Resource Management
Since Apache Pig has been written in Java,
4. Distributed Storage
the UDF‟s written using Java language work HIVE CLIENT :
efficiently compared to other languages. Hive supports applications written in any
Types of UDF‟s in Java : language like Python, Java, C++, Ruby, etc
Filter Functions : using JDBC, ODBC, and Thrift drivers, for
performing queries on the Hive.
 The filter functions are used as Hence, one can easily write a hive client
conditions in filter statements. application in any language of its own
 These functions accept a Pig value as
choice.
Hive clients are categorized into three types
input and return a Boolean value.
:
Eval Functions : 1. Thrift Clients : The Hive server is
based on Apache Thrift so that it can serve
 The Eval functions are used in the request from a thrift client.
FOREACH-GENERATE statements. 2. JDBC client : Hive allows for the Java
 These functions accept a Pig value as applications to connect to it using the JDBC
input and return a Pig result. driver. JDBC driver uses Thrift to
Algebraic Functions : communicate with the Hive Server.
3. ODBC client : Hive ODBC driver
allows applications based on the ODBC
 The Algebraic functions act on inner
protocol to connect to Hive. Similar to the
bags in a FOREACHGENERATE JDBC driver, the ODBC driver uses Thrift
statement. to communicate with the Hive Server.
 These functions are used to perform HIVE SERVICE :
full MapReduce operations on an To perform all queries, Hive provides
inner bag. various services like the Hive server2,
Hive Beeline, etc.
Apache Hive Architecture : The various services offered by Hive are :
1. Beeline
2. Hive Server 2
3. Hive Driver
4. Hive Compiler
BIG DATA ANALYTICS

5. Optimizer Example:
6. Execution Engine $hive
7. Metastore Hive> show databases;
8. HCatalog Hive Services :
9. WebHCat The following are the services provided by
PROCESSING AND RESOURCE Hive :
MANAGEMENT : · Hive CLI: The Hive CLI (Command
Hive internally uses a MapReduce Line Interface) is a shell where we can
framework as a defacto engine for executing execute Hive queries and commands.
the queries. · Hive Web User Interface: The Hive
MapReduce is a software framework for Web UI is just an alternative of Hive CLI. It
writing those applications that process a provides a web-based GUI for executing
massive amount of data in parallel on the Hive queries and commands.
large clusters of commodity hardware. · Hive metastore: It is a central
MapReduce job works by splitting data into repository that stores all the structure
chunks, which are processed by map-reduce information of various tables and partitions
tasks. in the warehouse. It also includes metadata
of column and its type information, the
DISTRIBUTED STORAGE : serializers and deserializers which is used to
Hive is built on top of Hadoop, so it uses the read and write data and the corresponding
underlying Hadoop Distributed File System HDFS files where the data is stored.
for the distributed storage. · Hive Server: It is referred to as
Hive Shell : Apache Thrift Server. It accepts the request
Hive shell is a primary way to interact with from different clients and provides it to Hive
hive. Driver.
It is a default service in the hive. · Hive Driver: It receives queries from
It is also called CLI (command line different sources like web UI, CLI, Thrift,
interference). and JDBC/ODBC driver. It transfers the
Hive shell is similar to MySQL Shell. queries to the compiler.
Hive users can run HQL queries in the hive · Hive Compiler: The purpose of the
shell. compiler is to parse the query and perform
In hive shell up and down arrow keys are semantic analysis on the different query
used to scroll previous commands. blocks and expressions. It converts HiveQL
HiveQL is case-insensitive (except for string statements into MapReduce jobs.
comparisons). · Hive Execution Engine: Optimizer
The tab key will autocomplete (provides generates the logical plan in the form of
suggestions while you type into the field) DAG of map-reduce tasks and HDFS tasks.
Hive keywords and functions. In the end, the execution engine executes the
Hive Shell can run in two modes : incoming tasks in the order of their
Non-Interactive mode : dependencies.
Non-interactive mode means run shell
scripts in administer zone. MetaStore :
Hive Shell can run in the non-interactive
mode, with the -f option. Hive metastore (HMS) is a service that
Example: stores Apache Hive and other metadata in a
$hive -f script.q, Where script. q is a file. backend RDBMS, such as MySQL or
Interactive mode : PostgreSQL.
The hive can work in interactive mode by Impala, Spark, Hive, and other services
directly typing the command “hive” in the share the metastore.
terminal.
BIG DATA ANALYTICS

The connections to and from HMS include Even though based on SQL, HiveQL does
HiveServer, Ranger, and the NameNode, not strictly follow the full SQL-92 standard.
which represents HDFS. HiveQL offers extensions not in SQL,
Beeline, Hue, JDBC, and Impala shell including multitable inserts and create table
clients make requests through thrift or JDBC as select.
to HiveServer. HiveQL lacked support for transactions and
The HiveServer instance reads/writes data to materialized views and only limited
HMS. subquery support.
By default, redundant HMS operate in Support for insert, update, and delete with
active/active mode. full ACID functionality was made available
The physical data resides in a backend with release 0.14.
RDBMS, one for HMS. Internally, a compiler translates HiveQL
All connections are routed to a single statements into a directed acyclic graph of
RDBMS service at any given time. MapReduce Tez, or Spark jobs, which are
HMS talks to the NameNode over thrift and submitted to Hadoop for execution.
functions as a client to HDFS. Example :
HMS connects directly to Ranger and the DROP TABLE IF EXISTS docs;
NameNode (HDFS), and so does
CREATE TABLE docs
HiveServer.
One or more HMS instances on the backend (line STRING);
can talk to other services, such as Ranger.
Comparison with Traditional Database : Checks if table docs exist and drop it if it
does. Creates a new table called docs with a
single column of type STRING called line.
RDBMS HIVE
LOAD DATA INPATH 'input_file' OVER
It is used to WRITE INTO TABLE docs;
It is used to maintain a
maintain the
data warehouse.
database. Loads the specified file or directory (In this
It uses SQL case “input_file”) into the table.
It uses HQL (Hive OVERWRITE specifies that the target table
(Structured Query
Query Language). to which the data is being loaded is to be re-
Language).
written; Otherwise, the data would be
Schema is fixed in appended.
Schema varies in it.
RDBMS CREATE TABLE word_counts AS
Normalized and de- SELECT word, count(1) AS count FROM
Normalized data is
normalized both type (SELECT explode(split(line, '\s')) AS word
stored.
of data is stored. FROM docs) temp
GROUP BY word
Tables in rdms are The table in hive is
sparse. dense. ORDER BY word;

It doesn‟t support It supports automation The query CREATE TABLE word_counts


partitioning. partition. AS SELECT word, count(1) AS
count creates a table
No partition The sharding method is
called word_counts with two
method is used used for partition
columns: word and count.
This query draws its input from the inner
query (SELECT explode(split(line, '\s')) AS
HiveQL : word FROM docs) temp".
BIG DATA ANALYTICS

This query serves to split the input words When an external table is deleted, Hive will
into different rows of a temporary table only delete the schema associated with the
aliased as temp. table.
The GROUP BY WORD groups the results The data files are not affected.
based on their keys. Syntax for External Tables :
This results in the count column holding the CREATE EXTERNAL TABLE IF NOT EX
number of occurrences for each word of
ISTS stocks (exchange STRING,
the word column.
The ORDER BY WORDS sorts the words symbol STRING,
alphabetically. price_open FLOAT,
Tables : price_high FLOAT,
Here are the types of tables in Apache Hive: price_low FLOAT,
Managed Tables : price_adj_close FLOAT)
In a managed table, both the table data and ROW FORMAT DELIMITED FIELDS TE
the table schema are managed by Hive. RMINATED BY ','
The data will be located in a folder named LOCATION '/data/stocks';
after the table within the Hive data
warehouse, which is essentially just a file
location in HDFS. Querying Data :
By managed or controlled we mean that if A query is a request for data or information
you drop (delete) a managed table, then from a database table or a combination of
Hive will delete both the Schema (the tables.
description of the table) and the data files This data may be generated as results
associated with the table. returned by Structured Query Language
Default location is /user/hive/warehouse. (SQL) or as pictorials, graphs or complex
The syntax for Managed Tables : results, e.g., trend analyses from data-
CREATE TABLE IF NOT EXISTS stocks mining tools.
(exchange STRING, One of several different query languages
symbol STRING, may be used to perform a range of simple to
complex database queries.
price_open FLOAT,
SQL, the most well-known and widely-used
price_high FLOAT, query language, is familiar to most database
price_low FLOAT, administrators (DBAs)
price_adj_close FLOAT)
ROW FORMAT DELIMITED FIELDS TE User-Defined Functions :
RMINATED BY ',' ; In Hive, the users can define their own
functions to meet certain client
requirements.
These are known as UDFs in Hive.
External Tables :
User-Defined Functions written in Java for
An external table is one where only the table
specific modules.
schema is controlled by Hive.
Some of UDFs are specifically designed for
In most cases, the user will set up the folder
the reusability of code in application
location within HDFS and copy the data
file(s) there. frameworks.
The developer will develop these functions
This location is included as part of the table
in Java and integrate those UDFs with the
definition statement.
Hive.
During the Query execution, the developer
can directly use the code, and UDFs will
BIG DATA ANALYTICS

return outputs according to the user-defined 1950 0


tasks. 1950 -11
It will provide high performance in terms of
coding and execution.
The general type of UDF will accept a single MapReduce Scripts in Hive / Hive
input value and produce a single output Scripts :
value.
We can use two different interfaces for Similar to any other scripting language,
writing Apache Hive User-Defined Hive scripts are used to execute a set of
Functions : Hive commands collectively.
1. Simple API Hive scripting helps us to reduce the time
2. Complex API and effort invested in writing and executing
Sorting And Aggregating : the individual commands manually.
Hive scripting is supported in Hive 0.10.0 or
Sorting data in Hive can be achieved by use higher versions of Hive.
of a standard ORDER BY clause, but there Joins and SubQueries :
is a catch. JOINS :
ORDER BY produces a result that is totally Join queries can perform on two tables
sorted, as expected, but to do so it sets the present in Hive.
number of reducers to one, making it very Joins are of 4 types, these are :
inefficient for large datasets. · Inner join: The Records common to
When a globally sorted result is not required both tables will be retrieved by this Inner
and in many cases it isn‟t, then you can use Join.
Hive‟s nonstandard extension, SORT BY · Left outer Join: Returns all the rows
instead. from the left table even though there are no
SORT BY produces a sorted file per matches in the right table.
reducer. · Right Outer Join: Returns all the rows
If you want to control which reducer a from the Right table even though there are
particular row goes to, typically so you can no matches in the left table.
perform some subsequent aggregation. · Full Outer Join: It combines records of
This is what Hive‟s DISTRIBUTE BY both the tables based on the JOIN Condition
clause does. given in the query. It returns all the records
Example : from both tables and fills in NULL Values
· To sort the weather dataset by year and for the columns missing values matched on
temperature, in such a way to ensure that all either side.
the rows for a given year end up in the same
reducer partition : SUBQUERIES :
A Query present within a Query is known as
Hive> FROM records2
a subquery.
> SELECT year, The main query will depend on the values
temperature returned by the subqueries.
> DISTRIBUTE BY year Subqueries can be classified into two types :
> SORT BY year ASC, · Subqueries in FROM clause
temperature DESC; · Subqueries in WHERE clause
When to use :
· To get a particular value combined
· Output :
from two column values from different
1949 111
tables.
1949 78
· Dependency of one table values on
1950 22
other tables.
BIG DATA ANALYTICS

· Comparative checking of one column It is static in nature Dynamic in nature


values from other tables.
Syntax : Slower retrieval of Faster retrieval of
data data
SubqueryinFROMclause It follows the ACID It follows CAP
SELECT <column names 1, 2...n>From (Sub (Atomicity, (Consistency,
Query)<TableName_Main> Consistency, Availability,
Subquery in WHERE clause Isolation and Partition-tolerance)
SELECT <column names 1, 2...n> From Durability) property. theorem.
<TableName_Main>WHERE col1 IN
(SubQuery); It can handle
structured,
It can handle
unstructured as well
structured data
HBASE as semi-structured
HBase Concepts : data
HBase is a distributed column-oriented
database built on top of the Hadoop file It cannot handle It can handle sparse
system. sparse data data
It is an open-source project and is Schema Design :
horizontally scalable. HBase table can scale to billions of rows and
HBase is a data model that is similar to any number of columns based on your
Google‟s big table designed to provide quick requirements.
random access to huge amounts of This table allows you to store terabytes of
structured data. data in it.
It leverages the fault tolerance provided by The HBase table supports the high read and
the Hadoop File System (HDFS). writes throughput at low latency.
It is a part of the Hadoop ecosystem that A single value in each row is indexed; this
provides random real-time read/write access value is known as the row key.
to data in the Hadoop File System. The HBase schema design is very different
One can store the data in HDFS either compared to the relational database schema
directly or through HBase. design.
Data consumer reads/accesses the data in Some of the general concepts that should be
HDFS randomly using HBase. followed while designing schema in Hbase:
HBase sits on top of the Hadoop File System · Row key: Each table in the HBase
and provides read and write access. table is indexed on the row key. There are
no secondary indices available on the HBase
HBase Vs RDBMS : table.
RDBMS HBase · Automaticity: Avoid designing a table
that requires atomicity across all rows. All
It requires SQL operations on HBase rows are atomic at row
(structured query NO SQL level.
language) · Even distribution: Read and write
should be uniformly distributed across all
It has a fixed schema No fixed schema nodes available in the cluster. Design row
key in such a way that, related entities
It is column-
It is row-oriented should be stored in adjacent rows to increase
oriented
read efficacy.
It is not scalable It is scalable
BIG DATA ANALYTICS

Zookeeper : · Understand the concept of Big Data


ZooKeeper is a distributed coordination and its importance to large, medium, and
service that also helps to manage a large set small companies in the current industry
scenario.
of hosts.
· Understand the need for implementing
Managing and coordinating a service a Big Data strategy and the various issues
especially in a distributed environment is a and challenges associated with this.
complicated process, so ZooKeeper solves · Analyze the Big Data strategy of IBM.
this problem due to its simple architecture as · Explore ways in which IBM‟s Big Data
well as API. strategy could be improved further.
ZooKeeper allows developers to focus on
core application logic. Introduction to InfoSphere :
InfoSphere Information Server provides a
For instance, to track the status of
single platform for data integration and
distributed data, Apache HBase uses governance.
ZooKeeper. The components in the suite combine to
They can also support a large Hadoop create a unified foundation for enterprise
cluster easily. information architectures, capable of scaling
To retrieve information, each client machine to meet any information volume
communicates with one of the servers. requirements.
It keeps an eye on the synchronization as You can use the suite to deliver business
results faster while maintaining data quality
well as coordination across the cluster
and integrity throughout your information
There is some best Apache ZooKeeper landscape.
feature : InfoSphere Information Server helps your
· Simplicity: With the help of a shared business and IT personnel collaborate to
hierarchical namespace, it coordinates. understand the meaning, structure, and
· Reliability: The system keeps content of information across a wide variety
performing, even if more than one node of sources.
By using InfoSphere Information Server,
fails.
your business can access and use
· Speed: In the cases where „Reads‟ are
information in new ways to drive
more common, it runs with the ratio of 10:1. innovation, increase operational efficiency,
· Scalability: By deploying more and lower risk.
machines, the performance can be enhanced. BigInsights :

IBM Big Data Strategy : BigInsights is a software platform for


IBM, a US-based computer hardware and discovering, analyzing, and visualizing data
software manufacturer, had implemented a from disparate sources.
Big Data strategy. The flexible platform is built on an Apache
Where the company offered solutions to Hadoop open-source framework that runs in
store, manage, and analyze the huge parallel on commonly available, low-cost
amounts of data generated daily and hardware.
equipped large and small companies to make
informed business decisions. Big Sheets :
The company believed that its Big Data and
analytics products and services would help BigSheets is a browser-based analytic tool
its clients become more competitive and included in the InfoSphere BigInsights
drive growth. Console that you use to break large amounts
Issues :
BIG DATA ANALYTICS

of unstructured data into consumable, Assume that we have a file


situation-specific business contexts. named student_details.txt in the HDFS
These deep insights help you to filter and directory /pig_data/ as shown below.
manipulate data from sheets even further.
student_details.txt
Intro to Big SQL : 001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Ko
lkata
IBM Big SQL is a high performance 003,Rajesh,Khanna,22,9848022339,Delhi
massively parallel processing (MPP) SQL 004,Preethi,Agarwal,21,9848022330,Pune
engine for Hadoop that makes querying 005,Trupthi,Mohanthy,23,9848022336,Bhu
enterprise data from across the organization waneshwar
an easy and secure experience. 006,Archana,Mishra,23,9848022335,Chenn
A Big SQL query can quickly access a ai
variety of data sources including HDFS, 007,Komal,Nayak,24,9848022334,trivendra
RDBMS, NoSQL databases, object stores, m
and WebHDFS by using a single database 008,Bharathi,Nambiayar,24,9848022333,Ch
connection or single query for best-in-class ennai
analytic capabilities.
And we have loaded this file into Apache
Big SQL provides tools to help you manage
Pig with the relation
your system and your databases, and you
name student_details as shown below.
can use popular analytic tools to visualize
your data. grunt> student_details = LOAD
Big SQL's robust engine executes complex 'hdfs://localhost:9000/pig_data/student_detai
queries for relational data and Hadoop data. ls.txt' USING PigStorage(',')
Big SQL provides an advanced SQL as (id:int, firstname:chararray,
compiler and a cost-based optimizer for lastname:chararray, age:int,
efficient query execution. phone:chararray, city:chararray);
Combining these with massive parallel
processing (MPP) engine helps distribute Now, let us group the records/tuples in the
query execution across nodes in a cluster. relation by age as shown below.
grunt> group_data = GROUP
student_details by age;
Data processing operators in Pig
Verification
GROUP operator
The GROUP operator is used to group the Verify the relation group_data using
data in one or more relations. It collects the the DUMP operator as shown below.
data having the same key. grunt> Dump group_data;
Syntax Output
Given below is the syntax of Then you will get output displaying the
the group operator. contents of the relation
grunt> Group_data = GROUP named group_data as shown below. Here
Relation_name BY age; you can observe that the resulting schema
has two columns −
Example  One is age, by which we have grouped
the relation.
BIG DATA ANALYTICS

 The other is a bag, which contains the Grouping by Multiple Columns


group of tuples, student records with the
respective age. Let us group the relation by age and city as
shown below.
(21,{(4,Preethi,Agarwal,21,9848022330,Pun
grunt> group_multiple = GROUP
e),(1,Rajiv,Reddy,21,9848022337,Hydera
student_details by (age, city);
bad)})
(22,{(3,Rajesh,Khanna,22,9848022339,Delh You can verify the content of the relation
i),(2,siddarth,Battacharya,22,984802233 named group_multiple using the Dump
8,Kolkata)}) operator as shown below.
(23,{(6,Archana,Mishra,23,9848022335,Che
grunt> Dump group_multiple;
nnai),(5,Trupthi,Mohanthy,23,9848022336
,Bhuwaneshwar)})
((21,Pune),{(4,Preethi,Agarwal,21,9848022
(24,{(8,Bharathi,Nambiayar,24,9848022333
330,Pune)})
,Chennai),(7,Komal,Nayak,24,9848022334,
((21,Hyderabad),{(1,Rajiv,Reddy,21,984802
trivendram)})
2337,Hyderabad)})
You can see the schema of the table after ((22,Delhi),{(3,Rajesh,Khanna,22,98480223
grouping the data using 39,Delhi)})
the describe command as shown below. ((22,Kolkata),{(2,siddarth,Battacharya,22,98
48022338,Kolkata)})
grunt> Describe group_data;
((23,Chennai),{(6,Archana,Mishra,23,98480
22335,Chennai)})
group_data: {group: int,student_details:
((23,Bhuwaneshwar),{(5,Trupthi,Mohanthy,
{(id: int,firstname: chararray,
23,9848022336,Bhuwaneshwar)})
lastname: chararray,age: int,phone:
((24,Chennai),{(8,Bharathi,Nambiayar,24,9
chararray,city: chararray)}}
848022333,Chennai)})
In the same way, you can get the sample (24,trivendram),{(7,Komal,Nayak,24,98480
illustration of the schema using 22334,trivendram)})
the illustrate command as shown below.
Group All
$ Illustrate group_data;
It will produce the following output − You can group a relation by all the columns
as shown below.
------------------------------------------------------
------------------------------------------- grunt> group_all = GROUP
|group_data| group:int | student_details All;
student_details:bag{:tuple(id:int,firstname:c Now, verify the content of the
hararray,lastname:chararray,age:int,phone:c relation group_all as shown below.
hararray,city:chararray)}|
------------------------------------------------------ grunt> Dump group_all;
-------------------------------------------
| | 21 | { 4, Preethi, Agarwal, 21, (all,{(8,Bharathi,Nambiayar,24,9848022333
9848022330, Pune), (1, Rajiv, Reddy, 21, ,Chennai),(7,Komal,Nayak,24,9848022334
9848022337, Hyderabad)}| ,trivendram),
| | 2 | (6,Archana,Mishra,23,9848022335,Chennai)
{(2,siddarth,Battacharya,22,9848022338,Ko ,(5,Trupthi,Mohanthy,23,9848022336,Bhuw
lkata),(003,Rajesh,Khanna,22,9848022339, aneshwar),
Delhi)}| (4,Preethi,Agarwal,21,9848022330,Pune),(3
------------------------------------------------------ ,Rajesh,Khanna,22,9848022339,Delhi),
-------------------------------------------
BIG DATA ANALYTICS

(2,siddarth,Battacharya,22,9848022338,Kol grunt> student_details = LOAD


kata),(1,Rajiv,Reddy,21,9848022337,Hyd 'hdfs://localhost:9000/pig_data/student_detai
erabad)}) ls.txt' USING PigStorage(',')
as (id:int, firstname:chararray,
lastname:chararray, age:int,
COGROUP operator phone:chararray, city:chararray);
The COGROUP operator works more or grunt> employee_details = LOAD
less in the same way as 'hdfs://localhost:9000/pig_data/employee_de
the GROUP operator. The only difference tails.txt' USING PigStorage(',')
between the two operators is that as (id:int, name:chararray, age:int,
the group operator is normally used with city:chararray);
one relation, while the cogroup operator is
used in statements involving two or more Now, let us group the records/tuples of the
relations. relations student_details and employee_det
ails with the key age, as shown below.
Grouping Two Relations using Cogroup
grunt> cogroup_data = COGROUP
student_details by age, employee_details by
Assume that we have two files
age;
namely student_details.txt and employee_
Verification
details.txt in the HDFS
directory /pig_data/ as shown below. Verify the relation cogroup_data using
student_details.txt the DUMP operator as shown below.

001,Rajiv,Reddy,21,9848022337,Hyderabad grunt> Dump cogroup_data;


002,siddarth,Battacharya,22,9848022338,Ko Output
lkata It will produce the following output,
003,Rajesh,Khanna,22,9848022339,Delhi displaying the contents of the relation
004,Preethi,Agarwal,21,9848022330,Pune named cogroup_data as shown below.
005,Trupthi,Mohanthy,23,9848022336,Bhu
waneshwar (21,{(4,Preethi,Agarwal,21,9848022330,Pun
006,Archana,Mishra,23,9848022335,Chenn e),
ai (1,Rajiv,Reddy,21,9848022337,Hyderabad)
007,Komal,Nayak,24,9848022334,trivendra },
m { })
008,Bharathi,Nambiayar,24,9848022333,Ch (22,{
ennai (3,Rajesh,Khanna,22,9848022339,Delhi),
(2,siddarth,Battacharya,22,9848022338,Kol
employee_details.txt kata) },
001,Robin,22,newyork {
002,BOB,23,Kolkata (6,Maggy,22,Chennai),(1,Robin,22,newyork
003,Maya,23,Tokyo ) })
004,Sara,25,London (23,{(6,Archana,Mishra,23,9848022335,Che
005,David,23,Bhuwaneshwar nnai),(5,Trupthi,Mohanthy,23,9848022336
006,Maggy,22,Chennai ,Bhuwaneshwar)},
And we have loaded these files into Pig with
{(5,David,23,Bhuwaneshwar),(3,Maya,23,T
the relation
okyo),(2,BOB,23,Kolkata)})
names student_details and employee_detai
ls respectively, as shown below.
BIG DATA ANALYTICS

(24,{(8,Bharathi,Nambiayar,24,9848022333 4,Chaitali,25,Mumbai,6500.00
,Chennai),(7,Komal,Nayak,24,9848022334, 5,Hardik,27,Bhopal,8500.00
trivendram)}, 6,Komal,22,MP,4500.00
{ }) 7,Muffy,24,Indore,10000.00
(25,{ },
orders.txt
{(4,Sara,25,London)})
102,2009-10-08 00:00:00,3,3000
The cogroup operator groups the tuples
100,2009-10-08 00:00:00,3,1500
from each relation according to age where
101,2009-11-20 00:00:00,2,1560
each group depicts a particular age value.
103,2008-05-20 00:00:00,4,2060
For example, if we consider the 1st tuple of And we have loaded these two files into Pig
the result, it is grouped by age 21. And it with the relations customers and orders as
contains two bags −
shown below.
 the first bag holds all the tuples from the
grunt> customers = LOAD
first relation (student_details in this
'hdfs://localhost:9000/pig_data/customers.txt
case) having age 21, and
' USING PigStorage(',')
 the second bag contains all the tuples
as (id:int, name:chararray, age:int,
from the second relation
address:chararray, salary:int);
(employee_details in this case) having
age 21.
grunt> orders = LOAD
In case a relation doesn‟t have tuples having 'hdfs://localhost:9000/pig_data/orders.txt'
the age value 21, it returns an empty bag. USING PigStorage(',')
as (oid:int, date:chararray,
JOIN Operator
customer_id:int, amount:int);
The JOIN operator is used to combine Let us now perform various Join operations
records from two or more relations. While on these two relations.
performing a join operation, we declare one
(or a group of) tuple(s) from each relation, Self - join
as keys. When these keys match, the two
particular tuples are matched, else the Self-join is used to join a table with itself as
records are dropped. Joins can be of the if the table were two relations, temporarily
following types − renaming at least one relation.
 Self-join Generally, in Apache Pig, to perform self-
 Inner-join join, we will load the same data multiple
 Outer-join − left join, right join, and full times, under different aliases (names).
join Therefore let us load the contents of the
This chapter explains with examples how to file customers.txt as two tables as shown
use the join operator in Pig Latin. Assume below.
that we have two files grunt> customers1 = LOAD
namely customers.txt and orders.txt in 'hdfs://localhost:9000/pig_data/customers.txt
the /pig_data/ directory of HDFS as shown ' USING PigStorage(',')
below. as (id:int, name:chararray, age:int,
customers.txt address:chararray, salary:int);
1,Ramesh,32,Ahmedabad,2000.00 grunt> customers2 = LOAD
2,Khilan,25,Delhi,1500.00 'hdfs://localhost:9000/pig_data/customers.txt
3,kaushik,23,Kota,2000.00 ' USING PigStorage(',')
BIG DATA ANALYTICS

as (id:int, name:chararray, age:int, It creates a new relation by combining


address:chararray, salary:int); column values of two relations (say A and
Syntax B) based upon the join-predicate. The query
compares each row of A with each row of B
Given below is the syntax of to find all pairs of rows which satisfy the
performing self-join operation using join-predicate. When the join-predicate is
the JOIN operator. satisfied, the column values for each
grunt> Relation3_name = JOIN matched pair of rows of A and B are
Relation1_name BY key, Relation2_name combined into a result row.
BY key ; Syntax
Example
Here is the syntax of performing inner
Let us perform self-join operation on the join operation using the JOIN operator.
relation customers, by joining the two
relations customers1 and customers2 as grunt> result = JOIN relation1 BY
shown below. columnname, relation2 BY columnname;
Example
grunt> customers3 = JOIN customers1 BY
id, customers2 BY id; Let us perform inner join operation on the
Verification two relations customers and orders as
shown below.
Verify the relation customers3 using
the DUMP operator as shown below. grunt> coustomer_orders = JOIN customers
BY id, orders BY customer_id;
grunt> Dump customers3; Verification
Output
Verify the relation coustomer_orders using
It will produce the following output, the DUMP operator as shown below.
displaying the contents of the
relation customers. grunt> Dump coustomer_orders;
Output
(1,Ramesh,32,Ahmedabad,2000,1,Ramesh,3
2,Ahmedabad,2000) You will get the following output that will
(2,Khilan,25,Delhi,1500,2,Khilan,25,Delhi,1 the contents of the relation
500) named coustomer_orders.
(3,kaushik,23,Kota,2000,3,kaushik,23,Kota, (2,Khilan,25,Delhi,1500,101,2009-11-20
2000) 00:00:00,2,1560)
(4,Chaitali,25,Mumbai,6500,4,Chaitali,25,M (3,kaushik,23,Kota,2000,100,2009-10-08
umbai,6500) 00:00:00,3,1500)
(5,Hardik,27,Bhopal,8500,5,Hardik,27,Bhop (3,kaushik,23,Kota,2000,102,2009-10-08
al,8500) 00:00:00,3,3000)
(6,Komal,22,MP,4500,6,Komal,22,MP,4500 (4,Chaitali,25,Mumbai,6500,103,2008-05-
) 20 00:00:00,4,2060)
(7,Muffy,24,Indore,10000,7,Muffy,24,Indor
e,10000) Note −
Outer Join: Unlike inner join, outer
Inner Join join returns all the rows from at least one of
the relations. An outer join operation is
Inner Join is used quite frequently; it is also carried out in three ways −
referred to as equijoin. An inner join returns
rows when there is a match in both tables.  Left outer join
 Right outer join
BIG DATA ANALYTICS

 Full outer join The right outer join operation returns all
rows from the right table, even if there are
Left Outer Join no matches in the left table.
Syntax
The left outer Join operation returns all
rows from the left table, even if there are no Given below is the syntax of
matches in the right relation. performing right outer join operation using
the JOIN operator.
Syntax
grunt> outer_right = JOIN customers BY id
Given below is the syntax of performing left
RIGHT, orders BY customer_id;
outer join operation using
Example
the JOIN operator.
Let us perform right outer join operation
grunt> Relation3_name = JOIN
on the two
Relation1_name BY id LEFT OUTER,
relations customers and orders as shown
Relation2_name BY customer_id;
below.
Example
Let us perform left outer join operation on grunt> outer_right = JOIN customers BY id
the two relations customers and orders as RIGHT, orders BY customer_id;
shown below. Verification

grunt> outer_left = JOIN customers BY id Verify the relation outer_right using


LEFT OUTER, orders BY customer_id; the DUMP operator as shown below.
Verification grunt> Dump outer_right
Verify the relation outer_left using Output
the DUMP operator as shown below. It will produce the following output,
grunt> Dump outer_left; displaying the contents of the
relation outer_right.
Output
(2,Khilan,25,Delhi,1500,101,2009-11-20
It will produce the following output,
00:00:00,2,1560)
displaying the contents of the
(3,kaushik,23,Kota,2000,100,2009-10-08
relation outer_left.
00:00:00,3,1500)
(1,Ramesh,32,Ahmedabad,2000,,,,) (3,kaushik,23,Kota,2000,102,2009-10-08
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,3,3000)
00:00:00,2,1560) (4,Chaitali,25,Mumbai,6500,103,2008-05-
(3,kaushik,23,Kota,2000,100,2009-10-08 20 00:00:00,4,2060)
00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 Full Outer Join
00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05- The full outer join operation returns rows
20 00:00:00,4,2060) when there is a match in one of the relations.
(5,Hardik,27,Bhopal,8500,,,,)
Syntax
(6,Komal,22,MP,4500,,,,)
(7,Muffy,24,Indore,10000,,,,) Given below is the syntax of performing full
outer join using the JOIN operator.
Right Outer Join
grunt> outer_full = JOIN customers BY id
FULL OUTER, orders BY customer_id;
BIG DATA ANALYTICS

Example 002,siddarth,Battacharya,22,programmer,00
3
Let us perform full outer join operation on 003,Rajesh,Khanna,22,programmer,003
the two relations customers and orders as 004,Preethi,Agarwal,21,programmer,003
shown below. 005,Trupthi,Mohanthy,23,programmer,003
grunt> outer_full = JOIN customers BY id 006,Archana,Mishra,23,programmer,003
FULL OUTER, orders BY customer_id; 007,Komal,Nayak,24,teamlead,002
Verification 008,Bharathi,Nambiayar,24,manager,001

Verify the relation outer_full using employee_contact.txt


the DUMP operator as shown below. 001,9848022337,[email protected],Hydera
bad,003
grun> Dump outer_full;
002,9848022338,[email protected],Kolk
Output ata,003
It will produce the following output, 003,9848022339,[email protected],Delhi,
displaying the contents of the 003
relation outer_full. 004,9848022330,[email protected],Pune,
003
(1,Ramesh,32,Ahmedabad,2000,,,,) 005,9848022336,[email protected],Bhuw
(2,Khilan,25,Delhi,1500,101,2009-11-20 aneshwar,003
00:00:00,2,1560) 006,9848022335,[email protected],Chen
(3,kaushik,23,Kota,2000,100,2009-10-08 nai,003
00:00:00,3,1500) 007,9848022334,[email protected],trivend
(3,kaushik,23,Kota,2000,102,2009-10-08 ram,002
00:00:00,3,3000) 008,9848022333,[email protected],Chen
(4,Chaitali,25,Mumbai,6500,103,2008-05- nai,001
20 00:00:00,4,2060)
(5,Hardik,27,Bhopal,8500,,,,) And we have loaded these two files into Pig
(6,Komal,22,MP,4500,,,,) with
(7,Muffy,24,Indore,10000,,,,) relations employee and employee_contact a
s shown below.
Using Multiple Keys grunt> employee = LOAD
'hdfs://localhost:9000/pig_data/employee.txt'
We can perform JOIN operation using USING PigStorage(',')
multiple keys. as (id:int, firstname:chararray,
Syntax lastname:chararray, age:int,
designation:chararray, jobid:int);
Here is how you can perform a JOIN
operation on two tables using multiple keys. grunt> employee_contact = LOAD
grunt> Relation3_name = JOIN 'hdfs://localhost:9000/pig_data/employee_co
Relation2_name BY (key1, key2), ntact.txt' USING PigStorage(',')
Relation3_name BY (key1, key2); as (id:int, phone:chararray,
email:chararray, city:chararray, jobid:int);
Assume that we have two files
namely employee.txt and employee_contac Now, let us join the contents of these two
t.txt in the /pig_data/ directory of HDFS as relations using the JOIN operator as shown
shown below. below.

employee.txt grunt> emp = JOIN employee BY (id,jobid),


employee_contact BY (id,jobid);
001,Rajiv,Reddy,21,programmer,003
BIG DATA ANALYTICS

Verification Assume that we have two files


namely customers.txt and orders.txt in
Verify the relation emp using the /pig_data/ directory of HDFS as shown
the DUMP operator as shown below. below.
grunt> Dump emp; customers.txt
Output
1,Ramesh,32,Ahmedabad,2000.00
It will produce the following output, 2,Khilan,25,Delhi,1500.00
displaying the contents of the relation 3,kaushik,23,Kota,2000.00
named emp as shown below. 4,Chaitali,25,Mumbai,6500.00
(1,Rajiv,Reddy,21,programmer,113,1,98480 5,Hardik,27,Bhopal,8500.00
22337,[email protected],Hyderabad,113) 6,Komal,22,MP,4500.00
(2,siddarth,Battacharya,22,programmer,113, 7,Muffy,24,Indore,10000.00
2,9848022338,[email protected],Kolka orders.txt
ta,113)
(3,Rajesh,Khanna,22,programmer,113,3,984 102,2009-10-08 00:00:00,3,3000
8022339,[email protected],Delhi,113) 100,2009-10-08 00:00:00,3,1500
(4,Preethi,Agarwal,21,programmer,113,4,98 101,2009-11-20 00:00:00,2,1560
48022330,[email protected],Pune,113) 103,2008-05-20 00:00:00,4,2060
(5,Trupthi,Mohanthy,23,programmer,113,5, And we have loaded these two files into Pig
9848022336,[email protected],Bhuwanes with the relations customers and orders as
hw ar,113) shown below.
(6,Archana,Mishra,23,programmer,113,6,98
48022335,[email protected],Chennai,11 grunt> customers = LOAD
3) 'hdfs://localhost:9000/pig_data/customers.txt
(7,Komal,Nayak,24,teamlead,112,7,984802 ' USING PigStorage(',')
2334,[email protected],trivendram,112) as (id:int, name:chararray, age:int,
(8,Bharathi,Nambiayar,24,manager,111,8,98 address:chararray, salary:int);
48022333,[email protected],Chennai,11
1) grunt> orders = LOAD
'hdfs://localhost:9000/pig_data/orders.txt'
USING PigStorage(',')
CROSS operator as (oid:int, date:chararray,
customer_id:int, amount:int);
The CROSS operator computes the cross- Let us now get the cross-product of these
product of two or more relations. This two relations using the cross operator on
chapter explains with example how to use these two relations as shown below.
the cross operator in Pig Latin.
grunt> cross_data = CROSS customers,
Syntax orders;
Verification
Given below is the syntax of
the CROSS operator. Verify the relation cross_data using
the DUMP operator as shown below.
grunt> Relation3_name = CROSS
Relation1_name, Relation2_name; grunt> Dump cross_data;
Output
Example It will produce the following output,
displaying the contents of the
relation cross_data.
BIG DATA ANALYTICS

(7,Muffy,24,Indore,10000,103,2008-05-20 (3,kaushik,23,Kota,2000,101,2009-11-20
00:00:00,4,2060) 00:00:00,2,1560)
(7,Muffy,24,Indore,10000,101,2009-11-20 (3,kaushik,23,Kota,2000,100,2009-10-08
00:00:00,2,1560) 00:00:00,3,1500)
(7,Muffy,24,Indore,10000,100,2009-10-08 (3,kaushik,23,Kota,2000,102,2009-10-08
00:00:00,3,1500) 00:00:00,3,3000)
(7,Muffy,24,Indore,10000,102,2009-10-08 (2,Khilan,25,Delhi,1500,103,2008-05-20
00:00:00,3,3000) 00:00:00,4,2060)
(6,Komal,22,MP,4500,103,2008-05-20 (2,Khilan,25,Delhi,1500,101,2009-11-20
00:00:00,4,2060) 00:00:00,2,1560)
(6,Komal,22,MP,4500,101,2009-11-20 (2,Khilan,25,Delhi,1500,100,2009-10-08
00:00:00,2,1560) 00:00:00,3,1500)
(6,Komal,22,MP,4500,100,2009-10-08 (2,Khilan,25,Delhi,1500,102,2009-10-08
00:00:00,3,1500) 00:00:00,3,3000)
(6,Komal,22,MP,4500,102,2009-10-08 (1,Ramesh,32,Ahmedabad,2000,103,2008-
00:00:00,3,3000) 05-20 00:00:00,4,2060)
(5,Hardik,27,Bhopal,8500,103,2008-05-20 (1,Ramesh,32,Ahmedabad,2000,101,2009-
00:00:00,4,2060) 11-20 00:00:00,2,1560)
(5,Hardik,27,Bhopal,8500,101,2009-11-20 (1,Ramesh,32,Ahmedabad,2000,100,2009-
00:00:00,2,1560) 10-08 00:00:00,3,1500)
(5,Hardik,27,Bhopal,8500,100,2009-10-08 (1,Ramesh,32,Ahmedabad,2000,102,2009-
00:00:00,3,1500) 10-08 00:00:00,3,3000)
(5,Hardik,27,Bhopal,8500,102,2009-10-08
00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05- UNION operator
20 00:00:00,4,2060)
(4,Chaitali,25,Mumbai,6500,101,2009-20 The UNION operator of Pig Latin is used to
00:00:00,4,2060) merge the content of two relations. To
(2,Khilan,25,Delhi,1500,101,2009-11-20 perform UNION operation on two relations,
00:00:00,2,1560) their columns and domains must be
(2,Khilan,25,Delhi,1500,100,2009-10-08 identical.
00:00:00,3,1500)
(2,Khilan,25,Delhi,1500,102,2009-10-08 Syntax
00:00:00,3,3000)
(1,Ramesh,32,Ahmedabad,2000,103,2008- Given below is the syntax of
05-20 00:00:00,4,2060) the UNION operator.
(1,Ramesh,32,Ahmedabad,2000,101,2009- grunt> Relation_name3 = UNION
11-20 00:00:00,2,1560) Relation_name1, Relation_name2;
(1,Ramesh,32,Ahmedabad,2000,100,2009-
10-08 00:00:00,3,1500) Example
(1,Ramesh,32,Ahmedabad,2000,102,2009-
10-08 00:00:00,3,3000)-11-20 Assume that we have two files
00:00:00,2,1560) namely student_data1.txt and student_dat
(4,Chaitali,25,Mumbai,6500,100,2009-10- a2.txt in the /pig_data/ directory of HDFS
08 00:00:00,3,1500) as shown below.
(4,Chaitali,25,Mumbai,6500,102,2009-10-
08 00:00:00,3,3000) Student_data1.txt
(3,kaushik,23,Kota,2000,103,2008-05-20 001,Rajiv,Reddy,9848022337,Hyderabad
00:00:00,4,2060)
BIG DATA ANALYTICS

002,siddarth,Battacharya,9848022338,Kolka (5,Trupthi,Mohanthy,9848022336,Bhuwane
ta shwar)
003,Rajesh,Khanna,9848022339,Delhi (6,Archana,Mishra,9848022335,Chennai)
004,Preethi,Agarwal,9848022330,Pune (7,Komal,Nayak,9848022334,trivendram)
005,Trupthi,Mohanthy,9848022336,Bhuwan (8,Bharathi,Nambiayar,9848022333,Chenna
eshwar i)
006,Archana,Mishra,9848022335,Chennai.
Student_data2.txt SPLIT operator
7,Komal,Nayak,9848022334,trivendram.
8,Bharathi,Nambiayar,9848022333,Chennai. The SPLIT operator is used to split a
relation into two or more relations.
And we have loaded these two files into Pig
with the relations student1 and student2 as Syntax
shown below.
grunt> student1 = LOAD Given below is the syntax of
'hdfs://localhost:9000/pig_data/student_data the SPLIT operator.
1.txt' USING PigStorage(',') grunt> SPLIT Relation1_name INTO
as (id:int, firstname:chararray, Relation2_name IF (condition1),
lastname:chararray, phone:chararray, Relation2_name (condition2),
city:chararray);
Example
grunt> student2 = LOAD
'hdfs://localhost:9000/pig_data/student_data Assume that we have a file
2.txt' USING PigStorage(',') named student_details.txt in the HDFS
as (id:int, firstname:chararray, directory /pig_data/ as shown below.
lastname:chararray, phone:chararray,
city:chararray); student_details.txt
Let us now merge the contents of these two 001,Rajiv,Reddy,21,9848022337,Hyderabad
relations using the UNION operator as 002,siddarth,Battacharya,22,9848022338,Ko
shown below. lkata
003,Rajesh,Khanna,22,9848022339,Delhi
grunt> student = UNION student1, student2; 004,Preethi,Agarwal,21,9848022330,Pune
Verification 005,Trupthi,Mohanthy,23,9848022336,Bhu
waneshwar
Verify the relation student using
006,Archana,Mishra,23,9848022335,Chenn
the DUMP operator as shown below.
ai
grunt> Dump student; 007,Komal,Nayak,24,9848022334,trivendra
Output m
008,Bharathi,Nambiayar,24,9848022333,Ch
It will display the following output, ennai
displaying the contents of the
relation student. And we have loaded this file into Pig with
the relation name student_details as shown
(1,Rajiv,Reddy,9848022337,Hyderabad) below.
(2,siddarth,Battacharya,9848022338,Kolkata
) student_details = LOAD
(3,Rajesh,Khanna,9848022339,Delhi) 'hdfs://localhost:9000/pig_data/student_detai
(4,Preethi,Agarwal,9848022330,Pune) ls.txt' USING PigStorage(',')
BIG DATA ANALYTICS

as (id:int, firstname:chararray, Syntax


lastname:chararray, age:int,
phone:chararray, city:chararray); Given below is the syntax of
the FILTER operator.
Let us now split the relation into two, one
listing the employees of age less than 23, grunt> Relation2_name = FILTER
and the other listing the employees having Relation1_name BY (condition);
the age between 22 and 25.
Example
SPLIT student_details into student_details1
if age<23, student_details2 if (22<age and Assume that we have a file
age>25); named student_details.txt in the HDFS
Verification directory /pig_data/ as shown below.
Verify the student_details.txt
relations student_details1 and student_det
ails2 using the DUMP operator as shown 001,Rajiv,Reddy,21,9848022337,Hyderabad
below. 002,siddarth,Battacharya,22,9848022338,Ko
lkata
grunt> Dump student_details1; 003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
grunt> Dump student_details2; 005,Trupthi,Mohanthy,23,9848022336,Bhu
Output waneshwar
006,Archana,Mishra,23,9848022335,Chenn
It will produce the following output, ai
displaying the contents of the 007,Komal,Nayak,24,9848022334,trivendra
relations student_details1 and student_det m
ails2 respectively. 008,Bharathi,Nambiayar,24,9848022333,Ch
grunt> Dump student_details1; ennai
(1,Rajiv,Reddy,21,9848022337,Hyderabad) And we have loaded this file into Pig with
(2,siddarth,Battacharya,22,9848022338,Kol the relation name student_details as shown
kata) below.
(3,Rajesh,Khanna,22,9848022339,Delhi)
(4,Preethi,Agarwal,21,9848022330,Pune) grunt> student_details = LOAD
'hdfs://localhost:9000/pig_data/student_detai
grunt> Dump student_details2; ls.txt' USING PigStorage(',')
(5,Trupthi,Mohanthy,23,9848022336,Bhuw as (id:int, firstname:chararray,
aneshwar) lastname:chararray, age:int,
(6,Archana,Mishra,23,9848022335,Chennai) phone:chararray, city:chararray);
(7,Komal,Nayak,24,9848022334,trivendram
Let us now use the Filter operator to get the
)
details of the students who belong to the city
(8,Bharathi,Nambiayar,24,9848022333,Che
Chennai.
nnai)
filter_data = FILTER student_details BY
city == 'Chennai';
FILTER operator Verification
The FILTER operator is used to select the Verify the relation filter_data using
required tuples from a relation based on a the DUMP operator as shown below.
condition.
grunt> Dump filter_data;
BIG DATA ANALYTICS

Output as (id:int, firstname:chararray,


lastname:chararray, phone:chararray,
It will produce the following output,
city:chararray);
displaying the contents of the
relation filter_data as follows. Let us now remove the redundant (duplicate)
tuples from the relation
(6,Archana,Mishra,23,9848022335,Chennai) named student_details using
(8,Bharathi,Nambiayar,24,9848022333,Che the DISTINCT operator, and store it as
nnai) another relation named distinct_data as
shown below.
DISTINCT operator grunt> distinct_data = DISTINCT
student_details;
The DISTINCT operator is used to remove Verification
redundant (duplicate) tuples from a relation.
Verify the relation distinct_data using
Syntax the DUMP operator as shown below.
grunt> Dump distinct_data;
Given below is the syntax of
Output
the DISTINCT operator.
It will produce the following output,
grunt> Relation_name2 = DISTINCT
displaying the contents of the
Relatin_name1;
relation distinct_data as follows.
Example (1,Rajiv,Reddy,9848022337,Hyderabad)
(2,siddarth,Battacharya,9848022338,Kolkata
Assume that we have a file )
named student_details.txt in the HDFS (3,Rajesh,Khanna,9848022339,Delhi)
directory /pig_data/ as shown below. (4,Preethi,Agarwal,9848022330,Pune)
(5,Trupthi,Mohanthy,9848022336,Bhuwane
student_details.txt
shwar)
001,Rajiv,Reddy,9848022337,Hyderabad (6,Archana,Mishra,9848022335,Chennai)
002,siddarth,Battacharya,9848022338,Kolka
ta
002,siddarth,Battacharya,9848022338,Kolka FOREACH operator
ta
003,Rajesh,Khanna,9848022339,Delhi The FOREACH operator is used to
003,Rajesh,Khanna,9848022339,Delhi generate specified data transformations
004,Preethi,Agarwal,9848022330,Pune based on the column data.
005,Trupthi,Mohanthy,9848022336,Bhuwan
eshwar Syntax
006,Archana,Mishra,9848022335,Chennai
006,Archana,Mishra,9848022335,Chennai Given below is the syntax
of FOREACH operator.
And we have loaded this file into Pig with
the relation name student_details as shown grunt> Relation_name2 = FOREACH
below. Relatin_name1 GENERATE (required
data);
grunt> student_details = LOAD
'hdfs://localhost:9000/pig_data/student_detai Example
ls.txt' USING PigStorage(',')
BIG DATA ANALYTICS

Assume that we have a file (4,21,Pune)


named student_details.txt in the HDFS (5,23,Bhuwaneshwar)
directory /pig_data/ as shown below. (6,23,Chennai)
(7,24,trivendram)
student_details.txt
(8,24,Chennai)
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Ko
lkata Order By
003,Rajesh,Khanna,22,9848022339,Delhi The ORDER BY operator is used to display
004,Preethi,Agarwal,21,9848022330,Pune the contents of a relation in a sorted order
005,Trupthi,Mohanthy,23,9848022336,Bhu based on one or more fields.
waneshwar
006,Archana,Mishra,23,9848022335,Chenn
Syntax
ai
007,Komal,Nayak,24,9848022334,trivendra
Given below is the syntax of the ORDER
m
BY operator.
008,Bharathi,Nambiayar,24,9848022333,Ch
ennai grunt> Relation_name2 = ORDER
Relatin_name1 BY (ASC|DESC);
And we have loaded this file into Pig with
the relation name student_details as shown
Example
below.
grunt> student_details = LOAD Assume that we have a file
'hdfs://localhost:9000/pig_data/student_detai named student_details.txt in the HDFS
ls.txt' USING PigStorage(',') directory /pig_data/ as shown below.
as (id:int, firstname:chararray,
student_details.txt
lastname:chararray,age:int, phone:chararray,
city:chararray); 001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Ko
Let us now get the id, age, and city values of
lkata
each student from the
003,Rajesh,Khanna,22,9848022339,Delhi
relation student_details and store it into
004,Preethi,Agarwal,21,9848022330,Pune
another relation named foreach_data using
005,Trupthi,Mohanthy,23,9848022336,Bhu
the foreach operator as shown below.
waneshwar
grunt> foreach_data = FOREACH 006,Archana,Mishra,23,9848022335,Chenn
student_details GENERATE id,age,city; ai
Verification 007,Komal,Nayak,24,9848022334,trivendra
m
Verify the relation foreach_data using 008,Bharathi,Nambiayar,24,9848022333,Ch
the DUMP operator as shown below. ennai
grunt> Dump foreach_data; And we have loaded this file into Pig with
Output the relation name student_details as shown
below.
It will produce the following output,
displaying the contents of the grunt> student_details = LOAD
relation foreach_data. 'hdfs://localhost:9000/pig_data/student_detai
ls.txt' USING PigStorage(',')
(1,21,Hyderabad) as (id:int, firstname:chararray,
(2,22,Kolkata) lastname:chararray,age:int, phone:chararray,
(3,22,Delhi) city:chararray);
BIG DATA ANALYTICS

Let us now sort the relation in a descending student_details.txt


order based on the age of the student and
001,Rajiv,Reddy,21,9848022337,Hyderabad
store it into another relation
002,siddarth,Battacharya,22,9848022338,Ko
named order_by_data using the ORDER
lkata
BY operator as shown below.
003,Rajesh,Khanna,22,9848022339,Delhi
grunt> order_by_data = ORDER 004,Preethi,Agarwal,21,9848022330,Pune
student_details BY age DESC; 005,Trupthi,Mohanthy,23,9848022336,Bhu
Verification waneshwar
006,Archana,Mishra,23,9848022335,Chenn
Verify the relation order_by_data using ai
the DUMP operator as shown below. 007,Komal,Nayak,24,9848022334,trivendra
grunt> Dump order_by_data; m
008,Bharathi,Nambiayar,24,9848022333,Ch
Output
ennai
It will produce the following output, And we have loaded this file into Pig with
displaying the contents of the the relation name student_details as shown
relation order_by_data. below.
(8,Bharathi,Nambiayar,24,9848022333,Che
grunt> student_details = LOAD
nnai)
'hdfs://localhost:9000/pig_data/student_detai
(7,Komal,Nayak,24,9848022334,trivendram
ls.txt' USING PigStorage(',')
)
as (id:int, firstname:chararray,
(6,Archana,Mishra,23,9848022335,Chennai)
lastname:chararray,age:int, phone:chararray,
(5,Trupthi,Mohanthy,23,9848022336,Bhuw
city:chararray);
aneshwar)
(3,Rajesh,Khanna,22,9848022339,Delhi) Now, let‟s sort the relation in descending
(2,siddarth,Battacharya,22,9848022338,Kol order based on the age of the student and
kata) store it into another relation
(4,Preethi,Agarwal,21,9848022330,Pune) named limit_data using the ORDER
(1,Rajiv,Reddy,21,9848022337,Hyderabad) BY operator as shown below.
grunt> limit_data = LIMIT student_details
LIMIT operator 4;
Verification
The LIMIT operator is used to get a limited
number of tuples from a relation. Verify the relation limit_data using
the DUMP operator as shown below.
Syntax grunt> Dump limit_data;
Output
Given below is the syntax of
the LIMIT operator. It will produce the following output,
displaying the contents of the
grunt> Result = LIMIT Relation_name relation limit_data as follows.
required number of tuples;
(1,Rajiv,Reddy,21,9848022337,Hyderabad)
Example (2,siddarth,Battacharya,22,9848022338,Kol
kata)
Assume that we have a file (3,Rajesh,Khanna,22,9848022339,Delhi)
named student_details.txt in the HDFS (4,Preethi,Agarwal,21,9848022330,Pune)
directory /pig_data/ as shown below.

You might also like