Apache pig

Apache Pig.
-- Jigar Parekh.
472062.

What is Pig:
• Pig is an open-source high-level dataflow
system.
• It provides simple language for queries and
data manipulation called Pig Latin.
• Internally it is complied into a Map-Reduce job
that are run on Hadoop.
• Similar to SQL query where the user specifies
the “What” and leaves the “How” to the
underlying processing engine.

Pig in Hadoop Eco System:
Pig sits on top of Map-Reduce layer.

Pig v/s Map-Reduce:
Map-Reduce Pig
MR is a compiled language. Pig is Scripting Language.
Java knowledge is needed. Java knowledge is not required, only may
be to write your own UDF.
Lots of hand coding. Pig uses already defined SQL like
functions or extend already existing UDFs.
Users much more comfortable to use MR
when dealing with the total Un-
Structured data.
Pig has problems dealing with the Un-
Structured data like Images, Videos, etc.

Who all are using PIG:
• 70% of production jobs at Yahoo (10ks per
day)
• Yahoo, Twitter, LinkedIn, Ebay, AOL,…
• Used to
– Process web logs
– Build user behavior models
– Build maps of the web
– Do research on large data sets

Accessing Pig:
• There are two modes in which we can access
Pig:
1) Local Mode: To run Pig in local mode, you
need access to a single machine.
2) Hadoop (Map-Reduce) Mode : To run Pig in
hadoop (map-reduce) mode, you need access to
a Hadoop cluster and HDFS installation.

Running Ways:
• Grunt Shell: Enter Pig commands manually using Pig’s interactive shell,
Grunt.
e.g: $ pig -x <local or mapreduce>
grunt>
• Script File: Place Pig commands in a script file and run the script.
e.g: $ pig -x <local or mapreduce> my_script.pig
• Embedded Program: Embed Pig commands in a host language and run
the program.
e.g: $ java -cp pig.jar:. Idlocal
$ java -cp pig.jar:.:$HADOOPDIR idhadoop
Note: ‘-x mapreduce’ keyword is optional if we want to run in the Hadoop
mode. Example: $ pig –x mapreduce is same as $ pig. Or
$ pig –x mapreduce my_script.pig is same as $ pig my_script.pig.

Data Types:
Simple Types Description Example
int Signed 32-bit integer 10
long Signed 64-bit integer Data: 10L or 10l
Display: 10L
float 32-bit floating point Data: 10.5F or 10.5f or 10.5e2f or
10.5E2F
Display: 10.5F or 1050.0F
double 64-bit floating point Data: 10.5 or 10.5e2 or 10.5E2
Display: 10.5 or 1050.0
chararray Character array (string) in Unicode UTF-8 format hello world
bytearray Byte array (blob)
boolean boolean true/false (case insensitive)
Complex Types
tuple An ordered set of fields. (19,2)
bag An collection of tuples. {(19,2), (18,1)}
map A set of key value pairs. [name#John,phone#5551212]

Pig Execution:
• Pig scripts/commands follow the pattern as
given below:
Load
(Text, CSV, JSON,
Hive table.)
Transform
(Filter, Group,
Sort)
Store
(Dump, Store into
HDFS, Hive)

Loading Data in Pig:
• A = LOAD 'student' ;
• file_load = LOAD ‘/usr/tmp/student.txt' ;
• Z = LOAD 'student' USING PigStorage() AS (name : chararray, age : int, gpa : float);
• A = LOAD 'data' AS (f1 : int, f2 : int, B: bag {T : tuple (t1 : int, t2 : int)});
-- A / file_load / Z , here are called Relations.
-- LOAD, keyword used for loading data from HDFS into the Relation for processing / transformation.
-- ‘student’, The name of the file or directory, in single quotes. We can give full path name or
file_name* for all the similar filenames to be loaded.
-- USING is a function/keyword.
-- PigStorage() / TextLoader() / JsonLoader() / HCatloader(), we need to use appropriate function in
order for Pig to understand the incoming data. These are case Sensitive.
PigStorage() defaults to TAB separated data. If different, we need to specify the separator between the
parenthesis, example: PigStorage(‘,’), PigStorage(‘t’)
-- AS, is a keyword.
-- (name : chararray, ..), is called Schema.

Accessing the Relation
• Once the data is loaded into a Relation, there are two ways we can
access the data.
(1) Positional
(2) Schema names.
In the first example, the columns needs to be accessed by position as
the schema is not defined. The notation starts with $0 for the first
column, $1 for the second column and so on so forth.
In the next example, the schema is defined in terms of column names.
We can use either $0, $1 notation or we can use the column name as
is.
grunt> DESCRIBE A;
-- Does not produce any output since its Schema-less.
grunt> DESCRIBE Z;
Z: {name : chararray, age : int, gpa : float}

Data Transformation in Pig:
• Arithmetic Operators. [+, -, *, /, %, ? :]
• Relational Operators. [filter, group, order,
distinct, load, store, etc]
• Diagnostic Operators. [dump, describe, etc]
• Eval Functions. [count, max, min, concat, etc]
• Math Functions. [round, abs, floor, ceil, etc]
• String Functions. [lower, upper, substring,
trim, etc]

Relational Operators in Pig:
• The ones which are important to know are:
1) FILTER,
2) GROUP BY/ COGROUP BY,
3) LIMIT,
4) ORDER BY,
5) JOIN,
6) DISTINCT,
7) FOREACH GENERATE.

Relational Operator Examples:
-- Filter : It is similar to WHERE clause in SQL.
grunt> A = LOAD 'data' AS (f1 : int, f2 : int, f3 : int);
grunt> X = FILTER A BY f3 == 3;
grunt> Y = FILTER A BY (f1 == 8) OR (F2 == 10);
-- Group By / CoGroup By : GROUP BY is used for grouping the schemas in a single relation.
Where as, when we need to group two or more relations we need to use COGROUP BY.
grunt> A = load 'student' AS (name: chararray, age: int, gpa: float);
grunt> B = GROUP A BY age;
grunt> A = LOAD 'data1' AS (owner : chararray, pet : chararray);
grunt> DUMP A;
(Alice,turtle)
(Alice,goldfish)
(Alice,cat)
(Bob,dog)
(Bob,cat)
grunt> B = LOAD 'data2' AS (friend1 : chararray, friend2 : chararray);
grunt> DUMP B;
(Cindy,Alice)
(Mark,Alice)
(Paul,Bob)
(Paul,Jane)
grunt> X = COGROUP A BY owner, B BY friend2;
Output:
(Alice,{(Alice,turtle),(Alice,goldfish),(Alice,cat)},{(Cindy,Alice),(Mark,Alice)})
(Bob,{(Bob,dog),(Bob,cat)},{(Paul,Bob)})
(Jane,{},{(Paul,Jane)})

-- Join : Essentially GROUP and JOIN operators perform similar functions. GROUP
creates a nested set of output tuples while JOIN creates a flat set of output tuples.
• Types of Joins : Inner, Outer (Left, Right, Full), Replicated, Merge, Skewed.
• Examples:
grunt> A = LOAD 'data1' AS (a1:int,a2:int,a3:int); grunt> DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
grunt> B = LOAD 'data2' AS (b1:int,b2:int);
grunt> DUMP B;
(2,4)
(8,9)
(1,3)
(2,7)
(2,9)
(4,6)
(4,9)
grunt> X = JOIN A BY a1, B BY b1;
grunt> DUMP X;
(1,2,3,1,3)
(4,2,1,4,6)
(4,3,3,4,6)
(4,2,1,4,9)
(4,3,3,4,9) …

• FOREACH .. GENERATE: Generates data transformations based on
columns of data.
• Generally it follows after Join, Group, Filter operators or Load, if
you want to work with only a select few columns.
• Example:
grunt> A = LOAD 'data' AS (f1:int,f2:int,f3:int);
grunt> DUMP A;
grunt> Y = FOREACH A GENERATE *; -- this will print the Relation A as is with all cols.
grunt> B = GROUP A BY f1;
grunt> DUMP B;
(1,{(1,2,3)})
(4,{(4,2,1),(4,3,3)})
(7,{(7,2,5)})
(8,{(8,3,4),(8,4,3)})
grunt> X = FOREACH B GENERATE group, COUNT(A) AS total;
(1,1)
(4,2)
(7,1)
(8,2)
Here ‘group’ is the first col of the
Grouped output and is named
implicitly by Pig. It points to the
values 1,4,7 and 8.

-- Limit: Limits the number of output tuples. If the
specified number of output tuples is equal to or
exceeds the number of tuples in the relation, all
tuples in the relation are returned.
Example: grunt> X = LIMIT A 3;
grunt> DUMP X;
(1,2,3)
(4,3,3)
(7,2,5)
Note: For Top N analysis, use ORDER BY (asc or desc)
and then Limit the output.

-- Distinct : Removes duplicate tuples in a relation.
grunt> A = LOAD 'data' AS (a1:int,a2:int,a3:int);
grunt> DUMP A;
(8,3,4)
(1,2,3)
(4,3,3)
(4,3,3)
(1,2,3)
grunt> X = DISTINCT A;
grunt> DUMP X;
(1,2,3)
(4,3,3)
(8,3,4)
-- Order By : Sorts a relation based on one or more fields. ORDER BY is NOT stable; if multiple
records have the same ORDER BY key, the order in which these records are returned is not
defined and is not guaranteed to be the same from one run to the next.
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
grunt> X = ORDER A BY a3 DESC;
grunt> DUMP X;
(7,2,5) (8,3,4) (1,2,3) (4,3,3) (8,4,3) (4,2,1)

Arithmetic Operators in Pig:
• We have the standard arithmetic operators which
Pig uses. They are:
1) Addition (+)
2) Subtraction (-)
3) Multiplication (*)
4) Division (/)
5) Modulo (%)
6) Bincond (? :) [(condition ? value_if_true : value_if_false)]
7) Case (CASE WHEN THEN ELSE END)

• Examples:
grunt> X = FOREACH A GENERATE f1, f2, f1+f2 AS f4;
grunt> X = FOREACH A GENERATE f2, (f2==1 ? 1: f3);
grunt> X = FOREACH A GENERATE f2,
( CASE WHEN f2 % 2 == 0 THEN 'even'
WHEN f2 % 2 == 1 THEN 'odd‘
END );
• The above CASE statement can be written as :
grunt> X = FOREACH A GENERATE f2,
( CASE f2 % 2 WHEN 0 THEN 'even'
WHEN 1 THEN 'odd'
END );

• Abs : Returns the absolute value of an expression.
Example: abs(int a), abs(float b)
• Ceil : Returns the value of the expression rounded up to
the nearest integer.
Example: ceil(4.6), ceil(1.0), ceil(-2.4)
• Floor : Returns the value of the expression rounded down
to the nearest integer.
Example: floor(4.6), floor(1.0), floor(-2.4)
• Round : Returns the value of an expression rounded to an
integer.
Example: round(4.6), round(1.0), round(-2.4)
• SQRT : Returns the positive square root of an expression
Example: SQRT(5)
Math Functions in Pig:

String Functions in Pig:
• Lower / Upper : Converts all characters in a string to lower /
upper case.
• LTRIM / RTRIM / TRIM : Returns a copy of a string with
leading / trailing / or both, white space removed.
• SUBSTRING : Returns a substring from a given string.
Syntax : SUBSTRING(string, startIndex, stopIndex)
Example : SUBSTRING(ABCDEF,1,4) => BCD. Here the start-
index starts with 0 and stop-index should be following the last
char we want.
• REPLACE : Replaces existing characters in a string with new
characters.
Syntax : REPLACE(string, 'oldChar', 'newChar');

Eval Functions in Pig:
• Usually the Eval functions operate on ‘bag’ datatype. So we need to
Group By before applying the functions.
• Count / Count_Star : Computes the number of elements in a bag. The
COUNT function ignores nulls. If you want to include NULL values in the
count computation, use COUNT_STAR. The output datatype will always be
of type Long.
Example : DUMP B;
(1,{(1,2,3)})
(4,{(4,2,1),(4,3,3)})
(7,{(7,2,5)})
(8,{(8,3,4),(8,4,3)})
X = FOREACH B GENERATE COUNT(A);
DUMP X;
(1L)
(2L)
(1L)
(2L)

• Min / Max : Computes the minimum / maximum of the numeric values or chararrays in a single-column bag. In
the below example the single-column is GPA.
Example :
A = LOAD 'student' AS (name:chararray, session:chararray, gpa:float);
DUMP A;
(John,fl,3.9F)
(John,wt,3.7F)
(John,sp,4.0F)
(John,sm,3.8F)
(Mary,fl,3.8F)
(Mary,wt,3.9F)
(Mary,sp,4.0F)
(Mary,sm,4.0F)
B = GROUP A BY name;
DUMP B;
(John,{(John,fl,3.9F),(John,wt,3.7F),(John,sp,4.0F),(John,sm,3.8F)})
(Mary,{(Mary,fl,3.8F),(Mary,wt,3.9F),(Mary,sp,4.0F),(Mary,sm,4.0F)})
X = FOREACH B GENERATE group, MAX(A.gpa);
DUMP X;
(John,4.0F)
(Mary,4.0F)
C = FOREACH B GENERATE A.name, AVG(A.gpa);
DUMP C;
({(John),(John),(John),(John)},3.850000023841858)
({(Mary),(Mary),(Mary),(Mary)},3.925000011920929)

Storing Data from Pig :
• Store functions determine how the data comes out of pig.
• PigStorage() :
1. Stores data in UTF-8 format.
2. PigStorage is the default function for the STORE operator
and works with both simple and complex data types.
3. PigStorage supports structured text files (in human-
readable UTF-8 format).
4. The default field delimiter is tab ('t'). You can also specify
other characters as delimiters but within single quotes.
Example : STORE X INTO 'output' USING PigStorage('*');

• HCatStorer() :
1. HCatStorer is used with Pig scripts to write data to HCatalog-
managed tables ( Read : Hive).
2. To bring in the appropriate jars for working with HCatalog, simply
include the following flag / parameters when running Pig from
the shell:
pig –useHCatalog
3. The fully qualified package name is:
org.apache.hive.hcatalog.pig.HCatStorer
Example :
STORE processed_data INTO 'tablename' USING
org.apache.hive.hcatalog.pig.HCatStorer();
A = LOAD 'tablename' USING
org.apache.hive.hcatalog.pig.HCatLoader();
Link :
https://cwiki.apache.org/confluence/display/Hive/HCatalog+LoadSto
re

User Defined Functions ( UDF ) :
• If the requirement is such that which cannot be fulfilled by
already existing operators / functions, then the user has an
option to write his own.
• Pig provides extensive support for user defined functions
(UDFs) as a way to specify custom processing.
• Pig UDFs can currently be implemented in three languages:
Java, Python, and JavaScript.
• You can customize all parts of the processing including data
load/store, column transformation, and aggregation.
• Pig also provides support for Piggy Bank, a repository for JAVA
UDFs. Through Piggy Bank you can access Java UDFs written
by other users and also contribute your java UDFs that you
have written.
• Please explore this Piggy Bank option before writing your own
Function as someone might already had coded for it.

Pig Example:
Word Count in Pig:
lines = LOAD '/user/hadoop/HDFS_File.txt' AS (line:chararray);
words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;
grouped = GROUP words BY word;
wordcount = FOREACH grouped GENERATE group, COUNT(words);
DUMP wordcount;
(a,2)
(is,2)
(This,1)
(class,1)
(hadoop,2)
(bigdata,1)
(technology,1)
TOKENIZE:
({(This),(is),(a),(hadoop),(class)})
({(hadoop),(is),(a),(bigdata),(technology)})
Flatten :
(This)
(is)
(a)
(hadoop)
(class) ….

Summary :
• Pig is an open-source high-level language.
• It sits above Map Reduce to simplify coding.
• Three main blocks of processing data :
– Load
– Transform
– Store.
• Pig can Load and Store from different sources
like DFS, Hive, etc.
• User can write UDFs to extend the functionality.

References :
• Pig Manual :
https://pig.apache.org/docs/r0.7.0/index.html
• Books :
– Programming Pig by Oreilly

Apache pig

More Related Content

What's hot

Similar to Apache pig

Recently uploaded

In this document

Apache pig