Pig- piggy bank in Big Data Analytics.ppt.pptx

PIG - COMPLEX DATA TYPES - PIGGY
BANK – USER DEFINED FUNCTIONS
– PARAMETER SUBSTITUTION
Presented by,
M .Varshini,
M.Sc(Computer Science),
Department of Computer Science,
Nadar Saraswathi College of Arts
and Science(Autonomous).

TABLE OF CONTENTS:
 PIG
 COMPLEX DATA TYPES
 PIGGY BANK
 USER DEFINED FUNCTIONS
 PARAMETER SUBSTITUTION

PIG:
 In big data, Pig is a high-level platform for processing
and analyzing large datasets within the
Hadoop ecosystem.
 It uses a scripting language called Pig Latin to abstract
away the complexity of low-level MapReduce jobs,
allowing developers to write data analysis programs
more efficiently.
 Pig is particularly useful for tasks like ETL (extract,
transform, load), research on raw data, and iterative
processing.

COMPLEX DATA TYPES:
 Apache Pig is widely used in Big Data Analytics for
processing huge volumes of data on Hadoop.
 Pig Latin provides powerful features to work with
complex or nested data structures—something
relational databases usually struggle with.
 Pig supports three complex data types:
TUPLE
BAG
MAP

1.TUPLE:
 A tuple is an ordered set of fields. It resembles a row in
a spreadsheet or a record in a table.
Characteristics:
 Fields can be of any type: simple (int , chararray) or
complex.
 Order matters.
 No field names inside a tuple (position-based).
Syntax:
(id, name, age)

2.BAG
 A bag is an unordered collection of tuples.
Think of it as a folder containing multiple documents
(tuples).
Characteristics:
 Duplicate tuples allowed.
 Bags can be nested.
 Ideal for representing 1-to-many relationships.
Syntax:
{ (tuple1), (tuple2), (tuple3) }

3.MAP
 A map stores data in key:value pairs.
Characteristics:
 Keys must be string.
 Values can be any type (int, tuple, bag).
 Fast lookup using keys.
Syntax:
['key1'#value1, 'key2'#value2]

Real-Time Query Examples:
 Find total spending by each customer
 Get city from map
 Count number of purchases
 Filtering customers with high-value transactions

PIGGGY BANK:
What is Piggy Bank in Pig?
 Piggy Bank is a repository of user-defined functions
(UDFs) contributed by the Apache Pig community.
 These functions extend Pig’s built-in capabilities and
help you perform tasks that are not available in core Pig.
 Think of Piggy Bank as a public library of ready-made
tools for Pig scripts.

Why Piggy Bank is Useful:
Pig has built-in functions for filtering, grouping, joining, etc.
But sometimes you need extra functionality—such as:
 Advanced string processing
 Mathematical computations
 Data format conversions
 Custom load/store functions
 Specialized evaluation functions
Piggy Bank provides many such extra UDFs, so you don’t have
to write your own from scratch.

Where is Piggy Bank Found?
 Piggy Bank is part of the Apache Pig source
code repository, usually under:
 It contains Java classes that can be packaged and
added into your Pig scripts.
contrib/piggybank/java

How to Use Piggy Bank UDFs
Build Piggy Bank jar
Register the jar in your Pig script
Use a Piggy Bank UDF

USER-DEFINED FUNCTIONS:
 Pig provides many built-in functions, but sometimes you
need custom processing logic. For that, Pig allows you
to write User-Defined Functions (UDFs).
What is a UDF in Pig?
 A UDF is custom code written by the user to extend
Pig's abilities.
It is commonly used when:
 Built-in functions are not enough.
 You need custom computation, transformation, or
filtering.

How to Use a UDF in Pig?
 1.Write the UDF
 2. Package the UDF into a JAR
 3. Register the JAR in Pig Script
 4. Use the UDF in Pig Latin

Types of UDFs in Pig
Eval
Function
Filter
Function
Load/Store
Functions
Accumulator
Functions
Algebraic
Functions

Advantages of Using UDFs in Pig
 Extend Pig’s default capabilities.
 Reuse custom business logic.
 Easy integration with Java ecosystem.
 Supports multiple languages.

PARAMETER SUBSTITUTION
 Parameter substitution in Apache Pig allows for
dynamic modification of Pig Latin scripts at runtime,
enhancing flexibility and reusability.
 Parameters are defined within the script using a dollar
sign prefix (e.g., $PARAM_NAME) and their values are
provided during script execution.

Methods of Parameter Substitution:
 Command-line arguments:
 Use the -param option followed
by param_name=param_value.
 Example: pig -x mapreduce -p TEAMID=BOS
script.pig
 Parameter files:
 Create a file
containing param_name=param_value pairs, one per
line.
 Use the -param_file option followed by the file name.
 Example: pig -x mapreduce -param_file params.txt
script.pig

Precedence of Parameter Values:
 When multiple sources define the same parameter, Pig
resolves the value based on the following precedence
(from lowest to highest):
“%default statements within the script”.
 Values provided in -param_file. (Later files or later
entries within a file take precedence).
 Values provided using the -param option on the
command line. (Later -param options for the same
parameter take precedence).

Pig- piggy bank in Big Data Analytics.ppt.pptx

More Related Content

Similar to Pig- piggy bank in Big Data Analytics.ppt.pptx

Recently uploaded

Pig- piggy bank in Big Data Analytics.ppt.pptx