PIG - COMPLEX DATA TYPES - PIGGY
BANK – USER DEFINED FUNCTIONS
– PARAMETER SUBSTITUTION
Presented by,
M .Varshini,
M.Sc(Computer Science),
Department of Computer Science,
Nadar Saraswathi College of Arts
and Science(Autonomous).
TABLE OF CONTENTS:
 PIG
 COMPLEX DATA TYPES
 PIGGY BANK
 USER DEFINED FUNCTIONS
 PARAMETER SUBSTITUTION
PIG:
 In big data, Pig is a high-level platform for processing
and analyzing large datasets within the
Hadoop ecosystem.
 It uses a scripting language called Pig Latin to abstract
away the complexity of low-level MapReduce jobs,
allowing developers to write data analysis programs
more efficiently.
 Pig is particularly useful for tasks like ETL (extract,
transform, load), research on raw data, and iterative
processing.
COMPLEX DATA TYPES:
 Apache Pig is widely used in Big Data Analytics for
processing huge volumes of data on Hadoop.
 Pig Latin provides powerful features to work with
complex or nested data structures—something
relational databases usually struggle with.
 Pig supports three complex data types:
TUPLE
BAG
MAP
1.TUPLE:
 A tuple is an ordered set of fields. It resembles a row in
a spreadsheet or a record in a table.
Characteristics:
 Fields can be of any type: simple (int , chararray) or
complex.
 Order matters.
 No field names inside a tuple (position-based).
Syntax:
(id, name, age)
2.BAG
 A bag is an unordered collection of tuples.
Think of it as a folder containing multiple documents
(tuples).
Characteristics:
 Duplicate tuples allowed.
 Bags can be nested.
 Ideal for representing 1-to-many relationships.
Syntax:
{ (tuple1), (tuple2), (tuple3) }
3.MAP
 A map stores data in key:value pairs.
Characteristics:
 Keys must be string.
 Values can be any type (int, tuple, bag).
 Fast lookup using keys.
Syntax:
['key1'#value1, 'key2'#value2]
Real-Time Query Examples:
 Find total spending by each customer
 Get city from map
 Count number of purchases
 Filtering customers with high-value transactions
PIGGGY BANK:
What is Piggy Bank in Pig?
 Piggy Bank is a repository of user-defined functions
(UDFs) contributed by the Apache Pig community.
 These functions extend Pig’s built-in capabilities and
help you perform tasks that are not available in core Pig.
 Think of Piggy Bank as a public library of ready-made
tools for Pig scripts.
Why Piggy Bank is Useful:
Pig has built-in functions for filtering, grouping, joining, etc.
But sometimes you need extra functionality—such as:
 Advanced string processing
 Mathematical computations
 Data format conversions
 Custom load/store functions
 Specialized evaluation functions
Piggy Bank provides many such extra UDFs, so you don’t have
to write your own from scratch.
Where is Piggy Bank Found?
 Piggy Bank is part of the Apache Pig source
code repository, usually under:
 It contains Java classes that can be packaged and
added into your Pig scripts.
contrib/piggybank/java
How to Use Piggy Bank UDFs
Build Piggy Bank jar
Register the jar in your Pig script
Use a Piggy Bank UDF
USER-DEFINED FUNCTIONS:
 Pig provides many built-in functions, but sometimes you
need custom processing logic. For that, Pig allows you
to write User-Defined Functions (UDFs).
What is a UDF in Pig?
 A UDF is custom code written by the user to extend
Pig's abilities.
It is commonly used when:
 Built-in functions are not enough.
 You need custom computation, transformation, or
filtering.
How to Use a UDF in Pig?
 1.Write the UDF
 2. Package the UDF into a JAR
 3. Register the JAR in Pig Script
 4. Use the UDF in Pig Latin
Types of UDFs in Pig
Eval
Function
Filter
Function
Load/Store
Functions
Accumulator
Functions
Algebraic
Functions
Advantages of Using UDFs in Pig
 Extend Pig’s default capabilities.
 Reuse custom business logic.
 Easy integration with Java ecosystem.
 Supports multiple languages.
PARAMETER SUBSTITUTION
 Parameter substitution in Apache Pig allows for
dynamic modification of Pig Latin scripts at runtime,
enhancing flexibility and reusability.
 Parameters are defined within the script using a dollar
sign prefix (e.g., $PARAM_NAME) and their values are
provided during script execution.
Methods of Parameter Substitution:
 Command-line arguments:
 Use the -param option followed
by param_name=param_value.
 Example: pig -x mapreduce -p TEAMID=BOS
script.pig
 Parameter files:
 Create a file
containing param_name=param_value pairs, one per
line.
 Use the -param_file option followed by the file name.
 Example: pig -x mapreduce -param_file params.txt
script.pig
Precedence of Parameter Values:
 When multiple sources define the same parameter, Pig
resolves the value based on the following precedence
(from lowest to highest):
“%default statements within the script”.
 Values provided in -param_file. (Later files or later
entries within a file take precedence).
 Values provided using the -param option on the
command line. (Later -param options for the same
parameter take precedence).
THANK YOU!

Pig- piggy bank in Big Data Analytics.ppt.pptx

  • 1.
    PIG - COMPLEXDATA TYPES - PIGGY BANK – USER DEFINED FUNCTIONS – PARAMETER SUBSTITUTION Presented by, M .Varshini, M.Sc(Computer Science), Department of Computer Science, Nadar Saraswathi College of Arts and Science(Autonomous).
  • 2.
    TABLE OF CONTENTS: PIG  COMPLEX DATA TYPES  PIGGY BANK  USER DEFINED FUNCTIONS  PARAMETER SUBSTITUTION
  • 3.
    PIG:  In bigdata, Pig is a high-level platform for processing and analyzing large datasets within the Hadoop ecosystem.  It uses a scripting language called Pig Latin to abstract away the complexity of low-level MapReduce jobs, allowing developers to write data analysis programs more efficiently.  Pig is particularly useful for tasks like ETL (extract, transform, load), research on raw data, and iterative processing.
  • 4.
    COMPLEX DATA TYPES: Apache Pig is widely used in Big Data Analytics for processing huge volumes of data on Hadoop.  Pig Latin provides powerful features to work with complex or nested data structures—something relational databases usually struggle with.  Pig supports three complex data types: TUPLE BAG MAP
  • 5.
    1.TUPLE:  A tupleis an ordered set of fields. It resembles a row in a spreadsheet or a record in a table. Characteristics:  Fields can be of any type: simple (int , chararray) or complex.  Order matters.  No field names inside a tuple (position-based). Syntax: (id, name, age)
  • 6.
    2.BAG  A bagis an unordered collection of tuples. Think of it as a folder containing multiple documents (tuples). Characteristics:  Duplicate tuples allowed.  Bags can be nested.  Ideal for representing 1-to-many relationships. Syntax: { (tuple1), (tuple2), (tuple3) }
  • 7.
    3.MAP  A mapstores data in key:value pairs. Characteristics:  Keys must be string.  Values can be any type (int, tuple, bag).  Fast lookup using keys. Syntax: ['key1'#value1, 'key2'#value2]
  • 8.
    Real-Time Query Examples: Find total spending by each customer  Get city from map  Count number of purchases  Filtering customers with high-value transactions
  • 9.
    PIGGGY BANK: What isPiggy Bank in Pig?  Piggy Bank is a repository of user-defined functions (UDFs) contributed by the Apache Pig community.  These functions extend Pig’s built-in capabilities and help you perform tasks that are not available in core Pig.  Think of Piggy Bank as a public library of ready-made tools for Pig scripts.
  • 10.
    Why Piggy Bankis Useful: Pig has built-in functions for filtering, grouping, joining, etc. But sometimes you need extra functionality—such as:  Advanced string processing  Mathematical computations  Data format conversions  Custom load/store functions  Specialized evaluation functions Piggy Bank provides many such extra UDFs, so you don’t have to write your own from scratch.
  • 12.
    Where is PiggyBank Found?  Piggy Bank is part of the Apache Pig source code repository, usually under:  It contains Java classes that can be packaged and added into your Pig scripts. contrib/piggybank/java
  • 13.
    How to UsePiggy Bank UDFs Build Piggy Bank jar Register the jar in your Pig script Use a Piggy Bank UDF
  • 14.
    USER-DEFINED FUNCTIONS:  Pigprovides many built-in functions, but sometimes you need custom processing logic. For that, Pig allows you to write User-Defined Functions (UDFs). What is a UDF in Pig?  A UDF is custom code written by the user to extend Pig's abilities. It is commonly used when:  Built-in functions are not enough.  You need custom computation, transformation, or filtering.
  • 15.
    How to Usea UDF in Pig?  1.Write the UDF  2. Package the UDF into a JAR  3. Register the JAR in Pig Script  4. Use the UDF in Pig Latin
  • 16.
    Types of UDFsin Pig Eval Function Filter Function Load/Store Functions Accumulator Functions Algebraic Functions
  • 17.
    Advantages of UsingUDFs in Pig  Extend Pig’s default capabilities.  Reuse custom business logic.  Easy integration with Java ecosystem.  Supports multiple languages.
  • 18.
    PARAMETER SUBSTITUTION  Parametersubstitution in Apache Pig allows for dynamic modification of Pig Latin scripts at runtime, enhancing flexibility and reusability.  Parameters are defined within the script using a dollar sign prefix (e.g., $PARAM_NAME) and their values are provided during script execution.
  • 19.
    Methods of ParameterSubstitution:  Command-line arguments:  Use the -param option followed by param_name=param_value.  Example: pig -x mapreduce -p TEAMID=BOS script.pig  Parameter files:  Create a file containing param_name=param_value pairs, one per line.  Use the -param_file option followed by the file name.  Example: pig -x mapreduce -param_file params.txt script.pig
  • 20.
    Precedence of ParameterValues:  When multiple sources define the same parameter, Pig resolves the value based on the following precedence (from lowest to highest): “%default statements within the script”.  Values provided in -param_file. (Later files or later entries within a file take precedence).  Values provided using the -param option on the command line. (Later -param options for the same parameter take precedence).
  • 21.