UNIT III
Introduction to Pig: Key Features of pig, The Anatomy of Pig, Pig on Hadoop, Pig
Philosophy, Pig Latin Overview, Data Types in Pig, Running Pig, Execution Modes of Pig,
Relational Operators.
Introduction to HIVE: HIVE features, HIVE architecture, HIVE datatypes, HIVE File
Formats, HIVE Query Language.
WHAT IS PIG?
Apache Pig is a platform for data analysis. It is an alternative to MapReduce Programming. Pig
was devel- oped as a research project at Yahoo.
Key Features of Pig
1. It provides an engine for executing data flows (how your data should flow). Pig processes data
in parallel on the Hadoop cluster.
2. It provides a language called "Pig Latin" to express data flows.
3. Pig Latin contains operators for many of the traditional data operations such as join, filter,
sort, etc.
4. It allows users to develop their own functions (User Defined Functions) for reading,
processing, and writing data.
The Anatomy of PIG
The main components of Pig are as follows
1.Data Flow language(Pig Latin)
2. Interactive shell where you can type Pig Latin statements (Grunt).
3. Pig interpreter and execution engine.
PIG on Hadoop
Pig runs on Hadoop. Pig uses both Hadoop Distributed File System and MapReduce
Programming. By default, Pig reads input files from HDFS. Pig stores the intermediate data (data
produced by MapReduce jobs) and the output in HDFS. However, Pig can also read input from
and place output to other sources.
Pig supports the following:
1. HDFS commands.
2. UNIX shell commands.
3. Relational operators.
4. Positional parameters.
5. Common mathematical functions.
6. Custom functions.
7. Complex data structures.
Pig Philosophy
Figure 10.2 describes the Pig philosophy.
1. Pigs Eat Anything: Pig can process different kinds of data such as structured and unstructured
data.
2. Pigs Live Anywhere: Pig not only processes files in HDFS, it also processes files in other
sources such as files in the local file system.
3. Pigs are Domestic Animals: Pig allows you to develop user-defined functions and the same
can be included in the script for complex operations.
4. Pigs Fly: Pig processes data quickly.
Pig Latin Overview
Data Types in Pig
1 Simple Data Types
Table 10.3 describes simple data types supported in Pig. In Pig, fields of unspecified types are
considered as an array of bytes which is known as bytearray.
Null: In Pig Latin, NULL denotes a value that is unknown or is non-existent.
2 Complex Data Types
Table 10.4 describes complex data types in Pig.
Running Pig
You can run pig in two ways
1.Interactive Mode
2.Batch Mode
1.Interactive Mode
2.Batch Mode
Execution Modes of Pig
Relational Operators.
Introduction to HIVE
Hive is a Data Warehousing tool that sits on top of Hadoop. Refer Figure 9.1. Hive is used to
process structured data in Hadoop. The three main tasks performed by Apache Hive are:
1. Summarization
2. Querying
3. Analysis
Facebook initially created Hive component to manage their ever-growing volumes of log data.
Later Apache software foundation developed it as open-source and it came to be known as
Apache Hive.
Hive makes use of the following:
1. HDFS for Storage.
2. MapReduce for execution.
3. Stores metadata/schemas in an RDBMS.
Hive provides HQL (Hive Query Language) or HiveQL which is similar to SQL. Hive compiles
SQL queries into MapReduce jobs and then runs the job in the Hadoop Cluster. It is designed to
support
HIVE features,
1. It is similar to SQL.
2. HQL is easy to code.
3. Hive supports rich data types such as structs, lists and maps.
4. Hive supports SQL filters, group-by and order-by clauses.
5. Custom Types, Custom Functions can be defined.
HIVE architecture
Hive Architecture is depicted in Figure 9.7. The various parts are as follows:
1. Hive Command-Line Interface (Hive CLI): The most commonly used interface to interact
with Hive.
2. Hive Web Interface: It is a simple Graphic User Interface to interact with Hive and to
execute query.
3. Hive Server: This is an optional server. This can be used to submit Hive Jobs from a remote
client.
4. JDBC/ODBC: Jobs can be submitted from a JDBC Client. One can write a Java code to
connect to Hive and submit jobs on it.
5. Driver: Hive queries are sent to the driver for compilation, optimization and execution.
6. Metastore: Hive table definitions and mappings to the data are stored in a Metastore. A
Metastore consists of the following:
• Metastore service: Offers interface to the Hive.
• Database: Stores data definitions, mappings to the data and others.
The metadata which is stored in the metastore includes IDs of Database, IDs of Tables, IDs of
Indexes, etc., the time of creation of a Table, the Input Format used for a Table, the Output
Format used for a Table, etc. The metastore is updated whenever a table is created or deleted
from Hive. There are three kinds of metastore.
1. Embedded Metastore: This metastore is mainly used for unit tests. Here, only one process is
allowed to connect to the metastore at a time. This is the default metastore for Hive. It is Apache
Derby Database. In this metastore, both the database and the metastore service run embedded in
the main Hive Server process. Figure 9.8 shows an Embedded Metastore.
2. Local Metastore: Metadata can be stored in any RDBMS component like MySQL. Local
metastore allows multiple connections at a time. In this mode, the Hive metastore service runs in
the main Hive Server process, but the metastore database runs in a separate process, and can be
on a separate host. Figure 9.9 shows a Local Metastore.
3. Remote Metastore: In this, the Hive driver and the metastore interface run on different JVMs
(which can run on different machines as well) as in Figure 9.10. This way the database can be
fire-walled from the Hive user and also database credentials are completely isolated from the
users of Hive.
HIVE datatypes
HIVE File Formats
HIVE Query Language.
Hive query language provides basic SQL like operations. Here are few of the tasks which HQL
can do easily.
1. Create and manage tables and partitions.
2. Support various Relational, Arithmetic, and Logical Operators.
3. Evaluate functions.
4. Download the contents of a table to a local directory or result of queries to HDFS directory.
.1 DDL (Data Definition Language) Statements
These statements are used to build and modify the tables and other objects in the database. The
DDL commands are as follows:
1. Create/Drop/Alter Database
2. Create/Drop/Truncate Table
3. Alter Table/Partition/Column
4. Create/Drop/Alter View
5. Create/Drop/Alter Index
6. Show
7. Describe
2 DML (Data Manipulation Language) Statements
These statements are used to retrieve, store, modify, delete, and update data in database. The
DML commands are as follows:
1. Loading files into table.
2. Inserting data into Hive Tables from queries.
Note: Hive 0.14 supports update, delete, and transaction operations.