big data - unit 5 - frame works - mini xerox- easy read
big data - unit 5 - frame works - mini xerox- easy read
5. Optimizer Example:
6. Execution Engine $hive
7. Metastore Hive> show databases;
8. HCatalog Hive Services :
9. WebHCat The following are the services provided by
PROCESSING AND RESOURCE Hive :
MANAGEMENT : · Hive CLI: The Hive CLI (Command
Hive internally uses a MapReduce Line Interface) is a shell where we can
framework as a defacto engine for executing execute Hive queries and commands.
the queries. · Hive Web User Interface: The Hive
MapReduce is a software framework for Web UI is just an alternative of Hive CLI. It
writing those applications that process a provides a web-based GUI for executing
massive amount of data in parallel on the Hive queries and commands.
large clusters of commodity hardware. · Hive metastore: It is a central
MapReduce job works by splitting data into repository that stores all the structure
chunks, which are processed by map-reduce information of various tables and partitions
tasks. in the warehouse. It also includes metadata
of column and its type information, the
DISTRIBUTED STORAGE : serializers and deserializers which is used to
Hive is built on top of Hadoop, so it uses the read and write data and the corresponding
underlying Hadoop Distributed File System HDFS files where the data is stored.
for the distributed storage. · Hive Server: It is referred to as
Hive Shell : Apache Thrift Server. It accepts the request
Hive shell is a primary way to interact with from different clients and provides it to Hive
hive. Driver.
It is a default service in the hive. · Hive Driver: It receives queries from
It is also called CLI (command line different sources like web UI, CLI, Thrift,
interference). and JDBC/ODBC driver. It transfers the
Hive shell is similar to MySQL Shell. queries to the compiler.
Hive users can run HQL queries in the hive · Hive Compiler: The purpose of the
shell. compiler is to parse the query and perform
In hive shell up and down arrow keys are semantic analysis on the different query
used to scroll previous commands. blocks and expressions. It converts HiveQL
HiveQL is case-insensitive (except for string statements into MapReduce jobs.
comparisons). · Hive Execution Engine: Optimizer
The tab key will autocomplete (provides generates the logical plan in the form of
suggestions while you type into the field) DAG of map-reduce tasks and HDFS tasks.
Hive keywords and functions. In the end, the execution engine executes the
Hive Shell can run in two modes : incoming tasks in the order of their
Non-Interactive mode : dependencies.
Non-interactive mode means run shell
scripts in administer zone. MetaStore :
Hive Shell can run in the non-interactive
mode, with the -f option. Hive metastore (HMS) is a service that
Example: stores Apache Hive and other metadata in a
$hive -f script.q, Where script. q is a file. backend RDBMS, such as MySQL or
Interactive mode : PostgreSQL.
The hive can work in interactive mode by Impala, Spark, Hive, and other services
directly typing the command “hive” in the share the metastore.
terminal.
BIG DATA ANALYTICS
The connections to and from HMS include Even though based on SQL, HiveQL does
HiveServer, Ranger, and the NameNode, not strictly follow the full SQL-92 standard.
which represents HDFS. HiveQL offers extensions not in SQL,
Beeline, Hue, JDBC, and Impala shell including multitable inserts and create table
clients make requests through thrift or JDBC as select.
to HiveServer. HiveQL lacked support for transactions and
The HiveServer instance reads/writes data to materialized views and only limited
HMS. subquery support.
By default, redundant HMS operate in Support for insert, update, and delete with
active/active mode. full ACID functionality was made available
The physical data resides in a backend with release 0.14.
RDBMS, one for HMS. Internally, a compiler translates HiveQL
All connections are routed to a single statements into a directed acyclic graph of
RDBMS service at any given time. MapReduce Tez, or Spark jobs, which are
HMS talks to the NameNode over thrift and submitted to Hadoop for execution.
functions as a client to HDFS. Example :
HMS connects directly to Ranger and the DROP TABLE IF EXISTS docs;
NameNode (HDFS), and so does
CREATE TABLE docs
HiveServer.
One or more HMS instances on the backend (line STRING);
can talk to other services, such as Ranger.
Comparison with Traditional Database : Checks if table docs exist and drop it if it
does. Creates a new table called docs with a
single column of type STRING called line.
RDBMS HIVE
LOAD DATA INPATH 'input_file' OVER
It is used to WRITE INTO TABLE docs;
It is used to maintain a
maintain the
data warehouse.
database. Loads the specified file or directory (In this
It uses SQL case “input_file”) into the table.
It uses HQL (Hive OVERWRITE specifies that the target table
(Structured Query
Query Language). to which the data is being loaded is to be re-
Language).
written; Otherwise, the data would be
Schema is fixed in appended.
Schema varies in it.
RDBMS CREATE TABLE word_counts AS
Normalized and de- SELECT word, count(1) AS count FROM
Normalized data is
normalized both type (SELECT explode(split(line, '\s')) AS word
stored.
of data is stored. FROM docs) temp
GROUP BY word
Tables in rdms are The table in hive is
sparse. dense. ORDER BY word;
This query serves to split the input words When an external table is deleted, Hive will
into different rows of a temporary table only delete the schema associated with the
aliased as temp. table.
The GROUP BY WORD groups the results The data files are not affected.
based on their keys. Syntax for External Tables :
This results in the count column holding the CREATE EXTERNAL TABLE IF NOT EX
number of occurrences for each word of
ISTS stocks (exchange STRING,
the word column.
The ORDER BY WORDS sorts the words symbol STRING,
alphabetically. price_open FLOAT,
Tables : price_high FLOAT,
Here are the types of tables in Apache Hive: price_low FLOAT,
Managed Tables : price_adj_close FLOAT)
In a managed table, both the table data and ROW FORMAT DELIMITED FIELDS TE
the table schema are managed by Hive. RMINATED BY ','
The data will be located in a folder named LOCATION '/data/stocks';
after the table within the Hive data
warehouse, which is essentially just a file
location in HDFS. Querying Data :
By managed or controlled we mean that if A query is a request for data or information
you drop (delete) a managed table, then from a database table or a combination of
Hive will delete both the Schema (the tables.
description of the table) and the data files This data may be generated as results
associated with the table. returned by Structured Query Language
Default location is /user/hive/warehouse. (SQL) or as pictorials, graphs or complex
The syntax for Managed Tables : results, e.g., trend analyses from data-
CREATE TABLE IF NOT EXISTS stocks mining tools.
(exchange STRING, One of several different query languages
symbol STRING, may be used to perform a range of simple to
complex database queries.
price_open FLOAT,
SQL, the most well-known and widely-used
price_high FLOAT, query language, is familiar to most database
price_low FLOAT, administrators (DBAs)
price_adj_close FLOAT)
ROW FORMAT DELIMITED FIELDS TE User-Defined Functions :
RMINATED BY ',' ; In Hive, the users can define their own
functions to meet certain client
requirements.
These are known as UDFs in Hive.
External Tables :
User-Defined Functions written in Java for
An external table is one where only the table
specific modules.
schema is controlled by Hive.
Some of UDFs are specifically designed for
In most cases, the user will set up the folder
the reusability of code in application
location within HDFS and copy the data
file(s) there. frameworks.
The developer will develop these functions
This location is included as part of the table
in Java and integrate those UDFs with the
definition statement.
Hive.
During the Query execution, the developer
can directly use the code, and UDFs will
BIG DATA ANALYTICS
(24,{(8,Bharathi,Nambiayar,24,9848022333 4,Chaitali,25,Mumbai,6500.00
,Chennai),(7,Komal,Nayak,24,9848022334, 5,Hardik,27,Bhopal,8500.00
trivendram)}, 6,Komal,22,MP,4500.00
{ }) 7,Muffy,24,Indore,10000.00
(25,{ },
orders.txt
{(4,Sara,25,London)})
102,2009-10-08 00:00:00,3,3000
The cogroup operator groups the tuples
100,2009-10-08 00:00:00,3,1500
from each relation according to age where
101,2009-11-20 00:00:00,2,1560
each group depicts a particular age value.
103,2008-05-20 00:00:00,4,2060
For example, if we consider the 1st tuple of And we have loaded these two files into Pig
the result, it is grouped by age 21. And it with the relations customers and orders as
contains two bags −
shown below.
the first bag holds all the tuples from the
grunt> customers = LOAD
first relation (student_details in this
'hdfs://localhost:9000/pig_data/customers.txt
case) having age 21, and
' USING PigStorage(',')
the second bag contains all the tuples
as (id:int, name:chararray, age:int,
from the second relation
address:chararray, salary:int);
(employee_details in this case) having
age 21.
grunt> orders = LOAD
In case a relation doesn‟t have tuples having 'hdfs://localhost:9000/pig_data/orders.txt'
the age value 21, it returns an empty bag. USING PigStorage(',')
as (oid:int, date:chararray,
JOIN Operator
customer_id:int, amount:int);
The JOIN operator is used to combine Let us now perform various Join operations
records from two or more relations. While on these two relations.
performing a join operation, we declare one
(or a group of) tuple(s) from each relation, Self - join
as keys. When these keys match, the two
particular tuples are matched, else the Self-join is used to join a table with itself as
records are dropped. Joins can be of the if the table were two relations, temporarily
following types − renaming at least one relation.
Self-join Generally, in Apache Pig, to perform self-
Inner-join join, we will load the same data multiple
Outer-join − left join, right join, and full times, under different aliases (names).
join Therefore let us load the contents of the
This chapter explains with examples how to file customers.txt as two tables as shown
use the join operator in Pig Latin. Assume below.
that we have two files grunt> customers1 = LOAD
namely customers.txt and orders.txt in 'hdfs://localhost:9000/pig_data/customers.txt
the /pig_data/ directory of HDFS as shown ' USING PigStorage(',')
below. as (id:int, name:chararray, age:int,
customers.txt address:chararray, salary:int);
1,Ramesh,32,Ahmedabad,2000.00 grunt> customers2 = LOAD
2,Khilan,25,Delhi,1500.00 'hdfs://localhost:9000/pig_data/customers.txt
3,kaushik,23,Kota,2000.00 ' USING PigStorage(',')
BIG DATA ANALYTICS
Full outer join The right outer join operation returns all
rows from the right table, even if there are
Left Outer Join no matches in the left table.
Syntax
The left outer Join operation returns all
rows from the left table, even if there are no Given below is the syntax of
matches in the right relation. performing right outer join operation using
the JOIN operator.
Syntax
grunt> outer_right = JOIN customers BY id
Given below is the syntax of performing left
RIGHT, orders BY customer_id;
outer join operation using
Example
the JOIN operator.
Let us perform right outer join operation
grunt> Relation3_name = JOIN
on the two
Relation1_name BY id LEFT OUTER,
relations customers and orders as shown
Relation2_name BY customer_id;
below.
Example
Let us perform left outer join operation on grunt> outer_right = JOIN customers BY id
the two relations customers and orders as RIGHT, orders BY customer_id;
shown below. Verification
Example 002,siddarth,Battacharya,22,programmer,00
3
Let us perform full outer join operation on 003,Rajesh,Khanna,22,programmer,003
the two relations customers and orders as 004,Preethi,Agarwal,21,programmer,003
shown below. 005,Trupthi,Mohanthy,23,programmer,003
grunt> outer_full = JOIN customers BY id 006,Archana,Mishra,23,programmer,003
FULL OUTER, orders BY customer_id; 007,Komal,Nayak,24,teamlead,002
Verification 008,Bharathi,Nambiayar,24,manager,001
(7,Muffy,24,Indore,10000,103,2008-05-20 (3,kaushik,23,Kota,2000,101,2009-11-20
00:00:00,4,2060) 00:00:00,2,1560)
(7,Muffy,24,Indore,10000,101,2009-11-20 (3,kaushik,23,Kota,2000,100,2009-10-08
00:00:00,2,1560) 00:00:00,3,1500)
(7,Muffy,24,Indore,10000,100,2009-10-08 (3,kaushik,23,Kota,2000,102,2009-10-08
00:00:00,3,1500) 00:00:00,3,3000)
(7,Muffy,24,Indore,10000,102,2009-10-08 (2,Khilan,25,Delhi,1500,103,2008-05-20
00:00:00,3,3000) 00:00:00,4,2060)
(6,Komal,22,MP,4500,103,2008-05-20 (2,Khilan,25,Delhi,1500,101,2009-11-20
00:00:00,4,2060) 00:00:00,2,1560)
(6,Komal,22,MP,4500,101,2009-11-20 (2,Khilan,25,Delhi,1500,100,2009-10-08
00:00:00,2,1560) 00:00:00,3,1500)
(6,Komal,22,MP,4500,100,2009-10-08 (2,Khilan,25,Delhi,1500,102,2009-10-08
00:00:00,3,1500) 00:00:00,3,3000)
(6,Komal,22,MP,4500,102,2009-10-08 (1,Ramesh,32,Ahmedabad,2000,103,2008-
00:00:00,3,3000) 05-20 00:00:00,4,2060)
(5,Hardik,27,Bhopal,8500,103,2008-05-20 (1,Ramesh,32,Ahmedabad,2000,101,2009-
00:00:00,4,2060) 11-20 00:00:00,2,1560)
(5,Hardik,27,Bhopal,8500,101,2009-11-20 (1,Ramesh,32,Ahmedabad,2000,100,2009-
00:00:00,2,1560) 10-08 00:00:00,3,1500)
(5,Hardik,27,Bhopal,8500,100,2009-10-08 (1,Ramesh,32,Ahmedabad,2000,102,2009-
00:00:00,3,1500) 10-08 00:00:00,3,3000)
(5,Hardik,27,Bhopal,8500,102,2009-10-08
00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05- UNION operator
20 00:00:00,4,2060)
(4,Chaitali,25,Mumbai,6500,101,2009-20 The UNION operator of Pig Latin is used to
00:00:00,4,2060) merge the content of two relations. To
(2,Khilan,25,Delhi,1500,101,2009-11-20 perform UNION operation on two relations,
00:00:00,2,1560) their columns and domains must be
(2,Khilan,25,Delhi,1500,100,2009-10-08 identical.
00:00:00,3,1500)
(2,Khilan,25,Delhi,1500,102,2009-10-08 Syntax
00:00:00,3,3000)
(1,Ramesh,32,Ahmedabad,2000,103,2008- Given below is the syntax of
05-20 00:00:00,4,2060) the UNION operator.
(1,Ramesh,32,Ahmedabad,2000,101,2009- grunt> Relation_name3 = UNION
11-20 00:00:00,2,1560) Relation_name1, Relation_name2;
(1,Ramesh,32,Ahmedabad,2000,100,2009-
10-08 00:00:00,3,1500) Example
(1,Ramesh,32,Ahmedabad,2000,102,2009-
10-08 00:00:00,3,3000)-11-20 Assume that we have two files
00:00:00,2,1560) namely student_data1.txt and student_dat
(4,Chaitali,25,Mumbai,6500,100,2009-10- a2.txt in the /pig_data/ directory of HDFS
08 00:00:00,3,1500) as shown below.
(4,Chaitali,25,Mumbai,6500,102,2009-10-
08 00:00:00,3,3000) Student_data1.txt
(3,kaushik,23,Kota,2000,103,2008-05-20 001,Rajiv,Reddy,9848022337,Hyderabad
00:00:00,4,2060)
BIG DATA ANALYTICS
002,siddarth,Battacharya,9848022338,Kolka (5,Trupthi,Mohanthy,9848022336,Bhuwane
ta shwar)
003,Rajesh,Khanna,9848022339,Delhi (6,Archana,Mishra,9848022335,Chennai)
004,Preethi,Agarwal,9848022330,Pune (7,Komal,Nayak,9848022334,trivendram)
005,Trupthi,Mohanthy,9848022336,Bhuwan (8,Bharathi,Nambiayar,9848022333,Chenna
eshwar i)
006,Archana,Mishra,9848022335,Chennai.
Student_data2.txt SPLIT operator
7,Komal,Nayak,9848022334,trivendram.
8,Bharathi,Nambiayar,9848022333,Chennai. The SPLIT operator is used to split a
relation into two or more relations.
And we have loaded these two files into Pig
with the relations student1 and student2 as Syntax
shown below.
grunt> student1 = LOAD Given below is the syntax of
'hdfs://localhost:9000/pig_data/student_data the SPLIT operator.
1.txt' USING PigStorage(',') grunt> SPLIT Relation1_name INTO
as (id:int, firstname:chararray, Relation2_name IF (condition1),
lastname:chararray, phone:chararray, Relation2_name (condition2),
city:chararray);
Example
grunt> student2 = LOAD
'hdfs://localhost:9000/pig_data/student_data Assume that we have a file
2.txt' USING PigStorage(',') named student_details.txt in the HDFS
as (id:int, firstname:chararray, directory /pig_data/ as shown below.
lastname:chararray, phone:chararray,
city:chararray); student_details.txt
Let us now merge the contents of these two 001,Rajiv,Reddy,21,9848022337,Hyderabad
relations using the UNION operator as 002,siddarth,Battacharya,22,9848022338,Ko
shown below. lkata
003,Rajesh,Khanna,22,9848022339,Delhi
grunt> student = UNION student1, student2; 004,Preethi,Agarwal,21,9848022330,Pune
Verification 005,Trupthi,Mohanthy,23,9848022336,Bhu
waneshwar
Verify the relation student using
006,Archana,Mishra,23,9848022335,Chenn
the DUMP operator as shown below.
ai
grunt> Dump student; 007,Komal,Nayak,24,9848022334,trivendra
Output m
008,Bharathi,Nambiayar,24,9848022333,Ch
It will display the following output, ennai
displaying the contents of the
relation student. And we have loaded this file into Pig with
the relation name student_details as shown
(1,Rajiv,Reddy,9848022337,Hyderabad) below.
(2,siddarth,Battacharya,9848022338,Kolkata
) student_details = LOAD
(3,Rajesh,Khanna,9848022339,Delhi) 'hdfs://localhost:9000/pig_data/student_detai
(4,Preethi,Agarwal,9848022330,Pune) ls.txt' USING PigStorage(',')
BIG DATA ANALYTICS