0% found this document useful (0 votes)
22 views

Unit 2.2 Hive

Uploaded by

Charan baru
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as KEY, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Unit 2.2 Hive

Uploaded by

Charan baru
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as KEY, PDF, TXT or read online on Scribd
You are on page 1/ 80

Hive - Introduction

The Hadoop ecosystem contains different sub-


projects (tools) such as Sqoop, Pig, and Hive that
are used to help Hadoop modules.
Sqoop: It is used to import and export data
to and from between HDFS and RDBMS.
Pig: It is a procedural language platform
used to develop a script for MapReduce
operations.
Hive: It is a platform used to develop SQL
type scripts to do MapReduce operations.
Hive - Introduction
There are various ways to execute MapReduce
operations:
1. The traditional approach using Java MapReduce
program for structured, semi-structured, and
unstructured data.
2. The scripting approach for MapReduce to process
structured and semi structured data using Pig.
3. The Hive Query Language (HiveQL or HQL) for
MapReduce to process structured data using Hive.
What is Hive
Hive is a data warehouse infrastructure tool to
process structured data in Hadoop. It resides on
top of Hadoop to summarize Big Data, and makes
querying and analyzing easy.
Initially Hive was developed by Facebook, later the
Apache Software Foundation took it up and
developed it further as an open source under the
name Apache Hive. It is used by different
companies. For example, Amazon uses it in
Amazon Elastic MapReduce.
What is Hive
Hive is not
A relational database
A design for OnLine Transaction Processing
(OLTP)
A language for real-time queries and row-level
updates
Features of Hive
It stores schema in a database and processed
data into HDFS.
It is designed for OLAP.
It provides SQL type language for querying
called HiveQL or HQL.
It is familiar, fast, scalable, and extensible.
Architecture of Hive
Architecture of Hive
Working of Hive
Working of Hive
1. Execute Query The Hive interface such as Command Line or Web UI
sends query to Driver (any database driver such as JDBC, ODBC, etc.) to
execute.
2. Get Plan The driver takes the help of query compiler that parses the
query to check the syntax and query plan or the requirement of query.
3. Get Metadata The compiler sends metadata request to Metastore
(any database).
4. Send Metadata Metastore sends metadata as a response to the
compiler.
5. Send Plan The compiler checks the requirement and resends the plan
to the driver. Up to here, the parsing and compiling of a query is complete.
Working of Hive
1. Execute Plan The driver sends the execute plan to the execution
engine.
2. Execute Job Internally, the process of execution job is a MapReduce
job. The execution engine sends the job to JobTracker, which is in Name
node and it assigns this job to TaskTracker, which is in Data node. Here, the
query executes MapReduce job.
7.1 Metadata Ops Meanwhile in execution, the execution engine can
execute metadata operations with Metastore.
8. Fetch Result The execution engine receives the results from Data
nodes.
9. Send Results The execution engine sends those resultant values
to the driver.
10. Send Results The driver sends the results to Hive Interfaces.
Hive - Data Types

All the data types in Hive are classified into


four types, given as follows:
I. Column Types
II. Literals
III. Null Values
IV. Complex Types
Hive - Data Types-- Column
Types
Column type are used as column data types of
Hive. They are as follows:
A. Integral Types:
1. Integer type data can be specified using integral
data types, INT.
2. When the data range exceeds the range of INT, you
need to use BIGINT and if the data range is smaller
than the INT, you use SMALLINT.
3. TINYINT is smaller than SMALLINT.
Hive - Data Types-- Column
Types
B. String Types:
String type data types can be specified using
single quotes (' ') or double quotes (" "). It
contains two data types: VARCHAR and CHAR.
Data Type Length
VARCHAR 1 to 65355
CHAR 255
Hive - Data Types-- Column
Types
C. Timestamp:
It supports traditional UNIX timestamp with
optional nanosecond precision.
It supports java.sql.Timestamp format “YYYY-
MM-DD HH:MM:SS.fffffffff” and format “yyyy-
mm-dd hh:mm:ss.ffffffffff”.
D. Dates:
DATE values are described in year/month/day
format in the form {{YYYY-MM-DD}}.
Hive - Data Types-- Column
Types
E. Decimals:
The DECIMAL type in Hive is as same as Big
Decimal format of Java.
It is used for representing immutable arbitrary
precision. The syntax and example is as
follows:
DECIMAL(precision, scale)
decimal(10,0)
Hive - Data Types-- Column
Types
E. Union Types:
Union is a collection of heterogeneous data types. You can create
an instance using create union. The syntax and example is
as follows:
UNIONTYPE<int, double, array<string>,
struct<a:int,b:string>>
1. {0:1}
2. {1:2.0}
3. {2:["three","four"]}
4. {3:{"a":5,"b":"five"}}
5. {2:["six","seven"]}
6. {3:{"a":8,"b":"eight"}}
7. {0:9}
8. {1:10.0}
Hive - Data Types-- Literals
The following literals are used in Hive:
A. Floating Point Types:
Floating point types are nothing but numbers
with decimal points. Generally, this type of data
is composed of DOUBLE data type.
B. Decimal Type:
Decimal type data is nothing but floating point
value with higher range than DOUBLE data type.
The range of decimal type is approximately -10-
308
to 10308.
Hive - Data Types– Null Values
Missing values are represented by the special
value NULL.
Hive - Data Types– Complex
Types
1. Arrays: Arrays in Hive are used the same way they are used
in Java.
Syntax: ARRAY<data_type>
2. Maps: Maps in Hive are similar to Java Maps.
Syntax: MAP<primitive_type, data_type>
3. Structs: Structs in Hive is similar to using complex data with
comment.
Syntax: STRUCT<col_name : data_type [COMMENT
col_comment], ...>
Hive DDL Operations
Hive DDL stands for (Data Definition Language) which is used to define
or change the structure of Databases, Tables, indexes, and so on. The
most commonly used DDL are CREATE, DROP, ALTER, SHOW, and so on.

The following is the list of DDL statements that are supported in


Apache Hive.
1. CREATE
2. DROP
3. TRUNCATE
4. ALTER
5. SHOW
6. DESCRIBE
7. USE
Hive DDL Operations
Hive DDL Operations Hive -
Create Database
Hive is a database technology that can define
databases and tables to analyze structured
data. The theme for structured data analysis
is to store the data in a tabular manner, and
pass queries to analyze it.
Note: Hive contains a default database named default.
Create Database Statement:
Create Database is a statement used to create a database in
Hive. A database in Hive is a namespace or a collection of
tables. The syntax for this statement is as follows:
CREATE DATABASE|SCHEMA [IF NOT EXISTS] <database name>
Here, IF NOT EXISTS is an optional clause, which notifies the user
that a database with the same name already exists. We can use
SCHEMA in place of DATABASE in this command.
Hive DDL Operations Hive -
Create Database
The following query is executed to create a database
named userdb:

hive> CREATE DATABASE [IF NOT EXISTS] userdb; or


hive> CREATE SCHEMA userdb;

The following query is used to verify a databases list:


hive> SHOW DATABASES;
default
userdb
Hive DDL Operations Hive -
Create Table
The conventions of creating a table in HIVE is
quite similar to creating a table using SQL.

Example
Let us assume you need to create a table named employee using CREATE
TABLE statement. The following table lists the fields and their data types in
employee table:
Hive DDL Operations Hive -
Create Table
The following query creates a table named employee using
the above data.
Hive DDL Operations
This section explains how to alter the attributes of a table such as
changing its table name, changing column names, adding columns, and
deleting or replacing columns.
1. Alter Table Statement: It is used to alter a table in Hive.
Syntax
The statement takes any of the following syntaxes based on what
attributes we wish to modify in a table.
1. ALTER TABLE name RENAME TO new_name

2. ALTER TABLE name ADD COLUMNS (col_spec[,

col_spec ...])
3. ALTER TABLE name DROP [COLUMN] column_name

4. ALTER TABLE name CHANGE column_name

new_name new_type
5. ALTER TABLE name REPLACE COLUMNS (col_spec[,

col_spec ...])
Hive DDL Operations
2. Rename To-- Statement:
The following query renames the table
from employee to emp.
hive> ALTER TABLE employee RENAME
TO emp;
3. Change Statement:
The following table contains the fields of employee table and
it shows the fields to be changed (in bold).
Hive DDL Operations
The following queries rename the column name and column data type
using the above data:
hive> ALTER TABLE employee CHANGE name
ename String;
hive> ALTER TABLE employee CHANGE salary
salary Double;
4. Add Columns Statement:
The following query adds a column named dept to the employee table.
hive> ALTER TABLE employee ADD COLUMNS
( dept STRING COMMENT 'Department name');
5. Replace Statement:
The following query deletes all the columns from
the employee table and replaces it
with emp and name columns:
hive> ALTER TABLE employee REPLACE COLUMNS (
eid INT empid Int, ename STRING name String);
HQL – Data Manipulation
This section describes how to drop a table in Hive. When you drop a
table from Hive Metastore, it removes the table/column data and
their metadata. It can be a normal table (stored in Metastore) or an
external table (stored in local file system); Hive treats both in the
same manner, irrespective of their types.
Drop Table Statement:
The syntax is as follows:
DROP TABLE [IF EXISTS] table_name;
The following query drops a table named employee:
hive> DROP TABLE IF EXISTS employee;
On successful execution of the query, you get to see the following
response:
OK
Time taken: 5.3 seconds
Hive>
HiveQL-Load Data Statement
1. Generally, after creating a table in SQL, we can insert data
using the Insert statement. But in Hive, we can insert data
using the LOAD DATA statement.
2. While inserting data into Hive, it is better to use LOAD DATA
to store bulk records.
3. There are two ways to load data:
one is from local file system and
second is from Hadoop file system.
HiveQL-Load Data Statement
Syntax
The syntax for load data is as follows:
LOAD DATA [LOCAL] INPATH 'filepath'
[OVERWRITE] INTO TABLE tablename
[PARTITION (partcol1=val1,
partcol2=val2 ...)]
1. LOCAL is identifier to specify the local path. It is optional.
2. OVERWRITE is optional to overwrite the data in the table.
3. PARTITION is optional.
HiveQL-Load Data Statement
Example:
We will insert the following data into the table. It is a text file
named sample.txt in /home/user directory.
The following query loads the given text into the table.
hive> LOAD DATA LOCAL INPATH
'/home/user/sample.txt' OVERWRITE INTO TABLE
employee;
On successful download, you get to see the following response:
OK
Time taken: 15.905 seconds
Hive>
Hive DML Operations
Apache Hive DML stands for (Data Manipulation Language) which is used
to insert, update, delete, and fetch data from Hive tables. Using DML
commands we can load files into Apache Hive tables, write data into the
file system from Hive queries, perform merge operation on the table, and
so on.
The following list of DML statements is supported by Apache Hive.
LOAD
1. SELECT
2. INSERT
3. DELETE
4. UPDATE
5. EXPORT
6. IMPORT
Hive DML Operations- 1. Load
Command
The load command is used to move data files into Hive tables. Load operations
are pure copy/move operations.
1. During the LOAD operation, if a LOCAL keyword is mentioned, then the LOAD
command will check for the file path in the local filesystem.
2. During the LOAD operation, if the LOCAL keyword is not mentioned, then the
Hive will need the absolute URI of the file such as
hdfs://namenode:9000/user/hive/project/data1.
3. During LOAD operation, if the OVERWRITE keyword is mentioned, then the
contents of the target table/partition will be deleted and replaced by the files
referred by the file path.
4. During LOAD operation, if the OVERWRITE keyword is not mentioned, then the
files referred to by the file path will be appended to the table.
Load Table Syntax:
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] I
NTO TABLE tablename [PARTITION (partcol1=val1, pa
rtcol2=val2 ...)]
Hive DML Operations- 1. Load
Command
Load Table Statement:
LOAD DATA LOCAL INPATH '/home/cloudduggu/
hive/examples/files/ml-00k/
u.data' OVERWRITE INTO TABLE cloudduggudb.u
serdata;
Hive DML Operations- 2. Select
Command
The Select statement project the records from the table.
Select Command Syntax:
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference [WHERE where_condition] [GROUP BY col_list]
[ORDER BY col_list] [CLUSTER BY col_list
| [DISTRIBUTE BY col_list] [SORT BY col_list] ] [LIMIT [offset,] rows]
Select Command Statement:
SELECT * FROM cloudduggudb.userdata WHERE userid=389;
Hive DML Operations- 3.Insert into
Command
The Insert into command appends data from one table to another
table.

Insert Into Syntax:


INSERT INTO TABLE tablename1 [PARTITION (par
tcol1=val1, partcol2=val2 ...)] select_statement
1 FROM from_statement;

Insert Into Statement:


INSERT INTO cloudduggudb.employee_bkp SELE
CT * FROM cloudduggudb.employee_detail;
Hive DML Operations- 4.Insert
Overwrite Command
The Insert overwrites perform the overwriting of the existing content of
the table.
In this example we will use both tables which we used in the INSERT INTO
section and overwrite the content of "cloudduggudb.employee_bkp" with
"cloudduggudb.employee_detail".
Insert Overwrite Syntax:
INSERT OVERWRITE TABLE tablename1 [PARTITION (p
artcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] sele
ct_statement1 FROM from_statement;
Insert Overwrite Statement:
INSERT OVERWRITE TABLE cloudduggudb.employee_
bkp SELECT a.* FROM cloudduggudb.employee_detail
a;
Hive DML Operations- 5. Insert
ValuesCommand
By using the Insert Values command we can manually insert records
in the existing table.
We will use the “cloudduggudb.employee_bkp” table and insert 2
records in that.
Insert Values Syntax:
INSERT INTO TABLE tablename [PARTITION (part
col1[=val1], partcol2[=val2] ...)] VALUES values_
row [, values_row ...];
Insert Values Statement:
INSERT INTO cloudduggudb.employee_bkp VALU
ES(1207,'Mahesh',70000,'Manager'),
(1208,'Raj',70000,'Executive');
Hive DML Operations- 6. Delete
1. The Delete command is usedCommand
to delete data from the table. If we supply where clause
then it will delete that particular record only.
2. To perform Delete/Update operations in Apache Hive we need to follow the below
points while creating a table otherwise delete/update statements with fail with error
10297.
Note: In Apache Hive, We can perform a DELETE statement on those tables
which follow the ACID property.

Before performing the create, delete, update table we should enable the ACID property
using the below parameters on Hive prompt.
hive>set hive.support.concurrency=true;

hive>set hive.enforce.bucketing=true;
hive>set hive.exec.dynamic.partition.mode=nonstrict;
hive>set hive.compactor.initiator.on=true;
hive>set hive.compactor.worker.threads=1;
hive>set
hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTx
nManager;
Hive DML Operations- 6. Delete
File format should Command
be ORC which can be defined with
TBLPROPERTIES(‘transactional’=’true’).
The table should be created with CLUSTERED BY followed by Buckets.
Now we will enable ACID property and create a table. After table creation,
we will insert data.
Create Table Statement:
CREATE TABLE acidexample (key int, value int)
PARTITIONED BY (load_date date)
CLUSTERED BY(key) INTO 3 BUCKETS
STORED AS ORC TBLPROPERTIES ('transactional'='tru
e');
Insert Table Statement:
INSERT INTO acidexample partition (load_date='2016
-03-01') VALUES (1, 1);
INSERT INTO acidexample partition (load_date='2016
-03-02') VALUES (2, 2);
INSERT INTO acidexample partition (load_date='2016
-03-03') VALUES (3, 3);
Hive DML Operations- 6. Delete
Command
Delete Command Syntax:
DELETE FROM tablename [WHERE
expression];
Delete Command with where
Statement:
DELETE FROM acidexample WHERE
key = 1;

Delete Statement:
DELETE FROM acidexample;
Hive DML Operations- 7. Update
Command
The Update command updates the existing records if where clause
is supplied otherwise it will delete table data. We can’t perform
update command on Partitioning and Bucketing columns.
Note: In Apache Hive, We can perform an UPDATE statement on
those tables which follow the ACID property.
Update Command Syntax:
UPDATE tablename SET column = value [, colum
n = value ...] [WHERE expression];
Update Statement:
UPDATE acidexample SET value=2 where key=3;
Hive DML Operations- 8. Export
Command
The Apache Hive EXPORT command is used in case we need to
exports the table's metadata to some other location.
To perform this activity we have created a directory
“hive_export_location“in HDFS under /data/
hive_export_location and exporting table “acidexample”.

Export Command Syntax:


EXPORT TABLE tablename [PARTITION (part_column="v
alue"[, ...])]
TO 'export_target_path' [ FOR replication('eventid') ]
Export Statement:
export table acidexample to '/data/
hive_export_location';
Hive DML Operations- 9. Import
1.Apache Hive IMPORTCommand
command imports the data from a specific
location into Hive tables.
2. To perform this activity we will copy table “acidexample” data
from HDFS location “data/ hive_export_location” into the
“cloudduggudb” database.
Import Command Syntax:
IMPORT [[EXTERNAL] TABLE new_or_original_tablename
[PARTITION (part_column="value"[, ...])]]
FROM 'source_path’
[LOCATION 'import_target_path']
Import Statement:
import table acidexample from '/data/
hive_export_location';
Hive Joins
Apache Hive JOINs are used to combine columns from one (self-
join) or more tables by using values common to each. Using join
we can fetch corresponding records from two or more tables. It is
almost similar to SQL joins.
Apache Hive provides four types of joins which are mentioned
below.
Inner Join
1. Left Outer Join
2. Right Outer Join
3. Full Outer Join
Hive Joins
The following graph is the representation of Apache Hive Joins using
table A and table B.
Hive Joins
Hive Join Syntax
Hive Joins- a. Inner Join
Basically, to combine and retrieve the records from multiple
tables we use Hive Join clause.
Moreover, in SQL JOIN is as same as OUTER JOIN.
Moreover, by using the primary keys and foreign keys of the
tables JOIN condition is to be raised.
Furthermore, the below query executes JOIN the CUSTOMER
and ORDER tables. Then further retrieves the records:
hive> SELECT c.ID, c.NAME, c.AGE, o.AMOUNT
FROM CUSTOMERS c JOIN ORDERS oON (c.ID =
o.CUSTOMER_ID);
Hive Joins- b. Left Outer Join
On defining HiveQL Left Outer Join, even if there are no matches in
the right table it returns all the rows from the left table.
To be more specific, even if the ON clause matches 0 (zero) records
in the right table, then also this Hive JOIN still returns a row in the
result. Although, it returns with NULL in each column from the right
table.
In addition, it returns all the values from the left table. Also, the
matched values from the right table, or NULL in case of no matching
JOIN predicate.
However, the below query shows LEFT OUTER JOIN between
CUSTOMER as well as ORDER tables:
hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE
FROM CUSTOMERS c
LEFT OUTER JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);
Hive Joins- c. Right Outer Join
Basically, even if there are no matches in the left table, HiveQL Right
Outer Join returns all the rows from the right table.
To be more specific, even if the ON clause matches 0 (zero) records
in the left table, then also this Hive JOIN still returns a row in the
result. Although, it returns with NULL in each column from the left
table
In addition, it returns all the values from the right table. Also, the
matched values from the left table or NULL in case of no matching
join predicate.
However, the below query shows RIGHT OUTER JOIN between the
CUSTOMER as well as ORDER tables.
hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE
FROM CUSTOMERS c RIGHT OUTER JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);
Hive Joins- d. Full Outer Join
The major purpose of this HiveQL Full outer Join is it combines the
records of both the left and the right outer tables which fulfills the
Hive JOIN condition. Moreover, this joined table contains either all
the records from both the tables or fills in NULL values for missing
matches on either side.
However, the below query shows FULL OUTER JOIN between
CUSTOMER as well as ORDER tables:
hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE
FROM CUSTOMERS c
FULL OUTER JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);
Hive Partitioning
Apache Hive organizes tables into partitions. Partitioning
is a way of dividing a table into related parts based on the values
of particular columns like date, city, and department.
Each table in the hive can have one or more partition keys to
identify a particular partition. Using partition it is easy to do
queries on slices of the data.
Why is Partitioning Important?
In the current century, we know that the huge amount of data
which is in the range of petabytes is getting stored in HDFS. So
due to this, it becomes very difficult for Hadoop users to query
this huge amount of data.
The Hive was introduced to lower down this burden of data
querying. Apache Hive converts the SQL queries
into MapReduce jobs and then submits it to the Hadoop
cluster. When we submit a SQL query, Hive read the entire
data-set.
Hive Partitioning
So, it becomes inefficient to run MapReduce jobs over a
large table. Thus this is resolved by creating partitions in tables.
Apache Hive makes this job of implementing partitions very easy
by creating partitions by its automatic partition scheme at the
time of table creation.
In Partitioning method, all the table data is divided into multiple
partitions. Each partition corresponds to a specific value(s) of
partition column(s). It is kept as a sub-record inside the table’s
record present in the HDFS.
Therefore on querying a particular table, appropriate partition of
the table is queried which contains the query value. Thus this
decreases the I/O time required by the query. Hence increases
the performance speed.
Hive Partitioning
Types of Partitioning:
There are the following two types of Apache Hive Partitioning.
1. Static Partitioning
2. Dynamic Partitioning
Hive Partitioning- Static
Partitioning
Hive Static Partitioning
1. Insert input data files individually into a partition table is Static Partition.
2. Usually when loading files (big files) into Hive tables static partitions are
preferred.
3. Static Partition saves your time in loading data compared to dynamic partition.
4. You “statically” add a partition in the table and move the file into the partition
of the table.
5. We can alter the partition in the static partition.
6. You can get the partition column value from the filename, day of date etc
without reading the whole big file.
7. If you want to use the Static partition in the hive you should set property set
hive.mapred.mode = strict This property set by default in hive-
site.xml
8. Static partition is in Strict Mode.
9. You should use where clause to use limit in the static partition.
10. You can perform Static partition on Hive Manage table or external table.
Hive Partitioning- Static
Let
Partitioning
us see the Static Partition with the below
example.
To perform this example, we have created a table
“USER_DATA” with DATE_DT and COUNTRY as
Partition columns. We will load data into
“USER_DATA”.
Create Table Syntax:
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table
_name[(col_name data_type [column_constraint_specification] [COMME
NT col_comment], [COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)];
Create Table Statement:
CREATE TABLE USER_DATA (USER_ID INT,USER_NAME STRING,SITE_DAT
A STRING)
PARTITIONED BY (DATE_DT STRING,COUNTRY STRING)
ROW FORMAT DELIMITEDFIELDS TERMINATED BY '\t'STORED AS TEXTFILE
Hive Partitioning- Static
Partitioning
Command Output:
Hive Partitioning- Dynamic
Partitioning
2. Dynamic Partitioning
Dynamic Partition is known for a single insert in the partition
table.
It loads data from the non-Partitioned table and takes more time
than Static Partition.
We can use Dynamic Partition when we have large data already
stored in a table.
It can be used in Hive in non-strict mode by setting this
parameter Hive.exec.dynamic.partition.mode=
nonstrict.
Dynamic Partition can’t be altered.
Hive Partitioning- Dynamic
Partitioning
Hive Dynamic Partitioning
1. Single insert to partition table is known as a dynamic partition.
2. Usually, dynamic partition loads the data from the non-partitioned
table.
3. Dynamic Partition takes more time in loading data compared to
static partition.
4. When you have large data stored in a table then the Dynamic partition
is suitable.
5. If you want to partition a number of columns but you don’t know how
many columns then also dynamic partition is suitable.
6. Dynamic partition there is no required where clause to use limit.
7. we can’t perform alter on the Dynamic partition.
8. You can perform dynamic partition on hive external table and managed
table.
9. If you want to use the Dynamic partition in the hive then the mode is in
non-strict mode.
10. Here are Hive dynamic partition properties you should allow
Hive Partitioning- Dynamic
Partitioning
Let us see Dynamic Partition with the below example.
To perform this example, we have created two tables
“USER_DATA_DYN” and “USER_LOG_DATA”. Table
“USER_DATA_DYN” will be a Partition table with column DATE_DT
and COUNTRY as Partition column and table “USER_LOG_DATA” will
be a non-Partition table. We will insert data in “USER_DATA_DYN”
using the non-Partition table “USER_LOG_DATA”.
Create Table Syntax:
Hive Partitioning- Dynamic
Create Table Partitioning
Statement: Let us create the table
“USER_DATA_DYN”.

Command Output:
Hive Partitioning
Hive Partitioning – Advantages and
Disadvantages
a) Hive Partitioning Advantages
1. Partitioning in Hive distributes execution load horizontally.
2. In partition faster execution of queries with the low volume of data
takes place. For example, search population from Vatican City returns
very fast instead of searching entire world population.
b) Hive Partitioning Disadvantages
3. There is the possibility of too many small partition creations- too many
directories.
4. Partition is effective for low volume data. But there some queries like
group by on high volume of data take a long time to execute. For
example, grouping population of China will take a long time as
compared to a grouping of the population in Vatican City.
There is no need for searching entire table column for a single record.
Hive Bucketing
1. Using Bucketing, Apache Hive provides another technique to organize
tables’ data in a more manageable way.
2. Bucketing is a concept of breaking data down into ranges which are called
buckets.
3. Bucketing gives one more structure to the data so that it can be used for
more efficient queries.
4. The range for a bucket is determined by the hash value of one or more
columns in the dataset.
5. These columns are called `bucketing` or `clustered by` columns.
6. When we load data in the Bucket table then it stores data in unique
buckets as files. Hive uses a Hashing Algorithm to generate a number in the
range of 1 to n buckets and store data in a particular bucket.
7. Bucketing in Apache Hive is useful when we deal with large datasets that
may need to be segregated into clusters for more efficient management
and to be able to perform join queries with other large datasets.
Hive Bucketing
To perform this example, we will create a Bucket table
“USER_LOG_BUCKET” which will have a Partition column DATE_DT
and four Buckets. We have mentioned the Bucketing column in the
CLUSTERED BY (USER_ID) clause in the create table statement. We
will insert data in this table “USER_LOG_BUCKET” from
“USER_LOG_DATA” (this table is created in the Partitioning section).
Hive Bucketing
Hive Bucketing
Advantage of Apache Hive Bucketing
Bucketing is a partitioning technique that helps to
avoid data shuffling & sorting by applying some
transformations.
The basic idea about Bucketing is to partition users'
data and store it in a sorted format based on the user's
SQL and at the same time allows users to read data.
Bucketing helps in performing a fast join operation.
Using Bucketing, user's data are stored in each bucket
in a sorted format.
Hive Optimization Techniques – Hive
Performance
There are several types of Hive Query Optimization techniques are
available while running our hive queries to improve Hive performance
with some Hive Performance tuning techniques.
So, in this Hive Optimization Techniques article, Hive Optimization
Techniques for Hive Queries we will learn how to optimize hive
queries to execute them faster on our cluster, types of Hive
Optimization Techniques for Queries: Execution Engine, Usage of
Suitable File Format, Hive Partitioning, Bucketing in Apache Hive,
Vectorization in Hive, Cost-Based Optimization in Hive, and Hive
Indexing.
Hive Optimization Techniques?
However, to run queries on petabytes of data we all know that
hive is a query language which is similar to SQL built
on Hadoop ecosystem. So, there are several Hive
optimization techniques to improve its performance which we
can implement when we run our hive queries.
Hive Optimization Techniques – Hive
Performance
Types of Query Optimization Techniques in
Hive
Hive Optimization Techniques – Hive
Performance
1. Partition Tables
Apache Hive Partition tables are used to improve
the performance of queries.
It allows the user to store data in separate
subdirectories under table location so when a user
submits the query against the partition key, the
performance is improved because Hive fetches the
specific rows despite all row scans but for the user,
it is a very challenging task to choose a partition
key because the partition should be a low cardinal
attribute.
Hive Optimization Techniques – Hive
Performance
2. De-normalizing data:
Normalization is a process used to model the table’s
data using certain rules to reduce data redundancy
and improve data integrity.
In the real term, if we normalize the data set that
means we are joining multiple tables to create a
relation and fetch the data but from performance
prospective joins are expensive and difficult
operations to perform, and one of the major
reasons for performance issues.
So we should avoid highly normalized table
structures to maintain good performance.
Hive Optimization Techniques – Hive
Performance
3. Map/Reduce Output Compress:
If compression is used then it will reduce the intermediate
data volume and due to this internal data transfer between
mappers and reducers is reduced over a network.
We can apply compression on mapper and reducer output
individually. We should note that gzip-compressed files are
not splittable which means this should be applied with
caution. Ideally, the compressed file size should not be larger
than a few hundred MB otherwise it will create an imbalanced
job.
We can set mapper output compression using “set
mapred.compress.map.output to true”.
We can set job output compression using “set
mapred.compress.map.output to true”.
Hive Optimization Techniques – Hive
Performance
4. Bucketing:
By using Bucketing the performances of queries are improved
if the Bucket key and join keys are similar.
Bucketing distributes the data in different buckets based on
the hash results of bucket key also I/O gets reduced if the
query is using join process on the same keys(column).
Before writing data to the bucketed table it is important to set
a bucketing flag (SET
hive.enforce.bucketing=true ;) and for best join
performance we can set (SET
hive.optimize.bucketmapjoin=true) so that it
will hints Hive to do bucket level join during the map stage
join.
Hive Optimization Techniques – Hive
Performance
5. Input Format Selection:
Choosing the correct input file in Apache Hive is critical
for performance.
If we take JSON or text file for input format then it is not
a good choice for a high volume production system
because these types of readable format take lots of
space and create overhead during processing.
We can resolve such issues by choosing the correct
input format such as columnar input formats (RCFile,
ORC).
There are some other binary format files out there such
as Avro, sequence files, Thrift, and
ProtoBuf, which will help in other issues.
Hive Optimization Techniques – Hive
Performance
6. Vectorization:
With the help of Vectorization, we can process the batch of
rows together instead of processing one row at a time.
We can enable the Vectorization by setting the configuration
parameter
hive.vectorized.execution.enabled=true.
7. Tez-Execution Engine:
Using Tez as an execution engine, performance will be
improved for the Apache Hive query.
TEZ provides an expressive-dataflow-definition API using that
we can describe the Direct Acyclic Graph (DAG) of
computation that we want to run.
Hive Optimization Techniques – Hive
Performance
8. Indexing:
An index is used to improve the performance
drastically.
When we define an index on a table a separate index
table gets created, which is used during query
processing.
Without an index, if a user is running query then it will
perform all row scan which is a costly and time taking
job and it takes a lot of system resources.
But with index framework will check the index table
and jump on specific data despite searching for all
rows.
Hive Join Strategies
1. Hive joins are executed by MapReduce jobs
through different execution engines like for
example Tez, Spark or MapReduce.
2. Joins even of multiple tables can be achieved
by one job only.
3. Since it’s first release many optimizations
have been added to Hive giving users various
options for query improvements of joins.
Hive Join Strategies-MapReduce Joins
1. Joins with MapReduce can be achieved in two
ways, either during the map phase (map-side)
or during the reduce phase (reduce-side).
Hive Join Strategies
1. Hive joins are executed by MapReduce jobs
through different execution engines like for
example Tez, Spark or MapReduce.
2. Joins even of multiple tables can be achieved
by one job only.
3. Since it’s first release many optimizations
have been added to Hive giving users various
options for query improvements of joins.

You might also like