Unit 2.2 Hive
Unit 2.2 Hive
Example
Let us assume you need to create a table named employee using CREATE
TABLE statement. The following table lists the fields and their data types in
employee table:
Hive DDL Operations Hive -
Create Table
The following query creates a table named employee using
the above data.
Hive DDL Operations
This section explains how to alter the attributes of a table such as
changing its table name, changing column names, adding columns, and
deleting or replacing columns.
1. Alter Table Statement: It is used to alter a table in Hive.
Syntax
The statement takes any of the following syntaxes based on what
attributes we wish to modify in a table.
1. ALTER TABLE name RENAME TO new_name
col_spec ...])
3. ALTER TABLE name DROP [COLUMN] column_name
new_name new_type
5. ALTER TABLE name REPLACE COLUMNS (col_spec[,
col_spec ...])
Hive DDL Operations
2. Rename To-- Statement:
The following query renames the table
from employee to emp.
hive> ALTER TABLE employee RENAME
TO emp;
3. Change Statement:
The following table contains the fields of employee table and
it shows the fields to be changed (in bold).
Hive DDL Operations
The following queries rename the column name and column data type
using the above data:
hive> ALTER TABLE employee CHANGE name
ename String;
hive> ALTER TABLE employee CHANGE salary
salary Double;
4. Add Columns Statement:
The following query adds a column named dept to the employee table.
hive> ALTER TABLE employee ADD COLUMNS
( dept STRING COMMENT 'Department name');
5. Replace Statement:
The following query deletes all the columns from
the employee table and replaces it
with emp and name columns:
hive> ALTER TABLE employee REPLACE COLUMNS (
eid INT empid Int, ename STRING name String);
HQL – Data Manipulation
This section describes how to drop a table in Hive. When you drop a
table from Hive Metastore, it removes the table/column data and
their metadata. It can be a normal table (stored in Metastore) or an
external table (stored in local file system); Hive treats both in the
same manner, irrespective of their types.
Drop Table Statement:
The syntax is as follows:
DROP TABLE [IF EXISTS] table_name;
The following query drops a table named employee:
hive> DROP TABLE IF EXISTS employee;
On successful execution of the query, you get to see the following
response:
OK
Time taken: 5.3 seconds
Hive>
HiveQL-Load Data Statement
1. Generally, after creating a table in SQL, we can insert data
using the Insert statement. But in Hive, we can insert data
using the LOAD DATA statement.
2. While inserting data into Hive, it is better to use LOAD DATA
to store bulk records.
3. There are two ways to load data:
one is from local file system and
second is from Hadoop file system.
HiveQL-Load Data Statement
Syntax
The syntax for load data is as follows:
LOAD DATA [LOCAL] INPATH 'filepath'
[OVERWRITE] INTO TABLE tablename
[PARTITION (partcol1=val1,
partcol2=val2 ...)]
1. LOCAL is identifier to specify the local path. It is optional.
2. OVERWRITE is optional to overwrite the data in the table.
3. PARTITION is optional.
HiveQL-Load Data Statement
Example:
We will insert the following data into the table. It is a text file
named sample.txt in /home/user directory.
The following query loads the given text into the table.
hive> LOAD DATA LOCAL INPATH
'/home/user/sample.txt' OVERWRITE INTO TABLE
employee;
On successful download, you get to see the following response:
OK
Time taken: 15.905 seconds
Hive>
Hive DML Operations
Apache Hive DML stands for (Data Manipulation Language) which is used
to insert, update, delete, and fetch data from Hive tables. Using DML
commands we can load files into Apache Hive tables, write data into the
file system from Hive queries, perform merge operation on the table, and
so on.
The following list of DML statements is supported by Apache Hive.
LOAD
1. SELECT
2. INSERT
3. DELETE
4. UPDATE
5. EXPORT
6. IMPORT
Hive DML Operations- 1. Load
Command
The load command is used to move data files into Hive tables. Load operations
are pure copy/move operations.
1. During the LOAD operation, if a LOCAL keyword is mentioned, then the LOAD
command will check for the file path in the local filesystem.
2. During the LOAD operation, if the LOCAL keyword is not mentioned, then the
Hive will need the absolute URI of the file such as
hdfs://namenode:9000/user/hive/project/data1.
3. During LOAD operation, if the OVERWRITE keyword is mentioned, then the
contents of the target table/partition will be deleted and replaced by the files
referred by the file path.
4. During LOAD operation, if the OVERWRITE keyword is not mentioned, then the
files referred to by the file path will be appended to the table.
Load Table Syntax:
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] I
NTO TABLE tablename [PARTITION (partcol1=val1, pa
rtcol2=val2 ...)]
Hive DML Operations- 1. Load
Command
Load Table Statement:
LOAD DATA LOCAL INPATH '/home/cloudduggu/
hive/examples/files/ml-00k/
u.data' OVERWRITE INTO TABLE cloudduggudb.u
serdata;
Hive DML Operations- 2. Select
Command
The Select statement project the records from the table.
Select Command Syntax:
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference [WHERE where_condition] [GROUP BY col_list]
[ORDER BY col_list] [CLUSTER BY col_list
| [DISTRIBUTE BY col_list] [SORT BY col_list] ] [LIMIT [offset,] rows]
Select Command Statement:
SELECT * FROM cloudduggudb.userdata WHERE userid=389;
Hive DML Operations- 3.Insert into
Command
The Insert into command appends data from one table to another
table.
Before performing the create, delete, update table we should enable the ACID property
using the below parameters on Hive prompt.
hive>set hive.support.concurrency=true;
hive>set hive.enforce.bucketing=true;
hive>set hive.exec.dynamic.partition.mode=nonstrict;
hive>set hive.compactor.initiator.on=true;
hive>set hive.compactor.worker.threads=1;
hive>set
hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTx
nManager;
Hive DML Operations- 6. Delete
File format should Command
be ORC which can be defined with
TBLPROPERTIES(‘transactional’=’true’).
The table should be created with CLUSTERED BY followed by Buckets.
Now we will enable ACID property and create a table. After table creation,
we will insert data.
Create Table Statement:
CREATE TABLE acidexample (key int, value int)
PARTITIONED BY (load_date date)
CLUSTERED BY(key) INTO 3 BUCKETS
STORED AS ORC TBLPROPERTIES ('transactional'='tru
e');
Insert Table Statement:
INSERT INTO acidexample partition (load_date='2016
-03-01') VALUES (1, 1);
INSERT INTO acidexample partition (load_date='2016
-03-02') VALUES (2, 2);
INSERT INTO acidexample partition (load_date='2016
-03-03') VALUES (3, 3);
Hive DML Operations- 6. Delete
Command
Delete Command Syntax:
DELETE FROM tablename [WHERE
expression];
Delete Command with where
Statement:
DELETE FROM acidexample WHERE
key = 1;
Delete Statement:
DELETE FROM acidexample;
Hive DML Operations- 7. Update
Command
The Update command updates the existing records if where clause
is supplied otherwise it will delete table data. We can’t perform
update command on Partitioning and Bucketing columns.
Note: In Apache Hive, We can perform an UPDATE statement on
those tables which follow the ACID property.
Update Command Syntax:
UPDATE tablename SET column = value [, colum
n = value ...] [WHERE expression];
Update Statement:
UPDATE acidexample SET value=2 where key=3;
Hive DML Operations- 8. Export
Command
The Apache Hive EXPORT command is used in case we need to
exports the table's metadata to some other location.
To perform this activity we have created a directory
“hive_export_location“in HDFS under /data/
hive_export_location and exporting table “acidexample”.
Command Output:
Hive Partitioning
Hive Partitioning – Advantages and
Disadvantages
a) Hive Partitioning Advantages
1. Partitioning in Hive distributes execution load horizontally.
2. In partition faster execution of queries with the low volume of data
takes place. For example, search population from Vatican City returns
very fast instead of searching entire world population.
b) Hive Partitioning Disadvantages
3. There is the possibility of too many small partition creations- too many
directories.
4. Partition is effective for low volume data. But there some queries like
group by on high volume of data take a long time to execute. For
example, grouping population of China will take a long time as
compared to a grouping of the population in Vatican City.
There is no need for searching entire table column for a single record.
Hive Bucketing
1. Using Bucketing, Apache Hive provides another technique to organize
tables’ data in a more manageable way.
2. Bucketing is a concept of breaking data down into ranges which are called
buckets.
3. Bucketing gives one more structure to the data so that it can be used for
more efficient queries.
4. The range for a bucket is determined by the hash value of one or more
columns in the dataset.
5. These columns are called `bucketing` or `clustered by` columns.
6. When we load data in the Bucket table then it stores data in unique
buckets as files. Hive uses a Hashing Algorithm to generate a number in the
range of 1 to n buckets and store data in a particular bucket.
7. Bucketing in Apache Hive is useful when we deal with large datasets that
may need to be segregated into clusters for more efficient management
and to be able to perform join queries with other large datasets.
Hive Bucketing
To perform this example, we will create a Bucket table
“USER_LOG_BUCKET” which will have a Partition column DATE_DT
and four Buckets. We have mentioned the Bucketing column in the
CLUSTERED BY (USER_ID) clause in the create table statement. We
will insert data in this table “USER_LOG_BUCKET” from
“USER_LOG_DATA” (this table is created in the Partitioning section).
Hive Bucketing
Hive Bucketing
Advantage of Apache Hive Bucketing
Bucketing is a partitioning technique that helps to
avoid data shuffling & sorting by applying some
transformations.
The basic idea about Bucketing is to partition users'
data and store it in a sorted format based on the user's
SQL and at the same time allows users to read data.
Bucketing helps in performing a fast join operation.
Using Bucketing, user's data are stored in each bucket
in a sorted format.
Hive Optimization Techniques – Hive
Performance
There are several types of Hive Query Optimization techniques are
available while running our hive queries to improve Hive performance
with some Hive Performance tuning techniques.
So, in this Hive Optimization Techniques article, Hive Optimization
Techniques for Hive Queries we will learn how to optimize hive
queries to execute them faster on our cluster, types of Hive
Optimization Techniques for Queries: Execution Engine, Usage of
Suitable File Format, Hive Partitioning, Bucketing in Apache Hive,
Vectorization in Hive, Cost-Based Optimization in Hive, and Hive
Indexing.
Hive Optimization Techniques?
However, to run queries on petabytes of data we all know that
hive is a query language which is similar to SQL built
on Hadoop ecosystem. So, there are several Hive
optimization techniques to improve its performance which we
can implement when we run our hive queries.
Hive Optimization Techniques – Hive
Performance
Types of Query Optimization Techniques in
Hive
Hive Optimization Techniques – Hive
Performance
1. Partition Tables
Apache Hive Partition tables are used to improve
the performance of queries.
It allows the user to store data in separate
subdirectories under table location so when a user
submits the query against the partition key, the
performance is improved because Hive fetches the
specific rows despite all row scans but for the user,
it is a very challenging task to choose a partition
key because the partition should be a low cardinal
attribute.
Hive Optimization Techniques – Hive
Performance
2. De-normalizing data:
Normalization is a process used to model the table’s
data using certain rules to reduce data redundancy
and improve data integrity.
In the real term, if we normalize the data set that
means we are joining multiple tables to create a
relation and fetch the data but from performance
prospective joins are expensive and difficult
operations to perform, and one of the major
reasons for performance issues.
So we should avoid highly normalized table
structures to maintain good performance.
Hive Optimization Techniques – Hive
Performance
3. Map/Reduce Output Compress:
If compression is used then it will reduce the intermediate
data volume and due to this internal data transfer between
mappers and reducers is reduced over a network.
We can apply compression on mapper and reducer output
individually. We should note that gzip-compressed files are
not splittable which means this should be applied with
caution. Ideally, the compressed file size should not be larger
than a few hundred MB otherwise it will create an imbalanced
job.
We can set mapper output compression using “set
mapred.compress.map.output to true”.
We can set job output compression using “set
mapred.compress.map.output to true”.
Hive Optimization Techniques – Hive
Performance
4. Bucketing:
By using Bucketing the performances of queries are improved
if the Bucket key and join keys are similar.
Bucketing distributes the data in different buckets based on
the hash results of bucket key also I/O gets reduced if the
query is using join process on the same keys(column).
Before writing data to the bucketed table it is important to set
a bucketing flag (SET
hive.enforce.bucketing=true ;) and for best join
performance we can set (SET
hive.optimize.bucketmapjoin=true) so that it
will hints Hive to do bucket level join during the map stage
join.
Hive Optimization Techniques – Hive
Performance
5. Input Format Selection:
Choosing the correct input file in Apache Hive is critical
for performance.
If we take JSON or text file for input format then it is not
a good choice for a high volume production system
because these types of readable format take lots of
space and create overhead during processing.
We can resolve such issues by choosing the correct
input format such as columnar input formats (RCFile,
ORC).
There are some other binary format files out there such
as Avro, sequence files, Thrift, and
ProtoBuf, which will help in other issues.
Hive Optimization Techniques – Hive
Performance
6. Vectorization:
With the help of Vectorization, we can process the batch of
rows together instead of processing one row at a time.
We can enable the Vectorization by setting the configuration
parameter
hive.vectorized.execution.enabled=true.
7. Tez-Execution Engine:
Using Tez as an execution engine, performance will be
improved for the Apache Hive query.
TEZ provides an expressive-dataflow-definition API using that
we can describe the Direct Acyclic Graph (DAG) of
computation that we want to run.
Hive Optimization Techniques – Hive
Performance
8. Indexing:
An index is used to improve the performance
drastically.
When we define an index on a table a separate index
table gets created, which is used during query
processing.
Without an index, if a user is running query then it will
perform all row scan which is a costly and time taking
job and it takes a lot of system resources.
But with index framework will check the index table
and jump on specific data despite searching for all
rows.
Hive Join Strategies
1. Hive joins are executed by MapReduce jobs
through different execution engines like for
example Tez, Spark or MapReduce.
2. Joins even of multiple tables can be achieved
by one job only.
3. Since it’s first release many optimizations
have been added to Hive giving users various
options for query improvements of joins.
Hive Join Strategies-MapReduce Joins
1. Joins with MapReduce can be achieved in two
ways, either during the map phase (map-side)
or during the reduce phase (reduce-side).
Hive Join Strategies
1. Hive joins are executed by MapReduce jobs
through different execution engines like for
example Tez, Spark or MapReduce.
2. Joins even of multiple tables can be achieved
by one job only.
3. Since it’s first release many optimizations
have been added to Hive giving users various
options for query improvements of joins.