hive
hive
HIVE
• Hive – What is Hive?
• Hive Architecture
• Hive Data Types
• Hive File Format
• Hive Query Language (HQL)
• RCFile Implementation
• SerDe
• User-defined Function(UDF).
What is Hive
• Hive is a data warehouse infrastructure tool to process structured data
in Hadoop.
• It resides on top of Hadoop to summarize Big Data, and makes
querying and analyzing easy.
• Initially Hive was developed by Facebook, later the Apache Software
Foundation took it up and developed it further as an open source under
the name Apache Hive.
• It is used by different companies. For example, Amazon uses it in
Amazon Elastic MapReduce.
• Hive is not
• A relational database.
• A design for OnLine Transaction Processing (OLTP).
• A language for real-time queries and row-level updates.
• Features of Hive
• It stores schema in a database and processed data into HDFS.
• It is designed for OLAP.
• It provides SQL type language for querying called HiveQL or HQL.
• It is familiar, fast, scalable, and extensible.
Architecture of Hive
User Interface Hive is a data warehouse infrastructure software that can create interaction between user and HDFS. The user
interfaces that Hive supports are Hive Web UI, Hive command line, and Hive HD Insight (In Windows server).
Meta Store Hive chooses respective database servers to store the schema or Metadata of tables, databases, columns in a table,
their data types, and HDFS mapping.
HiveQL Process HiveQL is similar to SQL for querying on schema info on the Metastore. It is one of the replacements of
Engine traditional approach for MapReduce program. Instead of writing MapReduce program in Java, we can write a
query for MapReduce job and process it.
Execution Engine The conjunction part of HiveQL process Engine and MapReduce is Hive Execution Engine. Execution engine
processes the query and generates results as same as MapReduce results. It uses the flavor of MapReduce.
HDFS or HBASE Hadoop distributed file system or HBASE are the data storage techniques to store data into file system.
Working of Hive
• The following diagram depicts the workflow between Hive and
Hadoop.
The following table defines how Hive
interacts with Hadoop framework:
Step No. Operation
1 Execute QueryThe Hive interface such as Command Line or Web UI sends query to Driver (any database driver such as JDBC, ODBC, etc.) to
execute.
2 Get PlanThe driver takes the help of query compiler that parses the query to check the syntax and query plan or the requirement of query.
6 Execute PlanThe driver sends the execute plan to the execution engine.
7 Execute JobInternally, the process of execution job is a MapReduce job. The execution engine sends the job to JobTracker, which is in Name
node and it assigns this job to TaskTracker, which is in Data node. Here, the query executes MapReduce job.
7.1 Metadata OpsMeanwhile in execution, the execution engine can execute metadata operations with Metastore.
8 Fetch ResultThe execution engine receives the results from Data nodes.
9 Send ResultsThe execution engine sends those resultant values to the driver.
10 Send ResultsThe driver sends the results to Hive Interfaces.
Hive Data Modelling
Hive Data Types
HIVE DATA TYPES
• The different data types in Hive, which are involved in the table creation.
• All the data types in Hive are classified into four types, given as follows:
• Column Types
• Literals
• Null Values
• Complex Types
• Column Types
• Column type are used as column data types of Hive.
• They are as follows:
• Integral Types
• Integer type data can be specified using integral data types, INT.
• When the data range exceeds the range of INT, you need to use BIGINT and if the data range is
smaller than the INT, you use SMALLINT. TINYINT is smaller than SMALLINT.
The following table depicts various INT
data types:
Type Postfix Example
TINYINT Y 10Y
SMALLINT S 10S
INT - 10
BIGINT L 10L
String Types
• String type data types can be specified using single quotes (' ')
or double quotes (" ").
• It contains two data types: VARCHAR and CHAR.
• Hive follows C-types escape characters.
• Union is a collection of heterogeneous data types. You can create an instance using create union.
The syntax and example is as follows:
{0:1}
{1:2.0}
{2:["three","four"]}
{3:{"a":5,"b":"five"}}
{2:["six","seven"]}
{3:{"a":8,"b":"eight"}}
{0:9}
{1:10.0}
Literals
• The following literals are used in Hive:
• Floating Point Types
• Floating point types are nothing but numbers with decimal points. Generally, this
type of data is composed of DOUBLE data type.
• Decimal Type
• Decimal type data is nothing but floating point value with higher range than
DOUBLE data type. The range of decimal type is approximately -10-308 to
10308.Null Value
• Missing values are represented by the special value NULL.
Null Value
• Missing values are represented by the special value NULL.
Complex Types
The Hive complex data types are as follows:
Arrays
Arrays in Hive are used the same way they are used in Java.
Syntax: ARRAY<data_type>
Maps
Maps in Hive are similar to Java Maps.
Syntax: MAP<primitive_type, data_type>
Structs
Structs in Hive is similar to using complex data with comment.
Syntax: STRUCT<col_name : data_type [COMMENT col_comment], ...>
Apache Hive Different File Formats
• TextFile
• SequenceFile
• RCFile
• AVRO
• ORC
• Parquet
Hive Text File Format
• Hive Text file format is a default storage format.
• You can use the text format to interchange the data with other client application.
• The text file format is very common most of the applications.
• Data is stored in lines, with each line being a record.
• Each lines are terminated by a newline character (\n).
• The text format is simple plane file format.
• You can use the compression (BZIP2) on the text file to reduce the storage spaces.
• Create a TEXT file by add storage option as ‘STORED AS TEXTFILE’ at the end of a Hive CREATE
TABLE command.
• Examples
• Below is the Hive CREATE TABLE command with storage format specification:
Create table textfile_table (column_specs) stored as textfile;
Hive Sequence File Format
• Sequence files are Hadoop flat files which stores values in binary
key-value pairs.
• The sequence files are in binary format and these files are able to split.
• The main advantages of using sequence file is to merge two or more files
into one file.
• Create a sequence file by add storage option as ‘STORED AS
SEQUENCEFILE’ at the end of a Hive CREATE TABLE command.
• Example
• Create table sequencefile_table (column_specs) stored as sequencefile;
Hive RC File Format
• This is another form of Hive file format which offers high row level
compression rates.
• If you have requirement to perform multiple rows at a time then you can use
RCFile format.
• The RCFile are very much similar to the sequence file format.
• This file format also stores the data as key-value pairs.
• Create RCFile by specifying ‘STORED AS RCFILE’ option at the end of a
CREATE TABLE Command:
• Example
• Create table RCfile_table (column_specs) stored as rcfile;
Hive AVRO File Format
• AVRO is open source project that provides data serialization and data
exchange services for Hadoop.
• You can exchange data between Hadoop ecosystem and program written in
any programming languages.
• Avro is one of the popular file format in Big Data Hadoop based
applications.
• Create AVRO file by specifying ‘STORED AS AVRO’ option at the end
of a CREATE TABLE Command.
• Example
• Create table avro_table (column_specs) stored as avro;
Hive ORC File Format
• The ORC file stands for Optimized Row Columnar file format.
• The ORC file format provides a highly efficient way to store data in Hive
table.
• This file system was actually designed to overcome limitations of the other
Hive file formats.
• The Use of ORC files improves performance when Hive is reading, writing,
and processing data from large tables.
• Create ORC file by specifying ‘STORED AS ORC’ option at the end of a CREATE
TABLE Command.
• Examples
• Create table orc_table (column_specs) stored as orc
Hive Parquet File Format
• Parquet is a column-oriented binary file format.
• The parquet is highly efficient for the types of large-scale queries.
• Parquet is especially good for queries scanning particular columns within a
particular table.
• The Parquet table uses compression Snappy, gzip; currently Snappy by
default.
• Create Parquet file by specifying ‘STORED AS PARQUET’ option at the end of a
CREATE TABLE Command.
• Example:
• Create table parquet_table (column_specs) stored as parquet;
Hibernate Query Language (HQL)
• is an object-oriented query language, similar to SQL, but instead of operating on
tables and columns, HQL works with persistent objects and their properties.
• HQL queries are translated by Hibernate into conventional SQL queries, which in
turns perform action on database.
• use HQL whenever possible to avoid database portability hassles, and to take
advantage of Hibernates SQL generation and caching strategies.
• Keywords like SELECT, FROM, and WHERE, etc., are not case sensitive, but
properties like table and column names are case sensitive in HQL.
FROM Clause
• You will use FROM clause if you want to load a complete persistent
objects into memory.
• Following is the simple syntax of using FROM clause −
String hql = "FROM Employee";
Query query = session.createQuery(hql);
List results = query.list();
• If you need to fully qualify a class name in HQL, just specify the
package and class name as follows −
String hql = "FROM com.hibernatebook.criteria.Employee";
Query query = session.createQuery(hql);
List results = query.list();
AS Clause
• The AS clause can be used to assign aliases to the classes in your HQL queries,
especially when you have the long queries.
• For instance, our previous simple example would be the following −
String hql = "FROM Employee AS E";
Query query = session.createQuery(hql);
List results = query.list();
• The AS keyword is optional and you can also specify the alias directly after the
class name, as follows −
String hql = "FROM Employee E";
Query query = session.createQuery(hql);
List results = query.list();
SELECT Clause
• The SELECT clause provides more control over the result set then the from
clause.
• If you want to obtain few properties of objects instead of the complete object,
use the SELECT clause.
• Syntax of using SELECT clause to get just first_name field of the Employee
object −
• Employee.firstName is a property of Employee object rather than a field of the EMPLOYEE table.
WHERE Clause
• If you want to narrow the specific objects that are returned from
storage, you use the WHERE clause.
• This clause lets Hibernate pull information from the database and
group it based on a value of an attribute and, typically, use the result to
include an aggregate value.
The UPDATE clause can be used to update one or more properties of an one or more objects.
String hql = "UPDATE Employee set salary = :salary " + "WHERE id = :employee_id";
Query query = session.createQuery(hql);
query.setParameter("salary", 1000);
query.setParameter("employee_id", 10);
int result = query.executeUpdate();
System.out.println("Rows affected: " + result);
DELETE Clause
• The DELETE clause can be used to delete one or more objects.
• The distinct keyword only counts the unique values in the row set.
This class defines a UDF named SquareUDF that takes an integer as input and returns the square
of that integer as output.
Next, compile the Java class into a JAR file. Make sure to include any dependencies required by
your UDF.
After compiling, you can add the JAR file containing your UDF to the Hive classpath using the
ADD JAR command:
DESCRIBE demo.employee