0% found this document useful (0 votes)
9 views

hive

Hive is a data warehouse infrastructure tool for processing structured data in Hadoop, initially developed by Facebook and now maintained by the Apache Software Foundation. It features a SQL-like query language called HiveQL, supports various file formats, and is designed for OLAP rather than OLTP. The document outlines Hive's architecture, data types, file formats, and query language functionalities.

Uploaded by

Utkarsha Mahajan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

hive

Hive is a data warehouse infrastructure tool for processing structured data in Hadoop, initially developed by Facebook and now maintained by the Apache Software Foundation. It features a SQL-like query language called HiveQL, supports various file formats, and is designed for OLAP rather than OLTP. The document outlines Hive's architecture, data types, file formats, and query language functionalities.

Uploaded by

Utkarsha Mahajan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

UNIT-3 HIVE

HIVE
• Hive – What is Hive?
• Hive Architecture
• Hive Data Types
• Hive File Format
• Hive Query Language (HQL)
• RCFile Implementation
• SerDe
• User-defined Function(UDF).
What is Hive
• Hive is a data warehouse infrastructure tool to process structured data
in Hadoop.
• It resides on top of Hadoop to summarize Big Data, and makes
querying and analyzing easy.
• Initially Hive was developed by Facebook, later the Apache Software
Foundation took it up and developed it further as an open source under
the name Apache Hive.
• It is used by different companies. For example, Amazon uses it in
Amazon Elastic MapReduce.
• Hive is not
• A relational database.
• A design for OnLine Transaction Processing (OLTP).
• A language for real-time queries and row-level updates.
• Features of Hive
• It stores schema in a database and processed data into HDFS.
• It is designed for OLAP.
• It provides SQL type language for querying called HiveQL or HQL.
• It is familiar, fast, scalable, and extensible.
Architecture of Hive

• The following component diagram depicts the architecture of Hive:


• This component diagram contains different units.
• The following table describes each unit:
Unit Name Operation

User Interface Hive is a data warehouse infrastructure software that can create interaction between user and HDFS. The user
interfaces that Hive supports are Hive Web UI, Hive command line, and Hive HD Insight (In Windows server).

Meta Store Hive chooses respective database servers to store the schema or Metadata of tables, databases, columns in a table,
their data types, and HDFS mapping.

HiveQL Process HiveQL is similar to SQL for querying on schema info on the Metastore. It is one of the replacements of
Engine traditional approach for MapReduce program. Instead of writing MapReduce program in Java, we can write a
query for MapReduce job and process it.

Execution Engine The conjunction part of HiveQL process Engine and MapReduce is Hive Execution Engine. Execution engine
processes the query and generates results as same as MapReduce results. It uses the flavor of MapReduce.

HDFS or HBASE Hadoop distributed file system or HBASE are the data storage techniques to store data into file system.
Working of Hive
• The following diagram depicts the workflow between Hive and
Hadoop.
The following table defines how Hive
interacts with Hadoop framework:
Step No. Operation

1 Execute QueryThe Hive interface such as Command Line or Web UI sends query to Driver (any database driver such as JDBC, ODBC, etc.) to
execute.
2 Get PlanThe driver takes the help of query compiler that parses the query to check the syntax and query plan or the requirement of query.

3 Get MetadataThe compiler sends metadata request to Metastore (any database).


4 Send MetadataMetastore sends metadata as a response to the compiler.
5 Send PlanThe compiler checks the requirement and resends the plan to the driver. Up to here, the parsing and compiling of a query is complete.

6 Execute PlanThe driver sends the execute plan to the execution engine.
7 Execute JobInternally, the process of execution job is a MapReduce job. The execution engine sends the job to JobTracker, which is in Name
node and it assigns this job to TaskTracker, which is in Data node. Here, the query executes MapReduce job.

7.1 Metadata OpsMeanwhile in execution, the execution engine can execute metadata operations with Metastore.

8 Fetch ResultThe execution engine receives the results from Data nodes.
9 Send ResultsThe execution engine sends those resultant values to the driver.
10 Send ResultsThe driver sends the results to Hive Interfaces.
Hive Data Modelling
Hive Data Types
HIVE DATA TYPES
• The different data types in Hive, which are involved in the table creation.
• All the data types in Hive are classified into four types, given as follows:
• Column Types
• Literals
• Null Values
• Complex Types
• Column Types
• Column type are used as column data types of Hive.
• They are as follows:
• Integral Types
• Integer type data can be specified using integral data types, INT.
• When the data range exceeds the range of INT, you need to use BIGINT and if the data range is
smaller than the INT, you use SMALLINT. TINYINT is smaller than SMALLINT.
The following table depicts various INT
data types:
Type Postfix Example

TINYINT Y 10Y

SMALLINT S 10S

INT - 10

BIGINT L 10L
String Types

• String type data types can be specified using single quotes (' ')
or double quotes (" ").
• It contains two data types: VARCHAR and CHAR.
• Hive follows C-types escape characters.

Data Type Length


VARCHAR 1 to 65355
CHAR 255
Timestamp
• It supports traditional UNIX timestamp with optional nanosecond
precision.
• It supports java.sql.Timestamp

• format “YYYY-MM-DD HH:MM:SS.fffffffff” and


• format “yyyy-mm-dd hh:mm:ss.ffffffffff”.
Dates
• DATE values are described in year/month/day format in the form
{{YYYY-MM-DD}}.
Decimals
• The DECIMAL type in Hive is as same as Big Decimal format of Java. It
is used for representing immutable arbitrary precision.
• The syntax and example is as follows:
• DECIMAL(precision, scale)
• decimal(10,0)
Union Types

• Union is a collection of heterogeneous data types. You can create an instance using create union.
The syntax and example is as follows:

UNIONTYPE<int, double, array<string>, struct<a:int,b:string>>

{0:1}
{1:2.0}
{2:["three","four"]}
{3:{"a":5,"b":"five"}}
{2:["six","seven"]}
{3:{"a":8,"b":"eight"}}
{0:9}
{1:10.0}
Literals
• The following literals are used in Hive:
• Floating Point Types
• Floating point types are nothing but numbers with decimal points. Generally, this
type of data is composed of DOUBLE data type.
• Decimal Type
• Decimal type data is nothing but floating point value with higher range than
DOUBLE data type. The range of decimal type is approximately -10-308 to
10308.Null Value
• Missing values are represented by the special value NULL.
Null Value
• Missing values are represented by the special value NULL.
Complex Types
The Hive complex data types are as follows:
Arrays
Arrays in Hive are used the same way they are used in Java.
Syntax: ARRAY<data_type>
Maps
Maps in Hive are similar to Java Maps.
Syntax: MAP<primitive_type, data_type>
Structs
Structs in Hive is similar to using complex data with comment.
Syntax: STRUCT<col_name : data_type [COMMENT col_comment], ...>
Apache Hive Different File Formats
• TextFile
• SequenceFile
• RCFile
• AVRO
• ORC
• Parquet
Hive Text File Format
• Hive Text file format is a default storage format.
• You can use the text format to interchange the data with other client application.
• The text file format is very common most of the applications.
• Data is stored in lines, with each line being a record.
• Each lines are terminated by a newline character (\n).
• The text format is simple plane file format.
• You can use the compression (BZIP2) on the text file to reduce the storage spaces.
• Create a TEXT file by add storage option as ‘STORED AS TEXTFILE’ at the end of a Hive CREATE
TABLE command.
• Examples
• Below is the Hive CREATE TABLE command with storage format specification:
Create table textfile_table (column_specs) stored as textfile;
Hive Sequence File Format
• Sequence files are Hadoop flat files which stores values in binary
key-value pairs.
• The sequence files are in binary format and these files are able to split.
• The main advantages of using sequence file is to merge two or more files
into one file.
• Create a sequence file by add storage option as ‘STORED AS
SEQUENCEFILE’ at the end of a Hive CREATE TABLE command.

• Example
• Create table sequencefile_table (column_specs) stored as sequencefile;
Hive RC File Format
• This is another form of Hive file format which offers high row level
compression rates.
• If you have requirement to perform multiple rows at a time then you can use
RCFile format.
• The RCFile are very much similar to the sequence file format.
• This file format also stores the data as key-value pairs.
• Create RCFile by specifying ‘STORED AS RCFILE’ option at the end of a
CREATE TABLE Command:
• Example
• Create table RCfile_table (column_specs) stored as rcfile;
Hive AVRO File Format
• AVRO is open source project that provides data serialization and data
exchange services for Hadoop.
• You can exchange data between Hadoop ecosystem and program written in
any programming languages.
• Avro is one of the popular file format in Big Data Hadoop based
applications.
• Create AVRO file by specifying ‘STORED AS AVRO’ option at the end
of a CREATE TABLE Command.
• Example
• Create table avro_table (column_specs) stored as avro;
Hive ORC File Format
• The ORC file stands for Optimized Row Columnar file format.
• The ORC file format provides a highly efficient way to store data in Hive
table.
• This file system was actually designed to overcome limitations of the other
Hive file formats.
• The Use of ORC files improves performance when Hive is reading, writing,
and processing data from large tables.
• Create ORC file by specifying ‘STORED AS ORC’ option at the end of a CREATE
TABLE Command.
• Examples
• Create table orc_table (column_specs) stored as orc
Hive Parquet File Format
• Parquet is a column-oriented binary file format.
• The parquet is highly efficient for the types of large-scale queries.
• Parquet is especially good for queries scanning particular columns within a
particular table.
• The Parquet table uses compression Snappy, gzip; currently Snappy by
default.
• Create Parquet file by specifying ‘STORED AS PARQUET’ option at the end of a
CREATE TABLE Command.
• Example:
• Create table parquet_table (column_specs) stored as parquet;
Hibernate Query Language (HQL)
• is an object-oriented query language, similar to SQL, but instead of operating on
tables and columns, HQL works with persistent objects and their properties.
• HQL queries are translated by Hibernate into conventional SQL queries, which in
turns perform action on database.
• use HQL whenever possible to avoid database portability hassles, and to take
advantage of Hibernates SQL generation and caching strategies.
• Keywords like SELECT, FROM, and WHERE, etc., are not case sensitive, but
properties like table and column names are case sensitive in HQL.
FROM Clause
• You will use FROM clause if you want to load a complete persistent
objects into memory.
• Following is the simple syntax of using FROM clause −
String hql = "FROM Employee";
Query query = session.createQuery(hql);
List results = query.list();
• If you need to fully qualify a class name in HQL, just specify the
package and class name as follows −
String hql = "FROM com.hibernatebook.criteria.Employee";
Query query = session.createQuery(hql);
List results = query.list();
AS Clause
• The AS clause can be used to assign aliases to the classes in your HQL queries,
especially when you have the long queries.
• For instance, our previous simple example would be the following −
String hql = "FROM Employee AS E";
Query query = session.createQuery(hql);
List results = query.list();

• The AS keyword is optional and you can also specify the alias directly after the
class name, as follows −
String hql = "FROM Employee E";
Query query = session.createQuery(hql);
List results = query.list();
SELECT Clause
• The SELECT clause provides more control over the result set then the from
clause.
• If you want to obtain few properties of objects instead of the complete object,
use the SELECT clause.
• Syntax of using SELECT clause to get just first_name field of the Employee
object −

String hql = "SELECT E.firstName FROM Employee E";


Query query = session.createQuery(hql);
List results = query.list();

• Employee.firstName is a property of Employee object rather than a field of the EMPLOYEE table.
WHERE Clause
• If you want to narrow the specific objects that are returned from
storage, you use the WHERE clause.

• syntax of using WHERE clause −


String hql = "FROM Employee E WHERE E.id = 10";
Query query = session.createQuery(hql);
List results = query.list();
GROUP BY Clause

• This clause lets Hibernate pull information from the database and
group it based on a value of an attribute and, typically, use the result to
include an aggregate value.

• syntax of using GROUP BY clause −


String hql = "SELECT SUM(E.salary), E.firtName FROM Employee E " +
"GROUP BY E.firstName";
Query query = session.createQuery(hql);
List results = query.list();
Using Named Parameters
• Hibernate supports named parameters in its HQL queries.
• This makes writing HQL queries that accept input from the user easy and you
do not have to defend against SQL injection attacks.
• Following is the simple syntax of using named parameters −
String hql = "FROM Employee E WHERE E.id = :employee_id";
Query query = session.createQuery(hql);
query.setParameter("employee_id",10);
List results = query.list();
UPDATE Clause
Bulk updates are new to HQL with Hibernate 3, and delete work differently in Hibernate 3 than they
did in Hibernate 2.
The Query interface now contains a method called executeUpdate() for executing HQL UPDATE or
DELETE statements.

The UPDATE clause can be used to update one or more properties of an one or more objects.

Following is the simple syntax of using UPDATE clause −

String hql = "UPDATE Employee set salary = :salary " + "WHERE id = :employee_id";
Query query = session.createQuery(hql);
query.setParameter("salary", 1000);
query.setParameter("employee_id", 10);
int result = query.executeUpdate();
System.out.println("Rows affected: " + result);
DELETE Clause
• The DELETE clause can be used to delete one or more objects.

syntax of using DELETE clause −


String hql = "DELETE FROM Employee " + "WHERE id = employee_id";
Query query = session.createQuery(hql);
query.setParameter("employee_id", 10);
int result = query.executeUpdate();
System.out.println("Rows affected: " + result);
INSERT Clause
HQL supports INSERT INTO clause only where records can be inserted from
one object to another object.

syntax of using INSERT INTO clause −


String hql = "INSERT INTO Employee(firstName, lastName, salary)" +
"SELECT firstName, lastName, salary FROM old_employee";
Query query = session.createQuery(hql);
int result = query.executeUpdate();
System.out.println("Rows affected: " + result);
Aggregate Methods
• HQL supports a range of aggregate methods, similar to SQL.
• They work the same way in HQL as in SQL and following is the list
of the available functions −
Sr.No. Functions & Description
1 avg(property name)
The average of a property's value
2 count(property name or *)
The number of times a property occurs in the results
3 max(property name)
The maximum value of the property values
4 min(property name)
The minimum value of the property values
5 sum(property name)
The sum total of the property values
String hql = "SELECT COUNT(E.id) FROM Employee E";

String hql = "SELECT COUNT(*) FROM Employee E";

• The distinct keyword only counts the unique values in the row set.

• The following query will return only unique count −


String hql = "SELECT count(distinct E.firstName) FROM Employee E";
Query query = session.createQuery(hql);
List results = query.list();
RCFILE IMPLEMENTATION
• RCFILE(Record COLUMNAR File) is a data placement structure that
determines how to store relational tables on computer clusters.
• create table employee_rc(name string,salary int,deptno int,DOJ
date) row format delimited fields terminated by ',' stored as RCFILE
;
• insert into table employee_rc select * from employee;
• create table employee_rc(name string,salary int,deptno int,DOJ
date) row format delimited fields terminated by ','
stored as RCFILE location '/data/in/employee_rc' ;
SERDE
• Basically, for Serializer/Deserializer in Hive or Hive SerDe (an acronym).
• It handles both serialization and deserialization in Hive.
• Also, interprets the results of serialization as individual fields for processing.
• In addition, to read in data from a table a SerDe allows Hive.
• Further writes it back out to HDFS in any custom format.
• However, it is possible that anyone can write their own SerDe for their own data
formats.
• HDFS files –> InputFileFormat –> <key, value> –> Deserializer –> Row object
• Row object –> Serializer –> <key, value> –> OutputFileFormat –> HDFS files
• It is very important to note that the “key” part is ignored when reading, and is
always a constant when writing. However, row object is stored into the “value”.
SERDE
User-Defined Functions (UDFs)
• In Hive, User-Defined Functions (UDFs) allow you to extend the functionality of Hive
by writing your custom functions in Java, Python, or other supported languages.
These functions can then be used in Hive queries to perform custom operations that
are not natively supported by HiveQL.

This class defines a UDF named SquareUDF that takes an integer as input and returns the square
of that integer as output.
Next, compile the Java class into a JAR file. Make sure to include any dependencies required by
your UDF.
After compiling, you can add the JAR file containing your UDF to the Hive classpath using the
ADD JAR command:

ADD JAR /path/to/your/udf.jar;


Once the JAR file is added, you can use the UDF in your Hive queries:

-- Register the UDF


CREATE TEMPORARY FUNCTION square AS 'com.example.hive.udf.SquareUDF';

-- Use the UDF in a query


SELECT square(5) AS result;
HIVE CRUD OPERATION
create an internal table
create table demo.employee (Id int, Name string , Salary float)
row format delimited fields terminated by ',' ;
create table if not exists demo.employee (Id int, Name string , Salary float)
DESCRIBE demo.employee
create an External table
The external table allows us to create and access a table and a data externally.
The external keyword is used to specify the external table, whereas the location keyword
is used to determine the location of loaded data.
As the table is external, the data is not present in the Hive directory. Therefore, if we try to
drop the table, the metadata of the table will be deleted, but the data still exists.
hdfs dfs -mkdir /HiveDirectory

DESCRIBE demo.employee

You might also like