0% found this document useful (0 votes)
19 views

HDFSandhivecommands

The document discusses important commands for HDFS and Hive. It covers commands to interact with HDFS like ls, mkdir, copyFromLocal. It also covers concepts in Hive like creating and using databases, creating and querying tables, partitioning and bucketing tables.

Uploaded by

Sravanth
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

HDFSandhivecommands

The document discusses important commands for HDFS and Hive. It covers commands to interact with HDFS like ls, mkdir, copyFromLocal. It also covers concepts in Hive like creating and using databases, creating and querying tables, partitioning and bucketing tables.

Uploaded by

Sravanth
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Basic HDFS and HIVE

Commands
By
Dr, R. Satya Krishna Sharma
Some Important commands of HDFS
Command Description Syntax
hadoop fs -ls Lists the contents of a directory hadoop fs -ls <path>
in HDFS.
mkdir Creates a new directory in HDFS hadoop fs -mkdir <path>
copyFromLocal Copies files or directories from hadoop fs -copyFromLocal
the local file system to HDFS <local-source> <hdfs-
destination>
copyToLocal Copies files or directories from hadoop fs -copyToLocal
HDFS to the local file system <hdfs-source> <local-
destination>
rm Deletes files or directories in hadoop fs -rm <path>
HDFS
mv Moves files or directories within hadoop fs -mv <source>
HDFS <destination>
cat Displays the contents of a file in hadoop fs -cat <file-path>
HDFS
Some Important commands of HDFS
Command Description Syntax
chown Changes the owner of a file or hadoop fs -chown [-R]
directory in HDFS <owner>[:<group>]
<path>
fs -get Copies files or directories from hadoop fs -get [hdfs-
HDFS to the local file system source] [local-destination]
fs -put Copies files or directories from hadoop fs -put [local-
the local file system to HDFS source] [hdfs-destination]
chmod Changes the permissions of hadoop fs -chmod [mode]
files or directories in HDFS. [path]
copyFromLocal Appends data from a local file hadoop fs -copyFromLocal
with - to an existing file in HDFS. -appendToFile [local-
appendToFile source] [hdfs-destination]
option
 Create Database:
Syntax: CREATE DATABASE [IF NOT EXISTS] database_name;
Description: Creates a new database.
 Use Database:
Syntax: USE database_name;
Description: Sets the current database context.
 Create Table:
◦ Syntax: CREATE TABLE [IF NOT EXISTS] table_name
◦ ( column1 data_type,
◦ column2 data_type, ... )
◦ [COMMENT 'table_comment']
◦ [PARTITIONED BY (col_name data_type, ...)]
◦ [ROW FORMAT row_format]
◦ [STORED AS file_format]
◦ [LOCATION 'hdfs_path'];
 Show Databases:
Syntax: SHOW DATABASES;
Description: Lists all databases.
 Show Tables:
Syntax: SHOW TABLES;
Description: Lists all tables in the current database.
 Describe Table:
Syntax: DESCRIBE [EXTENDED] table_name;
Description: Displays the schema of a table.
 Select Query:
◦ Syntax: SELECT [ALL | DISTINCT] column1, column2, ...
FROM table_name WHERE condition;
◦ Description: Retrieves data from a table based on the specified
conditions.
 Insert Into Table:
◦ Syntax: INSERT INTO TABLE table_name [PARTITION
(partition_key = 'value', ...)] VALUES (value1, value2, ...);
◦ Description: Inserts data into a table.
 Alter Table:
◦ Syntax:cssCopy code
◦ ALTER TABLE table_name ADD COLUMNS (new_column
data_type [COMMENT 'column_comment'], ...), CHANGE
column_name new_column data_type [COMMENT
'new_column_comment'], DROP COLUMN column_name,
RENAME TO new_table_name;
Partitioning
Partitioning involves dividing a table into smaller, more manageable
parts based on one or more columns. Each partition represents a subset
of the data, and these partitions are stored separately. This is useful
when you often query data based on certain criteria, as it allows Hive to
skip irrelevant partitions during query execution.
 Example of Partitioning:
◦ Let's say you have a table named sales with the following columns: product_id, date,
amount, and region. You can partition this table by the region column.
CREATE TABLE sales_partitioned ( product_id INT, date STRING,
amount DOUBLE )
PARTITIONED BY (region STRING);
Now, when you insert data into this table, Hive will automatically create
separate directories for each region in the Hadoop Distributed File
System (HDFS).
INSERT INTO TABLE sales_partitioned PARTITION (region='North')
VALUES (1, '2024-01-01', 100.0); INSERT INTO TABLE
sales_partitioned PARTITION (region='South') VALUES (2, '2024-01-
02', 150.0);
 The data will be stored in HDFS like this:
◦ /user/hive/warehouse/sales_partitioned/region=North/ -
0001 /user/hive/warehouse/sales_partitioned/region=South/
- 0002
 When querying this table, if you filter by region, Hive
will only scan the relevant partition, leading to improved
performance.
Bucketing in Hive:

Bucketing involves dividing data within each partition into a fixed number
of buckets. This helps distribute data evenly and can be beneficial when
performing join operations, as it reduces the amount of data that needs
to be shuffled and processed.
CREATE TABLE sales_bucketed ( product_id INT, date STRING, amount
DOUBLE )
PARTITIONED BY (region STRING)
CLUSTERED BY (product_id) INTO 4 BUCKETS;
In this example, the sales_bucketed table is bucketed by the product_id column
into 4 buckets. When inserting data, Hive will distribute the data evenly across
these buckets.
INSERT INTO TABLE sales_bucketed PARTITION (region='North') VALUES (1,
'2024-01-01', 100.0);
INSERT INTO TABLE sales_bucketed PARTITION (region='South') VALUES (2,
'2024-01-02', 150.0);
/user/hive/warehouse/sales_bucketed/region=North/
- bucket_00000
- - bucket_00001
- - bucket_00002
- - bucket_00003
/user/hive/warehouse/sales_bucketed/region=South/
- bucket_00000
- - bucket_00001
- - bucket_00002
- - bucket_00003
Data Models in Hive
 Table:
◦ The basic building block in Hive is the table. Tables in Hive define the structure of the
data and how it is stored. They are similar to tables in relational databases and can be
partitioned for better performance.
 Partitioning:
◦ Hive allows you to partition data in a table based on one or more columns. This is
particularly useful when dealing with large datasets, as it helps optimize queries by
reducing the amount of data that needs to be scanned.
 Bucketing:
◦ Bucketing is another technique in Hive for organizing data. It involves dividing data
into buckets based on a hash function applied to one or more columns. Bucketing can
improve query performance by reducing the number of files that need to be read.
 External Tables:
◦ Hive supports external tables, where the data is stored outside of the Hive warehouse
directory. This is useful when you want to manage data that is generated or updated by
processes outside of Hive.
 SerDe (Serializer/Deserializer):
◦ Hive uses SerDe for processing data during loading and unloading. SerDe allows Hive
to work with various data formats like JSON, XML, Avro, etc. It defines how data is
serialized and deserialized.
 Data Modeling with HiveQL:
◦ Hive uses HiveQL, a SQL-like language, for querying data. Through HiveQL, users
can define and manipulate the data model, including creating tables, altering their
structures, and performing various transformations
Joins with Hive
 CREATE TABLE employees ( emp_id INT, emp_name
STRING, dept_id INT );
INSERT INTO employees VALUES
◦ (1, 'John', 101),
◦ (2, 'Alice', 102),
◦ (3, 'Bob', 101),
 CREATE TABLE departments ( dept_id INT, dept_name
STRING );(4, 'Charlie', 103);
 INSERT INTO departments VALUES
◦ (101, 'HR'),
◦ (102, 'Finance'),
◦ (104, 'Marketing');
 SELECT e.emp_id, e.emp_name, e.dept_id, d.dept_name FROM
employees e LEFT OUTER JOIN departments d ON e.dept_id =
d.dept_id;
 emp_id | emp_name | dept_id | dept_name |
|1| John | 101 | HR |
|2| Alice | 102 | Finance |
|3| Bob | 101 | HR |
|4| Charlie | 103 | | NULL |

Left inner join


SELECT e.emp_id, e.emp_name, e.dept_id, d.dept_name FROM
employees e INNER JOIN departments d ON e.dept_id = d.dept_id;
 emp_id | emp_name | dept_id | dept_name |
|1| John | 101 | HR |
|2| Alice | 102 | Finance |
|3| Bob | 101 | HR |
Right outer joint is the inevsre of left outer joint
SELECT e.emp_id, e.emp_name, e.dept_id, d.dept_name FROM
employees e RIGHT OUTER JOIN departments d ON e.dept_id
= d.dept_id;dept_id
| emp_id | emp_name | dept_id | dept_name |
|1| John | 101 | HR |
|3| Bob | 101 | HR |
|2| Alice 102 | Finance |
| NULL | NULL | 104 | Marketing |
 SELECT e.emp_id, e.emp_name, e.dept_id, d.dept_name FROM
employees e LEFT JOIN departments d ON e.dept_id = d.dept_id
WHERE d.dept_id IS NOT NULL;
 | emp_id | emp_name | dept_id | dept_name |
|1| John | 101 | HR |
|2| Alice | 102 | Finance |
|3| Bob | 101 | HR |

You might also like