WELCOME
TOOLS-
SQOOP
Sub- topics
Introduction
Sqoop- definition
Architecture of sqoop
Working of sqoop
Sqoop import
Sqoop export
feATURES OF SQOOP
ADVANTAGES OF SQOOP
DISADVANTAGES OF SQOOP
INTRODUCTION
When Big Data storages and analyzers such
as MapReduce, Hive, HBase, Cassandra, Pig,
etc. of the Hadoop ecosystem came into
picture.
They required a tool to interact with the
relational database servers for importing
and exporting the Big Data residing in them.
Sqoop occupies a place in the Hadoop
ecosystem to provide feasible interaction
between relational database server and
Hadoop’ s HDFS.
SQOOP- DEFINITON
Sqoop: “SQL to Hadoop and
Hadoop to SQL”.
Tool to transfer data from
relational databases Teradata,
MySQL, PostgreSQL, Oracle,
Netezza.
It is provided by the Apache
Software Foundation.
ARCHITECTURE OF SQOOP
WORKING OF SQOOP
SQOOP IMPORT
The import tool imports individual
tables from RDBMS to HDFS.
Each row in a table is treated as a
record in HDFS.
All records are stored as text data in
text files or as binary data in Avro
and Sequence files.
SQOOP EXPORT
The export tool exports a set of files
from HDFS back to an RDBMS.
The files given as input to Sqoop
contain records, which are called as
rows in table.
Those are read and parsed into a
set of records and delimited with
user-specified delimiter.
FEATURES OF SQOOP
o Full Load.
o Incremental Load.
o Parallel import/export.
o Import results of SQL query.
o Compression.
o Connectors for all major RDBMS
Databases.
o Kerberos Security Integration.
ADVANTAGES OF SQOOP
Allows the transfer of data with a variety of
structured data stores like Postgres,
Oracle, Teradata, and so on.
Sqoop can execute the data transfer in
parallel, so execution can be quick and
more cost effective.
Helps to integrate with sequential data
from the mainframe.
DISADVANTAGES OF SQOOP
It uses a JDBC connection to connect
with RDBMS based data stores, and
this can be inefficient and less
performant.
For performing analysis, it executes
various map-reduce jobs and, at times,
this can be time consuming when
there are lot of joins if the data is in a
denormalized fashion.
THANK YOU