SlideShare a Scribd company logo
6
Most read
10
Most read
16
Most read
SQOOP
 SQOOP is data ingestion tool.
 SQOOP is a tool designed for transfer data
between HDFS and RDBMS such as MySQL, Oracle
etc.
 Export data back to RDBMS.
 Simple as user specifies the “what” and leave the
“how” to underlying processing engine.
 Rapid development
 No Java is required.
 Developed by cloudera.
Why SQOOP
• Data already available in RDBMS worldwide.
• Nighty processing is done on RDBMS for years.
• Need is to move certain data from RDBMS to
Hadoop for processing.
• Transferring data using scripts is inefficient
and time consuming.
• Traditional DB has already reporting, data
visualization applications configured.
SQOOP Under the hood
• The dataset being transferred is sliced up into
different partitions.
A Map only Job is launched with individual
mappers responsible for transferring a slice of
dataset.
• Each record of the data is maintained in a type
safe manner since SQOOP uses the database
metadata to understand the data types.
How SQOOP Import works
• Step-1
 SQOOP introspects database to gather the necessary metadata
for the data being imported.
• Step-2
 A Map only hadoop job submitted to cluster by SQOOP and
performs the data transfer using metadata captured in step-1.
• The imported data is saved in HDFS directory based on the
table being imported.
• By default these files contains comma delimitted fields,
with new line separating records.
 User can override the format of data by specifying the field
separator and record terminator character.
How SQOOP Export works
• Step-1
 SQOOP introspects database to gather the necessary metadata
for the data being imported.
• Step-2
 Transfer the data
 SQOOP divides the input dataset into splits.
 Sqoop uses the individual Map task to push the splits to the database.
 Each Map task performs this transfer over many transaction in order
to ensure optimal throughput and minimal resource utilization.
The target table must already exist in the database. Sqoop
performs a set of INSERT INTO operations, without regard for
existing content. If Sqoop attempts to insert rows which
violate constraints in the database (for example, a particular
primary key value already exists), then the export fails.
Importing Data into Hive
• --hive-import
Appending above to SQOOP import command, SQOOP
takes care of populating the hive metastore with
appropriate metadata for the table and also invokes
the necessary commands to load the table and
partition.
• Using Hive import, SQOOP converts the data from
the native data types within the external
datastore into the corresponding types in hive.
• SQOOP automatically chooses the native
delimiter set used by hive.
Importing Data into HBase
• SQOOP can populate data in specific column
family in Hbase table.
• Hbase table and column family setting is
required in order to import data to Hbase.
• Data imported to Hbase converted to their
string representation and inserted as UTF-8
bytes.
Connecting to a Database Server
• The connect string is similar to a URL, and is communicated to
Sqoop with the --connect argument.
• This describes the server and database to connect to; it may also
specify the port.
• You can use the --username and --password or -P parameters to
supply a username and a password to the database.
• For example:
• sqoop import --connect jdbc:mysql://IPAddress:port /DBName--
table tableName --username sqoop --password sqoop
Controlling Parallelism
• Sqoop imports data in parallel from most database sources. You can
specify the number of map tasks (parallel processes) to use to
perform the import by using the -m or --num-mappers argument.
• NOTE: Do Not increase the degree of parallism higher than that
which your database can reasonably support. For e.g., Connecting
100 concurrent clients to your database may increase the load on
the database server to a point where performance suffers as a
result.
• Sqoop uses a splitting column to split the workload. By default,
Sqoop will identify the primary key column (if present) in a table
and use it as the splitting column.
How Parallel import works
• The low and high values for the splitting column are retrieved from
the database, and the map tasks operate on evenly-sized
components of the total range. By default, four tasks are used. For
example, if you had a table with a primary key column of id whose
minimum value was 0 and maximum value was 1000, and Sqoop
was directed to use 4 tasks, Sqoop would run four processes which
each execute SQL statements of the form SELECT * FROM
sometable WHERE id >= lo AND id < hi, with (lo, hi) set to (0, 250),
(250, 500), (500, 750), and (750, 1001) in the different tasks.
• NOTE: Sqoop cannot currently split on multi-column primary key. If
your table has no index column, or has a multi-column key, then you
must also manually choose a splitting column.
Incremental Imports
• Sqoop provides an incremental import mode which can be used to
retrieve only rows newer than some previously-imported set of rows.
• Sqoop supports two types of incremental imports:
• 1)append and 2)lastmodified.
• 1)You should specify append mode when importing a table where new
rows are continually being added with increasing row id values. You
specify the column containing the row’s id with --check-column. Sqoop
imports rows where the check column has a value greater than the one
specified with --last-value.
• 2)Lastmodified mode should be used when rows of the source table may
be updated, and each such update will set the value of a last-modified
column to the current timestamp. Rows where the check column holds a
timestamp more recent than the timestamp specified with --last-value are
imported.
Install SQOOP
• To install SQOOP
• Download Sqoop-*.tar.gz
• tar -xvf sqoop-*.*.tar.gz
• export HADOOP_HOME=/some/path/hadoop-dir
• Please add the vendor Specific JDBC jar to $SQOOP_HOME/lib
• Change to Sqoop Bin folder
• ./sqoop help
Practice Session
SQOOP Commands
• A basic import of a table
 sqoop import --connect jdbc:mysql://192.168.45.1:3306/sakila --table film --
username sqoop --password sqoop
• Load sample data to a target directory
 sqoop import --connect jdbc:mysql://192.168.45.1:3306/sakila --table film --
username sqoop --password sqoop --target-dir '/user/cloudera/test/film' -m 1
• Load sample data with output directory and package
 sqoop import --connect jdbc:mysql://192.168.45.1:3306/sakila --table film --
username sqoop --password sqoop --package-name org.sandeep.sample --
outdir '/home/cloudera/sandeep/test1' --target-dir '/user/cloudera/test/film' -
m 1
• Controlling the import parallelism (8 parallel tasks):
 sqoop import --connect jdbc:mysql://192.168.45.1:3306/sakila --table film --
username sqoop --password sqoop --target-dir '/user/cloudera/test/film' --
split-by film_id -m 8
SQOOP Commands
• Incremental import
 sqoop import --connect jdbc:mysql://192.168.45.1:3306/sakila --table film --
username sqoop --password sqoop --target-dir '/user/cloudera/test/film' -m 1
• Save target file in tab separated format
 sqoop import --connect jdbc:mysql://192.168.45.1:3306/sakila --table actor --
username sqoop --password sqoop --check-column actor_id --incremental
append --last-value 180 --target-dir /user/cloudera/test/film3 --fields-
terminated-by 't‘
• Selecting specific columns from the EMPLOYEES table
 sqoop import --connect jdbc:mysql://192.168.45.1:3306/sakila --table actor --
username sqoop -password sqoop --columns 'actor_id,first_name,last_name' -
-target-dir /user/cloudera/test/actor1
• Query usage to import with condition
 sqoop import --connect jdbc:mysql://192.168.45.1:3306/sakila --query 'select
* from film where film_id < 91 and $CONDITIONS' --username sqoop --
password sqoop --target-dir '/user/cloudera/test/film2' --split-by film_id -m 2
SQOOP Commands
• Storing data in SequenceFiles
 sqoop import --connect jdbc:mysql://192.168.45.1:3306/sakila --table film --
username sqoop --password sqoop --as-sequencefile --target-dir
/user/cloudera/test/f
• Importing data to Hive:
 sqoop import --connect jdbc:mysql://192.168.45.1:3306/sakila --table
language --username sqoop -password sqoop -m 1 --hive-import
• Import only the schema to hive table
 sqoop create-hive-table --connect jdbc:mysql://192.168.45.1:3306/sakila --
table actor --username sqoop -password sqoop --fields-terminated-by ',';
• Importing data to Hbase:
 sqoop import --connect jdbc:mysql://192.168.45.1:3306/sakila --table actor --
username sqoop --password sqoop --columns 'actor_id,first_name,last_name'
--hbase-table ActorInfo --column-family ActorName --hbase-row-key actor_id -
m 1
SQOOP Commands
• Import all tables
 sqoop import-all-tables --connect
jdbc:mysql://192.168.45.1:3306/sakila --username sqoop --
password sqoop
• SQOOP EXPORT
 sqoop export --connect jdbc:mysql://192.168.45.1:3306/sakila --
table test --username sqoop --password sqoop --export-dir
/user/cloudera/actor
• SQOOP Version:
 $ sqoop version
• List tables present in a database
 sqoop list-tables --connect jdbc:mysql://192.168.45.1:3306/sakila --
username sqoop --password sqoop
SQOOP JOBS
Creating saved jobs is done with the --create action. This operation
requires a -- followed by a tool name and its arguments. The tool and
its arguments will form the basis of the saved job.
• Step-1 (Create a job)
 sqoop job --create myjob -- export --connect
jdbc:mysql://192.168.45.1:3306/sakila --table test --username sqoop -
-password sqoop --export-dir /user/cloudera/actor
• Step-2(view list of available jobs)
 sqoop job –list
• Step-3(verify the job details)
 sqoop job --show myjob
• Step-4(Execute job)
 sqoop job --exec myjob
Saved jobs and passwords
• Sqoop does not store passwords in the metastore as it is
not a secure resource.
• Hence, If you create a job that requires a password, you will
be prompted for that password each time you execute the
job.
• You can enable passwords in the metastore by
setting sqoop.metastore.client.record.password to true in
the configuration.
• Note: set sqoop.metastore.client.record.password to true if
you are executing saved jobs via Oozie because Sqoop
cannot prompt the user to enter passwords while being
executed as Oozie tasks.
Sqoop-eval
• The eval tool allows users to quickly run simple SQL
queries against a database; results are printed to the
console. This allows users to preview their import
queries to ensure they import the data they expect.
• sqoop eval --connect
jdbc:mysql://192.168.45.1:3306/sakila --query 'select
* from film limit 10' --username sqoop --password
sqoop
• sqoop eval --connect
jdbc:mysql://192.168.45.1:3306/sakila --query "insert
into test values(200,'test','test','2006-01-01
00:00:00')" --username sqoop --password sqoop
Sqoop-codegen
• The codegen tool generates Java code, It does
not perform the full import.
• The tool can be used to regenerate code if
Java source file is by chance lost.
• sqoop codegen --connect
jdbc:mysql://192.168.45.1:3306/sakila --table
film --username sqoop --password sqoop
Thank You
• Question?
• Feedback?
write me: explorehadoop@gmail.com

More Related Content

PPTX
Apache sqoop with an use case
PPTX
Introduction to sqoop
PDF
SQOOP PPT
PPTX
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
PPTX
Apache hive introduction
PDF
Introduction to Apache Sqoop
PDF
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
PPTX
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Apache sqoop with an use case
Introduction to sqoop
SQOOP PPT
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Apache hive introduction
Introduction to Apache Sqoop
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...

What's hot (20)

PPTX
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
PPTX
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
PDF
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
PPTX
Apache Tez - A New Chapter in Hadoop Data Processing
PDF
Hive Anatomy
PPTX
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
PPTX
Apache Spark
PDF
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
PPTX
Introduction to HiveQL
PDF
Building robust CDC pipeline with Apache Hudi and Debezium
PPTX
HBase.pptx
PDF
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
PDF
Cassandra Database
PPT
Hive User Meeting August 2009 Facebook
PPTX
Apache HBase™
PPTX
Node Labels in YARN
PPTX
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
PPTX
Hive + Tez: A Performance Deep Dive
PDF
Apache Flume
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Apache Tez - A New Chapter in Hadoop Data Processing
Hive Anatomy
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Apache Spark
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Introduction to HiveQL
Building robust CDC pipeline with Apache Hudi and Debezium
HBase.pptx
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Cassandra Database
Hive User Meeting August 2009 Facebook
Apache HBase™
Node Labels in YARN
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hive + Tez: A Performance Deep Dive
Apache Flume
Ad

Viewers also liked (17)

PDF
Apache Sqoop: A Data Transfer Tool for Hadoop
PPTX
Hive & HBase For Transaction Processing
PPTX
Big Data with Apache Hadoop
PDF
Kudu - Fast Analytics on Fast Data
PPTX
MCSE Certifications
PPTX
Hadoop crashcourse v3
PPTX
LLAP: Sub-Second Analytical Queries in Hive
PDF
Big data: Loading your data with flume and sqoop
PDF
New Data Transfer Tools for Hadoop: Sqoop 2
PPTX
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
PPTX
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
PDF
Apache Flume
PDF
Apache Flume - DataDayTexas
PDF
Hadoop Family and Ecosystem
PPTX
Big data and Hadoop
PPTX
Big Data Analytics with Hadoop
PPTX
Big data ppt
Apache Sqoop: A Data Transfer Tool for Hadoop
Hive & HBase For Transaction Processing
Big Data with Apache Hadoop
Kudu - Fast Analytics on Fast Data
MCSE Certifications
Hadoop crashcourse v3
LLAP: Sub-Second Analytical Queries in Hive
Big data: Loading your data with flume and sqoop
New Data Transfer Tools for Hadoop: Sqoop 2
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Apache Flume
Apache Flume - DataDayTexas
Hadoop Family and Ecosystem
Big data and Hadoop
Big Data Analytics with Hadoop
Big data ppt
Ad

Similar to Sqoop (20)

PPTX
BigData - Apache Spark Sqoop Introduce Basic
PPTX
From oracle to hadoop with Sqoop and other tools
PDF
Hadoop sqoop
PDF
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
PDF
SQL on Hadoop
PPTX
Oozie &amp; sqoop by pradeep
PPT
SQOOP AND IOTS ARCHITECTURE AND ITS APPLICATION.ppt
PDF
Sqoop on Spark for Data Ingestion
PDF
Training Slides: 351 - Tungsten Replicator for Data Warehouses
PDF
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
PDF
Hw09 Sqoop Database Import For Hadoop
PPTX
Qubole - Big data in cloud
PPT
AWS (Hadoop) Meetup 30.04.09
PDF
Oracle hadoop let them talk together !
PDF
Speed Up Your Existing Relational Databases with Hazelcast and Speedment
PDF
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab
PPTX
PDF
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
PPTX
Azure PaaS (WebApp & SQL Database) workshop solution
PPTX
Optimizing your Database Import!
BigData - Apache Spark Sqoop Introduce Basic
From oracle to hadoop with Sqoop and other tools
Hadoop sqoop
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
SQL on Hadoop
Oozie &amp; sqoop by pradeep
SQOOP AND IOTS ARCHITECTURE AND ITS APPLICATION.ppt
Sqoop on Spark for Data Ingestion
Training Slides: 351 - Tungsten Replicator for Data Warehouses
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Hw09 Sqoop Database Import For Hadoop
Qubole - Big data in cloud
AWS (Hadoop) Meetup 30.04.09
Oracle hadoop let them talk together !
Speed Up Your Existing Relational Databases with Hazelcast and Speedment
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Azure PaaS (WebApp & SQL Database) workshop solution
Optimizing your Database Import!

More from Prashant Gupta (8)

PPTX
Spark core
PPTX
Spark Sql and DataFrame
PPTX
Map Reduce
PPTX
Hadoop File system (HDFS)
PPTX
Apache PIG
PPTX
Map reduce prashant
PPTX
Mongodb - NoSql Database
PPTX
Sonar Tool - JAVA code analysis
Spark core
Spark Sql and DataFrame
Map Reduce
Hadoop File system (HDFS)
Apache PIG
Map reduce prashant
Mongodb - NoSql Database
Sonar Tool - JAVA code analysis

Recently uploaded (20)

PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
MYSQL Presentation for SQL database connectivity
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
madgavkar20181017ppt McKinsey Presentation.pdf
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Electronic commerce courselecture one. Pdf
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
PPTX
Telecom Fraud Prevention Guide | Hyperlink InfoSystem
PDF
KodekX | Application Modernization Development
PDF
CIFDAQ's Market Wrap: Ethereum Leads, Bitcoin Lags, Institutions Shift
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPT
Teaching material agriculture food technology
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Review of recent advances in non-invasive hemoglobin estimation
MYSQL Presentation for SQL database connectivity
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Chapter 3 Spatial Domain Image Processing.pdf
madgavkar20181017ppt McKinsey Presentation.pdf
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Electronic commerce courselecture one. Pdf
NewMind AI Monthly Chronicles - July 2025
20250228 LYD VKU AI Blended-Learning.pptx
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
Telecom Fraud Prevention Guide | Hyperlink InfoSystem
KodekX | Application Modernization Development
CIFDAQ's Market Wrap: Ethereum Leads, Bitcoin Lags, Institutions Shift
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
GamePlan Trading System Review: Professional Trader's Honest Take
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Teaching material agriculture food technology
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....

Sqoop

  • 1. SQOOP  SQOOP is data ingestion tool.  SQOOP is a tool designed for transfer data between HDFS and RDBMS such as MySQL, Oracle etc.  Export data back to RDBMS.  Simple as user specifies the “what” and leave the “how” to underlying processing engine.  Rapid development  No Java is required.  Developed by cloudera.
  • 2. Why SQOOP • Data already available in RDBMS worldwide. • Nighty processing is done on RDBMS for years. • Need is to move certain data from RDBMS to Hadoop for processing. • Transferring data using scripts is inefficient and time consuming. • Traditional DB has already reporting, data visualization applications configured.
  • 3. SQOOP Under the hood • The dataset being transferred is sliced up into different partitions. A Map only Job is launched with individual mappers responsible for transferring a slice of dataset. • Each record of the data is maintained in a type safe manner since SQOOP uses the database metadata to understand the data types.
  • 4. How SQOOP Import works • Step-1  SQOOP introspects database to gather the necessary metadata for the data being imported. • Step-2  A Map only hadoop job submitted to cluster by SQOOP and performs the data transfer using metadata captured in step-1. • The imported data is saved in HDFS directory based on the table being imported. • By default these files contains comma delimitted fields, with new line separating records.  User can override the format of data by specifying the field separator and record terminator character.
  • 5. How SQOOP Export works • Step-1  SQOOP introspects database to gather the necessary metadata for the data being imported. • Step-2  Transfer the data  SQOOP divides the input dataset into splits.  Sqoop uses the individual Map task to push the splits to the database.  Each Map task performs this transfer over many transaction in order to ensure optimal throughput and minimal resource utilization. The target table must already exist in the database. Sqoop performs a set of INSERT INTO operations, without regard for existing content. If Sqoop attempts to insert rows which violate constraints in the database (for example, a particular primary key value already exists), then the export fails.
  • 6. Importing Data into Hive • --hive-import Appending above to SQOOP import command, SQOOP takes care of populating the hive metastore with appropriate metadata for the table and also invokes the necessary commands to load the table and partition. • Using Hive import, SQOOP converts the data from the native data types within the external datastore into the corresponding types in hive. • SQOOP automatically chooses the native delimiter set used by hive.
  • 7. Importing Data into HBase • SQOOP can populate data in specific column family in Hbase table. • Hbase table and column family setting is required in order to import data to Hbase. • Data imported to Hbase converted to their string representation and inserted as UTF-8 bytes.
  • 8. Connecting to a Database Server • The connect string is similar to a URL, and is communicated to Sqoop with the --connect argument. • This describes the server and database to connect to; it may also specify the port. • You can use the --username and --password or -P parameters to supply a username and a password to the database. • For example: • sqoop import --connect jdbc:mysql://IPAddress:port /DBName-- table tableName --username sqoop --password sqoop
  • 9. Controlling Parallelism • Sqoop imports data in parallel from most database sources. You can specify the number of map tasks (parallel processes) to use to perform the import by using the -m or --num-mappers argument. • NOTE: Do Not increase the degree of parallism higher than that which your database can reasonably support. For e.g., Connecting 100 concurrent clients to your database may increase the load on the database server to a point where performance suffers as a result. • Sqoop uses a splitting column to split the workload. By default, Sqoop will identify the primary key column (if present) in a table and use it as the splitting column.
  • 10. How Parallel import works • The low and high values for the splitting column are retrieved from the database, and the map tasks operate on evenly-sized components of the total range. By default, four tasks are used. For example, if you had a table with a primary key column of id whose minimum value was 0 and maximum value was 1000, and Sqoop was directed to use 4 tasks, Sqoop would run four processes which each execute SQL statements of the form SELECT * FROM sometable WHERE id >= lo AND id < hi, with (lo, hi) set to (0, 250), (250, 500), (500, 750), and (750, 1001) in the different tasks. • NOTE: Sqoop cannot currently split on multi-column primary key. If your table has no index column, or has a multi-column key, then you must also manually choose a splitting column.
  • 11. Incremental Imports • Sqoop provides an incremental import mode which can be used to retrieve only rows newer than some previously-imported set of rows. • Sqoop supports two types of incremental imports: • 1)append and 2)lastmodified. • 1)You should specify append mode when importing a table where new rows are continually being added with increasing row id values. You specify the column containing the row’s id with --check-column. Sqoop imports rows where the check column has a value greater than the one specified with --last-value. • 2)Lastmodified mode should be used when rows of the source table may be updated, and each such update will set the value of a last-modified column to the current timestamp. Rows where the check column holds a timestamp more recent than the timestamp specified with --last-value are imported.
  • 12. Install SQOOP • To install SQOOP • Download Sqoop-*.tar.gz • tar -xvf sqoop-*.*.tar.gz • export HADOOP_HOME=/some/path/hadoop-dir • Please add the vendor Specific JDBC jar to $SQOOP_HOME/lib • Change to Sqoop Bin folder • ./sqoop help
  • 14. SQOOP Commands • A basic import of a table  sqoop import --connect jdbc:mysql://192.168.45.1:3306/sakila --table film -- username sqoop --password sqoop • Load sample data to a target directory  sqoop import --connect jdbc:mysql://192.168.45.1:3306/sakila --table film -- username sqoop --password sqoop --target-dir '/user/cloudera/test/film' -m 1 • Load sample data with output directory and package  sqoop import --connect jdbc:mysql://192.168.45.1:3306/sakila --table film -- username sqoop --password sqoop --package-name org.sandeep.sample -- outdir '/home/cloudera/sandeep/test1' --target-dir '/user/cloudera/test/film' - m 1 • Controlling the import parallelism (8 parallel tasks):  sqoop import --connect jdbc:mysql://192.168.45.1:3306/sakila --table film -- username sqoop --password sqoop --target-dir '/user/cloudera/test/film' -- split-by film_id -m 8
  • 15. SQOOP Commands • Incremental import  sqoop import --connect jdbc:mysql://192.168.45.1:3306/sakila --table film -- username sqoop --password sqoop --target-dir '/user/cloudera/test/film' -m 1 • Save target file in tab separated format  sqoop import --connect jdbc:mysql://192.168.45.1:3306/sakila --table actor -- username sqoop --password sqoop --check-column actor_id --incremental append --last-value 180 --target-dir /user/cloudera/test/film3 --fields- terminated-by 't‘ • Selecting specific columns from the EMPLOYEES table  sqoop import --connect jdbc:mysql://192.168.45.1:3306/sakila --table actor -- username sqoop -password sqoop --columns 'actor_id,first_name,last_name' - -target-dir /user/cloudera/test/actor1 • Query usage to import with condition  sqoop import --connect jdbc:mysql://192.168.45.1:3306/sakila --query 'select * from film where film_id < 91 and $CONDITIONS' --username sqoop -- password sqoop --target-dir '/user/cloudera/test/film2' --split-by film_id -m 2
  • 16. SQOOP Commands • Storing data in SequenceFiles  sqoop import --connect jdbc:mysql://192.168.45.1:3306/sakila --table film -- username sqoop --password sqoop --as-sequencefile --target-dir /user/cloudera/test/f • Importing data to Hive:  sqoop import --connect jdbc:mysql://192.168.45.1:3306/sakila --table language --username sqoop -password sqoop -m 1 --hive-import • Import only the schema to hive table  sqoop create-hive-table --connect jdbc:mysql://192.168.45.1:3306/sakila -- table actor --username sqoop -password sqoop --fields-terminated-by ','; • Importing data to Hbase:  sqoop import --connect jdbc:mysql://192.168.45.1:3306/sakila --table actor -- username sqoop --password sqoop --columns 'actor_id,first_name,last_name' --hbase-table ActorInfo --column-family ActorName --hbase-row-key actor_id - m 1
  • 17. SQOOP Commands • Import all tables  sqoop import-all-tables --connect jdbc:mysql://192.168.45.1:3306/sakila --username sqoop -- password sqoop • SQOOP EXPORT  sqoop export --connect jdbc:mysql://192.168.45.1:3306/sakila -- table test --username sqoop --password sqoop --export-dir /user/cloudera/actor • SQOOP Version:  $ sqoop version • List tables present in a database  sqoop list-tables --connect jdbc:mysql://192.168.45.1:3306/sakila -- username sqoop --password sqoop
  • 18. SQOOP JOBS Creating saved jobs is done with the --create action. This operation requires a -- followed by a tool name and its arguments. The tool and its arguments will form the basis of the saved job. • Step-1 (Create a job)  sqoop job --create myjob -- export --connect jdbc:mysql://192.168.45.1:3306/sakila --table test --username sqoop - -password sqoop --export-dir /user/cloudera/actor • Step-2(view list of available jobs)  sqoop job –list • Step-3(verify the job details)  sqoop job --show myjob • Step-4(Execute job)  sqoop job --exec myjob
  • 19. Saved jobs and passwords • Sqoop does not store passwords in the metastore as it is not a secure resource. • Hence, If you create a job that requires a password, you will be prompted for that password each time you execute the job. • You can enable passwords in the metastore by setting sqoop.metastore.client.record.password to true in the configuration. • Note: set sqoop.metastore.client.record.password to true if you are executing saved jobs via Oozie because Sqoop cannot prompt the user to enter passwords while being executed as Oozie tasks.
  • 20. Sqoop-eval • The eval tool allows users to quickly run simple SQL queries against a database; results are printed to the console. This allows users to preview their import queries to ensure they import the data they expect. • sqoop eval --connect jdbc:mysql://192.168.45.1:3306/sakila --query 'select * from film limit 10' --username sqoop --password sqoop • sqoop eval --connect jdbc:mysql://192.168.45.1:3306/sakila --query "insert into test values(200,'test','test','2006-01-01 00:00:00')" --username sqoop --password sqoop
  • 21. Sqoop-codegen • The codegen tool generates Java code, It does not perform the full import. • The tool can be used to regenerate code if Java source file is by chance lost. • sqoop codegen --connect jdbc:mysql://192.168.45.1:3306/sakila --table film --username sqoop --password sqoop