1.
Apache Sqoop Basic Commands
Perform / Execute below sets of Apache Sqoop basic commands:
• Connecting a Database Server
• Selecting the Data to Import
• Free-form Query Imports
• Controlling Parallelism
• Controlling Imports
AIM
To execute fundamental Apache Sqoop operations for connecting to a
database, selecting specific data for import, performing free-form query
imports, controlling parallel execution, and managing data import processes
efficiently in a Hadoop environment.
ALGORITHM
Step 1: Connect to the Database Server
1. Open the terminal or command prompt.
2. Use the sqoop import command.
3. Specify the database connection details using --connect, --username,
and --password.
4. Define the table name using --table.
Step 2: Select Specific Data for Import
5. Use the --columns option to specify the required columns.
6. Execute the command to import only the selected columns.
Step 3: Perform Free-form Query Imports
7. Use the --query option to execute a SQL query.
8. Ensure the query contains WHERE \$CONDITIONS for parallel
execution.
Step 4: Control Parallel Execution (Parallelism)
9. Use the --num-mappers option to define the number of parallel tasks.
10. Execute the command for parallel data import.
Step 5: Control Data Imports
11. Use the --where clause to filter data.
12. Specify the target directory using --target-dir.
13. Use --delete-target-dir to remove existing data before importing.
Source code:
• hostname
• hdfs dfs -ls
• service cloudera-scm-server status
• su
• service cloudera-scm-server status
• mysql -u root -pcloudera
• show databases;
• use retail_db;
• show tables;
• select * from departments;
• hostname -f
Connect to the Database Server
• sqoop list-databases --connect jdbc:mysql:// quickstart:3306/ --
password cloudera --username root;
• sqoop list-tables --connect jdbc:mysql://quickstart:3306/retail_db --
password cloudera --username root;
Select Specific Data for Import
• sqoop import --connect jdbc:mysql://quickstart:3306/retail_db --
password cloudera --username root --table departments --columns
department_id, department_name;
• hadoop fs -ls /user/cloudera
• hadoop fs -cat /user/cloudera/departments/part*
Perform Free-form Query Imports
• sqoop import --connect jdbc:mysql://quickstart:3306/retail_db --
password cloudera --username root --table departments --target-dir
/user/cloudera/dept1;
• hadoop fs -cat /user/cloudera/dept1/part*
• sqoop import -–connect jdbc:mysql://quickstart:3306/retail_db --
password cloudera --username root --table departments –-m 3 --where
“department_id>4” --target-dir /user/cloudera/dept2;
• hadoop fs -cat /user/cloudera/dept2/part*
Control Parallel Execution
• sqoop import --connect jdbc:mysql://quickstart:3306/retail_db --
password cloudera --username root --table departments --num-
mappers 4
Control Data Imports
• sqoop import --connect jdbc:mysql://quickstart:3306/retail_db --
password cloudera --username root --table departments --target-dir
/user/cloudera/dept1;
• hadoop fs -cat /user/cloudera/dept1/part*
2. Apache Sqoop Basic Commands
Perform / Execute below sets of Apache Sqoop basic commands:
• Controlling Mapper
• File Formats
• Large Objects
• Importing Data Into Hive
• Import all tables
• Sqoop Export
AIM
To execute Apache Sqoop commands for controlling mappers, handling file
formats, processing large objects (LOBs), importing data into Hive, importing
all tables from a database, and exporting data from Hadoop to a relational
database.
ALGORITHM
Controlling Mappers
1. Open the terminal and ensure Sqoop is installed and configured.
2. Use the sqoop import command with the --num-mappers option.
3. Define the database connection details (--connect, --username, --
password).
4. Specify the table name.
5. Set the number of mappers (e.g., --num-mappers 2).
6. Execute the command to control parallel execution.
Handling File Formats
7. Use either --as-textfile, --as-avrodatafile, or --as-parquetfile to specify
the file format.
8. Define the target directory using --target-dir.
9. Execute the command to store data in the specified format.
Handling Large Objects (LOBs)
10. Use the --direct mode to speed up import (for supported
databases).
11. Use the --lob-limit option to define the maximum LOB size (e.g.,
--lob-limit 10485760 for 10MB).
12. Execute the command to import large objects.
Importing Data Into Hive
15. Use the --hive-import option to load data into Hive.
16. Define the Hive database and table using --hive-database and --
hive-table.
17. Execute the command to import data into Hive.
Importing All Tables
18. Use the sqoop import-all-tables command.
19. Specify the database connection details (--connect, --username,
--password).
20. Define the target directory using --warehouse-dir.
21. Execute the command to import all tables into Hadoop.
Sqoop Export
22. Use the sqoop export command.
23. Define the database connection details (--connect, --username, -
-password).
24. Specify the table name and HDFS source directory (--export-dir).
25. Execute the command to transfer data from HDFS to the
database.
Source Code:
1. Login to MySQL
a. mysql -u root -pcloudera
b. CREATE DATABASE retail_db; USE retail_db; CREATE TABLE
customers ( customer_id INT PRIMARY KEY, first_name
VARCHAR(50), last_name VARCHAR(50), email VARCHAR(100)
); INSERT INTO customers VALUES (1, 'John', 'Doe',
'[email protected]'); INSERT INTO customers VALUES (2,
'Jane', 'Smith', '[email protected]');
c. Exit;
2. Import Data from MySQL to HDFS (With Mapper Control)
a. sqoop import --connect
jdbc:mysql://quickstart.cloudera/retail_db --username root --
password cloudera --table customers --num-mappers 2 --
target-dir /user/cloudera/customers_data
3. Import Data in Different File Formats
a. sqoop import --connect
jdbc:mysql://quickstart.cloudera/retail_db --username root --
password cloudera --table customers --as-avrodatafile --
target-dir /user/cloudera/customers_avro
4. Handling Large Objects (BLOBs/CLOBs)
a. sqoop import --connect
jdbc:mysql://quickstart.cloudera/retail_db --username root --
password cloudera --table customers --split-by customer_id --
target-dir /user/cloudera/customers_large.
5. Import Data into Hive
a. sqoop import --connect
jdbc:mysql://quickstart.cloudera/retail_db --username root --
password cloudera --table customers --hive-import --hive-
database retail_hive --hive-table customers
6. Import All Tables from MySQL to Hive
a. sqoop import-all-tables --connect
jdbc:mysql://quickstart.cloudera/retail_db \ --username root --
password cloudera --hive-import --hive-database retail_hive
7. Export Data from HDFS to MySQL
Create a Table in MySQL to Store Exported Data
CREATE TABLE customers_export (customer_id INT PRIMARY KEY,
first_name VARCHAR(50), last_name VARCHAR(50), email
VARCHAR(100) );
8. Run Sqoop Export
a. sqoop export --connect
jdbc:mysql://quickstart.cloudera/retail_db --username root --
password cloudera --table customers_export --export-dir
/user/cloudera/customers_data --input-fields-terminated-by ','
9. Verify Data
a. hdfs dfs -ls /user/cloudera/
b. hdfs dfs -cat /user/cloudera/customers_data/part-m-00000
c. USE retail_hive;
SHOW TABLES; SELECT * FROM customers LIMIT 5;
d. mysql -u root -pcloudera
USE retail_db;
SELECT * FROM customers_export;
3. Perform word count job for a given input file using Spark SQL.
Aim
To implement a Word Count program using PySpark SQL to process a text
file, tokenize words, and count their occurrences efficiently.
Algorithm
Step 1: Import necessary libraries from pyspark.sql.
Step 2: Create a SparkSession using
SparkSession.builder.appName("WordCountSQL").getOrCreate().
Step 3: This initializes a Spark environment to process data.
Step 4: Use spark.read.text("sample.txt") to read the text file into a
DataFrame.
Step 5: The DataFrame has a single column named "value", where each row
contains a line from the file.
Step 6: Use split(col("value"), " ") to split each line into words based on
spaces.
Step 7: Use explode() to flatten the list of words, creating a row for each
word.
Step 8: Use .groupBy("word").count() to count the occurrences of each word
in the dataset.
Step 9: Use .show() to display the word counts.
Source Code:
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, split, col
# Step 1: Initialize Spark Session
spark = SparkSession.builder.appName("WordCountSQL").getOrCreate()
# Step 2: Load the Text File into DataFrame
df = spark.read.text("sample.txt") # Change to your file path
# Step 3: Process Data using SQL Functions
word_counts = (
df.select(explode(split(col("value"), " ")).alias("word")) # Split and flatten
words
.groupBy("word")
.count() # Count occurrences
# Step 4: Show Results
word_counts.show()
Result : The Word Count program using PySpark SQL to process a text file,
tokenize words, and count their occurrences was executed successfully.