0% found this document useful (0 votes)
8 views8 pages

Lab Experiments 1,2&4

The document outlines basic commands for Apache Sqoop, including connecting to a database, importing specific data, controlling parallelism, and managing imports. It also covers advanced operations like handling file formats, large objects, importing data into Hive, and exporting data back to a relational database. Additionally, it includes a PySpark SQL implementation for a Word Count program to process a text file and count word occurrences.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views8 pages

Lab Experiments 1,2&4

The document outlines basic commands for Apache Sqoop, including connecting to a database, importing specific data, controlling parallelism, and managing imports. It also covers advanced operations like handling file formats, large objects, importing data into Hive, and exporting data back to a relational database. Additionally, it includes a PySpark SQL implementation for a Word Count program to process a text file and count word occurrences.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

1.

Apache Sqoop Basic Commands

Perform / Execute below sets of Apache Sqoop basic commands:

• Connecting a Database Server


• Selecting the Data to Import
• Free-form Query Imports
• Controlling Parallelism
• Controlling Imports

AIM

To execute fundamental Apache Sqoop operations for connecting to a


database, selecting specific data for import, performing free-form query
imports, controlling parallel execution, and managing data import processes
efficiently in a Hadoop environment.

ALGORITHM

Step 1: Connect to the Database Server

1. Open the terminal or command prompt.


2. Use the sqoop import command.
3. Specify the database connection details using --connect, --username,
and --password.
4. Define the table name using --table.

Step 2: Select Specific Data for Import

5. Use the --columns option to specify the required columns.


6. Execute the command to import only the selected columns.

Step 3: Perform Free-form Query Imports

7. Use the --query option to execute a SQL query.


8. Ensure the query contains WHERE \$CONDITIONS for parallel
execution.

Step 4: Control Parallel Execution (Parallelism)

9. Use the --num-mappers option to define the number of parallel tasks.


10. Execute the command for parallel data import.

Step 5: Control Data Imports


11. Use the --where clause to filter data.
12. Specify the target directory using --target-dir.
13. Use --delete-target-dir to remove existing data before importing.

Source code:

• hostname
• hdfs dfs -ls
• service cloudera-scm-server status
• su
• service cloudera-scm-server status
• mysql -u root -pcloudera
• show databases;
• use retail_db;
• show tables;
• select * from departments;
• hostname -f

Connect to the Database Server

• sqoop list-databases --connect jdbc:mysql:// quickstart:3306/ --


password cloudera --username root;
• sqoop list-tables --connect jdbc:mysql://quickstart:3306/retail_db --
password cloudera --username root;

Select Specific Data for Import

• sqoop import --connect jdbc:mysql://quickstart:3306/retail_db --


password cloudera --username root --table departments --columns
department_id, department_name;
• hadoop fs -ls /user/cloudera
• hadoop fs -cat /user/cloudera/departments/part*

Perform Free-form Query Imports

• sqoop import --connect jdbc:mysql://quickstart:3306/retail_db --


password cloudera --username root --table departments --target-dir
/user/cloudera/dept1;
• hadoop fs -cat /user/cloudera/dept1/part*
• sqoop import -–connect jdbc:mysql://quickstart:3306/retail_db --
password cloudera --username root --table departments –-m 3 --where
“department_id>4” --target-dir /user/cloudera/dept2;
• hadoop fs -cat /user/cloudera/dept2/part*

Control Parallel Execution

• sqoop import --connect jdbc:mysql://quickstart:3306/retail_db --


password cloudera --username root --table departments --num-
mappers 4

Control Data Imports

• sqoop import --connect jdbc:mysql://quickstart:3306/retail_db --


password cloudera --username root --table departments --target-dir
/user/cloudera/dept1;
• hadoop fs -cat /user/cloudera/dept1/part*

2. Apache Sqoop Basic Commands

Perform / Execute below sets of Apache Sqoop basic commands:

• Controlling Mapper
• File Formats
• Large Objects
• Importing Data Into Hive
• Import all tables
• Sqoop Export

AIM

To execute Apache Sqoop commands for controlling mappers, handling file


formats, processing large objects (LOBs), importing data into Hive, importing
all tables from a database, and exporting data from Hadoop to a relational
database.

ALGORITHM

Controlling Mappers
1. Open the terminal and ensure Sqoop is installed and configured.
2. Use the sqoop import command with the --num-mappers option.
3. Define the database connection details (--connect, --username, --
password).
4. Specify the table name.
5. Set the number of mappers (e.g., --num-mappers 2).
6. Execute the command to control parallel execution.

Handling File Formats

7. Use either --as-textfile, --as-avrodatafile, or --as-parquetfile to specify


the file format.
8. Define the target directory using --target-dir.
9. Execute the command to store data in the specified format.

Handling Large Objects (LOBs)

10. Use the --direct mode to speed up import (for supported


databases).
11. Use the --lob-limit option to define the maximum LOB size (e.g.,
--lob-limit 10485760 for 10MB).
12. Execute the command to import large objects.

Importing Data Into Hive

15. Use the --hive-import option to load data into Hive.


16. Define the Hive database and table using --hive-database and --
hive-table.
17. Execute the command to import data into Hive.

Importing All Tables

18. Use the sqoop import-all-tables command.


19. Specify the database connection details (--connect, --username,
--password).
20. Define the target directory using --warehouse-dir.
21. Execute the command to import all tables into Hadoop.
Sqoop Export

22. Use the sqoop export command.


23. Define the database connection details (--connect, --username, -
-password).
24. Specify the table name and HDFS source directory (--export-dir).
25. Execute the command to transfer data from HDFS to the
database.

Source Code:

1. Login to MySQL
a. mysql -u root -pcloudera
b. CREATE DATABASE retail_db; USE retail_db; CREATE TABLE
customers ( customer_id INT PRIMARY KEY, first_name
VARCHAR(50), last_name VARCHAR(50), email VARCHAR(100)
); INSERT INTO customers VALUES (1, 'John', 'Doe',
'[email protected]'); INSERT INTO customers VALUES (2,
'Jane', 'Smith', '[email protected]');
c. Exit;
2. Import Data from MySQL to HDFS (With Mapper Control)
a. sqoop import --connect
jdbc:mysql://quickstart.cloudera/retail_db --username root --
password cloudera --table customers --num-mappers 2 --
target-dir /user/cloudera/customers_data
3. Import Data in Different File Formats
a. sqoop import --connect
jdbc:mysql://quickstart.cloudera/retail_db --username root --
password cloudera --table customers --as-avrodatafile --
target-dir /user/cloudera/customers_avro
4. Handling Large Objects (BLOBs/CLOBs)
a. sqoop import --connect
jdbc:mysql://quickstart.cloudera/retail_db --username root --
password cloudera --table customers --split-by customer_id --
target-dir /user/cloudera/customers_large.
5. Import Data into Hive
a. sqoop import --connect
jdbc:mysql://quickstart.cloudera/retail_db --username root --
password cloudera --table customers --hive-import --hive-
database retail_hive --hive-table customers
6. Import All Tables from MySQL to Hive
a. sqoop import-all-tables --connect
jdbc:mysql://quickstart.cloudera/retail_db \ --username root --
password cloudera --hive-import --hive-database retail_hive
7. Export Data from HDFS to MySQL
Create a Table in MySQL to Store Exported Data
CREATE TABLE customers_export (customer_id INT PRIMARY KEY,
first_name VARCHAR(50), last_name VARCHAR(50), email
VARCHAR(100) );
8. Run Sqoop Export
a. sqoop export --connect
jdbc:mysql://quickstart.cloudera/retail_db --username root --
password cloudera --table customers_export --export-dir
/user/cloudera/customers_data --input-fields-terminated-by ','
9. Verify Data
a. hdfs dfs -ls /user/cloudera/
b. hdfs dfs -cat /user/cloudera/customers_data/part-m-00000
c. USE retail_hive;
SHOW TABLES; SELECT * FROM customers LIMIT 5;
d. mysql -u root -pcloudera
USE retail_db;
SELECT * FROM customers_export;
3. Perform word count job for a given input file using Spark SQL.

Aim

To implement a Word Count program using PySpark SQL to process a text


file, tokenize words, and count their occurrences efficiently.

Algorithm

Step 1: Import necessary libraries from pyspark.sql.

Step 2: Create a SparkSession using


SparkSession.builder.appName("WordCountSQL").getOrCreate().

Step 3: This initializes a Spark environment to process data.

Step 4: Use spark.read.text("sample.txt") to read the text file into a


DataFrame.

Step 5: The DataFrame has a single column named "value", where each row
contains a line from the file.

Step 6: Use split(col("value"), " ") to split each line into words based on
spaces.

Step 7: Use explode() to flatten the list of words, creating a row for each
word.

Step 8: Use .groupBy("word").count() to count the occurrences of each word


in the dataset.

Step 9: Use .show() to display the word counts.

Source Code:

from pyspark.sql import SparkSession

from pyspark.sql.functions import explode, split, col

# Step 1: Initialize Spark Session

spark = SparkSession.builder.appName("WordCountSQL").getOrCreate()

# Step 2: Load the Text File into DataFrame

df = spark.read.text("sample.txt") # Change to your file path

# Step 3: Process Data using SQL Functions

word_counts = (
df.select(explode(split(col("value"), " ")).alias("word")) # Split and flatten
words

.groupBy("word")

.count() # Count occurrences

# Step 4: Show Results

word_counts.show()

Result : The Word Count program using PySpark SQL to process a text file,
tokenize words, and count their occurrences was executed successfully.

You might also like