0% found this document useful (0 votes)
9 views

Big Data-Spark Lab Manual

The document is a lab manual for a Big Data-Spark course for B.Tech III Year students in the Department of Computer Science and Engineering. It outlines the vision and mission of the institute and department, program educational objectives, program outcomes, and specific outcomes related to Big Data and Spark. Additionally, it includes a detailed index of experiments and instructions for students regarding lab conduct and tasks related to Hadoop, Spark, and data management.

Uploaded by

Anitha Vazzu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Big Data-Spark Lab Manual

The document is a lab manual for a Big Data-Spark course for B.Tech III Year students in the Department of Computer Science and Engineering. It outlines the vision and mission of the institute and department, program educational objectives, program outcomes, and specific outcomes related to Big Data and Spark. Additionally, it includes a detailed index of experiments and instructions for students regarding lab conduct and tasks related to Hadoop, Spark, and data management.

Uploaded by

Anitha Vazzu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 56

BIG DATA-SPARK

Lab Manual

B. Tech III Year - II Semester


Department of Computer Science and
Engineering

COMPUTERSCIENCE&ENGINEERING

DEPT OF CSE
VISION & MISSION OF THE INSTITUTE & DEPARTMENT

Vision of the Department &Mission of the College

Vision of the College:


To empower students with professional education using creative & innovative technical
practices of global competence and research aptitude to become competitive engineers with
ethical values and entrepreneurial skills

Mission of the college:


To impart value based professional education through creative and innovative teaching-
learning process to face the global challenges of the new era technology.
To inculcate research aptitude and to bring out creativity in students by imparting
engineering knowledge imbibing interpersonal skills to promote innovation, research and
entrepreneurship.

Vision of the Department &Mission of the Department

Vision of the Department of CSE:


Debouch as a center of excellence for computer science engineering by imparting social, moral
and ethical values oriented education through advanced pedagogical techniques and produce
technologically and highly competent professionals of global standards with capabilities of solving
challenges of the time through innovative and creative solutions.

Mission of the Department of CSE:

To envision inquisitive-driven advanced knowledge building among students to impart


foundational knowledge of computer science and its applications of all spheres using the state-
of-the-art facilities and software industry-institute interaction.
To advance the department industry collaborations through interaction with professional society
through seminars/workshops/guest lectures and student internship programs.
To nurture students with leadership qualities, communication skills and imbibe qualities to work
as a team member and a leader for the economic and technological development in cutting edge
technologies in national and global arena.

VIGNAN’S INSTITUTE OF MANAGEMENT AND TECHNOLOGY FOR WOMEN

COMPUTERSCIENCE&ENGINEERING

DEPT OF CSE
Program Educational Objectives (PEOs):
PEO1: To provide graduates the foundational and essential knowledge in mathematics, science,
computer science and engineering and interdisciplinary engineering to emerge as
technocrats.

PEO2: To inculcate the capabilities to analyse, design and develop innovative solutions of
computer support systems for benefits of the society, by diligence and teamwork.

PEO3: To drive the graduates towards employment/purse higher studies/turn as entrepreneurs

Program Outcomes (POs):

PO1: Engineering Knowledge: Apply the knowledge of mathematics, science, engineering


fundamentals and an engineering specialization to the solution of complex engineering problems

PO2: Problem Analysis: Identify, formulate, review research literature, and analyze complex
engineering problems reaching substantiated conclusions using first principles of mathematics,
natural sciences and Engineering sciences.

PO3: Design/Development of Solutions: Design solutions for complex engineering problems and
design system components or processes that meet the specified needs with appropriate
consideration for the public health safety, and the cultural, societal, and environmental
considerations.

PO4: Conduct Investigations of Complex Problems: Use research-based knowledge and


research methods including design of experiments, analysis and interpretation of data, and
synthesis of the information to provide valid conclusions.

PO5: Modern Tool Usage: Create, select and apply appropriate techniques, resources and modern
engineering and IT tools including prediction and modeling to complex engineering activities with
an understanding of the limitations.

PO6: The Engineer and Society: Apply reasoning informed by the contextual knowledge to assess
societal, health, safety, legal and cultural issues and the consequent responsibilities relevant to the
professional engineering practice.

VIGNAN’S INSTITUTE OF MANAGEMENT AND TECHNOLOGY FOR WOMEN

COMPUTERSCIENCE&ENGINEERING
DEPT OF CSE

PO7: Environment and Sustainability: Understand the impact of the professional engineering
solutions in societal and environmental contexts and demonstrate the knowledge of,and need for
sustainable development.

PO8: Ethics: Apply ethical principles and commit to professional ethics and responsibilities and
norms of the engineering practice.

PO9: Individual and Team Work: Function effectively as an individual and as a member
or leader in diverse teams and in multidisciplinary settings.

PO10: Communication: Communicate effectively on complex engineering activities with the


engineering community and with society at large, such as being able to comprehend and write
effective reports and design documentation, make effective presentations and give and receive
clearinstructions.

PO11: Project Management and Finance: Demonstrate knowledge and understanding of the
engineering management principles and apply these to one's own work, as a member and leader in
a team to manage projects and in multidisciplinary environments.

PO12: Life-Long Learning: Recognize the need for and have the preparation and ability to
engage in independent and lifelong learning in the broadest context of technological change.

Program Specific Outcomes (PSOs):

PSO1: Foundation on Software Development:


Ability to grasp the software development life cycle of software systems and possess
competent skills and knowledge of software design process

PSO2: Industrial Skills Ability:


Ability to interpret fundamental concepts and methodology of computer systems so that
students can understand the functionality of hardware and software aspects of computer
systems

PSO3: Ethical and Social Responsibility:


Communicate effectively in both verbal and written form, will have knowledge of
professional and ethical responsibilities and will show the understanding of impact of
engineering solutions on the society and also will be aware of contemporary issues

VIGNAN’S INSTITUTE OF MANAGEMENT AND TECHNOLOGY FOR WOMEN

COMPUTERSCIENCE&ENGINEERING
INDEX

S.NO. TOPIC

To Study of Big Data Analytics and Hadoop Architecture


1 (i) know the concept of big data architecture
(ii) know the concept of Hadoop architecture
Loading DataSet in to HDFS for Spark Analysis
2 Installation of Hadoop and cluster management
(i) Installing Hadoop single node cluster in ubuntu environment
(ii) Knowing the differencing between single node clusters and multi-node
clusters
(iii) Accessing WEB-UI and the port number
(iv) Installing and accessing the environments such as hive and sqoop
3 File management tasks & Basic linux commands
(i) Creating a directory in HDFS
(ii) Moving forth and back to directories
(iii) Listing directory contents
(iv) Uploading and downloading a file in HDFS
(v) Checking the contents of the file
(vi) Copying and moving files
(vii) Copying and moving files between local to HDFS environment
(viii) Removing files and paths
(ix) Displaying few lines of a file
(x) Display the aggregate length of a file
(xi) Checking the permissions of a file
(xii) Zipping and unzipping the files with & without permission pasting it to a
location
(xiii) Copy, Paste commands
4 Map-reducing
(i) Definition of Map-reduce
(ii) Its stages and terminologies
(iii) Word-count program to understand map-reduce (Mapper phase, Reducer
phase, Driver
code)
Implementing Matrix-Multiplication with Hadoop Map-reduce
5
Compute Average Salary and Total Salary by Gender for an Enterprise.
6
(i) Creating hive tables (External and internal)
7 (ii) Loading data to external hive tables from sql tables(or)Structured c.s.v
using scoop
(iii) Performing operations like filterations and updations
(iv) Performing Join (inner, outer etc)
(v) Writing User defined function on hive tables
8 Create a sql table of employees Employee table with id,designation Salary
table (salary ,dept
id) Create external table in hive with similar schema of above tables,Move
data to hive using
scoop and load the contents into tables,filter a new table and write a UDF to
encrypt the table
with AES-algorithm, Decrypt it with key to show contents
(i) Pyspark Definition(Apache Pyspark) and difference between Pyspark,
Scala, pandas
9 (ii) Pyspark files and class methods
(iii) get(file name)
(iv) get root directory()
Pyspark -RDD’S
10 (i) what is RDD’s?
(ii) ways to Create RDD
(iii) parallelized collections
(iv) external dataset
(v) existing RDD’s
(vi) Spark RDD’s operations (Count, foreach(), Collect, join,Cache()
Perform pyspark transformations
11 (i) map and flatMap
(ii) to remove the words, which are not necessary to analyze this text.
(iii) groupBy
(iv) What if we want to calculate how many times each word is coming in
corpus ?
(v) How do I perform a task (say count the words ‘spark’ and ‘apache’ in
rdd3) separatly on
each partition and get the output of the task performed in these partition ?
(vi) unions of RDD
(vii) join two pairs of RDD Based upon their key
Pyspark sparkconf-Attributes and applications
12 (i) What is Pyspark spark conf ()
(ii) Using spark conf create a spark session to write a dataframe to read details
in a c.s.v and
later move that c.s.v to another location
COMPUTERSCIENCE&ENGINEERING

INSTRUCTIONS TO STUDENTS

 All students must observe the Dress Code while in the laboratory.
 Foods, drinks and smoking are NOT allowed.
 All bags must be left at the indicated place.
 The lab timetable must be strictly followed.
 Be PUNCTUAL for your laboratory session.
 Noise must be kept to a minimum.
 Workspace must be kept clean and tidy at all time.
 All students are liable for any damage to the accessories due to theirown
negligence.
 Students are strictly PROHIBITED from taking out any items from the laboratory.
 Report immediately to the Lab Supervisor if any malfunction of the
accessories, is there.

Before leaving the lab


 Shut down all the systems properly.
 Place the chairs properly.
 Please check the laboratory notice board regularly for updates.
List of Experiments
Exp.No. Experiment Name Page.No.
1 To Study of Big Data Analytics and Hadoop Architecture
(i) know the concept of big data architecture
(ii) know the concept of Hadoop architecture
2 Loading DataSet in to HDFS for Spark Analysis
Installation of Hadoop and cluster management
(i) Installing Hadoop single node cluster in ubuntu
environment
(ii) Knowing the differencing between single node clusters
and multi-node clusters
(iii) Accessing WEB-UI and the port number
(iv) Installing and accessing the environments such as hive
and sqoop
3 File management tasks & Basic linux commands
(i) Creating a directory in HDFS
(ii) Moving forth and back to directories
(iii) Listing directory contents
(iv) Uploading and downloading a file in HDFS
(v) Checking the contents of the file
(vi) Copying and moving files
(vii) Copying and moving files between local to HDFS
environment
(viii) Removing files and paths
(ix) Displaying few lines of a file
(x) Display the aggregate length of a file
(xi) Checking the permissions of a file
(xii) Zipping and unzipping the files with & without
permission pasting it to a location
(xiii) Copy, Paste commands
4 Map-reducing
(i) Definition of Map-reduce
(ii) Its stages and terminologies
(iii) Word-count program to understand map-reduce
(Mapper phase, Reducer phase, Driver
code)
5 Implementing Matrix-Multiplication with Hadoop Map-
reduce
6 Compute Average Salary and Total Salary by Gender for an
Enterprise.
7 (i) Creating hive tables (External and internal)
(ii) Loading data to external hive tables from sql
tables(or)Structured c.s.v using scoop
(iii) Performing operations like filterations and updations
(iv) Performing Join (inner, outer etc)
(v) Writing User defined function on hive tables
8 Create a sql table of employees Employee table with
id,designation Salary table (salary ,dept
id) Create external table in hive with similar schema of above
tables,Move data to hive using
scoop and load the contents into tables,filter a new table and
write a UDF to encrypt the table
with AES-algorithm, Decrypt it with key to show contents
9 (i) Pyspark Definition(Apache Pyspark) and difference
between Pyspark, Scala, pandas
(ii) Pyspark files and class methods
(iii) get(file name)
(iv) get root directory()
10 Pyspark -RDD’S
(i) what is RDD’s?
(ii) ways to Create RDD
(iii) parallelized collections
(iv) external dataset
(v) existing RDD’s
(vi) Spark RDD’s operations (Count, foreach(), Collect,
join,Cache()
11 Perform pyspark transformations
(i) map and flatMap
(ii) to remove the words, which are not necessary to
analyze this text.
(iii) groupBy
(iv) What if we want to calculate how many times each
word is coming in corpus ?
(v) How do I perform a task (say count the words ‘spark’
and ‘apache’ in rdd3) separatly on
each partition and get the output of the task performed in
these partition ?
(vi) unions of RDD
(vii) join two pairs of RDD Based upon their key
12 Pyspark sparkconf-Attributes and applications
(i) What is Pyspark spark conf ()
(ii) Using spark conf create a spark session to write a
dataframe to read details in a c.s.v and
later move that c.s.v to another location
Syllabus
BIG DATA-SPARK LAB
B.TECH III YEAR I SEM
WEEK-1
1 To Study of Big Data Analytics and Hadoop Architecture
(i) know the concept of big data architecture
(ii) know the concept of Hadoop architecture
2 Loading DataSet in to HDFS for Spark Analysis
Installation of Hadoop and cluster management
(i) Installing Hadoop single node cluster in ubuntu environment
(ii) Knowing the differencing between single node clusters and multi-node
clusters
(iii) Accessing WEB-UI and the port number
(iv) Installing and accessing the environments such as hive and sqoop
WEEK-2
3 File management tasks & Basic linux commands

(i) Creating a directory in HDFS

(ii) Moving forth and back to directories

(iii) Listing directory contents

(iv) Uploading and downloading a file in HDFS

(v) Checking the contents of the file

(vi) Copying and moving files

(vii) Copying and moving files between local to HDFS environment

(viii) Removing files and paths

(ix) Displaying few lines of a file

(x) Display the aggregate length of a file

(xi) Checking the permissions of a file

(xii) Zipping and unzipping the files with & without permission pasting it to a location

(xiii) Copy, Paste commands


4 Map-reducing

(i) Definition of Map-reduce

(ii) Its stages and terminologies

(iii) Word-count program to understand map-reduce (Mapper phase, Reducer phase,


Driver

WEEK-3
5 Implementing Matrix-Multiplication with Hadoop Map-reduce
6 Compute Average Salary and Total Salary by Gender for an Enterprise.
WEEK-4
7 (i) Creating hive tables (External and internal)
(ii) Loading data to external hive tables from sql tables(or)Structured c.s.v using
scoop
(iii) Performing operations like filterations and updations
(iv) Performing Join (inner, outer etc)
(v) Writing User defined function on hive tables
8 Create a sql table of employees Employee table with id,designation Salary table
(salary ,dept id) Create external table in hive with similar schema of above tables,Move data
to hive using scoop and load the contents into tables,filter a new table and write a UDF to
encrypt the table with AES-algorithm, Decrypt it with key to show contents

WEEK-5
9 (i) Pyspark Definition(Apache Pyspark) and difference between Pyspark, Scala,
pandas
(ii) Pyspark files and class methods
(iii) get(file name)
(iv) get root directory( )
10 Pyspark -RDD’S
(i) what is RDD’s?
(ii) ways to Create RDD
(iii) parallelized collections
(iv) external dataset
(v) existing RDD’s
(vi) Spark RDD’s operations (Count, foreach(), Collect, join,Cache()
WEEK-6
11 Perform pyspark transformations
(i) map and flatMap
(ii) to remove the words, which are not necessary to analyze this text.
(iii) groupBy
(iv) What if we want to calculate how many times each word is coming in corpus ?
(v) How do I perform a task (say count the words ‘spark’ and ‘apache’ in rdd3)
separatly on each partition and get the output of the task performed in these partition ?
(vi) unions of RDD
(vii) join two pairs of RDD Based upon their key
12 Pyspark sparkconf-Attributes and applications
(i) What is Pyspark spark conf ()
(ii) Using spark conf create a spark session to write a dataframe to read details in a c.s.v
and later move that c.s.v to another location

TEXT BOOKS:
1. Spark in Action, Marko Bonaci and Petar Zecevic, Manning.
2. PySpark SQL Recipes: With HiveQL, Dataframe and Graphframes, Raju Kumar
Mishra and Sundar Rajan Raman, Apress Media.
Course Objectives:
 The main objective of the course is to process Big Data with advance architecture
like spark and streaming data in Spark

Course Outcomes:
 Develop MapReduce Programs to analyze large dataset Using Hadoop and Spark
 Write Hive queries to analyze large dataset Outline the Spark Ecosystem and its
components
 Perform the filter, count, distinct, map, flatMap RDD Operations in Spark.
 Build Queries using Spark SQL
 Apply Spark joins on Sample Data Sets
 Make use of sqoop to import and export data from hadoop to database and vice-versa

WEEK-1
1. To Study of Big Data Analytics and Hadoop Architecture
(i) know the concept of big data architecture
(ii) know the concept of Hadoop architecture
Big Data architecture:
Big Data architecture refers to the design and structure used to store, process, and analyze
large volumes of data. These architectures are built to handle a variety of data types
(structured, semi-structured, unstructured), as well as the large scale and speed of modern
data flows. The core components of Big Data architecture typically include the following
layers:
1. Data Source Layer
This layer refers to the origin of the data, which could come from a variety of sources:
 External data sources: Social media, IoT devices, third-party services, etc.
 Internal data sources: Databases, data warehouses, etc.
 Data streams: Real-time data from sensors, logs, etc.
2. Data Ingestion Layer
Data ingestion is the process of collecting and transporting data from various sources to the
storage layer. The two main types of ingestion are:
 Batch processing: Data is collected over a fixed period (e.g., every hour, daily).
 Real-time/streaming processing: Data is collected in real-time or near real-time.
Tools used for data ingestion include:
 Apache Kafka: A distributed streaming platform.
 Apache Flume: A service for collecting and moving large amounts of log data.
 AWS Kinesis: A platform for real-time streaming data on AWS.
3. Data Storage Layer
This is where all the data is stored. Big Data storage should support both structured and
unstructured data. It needs to be scalable, reliable, and highly available. Some common types
of data storage in Big Data systems include:
 HDFS (Hadoop Distributed File System): A scalable, distributed file system.
 NoSQL databases: MongoDB, Cassandra, HBase for non-relational data.
 Data Lakes: A central repository for storing raw data in its native format (e.g., AWS
S3, Azure Blob Storage).
4. Data Processing Layer
This layer processes the stored data and transforms it into valuable insights. It can be divided
into two major approaches:
 Batch Processing: Processing data in large, scheduled intervals (e.g., Hadoop
MapReduce).
 Stream Processing: Processing data in real-time as it flows in (e.g., Apache Flink,
Apache Storm, Spark Streaming).
Some key processing tools:
 Apache Spark: A fast and general-purpose cluster-computing system.
 Apache Hadoop: A framework for distributed storage and processing.
 Flink and Storm: Used for real-time data stream processing.
5. Data Analytics Layer
Once data is processed, it is often analyzed to extract insights. The analytics layer provides
tools for complex analysis, including:
 Machine Learning (ML): Building predictive models and patterns using algorithms.
 Data Mining: Discovering hidden patterns and trends in data.
 Business Intelligence (BI): Tools like Tableau, Power BI for reporting and
visualization.
Popular tools used for analytics:
 Apache Hive: A data warehouse built on top of Hadoop for querying and analyzing
large datasets.
 Apache Impala: A high-performance SQL engine for big data.
 Python libraries (Pandas, scikit-learn): For data manipulation and machine
learning.
6. Data Presentation Layer
This layer presents the insights derived from the analytics layer. It often involves dashboards,
reports, and visualizations. Users, stakeholders, or systems will interact with this layer to
make data-driven decisions. Tools include:
 BI tools: Tableau, Power BI, QlikView.
 Custom web interfaces: To display reports, graphs, and analysis.
7. Security and Governance Layer
Given the large volumes and sensitivity of data, security and governance are critical. This
layer ensures data privacy, access control, and regulatory compliance.
 Authentication/Authorization: Ensuring only authorized users can access specific
data.
 Data Encryption: To protect sensitive data at rest and in transit.
 Data Lineage: Tracking the origin and movement of data to ensure trustworthiness.
 Compliance: Adhering to regulations such as GDPR, HIPAA, etc.
8. Orchestration and Management Layer
Big Data systems require complex management for coordination, scheduling, and monitoring.
 Apache Airflow: An open-source platform to programmatically author, schedule, and
monitor workflows.
 Kubernetes: For managing containerized applications and ensuring scalability and
reliability.
Key Technologies in Big Data Architecture:
 Hadoop Ecosystem: For storage and processing (HDFS, YARN, MapReduce, Pig,
Hive, etc.).
 Apache Kafka: For real-time streaming.
 Apache Spark: For fast in-memory data processing.
 NoSQL Databases: MongoDB, Cassandra, HBase.
 Cloud Platforms: AWS, Azure, Google Cloud provide tools for storage, processing,
and management.
Example of Big Data Architecture
+---------------------+
| Data Sources |
+---------------------+
|
v
+---------------------+ +--------------------+
| Data Ingestion | ---> | Data Storage |
| (Batch/Streaming) | | (HDFS, NoSQL, |
+---------------------+ | Data Lakes) |
| +--------------------+
v |
+--------------------+ v
| Data Processing | +---------------------+
| (Batch/Stream) | ---> | Data Analytics |
+--------------------+ | (ML, BI, Analysis) |
| +---------------------+
v |
+---------------------+ v
| Data Presentation | <--------> +-----------------+
| (Dashboards, Reports| | Security & |
| Visualization) | | Governance |
This high-level overview demonstrates the flow of data through the architecture from
collection to processing and presentation.
(ii) know the concept of Hadoop architecture

Hadoop Architecture Overview

Hadoop is an open-source framework for processing and storing large datasets in a


distributed computing environment. It is designed to scale from a single server to thousands
of machines, each offering local computation and storage. Understanding Hadoop
architecture is essential for working with Hadoop-based systems. Below is a detailed
overview of the Hadoop architecture, its components, and how they work together.

Key Components of Hadoop Architecture

The architecture of Hadoop primarily revolves around three main components:

1. Hadoop Distributed File System (HDFS)


2. MapReduce
3. YARN (Yet Another Resource Negotiator)

These components work together to provide a distributed system that can store and process
large volumes of data.

1. Hadoop Distributed File System (HDFS)

HDFS is the storage layer of Hadoop. It is designed to store vast amounts of data across
multiple machines in a distributed environment.

 Block-based storage: HDFS stores data in blocks (typically 128MB or 256MB by


default). Each file is divided into blocks, which are then distributed across multiple
nodes.
 Fault tolerance: HDFS ensures fault tolerance by replicating blocks. The default
replication factor is 3 (each block is copied three times across the cluster). If one node
fails, the data can still be accessed from another replica.
 NameNode: The NameNode is the master node in HDFS that manages the metadata
(such as block locations) for the files. It does not store the data itself but keeps track
of where the blocks are stored across the cluster.
 DataNode: DataNodes are the worker nodes that store the actual data in the form of
blocks. Each DataNode is responsible for serving the blocks on request and
performing block-level operations (like block creation, deletion, and replication).

HDFS Architecture Diagram:

+-------------------+ +-------------------+
| Client | | Client |
+-------------------+ +-------------------+
| |
+---------------+ +---------------+
| NameNode | | NameNode |
+---------------+ +---------------+
| |
+-------------------+ +-------------------+
| DataNode | | DataNode |
+-------------------+ +-------------------+

2. MapReduce

MapReduce is the processing layer of Hadoop. It is a programming model used for


processing large data sets in parallel across a distributed cluster.

 Map phase: In the Map phase, the input data is divided into chunks (called splits),
and each chunk is processed by a mapper. The mapper processes the data and
generates a set of intermediate key-value pairs.
 Shuffle and Sort: After the Map phase, the intermediate key-value pairs are shuffled
and sorted. The system groups the data by key and prepares it for the Reduce phase.

Reduce phase: In the Reduce phase, the system applies the reduce function to the sorted
intermediate data, aggregating or transforming the data in some way. The results are written
to the output files.

MapReduce Architecture Diagram:

+-------------+
| Input | ----> [Map] ----> [Shuffle & Sort] ----> [Reduce] ----> Output
+-------------+
 JobTracker: The JobTracker is the master daemon in the MapReduce framework. It
is responsible for scheduling and monitoring jobs, dividing the work into tasks, and
allocating tasks to TaskTrackers.
 TaskTracker: TaskTrackers are worker daemons that run on the cluster nodes and
execute tasks assigned by the JobTracker. Each TaskTracker handles both Map and
Reduce tasks.
3. YARN (Yet Another Resource Negotiator)

YARN is the resource management layer of Hadoop, responsible for managing resources
across the cluster and scheduling the execution of tasks.

 ResourceManager (RM): The ResourceManager is the master daemon in YARN,


which manages the allocation of resources (memory, CPU) to the various applications
running on the cluster. It makes sure that resources are allocated based on job
requirements and cluster availability.
 NodeManager (NM): The NodeManager runs on each node in the cluster. It is
responsible for managing resources on the individual node and monitoring the status
of the node.
 ApplicationMaster (AM): The ApplicationMaster is a per-application entity that
manages the lifecycle of a job. It negotiates resources with the ResourceManager and
monitors the progress of its application (MapReduce job or Spark job).

YARN Architecture Diagram:

+-----------------------+
| ResourceManager | <-----> [Resource Allocation]
+-----------------------+
|
+-----------------------------+
| NodeManager | <-----> [Resource Monitoring]
+-----------------------------+
|
+---------------------------+
| ApplicationMaster (AM) | <-----> [Job Coordination]
+---------------------------+
|
+-----------------------+
| Application | <-----> [MapReduce/Spark Job]
+-----------------------+

Hadoop Ecosystem Components

Apart from the core components (HDFS, MapReduce, and YARN), Hadoop has a rich
ecosystem that includes several tools and frameworks for different use cases. Some of the key
components include:

 Hive: A data warehouse system that facilitates querying and managing large datasets
in HDFS using SQL-like queries.
 Pig: A platform for analyzing large datasets, providing a high-level language called
Pig Latin for processing and transforming data.
 HBase: A NoSQL database for real-time read/write access to large datasets stored in
HDFS.
 Sqoop: A tool for transferring data between Hadoop and relational databases.
 Flume: A service for collecting and aggregating log data and other types of streaming
data.
 Oozie: A workflow scheduler for managing Hadoop jobs.
 Zookeeper: A service for coordinating distributed applications in the Hadoop
ecosystem.
 Mahout: A machine learning library for scalable machine learning algorithms.

Hadoop Architecture Diagram (Complete)


+------------------+
| Client Node |
+------------------+
|
+------------------+--------+----------+------------------+
| HDFS (Storage Layer) | YARN (Resource Manager)
+--------------------------------+ +----------------------------------+
| NameNode (Master) | | ResourceManager (Master) |
| DataNode (Worker) | | NodeManager (Worker) |
+--------------------------------+ +----------------------------------+
| |
+------------------+ +------------------------+
| MapReduce Layer | | Application Master |
+------------------+ +------------------------+

Key Characteristics of Hadoop Architecture

1. Scalability: Hadoop is designed to scale horizontally. As your data grows, you can
add more nodes to the cluster.
2. Fault Tolerance: Through replication and data distribution, Hadoop ensures that the
data is not lost even when individual nodes fail.
3. Cost Efficiency: Hadoop runs on commodity hardware, meaning you can build large-
scale clusters with low-cost machines.

Data Locality: Hadoop tries to move computation to where the data is stored to minimize
network congestion and speed up processing.
2. Loading DataSet in to HDFS for Spark Analysis Installation of Hadoop and cluster
management

(i) Installing Hadoop single node cluster in ubuntu environment

(ii) Knowing the differencing between single node clusters and multi-node clusters

(iii) Accessing WEB-UI and the port number

(iv) Installing and accessing the environments such as hive and sqoop

Installing Hadoop Single Node Cluster in Ubuntu Environment

Prerequisites:

 A fresh Ubuntu system or a virtual machine running Ubuntu.


 Java should be installed (Hadoop requires Java 8 or later).
 A user with sudo privileges.

Step-by-Step Installation:

1. Install Java (JDK):

Hadoop requires Java to be installed. Install Java 8 or a compatible version.

sudo apt update


sudo apt install openjdk-8-jdk

Verify the Java installation:

java -version
2. Install Hadoop:
o First, download Hadoop binaries from the official Apache website. You can
download a stable version using wget:
3. wget https://archive.apache.org/dist/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz
o Extract the downloaded tar file:

4. tar -xzvf hadoop-3.3.1.tar.gz


o Move it to the /opt directory:
5. sudo mv hadoop-3.3.1 /opt/hadoop
6. Set Environment Variables:

Add Hadoop-related environment variables to the .bashrc file:

nano ~/.bashrc
Add the following lines at the end of the file:

export HADOOP_HOME=/opt/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop

After saving and closing, apply the changes:

source ~/.bashrc
7. Configure Hadoop:

In the Hadoop configuration directory, you'll need to edit several XML files to set up
the cluster.

o core-site.xml:

Edit the core configuration to set the HDFS URI.

nano $HADOOP_HOME/etc/hadoop/core-site.xml

Add the following configuration:

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
o hdfs-site.xml:

Configure HDFS directories and replication:

nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml

Add:

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///opt/hadoop/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///opt/hadoop/hdfs/datanode</value>
</property>
</configuration>
o mapred-site.xml:
Set up the MapReduce framework:

nano $HADOOP_HOME/etc/hadoop/mapred-site.xml

Add:

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
o yarn-site.xml:

Configure YARN settings:

nano $HADOOP_HOME/etc/hadoop/yarn-site.xml

Add:

<configuration>
<property>
<name>yarn.resourcemanager.address</name>
<value>localhost:8032</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
8. Format HDFS:

Before starting Hadoop, format the HDFS:

hdfs namenode -format


9. Start Hadoop Daemons:

Start the Hadoop daemons (NameNode, DataNode, ResourceManager,


NodeManager):

start-dfs.sh
start-yarn.sh
10. Verify the Installation:
o Check if the HDFS is running properly:

o jps
o Check if the ResourceManager and NodeManager are running as well:
o jps
o You can also check the Hadoop Web UI to view the status of your cluster.

(ii) Differences Between Single-Node and Multi-Node Clusters


Single-Node Cluster:

 A single-node cluster is a Hadoop setup where all the Hadoop services (NameNode,
DataNode, ResourceManager, and NodeManager) run on one machine (localhost).
 It is simpler to set up and useful for development and testing purposes.
 Limited scalability and no distributed computation capability in the true sense of a
multi-node cluster.

Multi-Node Cluster:

 A multi-node cluster involves multiple machines, where one node acts as the master
(NameNode, ResourceManager) and others as slaves (DataNode, NodeManager).
 It offers the true power of distributed computing and storage, enabling scalability and
fault tolerance.
 It requires more complex configuration, network setup, and hardware resources.
 It is used in production environments where large-scale data processing is required.

(iii) Accessing WEB-UI and the Port Number

Hadoop provides a Web UI to monitor the cluster's health and performance. The following
are the key ports:

 NameNode Web UI: http://localhost:50070 – For monitoring HDFS status.


 ResourceManager Web UI: http://localhost:8088 – For monitoring the YARN resource
manager.
 JobHistory Server: http://localhost:19888 – For tracking MapReduce job history.

Make sure these ports are open and accessible.

(iv) Installing and Accessing Environments such as Hive and Sqoop

Hive Installation:

1. Install Hive:

You can download the latest stable version of Apache Hive from the Apache website
or install it via apt if available.

sudo apt-get install hive


2. Configure Hive:

Hive requires a metastore (typically MySQL or Derby). You can configure it by


editing hive-site.xml:
nano $HIVE_HOME/conf/hive-site.xml
3. Access Hive:

After installation and configuration, start Hive:

hive

This opens the Hive CLI where you can execute Hive queries.

Sqoop Installation:

1. Install Sqoop:

Download and install Sqoop, which is used for transferring data between relational
databases and Hadoop.

sudo apt-get install sqoop


2. Configure Sqoop:

Set up database connection configurations in Sqoop by editing the sqoop-site.xml file.

3. Access Sqoop:

To use Sqoop to import or export data, you can run commands like:

sqoop import --connect jdbc:mysql://localhost/database --table tablename --username user --password


pass

WEEK-2

3. File management tasks & Basic linux commands

(i) Creating a directory in HDFS

(ii) Moving forth and back to directories

(iii) Listing directory contents

(iv) Uploading and downloading a file in HDFS

(v) Checking the contents of the file

(vi) Copying and moving files

(vii) Copying and moving files between local to HDFS environment

(viii) Removing files and paths

(ix) Displaying few lines of a file


(x) Display the aggregate length of a file

(xi) Checking the permissions of a file

(xii) Zipping and unzipping the files with & without permission pasting it to a location

(xiii) Copy, Paste commands

Here’s a breakdown of file management tasks and basic Linux commands, particularly
focused on HDFS (Hadoop Distributed File System) operations:

(i) Creating a directory in HDFS:

To create a directory in HDFS, you can use the hadoop fs -mkdir command.

hadoop fs -mkdir /path/to/your/directory

This will create a directory at the specified path in HDFS.

(ii) Moving forth and back to directories:

You can navigate directories in the Linux file system using the cd command.

 To move to a directory:
 cd /path/to/directory
 To move back to the previous directory:

 cd -
 To move up one directory level:

 cd ..

For HDFS directories, you use the hadoop fs -ls command to list the contents and hadoop
fs -cd to change directories.

(iii) Listing directory contents:

To list contents of a directory, whether in HDFS or local, you use the ls command.

 In HDFS:
 hadoop fs -ls /path/to/directory
 In Local File System:

 ls /path/to/directory
(iv) Uploading and downloading a file in HDFS:

To upload a file to HDFS:


hadoop fs -put /local/path/to/file /hdfs/path/to/directory

To download a file from HDFS:

hadoop fs -get /hdfs/path/to/file /local/path/to/directory


(v) Checking the contents of the file:

You can check the contents of a file using the cat command.

 In HDFS:
 hadoop fs -cat /path/to/file
 In Local File System:

 cat /path/to/file
(vi) Copying and moving files:

 Copying files:
o To copy a file within HDFS:

o hadoop fs -cp /hdfs/source/path /hdfs/destination/path


o To copy a file from local to HDFS:

o hadoop fs -copyFromLocal /local/source/path /hdfs/destination/path


o To copy a file from HDFS to local:

o hadoop fs -copyToLocal /hdfs/source/path /local/destination/path


 Moving files:

o To move a file within HDFS:

o hadoop fs -mv /hdfs/source/path /hdfs/destination/path


o To move a file from local to HDFS:

o hadoop fs -moveFromLocal /local/source/path /hdfs/destination/path


o To move a file from HDFS to local:

o hadoop fs -moveToLocal /hdfs/source/path /local/destination/path


(vii) Copying and moving files between local and HDFS environment:

 Copying a file from local to HDFS:


 hadoop fs -copyFromLocal /local/path/to/file /hdfs/path/to/destination
 Copying a file from HDFS to local:

 hadoop fs -copyToLocal /hdfs/path/to/file /local/path/to/destination


 Moving a file from local to HDFS:

 hadoop fs -moveFromLocal /local/path/to/file /hdfs/path/to/destination


 Moving a file from HDFS to local:

 hadoop fs -moveToLocal /hdfs/path/to/file /local/path/to/destination


(viii) Removing files and paths:

To remove files and directories, you can use the -rm and -r options for directories.

 Remove a file in HDFS:


 hadoop fs -rm /hdfs/path/to/file
 Remove a directory in HDFS:

 hadoop fs -rm -r /hdfs/path/to/directory


 Remove a file locally:

 rm /local/path/to/file
 Remove a directory locally:

 rm -r /local/path/to/directory
(ix) Displaying few lines of a file:

To display the first few lines of a file:

 In HDFS:
 hadoop fs -head /path/to/file
 In Local File System:

 head /path/to/file
(x) Display the aggregate length of a file:

You can get the file size using the -du (disk usage) command.

 In HDFS:
 hadoop fs -du -s /path/to/file
 In Local File System:

 du -sh /path/to/file

This will display the total size of the file.

(xi) Checking the permissions of a file:

You can check the permissions of a file using the -ls command, which will show the file
permissions.

 In HDFS:
 hadoop fs -ls /path/to/file
 In Local File System:

 ls -l /path/to/file

This will display the permissions, owner, and group of the file or directory.
(xii) Zipping and unzipping files with and without permission pasting it to a
location:

You can zip and unzip files using the zip and unzip commands.

 Zipping a file:
 zip filename.zip /path/to/file
 Unzipping a file:

 unzip filename.zip -d /path/to/extract

To maintain permissions while transferring a file, use the -p option in cp or rsync for
preserving permissions.

Example with rsync:

rsync -av /path/to/source /path/to/destination


(xiii) Copy, Paste Commands:

 Copy Command (For local filesystem):


 cp /source/path /destination/path

For HDFS:

hadoop fs -cp /source/hdfs/path /destination/hdfs/path

 Paste Command (To paste a file after copying it): This is generally done by using cp
or mv as mentioned above. There's no specific "paste" command, but the operation is
performed through these commands when moving or copying data.

4. Map-reducing

(i) Definition of Map-reduce

(ii) Its stages and terminologies

(iii) Word-count program to understand map-reduce (Mapper phase, Reducer


phase, Driver code)

(i) Definition of Map-Reduce:

Map-Reduce is a programming model and processing technique used to process and generate large
datasets. It allows the parallel processing of data by dividing it into small chunks and distributing it
across multiple nodes in a cluster. The main concept involves two key operations: Map and Reduce.

 Map: The map function processes input data and produces a set of intermediate key-value
pairs.
 Reduce: The reduce function takes the intermediate key-value pairs, processes them, and
merges them to produce the final result.

Map-Reduce is widely used in distributed systems like Hadoop for large-scale data processing tasks.

(ii) Stages and Terminologies in Map-Reduce:

The Map-Reduce process is split into two main stages: the Map stage and the Reduce stage, but
several other intermediate processes and terminologies come into play.

1. Map Stage:
o The input data is divided into chunks (usually files or records).

o The Mapper function processes each chunk and outputs intermediate key-value pairs.
o The intermediate output is sorted and grouped by key (called the shuffle phase).
2. Shuffle and Sort:
o After the map phase, the intermediate key-value pairs are shuffled and sorted to
ensure that all values corresponding to the same key are grouped together. This step
happens automatically in Map-Reduce frameworks like Hadoop.
3. Reduce Stage:
o The Reducer function processes each group of intermediate key-value pairs and
merges them to produce a final output. It can aggregate, summarize, or process data
in any other way required by the user.
4. Output:
o After the reduce phase, the final output is written to a file or a database.

Key Terminologies in Map-Reduce:

 Mapper: The function or process that reads input data, processes it, and outputs key-value
pairs.
 Reducer: The function that processes the grouped key-value pairs from the mapper and
performs the final aggregation or computation.
 Key-Value Pair: The fundamental unit of data in Map-Reduce, where each record is
represented as a key paired with a value.
 Shuffle: The process of redistributing the data across reducers based on keys, ensuring that all
values for the same key are sent to the same reducer.
 Input Split: The unit of work or chunk of data that is sent to a mapper.
 Output: The final result after processing in the reduce phase, usually saved to disk or a
storage system.

(iii) Word-Count Program to Understand Map-Reduce:

Here is a simple example of a Word-Count program to demonstrate the Map-Reduce process. We will
break it into three main parts:

1. Mapper Phase:
The mapper reads input text and emits key-value pairs, where the key is a word, and the value is 1
(representing a single occurrence of the word).

Mapper code (in Python or any suitable language):

import sys

# Mapper function
def mapper():
for line in sys.stdin:
words = line.split()
for word in words:
# Emit word with value 1
print(f"{word}\t1")

if __name__ == "__main__":
mapper()

In this code:

 The input is a line of text.


 The line is split into words.
 For each word, a key-value pair is emitted, where the key is the word, and the value is 1.

2. Shuffle and Sort:

After the map phase, the framework automatically groups and sorts the emitted key-value pairs. For
instance, all instances of the word "hello" will be grouped together so that they can be passed to the
same reducer.

Example of shuffled data:

hello 1
hello 1
world 1
world 1
data 1

3. Reducer Phase:

The reducer processes the grouped key-value pairs. It aggregates the values by summing them to get
the total count for each word.

Reducer code:

import sys

# Reducer function
def reducer():
current_word = None
current_count = 0
for line in sys.stdin:
word, count = line.strip().split('\t')
count = int(count)

if word == current_word:
current_count += count
else:
if current_word:
print(f"{current_word}\t{current_count}")
current_word = word
current_count = count

# Output the last word


if current_word:
print(f"{current_word}\t{current_count}")

if __name__ == "__main__":
reducer()

In this code:

 The reducer receives grouped key-value pairs.


 It aggregates the count of each word and prints the final result.

4. Driver Code:

The driver code sets up the map and reduce operations and coordinates the execution of the map and
reduce phases in the framework. In Hadoop, this would be handled by a job configuration, but for
simplicity, this can be managed manually in a basic script.

Example Driver Code (in a Hadoop or basic setup):

# Pseudo code to explain the execution


# 1. The input text is passed to the Mapper.
# 2. Mapper emits key-value pairs.
# 3. Intermediate data is shuffled and sorted by keys.
# 4. The Reducer takes the sorted data, aggregates it, and outputs the result.

# In Hadoop, you would configure a Job with Mapper and Reducer.

Final Output:

After the map and reduce phases, the output would look like this:

data 1
hello 2
world 2

This shows the word count for each word in the input text.

In a distributed setup like Hadoop:

 The mapper would be executed on different nodes processing chunks of data in parallel.
 The reducer would then aggregate the results from all the mappers.
This basic example gives you a good understanding of how Map-Reduce works to process large
datasets by distributing the work and aggregating results efficiently.

WEEK-3

5 implement a program in Map Reduce for Matrix Multiplication.

Prerequisites:
1. Hadoop Installed (Single Node or Cluster)
2. Java Development Kit (JDK 8 or above)
3. HDFS Setup
4. Basic understanding of MapReduce

Theory
Matrix Multiplication Formula

Given two matrices A (m × n) and B (n × p), their multiplication results in matrix C (m ×p):

 MapReduce parallelizes the computation by dividing matrix elements across nodes.


 Mapper emits partial products for each (i, j).
 Reducer sums up partial products to compute the final result.

Implementation Steps

Step 1: Input Format

Each line represents a matrix element in the format:

MatrixName Row Column Value

Example (Matrix A and B):

A001

A012

A103

A114
B005

B016

B107

B118

Step 2: Mapper Code (MatrixMapper.java)

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class MatrixMapper extends Mapper<Object, Text, Text, IntWritable> {

public void map(Object key, Text value, Context context) throws IOException,
InterruptedException {

String[] tokens = value.toString().split(" ");

String matrixName = tokens[0];

int row = Integer.parseInt(tokens[1]);

int col = Integer.parseInt(tokens[2]);

int val = Integer.parseInt(tokens[3]);

if (matrixName.equals("A")) {

for (int k = 0; k < 2; k++) { // Assume B has 2 columns

context.write(new Text(row + "," + k), new IntWritable(val));

} else {
for (int i = 0; i < 2; i++) { // Assume A has 2 rows

context.write(new Text(i + "," + col), new IntWritable(val));

Step 3: Reducer Code (MatrixReducer.java)

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

import java.util.ArrayList;

public class MatrixReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values, Context context) throws


IOException, InterruptedException {

ArrayList<Integer> elements = new ArrayList<>();

for (IntWritable val : values) {

elements.add(val.get());

int sum = 0;

for (int i = 0; i < elements.size(); i += 2) {


sum += elements.get(i) * elements.get(i + 1);

context.write(key, new IntWritable(sum));

Step 4: Driver Code (MatrixMultiplication.java)

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class MatrixMultiplication {

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

Job job = Job.getInstance(conf, "Matrix Multiplication");

job.setJarByClass(MatrixMultiplication.class);

job.setMapperClass(MatrixMapper.class);

job.setReducerClass(MatrixReducer.class);
job.setMapOutputKeyClass(Text.class);

job.setMapOutputValueClass(IntWritable.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

System.exit(job.waitForCompletion(true) ? 0 : 1);

Running the Experiment

Step 1: Compile the Java Files

hadoop com.sun.tools.javac.Main *.java

jar cf matrix.jar *.class

Step 2: Upload Input File to HDFS

hdfs dfs -mkdir /input

hdfs dfs -put matrix.txt /input

Step 3: Run the Hadoop Job

hadoop jar matrix.jar MatrixMultiplication /input /output

Step 4: View Output

hdfs dfs -cat /output/part-r-00000

Expected Output
0,0 19

0,1 22

1,0 43

1,1 50

6. Compute Average Salary and Total Salary by Gender for an Enterprise.

Objective:

To compute average salary and total salary by gender from a dataset using Hadoop
MapReduce framework.

Step 1: Mapper Code (SalaryGenderMapper.java)

import java.io.IOException;

import org.apache.hadoop.io.*;

import org.apache.hadoop.mapreduce.Mapper;

public class SalaryGenderMapper extends Mapper<LongWritable, Text, Text,


DoubleWritable> {

public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {

String[] parts = value.toString().split(",");

if (parts.length == 5 && !parts[0].equals("EmployeeID")) {

String gender = parts[2];

double salary = Double.parseDouble(parts[4]);

context.write(new Text(gender), new DoubleWritable(salary));

}
}

Step 2: Reducer Code (SalaryGenderReducer.java)

import java.io.IOException;

import org.apache.hadoop.io.*;

import org.apache.hadoop.mapreduce.Reducer;

public class SalaryGenderReducer extends Reducer<Text, DoubleWritable, Text, Text> {

public void reduce(Text key, Iterable<DoubleWritable> values, Context context) throws


IOException, InterruptedException {

double totalSalary = 0;

int count = 0;

for (DoubleWritable val : values) {

totalSalary += val.get();

count++;

double avgSalary = totalSalary / count;

context.write(key, new Text("Total Salary: " + totalSalary + ", Average Salary: " + avgSalary));

Step 3: Driver Code (SalaryGenderDriver.java)

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class SalaryGenderDriver {


public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Salary by Gender");

job.setJarByClass(SalaryGenderDriver.class);
job.setMapperClass(SalaryGenderMapper.class);
job.setReducerClass(SalaryGenderReducer.class);

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(DoubleWritable.class);

FileInputFormat.addInputPath(job, new Path(args[0])); // Input path


FileOutputFormat.setOutputPath(job, new Path(args[1])); // Output path

System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Sample Output:
F Total Salary: 140000.0, Average Salary: 70000.0
M Total Salary: 120000.0, Average Salary: 60000.0
WEEK-4
7. (i) Creating hive tables (External and internal)
(ii) Loading data to external hive tables from sql tables(or)Structured c.s.v using
scoop
(iii) Performing operations like filterations and updations
(iv) Performing Join (inner, outer etc)
(v) Writing User defined function on hive tables

Prerequisites

 Access to the command line/terminal with a sudo user.


 Apache Hadoop installed and running.
 Apache Hive installed and running.
 Access to the file system where the data is stored (HDFS).
(i) Creating Hive Tables (External and Internal)

Internal Table:
CREATE TABLE employee (
id INT,
name STRING,
gender STRING,
department STRING,
salary FLOAT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
External Table:

CREATE EXTERNAL TABLE employee_ext (


id INT,
name STRING,
gender STRING,
department STRING,
salary FLOAT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/hive/warehouse/external/employee';
(ii) Loading data to external hive tables from sql tables(or)Structured c.s.v using
scoop

From SQL Table to Hive (using Sqoop):


sqoop import \
--connect jdbc:mysql://localhost:3306/employees \
--username root \
--password yourpassword \
--table employee \
--hive-import \
--create-hive-table \
--hive-table default.employee \
--fields-terminated-by ',' \
--target-dir /user/hive/warehouse/employee

(iii) Performing operations like filterations and updations


Filter:
SELECT * FROM employee WHERE gender = 'F' AND salary > 50000;

Update Record (Hive 0.14+ with ACID enabled):


UPDATE employee SET salary = 75000 WHERE id = 103;
Ensure the table is transactional:
ALTER TABLE employee SET TBLPROPERTIES ('transactional'='true');
(iv) Joins in Hive
Inner Join:
SELECT e.name, d.dept_name
FROM employee e
JOIN department d
ON e.department = d.dept_id;
Left Outer Join:
SELECT e.name, d.dept_name
FROM employee e
LEFT OUTER JOIN department d
ON e.department = d.dept_id;
Right Outer Join:
SELECT e.name, d.dept_name
FROM employee e
RIGHT OUTER JOIN department d
ON e.department = d.dept_id;
(v) Writing User defined function on hive tables
Java UDF Example: Convert Name to Uppercase
1. Java Code (UpperCaseUDF.java):

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

public class UpperCaseUDF extends UDF {


public Text evaluate(Text input) {
if (input == null) return null;
return new Text(input.toString().toUpperCase());
}
}

2. Compile and Create JAR:


javac -classpath `hadoop classpath`:. UpperCaseUDF.java
jar -cf upperudf.jar UpperCaseUDF.class

3. Register and Use in Hive:


ADD JAR /path/to/upperudf.jar;
CREATE TEMPORARY FUNCTION to_upper AS 'UpperCaseUDF';
SELECT to_upper(name) FROM employee;

8. Create a sql table of employees Employee table with id,designation Salary table
(salary ,dept id) Create external table in hive with similar schema of above tables,Move data
to hive using scoop and load the contents into tables,filter a new table and write a UDF to
encrypt the table with AES-algorithm, Decrypt it with key to show contents

Step 1: Create SQL Tables (MySQL Example)

-- Employee Table
CREATE TABLE employee (
id INT PRIMARY KEY,
name VARCHAR(50),
designation VARCHAR(50)
);

-- Salary Table
CREATE TABLE salary (
id INT,
salary FLOAT,
dept_id INT,
FOREIGN KEY (id) REFERENCES employee(id)
);

-- Sample Inserts
INSERT INTO employee VALUES (1, 'Alice', 'Manager'), (2, 'Bob', 'Analyst');
INSERT INTO salary VALUES (1, 80000, 101), (2, 50000, 102);

Step 2: Create Hive External Tables


-- External Table for Employee
CREATE EXTERNAL TABLE employee_ext (
id INT,
name STRING,
designation STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/hive/warehouse/external/employee_ext';

-- External Table for Salary


CREATE EXTERNAL TABLE salary_ext (
id INT,
salary FLOAT,
dept_id INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/hive/warehouse/external/salary_ext';
Step 3: Import Data from MySQL to Hive using Sqoop
# Import employee table
sqoop import \
--connect jdbc:mysql://localhost:3306/company \
--username root \
--password yourpassword \
--table employee \
--hive-import \
--hive-table default.employee_ext \
--fields-terminated-by ','
# Import salary table
sqoop import \
--connect jdbc:mysql://localhost:3306/company \
--username root \
--password yourpassword \
--table salary \
--hive-import \
--hive-table default.salary_ext \
--fields-terminated-by ','

Step 4: Filter a New Table from Hive


-- Filter employees with salary > 60000
CREATE TABLE high_salary_employees AS
SELECT e.id, e.name, e.designation, s.salary, s.dept_id
FROM employee_ext e
JOIN salary_ext s ON e.id = s.id
WHERE s.salary > 60000;

Step 5: Write Hive UDF to Encrypt using AES


AES Encrypt UDF (Java)
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

import javax.crypto.Cipher;
import javax.crypto.spec.SecretKeySpec;
import java.util.Base64;

public class AESEncryptUDF extends UDF {


private static final String ALGORITHM = "AES";
private static final String KEY = "1234567890123456"; // 16-byte key
public Text evaluate(Text input) {
try {
SecretKeySpec keySpec = new SecretKeySpec(KEY.getBytes(), ALGORITHM);
Cipher cipher = Cipher.getInstance(ALGORITHM);
cipher.init(Cipher.ENCRYPT_MODE, keySpec);
byte[] encrypted = cipher.doFinal(input.toString().getBytes());
return new Text(Base64.getEncoder().encodeToString(encrypted));
} catch (Exception e) {
return null;
}
}
}

AES Decrypt UDF


public class AESDecryptUDF extends UDF {
private static final String ALGORITHM = "AES";
private static final String KEY = "1234567890123456"; // Same key

public Text evaluate(Text input) {


try {
SecretKeySpec keySpec = new SecretKeySpec(KEY.getBytes(), ALGORITHM);
Cipher cipher = Cipher.getInstance(ALGORITHM);
cipher.init(Cipher.DECRYPT_MODE, keySpec);
byte[] decoded = Base64.getDecoder().decode(input.toString());
return new Text(new String(cipher.doFinal(decoded)));
} catch (Exception e) {
return null;
}
}
}
Step 6: Compile and Use UDFs in Hive
# Compile
javac -classpath `hadoop classpath`:. AESEncryptUDF.java AESDecryptUDF.java
jar -cf aesudf.jar AESEncryptUDF.class AESDecryptUDF.class

Load and Use in Hive


ADD JAR /path/to/aesudf.jar;

CREATE TEMPORARY FUNCTION aes_encrypt AS 'AESEncryptUDF';


CREATE TEMPORARY FUNCTION aes_decrypt AS 'AESDecryptUDF';

-- Encrypt Name Field


SELECT id, aes_encrypt(name), designation FROM high_salary_employees;

-- Decrypt it back
SELECT id, aes_decrypt(aes_encrypt(name)), designation FROM high_salary_employees;

WEEK-5
9. (i) Pyspark Definition(Apache Pyspark) and difference between Pyspark, Scala,
pandas
(ii) Pyspark files and class methods
(iii) get(file name)
(iv) get root directory()

(i) Pyspark Definition(Apache Pyspark) and difference between Pyspark, Scala, pandas

What is PySpark?
PySpark is the Python API for Apache Spark, an open-source, distributed computing system
used for big data processing and analytics. It allows you to harness the power of Spark with
Python programming.
Key Features:
Supports RDDs, DataFrames, SQL, and Streaming.
Scales across multiple machines or nodes.
Integrates with Hadoop, Hive, HDFS, Cassandra, etc.
Comparison: PySpark vs Scala vs Pandas

(ii) PySpark Files and Class Methods

Common PySpark Files:


 py files for writing Spark applications
 json, .csv, .parquet for reading data
 jar for Java/Scala library dependencies
Core Classes & Methods:
1. SparkContext

from pyspark import SparkContext


sc = SparkContext("local", "AppName")
sc.textFile("data.txt").collect()
2. SparkSession (used in DataFrame APIs)

from pyspark.sql import SparkSession


spark = SparkSession.builder.appName("App").getOrCreate()
df = spark.read.csv("data.csv", header=True, inferSchema=True)
3. RDD Operations

rdd = sc.parallelize([1, 2, 3, 4])


rdd.map(lambda x: x * 2).collect()
4. DataFrame Methods
df.show()
df.select("column_name").filter(df["age"] > 30).show()
df.groupBy("gender").agg({"salary": "avg"}).show()

(iii) get(filename) Function in PySpark

While get() is not a native method in PySpark, here's what you usually do to read files:

df = spark.read.csv("filename.csv", header=True, inferSchema=True)

df.show()

df = spark.read.csv("filename.csv", header=True, inferSchema=True)

df.show()

If you want to wrap this in a custom function:

def get(filename):

return spark.read.csv(filename, header=True, inferSchema=True)

df = get("data.csv")

df.show()

(iv) Get Root Directory in PySpark

To get the root directory of the working project in PySpark (Python):

import os

root_dir = os.getcwd()

print("Root Directory:", root_dir)

To get Spark Application Root:

spark = SparkSession.builder.appName("App").getOrCreate()

print(spark.sparkContext.applicationId)

10 . Pyspark -RDD’S
(i) what is RDD’s?

(ii) ways to Create RDD

(iii) parallelized collections

(iv) external dataset

(v) existing RDD’s

(vi) Spark RDD’s operations (Count, foreach(), Collect, join,Cache()

(i) What is an RDD?

An RDD (Resilient Distributed Dataset) is the fundamental data structure of Apache Spark.

It represents an immutable distributed collection of objects that can be processed in parallel.

RDDs are fault-tolerant, distributed, and allow in-memory computations.

(ii) Ways to Create RDD

1. From a Python Collection:

rdd1 = sc.parallelize([1, 2, 3, 4, 5])

2. From an External Dataset:

rdd2 = sc.textFile("data.txt")

3. From an Existing RDD:

rdd3 = rdd1.map(lambda x: x * 2)

(iii) Parallelized Collections

Parallelized collections are useful for testing or small datasets:

data = ['apple', 'banana', 'grape']

rdd = sc.parallelize(data)

rdd.collect() # ['apple', 'banana', 'grape']

(iv) External Dataset

You can read data from files using textFile or wholeTextFiles:

rdd_file = sc.textFile("hdfs:///user/hadoop/input.txt")

rdd_file.take(5)

(v) Existing RDDs

RDDs can be transformed into new RDDs:


rdd_orig = sc.parallelize([1, 2, 3])

rdd_squared = rdd_orig.map(lambda x: x ** 2)

print(rdd_squared.collect()) # [1, 4, 9]

(vi) Spark RDD Operations

Common transformations: map, filter, flatMap, union, join, distinct

Common actions: collect, count, foreach, take, reduce

# Example operations:

rdd = sc.parallelize([('a', 1), ('b', 2), ('a', 3)])

rdd.count() # 3

rdd.foreach(lambda x: print(x))

rdd.collect() # [('a', 1), ('b', 2), ('a', 3)]

# join:

rdd1 = sc.parallelize([('a', 1), ('b', 2)])

rdd2 = sc.parallelize([('a', 4)])

rdd1.join(rdd2).collect() # [('a', (1, 4))]

# cache:

rdd_cached = rdd.cache()

rdd_cached.collect()

WEEK-12

11. Perform pyspark transformations


(i) map and flatMap
(ii) to remove the words, which are not necessary to analyze this text.
(iii) groupBy
(iv) What if we want to calculate how many times each word is coming in corpus ?
(v) How do I perform a task (say count the words ‘spark’ and ‘apache’ in rdd3)
separatly on
each partition and get the output of the task performed in these partition ?
(vi) unions of RDD
(vii) join two pairs of RDD Based upon their key

(i) map and flatMap


# map(): Applies the function to each element and returns a new RDD.
rdd = sc.parallelize([1, 2, 3])
mapped_rdd = rdd.map(lambda x: x * 2)
mapped_rdd.collect() # [2, 4, 6]

# flatMap(): Flattens the result after applying the function.


lines = sc.parallelize(["hello world", "spark apache"])
words = lines.flatMap(lambda line: line.split(" "))
words.collect() # ['hello', 'world', 'spark', 'apache']
(ii) Remove unnecessary words (Stop Words Removal)
text = sc.parallelize(["Apache Spark is fast", "Spark is powerful"])
stopwords = ['is', 'a', 'the', 'and']
filtered = text.flatMap(lambda line: line.split()) \
.filter(lambda word: word.lower() not in stopwords)
filtered.collect() # ['Apache', 'Spark', 'fast', 'Spark', 'powerful']
(iii) groupBy
rdd = sc.parallelize(['apple', 'banana', 'grape', 'apricot'])
grouped = rdd.groupBy(lambda word: word[0])
[(k, list(v)) for k, v in grouped.collect()] # [('a', ['apple', 'apricot']), ...]
(iv) Word Count in a Corpus
rdd = sc.textFile("data.txt")
words = rdd.flatMap(lambda line: line.split())
word_pairs = words.map(lambda word: (word.lower(), 1))
word_counts = word_pairs.reduceByKey(lambda a, b: a + b)
word_counts.collect() # [('spark', 3), ('apache', 2), ...]
(v) Task on Each Partition (e.g., count 'spark' and 'apache')
def partition_task(iterator):
spark_count = 0
apache_count = 0
for word in iterator:
if word == 'spark': spark_count += 1
if word == 'apache': apache_count += 1
yield ('spark', spark_count), ('apache', apache_count)

rdd = sc.parallelize(['spark', 'apache', 'spark', 'hadoop', 'apache'], 2)


result = rdd.mapPartitions(partition_task)
result.collect() # Each partition's spark/apache count
(vi) Unions of RDD
rdd1 = sc.parallelize([1, 2, 3])
rdd2 = sc.parallelize([4, 5, 6])
rdd_union = rdd1.union(rdd2)
rdd_union.collect() # [1, 2, 3, 4, 5, 6]
(vii) Join Two RDDs by Key
rdd1 = sc.parallelize([(1, 'A'), (2, 'B')])
rdd2 = sc.parallelize([(1, 'X'), (2, 'Y')])
joined = rdd1.join(rdd2)
joined.collect() # [(1, ('A', 'X')), (2, ('B', 'Y'))]

12 Pyspark sparkconf-Attributes and applications


(i) What is Pyspark spark conf ()
(ii) Using spark conf create a spark session to write a dataframe to read details in a
c.s.v and later move that c.s.v to another location

(i) What is PySpark SparkConf()


SparkConf is a configuration class in PySpark used to set up various parameters of the Spark
Application.
It is passed to SparkContext or SparkSession to configure the behavior of the application.
# Example:
from pyspark import SparkConf
conf = SparkConf().setAppName('MyApp').setMaster('local')
(ii) Using SparkConf to Create SparkSession and Move CSV
# Step 1: Create SparkConf and SparkSession
from pyspark import SparkConf
from pyspark.sql import SparkSession

conf = SparkConf().setAppName('CSVApp').setMaster('local')
spark = SparkSession.builder.config(conf=conf).getOrCreate()

# Step 2: Read CSV File into DataFrame


df = spark.read.csv('input.csv', header=True, inferSchema=True)
df.show()

# Step 3: Write CSV to New Location


df.write.csv('output_folder/new_output.csv', header=True, mode='overwrite')

You might also like