Big Data-Spark Lab Manual
Big Data-Spark Lab Manual
Lab Manual
COMPUTERSCIENCE&ENGINEERING
DEPT OF CSE
VISION & MISSION OF THE INSTITUTE & DEPARTMENT
COMPUTERSCIENCE&ENGINEERING
DEPT OF CSE
Program Educational Objectives (PEOs):
PEO1: To provide graduates the foundational and essential knowledge in mathematics, science,
computer science and engineering and interdisciplinary engineering to emerge as
technocrats.
PEO2: To inculcate the capabilities to analyse, design and develop innovative solutions of
computer support systems for benefits of the society, by diligence and teamwork.
PO2: Problem Analysis: Identify, formulate, review research literature, and analyze complex
engineering problems reaching substantiated conclusions using first principles of mathematics,
natural sciences and Engineering sciences.
PO3: Design/Development of Solutions: Design solutions for complex engineering problems and
design system components or processes that meet the specified needs with appropriate
consideration for the public health safety, and the cultural, societal, and environmental
considerations.
PO5: Modern Tool Usage: Create, select and apply appropriate techniques, resources and modern
engineering and IT tools including prediction and modeling to complex engineering activities with
an understanding of the limitations.
PO6: The Engineer and Society: Apply reasoning informed by the contextual knowledge to assess
societal, health, safety, legal and cultural issues and the consequent responsibilities relevant to the
professional engineering practice.
COMPUTERSCIENCE&ENGINEERING
DEPT OF CSE
PO7: Environment and Sustainability: Understand the impact of the professional engineering
solutions in societal and environmental contexts and demonstrate the knowledge of,and need for
sustainable development.
PO8: Ethics: Apply ethical principles and commit to professional ethics and responsibilities and
norms of the engineering practice.
PO9: Individual and Team Work: Function effectively as an individual and as a member
or leader in diverse teams and in multidisciplinary settings.
PO11: Project Management and Finance: Demonstrate knowledge and understanding of the
engineering management principles and apply these to one's own work, as a member and leader in
a team to manage projects and in multidisciplinary environments.
PO12: Life-Long Learning: Recognize the need for and have the preparation and ability to
engage in independent and lifelong learning in the broadest context of technological change.
COMPUTERSCIENCE&ENGINEERING
INDEX
S.NO. TOPIC
INSTRUCTIONS TO STUDENTS
All students must observe the Dress Code while in the laboratory.
Foods, drinks and smoking are NOT allowed.
All bags must be left at the indicated place.
The lab timetable must be strictly followed.
Be PUNCTUAL for your laboratory session.
Noise must be kept to a minimum.
Workspace must be kept clean and tidy at all time.
All students are liable for any damage to the accessories due to theirown
negligence.
Students are strictly PROHIBITED from taking out any items from the laboratory.
Report immediately to the Lab Supervisor if any malfunction of the
accessories, is there.
(xii) Zipping and unzipping the files with & without permission pasting it to a location
WEEK-3
5 Implementing Matrix-Multiplication with Hadoop Map-reduce
6 Compute Average Salary and Total Salary by Gender for an Enterprise.
WEEK-4
7 (i) Creating hive tables (External and internal)
(ii) Loading data to external hive tables from sql tables(or)Structured c.s.v using
scoop
(iii) Performing operations like filterations and updations
(iv) Performing Join (inner, outer etc)
(v) Writing User defined function on hive tables
8 Create a sql table of employees Employee table with id,designation Salary table
(salary ,dept id) Create external table in hive with similar schema of above tables,Move data
to hive using scoop and load the contents into tables,filter a new table and write a UDF to
encrypt the table with AES-algorithm, Decrypt it with key to show contents
WEEK-5
9 (i) Pyspark Definition(Apache Pyspark) and difference between Pyspark, Scala,
pandas
(ii) Pyspark files and class methods
(iii) get(file name)
(iv) get root directory( )
10 Pyspark -RDD’S
(i) what is RDD’s?
(ii) ways to Create RDD
(iii) parallelized collections
(iv) external dataset
(v) existing RDD’s
(vi) Spark RDD’s operations (Count, foreach(), Collect, join,Cache()
WEEK-6
11 Perform pyspark transformations
(i) map and flatMap
(ii) to remove the words, which are not necessary to analyze this text.
(iii) groupBy
(iv) What if we want to calculate how many times each word is coming in corpus ?
(v) How do I perform a task (say count the words ‘spark’ and ‘apache’ in rdd3)
separatly on each partition and get the output of the task performed in these partition ?
(vi) unions of RDD
(vii) join two pairs of RDD Based upon their key
12 Pyspark sparkconf-Attributes and applications
(i) What is Pyspark spark conf ()
(ii) Using spark conf create a spark session to write a dataframe to read details in a c.s.v
and later move that c.s.v to another location
TEXT BOOKS:
1. Spark in Action, Marko Bonaci and Petar Zecevic, Manning.
2. PySpark SQL Recipes: With HiveQL, Dataframe and Graphframes, Raju Kumar
Mishra and Sundar Rajan Raman, Apress Media.
Course Objectives:
The main objective of the course is to process Big Data with advance architecture
like spark and streaming data in Spark
Course Outcomes:
Develop MapReduce Programs to analyze large dataset Using Hadoop and Spark
Write Hive queries to analyze large dataset Outline the Spark Ecosystem and its
components
Perform the filter, count, distinct, map, flatMap RDD Operations in Spark.
Build Queries using Spark SQL
Apply Spark joins on Sample Data Sets
Make use of sqoop to import and export data from hadoop to database and vice-versa
WEEK-1
1. To Study of Big Data Analytics and Hadoop Architecture
(i) know the concept of big data architecture
(ii) know the concept of Hadoop architecture
Big Data architecture:
Big Data architecture refers to the design and structure used to store, process, and analyze
large volumes of data. These architectures are built to handle a variety of data types
(structured, semi-structured, unstructured), as well as the large scale and speed of modern
data flows. The core components of Big Data architecture typically include the following
layers:
1. Data Source Layer
This layer refers to the origin of the data, which could come from a variety of sources:
External data sources: Social media, IoT devices, third-party services, etc.
Internal data sources: Databases, data warehouses, etc.
Data streams: Real-time data from sensors, logs, etc.
2. Data Ingestion Layer
Data ingestion is the process of collecting and transporting data from various sources to the
storage layer. The two main types of ingestion are:
Batch processing: Data is collected over a fixed period (e.g., every hour, daily).
Real-time/streaming processing: Data is collected in real-time or near real-time.
Tools used for data ingestion include:
Apache Kafka: A distributed streaming platform.
Apache Flume: A service for collecting and moving large amounts of log data.
AWS Kinesis: A platform for real-time streaming data on AWS.
3. Data Storage Layer
This is where all the data is stored. Big Data storage should support both structured and
unstructured data. It needs to be scalable, reliable, and highly available. Some common types
of data storage in Big Data systems include:
HDFS (Hadoop Distributed File System): A scalable, distributed file system.
NoSQL databases: MongoDB, Cassandra, HBase for non-relational data.
Data Lakes: A central repository for storing raw data in its native format (e.g., AWS
S3, Azure Blob Storage).
4. Data Processing Layer
This layer processes the stored data and transforms it into valuable insights. It can be divided
into two major approaches:
Batch Processing: Processing data in large, scheduled intervals (e.g., Hadoop
MapReduce).
Stream Processing: Processing data in real-time as it flows in (e.g., Apache Flink,
Apache Storm, Spark Streaming).
Some key processing tools:
Apache Spark: A fast and general-purpose cluster-computing system.
Apache Hadoop: A framework for distributed storage and processing.
Flink and Storm: Used for real-time data stream processing.
5. Data Analytics Layer
Once data is processed, it is often analyzed to extract insights. The analytics layer provides
tools for complex analysis, including:
Machine Learning (ML): Building predictive models and patterns using algorithms.
Data Mining: Discovering hidden patterns and trends in data.
Business Intelligence (BI): Tools like Tableau, Power BI for reporting and
visualization.
Popular tools used for analytics:
Apache Hive: A data warehouse built on top of Hadoop for querying and analyzing
large datasets.
Apache Impala: A high-performance SQL engine for big data.
Python libraries (Pandas, scikit-learn): For data manipulation and machine
learning.
6. Data Presentation Layer
This layer presents the insights derived from the analytics layer. It often involves dashboards,
reports, and visualizations. Users, stakeholders, or systems will interact with this layer to
make data-driven decisions. Tools include:
BI tools: Tableau, Power BI, QlikView.
Custom web interfaces: To display reports, graphs, and analysis.
7. Security and Governance Layer
Given the large volumes and sensitivity of data, security and governance are critical. This
layer ensures data privacy, access control, and regulatory compliance.
Authentication/Authorization: Ensuring only authorized users can access specific
data.
Data Encryption: To protect sensitive data at rest and in transit.
Data Lineage: Tracking the origin and movement of data to ensure trustworthiness.
Compliance: Adhering to regulations such as GDPR, HIPAA, etc.
8. Orchestration and Management Layer
Big Data systems require complex management for coordination, scheduling, and monitoring.
Apache Airflow: An open-source platform to programmatically author, schedule, and
monitor workflows.
Kubernetes: For managing containerized applications and ensuring scalability and
reliability.
Key Technologies in Big Data Architecture:
Hadoop Ecosystem: For storage and processing (HDFS, YARN, MapReduce, Pig,
Hive, etc.).
Apache Kafka: For real-time streaming.
Apache Spark: For fast in-memory data processing.
NoSQL Databases: MongoDB, Cassandra, HBase.
Cloud Platforms: AWS, Azure, Google Cloud provide tools for storage, processing,
and management.
Example of Big Data Architecture
+---------------------+
| Data Sources |
+---------------------+
|
v
+---------------------+ +--------------------+
| Data Ingestion | ---> | Data Storage |
| (Batch/Streaming) | | (HDFS, NoSQL, |
+---------------------+ | Data Lakes) |
| +--------------------+
v |
+--------------------+ v
| Data Processing | +---------------------+
| (Batch/Stream) | ---> | Data Analytics |
+--------------------+ | (ML, BI, Analysis) |
| +---------------------+
v |
+---------------------+ v
| Data Presentation | <--------> +-----------------+
| (Dashboards, Reports| | Security & |
| Visualization) | | Governance |
This high-level overview demonstrates the flow of data through the architecture from
collection to processing and presentation.
(ii) know the concept of Hadoop architecture
These components work together to provide a distributed system that can store and process
large volumes of data.
HDFS is the storage layer of Hadoop. It is designed to store vast amounts of data across
multiple machines in a distributed environment.
+-------------------+ +-------------------+
| Client | | Client |
+-------------------+ +-------------------+
| |
+---------------+ +---------------+
| NameNode | | NameNode |
+---------------+ +---------------+
| |
+-------------------+ +-------------------+
| DataNode | | DataNode |
+-------------------+ +-------------------+
2. MapReduce
Map phase: In the Map phase, the input data is divided into chunks (called splits),
and each chunk is processed by a mapper. The mapper processes the data and
generates a set of intermediate key-value pairs.
Shuffle and Sort: After the Map phase, the intermediate key-value pairs are shuffled
and sorted. The system groups the data by key and prepares it for the Reduce phase.
Reduce phase: In the Reduce phase, the system applies the reduce function to the sorted
intermediate data, aggregating or transforming the data in some way. The results are written
to the output files.
+-------------+
| Input | ----> [Map] ----> [Shuffle & Sort] ----> [Reduce] ----> Output
+-------------+
JobTracker: The JobTracker is the master daemon in the MapReduce framework. It
is responsible for scheduling and monitoring jobs, dividing the work into tasks, and
allocating tasks to TaskTrackers.
TaskTracker: TaskTrackers are worker daemons that run on the cluster nodes and
execute tasks assigned by the JobTracker. Each TaskTracker handles both Map and
Reduce tasks.
3. YARN (Yet Another Resource Negotiator)
YARN is the resource management layer of Hadoop, responsible for managing resources
across the cluster and scheduling the execution of tasks.
+-----------------------+
| ResourceManager | <-----> [Resource Allocation]
+-----------------------+
|
+-----------------------------+
| NodeManager | <-----> [Resource Monitoring]
+-----------------------------+
|
+---------------------------+
| ApplicationMaster (AM) | <-----> [Job Coordination]
+---------------------------+
|
+-----------------------+
| Application | <-----> [MapReduce/Spark Job]
+-----------------------+
Apart from the core components (HDFS, MapReduce, and YARN), Hadoop has a rich
ecosystem that includes several tools and frameworks for different use cases. Some of the key
components include:
Hive: A data warehouse system that facilitates querying and managing large datasets
in HDFS using SQL-like queries.
Pig: A platform for analyzing large datasets, providing a high-level language called
Pig Latin for processing and transforming data.
HBase: A NoSQL database for real-time read/write access to large datasets stored in
HDFS.
Sqoop: A tool for transferring data between Hadoop and relational databases.
Flume: A service for collecting and aggregating log data and other types of streaming
data.
Oozie: A workflow scheduler for managing Hadoop jobs.
Zookeeper: A service for coordinating distributed applications in the Hadoop
ecosystem.
Mahout: A machine learning library for scalable machine learning algorithms.
1. Scalability: Hadoop is designed to scale horizontally. As your data grows, you can
add more nodes to the cluster.
2. Fault Tolerance: Through replication and data distribution, Hadoop ensures that the
data is not lost even when individual nodes fail.
3. Cost Efficiency: Hadoop runs on commodity hardware, meaning you can build large-
scale clusters with low-cost machines.
Data Locality: Hadoop tries to move computation to where the data is stored to minimize
network congestion and speed up processing.
2. Loading DataSet in to HDFS for Spark Analysis Installation of Hadoop and cluster
management
(ii) Knowing the differencing between single node clusters and multi-node clusters
(iv) Installing and accessing the environments such as hive and sqoop
Prerequisites:
Step-by-Step Installation:
java -version
2. Install Hadoop:
o First, download Hadoop binaries from the official Apache website. You can
download a stable version using wget:
3. wget https://archive.apache.org/dist/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz
o Extract the downloaded tar file:
nano ~/.bashrc
Add the following lines at the end of the file:
export HADOOP_HOME=/opt/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
source ~/.bashrc
7. Configure Hadoop:
In the Hadoop configuration directory, you'll need to edit several XML files to set up
the cluster.
o core-site.xml:
nano $HADOOP_HOME/etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
o hdfs-site.xml:
nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml
Add:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///opt/hadoop/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///opt/hadoop/hdfs/datanode</value>
</property>
</configuration>
o mapred-site.xml:
Set up the MapReduce framework:
nano $HADOOP_HOME/etc/hadoop/mapred-site.xml
Add:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
o yarn-site.xml:
nano $HADOOP_HOME/etc/hadoop/yarn-site.xml
Add:
<configuration>
<property>
<name>yarn.resourcemanager.address</name>
<value>localhost:8032</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
8. Format HDFS:
start-dfs.sh
start-yarn.sh
10. Verify the Installation:
o Check if the HDFS is running properly:
o jps
o Check if the ResourceManager and NodeManager are running as well:
o jps
o You can also check the Hadoop Web UI to view the status of your cluster.
A single-node cluster is a Hadoop setup where all the Hadoop services (NameNode,
DataNode, ResourceManager, and NodeManager) run on one machine (localhost).
It is simpler to set up and useful for development and testing purposes.
Limited scalability and no distributed computation capability in the true sense of a
multi-node cluster.
Multi-Node Cluster:
A multi-node cluster involves multiple machines, where one node acts as the master
(NameNode, ResourceManager) and others as slaves (DataNode, NodeManager).
It offers the true power of distributed computing and storage, enabling scalability and
fault tolerance.
It requires more complex configuration, network setup, and hardware resources.
It is used in production environments where large-scale data processing is required.
Hadoop provides a Web UI to monitor the cluster's health and performance. The following
are the key ports:
Hive Installation:
1. Install Hive:
You can download the latest stable version of Apache Hive from the Apache website
or install it via apt if available.
hive
This opens the Hive CLI where you can execute Hive queries.
Sqoop Installation:
1. Install Sqoop:
Download and install Sqoop, which is used for transferring data between relational
databases and Hadoop.
3. Access Sqoop:
To use Sqoop to import or export data, you can run commands like:
WEEK-2
(xii) Zipping and unzipping the files with & without permission pasting it to a location
Here’s a breakdown of file management tasks and basic Linux commands, particularly
focused on HDFS (Hadoop Distributed File System) operations:
To create a directory in HDFS, you can use the hadoop fs -mkdir command.
You can navigate directories in the Linux file system using the cd command.
To move to a directory:
cd /path/to/directory
To move back to the previous directory:
cd -
To move up one directory level:
cd ..
For HDFS directories, you use the hadoop fs -ls command to list the contents and hadoop
fs -cd to change directories.
To list contents of a directory, whether in HDFS or local, you use the ls command.
In HDFS:
hadoop fs -ls /path/to/directory
In Local File System:
ls /path/to/directory
(iv) Uploading and downloading a file in HDFS:
You can check the contents of a file using the cat command.
In HDFS:
hadoop fs -cat /path/to/file
In Local File System:
cat /path/to/file
(vi) Copying and moving files:
Copying files:
o To copy a file within HDFS:
To remove files and directories, you can use the -rm and -r options for directories.
rm /local/path/to/file
Remove a directory locally:
rm -r /local/path/to/directory
(ix) Displaying few lines of a file:
In HDFS:
hadoop fs -head /path/to/file
In Local File System:
head /path/to/file
(x) Display the aggregate length of a file:
You can get the file size using the -du (disk usage) command.
In HDFS:
hadoop fs -du -s /path/to/file
In Local File System:
du -sh /path/to/file
You can check the permissions of a file using the -ls command, which will show the file
permissions.
In HDFS:
hadoop fs -ls /path/to/file
In Local File System:
ls -l /path/to/file
This will display the permissions, owner, and group of the file or directory.
(xii) Zipping and unzipping files with and without permission pasting it to a
location:
You can zip and unzip files using the zip and unzip commands.
Zipping a file:
zip filename.zip /path/to/file
Unzipping a file:
To maintain permissions while transferring a file, use the -p option in cp or rsync for
preserving permissions.
For HDFS:
Paste Command (To paste a file after copying it): This is generally done by using cp
or mv as mentioned above. There's no specific "paste" command, but the operation is
performed through these commands when moving or copying data.
4. Map-reducing
Map-Reduce is a programming model and processing technique used to process and generate large
datasets. It allows the parallel processing of data by dividing it into small chunks and distributing it
across multiple nodes in a cluster. The main concept involves two key operations: Map and Reduce.
Map: The map function processes input data and produces a set of intermediate key-value
pairs.
Reduce: The reduce function takes the intermediate key-value pairs, processes them, and
merges them to produce the final result.
Map-Reduce is widely used in distributed systems like Hadoop for large-scale data processing tasks.
The Map-Reduce process is split into two main stages: the Map stage and the Reduce stage, but
several other intermediate processes and terminologies come into play.
1. Map Stage:
o The input data is divided into chunks (usually files or records).
o The Mapper function processes each chunk and outputs intermediate key-value pairs.
o The intermediate output is sorted and grouped by key (called the shuffle phase).
2. Shuffle and Sort:
o After the map phase, the intermediate key-value pairs are shuffled and sorted to
ensure that all values corresponding to the same key are grouped together. This step
happens automatically in Map-Reduce frameworks like Hadoop.
3. Reduce Stage:
o The Reducer function processes each group of intermediate key-value pairs and
merges them to produce a final output. It can aggregate, summarize, or process data
in any other way required by the user.
4. Output:
o After the reduce phase, the final output is written to a file or a database.
Mapper: The function or process that reads input data, processes it, and outputs key-value
pairs.
Reducer: The function that processes the grouped key-value pairs from the mapper and
performs the final aggregation or computation.
Key-Value Pair: The fundamental unit of data in Map-Reduce, where each record is
represented as a key paired with a value.
Shuffle: The process of redistributing the data across reducers based on keys, ensuring that all
values for the same key are sent to the same reducer.
Input Split: The unit of work or chunk of data that is sent to a mapper.
Output: The final result after processing in the reduce phase, usually saved to disk or a
storage system.
Here is a simple example of a Word-Count program to demonstrate the Map-Reduce process. We will
break it into three main parts:
1. Mapper Phase:
The mapper reads input text and emits key-value pairs, where the key is a word, and the value is 1
(representing a single occurrence of the word).
import sys
# Mapper function
def mapper():
for line in sys.stdin:
words = line.split()
for word in words:
# Emit word with value 1
print(f"{word}\t1")
if __name__ == "__main__":
mapper()
In this code:
After the map phase, the framework automatically groups and sorts the emitted key-value pairs. For
instance, all instances of the word "hello" will be grouped together so that they can be passed to the
same reducer.
hello 1
hello 1
world 1
world 1
data 1
3. Reducer Phase:
The reducer processes the grouped key-value pairs. It aggregates the values by summing them to get
the total count for each word.
Reducer code:
import sys
# Reducer function
def reducer():
current_word = None
current_count = 0
for line in sys.stdin:
word, count = line.strip().split('\t')
count = int(count)
if word == current_word:
current_count += count
else:
if current_word:
print(f"{current_word}\t{current_count}")
current_word = word
current_count = count
if __name__ == "__main__":
reducer()
In this code:
4. Driver Code:
The driver code sets up the map and reduce operations and coordinates the execution of the map and
reduce phases in the framework. In Hadoop, this would be handled by a job configuration, but for
simplicity, this can be managed manually in a basic script.
Final Output:
After the map and reduce phases, the output would look like this:
data 1
hello 2
world 2
This shows the word count for each word in the input text.
The mapper would be executed on different nodes processing chunks of data in parallel.
The reducer would then aggregate the results from all the mappers.
This basic example gives you a good understanding of how Map-Reduce works to process large
datasets by distributing the work and aggregating results efficiently.
WEEK-3
Prerequisites:
1. Hadoop Installed (Single Node or Cluster)
2. Java Development Kit (JDK 8 or above)
3. HDFS Setup
4. Basic understanding of MapReduce
Theory
Matrix Multiplication Formula
Given two matrices A (m × n) and B (n × p), their multiplication results in matrix C (m ×p):
Implementation Steps
A001
A012
A103
A114
B005
B016
B107
B118
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
public void map(Object key, Text value, Context context) throws IOException,
InterruptedException {
if (matrixName.equals("A")) {
} else {
for (int i = 0; i < 2; i++) { // Assume A has 2 rows
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
import java.util.ArrayList;
elements.add(val.get());
int sum = 0;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
job.setJarByClass(MatrixMultiplication.class);
job.setMapperClass(MatrixMapper.class);
job.setReducerClass(MatrixReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
Expected Output
0,0 19
0,1 22
1,0 43
1,1 50
Objective:
To compute average salary and total salary by gender from a dataset using Hadoop
MapReduce framework.
import java.io.IOException;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Mapper;
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
}
}
import java.io.IOException;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Reducer;
double totalSalary = 0;
int count = 0;
totalSalary += val.get();
count++;
context.write(key, new Text("Total Salary: " + totalSalary + ", Average Salary: " + avgSalary));
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
job.setJarByClass(SalaryGenderDriver.class);
job.setMapperClass(SalaryGenderMapper.class);
job.setReducerClass(SalaryGenderReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(DoubleWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Sample Output:
F Total Salary: 140000.0, Average Salary: 70000.0
M Total Salary: 120000.0, Average Salary: 60000.0
WEEK-4
7. (i) Creating hive tables (External and internal)
(ii) Loading data to external hive tables from sql tables(or)Structured c.s.v using
scoop
(iii) Performing operations like filterations and updations
(iv) Performing Join (inner, outer etc)
(v) Writing User defined function on hive tables
Prerequisites
Internal Table:
CREATE TABLE employee (
id INT,
name STRING,
gender STRING,
department STRING,
salary FLOAT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
External Table:
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
8. Create a sql table of employees Employee table with id,designation Salary table
(salary ,dept id) Create external table in hive with similar schema of above tables,Move data
to hive using scoop and load the contents into tables,filter a new table and write a UDF to
encrypt the table with AES-algorithm, Decrypt it with key to show contents
-- Employee Table
CREATE TABLE employee (
id INT PRIMARY KEY,
name VARCHAR(50),
designation VARCHAR(50)
);
-- Salary Table
CREATE TABLE salary (
id INT,
salary FLOAT,
dept_id INT,
FOREIGN KEY (id) REFERENCES employee(id)
);
-- Sample Inserts
INSERT INTO employee VALUES (1, 'Alice', 'Manager'), (2, 'Bob', 'Analyst');
INSERT INTO salary VALUES (1, 80000, 101), (2, 50000, 102);
import javax.crypto.Cipher;
import javax.crypto.spec.SecretKeySpec;
import java.util.Base64;
-- Decrypt it back
SELECT id, aes_decrypt(aes_encrypt(name)), designation FROM high_salary_employees;
WEEK-5
9. (i) Pyspark Definition(Apache Pyspark) and difference between Pyspark, Scala,
pandas
(ii) Pyspark files and class methods
(iii) get(file name)
(iv) get root directory()
(i) Pyspark Definition(Apache Pyspark) and difference between Pyspark, Scala, pandas
What is PySpark?
PySpark is the Python API for Apache Spark, an open-source, distributed computing system
used for big data processing and analytics. It allows you to harness the power of Spark with
Python programming.
Key Features:
Supports RDDs, DataFrames, SQL, and Streaming.
Scales across multiple machines or nodes.
Integrates with Hadoop, Hive, HDFS, Cassandra, etc.
Comparison: PySpark vs Scala vs Pandas
While get() is not a native method in PySpark, here's what you usually do to read files:
df.show()
df.show()
def get(filename):
df = get("data.csv")
df.show()
import os
root_dir = os.getcwd()
spark = SparkSession.builder.appName("App").getOrCreate()
print(spark.sparkContext.applicationId)
10 . Pyspark -RDD’S
(i) what is RDD’s?
An RDD (Resilient Distributed Dataset) is the fundamental data structure of Apache Spark.
rdd2 = sc.textFile("data.txt")
rdd3 = rdd1.map(lambda x: x * 2)
rdd = sc.parallelize(data)
rdd_file = sc.textFile("hdfs:///user/hadoop/input.txt")
rdd_file.take(5)
rdd_squared = rdd_orig.map(lambda x: x ** 2)
print(rdd_squared.collect()) # [1, 4, 9]
# Example operations:
rdd.count() # 3
rdd.foreach(lambda x: print(x))
# join:
# cache:
rdd_cached = rdd.cache()
rdd_cached.collect()
WEEK-12
conf = SparkConf().setAppName('CSVApp').setMaster('local')
spark = SparkSession.builder.config(conf=conf).getOrCreate()