BDA Practical File
BDA Practical File
College of Engineering
Opp Gujarat University, Navrangpura, Ahmedabad - 380015
LAB PRACTICALS
Branch: Computer Engineering
VISION
SIGN OF FACULTY
L. D. College of Engineering, Ahmedabad
Term: 2022-23
2
CO-2
3
CO-2
4
CO-2
5
CO-2
6
CO-2
7
CO-3
8
CO-5
9
CO-3
10
CO-3
11
CO-3
12
CO-1
Page |2
INDEX
Sr.
CO AIM Date Page Marks Sign
No.
No.
Prepare Make a single node clust er in 28-07- 5
1 CO-2
Hadoop. 2022
Practical 1
AIM: Make a single node cluster in Hadoop.
• For making cluster in Hadoop, there are two different types of Hadoop installations:
1. Single node cluster Hadoop (One DataNode running and setting up all the NameNode,
DataNode, ResourceManager, and NodeManager on a single machine.)
2. Multi node cluster Hadoop (more than one DataNode running and each DataNode is
running on different machines.)
• For this practical single node cluster Hadoop installation has used.
• Installation of single node cluster will be same as multimode cluster. After installation of Hadoop
several configurations is need to be done. Those are as follows:
Step-2: Edit hdfs-site.xml (contains configuration settings of HDFS entites) file under
“hadoop-3.3.0\etc\hadoop” path and edit the property mentioned below inside configuration
tag:
Step-5: Edit hadoop-env.sh file under “hadoop-3.3.0\etc\hadoop” path and add the Java
Path as mentioned below:
Step-6: Run command under Hadoop file bin/hadoop namenode -format (This command
formats the HDFS via NameNode and only executed for the first time. Formatting the file
system means initializing the directory specified by dfs.name.dir variable.)
Step-7: Open http://localhost:9870 URL in browser (this will help to monitor all of the
information about nodes, memory management, resource utilization, etc.)
Practical 2
AIM: Run Word count program in Hadoop with 250 MB size of Data Set.
• Pre-requisite:
▪ Java Installation - Check whether the Java is installed or not.
▪ Hadoop Installation - Check whether the Hadoop is installed or not.
• In this practical, we find out the frequency of each word exists in this text file.
WC_Mapper.java
package com.wordcount;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
BE 7th Semester Big Data Analytics (3170725) 190280107002
P a g e | 13
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class WC_Mapper extends MapReduceBase implements
Mapper<LongWritable,Text,Text,IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,OutputCollector<Text,IntWritable> output,
Reporter reporter) throws IOException{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()){
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
File: WC_Reducer.java
package com.wordcount;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
File: WC_Runner.java
package com.wordcount;
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
public class WC_Runner {
public static void main(String[] args) throws IOException{
JobConf conf = new JobConf(WC_Runner.class);
conf.setJobName("WordCount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(WC_Mapper.class);
conf.setCombinerClass(WC_Reducer.class);
Practical – 3
AIM: Understand the Logs generated by MapReduce program.
MapReduce service:
• Service instance log that contains details related to the MapReduce framework and
startup on the service side. Each service instance has a separate log, in addition to
separate logs for each task attempt.
$HADOOP_HOME/logs/user.pmr.service.application.index.log
MapReduce task:
• Task log that contains details relating to user-defined MapReduce code on the
service side. Each task records its task log messages (syslog), standard output
(stdout) and errors (stderr) to separate files in this directory. Knowing the job and
task IDs can be very useful when debugging MapReduce jobs.
-$HADOOP_HOME/logs/tasklogs/application/ job_ID/task_task
ID_att_attempt ID/syslog
-$HADOOP_HOME/logs/tasklogs/application/ job_ID/task_task
ID_att_attempt ID/stdout
-$HADOOP_HOME/logs/tasklogs/application/ job_ID/task_task
ID_att_attempt ID/stderr
Shuffle daemon:
• MRSS service log that captures messages, events, and errors related to the
MapReduce host.
$HADOOP_HOME/logs/mrss.hostname.log
MapReduce API
• API log that captures details relating to API calls from EGO.
api.hostname.log in the directory where the job was submitted MapReduce
client
• MapReduce client messages are not recorded in a log file. You can only view them on the client
console.
Level Description
Practical – 4
AIM: Run two different Data sets/Different size of Datasets on Hadoop and Compare the Logs.
First Dataset:
Program: Word Count
Size: 1MB
dev-
dev
dev-
Conclusion:
• For two different size of dataset running same map reduce job we can say that time
taken for running for job is increase as size of dataset bigger.
Practical – 5
AIM: Develop Map Reduce Application to Sort a given file and do aggregation on some parameters.
• Map:
▪ Input: DonationWritables “full row” objects from theSequenceFile.
• Reduce:
• Map:
▪ Input: (city, sumtotal) pairs with summed total percity.
▪ Output: (sumtotal, city) inversed pair.
• Reduce:
▪ Identity reducer. Does not reduce anything, but theshuffling will sort
on keys for us.
• Code:
import java.io.IOException;
import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.FloatWritable; import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat; import
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import
data.writable.DonationWritable;
public class DonationsSumByCity {
public static class CityDonationMapper extends Mapper<Text, DonationWritable, Text,
FloatWritable> {
private Text city = new Text();
private FloatWritable total = new FloatWritable();
@Override
public void map(Text key, DonationWritable donation, Context context) throws IOException,
InterruptedException {
// Ignore rows where the donor is a teacher if ("t".equals(donation.donor_is_teacher)) {
return;
float) pair.
}
}
}
// Transform city name to uppercase and write the (string,
city");
job.setJarByClass(DonationsSumByCity.class);
// Mapper configuration job.setMapperClass(CityDonationMapper.class);
job.setInputFormatClass(SequenceFileInputFormat.class);
job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(FloatWritable.class);
// Reducer configuration (use the reducer as combiner also, useful
in cases of aggregation)
job.setCombinerClass(FloatSumReducer.class); job.setReducerClass(FloatSumReducer.class);
job.setNumReduceTasks(1); job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FloatWritable.class); FileInputFormat.setInputPaths(job, new
Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
A couple of things to notice here regarding the Sequence File as the input:
• job.setInputFormatClass(SequenceFileInputFormat.class)
o Tell the job that we are reading a Sequence File.
• ... extends Mapper<Text, DonationWritable, Text, FloatWritable>
o The first two generic type parameters of the Mapper class should be the
input Key and Value types of Sequence File.
• map(Text key, DonationWritable donation, Context context)
o The parameter of the map method are directly the Writable objects. If we
were using the CSV input we would have a Text object as the second
parameter containing the csv line, which we would have to split on
commas to obtain values.
Using a Combiner
• Since we are doing an aggregation task here, using our Reducer as a Combiner
by calling job.setCombinerClass(FloatSumReducer.class) improves performance. It
will start reducing the Mapper’s output during the map phase, which will result in
less data being shuffled and sent to the Reducer.
• Code:
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.Text;
import
org.apache.hadoop.io.WritableComparable;
import
org.apache.hadoop.io.WritableComparator;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
InterruptedException {
@SuppressWarnings("rawtypes") @Override
public int compare(WritableComparable w1, WritableComparable
w2) { FloatWritable key1 = (FloatWritable) w1;
FloatWritable key2 = (FloatWritable) w2; return -1 *
key1.compareTo(key2);
}
}
public static void main(String[] args) throws Exception {
Job job = Job.getInstance(new Configuration(), "Order By Sum
Desc") job.setJarByClass(DonationsSumByCity.class);
job.setMapOutputKeyClass(FloatWritable.class);
job.setMapOutputValueClass(Text.class);
• d
Here are the terminal commands for executing and viewing the outputs for these 2
MapReduce jobs:
• As expected, the output of the first job is a plain text list of <city,sum> ordered by city name.
Thesecond job generates a list of <sum,city> sorted by descending sum
• Execution times:
Practical – 6
AIM: Download any two Big Data Sets from authenticate website.
1. Yelp Dataset:
• Website: https://www.yelp.com/dataset
2. Kaggle:
• Website: https://www.kaggle.com/datasets
Practical – 7
AIM: Explore Spark and Implement Word count application using Spark.
Apache Spark:
• Apache Spark is a lightning-fast cluster computing technology, designed for fast
computation. It is based on Hadoop MapReduce and it extends the MapReduce model to
efficiently use it for more types of computations, which includes interactive queries and
stream processing. The main feature of Spark is its in-memory cluster computing that
increases the processing speed of an application.
• Spark is designed to cover a wide range of workloads such as batch applications, iterative
algorithms, interactive queries and streaming. Apart from supporting all these workloads in a
respective system, it reduces the management burden of maintaining separate tools.
3. Now, follow the below command to open the spark in Scala mode.
$ spark-shell
• Now, we can read the generated result by using the following command.
5. Here, we split the existing data in the form of individual words by using following command.
scala > val splitdata = data.flatMap(line=>lines.split( “ “));
Now, we can read the generated result by using the following command.
Practical – 8
AIM: Creating the HDFS tables and loading them in Hive and learn joining of tables in Hive.
Move the text file from local file system into newly created folder called javachain
javachain~hadoop]$ hadoop fs -put ~/Desktop/student.txt javachain/
JOIN
• The following query executes JOIN on the CUSTOMER and ORDER tables, and retrieves the
records:
• On successful execution of the query, you get to see the following response:
• A LEFT JOIN returns all the values from the left table, plus the matched values from the right
table, or NULL in case of no matching JOIN predicate.
• The following query demonstrates LEFT OUTER JOIN between CUSTOMER and ORDER
tables:
• On successful execution of the query, you get to see the following response :
• The following query demonstrates RIGHT OUTER JOIN between the CUSTOMER and
ORDER tables.
• On successful execution of the query, you get to see the following response:
Practical – 9
AIM: Implementation of Matrix algorithms in Spark Sql programming
• Code:
Matrix_multiply.py
def permutation(row):
rowPermutation = []
rowPermutation.append(float(element) *
float(row[e]))
return rowPermutation
def main():
input = sys.argv[1] output = sys.argv[2]
conf = SparkConf().setAppName('Matrix
Multiplication') sc = SparkContext(conf=conf)
assert sc.version >= '1.5.1'
# outputResult = sc.parallelize(result).coalesce(1)
# outputResult.saveAsTextFile(output)
if __name__ == "__main__":
main()
matrix_multiply_sparse.py
def createCSRMatrix(input):
row = []
col = []
data = []
return csr_matrix((data,(row,col)),
shape=(1,100))
def multiplyMatrix(csrMatrix):
csrTransponse = csrMatrix.transpose(copy=True)
return (csrTransponse*csrMatrix)
def formatOutput(indexValuePairs):
def main():
input = sys.argv[1] output = sys.argv[2]
conf = SparkConf().setAppName('Sparse
Matrix Multiplication') sc =
SparkContext(conf=conf) assert sc.version >=
'1.5.1'
sparseMatrix = sc.textFile(input).map(lambda
row : row.split('
')).map(createCSRMatrix).map(multiplyMatrix).red
uce(operator.add) outputFile = open(output, 'w')
data = sparseMatrix.data[sparseMatrix.indptr[row]:sparseMatrix.indptr[row+1]]
indexValuePairs = zip(col,data)
formattedOutput = formatOutput(indexValuePairs)
outputFile.write(formattedOutput + '\n')
if __name__ == "__main__":
main()
Practical – 10
AIM: Create A Data Pipeline Based on Messaging Using PySpark and Hive -Covid-19 Analysis.
Building data pipeline for Covid-19 data analysis using Bigdata technologies and Tableau
• The purpose is to collect the real time streaming data from COVID19 open API for every
5 minutes into the ecosystem using NiFi and to process it and store it in the data lake on
AWS.
• Data processing includes parsing the data from complex JSON format to csv format then
publishing to Kafka for persistent delivery of messages into PySpark for further
processing.
• The processed data is then fed into output Kafka topic which is inturn consumed by Nifi
and stored in HDFS.
• A Hive external table is created on top of HDFS processed data for which the process is
Orchestrated using Airflow to run for every time interval. Finally, KPIs are visualised in
tableau.
• Data Architecture:
d
• Tools used:
1. Nifi -nifi-1.10.0
2. Hadoop -hadoop_2.7.3
3. Hive-apache-hive-2.1.0
4. Spark-spark-2.4.5
5. Zookeeper-zookeeper-2.3.5
6. Kafka-kafka_2.11-2.4.0
7. Airflow-airflow-1.8.1
8. Tableau
Practical – 11
AIM: Explore NoSQL database like MongoDB and perform basic CRUD operation.
1. Create Operation
2. Update Operations
3. Read Operations
4. Delete Operation
Practical – 12
AIM: Case study based on the concept of Big Data Analytics.
1. Walmart:
• Walmart is the largest retailer in the world and the world’s largest company by revenue,
with more than 2 million employees and 20000 stores in 28 countries.
• It started making use of big data analytics much before the word Big Data came into the
picture. Walmart uses Data Mining to discover patterns that can be used to provide
product recommendations to the user, based on which products were brought together.
• Walmart by applying effective Data Mining has increased its conversion rate of
customers. It has been speeding along big data analysis to provide best-in-class e-
commerce technologies with a motive to deliver superior customer experience.
• The main objective of holding big data at Walmart is to optimize the shopping
experience of customers when they are in a Walmart store.
• Big data solutions at Walmart are developed with the intent of redesigning global
websites and building innovative applications to customize the shopping experience for
customers whilst increasing logistics efficiency.
• Hadoop and NoSQL technologies are used to provide internal customers with access to
real-time data collected from different sources and centralized for effective use.
2. Uber:
• Uber is the first choice for people around the world when they think of moving people
and making deliveries.
• It uses the personal data of the user to closely monitor which features of the service is
mostly used, to analyze usage patterns and to determine where the services should be
more focused.
• Uber focuses on the supply and demand of the services due to which the prices of the
services provided changes. Therefore, one of Uber’s biggest uses of data is surge
pricing.
• For instance, if you are running late for an appointment and you book a cab in a
crowded place then you must be ready to pay twice the amount.
BE 7th Semester Big Data Analytics (3170725) 190280107002
P a g e | 49
• For example, On New Year’s Eve, the price for driving for one mile can go from 200 to
1000. In the short term, surge pricing affects the rate of demand, while long term use
could be the key to retaining or losing customers. Machine learning algorithms are
considered to determine where the demand is strong.
3. Netflix:
4. eBay:
• A big technical challenge for eBay as a data-intensive business to exploit a system that
can rapidly analyze and act on data as it arrives (streaming data).
• There are many rapidly evolving methods to support streaming data analysis. eBay is
working with several tools including Apache Spark, Storm, Kafka.
• It allows the company’s data analysts to search for information tags that have been
associated with the data (metadata) and make it consumable to as many people as
possible with the right level of security and permissions (data governance).
• The company has been at the forefront of using big data solutions and actively
contributes its knowledge back to the open-source community.
• Procter & Gamble whose products we all use 2-3 times a day is a 179-year-old
company. The genius company has recognized the potential of Big Data and put it to
use in business units around the globe.
• P&G has put a strong emphasis on using big data to make better, smarter, real-time
business decisions. The Global Business Services organization has developed tools,
systems, and processes to provide managers with direct access to the latest data and
advanced analytics.
• Therefore P&G being the oldest company, still holding a great share in the market
despite having many emerging companies.
Practical 12
AIM: Case study based on the concept of Big Data Analytics. Prepare presentation
in the group of 4. Submit PPTs.
Figure 1 Figure 2
Figure 4
Figure 3
Figure 5 Figure 6
Figure 8
Figure 7
Figure 9 Figure 10
Figure 11 Figure 12
Figure 13 Figure 14
Figure 15 Figure 16
Figure 17 Figure 18
Figure 20
Figure 19
Figure 22
Figure 21
Figure 24
Figure 23
Figure 26
Figure 25
Figure 28
Figure 27
Figure 29
Figure 30