0% found this document useful (0 votes)

586 views

BDA Practical File

The document describes setting up a single node Hadoop cluster on a local machine. It involves installing Hadoop, and then configuring core-site.xml, hdfs-site.xml and mapred-site.xml files to set I/O and HDFS properties for the single node cluster. The configuration defines the NameNode, DataNode, ResourceManager and NodeManager all running on the single machine.

Uploaded by

anujmodi763

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

586 views

BDA Practical File

Uploaded by

anujmodi763

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 61

L. D.

College of Engineering
Opp Gujarat University, Navrangpura, Ahmedabad - 380015

LAB PRACTICALS
Branch: Computer Engineering

BIG DATA ANALYTICS (3170722)

Semester: VII

Enrollment No.: 190280107002

Faculty Name: Prof. (Dr.) Hetal Joshiara
Name: Dev Ansodariya
Division: B
Batch: B1
Computer Engineering Department
L. D. College of Engineering Ahmedabad

VISION

• To achieve academic excellence in Computer Engineering by providing value

based education.
MISSION

• To produce graduates according to the needs of industry, government, society

and scientific community.
• To develop partnership with industries, research and development organizations
and government sectors for continuous improvement of faculties and students.
• To motivate students for participating in reputed conferences, workshops,
seminars and technical events to make them technocrats and entrepreneurs.
• To enhance the ability of students to address the real-life issues by applying
technical expertise, human values and professional ethics.
• To inculcate habit of using free and open-source software, latest technology and
soft skills so that they become competent professionals.
• To encourage faculty members to upgrade their skills and qualification through
training and higher studies at reputed universities.
Certificate
This is to certify that
Mr. Dev Ansodariya, Enrollment No. 190280107002 of B.E. Sem-7 class
has satisfactorily completed the course in BIG DATA ANALYTICS
(3170722) within four walls of L. D. College of Engineering, Ahmedabad-
380015.

Date of submission. 11th November, 2022

Staff in-Charge Prof. (Dr.) Hetal A. Joshiara
Head of Department Dr. Chirag S. Thaker
L. D. College of Engineering, Ahmedabad

Department of Computer Engineering

Rubrics for Practical

SEMESTER: BE-VII Academic Term: July-Nov 2022-23 (ODD)

Subject: Big Data Analytics (3170722) Elective Subject

Faculty: Prof. (Dr.) Hetal A. Joshiara

Rubrics Criteria Marks Good Satisfactory Need

ID (2) (1) Improvement
(0)
RB1 Regularity 05 High (>70%) Moderate Poor (0-40%)
(4070%)
RB2 Problem Analysis 05 Apt & Full Limited Very Less
& Development Identification of Identification of Identification of
of the Solution the Problem & the Problem / the Problem /
Complete Incomplete Very Less
Solution for the Solution for the Solution for the
Problem Problem Problem
RB3 Testing of the 05 Correct Solution Partially Very less correct
Solution as required Correct solution for the
Solution for the problem
Problem
RB4 Documentation 05 Documentation Not up to Proper format not
completed neatly. standard. followed,
incomplete.

Each practical carries 20 marks.

SIGN OF FACULTY
L. D. College of Engineering, Ahmedabad

Department of Computer Engineering

LABORATORY PRACTICALS ASSESSMENT

Subject Name: Big Data Analytics (3170722)

Term: 2022-23

Enroll. No.: 190280107002

Name: DEV ANSODARIYA

Pract. CO RB1 RB2 RB3 RB4 Total

No. Achieved (5) (5) (5) (5) (20)
1 CO-2

2
CO-2

3
CO-2

4
CO-2

5
CO-2

6
CO-2

7
CO-3

8
CO-5

9
CO-3

10
CO-3

11
CO-3

12
CO-1
Page |2

INDEX
Sr.
CO AIM Date Page Marks Sign
No.
No.
Prepare Make a single node clust er in 28-07- 5
1 CO-2
Hadoop. 2022

CO-2 Run Word count program in Hadoop with 250 MB 04-08- 11

2 2022
size of Data Set.

CO-2 Understand the Logs generated by MapReduce 18-08- 16

3 2022
program.

CO-2 Run two different Data sets/Different size of 25-8- 18

4 2022
Datasets on Hadoop and Compare the Logs.

CO-2 Develop Map-Reduce Application to Sort a given 01-09-

5 2022
file and do aggregation on some parameters. 21

CO-2 Download any two Big Data Sets from 08-09- 29

6 2022
authenticate website.

CO-3 Explore Spark and Implement Word count 15-09- 31

7 2022
application using Spark.

CO-5 Creating the HDFS tables and loading them in 22-09- 35

8 2022
Hive and learn joining of tables in Hive.

BE 7th Semester Big Data Analytics (3170725) 190280107002

Page |3

CO-3 Implementation of Matrix algorithms in Spark 29-09- 39

9 2022
SQL programming

CO-3 Create Data Pipeline Based on Messaging Using 13-10- 43

10 2022
PySpark and Hive Covid-19 Analysis

CO-3 Explore NoSQL database like MongoDB and 13-10- 45

11 2022
perform basic CRUD operation.

Case study based on the concept of Big Data 18-10- 51

12 2022
CO-1 Analytics. Prepare presentation in the group of 4.
Submit PPTs.

BE 7th Semester Big Data Analytics (3170725) 190280107002

Page |5

Practical 1
AIM: Make a single node cluster in Hadoop.

• For making cluster in Hadoop, there are two different types of Hadoop installations:

1. Single node cluster Hadoop (One DataNode running and setting up all the NameNode,
DataNode, ResourceManager, and NodeManager on a single machine.)
2. Multi node cluster Hadoop (more than one DataNode running and each DataNode is
running on different machines.)

• For this practical single node cluster Hadoop installation has used.

• Installation of single node cluster will be same as multimode cluster. After installation of Hadoop
several configurations is need to be done. Those are as follows:

Step-1: after downloading Hadoop, edit core-site.xml file under “hadoop-3.3.0\etc\hadoop”

path by adding I/O setting as follows:

BE 7th Semester Big Data Analytics (3170725) 190280107002

Page |6

Step-2: Edit hdfs-site.xml (contains configuration settings of HDFS entites) file under
“hadoop-3.3.0\etc\hadoop” path and edit the property mentioned below inside configuration
tag:

Step-3: Edit the mapred-site.xml (contains configuration settings of MapReduce

application) file under “hadoop-3.3.0\etc\hadoop” path and edit property mentioned below
inside configuration tag:

BE 7th Semester Big Data Analytics (3170725) 190280107002

Page |7

Step-4: Edit yarn-site.xml (contains configuration settings of ResourceManager and

NodeManager) file under “hadoop-3.3.0\etc\hadoop” path and edit the property mentioned
below inside configuration tag:

Step-5: Edit hadoop-env.sh file under “hadoop-3.3.0\etc\hadoop” path and add the Java
Path as mentioned below:

Step-6: Run command under Hadoop file bin/hadoop namenode -format (This command
formats the HDFS via NameNode and only executed for the first time. Formatting the file
system means initializing the directory specified by dfs.name.dir variable.)

BE 7th Semester Big Data Analytics (3170725) 190280107002

Page |8

Step-6: open command line and go to “hadoop-3.3.0/sbin” and type start-

all.cmd command (this command is a combination of start-dfs.cmd, start-
yarn.cmd & mr-jobhistory-daemon.cmd)

BE 7th Semester Big Data Analytics (3170725) 190280107002

Page |9

Step-7: Start Namenode and Datanode using start-dfs.cmd and start-yarn.cmd

under sbin folder

BE 7th Semester Big Data Analytics (3170725) 190280107002

P a g e | 10

Step-7: Open http://localhost:9870 URL in browser (this will help to monitor all of the
information about nodes, memory management, resource utilization, etc.)

BE 7th Semester Big Data Analytics (3170725) 190280107002

P a g e | 11

Practical 2
AIM: Run Word count program in Hadoop with 250 MB size of Data Set.

• Pre-requisite:
▪ Java Installation - Check whether the Java is installed or not.
▪ Hadoop Installation - Check whether the Hadoop is installed or not.

• Steps to Execute Word Count Program:

• Create or import a text file in your local machine and put it in proper directory.

• In this practical, we find out the frequency of each word exists in this text file.

• Create a directory in HDFS, where to kept text file.

hdfs dfs -mkdir /test

• Upload the data.txt file on HDFS in the specific directory.

hdfs dfs -put /hadoop/wordcount/data.txt /test

BE 7th Semester Big Data Analytics (3170725) 190280107002

P a g e | 12

• Write the MapReduce program using eclipse:

WC_Mapper.java

package com.wordcount;

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
BE 7th Semester Big Data Analytics (3170725) 190280107002
P a g e | 13
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class WC_Mapper extends MapReduceBase implements
Mapper<LongWritable,Text,Text,IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,OutputCollector<Text,IntWritable> output,
Reporter reporter) throws IOException{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()){
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}

File: WC_Reducer.java

package com.wordcount;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;

public class WC_Reducer extends MapReduceBase implements

Reducer<Text,IntWritable,Text,IntWritable> {

BE 7th Semester Big Data Analytics (3170725) 190280107002

P a g e | 14
public void reduce(Text key, Iterator<IntWritable>
values,OutputCollector<Text,IntWritable> output,
Reporter reporter) throws IOException {
int sum=0;
while (values.hasNext()) {
sum+=values.next().get();
}
output.collect(key,new IntWritable(sum));
}
}

File: WC_Runner.java

package com.wordcount;

import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
public class WC_Runner {
public static void main(String[] args) throws IOException{
JobConf conf = new JobConf(WC_Runner.class);
conf.setJobName("WordCount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(WC_Mapper.class);
conf.setCombinerClass(WC_Reducer.class);

BE 7th Semester Big Data Analytics (3170725) 190280107002

P a g e | 15
conf.setReducerClass(WC_Reducer.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf,new Path(args[0]));
FileOutputFormat.setOutputPath(conf,new Path(args[1]));
JobClient.runJob(conf);
}
}

• Create the jar file of this program and name it countworddemo.jar.

• Run the jar file
hadoop jar /home/wordcount/wordcountdemo.jar com.wordcount.WC_Runner
/test/data.txt /r_output
• The output is stored in /r_output/part-00000
• Now execute the command to see the output.
hdfs dfs -cat /r_output/part-00000

Signature of Faculty: Grade:

BE 7th Semester Big Data Analytics (3170725) 190280107002

P a g e | 16

Practical – 3
AIM: Understand the Logs generated by MapReduce program.

Logs & File Location

• Default directory of Hadoop log file is $HADOOP_HOME/logs.

MapReduce service:
• Service instance log that contains details related to the MapReduce framework and
startup on the service side. Each service instance has a separate log, in addition to
separate logs for each task attempt.
$HADOOP_HOME/logs/user.pmr.service.application.index.log

MapReduce task:
• Task log that contains details relating to user-defined MapReduce code on the
service side. Each task records its task log messages (syslog), standard output
(stdout) and errors (stderr) to separate files in this directory. Knowing the job and
task IDs can be very useful when debugging MapReduce jobs.

-$HADOOP_HOME/logs/tasklogs/application/ job_ID/task_task
ID_att_attempt ID/syslog
-$HADOOP_HOME/logs/tasklogs/application/ job_ID/task_task
ID_att_attempt ID/stdout
-$HADOOP_HOME/logs/tasklogs/application/ job_ID/task_task
ID_att_attempt ID/stderr

Shuffle daemon:
• MRSS service log that captures messages, events, and errors related to the
MapReduce host.
$HADOOP_HOME/logs/mrss.hostname.log

BE 7th Semester Big Data Analytics (3170725) 190280107002

P a g e | 17

MapReduce API
• API log that captures details relating to API calls from EGO.
api.hostname.log in the directory where the job was submitted MapReduce
client
• MapReduce client messages are not recorded in a log file. You can only view them on the client
console.

MapReduce log levels

• MapReduce logs support various levels. You can configure the log levels for the
MapReduce service and tasks.
• You can set log levels to any of the following values:

Level Description

DEBUG Logs all debug-level and informational messages.

INFO Logs all informational messages and more serious messages. This is thedefault log level.
WARN Logs only those messages that are warnings or more serious messages. Thisis the default level of
-------------------------debug information.
ERROR Logs only those messages that indicate error conditions or more seriousmessages.
FATAL Logs only those messages in which the system is unusable.

Signature of Faculty: Grade:

BE 7th Semester Big Data Analytics (3170725) 190280107002

P a g e | 18

Practical – 4
AIM: Run two different Data sets/Different size of Datasets on Hadoop and Compare the Logs.

First Dataset:
Program: Word Count
Size: 1MB

dev-

BE 7th Semester Big Data Analytics (3170725) 190280107002

P a g e | 19

dev

dev-

BE 7th Semester Big Data Analytics (3170725) 190280107002

P a g e | 20

Conclusion:
• For two different size of dataset running same map reduce job we can say that time
taken for running for job is increase as size of dataset bigger.

Signature of Faculty: Grade:

BE 7th Semester Big Data Analytics (3170725) 190280107002

P a g e | 21

Practical – 5
AIM: Develop Map Reduce Application to Sort a given file and do aggregation on some parameters.

Let’s say we want to:

“View all donor cities by descending order of donation total amount,

considering only donationswhich were not issued by a teacher. City names
should be case insensitive (using upper-case)”

This can be done in SQL as:

This query actually involves quite a few operations:

• Filtering on the value of doner_is_teacher

• Aggregating the sum m of total values grouping by city
• Sorting on the aggregated value Sumtotal

BE 7th Semester Big Data Analytics (3170725) 190280107002

P a g e | 22

This task could be broken down into 2 MapReduce jobs:

1. First Job: Filtering and Aggregation

• Map:
▪ Input: DonationWritables “full row” objects from theSequenceFile.

▪ Output: (city, total) pairs for each entry only ifdonor_is_teacher is

not true.

• Reduce:

▪ Reduce by summing the “total” values for each “city”key

2. Second Job: Sorting

• Map:
▪ Input: (city, sumtotal) pairs with summed total percity.
▪ Output: (sumtotal, city) inversed pair.

• Reduce:
▪ Identity reducer. Does not reduce anything, but theshuffling will sort
on keys for us.

BE 7th Semester Big Data Analytics (3170725) 190280107002

P a g e | 23

• Here is simple visualization of 2-job Process:

BE 7th Semester Big Data Analytics (3170725) 190280107002

P a g e | 24

The First Job

• Code:

import java.io.IOException;
import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.FloatWritable; import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat; import
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import
data.writable.DonationWritable;
public class DonationsSumByCity {
public static class CityDonationMapper extends Mapper<Text, DonationWritable, Text,
FloatWritable> {
private Text city = new Text();
private FloatWritable total = new FloatWritable();
@Override
public void map(Text key, DonationWritable donation, Context context) throws IOException,
InterruptedException {
// Ignore rows where the donor is a teacher if ("t".equals(donation.donor_is_teacher)) {
return;

float) pair.

}
}

}
// Transform city name to uppercase and write the (string,

city.set(donation.donor_city.toUpperCase()); total.set(donation.total); context.write(city, total);

public static class FloatSumReducer extends Reducer<Text, FloatWritable, Text,

FloatWritable> {
private FloatWritable result = new FloatWritable(); @Override
public void reduce(Text key, Iterable<FloatWritable> values, Context context) throws
IOException, InterruptedException {
float sum = 0;
for (FloatWritableval : values) { sum += val.get();
}
result.set(sum); context.write(key, result);
}
}

BE 7th Semester Big Data Analytics (3170725) 190280107002

P a g e | 25

city");

public static void main(String[] args) throws Exception {

Job job = Job.getInstance(new Configuration(), "Sum donations by

job.setJarByClass(DonationsSumByCity.class);
// Mapper configuration job.setMapperClass(CityDonationMapper.class);
job.setInputFormatClass(SequenceFileInputFormat.class);
job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(FloatWritable.class);
// Reducer configuration (use the reducer as combiner also, useful

in cases of aggregation)
job.setCombinerClass(FloatSumReducer.class); job.setReducerClass(FloatSumReducer.class);
job.setNumReduceTasks(1); job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FloatWritable.class); FileInputFormat.setInputPaths(job, new
Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

Sequence File input format

A couple of things to notice here regarding the Sequence File as the input:

• job.setInputFormatClass(SequenceFileInputFormat.class)
o Tell the job that we are reading a Sequence File.
• ... extends Mapper<Text, DonationWritable, Text, FloatWritable>
o The first two generic type parameters of the Mapper class should be the
input Key and Value types of Sequence File.
• map(Text key, DonationWritable donation, Context context)
o The parameter of the map method are directly the Writable objects. If we
were using the CSV input we would have a Text object as the second
parameter containing the csv line, which we would have to split on
commas to obtain values.

Using a Combiner

• Since we are doing an aggregation task here, using our Reducer as a Combiner
by calling job.setCombinerClass(FloatSumReducer.class) improves performance. It
will start reducing the Mapper’s output during the map phase, which will result in
less data being shuffled and sent to the Reducer.

BE 7th Semester Big Data Analytics (3170725) 190280107002

P a g e | 26

The Second Job

• Code:

public class OrderBySumDesc {

public static class InverseCitySumMapper extends Mapper<Text,
Text, FloatWritable, Text>
{

private FloatWritablefloatSum = new FloatWritable(); @Override

public void map(Text city, Text sum, Context context) throws
IOException,

InterruptedException {

float floatVal = Float.parseFloat(sum.toString());

floatSum.set(floatVal);
context.write(floatSum, city);

BE 7th Semester Big Data Analytics (3170725) 190280107002

P a g e | 27
public static class DescendingFloatComparator extends
WritableComparator { public DescendingFloatComparator() {
super(FloatWritable.class, true);
}

@SuppressWarnings("rawtypes") @Override
public int compare(WritableComparable w1, WritableComparable
w2) { FloatWritable key1 = (FloatWritable) w1;
FloatWritable key2 = (FloatWritable) w2; return -1 *
key1.compareTo(key2);
}
}
public static void main(String[] args) throws Exception {
Job job = Job.getInstance(new Configuration(), "Order By Sum
Desc") job.setJarByClass(DonationsSumByCity.class);

// The mapper which transforms (K:V) => (float(V):K)

job.setMapperClass(InverseCitySumMapper.class);
job.setInputFormatClass(KeyValueTextInputFormat.class);

job.setMapOutputKeyClass(FloatWritable.class);
job.setMapOutputValueClass(Text.class);

// Sort with descending float order

job.setSortComparatorClass(DescendingFloatComparator.cl
ass);

// Use default Reducer which simply transforms (K:V1,V2) => (K:V1),

(K:V2)job.setReducerClass(Reducer.class);
job.setNumReduceTasks(1);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new
Path(args[1]));System.exit(job.waitForCompletion(true)
? 0 : 1);
}
}

BE 7th Semester Big Data Analytics (3170725) 190280107002

P a g e | 28
Running and Viewing Results

• d
Here are the terminal commands for executing and viewing the outputs for these 2
MapReduce jobs:

$ hadoop jar donors.jar mapreduce.donation.DonationsSumByCity

donors/donations.seqfiledonors/output_sumbycity

$ hdfsdfs -cat donors/output_sumbycity/* [...]

ROCKWALL 8422.99
ROCKWELL 80.0
ROCKWOOD 9224.17
[...]

$ hadoop jar donors.jar mapreduce.donation.OrderBySumDesc

donors/output_sumbycity donors/output_orderbysumdesc

$ hdfsdfs -cat donors/output_orderbysumdesc/* | head 1.71921696E8

2.5504284E7 NEW YORK

1.5451513E7 SAN FRANCISCO
6163194.0 CHICAGO
5085116.5 SEATTLE

• As expected, the output of the first job is a plain text list of <city,sum> ordered by city name.
Thesecond job generates a list of <sum,city> sorted by descending sum

• Execution times:

▪ The first job took an average of 1 min 25 sec on my cluster.

▪ This second job took an average of 1 min 02 sec on my cluster.

BE 7th Semester Big Data Analytics (3170725) 190280107002

P a g e | 29

Signature of Faculty: Grade:

Practical – 6
AIM: Download any two Big Data Sets from authenticate website.

1. Yelp Dataset:
• Website: https://www.yelp.com/dataset

BE 7th Semester Big Data Analytics (3170725) 190280107002

P a g e | 30

2. Kaggle:
• Website: https://www.kaggle.com/datasets

Signature of Faculty: Grade:

BE 7th Semester Big Data Analytics (3170725) 190280107002
P a g e | 31

Practical – 7
AIM: Explore Spark and Implement Word count application using Spark.

Apache Spark:
• Apache Spark is a lightning-fast cluster computing technology, designed for fast
computation. It is based on Hadoop MapReduce and it extends the MapReduce model to
efficiently use it for more types of computations, which includes interactive queries and
stream processing. The main feature of Spark is its in-memory cluster computing that
increases the processing speed of an application.

• Spark is designed to cover a wide range of workloads such as batch applications, iterative
algorithms, interactive queries and streaming. Apart from supporting all these workloads in a
respective system, it reduces the management burden of maintaining separate tools.

Spark Word Count

1. Create a directory in Hadoop, where to kept text file.

$ hadoop fs -mkdir /spark
2. Upload the data file on Hadoop in the specific directory.
$ hadoop fs -put <fileLocation> URI

BE 7th Semester Big Data Analytics (3170725) 190280107002

P a g e | 32

3. Now, follow the below command to open the spark in Scala mode.
$ spark-shell

4. Let’s create an RDD by using the following commad.

scala > val data = sc.textFile(“sparkdata.txt”);

• Now, we can read the generated result by using the following command.

scala > data.collect;

BE 7th Semester Big Data Analytics (3170725) 190280107002

P a g e | 33

5. Here, we split the existing data in the form of individual words by using following command.
scala > val splitdata = data.flatMap(line=>lines.split( “ “));

Now, we can read the generated result by using the following command.

scala > splitdata.collect;

BE 7th Semester Big Data Analytics (3170725) 190280107002

P a g e | 34

6. Perform the reduce operation

scala > val reducedata = mapdata.reduceByKey(_+_);

we can read the generated result by using the following command.

scala > reducedata.collect;

Signature of Faculty: Grade:

BE 7th Semester Big Data Analytics (3170725) 190280107002

P a g e | 35

Practical – 8
AIM: Creating the HDFS tables and loading them in Hive and learn joining of tables in Hive.

Create a folder on HDFS under /user/cloudera HDFS Path

javachain~hadoop]$ hadoop fs -mkdir javachain

Move the text file from local file system into newly created folder called javachain
javachain~hadoop]$ hadoop fs -put ~/Desktop/student.txt javachain/

Create Empty table STUDENT in HIVE

hive> create table student

> ( std_id int,
> std_name string,
> std_grade string,
> std_addres string)
> partitioned by (country string)
> row format delimited
> fields terminated by ','
>;
OK
Time taken: 0.349 seconds

Load Data from HDFS path into HIVE TABLE.

hive> load data inpath 'javachain/student.txt' into table student

partition(country='usa');
Loading data to table default.student partition (country=usa)
chgrp: changing ownership of
'hdfs://quickstart.cloudera:8020/user/hive/warehouse/student/country=usa/stud
ent.txt': User does not
belong to hive
Partition default.student{country=usa} stats: [numFiles=1, numRows=0,
totalSize=120, rawDataSize=0]
OK
Time taken: 1.048 seconds

BE 7th Semester Big Data Analytics (3170725) 190280107002

P a g e | 36

Select the values in the Hive table.

hive> select * from student;

OK
101 'JAVACHAIN' 3RD 'USA usa
102 'ANTO' 10TH 'ENGLAND' usa
103 'PRABU' 2ND 'INDIA' usa
104 'KUMAR' 4TH 'USA' usa
105 'jack' 2ND 'INDIA' usa
Time taken: 0.553 seconds, Fetched: 5 row(s)

JOIN
• The following query executes JOIN on the CUSTOMER and ORDER tables, and retrieves the
records:

hive> SELECT c.ID, c.NAME, c.AGE, o.AMOUNT

FROM CUSTOMERS c JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);

• On successful execution of the query, you get to see the following response:

LEFT OUTER JOIN

• A LEFT JOIN returns all the values from the left table, plus the matched values from the right
table, or NULL in case of no matching JOIN predicate.

BE 7th Semester Big Data Analytics (3170725) 190280107002

P a g e | 37

• The following query demonstrates LEFT OUTER JOIN between CUSTOMER and ORDER
tables:

hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE

FROM CUSTOMERS c
LEFT OUTER JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);

• On successful execution of the query, you get to see the following response :

RIGHT OUTER JOIN

• The following query demonstrates RIGHT OUTER JOIN between the CUSTOMER and
ORDER tables.

hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c RIGHT

OUTER JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);

• On successful execution of the query, you get to see the following response:

BE 7th Semester Big Data Analytics (3170725) 190280107002

P a g e | 38

Full Outer Join:

• The following query demonstrates FULL OUTER JOIN between CUSTOMER and
ORDER tables:

hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE

FROM CUSTOMERS c
FULL OUTER JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);

Signature of Faculty: Grade:

BE 7th Semester Big Data Analytics (3170725) 190280107002

P a g e | 39

Practical – 9
AIM: Implementation of Matrix algorithms in Spark Sql programming

• Code:

Matrix_multiply.py

from pyspark import SparkConf, SparkContext

import sys, operator

def add_tuples(a, b):

return list(sum(p) for p in zip(a,b))

def permutation(row):
rowPermutation = []

for element in row: for e in

range(len(row)):

rowPermutation.append(float(element) *
float(row[e]))

return rowPermutation

def main():
input = sys.argv[1] output = sys.argv[2]
conf = SparkConf().setAppName('Matrix
Multiplication') sc = SparkContext(conf=conf)
assert sc.version >= '1.5.1'

row = sc.textFile(input).map(lambda row :

row.split(' ')).cache() ncol = len(row.take(1)[0])
intermediateResult =
row.map(permutation).reduce(add_tuples)

BE 7th Semester Big Data Analytics (3170725) 190280107002

P a g e | 40

outputFile = open(output, 'w')

result = [intermediateResult[x:x+3] for x in

range(0, len(intermediateResult), ncol)] for row
in result:
for element in row:
outputFile.write(str(element) + ' ')
outputFile.write('\n') outputFile.close()

# outputResult = sc.parallelize(result).coalesce(1)
# outputResult.saveAsTextFile(output)

if __name__ == "__main__":

main()
matrix_multiply_sparse.py

from pyspark import SparkConf, SparkContext

import sys, operator from scipy import * from
scipy.sparse import csr_matrix

def createCSRMatrix(input):
row = []
col = []
data = []

for values in input:

value = values.split(':')
row.append(0)
col.append(int(value[0]))
data.append(float(value[1]))

return csr_matrix((data,(row,col)),
shape=(1,100))

def multiplyMatrix(csrMatrix):
csrTransponse = csrMatrix.transpose(copy=True)
return (csrTransponse*csrMatrix)

def formatOutput(indexValuePairs):

BE 7th Semester Big Data Analytics (3170725) 190280107002

P a g e | 41

return ' '.join(map(lambda pair : str(pair[0]) + ':' + str(pair[1]),

indexValuePairs))

def main():
input = sys.argv[1] output = sys.argv[2]
conf = SparkConf().setAppName('Sparse
Matrix Multiplication') sc =
SparkContext(conf=conf) assert sc.version >=
'1.5.1'
sparseMatrix = sc.textFile(input).map(lambda
row : row.split('
')).map(createCSRMatrix).map(multiplyMatrix).red
uce(operator.add) outputFile = open(output, 'w')

for row in range(len(sparseMatrix.indptr)-1):

col =sparseMatrix.indices[sparseMatrix.indptr[row]:
sparseMatrix.indptr[row+1]]

data = sparseMatrix.data[sparseMatrix.indptr[row]:sparseMatrix.indptr[row+1]]

indexValuePairs = zip(col,data)
formattedOutput = formatOutput(indexValuePairs)
outputFile.write(formattedOutput + '\n')

if __name__ == "__main__":
main()

BE 7th Semester Big Data Analytics (3170725) 190280107002

P a g e | 42

Signature of Faculty: Grade:

BE 7th Semester Big Data Analytics (3170725) 190280107002

P a g e | 43

Practical – 10
AIM: Create A Data Pipeline Based on Messaging Using PySpark and Hive -Covid-19 Analysis.

Building data pipeline for Covid-19 data analysis using Bigdata technologies and Tableau

• The purpose is to collect the real time streaming data from COVID19 open API for every
5 minutes into the ecosystem using NiFi and to process it and store it in the data lake on
AWS.
• Data processing includes parsing the data from complex JSON format to csv format then
publishing to Kafka for persistent delivery of messages into PySpark for further
processing.
• The processed data is then fed into output Kafka topic which is inturn consumed by Nifi
and stored in HDFS.
• A Hive external table is created on top of HDFS processed data for which the process is
Orchestrated using Airflow to run for every time interval. Finally, KPIs are visualised in
tableau.

• Data Architecture:
d

BE 7th Semester Big Data Analytics (3170725) 190280107002

P a g e | 44

• Tools used:
1. Nifi -nifi-1.10.0
2. Hadoop -hadoop_2.7.3
3. Hive-apache-hive-2.1.0
4. Spark-spark-2.4.5
5. Zookeeper-zookeeper-2.3.5
6. Kafka-kafka_2.11-2.4.0
7. Airflow-airflow-1.8.1
8. Tableau

Signature of Faculty: Grade:

BE 7th Semester Big Data Analytics (3170725) 190280107002

P a g e | 45

Practical – 11
AIM: Explore NoSQL database like MongoDB and perform basic CRUD operation.

• Start Mongo Shell

• . Display all databases.

BE 7th Semester Big Data Analytics (3170725) 190280107002

P a g e | 46

1. Create Operation

For insert one document

db.collection.insertOne()
For insert multiple document
db.collection.insertMany

2. Update Operations

Update single document

db.collection.updateOne()
Update Multiple document
db.collection.updateMany()
Replace single document
db.collection.replaceOne()

3. Read Operations

For Display All documents.

db.collection.find()

BE 7th Semester Big Data Analytics (3170725) 190280107002

P a g e | 47

4. Delete Operation

delete single document

db.collection.deleteOne()
For delete multiple document
db.collection.deleteMany()

Signature of Faculty: Grade:

BE 7th Semester Big Data Analytics (3170725) 190280107002

P a g e | 48

Practical – 12
AIM: Case study based on the concept of Big Data Analytics.

1. Walmart:

• Walmart is the largest retailer in the world and the world’s largest company by revenue,
with more than 2 million employees and 20000 stores in 28 countries.
• It started making use of big data analytics much before the word Big Data came into the
picture. Walmart uses Data Mining to discover patterns that can be used to provide
product recommendations to the user, based on which products were brought together.
• Walmart by applying effective Data Mining has increased its conversion rate of
customers. It has been speeding along big data analysis to provide best-in-class e-
commerce technologies with a motive to deliver superior customer experience.
• The main objective of holding big data at Walmart is to optimize the shopping
experience of customers when they are in a Walmart store.
• Big data solutions at Walmart are developed with the intent of redesigning global
websites and building innovative applications to customize the shopping experience for
customers whilst increasing logistics efficiency.
• Hadoop and NoSQL technologies are used to provide internal customers with access to
real-time data collected from different sources and centralized for effective use.

2. Uber:

• Uber is the first choice for people around the world when they think of moving people
and making deliveries.
• It uses the personal data of the user to closely monitor which features of the service is
mostly used, to analyze usage patterns and to determine where the services should be
more focused.
• Uber focuses on the supply and demand of the services due to which the prices of the
services provided changes. Therefore, one of Uber’s biggest uses of data is surge
pricing.
• For instance, if you are running late for an appointment and you book a cab in a
crowded place then you must be ready to pay twice the amount.
BE 7th Semester Big Data Analytics (3170725) 190280107002
P a g e | 49

• For example, On New Year’s Eve, the price for driving for one mile can go from 200 to
1000. In the short term, surge pricing affects the rate of demand, while long term use
could be the key to retaining or losing customers. Machine learning algorithms are
considered to determine where the demand is strong.

3. Netflix:

• It is the most loved American entertainment company specializing in online on-demand

streaming video for its customers.
• Netflix has been determined to be able to predict what exactly its customers will enjoy
watching with Big Data. As such, Big Data analytics is the fuel that fires the
‘recommendation engine’ designed to serve this purpose.
• More recently, Netflix started positioning itself as a content creator, not just a
distribution method.
• Unsurprisingly, this strategy has been firmly driven by data. Netflix’s recommendation
engines and new content decisions are fed by data points such as what titles customers
watch, how often playback stopped, ratings are given, etc.
• The company’s data structure includes Hadoop, Hive and Pig with much other
traditional business intelligence.
• Netflix shows us that knowing exactly what customers want is easy to understand if the
companies just don’t go with the assumptions and make decisions based on Big Data.

4. eBay:

• A big technical challenge for eBay as a data-intensive business to exploit a system that
can rapidly analyze and act on data as it arrives (streaming data).
• There are many rapidly evolving methods to support streaming data analysis. eBay is
working with several tools including Apache Spark, Storm, Kafka.
• It allows the company’s data analysts to search for information tags that have been
associated with the data (metadata) and make it consumable to as many people as
possible with the right level of security and permissions (data governance).
• The company has been at the forefront of using big data solutions and actively
contributes its knowledge back to the open-source community.

BE 7th Semester Big Data Analytics (3170725) 190280107002

P a g e | 50

5. Procter & Gamble:

• Procter & Gamble whose products we all use 2-3 times a day is a 179-year-old
company. The genius company has recognized the potential of Big Data and put it to
use in business units around the globe.
• P&G has put a strong emphasis on using big data to make better, smarter, real-time
business decisions. The Global Business Services organization has developed tools,
systems, and processes to provide managers with direct access to the latest data and
advanced analytics.
• Therefore P&G being the oldest company, still holding a great share in the market
despite having many emerging companies.

Signature of Faculty: Grade:

BE 7th Semester Big Data Analytics (3170725) 190280107002

P a g e | 51

Practical 12

AIM: Case study based on the concept of Big Data Analytics. Prepare presentation
in the group of 4. Submit PPTs.

Figure 1 Figure 2

Figure 4
Figure 3

BE 7th Semester Big Data Analytics (3170725) 190280107002

P a g e | 52

Figure 5 Figure 6

Figure 8
Figure 7

BE 7th Semester Big Data Analytics (3170725) 190280107002

P a g e | 53

Figure 9 Figure 10

Figure 11 Figure 12

BE 7th Semester Big Data Analytics (3170725) 190280107002

P a g e | 54

Figure 13 Figure 14

Figure 15 Figure 16

BE 7th Semester Big Data Analytics (3170725) 190280107002

P a g e | 55

Figure 17 Figure 18

Figure 20
Figure 19

BE 7th Semester Big Data Analytics (3170725) 190280107002

P a g e | 56

Figure 22
Figure 21

Figure 24
Figure 23

BE 7th Semester Big Data Analytics (3170725) 190280107002

P a g e | 57

Figure 26
Figure 25

Figure 28
Figure 27

BE 7th Semester Big Data Analytics (3170725) 190280107002

P a g e | 58

Figure 29

Figure 30

BE 7th Semester Big Data Analytics (3170725) 190280107002

Cloud Unit3
No ratings yet
Cloud Unit3
26 pages
L3 Workbook Learning From Legends Tendulkar Tata
100% (1)
L3 Workbook Learning From Legends Tendulkar Tata
6 pages
Gujarat Technological University (GTU) : A Report On
67% (3)
Gujarat Technological University (GTU) : A Report On
18 pages
JIET BTech Report Template Final
No ratings yet
JIET BTech Report Template Final
15 pages
Data Engineering Pre-Interview Quiz MCQ
100% (1)
Data Engineering Pre-Interview Quiz MCQ
8 pages
De Report Sem-6
No ratings yet
De Report Sem-6
22 pages
Advanced Web Programming - Prof. S J Agravat
No ratings yet
Advanced Web Programming - Prof. S J Agravat
151 pages
Mad Lab Manual
No ratings yet
Mad Lab Manual
64 pages
Practical SE (SRS) 2
No ratings yet
Practical SE (SRS) 2
65 pages
Government Engineering College Gandhinagar: Department of Information & Technology
No ratings yet
Government Engineering College Gandhinagar: Department of Information & Technology
19 pages
How To Register GTU de Portal - KCP
No ratings yet
How To Register GTU de Portal - KCP
14 pages
IPDC 1 English Question Bank 052023 - 240307 - 113331
No ratings yet
IPDC 1 English Question Bank 052023 - 240307 - 113331
24 pages
CC 2nd Assignment Asnwers
No ratings yet
CC 2nd Assignment Asnwers
18 pages
Final LP-VI Lab Manual 23-24
No ratings yet
Final LP-VI Lab Manual 23-24
71 pages
Software Engg - Lab - Manual
No ratings yet
Software Engg - Lab - Manual
149 pages
ADA Viva Questions Unit 1 and 2
No ratings yet
ADA Viva Questions Unit 1 and 2
5 pages
DAA Assignment (Module4)
No ratings yet
DAA Assignment (Module4)
10 pages
Se Leb Manual PDF
0% (1)
Se Leb Manual PDF
37 pages
Q.1 (A) Give Full Form of Following Acronym
No ratings yet
Q.1 (A) Give Full Form of Following Acronym
16 pages
Software Engineeing GTU Practicals
No ratings yet
Software Engineeing GTU Practicals
59 pages
Vision Mission PEOs POs PSOs - Comp Dept - RCTI
No ratings yet
Vision Mission PEOs POs PSOs - Comp Dept - RCTI
2 pages
Quick Reference: (Core Java)
No ratings yet
Quick Reference: (Core Java)
175 pages
Advanced JAVA Study Material GTU - 23042016 - 032615AM PDF
67% (3)
Advanced JAVA Study Material GTU - 23042016 - 032615AM PDF
71 pages
AOOP-4340701-Lab Manual
No ratings yet
AOOP-4340701-Lab Manual
302 pages
E-Marketing: Identifying Web Presence Goals Physical Space Web Space
No ratings yet
E-Marketing: Identifying Web Presence Goals Physical Space Web Space
20 pages
Cyber Security - Lecture 11
100% (4)
Cyber Security - Lecture 11
52 pages
Ajt Paper Solution.... For Gtu Student.... Advance Java Technology
No ratings yet
Ajt Paper Solution.... For Gtu Student.... Advance Java Technology
186 pages
Report Sem 5
50% (2)
Report Sem 5
28 pages
Entrepreneurship & E-Business
No ratings yet
Entrepreneurship & E-Business
96 pages
8085 Assembly Language Programs
No ratings yet
8085 Assembly Language Programs
7 pages
Subject Name: Artificial Intelligence Subject Code:3161608: Semester: VI (2020)
No ratings yet
Subject Name: Artificial Intelligence Subject Code:3161608: Semester: VI (2020)
15 pages
Fab - Care Report
No ratings yet
Fab - Care Report
49 pages
Ajp Lab Manual: Sir Bhavsinhji Polytechnic Institute Bhavnagar
No ratings yet
Ajp Lab Manual: Sir Bhavsinhji Polytechnic Institute Bhavnagar
61 pages
Scsa4002 5GNetworks
No ratings yet
Scsa4002 5GNetworks
1 page
COSM Questions
No ratings yet
COSM Questions
4 pages
TCS Codevita
No ratings yet
TCS Codevita
6 pages
Design The Following Static Web Pages Required For An Online Book Store Web Site
No ratings yet
Design The Following Static Web Pages Required For An Online Book Store Web Site
24 pages
Mad Lab Record
No ratings yet
Mad Lab Record
95 pages
Ourppt
No ratings yet
Ourppt
11 pages
Internship at Brainybeam Technologies Pvt. LTD: Ruchit Mukeshbhai Patel
No ratings yet
Internship at Brainybeam Technologies Pvt. LTD: Ruchit Mukeshbhai Patel
51 pages
DOE 2A Report
No ratings yet
DOE 2A Report
35 pages
Data Structure Assignment - 4
100% (1)
Data Structure Assignment - 4
9 pages
IPDC Paper Solution-3
100% (3)
IPDC Paper Solution-3
10 pages
LNM Matrix
No ratings yet
LNM Matrix
1 page
Wipro CV
No ratings yet
Wipro CV
4 pages
1) Write A Shell Script That Accepts Command Line Arguments and Prints Them in Reverse Order
No ratings yet
1) Write A Shell Script That Accepts Command Line Arguments and Prints Them in Reverse Order
34 pages
Gujarat Technological University
No ratings yet
Gujarat Technological University
9 pages
Computer Engineering Department: Gtu Important Questions Bank Subject Name: Information Ecurity Subject Code Semester
No ratings yet
Computer Engineering Department: Gtu Important Questions Bank Subject Name: Information Ecurity Subject Code Semester
5 pages
Computer Centre Naac Presentation 2013
No ratings yet
Computer Centre Naac Presentation 2013
32 pages
Online Organic Health Food Store Project
No ratings yet
Online Organic Health Food Store Project
5 pages
Report Sem 7
No ratings yet
Report Sem 7
25 pages
Guide How To Host A Programming Contest On Hacker Earth
No ratings yet
Guide How To Host A Programming Contest On Hacker Earth
18 pages
TEACHING AND EVALUATION SCHEME FOR 5th Semester (CSE) (Wef 2020-21)
No ratings yet
TEACHING AND EVALUATION SCHEME FOR 5th Semester (CSE) (Wef 2020-21)
25 pages
Final Report Pmms
No ratings yet
Final Report Pmms
33 pages
Gujarat Technological University Be - Semester-Vi (New) - Examination - Summer 2019 GTU Paper Solution
No ratings yet
Gujarat Technological University Be - Semester-Vi (New) - Examination - Summer 2019 GTU Paper Solution
16 pages
SIC - AI - Chapter 3. Exploratory Data Analysis - Rev2.0
No ratings yet
SIC - AI - Chapter 3. Exploratory Data Analysis - Rev2.0
527 pages
DSBDA Practical Final
No ratings yet
DSBDA Practical Final
49 pages
UID Module 2 PPT With Assignment Answers
100% (1)
UID Module 2 PPT With Assignment Answers
39 pages
21cs61 Model Paper Ans
No ratings yet
21cs61 Model Paper Ans
33 pages
2022-23-BDA-LAB Manual
No ratings yet
2022-23-BDA-LAB Manual
59 pages
2022-23-BDA-LAB Manual
No ratings yet
2022-23-BDA-LAB Manual
59 pages
Blda Pract 2024
No ratings yet
Blda Pract 2024
59 pages
Hadoop, Hbase, and Hive
No ratings yet
Hadoop, Hbase, and Hive
25 pages
Unity Catalog Upgrade Plan Template
No ratings yet
Unity Catalog Upgrade Plan Template
34 pages
Hcia Big Data V 3 Merci
No ratings yet
Hcia Big Data V 3 Merci
197 pages
BDAmod 3
No ratings yet
BDAmod 3
18 pages
big data ecosystem2
No ratings yet
big data ecosystem2
46 pages
Maanvendra CV Ind
No ratings yet
Maanvendra CV Ind
1 page
Framework For Big Data Analytics of Mood
No ratings yet
Framework For Big Data Analytics of Mood
10 pages
2 Hadoop
No ratings yet
2 Hadoop
20 pages
BD Course Handout
No ratings yet
BD Course Handout
5 pages
WP Dremio Definitive Guide To The Data Lakehouse
No ratings yet
WP Dremio Definitive Guide To The Data Lakehouse
20 pages
Vishal Mittal CV
No ratings yet
Vishal Mittal CV
3 pages
Course Material
100% (1)
Course Material
57 pages
Venkateshwaran Gopal: Professional
No ratings yet
Venkateshwaran Gopal: Professional
5 pages
Lab06 Spark Dataframes
No ratings yet
Lab06 Spark Dataframes
12 pages
Resume Ashwini Bulange ETL QA Lead
No ratings yet
Resume Ashwini Bulange ETL QA Lead
2 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
55 pages
Gaurav Garg PDF
No ratings yet
Gaurav Garg PDF
3 pages
BD - Unit - IV - Hive and Pig
No ratings yet
BD - Unit - IV - Hive and Pig
41 pages
Dice Resume CV SN
No ratings yet
Dice Resume CV SN
5 pages
Hive Tutorial PDF
0% (1)
Hive Tutorial PDF
14 pages
Ram Manohar Bheemana: Contact About Me
No ratings yet
Ram Manohar Bheemana: Contact About Me
7 pages
Notes - 4 Unit-Big Data
No ratings yet
Notes - 4 Unit-Big Data
38 pages
Trivago Pipeline
No ratings yet
Trivago Pipeline
18 pages
Apache Hadoop Filesystem and Its Usage in Facebook
No ratings yet
Apache Hadoop Filesystem and Its Usage in Facebook
33 pages
Ambari-Release-Notes 2 7 3
No ratings yet
Ambari-Release-Notes 2 7 3
9 pages
Int 421
No ratings yet
Int 421
2 pages
Hortonworks Data Platform Installing HDP On Windows
No ratings yet
Hortonworks Data Platform Installing HDP On Windows
84 pages
Big-Data-Pyq-2023-solution
No ratings yet
Big-Data-Pyq-2023-solution
18 pages