0% found this document useful (0 votes)

70 views87 pages

Big Data Analytics Lab Manual (R22a0590)

The document is a laboratory manual for a Big Data Analytics course at Malla Reddy College of Engineering & Technology for B. Tech III Year students. It outlines the vision, mission, program educational objectives, specific outcomes, and outcomes expected from graduates, along with general laboratory instructions and a list of experiments to be conducted. The manual includes detailed instructions for installing and running software tools like Python, NumPy, Pandas, and Hadoop, as well as various data processing and analysis tasks.

Uploaded by

Satyasri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

70 views87 pages

Big Data Analytics Lab Manual (R22a0590)

Uploaded by

Satyasri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 87

BIG DATA ANALYTICS

LABORATORY MANUAL

(R22A0590)

B. TECH III YEAR-II SEM

(2024-25)

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

MALLA REDDY COLLEGE OF ENGINEERING & TECHNOLOGY
(Autonomous Institution – UGC, Govt. of India)
(Affiliated to JNTUH, Hyderabad, Approved by AICTE - Accredited by NBA & NAAC – ‗A Grade - ISO 9001:2015 Certified)
Maisammaguda, Dhulapally (Post Via. Hakimpet), Secunderabad – 500100, Telangana State, INDIA.
DEPARTMENT OF COMPUTER SCIENCE AND
ENGINEERING

Vision
To acknowledge quality education and instill high patterns of discipline making the
students technologically superior and ethically strong which involves the improvement
in the quality of life in human race.

Mission
 To achieve and impart holistic technical education using the best of infrastructure,
outstanding technical and teaching expertise to establish the students in to competent and
confident engineers.
 Evolving the center of excellence through creative and innovative teaching learning
practices for promoting academic achievement to produce internationally accepted
competitive and world class professionals.

PROGRAMME EDUCATIONAL OBJECTIVES (PEOs)

PEO1 – ANALYTICAL SKILLS
 To facilitate the graduates with the ability to visualize, gather information, articulate, analyze,
solve complex problems, and make decisions. These are essential to address thechallenges of
complex and computation intensive problems increasing their productivity.

PEO2 – TECHNICAL SKILLS

 To facilitate the graduates with the technical skills that prepare them for immediate employment
and pursue certification providing a deeper understanding of the technology in advanced areas of
computer science and related fields, thus encouraging to pursue higher education and research
based on their interest.

PEO3 – SOFT SKILLS

 To facilitate the graduates with the soft skills that include fulfilling the mission, setting goals,
showing self-confidence by communicating effectively, having a positive attitude, get involved
in team-work, being a leader, managing their career and their life.

PEO4 – PROFESSIONAL ETHICS

 To facilitate the graduates with the knowledge of professional and ethical responsibilities by
paying attention to grooming, being conservative with style, following dress codes, safety codes,
and adapting themselves to technological advancements.
PROGRAM SPECIFIC OUTCOMES (PSOs)

After the completion of the course, B. Tech Computer Science andEngineering,the graduates will
have the following Program Specific Outcomes:

1. Fundamentals and critical knowledge of the Computer System:- Able to Understand the
working principles of the computer System and its components , Apply the knowledge to build,
asses, and analyze the software and hardware aspects of it .

2. The comprehensive and Applicative knowledge of Software Development: Comprehensive

skills of Programming Languages, Software process models, methodologies, and able to plan,
develop, test, analyze, and manage the software and hardware intensive systems in
heterogeneousplatforms individually or working in teams.

3. Applications of Computing Domain & Research: Able to use the professional, managerial,
interdisciplinary skill set, and domain specific tools in development processes, identify the
research gaps, and provide innovative solutions to them.
PROGRAM OUTCOMES (POs)

Engineering Graduates should possess the following:

1. Engineering knowledge: Apply the knowledge of mathematics, science, engineering

fundamentals, and an engineering specialization to the solution of complex engineering
problems.

2. Problem analysis: Identify, formulate, review research literature, and analyze complex
engineering problems reaching substantiated conclusions using first principles of mathematics,
natural sciences, and engineering sciences.

3. Design / development of solutions: Design solutions for complex engineering problems and
design system components or processes that meet the specified needs with appropriate
consideration for the public health and safety, and the cultural, societal, and environmental
considerations.

4. Conduct investigations of complex problems: Use research-based knowledge and research

methods including design of experiments, analysis and interpretation of data, and synthesis ofthe
information to provide valid conclusions.

5. Modern tool usage: Create, select, and apply appropriate techniques, resources, and modern
engineering and IT tools including prediction and modeling to complex engineering activities
with an understanding of the limitations.

6. The engineer and society: Apply reasoning informed by the contextual knowledge to assess
societal, health, safety, legal and cultural issues and the consequent responsibilities relevant to
the professional engineering practice.

7. Environment and sustainability: Understand the impact of the professional engineering

solutions in societal and environmental contexts, and demonstrate the knowledge of, and needfor
sustainable development.

8. Ethics: Apply ethical principles and commit to professional ethics and responsibilities and
norms of the engineering practice.

9. Individual and team work: Function effectively as an individual, and as a member or leader in
diverse teams, and in multidisciplinary settings.

10. Communication: Communicate effectively on complex engineering activities with the

engineering community and with society at large, such as, being able to comprehend and write
effective reports and design documentation, make effective presentations, and give and receive
clear instructions.

11. Project management and finance: Demonstrate knowledge and understanding of the
engineering and management principles and apply these to one’s own work, as a member and
leader in a team, to manage projects and in multi disciplinary environments.

12 .Life- long learning: Recognize the need for, and have the preparation and ability to engage
in independent and life-long learning in the broadest context of technological change.
MALLA REDDY COLLEGE OF ENGINEERING & TECHNOLOGY
Maisammaguda, Dhulapally Post, Via Hakimpet, Secunderabad – 500100

GENERAL LABORATORY INSTRUCTIONS

1. Students are advised to come to the laboratory at least 5 minutes before (to the
starting time), those who come after 5 minutes will not be allowed into the lab.
2. Plan your task properly much before to the commencement, come prepared to the lab
with the synopsis / program / experiment details.
3. Student should enter into the laboratory with:
a. Laboratory observation notes with all the details (Problem statement, Aim,
Algorithm, Procedure, Program, Expected Output, etc.,) filled in for the lab
session.
b. Laboratory Record updated up to the last session experiments and other
utensils (ifany) needed in the lab.
c. Proper Dress code and Identity card.
4. Sign in the laboratory login register, write the TIME-IN, and occupy the computer
system allotted to you by the faculty.
5. Execute your task in the laboratory, and record the results / output in the lab
observation note book, and get certified by the concerned faculty.
6. All the students should be polite and cooperative with the laboratory staff, must
maintain the discipline and decency in the laboratory.
7. Computer labs are established with sophisticated and high end branded systems, which
should be utilized properly.
8. Students / Faculty must keep their mobile phones in SWITCHED OFF mode during the
lab sessions. Misuse of the equipment, misbehaviors with the staff and systems etc.,
will attract severe punishment.
9. Students must take the permission of the faculty in case of any urgency to go out; if
anybody found loitering outside the lab / class without permission during
workinghourswill be treated seriously and punished appropriately.
10. Students should LOG OFF/ SHUT DOWN the computer system before he/she leaves
the lab after completing the task (experiment) in all aspects. He/she must ensure the
system / seat is kept properly.
HEAD OF THE DEPARTMENT PRINCIPAL
MALLA REDDY COLLEGE OF ENGINEERING AND TECHNOLOGY

B.TECH - III-YEAR I I-SEM-CSE L/T/P/C

-/-/3/1.5

(R22A0590) BIG DATA ANALYTICS LAB

COURSE OBJECTIVES:
1. The objectives of this course are,
2. 1. To implement MapReduce programs for processing big data.
3. 2. To realize storage of big data using MongoDB.
4. 3. To analyze big data using machine learning techniques such as Decision tree
classification and clustering.

List of Experiments
WEEK 1 &2: Install, configure and run python, numPyand Pandas.
WEEK 3: Install, configure and run Hadoop and HDFS.
WEEK 4: Visualize data using basic plotting techniques in Python.
WEEK 5&6: Implement NoSQL Database Operations: CRUD operations, Arrays using MongoDB.
WEEK 7: Implement Functions: Count – Sort – Limit – Skip – Aggregate using MongoDB.
WEEK 8: Implement word count / frequency programs using MapReduce.
WEEK 9: Implement a MapReduce programthat processes a dataset.
WEEK 10: Implement clustering techniques using SPARK.
WEEK 11:Implement an application that stores big data in MongoDB / Pig using Hadoop / R.
BIG DATA ANALYTICS LAB

Table of Contents

S.No Name of the Experiment Page No.

1. Install, configureand run python, numPy and Pandas. 1

2. Install, configure and run Hadoop and HDFS. 18

3. Visualize data using basic plotting techniques in Python. 33

Implement NoSQL Database Operations: CRUD operations, 41

Arrays using MongoDB).
4.
Implement Functions: Count – Sort – Limit – Skip – Aggregate 50
5.
using MongoDB.

6. Implement word count / frequencyprograms using MapReduce. 60

7. Implement a MapReduce programthat processes a dataset. 65

8. Implement clustering techniques using SPARK. 70

9. Implement an application that stores big data in MongoDB / Pig 74

using Hadoop / R.
BIG DATA ANALYTICS LAB 2024-2025

EXPERIMENT: 1&2
Install, configure and run python, numpy and pandas.

PROGRAM:
AIM: To Installing and Running Applications On python, numpy and pandas.
How to Install Anaconda on Windows?

Anaconda is an open-source software that contains Jupyter, spyder, etc that are used for large data
processing, data analytics, heavy scientific computing. Anaconda works for R and python
programming language. Spyder(sub-application of Anaconda) is used for python. Opencv for python
will work in spyder. Package versions are managed by the package management system called
conda.
To begin working with Anaconda, one must get it installed first. Follow the below instructions to
Download and install Anaconda on your system:
Download and install Anaconda:
Head over to anaconda.com and install the latest version of Anaconda. Make sure to download the
―Python 3.7 Version‖ for the appropriate architecture.

Begin with the installation process:

 Getting Started

Dept. of CSE Page 1

BIG DATA ANALYTICS LAB 2024-2025
Getting through the License Agreement:

Select Installation Type: Select Just Me if you want the software to be used bya single User

Choose Installation Location:

Dept. of CSE Page 2

BIG DATA ANALYTICS LAB 2023-2024
2024-2025
Advanced Installation Option:

ough the Installation Process:

Getting through the Installation Process:

Dept. of CSE Page 3

BIG DATA ANALYTICS LAB 2023-2024
2024-2025
Recommendation to Install Pycharm:

p the Installation:

Finishing u
Finishing up the Installation:

Dept. of CSE Page 4

BIG DATA ANALYTICS LAB 2024-2025
Working with Anaconda:
Once the installation process is done, Anaconda can be used to perform multiple operations. To

begin using Anaconda,search for

Anaconda Navigator from the Start Menu in Windows

#import pandas in jupyter notebook

import pandas

#loading the dataset which is excel file

dataset = pandas.read_csv("crime.csv")

Dept. of CSE Page 5

BIG DATA ANALYTICS LAB 2024-2025
#displaying the data
dataset

import pandas as pd
dataset1 = pd.read_csv("crime.csv")
dataset1

dataset1.head()

Dept. of CSE Page 6

BIG DATA ANALYTICS LAB 2023-2024
dataset1.tail() 2024-2025

dataset1.head(10)

dataset1.tail(10)

type(dataset1)
pandas.core.frame.DataFrame

Dept. of CSE Page 7

Dept. of CSE Page 7
BIG DATA ANALYTICS LAB 2024-2025
#to find any null values in the last 5 rows
dataset1.isnull().tail()

#to makesure that no null values exists

dataset1.notnull().tail()

#displays the number of null values in each column

dataset1.isnull().sum()

#helps to find null values with respect to ROBBERY column

dataset1[dataset1.Robbery.isnull()]

dataset1.shape

#helps to find how many times values in a particular column has repeated
dataset1['Robbery'].value_counts()

Dept. of CSE Page 8

BIG DATA ANALYTICS LAB 2024-2025
#consolidated value counts for all the columns in the dataset
for col in dataset1.columns:
display(dataset1[col].value_counts())

#helps to find number of rows in the dataset

dataset_length=len(dataset1) dataset_length

#helps to find number of columns in the dataset

dataset_col=len(dataset1.columns)
dataset_col

#helps to find the summary of numerical columns

dataset1.describe()

#helps to describe individual column

dataset1.Murder.describe()

Dept. of CSE Page 9

BIG DATA ANALYTICS LAB 2024-2025
dataset1.skew()

dataset1.var()

dataset1.kurtosis()

print(dataset1.dtypes)

NUMPY

Numpy is the core library for scientific and numerical computing in Python. It provides high
performance multi dimensional arrayobject and tools for working with arrays.
Numpy main object is the multidimensional array, it is a table of elements (usually numbers) all of
the same type indexed by a positive integers.
Dept. of CSE Page 10
BIG DATA ANALYTICS LAB 2024-2025
In Numpy dimensions are called as axes.
Numpy is fast, convenient and occupies less memorywhen compared to python list.

import numpy
arr = numpy.array([1, 2, 3, 4, 5])
print(arr)

NumPy is usually imported under the np alias.

import numpy as np

Now the NumPy package can be referred to as np instead of numpy.

import numpy as np
arr = np.array([1, 2, 3, 4, 5])
print(arr)

Checking NumPy Version

The version string is stored under version attribute.

import numpy as np
print(np. version )

Create a NumPy ndarray Object

NumPy is used to work with arrays. The arrayobject in NumPy is called ndarray.
We can create a NumPy ndarrayobject byusing the array() function.

import numpy as np
arr = np.array([1, 2, 3, 4, 5])
print(arr)
print(type(arr))

type(): This built-in Python function tells us the type of the object passed to it. Like in above
code it shows that arr is numpy.ndarray type.
To create an ndarray, we can pass a list, tuple or any array-like object into the array() method, and it
will be converted into an ndarray:

Use a tuple to create a NumPy array:

import numpy as np
arr = np.array((1, 2, 3, 4, 5))
print(arr)

Dimensions in Arrays
A dimension in arrays is one level of arraydepth (nested arrays).

Dept. of CSE Page 11

BIG DATA ANALYTICS LAB 2024-2025

nested array: are arrays that have arrays as their elements.

0-D Arrays
0-D arrays, or Scalars, are the elements in an array. Each value in an array is a 0-D array.

#Create a 0-D arraywith value 42

import numpy as np
arr = np.array(42)
print(arr)

1-D Arrays
These are the most common and basic arrays.

#Create a 1-D array containing the values 1,2,3,4,5:

import numpy as np
arr = np.array([1, 2, 3, 4, 5])
print(arr)

2-D Arrays
An array that has 1-D arrays as its elements is called a 2-D array.
These are often used to represent matrix or 2nd order tensors.

#Create a 2-D array containing two arrays with the values 1,2,3 and 4,5,6:
import numpyas np
arr = np.array([[1, 2, 3], [4, 5, 6]])
print(arr)

3-D arrays
An array that has 2-D arrays (matrices) as its elements is called 3-D array.
These are often used to represent a 3rd order tensor.

#Create a 3-D array with two 2-D arrays, both containing two arrays with the values 1,2,3 and 4,5,6:
import numpyas np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])
print(arr)

Check Number of Dimensions?

NumPy Arrays provides the ndim attribute that returns an integer that tells us how many dimensions
the array have.

Dept. of CSE Page 12

BIG DATA ANALYTICS LAB 2024-2025
#Check how manydimensions the arrays have:
import numpy as np
a = np.array(42)
b = np.array([1, 2, 3, 4, 5])
c = np.array([[1, 2, 3], [4, 5, 6]])
d = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])
print(a.ndim)
print(b.ndim)
print(c.ndim)
print(d.ndim)

#Create an array with 5 dimensions and verify that it has 5 dimensions:

import numpyas np
arr = np.array([1, 2, 3, 4], ndmin=5)
print(arr)
print('number of dimensions :', arr.ndim)

NumPy Array Indexing

Access Array Elements
Array indexing is the same as accessing an array element.
You can access an array element byreferring to its index number.
The indexes in NumPy arrays start with 0, meaning that the first element has index 0, and the second
has index 1 etc.

#Get the first element from the following array:

import numpy as np
arr = np.array([1, 2, 3, 4])
print(arr[0])

#Get the second element from the following array.

import numpyas np
arr = np.array([1, 2, 3, 4])
print(arr[1])

#Get third and fourth elements from the following array and add them.
import numpyas np
arr = np.array([1, 2, 3, 4])
print(arr[2] + arr[3])

Dept. of CSE Page 13

BIG DATA ANALYTICS LAB 2024-2025
Access 2-D Arrays
To access elements from 2-D arrays we can use comma separated integers representing the dimension and
the index of the element.
Think of 2-D arrays like a table with rows and columns, where the dimension represents the row and the
index represents the column.
#Access the element on the first row, second column:
import numpy as np
arr = np.array([[1,2,3,4,5], [6,7,8,9,10]])
print('2nd element on 1st row: ', arr[0, 1])

#Access the element on the 2nd row, 5th column:

import numpy as np
arr = np.array([[1,2,3,4,5], [6,7,8,9,10]])
print('5th element on 2nd row: ', arr[1, 4])

OUTPUT:

Dept. of CSE Page 14

BIG DATA ANALYTICS LAB 2024-2025

Record Notes

Signature of the Faculty

Dept. of CSE Page 15

BIG DATA ANALYTICS LAB 2023-2024
2024-2025

Signature of the Faculty

Dept. of CSE Page 16

BIG DATA ANALYTICS LAB 2023-2024
2024-2025

Signature of the Faculty

Dept. of CSE Page 17

BIG DATA ANALYTICS LAB 2024-2025

EXPERIMENT: 3
Install, Configure and Run Hadoop and HDFS
PROGRAM:
AIM: To Installing and Running Applications On Hadoop and HDFS.
HADOOP INSTALATION IN WINDOWS
1. Prerequisites
Hardware Requirement
* RAM — Min. 8GB, if you have SSD in your system then 4GB RAM would also work.
* CPU — Min. Quad core, with at least 1.80GHz
2. JRE 1.8 — Offline installer for JRE
3. Java Development Kit — 1.8
4. A Software for Un-Zipping like 7Zip or Win Rar
* I will be using a 64-bit windows for the process, please check and download the version supported
by your system x86 or x64 for all the software.
5. Download Hadoop zip
* I am using Hadoop-2.9.2, you can use anyother STABLE version for hadoop.

Once we have Downloaded all the above software, we can proceed with next steps in installing the
Hadoop.
2. Unzip and Install Hadoop
After Downloading the Hadoop, we need to Unzip the hadoop-2.9.2.tar.gz file.

Once extracted, we would get a new file hadoop-2.9.2.tar.

Dept. of CSE Page 18

BIG DATA ANALYTICS LAB 2024-2025
Now, once again we need to extract this tar file.

Now we can organize our Hadoop installation, we can create a folder and move the final extracted
file in it. For Eg. :-

Please note while creating folders, DO NOT ADD SPACES IN BETWEEN THE FOLDER
NAME.(it can cause issues later)
I have placed my Hadoop in D: drive you can use C: or anyother drive also.

3. Setting Up Environment Variables

Another important step in setting up a work environment is to set your Systems environment
variable.
To edit environment variables, go to Control Panel > System > click on the ― Advanced system
settings‖ link
Alternatively, We can Right click on This PC icon and click on Properties and click on the
— Advanced system settings‖ link
Or, easiest wayis to search for Environment Variable in search bar and there you GO…

Dept. of CSE Page 19

BIG DATA ANALYTICS LAB 2024-2025

Dept. of CSE Page 20

Setting JAVA_H BIG DATA ANALYTICS LAB 2024-2025
2023-2024
OME
Open environment Variable and click on ― New‖ in ―User Variable‖

Onclicking― New‖, weget below screen.

Dept. of CSE Page 21

BIG DATA ANALYTICS LAB 2024-2025
Now as shown, add JAVA_HOME in variable name and path of Java(jdk) in Variable Value.
Click OK and we are half done with setting JAVA_HOME.

Setting HADOOP_HOME
Open environment Variable and click on ― New‖ in ―User Variable‖

Onclicking― New‖, weget below screen.

Now as shown, add HADOOP_HOME in variable name and path of Hadoop folder in Variable
Value.
Click OK and we are half done with setting HADOOP_HOME.
Note:- If you want thepath tobe set forallusers you need to select ― New‖ fromSystem Variables.
Setting Path Variable
Last step in setting Environment variable is setting Path in System Variable.

Dept. of CSE Page 22

BIG DATA ANALYTICS LAB 2024-2025

Select Path variable in the system variables and click on―Edit‖.

Now we need to add these paths to Path Variable one by one:-

* %JAVA_HOME%\bin
* %HADOOP_HOME%\bin
* %HADOOP_HOME%\sbin
Click OK and OK. & we are done with Setting Environment Variables.
Verify the Paths
Now we need to verify that what we have done is correct and reflecting.
Open a NEW Command Window
Run following commands
echo %JAVA_HOME%
echo %HADOOP_HOME%
echo %PATH%
4. Editing Hadoop files
Once we have configured the environment variables next step is to configure Hadoop. It has 3 parts:-
Creating Folders
We need to create a folder data in the hadoop directory, and 2 sub folders namenode and datanode

Dept. of CSE Page 23

BIG DATA ANALYTICS LAB 2024-2025

Create DATA folder in the Hadoop directory

Once DATA folder is created, we need to create 2 new folders namely, namenode and datanode
inside the data folder
These folders are important because files on HDFS resides inside the datanode.
Editing Configuration Files
Now we need to edit the following config files in hadoop for configuring it :-
(We can find these files in Hadoop -> etc -> hadoop)
* core-site.xml
* hdfs-site.xml
* mapred-site.xml
* yarn-site.xml
* hadoop-env.cmd
Editing core-site.xml
Right click onthe file, select edit and paste the following content within <configuration>
</configuration> tags.
Note:- Below part already has the configuration tag, we need to copyonly the part inside it.
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
Editing hdfs-site.xml
Dept. of CSE Page 24
BIG DATA ANALYTICS LAB 2024-2025
Right click on the file, select edit and paste the following content within
<configuration></configuration>tags.
Note:- Below part already has the configuration tag, we need to copyonly the part inside it.
Also replace PATH~1 and PATH~2 with the path of namenode and datanode folder that we created
recently(step 4.1).
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>C:\hadoop\data\namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>C:\hadoop\data\datanode</value>
</property>
</configuration>
Editing mapred-site.xml
Right click onthe file, select edit and paste the following content within <configuration>
</configuration> tags.
Note:- Below part already has the configuration tag, we need to copyonlythe part inside it.
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Editing yarn-site.xml
Right click onthe file, select edit and paste the following content within <configuration>
</configuration> tags.
Note:- Below part already has the configuration tag, we need to copyonly the part inside it.
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
Verifying hadoop-env.cmd
Right click onthe file, select edit and check if the JAVA_HOME is set correctlyor not.
We can replace the JAVA_HOME variable in the file with your actual JAVA_HOME that we
configured in the System Variable.
set JAVA_HOME=%JAVA_HOME%
OR
set JAVA_HOME="C:\Program Files\Java\jdk1.8.0_221"
Replacing bin
Last step in configuring the hadoop is to download and replace the bin folder.
Dept. of CSE Page 25
BIG DATA ANALYTICS LAB 2024-2025
* Go to this GitHub Repo and download the bin folder as a zip.
* Extract the zip and copy all the files present under bin folder to %HADOOP_HOME%\bin
Note:- If you are using different version of Hadoop then please search for its respective bin folder
and download it.
5. Testing Setup
Congratulation..!!!!!
We are done with the setting up the Hadoop in our System.
Now we need to check if everything works smoothly…
Formatting Namenode

Before starting hadoop we need to format the namenode for this we need to start a NEW Command
Prompt and run below command
hadoop namenode –format

Note:- This command formats all the data in namenode. So, its advisable to use only at the start and
do not use it everytime while starting hadoop cluster to avoid data loss.
Launching Hadoop
Now we need to start a new Command Prompt remember to run it as administrator to avoid
permission issues and execute below commands
start-all.cmd

This will open 4 new cmd windows running 4 different Daemons of hadoop:-
* Namenode
* Datanode
* Resourcemanager
* Nodemanager

Dept. of CSE Page 26

BIG DATA ANALYTICS LAB 2024-2025

Note:- We can verify if all the daemons are up and running using jps command in new cmd window.
6. Running Hadoop (Verifying Web UIs)
Namenode
Open localhost:50070 in a browser tabto verify namenode health.

Resourcemanger
Open localhost:8088 in a browser tab to check resourcemanager details.

Datanode
Open localhost:50075 in a browser tabto checkout datanode.

Dept. of CSE Page 27

BIG DATA ANALYTICS LAB 2024-2025

OUTPUT:

Dept. of CSE Page 28

BIG DATA ANALYTICS LAB 2023-2024
2024-2025

Record Notes

Dept. of CSE Page 29

BIG DATA ANALYTICS LAB 2023-2024
2024-2025

Signature of the Faculty

Dept. of CSE Page 30

BIG DATA ANALYTICS LAB 2023-2024
2024-2025

Signature of the Faculty

Signature of the Faculty
Dept. of CSE Page 31
BIG DATA ANALYTICS LAB 2023-2024
2024-2025

Signature of the Faculty

Dept. of CSE Page 32

BIG DATA ANALYTICS LAB 2024-2025
EXPERIMENT: 4
Visualize Data Using Basic Plotting Techniques In Python.
PROGRAM:
AIM: To createanapplicationthat takesthe Visualize Data Using Basic Plotting Techniques.
import pandas as pb
import matplotlib.pyplot as plt
import seaborn as sns
crime=pb.read_csv('crime.csv')
crime

plt.plot(crime.Murder,crime.Assault);

import seaborn as sns

Dept. of CSE Page 33

BIG DATA ANALYTICS LAB 2024-2025
sns.scatterplot(crime.Murder,crime.Assault);

sns.scatterplot(crime.Murder,crime.Assault,hue=crime.Murder,s=100);

plt.figure(figsize=(12,6))
plt.title('Murder Vs Assault')
sns.scatterplot(crime.Murder,crime.Assault,hue=crime.Murder,s=100);

Dept. of CSE Page 34

BIG DATA ANALYTICS LAB 2023-2024
2024-2025

plt.title('Histogram for Robbery')

plt.hist(crime.Robbery);

plt.bar(crime_bar.index,crime_bar.Robbery);

sns.barplot('Robbery','Year',data=crime);

Dept. of CSE Page 35

BIG DATA ANALYTICS LAB 2023-2024
2024-2025

import matplotlib.pyplot as plt

import pandas as pd
import numpy as np
data=pd.read_csv('crime.csv')
x=data.Population
y=data.CarTheft
plt.scatter(x,y)
plt.xlabel('Population')
plt.ylabel('CarTheft')
plt.title('Population Vs CarTheft')
plt.show();

Dept. of CSE Page 36

BIG DATA ANALYTICS LAB 2023-2024
2024-2025

OUTPPU
UT:

Dept. of CSE Page 37

BIG DATA ANALYTICS LAB 2024-2025
Record Notes

Signature of the Faculty

Dept ofCSE Page 38

BIG DATA ANALYTICS LAB 2024-2025

Signature of the Faculty

Dept ofCSE Page 39

BIG DATA ANALYTICS LAB 2024-2025

Signature of the Faculty

Dept ofCSE Page 40

BIG DATA ANALYTICS LAB 2024-2025
EXPERIMENT: 5&6

Implement no sql Database Operations: Crud Operations, Arrays Using MONGODB.

PROGRAM:
AIM: To Createa operations for crud and arrays without no sql datasbase.
TITLE: Basic CRUD operations in MongoDB.
CRUD operations refer to the basic Insert, Read, Update and Delete operations.
Inserting a document into a collection (Create)
➢ The command db.collection.insert()will perform an insert operation into a collection of a
document. ➢ Let us insert a document to a student collection. You must be connected to a database
for doing any insert. It is done as follows:
db.student.insert({
regNo: "3014",
name: "Test Student",
course: { courseName: "MCA", duration: "3 Years" },
address: {
city: "Bangalore",
state: "KA",
country: "India" } })
An entryhas been made into the collection called student.

Querying a document from a collection (Read)

To retrieve (Select) the inserted document, run the below command. The find() command will
retrieve all the documents of the given collection.
db.collection_name.find()
➢ If a record is to be retrieved based on some criteria, the find() method should be called passing
parameters, then the record will be retrieved based on the attributes specified.
db.collection_name.find({"fieldname":"value"})
➢ For Example: Let us retrieve the record from the student collection where the attribute regNo is
3014and the query for the same is as shown below:
db.students.find({"regNo":"3014"})

Updating a document in a collection (Update) In order to update specific field values of a collection
in MongoDB, runthe below query. db.collection_name.update()

Dept of CSE Page 41

BIG DATA ANALYTICS LAB 2024-2025

➢ update() method specified above will take the fieldname and the new value as argument to update
a document.
➢ Let us update the attribute name of the collection student for the document with regNo 3014.
db.student.update({
"regNo": "3014"
},
$set:
{
"name": "Viraj"
})
Removing an entry from the collection (Delete)
➢ Let us now look into the deleting an entry from a collection. In order to delete an entry from a
collection, runthe command as shown below : db.collection_name.remove({"fieldname":"value"})
➢ For Example : db.student.remove({"regNo":"3014"})

Note that after running the remove() method, the entryhas been deleted from the student collection.

Working with Arrays in MongoDB

1. Introduction
In a MongoDB database, data is stored in collections and a collection has documents. A document
has fields and values, like in a JSON. The field types include scalar types (string, number, date, etc.)
and composite types (arrays and objects). In this article we will look at an example of using the array
field type.
The example is an application where users create blog posts and write comments for the posts. The
relationship between the posts and comments is One-to-Many; i.e., a post can have many comments.
We will consider a collection of blog posts with their comments. That is a post document will also
store the related comments. In MongoDB's document model, a 1:N relationship data can be stored
within a collection; this is a de-normalized form of data. The related data is stored together and can
be accessed (and updated) together. The comments are stored as an array; an array of comment
objects.
A sample document of the blog posts with comments:
{
"_id" : ObjectId("5ec55af811ac5e2e2aafb2b9"),
"name" : "Working with Arrays",
"user" : "Database Rebel",
"desc" : "Maintaining an arrayof objects in a document",
"content" : "some content ...",
"created" : ISODate("2020-05-20T16:28:55.468Z"),
"updated" : ISODate("2020-05-20T16:28:55.468Z"),
"tags" : [ "mongodb", "arrays" ],
"comments" : [
{
"user" : "DB Learner",
"content" : "Nice post.",
"updated" : ISODate("2020-05-20T16:35:57.461Z")

Dept of CSE Page 42

BIG DATA ANALYTICS LAB 2024-2025
}

]
}
In an application, a blog post is created, comments are added, queried, modified or deleted byusers.
In the example, we will write code to create a blog post document, and do some CRUD operations
with comments for the post.

2. Create and Query a Document

Let's create a blog post document. We will use a database called as blogs and a collection called as
posts. The code is written in mongoshell (an interactive JavaScript interface to MongoDB). Mongo
shell is started from the command line and is connected to the MongoDB server. From the shell:
use blogs
NEW_POST =
{
name: "Working with Arrays",
user: "Database Rebel",
desc: "Maintaining an array of objects in a document",
content: "some content...",
created: ISODate(),
updated: ISODate(),
tags: [ "mongodb", "arrays" ]
}
db.posts.insertOne(NEW_POST)
Returns a result { "acknowledged" : true, "insertedId" : ObjectId("5ec55af811ac5e2e2aafb2b9") }
indicating that a new document is created. This is a common acknowledgement when you perform a
write operation. When a document is inserted into a collection for the first time, the collection gets
created (if it doesn't exist already). The insertOne method inserts a document into the collection.
Now, let's querythe collection :
db.posts.findOne()
{
"_id" : ObjectId("5ec55af811ac5e2e2aafb2b9"),
"name" : "Working with Arrays",
"user" : "Database Rebel",
"desc" : "Maintaining an arrayof objects in a document",
"content" : "some content...",
"created" : ISODate("2020-05-20T16:28:55.468Z"),
"updated" : ISODate("2020-05-20T16:28:55.468Z"),
"tags" : [
"mongodb",
"arrays"
]
}
The findOne method retrieves one matching document from the collection. Note the scalar fields
name (string type) and created (date type), and the array field tags. In the newly inserted document
there are no comments, yet.

3. Add an Array Element

Let's add a comment for this post, by a user "DB Learner":
NEW_COMMENT = {
user: "DB Learner",
text: "Nice post, can I know more about the arrays in MongoDB?",

Dept of CSE Page 43

BIG DATA ANALYTICS LAB 2024-2025
updated: ISODate()
}

db.posts.updateOne(
{ _id : ObjectId("5ec55af811ac5e2e2aafb2b9") },
{ $push: { comments: NEW_COMMENT } }
)
Returns: { "acknowledged" : true, "matchedCount" : 1, "modifiedCount" : 1 }
The updateOne method updates a document's fields based upon the specified condition. $push is an
array update operator which adds an element to an array. If the array doesn't exist, it creates an array
field and then adds the element.
Let's querythe collection and confirm the new comment visually, using the findOne method:
{
"_id" : ObjectId("5ec55af811ac5e2e2aafb2b9"),
"name" : "Working with Arrays",
...
"comments" : [
{
"user" : "DB Learner",
"text" : "Nice post, can I know more about the arrays in MongoDB?",
"updated" : ISODate("2020-05-20T16:35:57.461Z")
}
]
}
Note the comments array field has comment objects as elements. Let's add one more comment using
the same $push update operator. This new comment (by user "Database Rebel") is appended to the
comments array:
"comments" : [
{
"user" : "DB Learner",
"text" : "Nice post, can I know more about the arrays in MongoDB?",
"updated" : ISODate("2020-05-20T16:35:57.461Z")
},
{
"user" : "Database Rebel",
"text" : "Thank you, please look for updates",
"updated" : ISODate("2020-05-20T16:48:25.506Z")
}
]

4. Update an Array Element

Let's update the comment posted by "Database Rebel" with modified text field :
NEW_CONTENT = "Thank you, please look for updates - updated the post".
db.posts.updateOne(
{ _id : ObjectId("5ec55af811ac5e2e2aafb2b9"), "comments.user": "Database Rebel" },
{ $set: { "comments.$.text": NEW_CONTENT } }
)
The $set update operator is used to change a field's value. The positional $ operator identifies an
element in an array to update without explicitly specifying the position of the element in the array.
The first matching element is updated. The updated comment object:
"comments" : [
{
"user" : "Database Rebel",
Dept of CSE Page 44
BIG DATA ANALYTICS LAB 2024-2025
"text" : "Thank you, please look for updates - updated",
"updated" : ISODate("2020-05-20T16:48:25.506Z")

}
]

5. Delete an Array Element

The user changed his mind and wanted to delete the comment, and then add a new one.
db.posts.updateOne(
{ _id" : ObjectId("5ec55af811ac5e2e2aafb2b9") },
{ $pull: { comments: { user: "Database Rebel" } } }
)
The $pull update operator removes elements from an array which match the specified condition - in
this case { comments: { user: "Database Rebel" } }.
A new comment is added to the array after the above delete operation, with the following text:
"Thank you for your comment. I have updated the post with CRUD operations on an array field".

6. Add a New Field to all Objects in the Array

Let's add a new field likes for all the comments in the array.
db.posts.updateOne(
{ "_id : ObjectId("5ec55af811ac5e2e2aafb2b9") },
{ $set: { "comments.$[].likes": 0 } }
)
The all positional operator $[] specifies that the update operator $set should modify all elements in
the specified array field. After the update, all comment objects have the likes field, for example:
{
"user" : "DB Learner",
"text" : "Nice post, can I know more about the arrays in MongoDB?",
"updated" : ISODate("2020-05-20T16:35:57.461Z"),
"likes" : 0
}

7. Update a Specific Array Element Based on a Condition

First, let's add another new comment using the $push update operator:
NEW_COMMENT = {
user: "DB Learner",
text: "Thanks for the updates!",
updated: ISODate()
}
Note the likes field is missing in the input document. We will update this particular comment in the
comments arraywith the condition that the likes field is missing.
db.posts.updateOne(
{ "_id" : ObjectId("5ec55af811ac5e2e2aafb2b9") },
{ $inc: { "comments.$[ele].likes": 1 } },
{ arrayFilters: [ { "ele.user": "DB Learner", "ele.likes": { $exists: false } } ] }
)
The likes field is updated using the $inc update operator (this increments a field's value, or if not
exists adds the field and then increments). The filtered positional operator $[<identifier>] identifies
the arrayelements that match the arrayFilters conditions for an update operation.

Dept of CSE Page 45

BIG DATA ANALYTICS LAB 2024-2025

OUTPUT:

Dept of CSE Page 46

BIG DATA ANALYTICS LAB 2024-2025

Record Notes

Signature of the Faculty

Dept of CSE Page 47
BIG DATA ANALYTICS LAB 2024-2025

Signature of the Faculty

Dept of CSE Page 48
BIG DATA ANALYTICS LAB 2024-2025

Signature of the Faculty

Dept of CSE Page 49
BIG DATA ANALYTICS LAB 2024-2025

EXPERIMENT: 7
Implement Functions: Count – Sort – Limit – Skip – Aggregate Using MONGODB.
PROGRAM:

AIM: To create function operations for sort, limit, skip and aggregate.

1. COUNT
How do you get the number of Debit and Credit transactions? One way to
do it is by using count() function as below
> db.transactions.count({cr_dr : "D"});
or

> db.transactions.find({cr_dr : "D"}).length();

But what if you do not know the possible values of cr_dr upfront. Here
Aggregation framework comes to play. See the below Aggregate query.
> db.transactions.aggregate( [
{
$group :{
_id: '$cr_dr', // group bytype oftransaction
// Add 1 for each document to the count for this type of
transaction
count : {$sum : 1}
}
}
]
);

And the result is

{
"_id" : "C",
"count" : 3
}
{
"_id" : "D",
"count" : 5
}

2. SORT
Definition
$sort
Sorts all input documents and returns them to the pipeline in sorted order.

The
$sort
Dept of CSE Page 50
BIG DATA ANALYTICS LAB 2024-2025
stage has the following prototype form:

{ $sort: { <field1>: <sort order>, <field2>: <sort order> ... }}

$sort
takes a document that specifies the field(s) to sort by and the respective
sort order. <sort order> can have one of the following values:

Value
Description
1
Sort ascending.
-1
Sort descending.
{ $meta: "textScore" }
Sort by the computed textScore metadata in descending order. See
Text Score Metadata Sort
for an example.
If sorting on multiple fields, sort order is evaluated from left to right. For
example, in the form above, documents are first sorted by <field1>. Then
documents with the same <field1> values are further sorted by <field2>.

Behavior
Limits
You can sort on a maximum of 32 keys.

Sort Consistency
MongoDB does not store documents in a collection in a particular order.
When sorting on a field which contains duplicate values, documents
containing those values may be returned in any order.

If consistent sort order is desired, include at least one field in your sort
that contains unique values. The easiest way to guarantee this is to
include the _id field in your sort query.

Consider the following restaurant collection:

db.restaurants.insertMany( [
{ "_id" : 1, "name" : "Central Park Cafe", "borough" : "Manhattan"},
{ "_id" : 2, "name" : "Rock A Feller Bar and Grill", "borough" :
"Queens"},
{ "_id" : 3, "name" : "Empire State Pub", "borough" : "Brooklyn"},
{ "_id" : 4, "name" : "Stan's Pizzaria", "borough" : "Manhattan"},
{ "_id" : 5, "name" : "Jane's Deli", "borough" : "Brooklyn"},
])
Dept of CSE Page 51
BIG DATA ANALYTICS LAB 2024-2025

The following command uses the

$sort
stagetosort ontheborough field:

db.restaurants.aggregate(
[

{ $sort : { borough : 1 } }
]
)

In this example, sort order may be inconsistent, since the borough field
contains duplicate values for both Manhattan and Brooklyn. Documents
are returned in alphabetical order by borough, but the order of those
documents with duplicate values for borough might not the be the same
across multiple executions of the same sort. For example, here are the
results fromtwo different executions of the above command:

{"_id" : 3, "name" :"Empire State Pub", "borough" : "Brooklyn" }

{"_id" : 5, "name" : "Jane's Deli", "borough" : "Brooklyn" }
{"_id" : 1, "name" :"Central Park Cafe", "borough" : "Manhattan" }
{"_id": 4,"name" : "Stan's Pizzaria", "borough" :"Manhattan" }
{ "_id" : 2, "name" : "Rock A Feller Bar and Grill", "borough" : "Queens"
}
{"_id" : 5, "name" : "Jane's Deli", "borough" : "Brooklyn" }
{"_id" : 3, "name" : "Empire State Pub", "borough" : "Brooklyn" }
{"_id": 4,"name" : "Stan's Pizzaria", "borough" :"Manhattan" }
{"_id" : 1, "name" : "Central Park Cafe", "borough" :"Manhattan" }
{ "_id" : 2, "name" : "Rock A Feller Bar and Grill", "borough" : "Queens"
}
While the values for borough are still sorted in alphabetical order, the
order of the documents containing duplicate values for borough (i.e.
Manhattan and Brooklyn) is not the same.

To achieve a consistent sort, add a field which contains exclusively

unique values to the sort. The following command uses the
$sort
stagetosort onboththe borough field and the _id field:

db.restaurants.aggregate(
[
{ $sort : { borough : 1, _id: 1 } }
]
)

Dept of CSE Page 52

BIG DATA ANALYTICS LAB 2024-2025
Since the _id field is always guaranteed to contain exclusively unique
values, the returned sort order will always be the same across multiple
executions of the same sort.

Examples
Ascending/Descending Sort
For the field or fields to sort by, set the sort order to 1 or -1 to specify an
ascending or descending sort respectively, as in the following example:

db.users.aggregate(
[
{ $sort : { age : -1, posts: 1 } }
]
)

This operation sorts the documents in the users collection, in descending

order according by the age field and then in ascending order according to
the value in the posts field.
3. LIMIT

$sort
Sorts allinputdocuments and returns themtothe pipeline insorted order.
The $sort stage has the following prototype form:
{ $sort: { <field1>: <sort order>, <field2>: <sort order> ... }}
$sort takes a document that specifies the field(s) to sort by and the
respective sort order. <sort order> can have one of the following values:
Value Description
1 Sort ascending.
-1 Sort descending.
{ $meta: Sort by the computed textScore
"textScore" } metadata in descending order.
See Text Score Metadata Sort foranexample.
If sorting on multiple fields, sort order is evaluated from left to right. For
example, in the form above, documents are first sorted by <field1>. Then
documents with the same <field1>values are further sorted by <field2>.
Behavior
Limits
You can sort on a maximum of 32 keys.
Sort Consistency
MongoDB does not store documents in a collection in a particular order.
When sorting on a field which contains duplicate values, documents
containing those values may be returned in any order.
If consistent sort order is desired, include at least one field in your sort
Dept of CSE Page 53
BIG DATA ANALYTICS LAB 2024-2025

that contains unique values. The easiest way to guarantee this is to

include the _id field in your sort query.
Consider the following restaurant collection:
db.restaurants.insertMany([
{ "_id" : 1, "name" : "Central Park Cafe", "borough" : "Manhattan"},
{"_id" : 2, "name" : "Rock A Feller Barand Grill", "borough" :"Queens"},
{ "_id" : 3, "name" : "Empire State Pub", "borough" : "Brooklyn"},
{ "_id" : 4, "name" : "Stan's Pizzaria", "borough" : "Manhattan"},
{ "_id" : 5, "name" : "Jane's Deli", "borough" : "Brooklyn"},
])
The followingcommand uses the $sortstage to sort onthe borough field:
db.restaurants.aggregate(
[
{ $sort : { borough : 1 } }
]
)
In this example, sort order may be inconsistent, since the borough field
contains duplicate values for both Manhattan and Brooklyn. Documents
are returned in alphabetical order by borough, but the order of those
documents with duplicate values for borough might not the be the same
across multiple executions of the same sort. For example, here are the
results fromtwo different executions ofthe above command:
{"_id" : 3, "name" : "Empire State Pub", "borough" : "Brooklyn" }
{"_id" : 5, "name" : "Jane's Deli", "borough" :"Brooklyn" }
{"_id" : 1, "name" :"Central Park Cafe", "borough" : "Manhattan" }
{"_id": 4,"name":"Stan's Pizzaria", "borough" :"Manhattan" }
{ "_id" : 2, "name" : "Rock A Feller Bar and Grill", "borough" : "Queens" }
{"_id" : 5, "name" : "Jane's Deli", "borough" :"Brooklyn" }
{"_id" : 3, "name" : "Empire State Pub", "borough" : "Brooklyn" }
{"_id": 4,"name":"Stan's Pizzaria", "borough" :"Manhattan" }
{"_id" : 1, "name" :"Central Park Cafe", "borough" :"Manhattan" }
{ "_id" : 2, "name" : "Rock A Feller Bar and Grill", "borough" : "Queens" }
While the values for borough are still sorted in alphabetical order, the
order of the documents containing duplicate values
for borough (i.e. Manhattan and Brooklyn) is not the same.
To achieve a consistent sort, add a field which contains exclusively
unique values to the sort. The following command uses the $sort stage to
sort on both the borough field and the _id field:
db.restaurants.aggregate(
[
{ $sort : { borough : 1, _id: 1 } }
]
)
Since the _id field is always guaranteed to contain exclusively unique
values, the returned sort order will always be the same across multiple
Dept of CSE Page 54
BIG DATA ANALYTICS LAB 2024-2025
executions of the same sort.
Examples
Ascending/Descending Sort

For the field or fields to sort by, set the sort order to 1 or -1 to specify an
ascending or descending sort respectively, as in the following example:
db.users.aggregate(
[
{ $sort : { age : -1, posts: 1 } }
]
)
4. SKIP
$skip
Skips over the specified number of documents that pass into the stage and
passes the remaining documents to the next stage in the pipeline.

The
$skip
stage has the following prototype form:

{ $skip: <positive 64-bit integer> }

Dept of CSE Page 55

BIG DATA ANALYTICS LAB 2024-2025

OUTPUT:

Dept of CSE Page 56

BIG DATA ANALYTICS LAB 2024-2025

Record Notes

Signature of the Faculty

Dept of CSE Page 57
BIG DATA ANALYTICS LAB 2024-2025

Signature of the Faculty

Dept of CSE Page 58
BIG DATA ANALYTICS LAB 2024-2025

Signature of the Faculty

Dept of CSE Page 59
BIG DATA ANALYTICS LAB 2024-2025

EXPERIMENT: 8
Implement Word Count/ Frequency Programs Using Map Reduce.
PROGRAM:
AIM: To count a given number using map reduce functions.

Hadoop Streaming API for helping us passing data between our Map and Reduce code
via STDIN (standard input) and STDOUT (standard output).
Note : Change the file has execution permission (chmod +x /home/hduser/mapper.py)
Change the file has execution permission (chmod +x /home/hduser/reducer.py
Mapper program
mapper.py
import sys
# input comes from STDIN (standard input)
for line in sys.stdin:
line = line.strip() # remove leading and trailing whitespace
words = line.split()# split the line into words
# increase counters
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
# tab-delimited; the trivial word count is 1
print '%s\t%s' % (word, 1)
Reducer program
"""reducer.py"""
from operator import itemgetter
import sys
current_word = None
current_count = 0
word= None

# input comes from STDIN

for line in sys.stdin:
line = line.strip() # remove leading and trailing whitespace
# parse the input we gotfrom mapper.py
word, count = line.split('\t', 1)
# convert count (currently a string) to int
try:
count = int(count)
except ValueError:
# count was not a number, so silently
# ignore/discard this line
continue

# this IF-switch only works because Hadoop sorts map output

# by key (here: word) before it is passed to the reducer
if current_word == word:
current_count+= count
else:

Dept of CSE Page 60

BIG DATA ANALYTICS LAB 2024-2025
if current_word:
# write result to STDOUT
print '%s\t%s' % (current_word, current_count)
current_count= count
current_word = word

# do not forget to output the last word if needed!

if current_word== word:
print'%s\t%s'% (current_word, current_count)

Test the code (cat data | map | sort | reduce)

hduser@ubuntu:~$echo "foo foo quux labs foo bar quux" | /home/hduser/mapper.py

foo 1
foo 1
quux 1
labs 1
foo 1
bar 1
quux 1

hduser@ubuntu:~$ echo "foo foo quux labs foo bar quux" | /home/hduser/mapper.py | sort -k1,1 |
/home/hduser/reducer.py
bar 1
foo 3
labs 1
quux 2

hduser@ubuntu:~$ cat /tmp/gutenberg/20417-8.txt | /home/hduser/mapper.py

The 1
Project 1
Gutenberg 1
EBook 1
of 1

OUTPUT:

Dept of CSE Page 61

BIG DATA ANALYTICS LAB 2024-2025
Record Notes

Signature of the Faculty

Dept of CSE Page 62

BIG DATA ANALYTICS LAB 2024-2025

Signature of the Faculty

Dept of CSE Page 63

BIG DATA ANALYTICS LAB 2024-2025

Signature of the Faculty

Dept of CSE Page 64

BIG DATA ANALYTICS LAB 2024-2025
EXPERIMENT: 9
Implement a MapReduce Program that process a dataset.
AIM:
Tocreateprocess dataset usingmapreducefunctions.

PROGRAM:

The python program reads the data from a dataset ( stored in the file data.csv- wine quality).
The data mapped is stored in shuffled.pkl using mapper.py.
The contents of shuffled.pkl are reduced using reducer.py
Mapper Program

import pandas as pd
import pickle

data = pd.read_csv('data.csv')

#Slicing Data
slice1 = data.iloc[0:399,:]
slice2 = data.iloc[400:800,:]
slice3 = data.iloc[801:1200,:]
slice4 = data.iloc[1201:,:]

def mapper(data):
mapped = []

for index,row in data.iterrows():

mapped.append((row['quality'],row['volatile acidity']))
return mapped

map1 = mapper(slice1)
map2 = mapper(slice2)
map3 = mapper(slice3)
map4 = mapper(slice4)

shuffled = {
3.0: [],
4.0: [],
5.0: [],
6.0: [],
7.0: [],
8.0: [],
}
for i in [map1,map2,map3,map4]:
for j in i:

Dept of CSE Page 65

BIG DATA ANALYTICS LAB 2024-2025
shuffled[j[0]].append(j[1])
file= open('shuffled.pkl','ab')
pickle.dump(shuffled,file)
file.close()
print("Data has been mapped. Now, run reducer.pyto reduce the contents in
shuffled.pkl file.")

Reducer Program

import
pickle
file= open('shuffled.pkl','rb')
shuffled = pickle.load(file)
def reduce(shuffled_dict):
reduced = {}

for i in shuffled_dict:

reduced[i] = sum(shuffled_dict[i])/len(shuffled_dict[i])

return reduced
final = reduce(shuffled)
print("Average volatile acidity in different classes of wine: ")
for i in final:
print(i,':',final[i])

OUTPUT:

Dept of CSE Page 66

BIG DATA ANALYTICS LAB 2024-2025
Record Notes

Signature of the Faculty

Dept of CSE Page 67

BIG DATA ANALYTICS LAB 2024-2025

Signature of the Faculty

Dept of CSE Page 68

BIG DATA ANALYTICS LAB 2024-2025

Signature of the Faculty

Dept of CSE Page 69

BIG DATA ANALYTICS LAB 2024-2025

EXPERIMENT: 10

Implement Clustering Techniques Using SPARK.

AIM:To createa clusteringusing SPARK.

PROGRAM:
# Loads data.
dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")

# Trains a k-means model.

kmeans = KMeans().setK(2).setSeed(1)
model = kmeans.fit(dataset)

# Evaluateclusteringby computing Within Set Sumof Squared Errors.

wssse= model.computeCost(dataset)
print("WithinSet Sumof Squared Errors = "+ str(wssse))

# Shows theresult.
centers = model.clusterCenters()
print("Cluster Centers: ")
for center in centers:
print(center)

OUTPUT:

Dept of CSE Page 70

BIG DATA ANALYTICS LAB 2024-2025

Record Notes

Signature of the Faculty

Dept of CSE Page 71

BIG DATA ANALYTICS LAB 2024-2025

Signature of the Faculty

Dept of CSE Page 72

BIG DATA ANALYTICS LAB 2024-2025

Signature of the Faculty

Dept of CSE Page 73

BIG DATA ANALYTICS LAB 2024-2025
EXPERIMENT : 11

mplement an Application that Stores Big Data in MONGODB / PIG Using Hadoop /
R.
AIM:Todesignaapplicationtostoresdata inmongdob usinghadoop.

PROGRAM:

R Shiny Tutorial: How to Make Interactive Web Applications in R

Introduction
In this modern technological era, various apps are available for all of us –from tracking our fitness
level, sleep to giving us the latest information about the stock markets. Apps like Robinhood, Google
Fit and Workit seem so amazingly useful because they use real-time data and statistics. As R is a
frontrunner in the field of statistical computing and programming, developers need a system to use
its power to build apps.
This is where R Shiny comes to save the day. In this, R Shiny tutorial, you will come to know the
basics.
What is R Shiny?
Shiny is an R package that was developed for building interactive web applications in R. Using this,
you can create web applications utilizing native HTML and CSS code along with R Shiny code. You
can build standalone web apps on a website that will make data visualization easy. These applications
made through R Shinycan seamlessly display R objects such as tables and plots.
Let us look at some of the features of R Shiny:
 Build web applications with fewer lines of code, without JavaScript.
 These applications are live and are accessible to users like spreadsheets. The outputs may
alter in real-time if the users change the input.
 Developers with little knowledge of web tools can also build apps using R Shiny.
 You get in-built widgets to display tables, outputs of R objects and plots.
 You can add live visualizations and reports to the web application using this package.
 The user interfaces can be coded in R or can be prepared using HTML, CSS or JavaScript.
 The default user interface is built using Bootstrap.
 It comes with a WebSocket package that enables fast communication between the web server
and R.

Components of an RShiny app

A Shiny app has two primary components – a user interface object and a server function.
These are the arguments passed on to the shinyApp method. This method creates an
application object using the arguments.

Let us understand the basic parts of an R Shiny app in detail:

User interface function

This function defines the appearance of the web application. It makes the application
interactive by obtaining input from the user and displaying it on the screen. HTML and CSS
tags can be used for making the application look better. So, while building the ui.R file you
create an HTML file with R functions.

If you type fluidPage() in the R console, you will see that the method returns a tag <div
class=‖container-fluid‖></div>.

The different input functions are:

Dept of CSE Page 74

BIG DATA ANALYTICS LAB 2024-2025

 selectInput() – This method is used for creating a dropdown HTML that has various
choices to select.
 numericInput() – This method creates an input area for writing text or numbers.
 radioButtons() – This provides radio buttons for the user to select an input.

Layout methods
The various layout features available in Bootstrap are implemented by R Shiny. The components are:

Panels
These are methods that group elements together into a single panel. These include:

 absolutePanel()
 inputPanel()
 conditionalPanel()
 headerPanel()
 fixedPanel()

Layout functions
These organize the panels for a particular layout. These include:

 fluidRow()
 verticalLayout()
 flowLayout()
 splitLayout()
 sidebarLayout()

Output methods
These methods are used for displaying R output components images, tables and plots. Theyare:

 tableOutput() – This method is used for displaying an R table

 plotOutput() – This method is used for displaying an R plot object

Serverfunction
After you have created the appearance of the application and the ways to take input values from the
user, it is time to set up the server. The server functions help you to write the server-side code for the
Shiny app. You can create functions that map the user inputs to the corresponding outputs. This
function is called bythe web browser when the application is loaded.

It takes an input and output parameter, and return values are ignored. An optional session parameter
is also taken bythis method.

R Shiny tutorial: How toget started with R Shiny?

Steps to start working with the R Shinypackage are as follows:

 Go to the R console and type in the command – install.packages(― shiny‖)

 The package comes with 11 built-in application examples for you to understand how Shiny
works

Dept of CSE Page 75

BIG DATA ANALYTICS LAB 2024-2025

You can start with the Hello Shiny example to understand the basic structure. Type this code to run
Hello Shiny:

library(shiny)
runExample(“01_hello”)

The steps tocreate a new Shiny app are:

 Open RStudio and go to the File option
 Select New Project in adirectory and click onthe―Shiny Web‖ Application
 You will get a histogram and a slider to test the changes in output with respect to the input
 You will get two scripts ui.R and server.R for coding and customizing the application

Tips for Shiny app development

 Test the app in the browser to see how it looks before sending it for production
 Run the entire script while debugging the app
 Be careful about common error such as commas

OUTPUT:

Dept of CSE Page 76

BIG DATA ANALYTICS LAB 2024-2025
Record Notes:

Signature of the Faculty

Dept of CSE Page 77
BIG DATA ANALYTICS LAB 2024-2025

Signature of the Faculty

Dept of CSE Page 78
BIG DATA ANALYTICS LAB 2024-2025

Signature of the Faculty

Dept of CSE Page 79
BIG DATA ANALYTICS LAB 2024-2025

Signature of the Faculty

Dept of CSE Page 80

Big Data Analytics Laboratory Manual
No ratings yet
Big Data Analytics Laboratory Manual
89 pages
Bda Lab Manual (R20a0592)
No ratings yet
Bda Lab Manual (R20a0592)
89 pages
Big Data Analytics Lab Manual 2025
No ratings yet
Big Data Analytics Lab Manual 2025
91 pages
DS Lab (R22a0583)
No ratings yet
DS Lab (R22a0583)
64 pages
Oops Through Java Lab Manual - R22
No ratings yet
Oops Through Java Lab Manual - R22
77 pages
Se Lab New
No ratings yet
Se Lab New
46 pages
Data Structures Lab - 101
No ratings yet
Data Structures Lab - 101
70 pages
DS Lab Manual
No ratings yet
DS Lab Manual
140 pages
Data Structures Through Python Lab Manual (R20a0503)
No ratings yet
Data Structures Through Python Lab Manual (R20a0503)
70 pages
R15A0591 - LP Lab Manual
No ratings yet
R15A0591 - LP Lab Manual
47 pages
Csit r22 Data Visualization
No ratings yet
Csit r22 Data Visualization
46 pages
Big Data Analytics Lab Manual
No ratings yet
Big Data Analytics Lab Manual
90 pages
Modifying File Permissions in Linux
No ratings yet
Modifying File Permissions in Linux
60 pages
Operating Systems Lab Manual (R20a0584)
No ratings yet
Operating Systems Lab Manual (R20a0584)
63 pages
Linux Programming Lab Manual
No ratings yet
Linux Programming Lab Manual
117 pages
Data Structures Lab
No ratings yet
Data Structures Lab
141 pages
JNTUH R22 OS Lab Manual 2021-22
No ratings yet
JNTUH R22 OS Lab Manual 2021-22
60 pages
Big Data Analytics Lab Manual
No ratings yet
Big Data Analytics Lab Manual
91 pages
DSP Lab Manual (R20a0583)
No ratings yet
DSP Lab Manual (R20a0583)
54 pages
It Iii B.tech Sem-Ii Dwdm-R17a0590 Lab Manual 2019-20
No ratings yet
It Iii B.tech Sem-Ii Dwdm-R17a0590 Lab Manual 2019-20
107 pages
Web Technologys
No ratings yet
Web Technologys
66 pages
Data Structures Lab Manual in Python
No ratings yet
Data Structures Lab Manual in Python
71 pages
Dbms Lab Manual - II B.tech It Semii (2017-18)
No ratings yet
Dbms Lab Manual - II B.tech It Semii (2017-18)
83 pages
(R18A0584) Data Structures Lab Manual
No ratings yet
(R18A0584) Data Structures Lab Manual
104 pages
2 - 1 Lab Manual Final
No ratings yet
2 - 1 Lab Manual Final
183 pages
It - (20) - 2-2 - Database Management Systems Laboratory Manual (2022-23)
No ratings yet
It - (20) - 2-2 - Database Management Systems Laboratory Manual (2022-23)
70 pages
Laboratory Manual Data Warehousing and Mining Lab: Department of Computer Science and Engineering
No ratings yet
Laboratory Manual Data Warehousing and Mining Lab: Department of Computer Science and Engineering
234 pages
Web Technologies Lab Index
No ratings yet
Web Technologies Lab Index
8 pages
C++ Data Structures Laboratory Manual
No ratings yet
C++ Data Structures Laboratory Manual
81 pages
DWDM Lab Manual - It - Iii-Ii - 2018-19 PDF
No ratings yet
DWDM Lab Manual - It - Iii-Ii - 2018-19 PDF
96 pages
Compiler Design Lab
No ratings yet
Compiler Design Lab
64 pages
DSLAB
No ratings yet
DSLAB
72 pages
S.E Lab Manual Cse-B
No ratings yet
S.E Lab Manual Cse-B
181 pages
It - (r22) - 3-1 - Data Science and Artificial Intelligence - Lab Manual
No ratings yet
It - (r22) - 3-1 - Data Science and Artificial Intelligence - Lab Manual
97 pages
Csit - (R22) - 2-2 - Software Engineering Lab - Manual - (2023-24)
No ratings yet
Csit - (R22) - 2-2 - Software Engineering Lab - Manual - (2023-24)
62 pages
It - II B.tech Sem - II Java Lab Manual (20-21)
No ratings yet
It - II B.tech Sem - II Java Lab Manual (20-21)
197 pages
Csit III-II (R22a6681) Machine Learning Lab Manual (2024-25)
No ratings yet
Csit III-II (R22a6681) Machine Learning Lab Manual (2024-25)
75 pages
R-22 Data Visualization - R Programming Power Bi Lab Record
No ratings yet
R-22 Data Visualization - R Programming Power Bi Lab Record
36 pages
FSD Lab Manual (R22a0589)
No ratings yet
FSD Lab Manual (R22a0589)
67 pages
6 Big Data Analytics Lab Manual
No ratings yet
6 Big Data Analytics Lab Manual
73 pages
DWDM Lab Manual for B.Tech Students
No ratings yet
DWDM Lab Manual for B.Tech Students
94 pages
Case Tools &software Testing Lab Manual
No ratings yet
Case Tools &software Testing Lab Manual
129 pages
Dbms Lab Manual
No ratings yet
Dbms Lab Manual
68 pages
Dbms Lab Manual
No ratings yet
Dbms Lab Manual
98 pages
Full Stack Dev Lab Manual
No ratings yet
Full Stack Dev Lab Manual
60 pages
Software Testing LAB MANUAL
No ratings yet
Software Testing LAB MANUAL
126 pages
Data Structures Lab Manual 2024-25
No ratings yet
Data Structures Lab Manual 2024-25
67 pages
Web Tech Lab File (BCS-552)
No ratings yet
Web Tech Lab File (BCS-552)
65 pages
B.Tech IT Application Dev Lab Manual
No ratings yet
B.Tech IT Application Dev Lab Manual
45 pages
DWDM Lab Manual
No ratings yet
DWDM Lab Manual
96 pages
Web Technologies Lab Manual
No ratings yet
Web Technologies Lab Manual
123 pages
Dronacharya WD Lab Manual 1
No ratings yet
Dronacharya WD Lab Manual 1
76 pages
DWDM Lab Manual r20
No ratings yet
DWDM Lab Manual r20
97 pages
Putnam Questions
No ratings yet
Putnam Questions
32 pages
Solution Manual For Operations Research Applications and Algorithms 4th Edition
No ratings yet
Solution Manual For Operations Research Applications and Algorithms 4th Edition
23 pages
High-Speed Tracking With Kernelized Correlation Filters
No ratings yet
High-Speed Tracking With Kernelized Correlation Filters
14 pages
Fortran Statistics F90lib
No ratings yet
Fortran Statistics F90lib
3 pages
Numerical Methods: Technological Institute of The Philippines
No ratings yet
Numerical Methods: Technological Institute of The Philippines
11 pages
Khaled Aziz - Reservoir Simulation
No ratings yet
Khaled Aziz - Reservoir Simulation
250 pages
Academic Regulations Course Structure AND Detailed Syllabus
No ratings yet
Academic Regulations Course Structure AND Detailed Syllabus
161 pages
(Dorian Goldfeld) Automorphic Forms and L-Function
No ratings yet
(Dorian Goldfeld) Automorphic Forms and L-Function
509 pages
Maths Quest 11C Teacher's Addition
100% (14)
Maths Quest 11C Teacher's Addition
640 pages
Uber - LeetCode
No ratings yet
Uber - LeetCode
11 pages
Quantitative Analysis
No ratings yet
Quantitative Analysis
303 pages
Important Data Structure Exam Questions
100% (2)
Important Data Structure Exam Questions
5 pages
Solitary Waves in Nonlinear Dirac Equation
No ratings yet
Solitary Waves in Nonlinear Dirac Equation
69 pages
HW3 2
No ratings yet
HW3 2
4 pages
Algebra Booster For JEE Main and Advanced Rejaul Makshud McGraw Hill Rejaul Makshud Full
No ratings yet
Algebra Booster For JEE Main and Advanced Rejaul Makshud McGraw Hill Rejaul Makshud Full
134 pages
Vector Spaces: Theory and Practice
No ratings yet
Vector Spaces: Theory and Practice
23 pages
PhysRevA 109 040101
No ratings yet
PhysRevA 109 040101
25 pages
Image Process PDF
No ratings yet
Image Process PDF
10 pages
END 395 Lecture 5 Handout
No ratings yet
END 395 Lecture 5 Handout
5 pages
Counting in Two Ways
0% (1)
Counting in Two Ways
4 pages
12th Maths One Mark Question Bank EM
100% (1)
12th Maths One Mark Question Bank EM
5 pages
Understanding Complex Entangled States
No ratings yet
Understanding Complex Entangled States
62 pages
Pak Navy Mathematics
No ratings yet
Pak Navy Mathematics
27 pages
NDA Syllabus 2024 and Exam Pattern For NDA 1 & 2
No ratings yet
NDA Syllabus 2024 and Exam Pattern For NDA 1 & 2
1 page
M.E. Mobile & Pervasive Computing Curriculum
No ratings yet
M.E. Mobile & Pervasive Computing Curriculum
103 pages
Basic Simulation Lab File (Es-204) : Ravi Kumar A45615820008 B.Tech Ce 4 SEM
No ratings yet
Basic Simulation Lab File (Es-204) : Ravi Kumar A45615820008 B.Tech Ce 4 SEM
49 pages
Numerical Solutions
No ratings yet
Numerical Solutions
9 pages
Senior Five Accounting
No ratings yet
Senior Five Accounting
155 pages
STK133 - Study Guide - 2025
No ratings yet
STK133 - Study Guide - 2025
13 pages
High-Order Spectral/hp Finite Elements
No ratings yet
High-Order Spectral/hp Finite Elements
24 pages