0% found this document useful (0 votes)
11 views

Prachi 20CS111 BDALab File

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Prachi 20CS111 BDALab File

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

BIG DATA ANALYTICS

LAB
Submitted in partial fulfillment of the requirements for the award

Of the degree of Bachelor of

Technology In

Computer Science & Engineering


(Bikaner Technical University, Bikaner)

SESSION (2023-2024)

SUBMITTED TO: SUBMITTED


BY:
Mr.Sunil Kumar Khinchi Name:Prachi
20EEACS301

DEPARTMENT OF COMPUTER SCIENCE AND

ENGINEERING, ENGINEERING COLLEGE AJMER


1. Implement the following Data structures in Java
i) Linked Lists ii) Stacks iii) Queues iv) Set v) Map

DESCRIPTION:
The java.util package contains all the classes and interfaces for Collection framework

Methods of Collection interface


There are many methods declared in the Collection interface. They are as follows:

No. Method Description

1. public boolean add(Object is used to insert an element in this


element) collection.

2. public boolean is used to insert the specified collection


addAll(Collection c) elements in the invoking collection.

3. public boolean remove(Object is used to delete an element from this


element) collection..

4. public boolean is used to delete all the elements of


removeAll(Collection c) specified collection from the invoking
collection.
5. public boolean is used to delete all the elements of
retainAll(Collection c) invoking collection except the specified
collection.

6. public int size() return the total number of elements in the


collection.

7. public void clear() removes the total no of element from the


collection.

8. public boolean contains(Object is used to search an element.


element)

9. public boolean is used to search the specified collection in


containsAll(Collection c) this collection.

10. public Iterator iterator() returns an iterator.

11. public Object[] toArray() converts collection into array

12. public boolean isEmpty() checks if collection into empty

SKELTON OF JAVA.UTIL.COLLECTION INTERFACE

public interface Collection<E> extends Iterable<E> {


int size();
boolean isEmpty();
boolean contains(Object o);
Iterator<E> iterator();
Object[] toArray();
<T> T[] toArray(T[] a);
boolean add(E e);
boolean remove(Object o);
boolean addAll(Collection<? extends E> c);
boolean removeAll(Collection<?> c);
boolean retainAll(Collection<?> c);
void clear();
boolean equals(Object o);
int hashCode();
}
ALGORITHM for All Collection Data Structures:-

Steps of Creation of Collection


1. Create a Object of Generic Type E,T,K or V
2. Create a Model class or Plain Old Java Object (POJO) of type.
3. Generate Setters and Getters
4. Create a Collection Object of type either Set or List or Map or Queue
5. Add Objects to the collection
Boolean add(E e)
6. Add Collection to the Collection.
Boolean addAll(Collection)
7. Remove or retain data from Collection
Remove(Collection) retailAll(Collection)
8. Iterate Objects using Enumeration or Iterator or ListIterator
Iterator listIterator()
9. Display Objects from Collection
10. END

SAMPLE INPUT:
Sample Employee Data Set:
(employee.txt)
e100,james,asst.prof,cse,8000,16000,4000,8.7
e101,jack,asst.prof,cse,8350,17000,4500,9.2
e102,jane,assoc.prof,cse,15000,30000,8000,7.8
e104,john,prof,cse,30000,60000,15000,8.8
e105,peter,assoc.prof,cse,16500,33000,8600,6.9
e106,david,assoc.prof,cse,18000,36000,9500,8.3
e107,daniel,asst.prof,cse,9400,19000,5000,7.9
e108,ramu,assoc.prof,cse,17000,34000,9000,6.8
e109,rani,asst.prof,cse,10000,21500,4800,6.4
E110,murthy,prof,cse,35000,71500,15000,9,3

EXPECTED OUTPUT:-
Prints the information of employee with all its attributes
2. Perform setting up and Installing Hadoop in its three operating modes: Standalone,
Pseudo distributed, Fully distributed.

DESCRIPTION:
Hadoop is written in Java, so you will need to have Java installed on your machine,
version 6 or later. Sun's JDK is the one most widely used with Hadoop, although others
have been reported to work. Hadoop runs on Unix and on Windows. Linux is the only
supported production platform, but other flavors of Unix (including Mac OS X) can be
used to run Hadoop for development. Windows is only supported as a development
platform, and additionally requires Cygwin to run. During the Cygwin installation
process, you should include the openssh package if you plan to run Hadoop in pseudo-
distributed mode

ALGORITHM

STEPS INVOLVED IN INSTALLING HADOOP IN STANDALONE MODE:-

1. Command for installing ssh is “sudo apt-get install ssh”.


2. Command for key generation is ssh-keygen –t rsa –P “ ”.
3. Store the key into rsa.pub by using the command cat $HOME/.ssh/id_rsa.pub >>
$HOME/.ssh/authorized_keys
4. Extract the java by using the command tar xvfz jdk-8u60-linux-i586.tar.gz.
5. Extract the eclipse by using the command tar xvfz eclipse-jee-mars-R-linux-
gtk.tar.gz
6. Extract the hadoop by using the command tar xvfz hadoop-2.7.1.tar.gz
7. Move the java to /usr/lib/jvm/ and eclipse to /opt/ paths. Configure the java path in the
eclipse.ini file
8. Export java path and hadoop path in ./bashrc
9. Check the installation successful or not by checking the java version and hadoop
version
10. Check the hadoop instance in standalone mode working correctly or not by using an
implicit hadoop jar file named as word count.
11. If the word count is displayed correctly in part-r-00000 file it means that standalone
mode is installed successfully.

ALGORITHM
STEPS INVOLVED IN INSTALLING HADOOP IN PSEUDO DISTRIBUTED
MODE:-

1. In order install pseudo distributed mode we need to configure the hadoop


configuration files resides in the directory /home/lendi/hadoop-2.7.1/etc/hadoop.
2. First configure the hadoop-env.sh file by changing the java path.
3. Configure the core-site.xml which contains a property tag, it contains name and
value. Name as fs.defaultFS and value as hdfs://localhost:9000
4. Configure hdfs-site.xml.
5. Configure yarn-site.xml.
6. Configure mapred-site.xml before configure the copy mapred-site.xml.template to
mapred-site.xml.
7. Now format the name node by using command hdfs namenode –format.
8. Type the command start-dfs.sh,start-yarn.sh means that starts the daemons like
NameNode,DataNode,SecondaryNameNode ,ResourceManager,NodeManager.
9. Run JPS which views all daemons. Create a directory in the hadoop by using
command hdfs dfs –mkdr /csedir and enter some data into lendi.txt using command
nano lendi.txt and copy from local directory to hadoop using command hdfs dfs –
copyFromLocal lendi.txt /csedir/and run sample jar file wordcount to check whether
pseudo distributed mode is working or not.
10. Display the contents of file by using command hdfs dfs –cat /newdir/part-r-00000.

FULLY DISTRIBUTED MODE INSTALLATION:

ALGORITHM

1. Stop all single node clusters


$stop-all.sh
2. Decide one as NameNode (Master) and remaining as DataNodes(Slaves).
3. Copy public key to all three hosts to get a password less SSH access
$ssh-copy-id –I $HOME/.ssh/id_rsa.pub lendi@l5sys24
4. Configure all Configuration files, to name Master and Slave Nodes.
$cd $HADOOP_HOME/etc/hadoop
$nano core-site.xml
$ nano hdfs-site.xml
5. Add hostnames to file slaves and save it.
$ nano slaves
6. Configure $ nano yarn-site.xml
7. Do in Master Node
$ hdfs namenode –format
$ start-dfs.sh
$start-yarn.sh
8. Format NameNode
9. Daemons Starting in Master and Slave Nodes
10. END

INPUT
ubuntu @localhost> jps
OUTPUT:
Data node, name nodem Secondary name node,
NodeManager, Resource Manager

3. Implement the following file management tasks in Hadoop:


● Adding files and directories
● Retrieving files
● Deleting files Hint: A typical Hadoop workflow creates data files (such as log
files) elsewhere and copies them into HDFS using one of the above command
line utilities.

DESCRIPTION:

HDFS is a scalable distributed filesystem designed to scale to petabytes of data while


running on top of the underlying filesystem of the operating system. HDFS keeps track of where
the data resides in a network by associating the name of its rack (or network switch) with the
dataset. This allows Hadoop to efficiently schedule tasks to those nodes that contain data, or
which are nearest to it, optimizing bandwidth utilization. Hadoop provides a set of command line
utilities that work similarly to the Linux file commands, and serve as your primary interface with
HDFS. We‘re going to have a look into HDFS by interacting with it from the command line. We
will take a look at the most common file management tasks in Hadoop, which include:

● Adding files and directories to HDFS


● Retrieving files from HDFS to local filesystem
● Deleting files from HDFS

ALGORITHM:-

SYNTAX AND COMMANDS TO ADD, RETRIEVE AND DELETE DATA FROM HDFS

Step-1
Adding Files and Directories to HDFS

Before you can run Hadoop programs on data stored in HDFS, you‘ll need to put the data into
HDFS first. Let‘s create a directory and put a file in it. HDFS has a default working directory of
/user/$USER, where $USER is your login user name. This directory isn‘t automatically created
for you, though, so let‘s create it with the mkdir command. For the purpose of illustration, we
use chuck. You should substitute your user name in the example commands.

hadoop fs -mkdir /user/chuck


hadoop fs -put example.txt
hadoop fs -put example.txt /user/chuck

Step-2
Retrieving Files from HDFS
The Hadoop command get copies files from HDFS back to the local filesystem. To retrieve
example.txt, we can run the following command:
hadoop fs -cat example.txt

Step-3
Deleting Files from HDFS
hadoop fs -rm example.txt
● Command for creating a directory in hdfs is “hdfs dfs –mkdir /lendicse”.
● Adding directory is done through the command “hdfs dfs –put lendi_english /”.

Step-4
Copying Data from NFS to HDFS
Copying from directory command is “hdfs dfs –copyFromLocal
/home/lendi/Desktop/shakes/glossary /lendicse/”
● View the file by using the command “hdfs dfs –cat /lendi_english/glossary”
● Command for listing of items in Hadoop is “hdfs dfs –ls hdfs://localhost:9000/”.
● Command for Deleting files is “hdfs dfs –rm r /kartheek”.

SAMPLE INPUT:
Input as any data format of type structured, Unstructured or Semi Structured

EXPECTED OUTPUT:

4. Run a basic Word Count Map Reduce program to understand Map Reduce
Paradigm.
DESCRIPTION:--
MapReduce is the heart of Hadoop. It is this programming paradigm that allows for
massive scalability across hundreds or thousands of servers in a Hadoop cluster.
The MapReduce concept is fairly simple to understand for those who are familiar with clustered
scale-out data processing solutions. The term MapReduce actually refers to two separate and
distinct tasks that Hadoop programs perform. The first is the map job, which takes a set of data
and converts it into another set of data, where individual elements are broken down into tuples
(key/value pairs). The reduce job takes the output from a map as input and combines those data
tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce
job is always performed after the map job.

ALGORITHM

MAP REDUCE PROGRAM


WordCount is a simple program which counts the number of occurrences of each word in a given
text input data set. WordCount fits very well with the MapReduce programming model making it
a great example to understand the Hadoop Map/Reduce programming style. Our implementation
consists of three main parts:
1. Mapper
2. Reducer
3. Driver

Step-1. Write a Mapper


A Mapper overrides the ―map‖ function from the Class
"org.apache.hadoop.mapreduce.Mapper" which provides <key, value> pairs as the input. A
Mapper implementation may output
<key,value> pairs using the provided Context .
Input value of the WordCount Map task will be a line of text from the input data file and the key
would be the line number <line_number, line_of_text> . Map task outputs <word, one> for each
word in the line of text.
Pseudo-code
void Map (key, value){
for each word x in value:
output.collect(x, 1);
}
Step-2. Write a Reducer
A Reducer collects the intermediate <key,value> output from multiple map tasks and assemble a
single result. Here, the WordCount program will sum up the occurrence of each word to pairs as
<word, occurrence>.
Pseudo-code
void Reduce (keyword, <list of value>){
for each x in <list of value>:
sum+=x;
final_output.collect(keyword, sum);
}
Step-3. Write Driver
The Driver program configures and run the MapReduce job. We use the main program to
perform basic configurations such as:
● Job Name : name of this Job
● Executable (Jar) Class: the main executable class. For here, WordCount.
● Mapper Class: class which overrides the "map" function. For here, Map.
● Reducer: class which override the "reduce" function. For here , Reduce.
● Output Key: type of output key. For here, Text.
● Output Value: type of output value. For here, IntWritable.
● File Input Path
● File Output Path

INPUT:-
Set of Data Related Shakespeare Comedies, Glossary, Poems

OUTPUT:-
5. Write a Map Reduce program that mines weather data. Weather sensors collecting data
every hour at many locations across the globe gather a large volume of log data, which is a
good candidate for analysis with MapReduce, since it is semi structured and record-
oriented.

DESCRIPTION:
Climate change has been seeking a lot of attention since long time. The antagonistic
effect of this climate is being felt in every part of the earth. There are many examples for these,
such as sea levels are rising, less rainfall, increase in humidity. The propose system overcomes
the some issues that occurred by using other techniques. In this project we use the concept of
Big data Hadoop. In the proposed architecture we are able to process offline data, which is
stored in the National Climatic Data Centre (NCDC). Through this we are able to find out the
maximum temperature and minimum temperature of year, and able to predict the future weather
forecast. Finally, we plot the graph for the obtained MAX and MIN temperature for each moth
of the particular year to visualize the temperature. Based on the previous year data weather data
of coming year is predicted.

ALGORITHM:-
MAPREDUCE PROGRAM

WordCount is a simple program which counts the number of occurrences of each word in a given
text input data set. WordCount fits very well with the MapReduce programming model making it
a great example to understand the Hadoop Map/Reduce programming style. Our implementation
consists of three main parts:

1. Mapper
2. Reducer
3. Main program

Step-1. Write a Mapper

A Mapper overrides the ―map‖ function from the Class"org.apache.hadoop.mapreduce.Mapper"


which provides <key, value> pairs as the input. A Mapper implementation may output
<key,value> pairs using the provided Context .

Input value of the WordCount Map task will be a line of text from the input data file and the key
would be the line number <line_number, line_of_text> . Map task outputs <word, one> for each
word in the line of text.

Pseudo-code
void Map (key, value){
for each max_temp x in value:
output.collect(x, 1);
}
void Map (key, value){
for each min_temp x in value:
output.collect(x, 1);
}

Step-2 Write a Reducer

A Reducer collects the intermediate <key,value> output from multiple map tasks and assemble a
single result. Here, the WordCount program will sum up the occurrence of each word to pairs as
<word, occurrence>.

Pseudo-code

void Reduce (max_temp, <list of value>){


for each x in <list of value>:
sum+=x;
final_output.collect(max_temp, sum);
}
void Reduce (min_temp, <list of value>){
for each x in <list of value>:
sum+=x;
final_output.collect(min_temp, sum);
}

3. Write Driver

The Driver program configures and run the MapReduce job. We use the main program to
perform basic configurations such as:

● Job Name : name of this Job


● Executable (Jar) Class: the main executable class. For here, WordCount.
● Mapper Class: class which overrides the "map" function. For here, Map.
● Reducer: class which override the "reduce" function. For here , Reduce.
● Output Key: type of output key. For here, Text.
● Output Value: type of output value. For here, IntWritable.
● File Input Path
● File Output Path
INPUT:-
Set of Weather Data over the years

OUTPUT:-

6. Implement Matrix Multiplication with Hadoop Map Reduce.


DESCRIPTION:
We can represent a matrix as a relation (table) in RDBMS where each cell in the
matrix can be represented as a record (i,j,value). As an example let us consider the following
matrix and its representation. It is important to understand that this relation is a very inefficient
relation if the matrix is dense. Let us say we have 5 Rows and 6 Columns , then we need to store
only 30 values. But if you consider above relation we are storing 30 rowid, 30 col_id and 30
values in other sense we are tripling the data. So a natural question arises why we need to store in
this format ? In practice most of the matrices are sparse matrices . In sparse matrices not all cells
used to have any values , so we don‘t have to store those cells in DB. So this turns out to be very
efficient in storing such matrices.

MapReduceLogic

Logic is to send the calculation part of each output cell of the result matrix to a reducer.
So in matrix multiplication the first cell of output (0,0) has multiplication and summation of
elements from row 0 of the matrix A and elements from col 0 of matrix B. To do the
computation of value in the output cell (0,0) of resultant matrix in a seperate reducer we need to
use (0,0) as output key of mapphase and value should have array of values from row 0 of matrix
A and column 0 of matrix B. Hopefully this picture will explain the point. So in this algorithm
output from map phase should be having a <key,value> , where key represents the output cell
location (0,0) , (0,1) etc.. and value will be list of all values required for reducer to do
computation. Let us take an example for calculatiing value at output cell (00). Here we need to
collect values from row 0 of matrix A and col 0 of matrix B in the map phase and pass (0,0) as
key. So a single reducer can do the calculation.

ALGORITHM
We assume that the input files for A and B are streams of (key,value) pairs in sparse
matrix format, where each key is a pair of indices (i,j) and each value is the corresponding matrix
element value. The output files for matrix C=A*B are in the same format.
We have the following input parameters:

The path of the input file or directory for matrix A.


The path of the input file or directory for matrix B.
The path of the directory for the output files for matrix C.
strategy = 1, 2, 3 or 4.

1. R = the number of reducers.


2. I = the number of rows in A and C.
3. K = the number of columns in A and rows in B.
4. J = the number of columns in B and C.
5. IB = the number of rows per A block and C block.
6. KB = the number of columns per A block and rows per B block.
7. JB = the number of columns per B block and C block.
In the pseudo-code for the individual strategies below, we have intentionally avoided
factoring common code for the purposes of clarity.

Note that in all the strategies the memory footprint of both the mappers and the
reducers is flat at
Scale.

Note that the strategies all work reasonably well with both dense and sparse matrices. For sparse
matrices we do not emit zero elements. That said, the simple pseudo-code for multiplying the
individual blocks shown here is certainly not optimal for sparse matrices. As a learning exercise,
our focus here is on mastering the MapReduce complexities, not on optimizing the sequential
matrix multiplication algorithm for the individual blocks.

Steps
1. setup ()
2. var NIB = (I-1)/IB+1
3. var NKB = (K-1)/KB+1
4. var NJB = (J-1)/JB+1
5. map (key, value)
6. if from matrix A with key=(i,k) and value=a(i,k)
7. for 0 <= jb < NJB
8. emit (i/IB, k/KB, jb, 0), (i mod IB, k mod KB, a(i,k))
9. if from matrix B with key=(k,j) and value=b(k,j)
10. for 0 <= ib < NIB
emit (ib, k/KB, j/JB, 1), (k mod KB, j mod JB, b(k,j))
Intermediate keys (ib, kb, jb, m) sort in increasing order first by ib, then by kb, then by
jb, then by m. Note that m = 0 for A data and m = 1 for B data.

The partitioner maps intermediate key (ib, kb, jb, m) to a reducer r as follows:

11. r = ((ib*JB + jb)*KB + kb) mod R


12. These definitions for the sorting order and partitioner guarantee that each reducer
R[ib,kb,jb] receives the data it needs for blocks A[ib,kb] and B[kb,jb], with the data for
the A block immediately preceding the data for the B block.
13. var A = new matrix of dimension IBxKB
14. var B = new matrix of dimension KBxJB
15. var sib = -1
16. var skb = -1

Reduce (key, valueList)

17. if key is (ib, kb, jb, 0)


18. // Save the A block.
19. sib = ib
20. skb = kb
21. Zero matrix A
22. for each value = (i, k, v) in valueList A(i,k) = v
23. if key is (ib, kb, jb, 1)
24. if ib != sib or kb != skb return // A[ib,kb] must be zero!
25. // Build the B block.
26. Zero matrix B
27. for each value = (k, j, v) in valueList B(k,j) = v
28. // Multiply the blocks and emit the result.
29. ibase = ib*IB
30. jbase = jb*JB
31. for 0 <= i < row dimension of A
32. for 0 <= j < column dimension of B
33. sum = 0
34. for 0 <= k < column dimension of A = row dimension of B
a. sum += A(i,k)*B(k,j)
35. if sum != 0 emit (ibase+i, jbase+j), sum

INPUT:-

Set of data sets over different Clusters are taken as Rows and Columns.
OUTPUT:-

You might also like