Prachi 20CS111 BDALab File
Prachi 20CS111 BDALab File
LAB
Submitted in partial fulfillment of the requirements for the award
Technology In
SESSION (2023-2024)
DESCRIPTION:
The java.util package contains all the classes and interfaces for Collection framework
SAMPLE INPUT:
Sample Employee Data Set:
(employee.txt)
e100,james,asst.prof,cse,8000,16000,4000,8.7
e101,jack,asst.prof,cse,8350,17000,4500,9.2
e102,jane,assoc.prof,cse,15000,30000,8000,7.8
e104,john,prof,cse,30000,60000,15000,8.8
e105,peter,assoc.prof,cse,16500,33000,8600,6.9
e106,david,assoc.prof,cse,18000,36000,9500,8.3
e107,daniel,asst.prof,cse,9400,19000,5000,7.9
e108,ramu,assoc.prof,cse,17000,34000,9000,6.8
e109,rani,asst.prof,cse,10000,21500,4800,6.4
E110,murthy,prof,cse,35000,71500,15000,9,3
EXPECTED OUTPUT:-
Prints the information of employee with all its attributes
2. Perform setting up and Installing Hadoop in its three operating modes: Standalone,
Pseudo distributed, Fully distributed.
DESCRIPTION:
Hadoop is written in Java, so you will need to have Java installed on your machine,
version 6 or later. Sun's JDK is the one most widely used with Hadoop, although others
have been reported to work. Hadoop runs on Unix and on Windows. Linux is the only
supported production platform, but other flavors of Unix (including Mac OS X) can be
used to run Hadoop for development. Windows is only supported as a development
platform, and additionally requires Cygwin to run. During the Cygwin installation
process, you should include the openssh package if you plan to run Hadoop in pseudo-
distributed mode
ALGORITHM
ALGORITHM
STEPS INVOLVED IN INSTALLING HADOOP IN PSEUDO DISTRIBUTED
MODE:-
ALGORITHM
INPUT
ubuntu @localhost> jps
OUTPUT:
Data node, name nodem Secondary name node,
NodeManager, Resource Manager
DESCRIPTION:
ALGORITHM:-
SYNTAX AND COMMANDS TO ADD, RETRIEVE AND DELETE DATA FROM HDFS
Step-1
Adding Files and Directories to HDFS
Before you can run Hadoop programs on data stored in HDFS, you‘ll need to put the data into
HDFS first. Let‘s create a directory and put a file in it. HDFS has a default working directory of
/user/$USER, where $USER is your login user name. This directory isn‘t automatically created
for you, though, so let‘s create it with the mkdir command. For the purpose of illustration, we
use chuck. You should substitute your user name in the example commands.
Step-2
Retrieving Files from HDFS
The Hadoop command get copies files from HDFS back to the local filesystem. To retrieve
example.txt, we can run the following command:
hadoop fs -cat example.txt
Step-3
Deleting Files from HDFS
hadoop fs -rm example.txt
● Command for creating a directory in hdfs is “hdfs dfs –mkdir /lendicse”.
● Adding directory is done through the command “hdfs dfs –put lendi_english /”.
Step-4
Copying Data from NFS to HDFS
Copying from directory command is “hdfs dfs –copyFromLocal
/home/lendi/Desktop/shakes/glossary /lendicse/”
● View the file by using the command “hdfs dfs –cat /lendi_english/glossary”
● Command for listing of items in Hadoop is “hdfs dfs –ls hdfs://localhost:9000/”.
● Command for Deleting files is “hdfs dfs –rm r /kartheek”.
SAMPLE INPUT:
Input as any data format of type structured, Unstructured or Semi Structured
EXPECTED OUTPUT:
4. Run a basic Word Count Map Reduce program to understand Map Reduce
Paradigm.
DESCRIPTION:--
MapReduce is the heart of Hadoop. It is this programming paradigm that allows for
massive scalability across hundreds or thousands of servers in a Hadoop cluster.
The MapReduce concept is fairly simple to understand for those who are familiar with clustered
scale-out data processing solutions. The term MapReduce actually refers to two separate and
distinct tasks that Hadoop programs perform. The first is the map job, which takes a set of data
and converts it into another set of data, where individual elements are broken down into tuples
(key/value pairs). The reduce job takes the output from a map as input and combines those data
tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce
job is always performed after the map job.
ALGORITHM
INPUT:-
Set of Data Related Shakespeare Comedies, Glossary, Poems
OUTPUT:-
5. Write a Map Reduce program that mines weather data. Weather sensors collecting data
every hour at many locations across the globe gather a large volume of log data, which is a
good candidate for analysis with MapReduce, since it is semi structured and record-
oriented.
DESCRIPTION:
Climate change has been seeking a lot of attention since long time. The antagonistic
effect of this climate is being felt in every part of the earth. There are many examples for these,
such as sea levels are rising, less rainfall, increase in humidity. The propose system overcomes
the some issues that occurred by using other techniques. In this project we use the concept of
Big data Hadoop. In the proposed architecture we are able to process offline data, which is
stored in the National Climatic Data Centre (NCDC). Through this we are able to find out the
maximum temperature and minimum temperature of year, and able to predict the future weather
forecast. Finally, we plot the graph for the obtained MAX and MIN temperature for each moth
of the particular year to visualize the temperature. Based on the previous year data weather data
of coming year is predicted.
ALGORITHM:-
MAPREDUCE PROGRAM
WordCount is a simple program which counts the number of occurrences of each word in a given
text input data set. WordCount fits very well with the MapReduce programming model making it
a great example to understand the Hadoop Map/Reduce programming style. Our implementation
consists of three main parts:
1. Mapper
2. Reducer
3. Main program
Input value of the WordCount Map task will be a line of text from the input data file and the key
would be the line number <line_number, line_of_text> . Map task outputs <word, one> for each
word in the line of text.
Pseudo-code
void Map (key, value){
for each max_temp x in value:
output.collect(x, 1);
}
void Map (key, value){
for each min_temp x in value:
output.collect(x, 1);
}
A Reducer collects the intermediate <key,value> output from multiple map tasks and assemble a
single result. Here, the WordCount program will sum up the occurrence of each word to pairs as
<word, occurrence>.
Pseudo-code
3. Write Driver
The Driver program configures and run the MapReduce job. We use the main program to
perform basic configurations such as:
OUTPUT:-
MapReduceLogic
Logic is to send the calculation part of each output cell of the result matrix to a reducer.
So in matrix multiplication the first cell of output (0,0) has multiplication and summation of
elements from row 0 of the matrix A and elements from col 0 of matrix B. To do the
computation of value in the output cell (0,0) of resultant matrix in a seperate reducer we need to
use (0,0) as output key of mapphase and value should have array of values from row 0 of matrix
A and column 0 of matrix B. Hopefully this picture will explain the point. So in this algorithm
output from map phase should be having a <key,value> , where key represents the output cell
location (0,0) , (0,1) etc.. and value will be list of all values required for reducer to do
computation. Let us take an example for calculatiing value at output cell (00). Here we need to
collect values from row 0 of matrix A and col 0 of matrix B in the map phase and pass (0,0) as
key. So a single reducer can do the calculation.
ALGORITHM
We assume that the input files for A and B are streams of (key,value) pairs in sparse
matrix format, where each key is a pair of indices (i,j) and each value is the corresponding matrix
element value. The output files for matrix C=A*B are in the same format.
We have the following input parameters:
Note that in all the strategies the memory footprint of both the mappers and the
reducers is flat at
Scale.
Note that the strategies all work reasonably well with both dense and sparse matrices. For sparse
matrices we do not emit zero elements. That said, the simple pseudo-code for multiplying the
individual blocks shown here is certainly not optimal for sparse matrices. As a learning exercise,
our focus here is on mastering the MapReduce complexities, not on optimizing the sequential
matrix multiplication algorithm for the individual blocks.
Steps
1. setup ()
2. var NIB = (I-1)/IB+1
3. var NKB = (K-1)/KB+1
4. var NJB = (J-1)/JB+1
5. map (key, value)
6. if from matrix A with key=(i,k) and value=a(i,k)
7. for 0 <= jb < NJB
8. emit (i/IB, k/KB, jb, 0), (i mod IB, k mod KB, a(i,k))
9. if from matrix B with key=(k,j) and value=b(k,j)
10. for 0 <= ib < NIB
emit (ib, k/KB, j/JB, 1), (k mod KB, j mod JB, b(k,j))
Intermediate keys (ib, kb, jb, m) sort in increasing order first by ib, then by kb, then by
jb, then by m. Note that m = 0 for A data and m = 1 for B data.
The partitioner maps intermediate key (ib, kb, jb, m) to a reducer r as follows:
INPUT:-
Set of data sets over different Clusters are taken as Rows and Columns.
OUTPUT:-