Big data analytics lab-JD
Big data analytics lab-JD
SUBJECT HANDLED BY :
MS.JAIDHARNI AP/CSE
BIG DATA ANALYTICS LAB
(CCS334)
List of Experiments
2. Hadoop implementation of the management tasks, such as Adding files and directories,
Retrieving files and deleting files.
4. Run a basic word Count Map Reduce program to understand Map Reduce Paradigm.
AIM:-
i) Perform setting up and Installing Hadoop in its three operating modes:
Standalone
Pseudo Distributed
Fully Distributed
DESCRIPTION:
Hadoop is written in Java, so you will need to have Java installed on your machine,
version 6 or later. Sun's JDK is the one most widely used with Hadoop, although others have
Hadoop runs on Unix and on Windows. Linux is the only supported production platform,
but other flavors of Unix (including Mac OS X) can be used to run Hadoop for development.
Windows is only supported as a development platform, and additionally requires Cygwin to run.
During the Cygwin installation process, you should include the openssh package if you plan to
ALGORITHM
STEPS INVOLVED IN INSTALLING HADOOP IN STANDALONE MODE:-
3. Store the key into rsa.pub by using the command cat $HOME/.ssh/id_rsa.pub >>
$HOME/.ssh/authorized_keys
eclipse.ini file
8. Export java path and hadoop path in ./bashrc
9. Check the installation successful or not by checking the java version and hadoop version
10. Check the hadoop instance in standalone mode working correctly or not by using an
11. If the word count is displayed correctly in part-r-00000 file it means that standalone mode
is installed successfully.
ALGORITHM
STEPS INVOLVED IN INSTALLING HADOOP IN PSEUDO DISTRIBUTED MODE:-
3. Configure the core-site.xml which contains a property tag, it contains name and
4. Configure hdfs-site.xml.
5. Configure yarn-site.xml.
mapred-site.xml.
7. Now format the name node by using command hdfs namenode –format.
8. Type the command start-dfs.sh,start-yarn.sh means that starts the daemons like
NameNode,DataNode,SecondaryNameNode ,ResourceManager,NodeManager.
9. Run JPS which views all daemons. Create a directory in the hadoop by using
command hdfs dfs –mkdr /csedir and enter some data into lendi.txt using command
nano lendi.txt and copy from local directory to hadoop using command hdfs dfs –
copyFromLocal lendi.txt /csedir/and run sample jar file wordcount to check whether
10. Display the contents of file by using command hdfs dfs –cat /newdir/part-r-00000.
$stop-all.sh
2. Decide one as NameNode (Master) and remaining as DataNodes(Slaves).
3. Copy public key to all three hosts to get a password less SSH access
$cd $HADOOP_HOME/etc/hadoop
$nano core-site.xml
$ nano hdfs-site.xml
$ nano slaves
7. Do in Master Node
$ start-dfs.sh
$start-yarn.sh
8. Format NameNode
10. END
INPUT
ubuntu @localhost> jps
OUTPUT:
Data node, name nodem Secondary name node,
Result:
We've installed Hadoop in standalone mode and verified it by running an example program it
provided
EXPNO:2
Hadoop Implementation of file management tasks
Date:
AIM:-
Implement the following file management tasks in Hadoop:
Adding files and directories
Retrieving files
Deleting Files
DESCRIPTION:-
HDFS is a scalable distributed filesystem designed to scale to petabytes of data while
running on top of the underlying filesystem of the operating system. HDFS keeps track of where
the data resides in a network by associating the name of its rack (or network switch) with the
dataset. This allows Hadoop to efficiently schedule tasks to those nodes that contain data, or
which are nearest to it, optimizing bandwidth utilization. Hadoop provides a set of command line
utilities that work similarly to the Linux file commands, and serve as your primary interface with
HDFS. We‘re going to have a look into HDFS by interacting with it from the command line. We
will take a look at the most common file management tasks in Hadoop, which include:
Adding files and directories to HDFS
Retrieving files from HDFS to local filesystem
Deleting files from HDFS
ALGORITHM:-
SYNTAX AND COMMANDS TO ADD, RETRIEVE AND DELETE DATA FROM HDFS
Step-1
Adding Files and Directories to HDFS
Before you can run Hadoop programs on data stored in HDFS, you‘ll need to put the data into
HDFS first. Let‘s create a directory and put a file in it. HDFS has a default working directory of
/user/$USER, where $USER is your login user name. This directory isn‘t automatically created
for you, though, so let‘s create it with the mkdir command. For the purpose of illustration, we
use chuck. You should substitute your user name in the example commands.
Step-2
Step-3
Step-4
View the file by using the command “hdfs dfs –cat /lendi_english/glossary”
Command for listing of items in Hadoop is “hdfs dfs –ls hdfs://localhost:9000/”.
Command for Deleting files is “hdfs dfs –rm r /kartheek”
SAMPLE INPUT:
Input as any data format of type structured, Unstructured or Semi Structured
EXPECTED OUTPUT:
Result:
Thus the implementation for program is executed successfully.
EXPNO:3
Implementation of Matrix Multiplication with hadoop
Date:
AIM:-
Write a Map Reduce Program that implements Matrix Multiplication.
DESCRIPTION:
We can represent a matrix as a relation (table) in RDBMS where each cell in the matrix
can be represented as a record (i,j,value). As an example let us consider the following matrix
and its representation. It is important to understand that this relation is a very inefficient relation
if the matrix is dense. Let us say we have 5 Rows and 6 Columns , then we need to store only 30
values. But if you consider above relation we are storing 30 rowid, 30 col_id and 30 values in
other sense we are tripling the data. So a natural question arises why we need to store in this
format ? In practice most of the matrices are sparse matrices . In sparse matrices not all cells
used to have any values , so we don‘t have to store those cells in DB. So this turns out to be very
efficient in storing such matrices.
MapReduceLogic:
Logic is to send the calculation part of each output cell of the result matrix to a reducer.
So in matrix multiplication the first cell of output (0,0) has multiplication and summation of
elements from row 0 of the matrix A and elements from col 0 of matrix B. To do the
computation of value in the output cell (0,0) of resultant matrix in a seperate reducer we need to
use (0,0) as output key of mapphase and value should have array of values from row 0 of matrix
A and column 0 of matrix B. Hopefully this picture will explain the point. So in this algorithm
output from map phase should be having a <key,value> , where key represents the output cell
location (0,0) , (0,1) etc.. and value will be list of all values required for reducer to do
computation. Let us take an example for calculatiing value at output cell (00). Here we need to
collect values from row 0 of matrix A and col
0 of matrix B in the map phase and pass (0,0) as
key. So a single reducer can do the calculation
ALGORITHM
We assume that the input files for A and B are streams of (key,value) pairs in sparse
matrix format, where each key is a pair of indices (i,j) and each value is the corresponding matrix
element value. The output files for matrix C=A*B are in the same format.
Steps
1. setup ()
2. var NIB = (I-1)/IB+1
3. var NKB = (K-1)/KB+1
4. var NJB = (J-1)/JB+1
5. map (key, value)
6. if from matrix A with key=(i,k) and value=a(i,k)
7. for 0 <= jb < NJB
8. emit (i/IB, k/KB, jb, 0), (i mod IB, k mod KB, a(i,k))
9. if from matrix B with key=(k,j) and value=b(k,j)
10. for 0 <= ib < NIB
emit (ib, k/KB, j/JB, 1), (k mod KB, j mod JB, b(k,j))
Intermediate keys (ib, kb, jb, m) sort in increasing order first by ib, then by kb, then by jb,
then by m. Note that m = 0 for A data and m = 1 for B data.
The partitioner maps intermediate key (ib, kb, jb, m) to a reducer r as follows:
11. r = ((ib*JB + jb)*KB + kb) mod R
12. These definitions for the sorting order and partitioner guarantee that each reducer
R[ib,kb,jb] receives the data it needs for blocks A[ib,kb] and B[kb,jb], with the data for
the A block immediately preceding the data for the B block.
13. var A = new matrix of dimension IBxKB
14. var B = new matrix of dimension KBxJB
15. var sib = -1
16. var skb = -1
OUTPUT
Result:
Thus the implementation for program is executed successfully.
EXPNO:4
Word count Map Reduce program
Date:
AIM: To Develop a MapReduce program to calculate the frequency of a given word in a given
file. Map Function – It takes a set of data and converts it into another set of data, where
individual elements are broken down into tuples (Key-Value pair).
Input
Setof data
Bus,Car,bus,car,train, car,bus,car,train,bus,TRAIN,BUS,buS, caR,CAR,car,BUS,TRAIN
Output
Convertintoanothersetofdata
(Key,Value)
(Bus,1),(Car,1), (bus,1),(car,1),(train,1),(car,1), (bus,1),(car,1), (train,1),(bus,1),
(TRAIN,1),(BUS,1),(buS,1),(caR,1),(CAR,1),(car,1), (BUS,1), (TRAIN,1)
ReduceFunction–TakestheoutputfromMapasaninputandcombinesthosedatatuples into a
smaller set of tuples.
Example– (Reducefunction in Word Count)
Input Set of Tuples
(output of Map function)
(Bus,1),(Car,1),(bus,1),(car,1),(train,1),(car,1),(bus,1),(car,1),(train,1),(bus,1), (TRAIN,1),
(BUS,1),
(buS,1),(caR,1),(CAR,1),(car,1),(BUS,1), (TRAIN,1)
Output Convertsintosmallersetoftuples
(BUS,7),(CAR,7),(TRAIN,4)
WorkFlowof Program
1. Splitting–Thesplittingparametercanbeanything,e.g.splittingbyspace,
comma, semicolon, or even by a new line (‘\n’).
2. Mapping–asexplainedabove
3. Intermediatesplitting –theentireprocessinparallelondifferentclusters.Inorder
togroupthemin“ReducePhase”thesimilarKEYdatashouldbeonsamecluster.
4. Reduce–itisnothingbutmostlygroupbyphase
5. Combining–Thelastphasewhereallthedata(individualresultsetfromeach
cluster) is combine together to form a Result
NowLet’sSeetheWordCountProgramin Java
MakesurethatHadoopisinstalledonyoursystemwithjavaidk Steps to
follow
Step1.OpenEclipse>File>New>JavaProject >(Nameit–MRProgramsDemo)>Finish
Step2.RightClick>New>Package(Nameit -PackageDemo)>Finish Step 3. Right
Click on Package > New > Class (Name it - WordCount) Step 4. Add
Following Reference Libraries –
/usr/lib/hadoop-0.20/hadoop-core.jar
Usr/lib/hadoop-0.20/lib/Commons-cli-1.2.jar
package PackageDemo;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import
org.apache.hadoop.mapreduce.Mapper;
import
org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
publicclassWordCount{
publicstaticvoidmain(String[]args)throwsException
{
Configurationc=newConfiguration();
String[]files=newGenericOptionsParser(c,args).getRemainingArgs();
Path input=new Path(files[0]);
Path output=new Path(files[1]);
Job j=new Job(c,"wordcount");
j.setJarByClass(WordCount.class);
j.setMapperClass(MapForWordCount.class);
j.setReducerClass(ReduceForWordCount.class);j.setOutputKey
Class(Text.class); j.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(j, input);
FileOutputFormat.setOutputPath(j, output);
System.exit(j.waitForCompletion(true)?0:1);
}
publicstaticclassMapForWordCountextendsMapper<LongWritable,Text,Text, Int
Writable>{
publicvoidmap(LongWritablekey,Textvalue,Contextcon)throwsIOException,
InterruptedException
{
Stringline=value.toString();
String[]words=line.split(",");
for(String word: words )
{
TextoutputKey=newText(word.toUpperCase().trim());
IntWritable outputValue = new IntWritable(1);
con.write(outputKey, outputValue);
}
}
}
publicstaticclassReduceForWordCountextendsReducer<Text,IntWritable,Text,
IntWritable>
{
publicvoidreduce(Textword,Iterable<IntWritable>values,Contextcon)throws
IOException,
InterruptedException
{
intsum=0;
for(IntWritablevalue:values)
{
sum+=value.get();
}
con.write(word,newIntWritable(sum));
}
}
}
MakeJarFile
RightClickonProject>Export>SelectexportdestinationasJarFile>next>Finish
ToMovethisintoHadoopdirectly,opentheterminalandenterthefollowing commands:
[training@localhost~]$hadoopfs-putwordcountFilewordCountFile
RunJarfile
(Hadoopjarjarfilename.jarpackageName.ClassNamePathToInputTextFile
PathToOutputDirectry)
[training@localhost~]$HadoopjarMRProgramsDemo.jar PackageDemo.WordCount
wordCountFile MRDir1
Result : OpenResult
[training@localhost~]$hadoopfs-lsMRDir1
Found 3 items
-rw-r--r--1trainingsupergroup
02016-02-2303:36/user/training/MRDir1/_SUCCESS
drwxr-xr-x - training supergroup
02016-02-2303:36/user/training/MRDir1/_logs
-rw-r--r--1trainingsupergroup
20 2016-02-23 03:36 /user/training/MRDir1/part-r-00000
[training@localhost~]$hadoopfs-catMRDir1/part-r-00000
BUS 7
CAR 4
TRAIN6
Result:
Thus the implementation for program is executed successfully.
EXPNO:5
Date:
Installation of Hive
Downloading Hive
We use hive-0.14.0 in this tutorial. You can download it by visiting the following link
http://apache.petsads.us/hive/hive-0.14.0/. Let us assume it gets downloaded onto the /Downloads directory.
Here, we download Hive archive named “apache-hive-0.14.0-bin.tar.gz” for this tutorial. The following
command is used to verify the download:
$ cd Downloads
$ ls
apache-hive-0.14.0-bin.tar.gz
Installing Hive
The following steps are required for installing Hive on your system. Let us assume the Hive archive is
downloaded onto the /Downloads directory.
The following command is used to verify the download and extract the hive archive:
apache-hive-0.14.0-bin apache-hive-0.14.0-bin.tar.gz
We need to copy the files from the super user “su -”. The following commands are used to copy the files from
the extracted directory to the /usr/local/hive” directory.
$ su -
passwd:
# cd /home/user/Download
# mv apache-hive-0.14.0-bin /usr/local/hive
# exit
You can set up the Hive environment by appending the following lines to ~/.bashrc file:
export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin
export CLASSPATH=$CLASSPATH:/usr/local/Hadoop/lib/*:.
export CLASSPATH=$CLASSPATH:/usr/local/hive/lib/*:.
$ source ~/.bashrc
Configuring Hive
To configure Hive with Hadoop, you need to edit the hive-env.sh file, which is placed in the
$HIVE_HOME/conf directory. The following commands redirect to Hive config folder and copy the
template file:
$ cd $HIVE_HOME/conf
$ cp hive-env.sh.template hive-env.sh
export HADOOP_HOME=/usr/local/hadoop
Hive installation is completed successfully. Now you require an external database server to configure
Metastore. We use Apache Derby database.
Result:
Thus the implementation for program is executed successfully.
EXPNO:6
Installation of HBase
Date:
Installing HBase
We can install HBase in any of the three modes: Standalone mode, Pseudo Distributed mode, and Fully
Distributed mode.
$cd usr/local/
$wget http://www.interior-dsgn.com/apache/hbase/stable/hbase-0.98.8-
hadoop2-bin.tar.gz
$tar -zxvf hbase-0.98.8-hadoop2-bin.tar.gz
Shift to super user mode and move the HBase folder to /usr/local as shown below.
$su
$password: enter your password here
mv hbase-0.99.1/* Hbase/
Before proceeding with HBase, you have to edit the following files and configure HBase.
hbase-env.sh
Set the java Home for HBase and open hbase-env.sh file from the conf folder. Edit JAVA_HOME
environment variable and change the existing path to your current JAVA_HOME variable as shown below.
cd /usr/local/Hbase/conf
gedit hbase-env.sh
This will open the env.sh file of HBase. Now replace the existing JAVA_HOME value with your current
value as shown below.
export JAVA_HOME=/usr/lib/jvm/java-1.7.0
hbase-site.xml
This is the main configuration file of HBase. Set the data directory to an appropriate location by opening the
HBase home folder in /usr/local/HBase. Inside the conf folder, you will find several files, open the hbase-
site.xml file as shown below.
#cd /usr/local/HBase/
#cd conf
# gedit hbase-site.xml
Inside the hbase-site.xml file, you will find the <configuration> and </configuration> tags. Within them, set
the HBase directory under the property key with the name “hbase.rootdir” as shown below.
<configuration>
//Here you have to set the path where you want HBase to store its files.
<property>
<name>hbase.rootdir</name>
<value>file:/home/hadoop/HBase/HFiles</value>
</property>
//Here you have to set the path where you want HBase to store its built in zookeeper
files.
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/home/hadoop/zookeeper</value>
</property>
</configuration>
With this, the HBase installation and configuration part is successfully complete. We can start HBase by using
start-hbase.sh script provided in the bin folder of HBase. For that, open HBase Home Folder and run HBase
start script as shown below.
$cd /usr/local/HBase/bin
$./start-hbase.sh
If everything goes well, when you try to run HBase start script, it will prompt you a message saying that
HBase has started.
Configuring HBase
Before proceeding with HBase, configure Hadoop and HDFS on your local system or on a remote system and
make sure they are running. Stop HBase if it is running.
hbase-site.xml
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
It will mention in which mode HBase should be run. In the same file from the local file system, change the
hbase.rootdir, your HDFS instance address, using the hdfs://// URI syntax. We are running HDFS on the
localhost at port 8030.
<property>
<name>hbase.rootdir</name>
<value>hdfs://localhost:8030/hbase</value>
</property>
Starting HBase
After configuration is over, browse to HBase home folder and start HBase using the following command.
$cd /usr/local/HBase
$bin/start-hbase.sh
HBase creates its directory in HDFS. To see the created directory, browse to Hadoop bin and type the
following command.
Found 7 items
drwxr-xr-x - hbase users 0 2014-06-25 18:58 /hbase/.tmp
drwxr-xr-x - hbase users 0 2014-06-25 21:49 /hbase/WALs
drwxr-xr-x - hbase users 0 2014-06-25 18:48 /hbase/corrupt
drwxr-xr-x - hbase users 0 2014-06-25 18:58 /hbase/data
-rw-r--r-- 3 hbase users 42 2014-06-25 18:41 /hbase/hbase.id
-rw-r--r-- 3 hbase users 7 2014-06-25 18:41 /hbase/hbase.version
drwxr-xr-x - hbase users 0 2014-06-25 21:49 /hbase/oldWALs
$ ./bin/local-master-backup.sh 2 4
To kill a backup master, you need its process id, which will be stored in a file named “/tmp/hbase-USER-X-
master.pid.” you can kill the backup master using the following command.
$ .bin/local-regionservers.sh stop 3
Starting HBaseShell
After Installing HBase successfully, you can start HBase Shell. Below given are the sequence of steps that are
to be followed to start the HBase shell. Open the terminal, and login as super user.
Browse through Hadoop home sbin folder and start Hadoop file system as shown below.
$cd $HADOOP_HOME/sbin
$start-all.sh
Start HBase
Browse through the HBase root directory bin folder and start HBase.
$cd /usr/local/HBase
$./bin/start-hbase.sh
Start Region
$./bin/./local-regionservers.sh start 3
$cd bin
$./hbase shell
This will give you the HBase Shell Prompt as shown below.
hbase(main):001:0>
To access the web interface of HBase, type the following url in the browser.
http://localhost:60010
This interface lists your currently running Region servers, backup masters and HBase tables.
HBase Tables
Before proceeding with programming, set the classpath to HBase libraries in .bashrc file. Open .bashrc in any
of the editors as shown below.
$ gedit ~/.bashrc
Set classpath for HBase libraries (lib folder in HBase) in it as shown below.
This is to prevent the “class not found” exception while accessing the HBase using java API.
Result:
Thus the implementation for program is executed successfully.
EXPNO:7
Importing and exporting data from various database
Date:
SQL Server is very popular in Relational Database and it is used across many software industries. In MS SQL
Server, two sorts of databases are available.
System databases
User Databases
In this, we will learn to export and import SQL Server databases using Microsoft SQL Server. Exporting and
Importing stands as a backup plan for developers.
Step 1: Open the “Microsoft SQL Server” Click on “File”, “New” and select “Database Engine Query”.
Query :
Output:
Step 3: Select the newly created table “college”.
Query :
USE college;
Output:
Query :
Output:
Step 5: Insert the Records
Query :
Output:
Exporting SQL Server Database:
After creating a database in “Microsoft SQL Server”, Let’s see how exporting takes place.
Step 1: Open the Object Explorer, Right-click on the Database that you want to export and click the “task”
option and select “Export Data-Tier Application”.
Step 2: Click Next and by browsing, select the destination folder in which you have to save the database file.
The filename should be as same as the database name ( here “college” ) and click “Next ” and “Finish”. You
will get a dialogue box showing the result of exporting.
Step 1: Right Click on the Database folder and select “Import Data-Tier Application” and click “Next.
Step 2: Select the file which you have exported and change the name of the database
here we changed the database name from “college” to “college_ info” and click “Next” and a dialogue box
appears showing the result of importing.
Whether you're preparing for your first job interview or aiming to upskill in this ever-evolving tech landscape,
GeeksforGeeks Courses are your key to success. We provide top-quality content at affordable prices, all geared
towards accelerating your growth in a time-bound manner. Join the millions we've already empowered, and we're
here to do the same for you
Result:
Thus the implementation for program is executed successfully.
.