Lab Manual
Lab Manual
Course
BE Computer Engineering
Semester – VIII
Subject
Prepared by,
Assistant Professor,
Computer Engineering
C. K. Pithawalla College of Engineering and Technology,
Surat
Lab Manual Big Data Analytics
Index
Assignment 1............................................................................................................................................... 4
1.1 Understand and demonstrate List, Set and Map in Java. ............................................................... 4
1.2 Student basic information management using various collection types. ......................................... 4
Assignment 2............................................................................................................................................. 10
Read Operations................................................................................................................................. 12
2.3 Web based application for Student registration using PHP and MongoDB. ................................. 13
Assignment 3............................................................................................................................................. 14
CONFIGURATION ............................................................................................................................ 17
3.2 Understand the overall programming architecture using Map Reduce API ................................. 20
MapReduce Implementation.............................................................................................................. 20
Assignment 4............................................................................................................................................. 21
4.2 Creating the HDFS tables and loading them in Hive and learn joining of tables in Hive ............ 21
Assignment 5............................................................................................................................................. 22
Assignment 1
/*
* To change this license header, choose License Headers in Project Properties.
* To change this template file, choose Tools | Templates
* and open the template in the editor.
*/
package setlistmapdemo;
import java.util.*;
import java.io.InputStreamReader;
import java.io.BufferedReader;
import java.io.IOException;
/**
*
* @author Yogesh Kapuriya
*/
public class SetListMapDemo {
/**
* @param args the command line arguments
*/
ArrayList students;
BufferedReader ip;
public SetListMapDemo() {
this.students = new ArrayList();
this.ip = new BufferedReader(new InputStreamReader(System.in));
}
main_loop:
while (true) {
System.out.println("Enter your Choice");
System.out.println("1. Add Student");
System.out.println("2. Find Student");
System.out.println("3. Update Student Detail");
System.out.println("4. Delete Student");
System.out.println("5. Get Count of Students");
System.out.println("6. List all Students");
System.out.println("7. Exit");
String choice = MainProg.ip.readLine();
int ch = Integer.parseInt(choice);
switch (ch) {
case 1:
System.out.println("Enter Enrollment No :");
enrollNo = MainProg.ip.readLine();
System.out.println("Enter Name :");
stuName = MainProg.ip.readLine();
System.out.println("Enter Contact No :");
contactNo = MainProg.ip.readLine();
System.out.println("Enter DOB :");
dob = MainProg.ip.readLine();
MainProg.addStudent(enrollNo, stuName, contactNo, dob);
break;
case 2:
System.out.println("Enter Enrollment No of Student to be
Found :");
enrollNo = MainProg.ip.readLine();
HashMap s = MainProg.findStudent(enrollNo);
if (s.isEmpty()) {
System.out.println("Student not Found");
} else {
System.out.println("Student Detail : Enrollment No - "
+ s.get("EnrollNo") + ", Name : " + s.get("Name") + ", Contact No : " +
s.get("ContactNo") + ", DOB : " + s.get("DOB"));
}
break;
case 3:
System.out.println("Enter Enrollment No of Student to be
Updated :");
enrollNo = MainProg.ip.readLine();
MainProg.updateStudent(enrollNo);
break;
case 4:
System.out.println("Enter Enrollment No of Student to be
Deleted :");
enrollNo = MainProg.ip.readLine();
MainProg.deleteStudent(enrollNo);
break;
case 5:
int count = MainProg.getStudentCount();
System.out.println("Total Students : " + count);
break;
case 6:
System.out.println("List of Stundets");
MainProg.listStudents();
break;
case 7:
break main_loop;
}
}
}
/**
* Adds new Student to List of students
*
* @param enroll - Enrollment no of student
* @param name - Name of student
* @param contact - Contact no of student
* @param dob - Date of birth in dd-mm-yyyy format
*/
public void addStudent(String enroll, String name, String contact, String
dob) {
HashMap student = new HashMap();
student.put("EnrollNo", enroll);
student.put("Name", name);
student.put("ContactNo", contact);
student.put("DOB", dob);
this.students.add(student);
}
/**
* Finds student with given enrollment no
*
* @param enroll
* @return Student Object or empty object
*/
public HashMap findStudent(String enroll) {
HashMap stu;
Iterator itr = students.iterator();
while (itr.hasNext()) {
stu = (HashMap) itr.next();
if (enroll.equalsIgnoreCase((String) stu.get("EnrollNo"))) {
System.out.println("Here");
return stu;
}
}
stu = new HashMap();
return stu;
}
/**
* Updates student detail
*
* @param enroll
* @throws java.io.IOException
*/
public void updateStudent(String enroll) throws IOException {
HashMap stu = this.findStudent(enroll);
if (stu.isEmpty()) {
System.out.println("Student not Found");
} else {
String stuName, contactNo, dob;
System.out.println("Student Current Detail : ");
/**
* Delete student with given enrollment no
* @param enroll
*/
public void deleteStudent(String enroll) throws IOException {
HashMap stu = this.findStudent(enroll);
if (stu.isEmpty()) {
System.out.println("Student not Found");
} else {
System.out.println("Student Detail : Enrollment No - " +
stu.get("Name") + ", Name : " + stu.get("ContactNo"));
System.out.println("Are you sure? Type - Yes or No");
String choice = this.ip.readLine();
if (choice.equalsIgnoreCase("no")) {
System.out.println("Operation cancelled !");
} else {
this.students.remove(this.students.indexOf(stu));
}
}
}
/**
* Get count of students in student list
*
* @return
*/
public int getStudentCount() {
return this.students.size();
}
/**
* List all students present in List
*/
public void listStudents() {
Iterator itr = students.iterator();
while (itr.hasNext()) {
HashMap s = (HashMap) itr.next();
System.out.println("Student Detail : Enrollment No - " +
s.get("EnrollNo") + ", Name : " + s.get("Name") + ", Contact No : " +
s.get("ContactNo") + ", DOB : " + s.get("DOB"));
}
}
}
Assignment 2
MongoDB is already included in Ubuntu package repositories, but the official MongoDB repository
provides most up-to-date version and is the recommended way of installing the software. In this step,
we will add this official repository to our server.
Ubuntu ensures the authenticity of software packages by verifying that they are signed with GPG keys,
so we first have to import they key for the official MongoDB repository.
Next, we have to add the MongoDB repository details so apt will know where to download the packages
from.
After adding the repository details, we need to update the packages list.
This command will install several packages containing latest stable version of MongoDB along with
helpful management tools for the MongoDB server.
In order to properly launch MongoDB as a service on Ubuntu 16.04, we additionally need to create a
unit file describing the service. A unit file tells systemd how to manage a resource. The most common
unit type is a service, which determines how to start or stop the service, when should it be automatically
started at boot, and whether it is dependent on other software to run.
We'll create a unit file to manage the MongoDB service. Create a configuration file named
mongodb.service in the /etc/systemd/system directory using nano or your favorite text editor.
Paste in the following contents, then save and close the file.
[Unit]
Description=High-performance, schema-free document-oriented database
After=network.target
[Service]
User=mongodb
ExecStart=/usr/bin/mongod --quiet --config /etc/mongod.conf
[Install]
WantedBy=multi-user.target
The Unit section contains the overview (e.g. a human-readable description for MongoDB service)
as well as dependencies that must be satisfied before the service is started. In our case, MongoDB
depends on networking already being available, hence network.target here.
The Service section how the service should be started. The User directive specifies that the server
will be run under the mongodb user, and the ExecStart directive defines the startup command
for MongoDB server.
The last section, Install, tells systemd when the service should be automatically started. The
multi-user.target is a standard system startup sequence, which means the server will be
automatically started during boot.
While there is no output to this command, you can also use systemctl to check that the service has
started properly.
The last step is to enable automatically starting MongoDB when the system starts.
The MongoDB server now configured and running, and you can manage the MongoDB service using the
systemctl command (e.g. sudo systemctl mongodb stop, sudo systemctl mongodb start).
Create Operations
Create or insert operations add new documents to a collection. If the collection does not currently exist,
insert operations will create the collection. MongoDB provides the following methods to insert
documents into a collection:
Read Operations
Read operations retrieves documents from a collection; i.e. queries a collection for documents. MongoDB
provides the following methods to read documents from a collection:
db.collection.find()
You can specify query filters or criteria that identify the documents to return.
Update Operations
Update operations modify existing documents in a collection. MongoDB provides the following methods
to update documents of a collection:
You can specify criteria, or filters, that identify the documents to update. These filters use the same
syntax as read operations.
Delete Operations
Delete operations remove documents from a collection. MongoDB provides the following methods to
delete documents of a collection:
You can specify criteria, or filters, that identify the documents to remove. These filters use the same
syntax as read operations.
2.3 Web based application for Student registration using PHP and MongoDB.
Fields to be manage for student are : <enrolment_no, name, email, contact_no, date_of_birth, branch,
address, tution_fees> Implement the following,
Assignment 3
INSTALL SSH
Generate keys
password:hadoop
It gets connected
ssh localhost
INSTALL JAVA
(for multinode, copy same folder in all machines using command: sudo rcp -r
<username>@<machine_name>:/usr/lib/jvm/ /usr/lib/)
cd /usr/lib/jvm
export JAVA_HOME="/usr/lib/jvm/jdk1.7.0_67"
export PATH="$PATH:$JAVA_HOME/bin"
To test
exec bash
$PATH
cd ~
INSTALLING HADOOP
cd ~
(for multinode, copy same folder in all machines using command: sudo rcp -r
<username>@<machine_name>:/usr/local/hadoop/ /usr/local/)
password: hadoop
Check ---
cd /usr/local/hadoop
ls
cd ~
export HADOOP_PREFIX=/usr/local/hadoop
export PATH=$PATH:$HADOOP_PREFIX/bin
To test
exec bash
$PATH
cd /usr/local/hadoop/conf
Add
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-sun-amd64
Search
#export HADOOP_OPTS
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true
CONFIGURATION
Especially for multinode deployment provide the name of namenode and not your system name
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://coed161:10001</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop/tmp</value>
</property>
</configuration>
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>coed161:10002</value>
</property>
</configuration>
Create temp directory
pwd:hadoop
start-all.sh
jps
Go to firefox
Hadoop administration
http://coed161:50070/dfshealth.jsp
Look for live nodes and browse the dfs file system
http://coed161:50030/jobtracker.jsp
stop-all.sh
Firewall install
sudo iptables -L -n
sudo iptables -L -n
Benchmarking demo
copy the hadoop-examples-1.2.1.jar file to the home directory in a folder called binary: where we will
store all jar files and executables
Change output directory each time or remove the folder after each run else it will fail
To remove directories
3.2 Understand the overall programming architecture using Map Reduce API
Hadoop MapReduce can be defined as a software programming framework used to process big volume
of data (in terabyte level) in a parallel environment of clustered nodes. The cluster consists of thousands
of nodes of commodity hardware. The processing is distributed, reliable and fault tolerant. A typical
MapReduce job is performed according to the following steps:
1. Split the data into independent chunks based on key-value pair. This is done by Map task in a
parallel manner.
2. The output of the Map job is sorted based on the key values
3. The sorted output is the input to the Reduce job. And then it produces the final output to the
processing and returns the result to the client.
MapReduce Framework
The Apache Hadoop MapReduce framework is written in Java. The framework consists of master-slave
configuration. The master is known as JobTracker and the slaves are known as TaskTrackers. The
master controls the task processed on the slaves (which are nothing but the nodes in a cluster). The
computation is done on the slaves. So the compute and storages nodes are the same in a clustered
environment. The concept is ' move the computation to the nodes where the data is stored', and it makes
the processing faster.
MapReduce Processing
The MapReduce framework model is very lightweight. So the cost of hardware is low compared with
other frameworks. But at the same time, we should understand that the model works efficiently only in
a distributed environment as the processing is done on nodes where the data resides. The other features
like scalability, reliability and fault tolerance also works well on distributed environment.
MapReduce Implementation
The following are the different components of the entire end-to-end implementation.
The client program that is the driver class and initiates the process
The Map function that performs the split using the key-value pair.
The Reduce function that aggregate the processed data and send the output back to the client.
Assignment 4
Installation of Hive
https://cwiki.apache.org/confluence/display/Hive/AdminManual+Installation
http://pig.apache.org/docs/r0.16.0/start.html#req
4.2 Creating the HDFS tables and loading them in Hive and learn joining of tables in
Hive
There are 2 types of tables in Hive, Internal and External. Case study described on following blog
describes creation of table, loading data in it, creating views, indexes and dropping table on weather
data.
https://www.dezyre.com/hadoop-tutorial/apache-hive-tutorial-tables
Assignment 5
http://spark.apache.org/docs/latest/
http://spark.apache.org/docs/latest/spark-standalone.html
Python Script
Java Program