0% found this document useful (0 votes)
103 views46 pages

Big Data - Hands-On Manual The Fastest Way To Learn Big Data! - Alvaro de Castro

This document provides an overview of the Hadoop framework for distributed storage and processing of large datasets. It discusses key components of Hadoop including HDFS for distributed storage, MapReduce for distributed processing, and how Hadoop allowed big data analytics to be performed more widely and at low cost. It also describes some limitations of MapReduce and how additional frameworks like Pig and Hive were developed to make Hadoop easier to use and help accelerate analytics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
103 views46 pages

Big Data - Hands-On Manual The Fastest Way To Learn Big Data! - Alvaro de Castro

This document provides an overview of the Hadoop framework for distributed storage and processing of large datasets. It discusses key components of Hadoop including HDFS for distributed storage, MapReduce for distributed processing, and how Hadoop allowed big data analytics to be performed more widely and at low cost. It also describes some limitations of MapReduce and how additional frameworks like Pig and Hive were developed to make Hadoop easier to use and help accelerate analytics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

History

In theoretical computer science, the CAP theorem, also known as Brewer’s theorem, states
that it is impossible for a distributed computer system to simultaneously provide all three
of the following guarantees:
1 - Consistency (all nodes see the same data at the same time)
2 - Availability (a guarantee that every request receives a response about whether it
succeeded or failed)
3 - Partition tolerance (the system continues to operate despite arbitrary partitioning due
to network failures)

Alone, Hadoop is a software market that IDC predicts will be worth $813 million in 2016,
but it’s also driving a big data market the research firm predicts will hit more than $23
billion by 2016. Since Cloudera launched in 2008, Hadoop has spawned dozens of startups
and spurred hundreds of millions in venture capital investment since 2008.

With the advent of Hadoop in 2006, big data analytics was no longer a task that could be
performed by only a few companies or groups. This is because Hadoop was an open-
source framework, and thus many companies and groups that needed big data analytics
could use Hadoop easily at low cost. In other words, big data analytics became a universal
technology.

The core of Hadoop is Hadoop Distributed File System (HDFS) and MapReduce. Hadoop
stores data in HDFS, a file system that can expand the capacity in the form of a distributed
file system, conducts MapReduce operations based on the stored data, and consequently
gets the required data.

There is no limit to one’s needs. The Hadoop user group tried to overcome Hadoop’s
limits in terms of functionality and performance, and develop it more. Complaints were
focused on the use of MapReduce. MapReduce has two main disadvantages.

1 – It is very inconvenient to use.
2 – Its processing is slow.

To resolve the inconveniences of using MapReduce, platforms such as Pig and Hive
appeared in 2008. Pig and Hive are sub-projects of Hadoop (Hadoop is also an ecosystem
of multiple platforms; a variety of products based on Hadoop have been created). Both Pig
and Hive have a form of high-level language, but Pig has a procedural form and Hive has
a declarative language form similar to SQL. With the advent of Pig and Hive, Hadoop
users could conduct big data analytics easier.

However, as Hive and Pig are related to the data retrieval interface, they cannot contribute
to accelerating big data analytics work. Internally, both Hive and Pig use MapReduce as
well.

This is why HBase, a column-based NoSQL appeared. HBase, which enables faster
input/output of key/value data, finally provided Hadoop-based systems with an
environment in which data could be processed in real time.

The installation procedures described below are for Ubuntu. Also, they always refer to the
latest installation as from July 2015. Finally, they use the username al and password a. It is
recommended that the same username and password be used in all nodes of a rack.

The softwares described usually come in source and binary format. Should you choose to
build them, the application of choice for all today is Maven (SBT is no longer
recommended for Spark). Also, Java applications developed are usually compiled with
Maven, so we will install it before anything else.

1 - wget http://ftp.cixug.es/apache/maven/maven-3/3.3.3/binaries/apache-maven-3.3.3-
bin.tar.gz
2 - tar –xvf apache-maven-3.3.3-bin.tar.gz
rename the directory just created to maven
3 - sudo nano ~/.bashrc
include the text export MAVEN_HOME=/home/al/maven
alter the PATH to include MAVEN_HOME/bin

Hardware considerations
The whole idea of Big Data is to be able to use inexpensive, off the shelf PCs running
different operating systems and with different harware configurations. Whereas this is
quite exciting, it is rather utopical, as what it is indeed recommended is to use identical
PCs for all the namenodes and high-end servers for the datanodes. In a multinode cluster
NameNode and DataNodes are usually on different machines. There is only one
NameNode in a cluster and many DataNodes; Thats why we call NameNode as a single
point of failure. Although There is a Secondary NameNode (SNN) that can exist on
different machine which doesn’t actually act as a NameNode but stores the image of
primary NameNode at certain checkpoint and is used as backup to restore NameNode. In a
single node cluster (which is referred to as a cluster in pseudo-distributed mode), the
NameNode and DataNode can be in a single machine as well.
Namenode
The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all
files in the file system, and tracks where across the cluster the file data is kept. It does not
store the data of these files itself: that’s the job of the datanodes.

Client applications talk to the NameNode whenever they wish to locate a file, or when
they want to add/copy/move/delete a file. The NameNode responds the successful requests
by returning a list of relevant DataNode servers where the data lives.
The NameNode is a Single Point of Failure for the HDFS Cluster. HDFS is not currently
a High Availability system. When the NameNode goes down, the file system goes offline.
There is an optional SecondaryNameNode that can be hosted on a separate machine. It
only creates checkpoints of the namespace by merging the edits file into the fsimage file
and does not provide any real redundancy.
It is essential to look after the NameNode. Here are some recommendations from
production use
1 – Use a good server with lots of RAM. The more RAM you have, the bigger the file
system, or the smaller the block size.
2 – Use ECC RAM.
3 – List more than one name node directory in the configuration, so that multiple copies of
the file system meta-data will be stored. As long as the directories are on separate disks, a
single disk failure will not corrupt the meta-data.
4 – Configure the NameNode to store one set of transaction logs on a separate disk from
the image.
5 – Configure the NameNode to store another set of transaction logs to a network mounted
disk.
6 – Do not host DataNode, JobTracker or TaskTracker services on the same system.

Datanode
A DataNode stores data in the Hadoop File System. A functional filesystem has more than
one DataNode, with data replicated across them. On startup, a DataNode connects to the
NameNode; spinning until that service comes up. It then responds to requests from the
NameNode for filesystem operations.
Client applications can talk directly to a DataNode, once the NameNode has provided the
location of the data. Similarly, MapReduce operations farmed out to TaskTracker instances
near a DataNode, talk directly to the DataNode to access the files. TaskTracker instances
can, indeed should, be deployed on the same servers that host DataNode instances, so that
MapReduce operations are performed close to the data.
DataNode instances can talk to each other, which is what they do when they are
replicating data.
There is usually no need to use RAID storage for DataNode data, because data is designed
to be replicated across multiple servers, rather than multiple disks on the same server.
An ideal configuration is for a server to have a DataNode, a TaskTracker, and then
physical disks one TaskTracker slot per CPU. This will allow every TaskTracker 100% of
a CPU, and separate disks to read and write data.
Avoid using NFS for data storage in production system.

Networking
Big data applications aren’t just big, they also tend to be what bursty. When a job is
initiated, data begins to flow. During these periods of high traffic, congestion is a primary
concern. However, congestion can lead to more than queuing delays and dropped packets.
Congestion also can trigger retransmissions, which can cripple already heavily loaded
networks.
In this sense, Network partitioning is crucial in setting up big data environments. In its
simplest form, partitioning can mean the separation of big data traffic from residual
network traffic so that bursty demands from applications do not impact other mission-
critical workloads. Beyond that, there is a need to handle multiple tenants running multiple
jobs for performance, compliance and/or auditing reasons.
Also, moving from 1 Gb to a 10 Gb network infrastructure is sometimes recommended.
Be aware though that the connectors and cabling of 10 Gb infrastructures are quite
expensive.
Apache’s Grindmix is a tool commonly used benchmark Hadoop clusters. It submits a mix
of synthetic jobs, modeling a profile mined from production loads.


Frameworks

Hadoop
Hadoop is by far the most recognizable name in the Big Data world and practically
became a synonym of Big Data. It is an open-source software framework written in Java
for distributed storage and distributed processing of very large data sets on computer
clusters built from inexpensive hardware.

Hadoop was built from Google’s MapReduce paradigm by Doug Cutting and Mike
Cafarella in 2005. Cutting, who was working at Yahoo! at the time, named it after his
son’s toy elephant. It was originally developed to support distribution for the Nutch search
engine project. Cutting was also the creator of Apache Lucene, the widely used text search
library

The core of Apache Hadoop consists of a storage part (Hadoop Distributed File System
(HDFS)) and a processing part (MapReduce). Hadoop splits files into large blocks and
distributes them amongst the nodes in the cluster. To process the data, Hadoop
MapReduce transfers packaged code for nodes to process in parallel, based on the data
each node needs to process. This allows the data to be processed faster and more
efficiently than it would be in a more conventional supercomputer. The Hadoop
framework is composed of the following modules:
1 – Hadoop Common – contains libraries and utilities needed by other Hadoop modules;
2 – Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on
commodity machines
3 – Hadoop YARN – a resource-management platform responsible for managing
computing resources in clusters and using them for scheduling of users’ applications
4 – Hadoop MapReduce – a programming model for large scale data processing.

Note that in Windows 64 bits it is possible just to download a binary version of Hadoop.
This however is not true for windows 32 bits, as some of the executables are 64 bit. As
such, it is required to compile it to run on windows 32 bits. However, this also requires
patching some of the xml files and the patches are hard if not impossible to find
(https://issues.apache.org/jira/browse/HADOOP-9922). As such, although it is
theoretically possible to run hadoop on windows 32 bits, for all practical purposes it is not.

Before we install, there are 5 concepts we need to know:
1 – DataNode
A DataNode stores data in the Hadoop File System. A functional file system has more than
one DataNode, with the data replicated across them.

2 – NameNode
The NameNode is the centrepiece of an HDFS file system. It keeps the directory of all
files in the file system, and tracks where across the cluster the file data is kept. It does not
store the data of these file itself.

3 - Secondary Namenode
Secondary Namenode whole purpose is to have a checkpoint in HDFS. It is just a helper
node for namenode.

4 - Jobtracker
The Jobtracker is the service within hadoop that farms out MapReduce to specific nodes in
the cluster, ideally the nodes that have the data, or atleast are in the same rack.

5 – TaskTracker
A TaskTracker is a node in the cluster that accepts tasks- Map, Reduce and Shuffle
operatons – from a Job Tracker.


Hadoop Installation

In this example, we will install hadoop to 2 64-bit servers. Do NOT use 32 bit servers, as
they are not supported!
1 – Install Lubuntu 14.04.3 in both servers, as that’s compatible with Cloudera, Ambari,
and other systems. Do NOT install version 15, as it is NOT compatible! Call them node1
and node2. Create in both usernames “al” and password “a” and automatic login and ip
numbers 192.168.1.10 and 192.168.1.11

2 – Make yourself in both servers as root, as that makes things much easier. Later if you
want you can undo this for more safety.
2.1 – Edit Sudoers with the command sudo visudo and add at the end the line to add al
into sudoers
al ALL=(ALL:ALL) ALL
2.2 – Edit with sudo nano /etc/passwd
In that file, you will see something like this:
al:*:1000:1000:John Doe:/home/jdoe:/usr/bin/ksh
Change to something like
al:*:0:0:John Doe:/home/jdoe:/usr/bin/ksh
2.3 – Logout and login back

3 – Activate in both servers the root account (later you can deactivate itwith sudo passwd -
l root)
3.1 – sudo passwd root
3.2 – sudo passwd -u root

4 – Edit /etc/hosts in both servers to include their ip numbers so they see each other
192.168.1.10 node1
192.168.1.11 node2

5 - Configure SSH with the following commands
5.1 – Install SSH server in both servers
sudo apt-get install openssh-server
5.2 – Enable root access in both servers to SSH by editing the file /etc/ssh/sshd_config,
and edit the following line from
PermitRootLogin without-password to PermitRootLogin yes
5.3 – Generate at node1 a local key with no password and save it at /root/.ssh/
ssh-keygen
5.4 – Copy the public key to node2 and every host you may connect to (will prompt
password):
scp /root/.ssh/id_rsa.pub root@node2:/root/.ssh/id_rsa.pub
5.5 – Add the key to the list of authorized keys
cat /root/.ssh/id_rsa.pub >> /root/.ssh/authorized_keys
5.6 – Shell into the remote machine to do the same (will prompt password):
ssh root@node2
mkdir /root/.ssh/
cat /root/.ssh/id_rsa.pub >> /root/.ssh/authorized_keys
5.7 – Test that you can log in with no password at both servers
ssh root@node1
ssh root@node2

6 - Disable IPv6 with the command sudo nano /etc/sysctl.conf and copy the following
lines at the end of the file:
#disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
Use the command cat /proc/sys/net/ipv6/conf/all/disable_ipv6 to check to make sure IPv6
is off:
it should say 1. If it says 0, you missed something.

7 - Install Java with the following commands:
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update && sudo apt-get upgrade
sudo apt-get install oracle-java7-installer

8 – Install Hadoop with the following commands:
wget http://ftp.cixug.es/apache/hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz
tar -zxf hadoop-2.7.1.tar.gz
Rename the directory created to hadoop

6 – Learn where Java is installed in order to put it in $HOME/.bashrc use the command
sudo update-alternatives —config java
Update then $HOME/.bashrc with the command nano /home/hduser/.bashrc and add the
following configurations at the end of .bachrc file
export HADOOP_HOME=/home/al/hadoop
export JAVA_HOME=/usr/lib/jvm/java-7-oracle
export PATH=$PATH:$HADOOP_HOME/bin

Load the new values in your session: source ~/.bashrc

7 – Configure hadoop
We need to modify several files in the /etc/hadoop folder.
7.1 - hadoop-env.sh
We need to only update the JAVA_HOME using the value we obtained in step 6. Then
change the following line:
export JAVA_HOME=/usr/lib/jvm/java-7-oracle
7.2 - conf/core-site.xml:
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

7.3 - conf/hdfs-site.xml:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

7.4 - conf/mapred-site.xml:
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>

8 - Formatting NameNode
Now you can start working on the Node. First format:
/hadoop/bin/hadoop namenode -format
This is done just once at first time of your installation and usually never again.

9 - Starting Hadoop Cluster
Navigate to hadoop/bin directory and run the ./start-all.sh script.
/hadoop/sbin/start-all.sh
To see if all hadoop related commands are running use the command jps

10 – Configure it to start at boot time
sudo apt-get install sysv-rc-conf
sudo sysv-rc-conf hadoop on
sudo sysv-rc-conf hadoop-hdfs-namenode on
sudo sysv-rc-conf hadoop-yarn-resourcemanager on
sudo sysv-rc-conf hadoop-yarn-nodemanager on
sudo sysv-rc-conf hadoop-hdfs-datanode on

To check is hadoop is configured to start on
sysv-rc-conf —list

11 – Configuring Hadoop
http://localhost:50070/ – web UI of the NameNode daemon (HDFS layer)
The name node web UI shows you a cluster summary including information about
total/remaining capacity, live and dead nodes. Additionally, it allows you to browse the
HDFS namespace and view the contents of its files in the web browser. It also gives
access to the local machine’s Hadoop log files.

http://localhost:50030/ – web UI of the JobTracker daemon (MapReduce layer)
The JobTracker web UI provides information about general job statistics of the Hadoop
cluster, running/completed/failed jobs and a job history log file. It also gives access to
the ‘‘local machine’s’’ Hadoop log files (the machine on which the web UI is running
on).

http://localhost:50060/ – web UI of the TaskTracker daemon (MapReduce layer)
The task tracker web UI shows you running and non-running tasks. It also gives access
to the ‘‘local machine’s’’ Hadoop log files.

Spark
Apache Spark is an open-source cluster computing framework originally developed Spark
by Matei Zaharia at UC Berkeley AMPLab in 2009, and open sourced in 2010 under a
BSD license. In 2013, the project was donated to the Apache Software Foundation and
switched its license to Apache 2.0. In February 2014, Spark became an Apache Top-Level
Project.
In contrast to Hadoop’s two-stage disk-based MapReduce paradigm, Spark’s in-memory
primitives provide performances which are up to 100 times faster. This, because Spark
runs largerly on RAM, so it requires nodes which as much RAM as possible (typically,
over 24 Gb per node). Thus, Spark stores data in-memory whereas Hadoop stores data on
disk. Hadoop uses replication to achieve fault tolerance whereas Spark uses different data
storage model, resilient distributed datasets (RDD), uses a clever way of guaranteeing
fault tolerance that minimizes network I/O. RDDs achieve fault tolerance through a notion
of lineage: if a partition of an RDD is lost, the RDD has enough information to rebuild just
that partition. This removes the need for replication to achieve fault tolerance and thus
eliminates disk accesses.
Spark can run independently from Hadoop, but still requires a cluster manager and a
distributed storage system. For cluster management, Spark supports standalone (native
Spark cluster), Hadoop YARN, or Apache Mesos. For distributed storage, Spark can use
Hadoop Distributed File System (HDFS), Cassandra, OpenStack Swift, and Amazon S3.
With over 465 contributors in 2014, it is by far the most active project in the Apache
Software Foundation and is widely being considered as the next Hadoop generation.
Spark supports Java, Python, and Scala, but definitely prefers the last one. In fact, it
already comes with a scaled down version of Scala, and can run scala programs from
within a console.
Spark is much faster to implement than Hadoop, requiring just to alter the bashrc file to
get it running. However, it is harder to make it work, and requires maven to compile any
new application.
Most of the development activity in Apache Spark is now in the built-in libraries,
including Spark SQL, Spark Streaming, MLlib and GraphX. Out of these, the most
popular are Spark Streaming and Spark SQL: about 50-60% of users use each of them
respectively.

Installation
There are several releases available to download: with or without support for Hadoop, and
pre-built or not. Be aware that if you later decide to use YARN instead of Spark’s
Standalone cluster system you’d need a version which supports Hadoop. Also, building
Spark is a time-consuming and faulty process, so it’s best just to download a pre-built
version for Hadoop.

1 - Make sure you have Java installed with
sudo apt-add-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java7-installer
2 – Download using wget http://ftp.cixug.es/apache/spark/spark-1.4.0/spark-1.4.0-bin-
hadoop2.6.tgz
3 – Untar with tar –zxf spark-1.4.0-bin-hadoop2.6.tgz.2
4 – Rename the created folder to hadoop
5 - Update then $HOME/.bashrc with the command nano ~ /.bashrc with following
configurations at the end of the file export SPARK_HOME=/home/al/hadoop and then
source ~ /.bashrc to load the new configurations
6 – The simplest way to see if it’s running is to go to the bin directory and run ./spark-shell
Then go to 127.0.0.1:4040 to check if it’s running
7 – You can then start the master from the sbin folder with the command ./start-master.sh
Then go to 127.0.0.1:8080 to check if it’s running. You can close the konsole window and
it will continue running. Here, note that if you already have Ambari running you’ll need to
change the port, as 8080 is the same Ambari uses in its UI
8 – Check Spark’s integration with Hadoop by uploading one file to HDFS and try running
a WordCount program using Spark Java API. First upload a file on HDFS (Make sure your
Hadoop cluster is up and running) run the following commands from Hadoop’s bin
directory
./hadoop fs -mkdir /in
./hadoop fs -copyFromLocal LICENSE.txt /in
Then, to run the wordCount test program, execute following command from Spark’s bin
directory
./run-example JavaWordCount /in
It should show a number of words counted from the LICENSE.txt file

Flink
Apache Flink (German for “quick” or “nimble”) is a streaming dataflow engine that
provides data distribution, communication, and fault tolerance for distributed
computations over data streams.
Flink includes several APIs for creating applications that use the Flink engine:
1 – DataSet API for static data embedded in Java, Scala, and Python,
2 – DataStream API for unbounded streams embedded in Java and Scala, and
3 – Table API with a SQL-like expression language embedded in Java and Scala.
Flink also bundles libraries for domain-specific use cases:
1 – Machine Learning library, and
2 – Gelly, a graph processing API and library.
Despite Spark being extremely fast when compared with Hadoop, it is still not considered
a pure stream-processing engine, but a fast-batch operation working on a small part of
incoming data (“micro-batching”). This is where Apache FlinkFlink enters, particularly
for streaming operations. However, the benefit of Spark’s micro-batch model is that it
offers full fault-tolerance and “exactly-once” processing for the entire computation,
meaning it can recover all state and results even if a node crashes. Flink (and Storm) don’t
provide this, requiring application developers to worry about missing data or to treat the
streaming results as potentially incorrect. This makes it hard to write more complex
applications.
Installation
Flink runs on Linux, Mac OS X, and Windows. You don’t have to install Hadoop to use
Flink, but if you plan to use Flink with data stored in Hadoop, pick the version matching
your installed Hadoop version. Since we installed Hadoop 2.7, we will choose that version
and will have to build it with maven
1 – wget http://www.apache.org/dyn/closer.cgi/flink/flink-0.9.0/flink-0.9.0-bin-
hadoop27.tgz
2 – tar –zxf flink-0.9.0-bin-hadoop27.tgz
3 – Rename the newly created folder to flink
4 – Compile it with mvn package –DskipTests
5 – Run with cd build-target/bin and then ./start-local.sh
6 – Test it
Download test data
wget -O hamlet.txt http://www.gutenberg.org/cache/epub/1787/pg1787.txt
Start the example program
./flink run /home/al/flink/build-target/examples/flink-java-examples-0.9.0-
WordCount.jar file://`pwd`/hamlet.txt file://`pwd`/wordcount-result.txt
7 – See it running at 127.0.0.1:8081


Cluster Managers
MapReduce
MapReduce is a programming model first presented in 2004 by Jeffrey Dean and Sanjay
Ghemawat from Google. Later, it was implemented by several vendors, being Hadoop the
most common, for easily writing applications which process vast amounts of data (multi-
terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity
hardware in a reliable, fault-tolerant manner.

A MapReduce job usually splits the input data-set into independent chunks which are
processed by the map tasks in a completely parallel manner. The framework sorts the
outputs of the maps, which are then input to the reduce tasks. Typically both the input and
the output of the job are stored in a file-system. The framework takes care of scheduling
tasks, monitoring them and re-executes the failed tasks.

MapReduce is available in several languages, including C, C++, Java, Ruby, Perl and
Python. Programmers can use MapReduce libraries to create tasks without dealing with
communication or coordination between nodes. Programs written in this functional style
are automatically parallelized and executed on a large cluster of commodity machines.
The run-time system takes care of the details of partitioning the input data, scheduling the
program’s execution across a set of machines, handling machine failures, and managing
the required inter-machine communication. This allows programmers without any
experience with parallel and distributed systems to easily utilize the resources of a large
distributed system.

As an analogy, you can think of map and reduce tasks as the way a cen¬sus was conducted
in Roman times, where the census bureau would dispatch its people to each city in the
empire. Each census taker in each city would be tasked to count the number of people in
that city and then return their results to the capital city. There, the results from each city
would be reduced to a single count (sum of all cities) to determine the overall population
of the empire. This (map)ping of people to cities, in parallel, and then combining the
results (reduce)ing is much more efficient than sending a single person to count every
person in the empire in a serial fashion.

Not all processes can be written as Maps and Reduces. Even then, MapReduce as devised
by Google is now considered obsolete, as development has moved on to more capable and
less disk-intensive mechanisms that incorporate the same capabilities at much faster
speeds.

Yarn
Beginning with hadoop 2, MapReduce was replaced by an improved version called
MapReduce 2.0 (MRv2) or YARN (“Yet Another Resource Negotiator”). In 2012, YARN
became a sub-project of the larger Apache Hadoop project, and started to be used by other
frameworks such as Spark.

Spark Standalone
Mesos
Apache Mesos is an open-source cluster manager that was developed at the University of
California, Berkeley. Mesos began as a research project in the UC Berkeley RAD Lab by
then PhD students Benjamin Hindman, Andy Konwinski, and Matei Zaharia, as well as
professor Ion Stoica. The students started working together on the project as part of a
course on Advanced Topics in Computer Systems taught by David Culler. The software
enables resource sharing in a fine-grained manner, improving cluster utilization. Since
being developed at UC Berkeley, it has been adopted by several large software companies,
including Twitter, Airbnb and Apple. In April 2015, it was announced that Apple service
Siri is using their own Mesos framework called Jarvis. Mesos has APIs written in Java,
Python and C++ APIs developing new parallel applications and can be run on Hadoop and
Spark. Mesos runs on Linux (64 Bit) and Mac OS X (64 Bit).
Installation
1 - Update the packages: sudo apt-get update
2 - Install the latest OpenJDK: sudo apt-get install -y openjdk-7-jdk
3 - Install autotools (Only necessary if building from git repository): sudo apt-get install -y
autoconf libtool
4 - Install other Mesos dependencies with sudo apt-get -y install build-essential python-
dev python-boto libcurl4-nss-dev libsasl2-dev maven libapr1-dev libsvn-dev
5 – Download mesos: wget http://www.apache.org/dist/mesos/0.22.1/mesos-0.22.1.tar.gz
6 – Expand the package: tar -zxf mesos-0.22.1.tar.gz
7 – Build Mesos from within its directory:
$ cd mesos
# Bootstrap (Only required if building from git repository).
$ ./bootstrap
# Configure and build.
$ mkdir build
$ cd build
$ ../configure
$ make
8 – Start and test with its included examples:
# Change into build directory.
$ cd build
# Start mesos master (Ensure work directory exists and has proper permissions).
$ ./bin/mesos-master.sh —ip=127.0.0.1 —work_dir=/var/lib/mesos
# Start mesos slave.
$ ./bin/mesos-slave.sh —master=127.0.0.1:5050
# Visit the mesos web page.
$ http://127.0.0.1:5050

# Run C++ framework (Exits after successfully running some tasks.).
$ ./src/test-framework —master=127.0.0.1:5050
# Run Java framework (Exits after successfully running some tasks.).
$ ./src/examples/java/test-framework 127.0.0.1:5050
# Run Python framework (Exits after successfully running some tasks.).
$ ./src/examples/python/test-framework 127.0.0.1:5050

Amazon EC

Languages
In the Big Data world, the languages created for it are much more than just languages,
being really frameworks. In fact, they also enter the SQL realm, can run interactively, they
interact with other languages, etc. Thus, they are no longer just languages and should not
be treated as such.


Java
Python
Scala
Created by Martin Odersky in 2001 at the École Polytechnique Fédérale de Lausanne
(EPFL), after an internal release in late 2003, Scala was released publicly in early 2004 on
the Java platform and on the .NET platform in June 2004. A second version (v2.0)
followed in March 2006. The .NET support was officially dropped in 2012. On 12 May
2011, Odersky launched Typesafe Inc., a company to provide commercial support,
training, and services for Scala.

Scala runs on the JVM and is thus compatible with existing Java programs. and also runs
on Android smartphones. The idea of Scala is to bring back the simplicity that Java had in
its beginnings. In fact, Scala can be used to create almost any java applications with less
time and code. also, Scala’s compilation and execution model is identical to that of Java,
making it compatible with Java build tools such as Ant.

Scala can also run interactively, eliminating thus the need to constantly compile the
application in order to see if it works, and thus cutting on development time. To the JVM,
Scala code and Java code are indistinguishable, the only difference being an extra runtime
library: scala-library.jar.

Finally, Spark was developed in Scala and even has a subset of that language, so, just like
the market is increasingly switching from Hadoop to Spark, it is also moving from Java to
Scala.

Installation
wget http://www.scala-lang.org/files/archive/scala-docs-2.12.0-M1.tgz
sudo mkdir /usr/local/src/scala
sudo tar xvf scala-2.10.4.tgz -C /usr/local/src/scala/

Pig
Pig is a high level scripting language that is used with Apache Hadoop. Pig enables data
workers to write complex data transformations without knowing Java. Pig was originally
developed at Yahoo Research around 2006 for researchers to have an ad-hoc way of
creating and executing map-reduce jobs on very large data sets. In 2007, it was moved into
the Apache Software Foundation. Like actual pigs, who eat almost anything, the Pig
programming language is designed to handle any kind of data—hence the name. Pig’s
simple SQL-like scripting language is called Pig Latin, and appeals to developers already
familiar with scripting languages and SQL. Pig is complete, so you can do all required
data manipulations in Apache Hadoop with Pig. Through the User Defined
Functions(UDF) facility in Pig, Pig can invoke code in many languages like JRuby, Jython
and Java. You can also embed Pig scripts in other languages. The result is that you can use
Pig as a component to build larger and more complex applications.

Hive
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data
summarization, query, and analysis. While initially developed by Facebook, Apache Hive
is now used and developed by other companies such as Netflix. Hive provides a database
query interface to Apache Hadoop. People often ask why do Pig and Hive exist when they
seem to do much of the same thing. Hive because of its SQL like query language is often
used as the interface to an Apache Hadoop based data warehouse. Hive is considered
friendlier and more familiar to users who are used to using SQL for querying data. Pig fits
in through its data flow strengths where it takes on the tasks of bringing data into Apache
Hadoop and working with it to get it into the form for querying. So, Pig is best suited for
data factory, and Hive for data warehouse. Hive has three main functions: data
summarization, query and analysis. It supports queries expressed in a language called
HiveQL, which automatically translates SQL-like queries into MapReduce jobs executed
on Hadoop. In addition, HiveQL supports custom MapReduce scripts to be plugged into
queries. Hive also enables data serialization/deserialization and increases flexibility in
schema design by including a system catalog called Hive-Metastore. Hive is not designed
for OLTP workloads and does not offer real-time queries or row-level updates. It is best
used for batch jobs over large sets of append-only data (like web logs). Hive supports text
files (also called flat files), SequenceFiles (flat files consisting of binary key/value pairs)
and RCFiles (Record Columnar Files which store columns of a table in a columnar
database way.) By default, Hive stores metadata in an embedded Apache Derby database,
and other client/server databases like MySQL can optionally be used.

Installation
1 - Download Hive tar
wget http://apache.rediris.es/hive/hive-1.2.1/apache-hive-1.2.1-bin.tar.gz
2 - Extract the tar file
tar -xzvf apache-hive-1.2.1-bin.tar.gz
Rename the just created directory to /hive
3 - Edit.bashrc to update the environment variables for user.
sudo nano ~/.bashrc
Add the following at the end of the file:
export HIVE_HOME=/home/user/hive
export PATH=$PATH:$HIVE_HOME/bin

Activate the changes with source ~/.bashrc

4 - Create Hive directories within HDFS (make sure Hadoop is running) . ‘warehouse’ is
the location to store the table or data related to hive. ‘temp’is the temporary location to
store the intermediate result of processing.
hadoop fs -mkdir /user/hive/warehouse
hadoop fs -mkdir /temp

5 – Set read/write permissions for table.
hadoop fs –chmod g+w /user/hive/warehouse
hadoop fs –chmod g+w /temp

6 – Set Hadoop path in Hive config.sh.
sudo nano hive-config.sh

7 - Launch Hive
/hive/bin ./hive
You will enter hive’s command module, from where you can operate hive

8 - Create sample tables
CREATE TABLE shakespeare (freq INT, word STRING) STORED AS TEXTFILE;
Note that all commands need to end with “;“
You will see the newly created file at Hadoop’s directory at
http://127.0.0.1:50070/explorer.html#/user/hive/warehouse/shakespeare

9 - To exit from Hive type ‘exit’


Databases
NoSQL
In the Big Data world, the usual databases which depend on tables (“relational”) such as
MySQL, Oracle, SQL Server, etc simply do not fit. This is because they scale very poorly,
as in the event of the alteration of the schema, and given the enormous size of the DB, it
takes far too long to alter them. Also, they depend on just one server, and any attempt to
increase that depends on vertical scaling (adding RAM to the server), or creating NAS
with other servers in order to make it look like one. NoSQL databases are built to allow
the insertion of data without a predefined schema. That makes it easy to make significant
application changes in real-time, without worrying about service interruptions – which
means development is faster, code integration is more reliable, and less database
administrator time is needed. A NoSQL database provides a mechanism for storage and
retrieval of data that is modeled in means other than the tabular relations used in relational
databases. Such databases have existed since the late 1960s, but did not obtain the
“NoSQL” moniker until their surge in popularity a few years ago, triggered by the storage
needs of companies such as Facebook, Google and Amazon.com
NoSQL vs. SQL

SQL Databases NOSQL Databases

Many different types including


One type (SQL database) with minor key-value stores, document
Types
variations databases, wide-column stores,
and graph databases

Developed in 2000s to deal


with limitations of SQL
Development Developed in 1970s to deal with first
databases, particularly
History wave of data storage applications
concerning scale, replication
and unstructured data storage

MongoDB, Cassandra, HBase,


Examples MySQL, Postgres, Oracle Database
Neo4j

Individual records (e.g., “employees”) are


Varies based on database type.
stored as rows in tables, with each column
For example, key-value stores
storing a specific piece of data about that
function similarly to SQL
record (e.g., “manager,” “date hired,”
databases, but have only two
etc.), much like a spreadsheet. Separate
columns (“key” and “value”),
data types are stored in separate tables,
with more complex information
and then joined together when more
Data Storage sometimes stored within the
complex queries are executed. For
Model “value” columns. Document
example, “offices” might be stored in one
databases do away with the
table, and “employees” in another. When
table-and-row model altogether,
a user wants to find the work address of
storing all relevant data together
an employee, the database engine joins
in single “document” in JSON,
the “employee” and “office” tables
XML, or another format, which
together to get all the information
can nest values hierarchically.
necessary.

Typically dynamic. Records can


add new information on the fly,
Structure and data types are fixed in and unlike SQL table rows,
advance. To store information about a dissimilar data can be stored
Schemas new data item, the entire database must be together as necessary. For some
altered, during which time the database databases (e.g., wide-column
must be taken offline. stores), it is somewhat more
challenging to add new fields
dynamically.

Horizontally, meaning that to


Vertically, meaning a single server must
add capacity, a database
be made increasingly powerful in order to administrator can simply add
Scaling deal with increased demand. It is possible more commodity servers or
to spread SQL databases over many cloud instances. The database
servers, but significant additional automatically spreads data
engineering is generally required. across servers as necessary.

Mix of open-source (e.g., Postgres,


Development
MySQL) and closed source (e.g., Oracle Open-source
Model
Database)

In certain circumstances and at


Supports Yes, updates can be configured to
certain levels (e.g., document
Transactions complete entirely or not at all
level vs. database level)

Specific language using Select, Insert, and


Data
Update statements, e.g. SELECT fields Through object-oriented APIs
Manipulation
FROM table WHERE…

Depends on product. Some


provide strong consistency
Consistency Can be configured for strong consistency (e.g., MongoDB) whereas
others offer eventual
consistency (e.g., Cassandra)

Another important term to keep in mind is sharding. This is a technique to create large
databases that separates very large databases the into smaller, faster, more easily managed
parts called data shards. The word shard means a small part of a whole.

Hbase
HBase is an open source, non-relational, distributed database modeled after Google’s
BigTable and written in Java. It is developed as part of Apache Software Foundation’s
Apache Hadoop project and runs on top of HDFS. Tables in HBase can serve as the input
and output for MapReduce jobs run in Hadoop, and may be accessed through the Java API
but also through REST, Avro or Thrift gateway APIs.
HBase is not a direct replacement for a classic SQL database, although recently its
performance has improved, and it is now serving several data-driven websites, including
Facebook’s Messaging Platform.
Installation
1 – Download with wget http://apache.rediris.es/hbase/1.1.1/hbase-1.1.1-bin.tar.gz
2 – Unzip with tar hbase-1.1.1-bin.tar.gz
3 – nano ~/hbase/conf/hbase-site.xml
Add these configuration and then save
<configuration>
<property>
<name>hbase.rootdir</name>
<value>file:///home/al/hbase/conf/hbase-data</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/usr/local/hbase/zookeeper</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>hbase.zookeeper.property.clientPort</name>
<value>2222</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>localhost</value>
</property>
</configuration>

4 - nano ~/hbase/conf/hbase-env.sh
Uncomment this configuration in hbase-env-sh to allow HBase manage it’s own instance
of Zookeeper, as we have not installed it yet:
export HBASE_MANAGES_ZK=true

7. Sudo nano ~/.bashrc
# Add to the end of the file
export HBASE_HOME=/home/al/hbase
export PATH=$PATH:HBASE_HOME/bin:HBASE_HOME/sbin

source ~/.bashrc

8. Start Hbase (Hadoop must be running first!)
~/hbase/bin/.start-hbase.sh

9. Check if it’s running
ps -ef|grep hbase

10. Now, we can go to HBase shell to work
~/hbase/bin./ hbase shell

Cassandra
Named after the Greek mythological prophet Cassandra cursed to be able to see the future,
but never believed, Apache Cassandra is an open source distributed database system that
is designed for storing and managing large amounts of data across commodity servers.
Cassandra can serve as both a real-time operational data store for online transactional
applications and a read-intensive database for large-scale business intelligence (BI)
systems. Cassandra was initially developed at Facebook by Avinash Lakshman (one of the
authors of Amazon’s Dynamo) and Prashant Malik to power their Inbox Search feature. It
was released as an open source project on Google code in July 2008 and in March 2009 it
became an Apache project.
In Cassandra, all nodes play an identical role; there is no concept of a master node, with
all nodes communicating with each other equally. Cassandra’s built-for-scale architecture
means that it is capable of handling large amounts of data and thousands of concurrent
users or operations per second—​even across multiple data centers—​as easily as it can
manage much smaller amounts of data and user traffic. Cassandra’s architecture also
means that, unlike other master-slave or sharded systems, it has no single point of failure
and therefore is capable of offering true continuous availability and uptime — simply add
new nodes to an existing cluster without having to take it down. Large companies such as
Apple, Comcast, Instagram, Spotify, eBay, Rackspace, Netflix, use it today: the larger
production environments have PB’s of data in clusters of over 75,000 nodes. A University
of Toronto 2012 comparison test on NoSQL systems concluded that “In terms of
scalability, there is a clear winner throughout our experiments. Cassandra achieves the
highest throughput for the maximum number of nodes in all experiments” although “this
comes at the price of high write and read latencies.
Installation
1 – wget http://ftp.cixug.es/apache/cassandra/2.2.0/apache-cassandra-2.2.0-rc2-bin.tar.gz
2 – tar xvf apache-cassandra-2.2.0-rc2-bin.tar.gz
3 – rename the newly created folder to cassandra
4 – /home/a/cassandra/bin/cassandra –f
After starting the Cassandra server you will see it has started listening for thrift clients:
(…. )
INFO 12:18:29,140 Listening for thrift clients…
You can then stop the server with CTRL-C, and you’ll see
(…)
INFO 13:04:08,663 Stop listening to thrift clients
INFO 13:04:08,666 Waiting for messaging service to quiesce
INFO 13:04:08,667 MessagingService shutting down server thread.

Alternatively, the DataStax Community Edition is a free packaged distribution of Apache
Cassandra made available by DataStax. There’s no faster, easier way to get started with
Apache Cassandra than to download, install, and use DataStax Community Edition. It has
versions for Unix & Mac OS X 10.x,
Windows Server 2008 / Windows 7 or Later (32 and 64 bit versions), Red Hat Enterprise,
Linux 5.x, 6.x,
Debian 6.x, and Ubuntu 10.x, 11.x, 12.x

Impala
Cloudera Impala is a query engine that runs on Apache Hadoop. It is an integrated part of
a Cloudera enterprise data hub. The project was announced in October 2012 as a public
beta and became generally available in May 2013. The Apache-licensed Impala project
brings scalable parallel database technology to Hadoop, enabling users to issue low-
latency SQL queries to data stored in HDFS and Apache HBase without requiring data
movement or transformation. Impala is integrated with Hadoop to use the same file and
data formats, metadata, security and resource management frameworks used by
MapReduce, Apache Hive, Apache Pig and other Hadoop software. Impala is seen as a
competitor to the Stinger Project, which is basically an update of Hive, which is promoted
by Hortonworks. Impala adopted Hive-SQL as an interface. Hive-SQL is similar in terms
of syntax to SQL and for this reason, users can access data stored in HDFS through a very
familiar method. As Hive-SQL uses Hive, you can access the same data through the same
method. However, not all Hive-SQLs are supported by Impala. Thus, whereas Hive
supports all calls from Impala, the opposite is not necesarily true.

Hypertable
Hypertable is an open source database system inspired by publications on the design of
Google’s BigTable.
Hypertable runs on top of a distributed file system such as HDFS, is written almost
entirely in C++ as the developers believe it has significant performance advantages over
Java, and was developed as an in-house software at Zvents Inc. In January 2009, Baidu,
the leading Chinese language search engine, became a project sponsor. The company
claims that in a test against HBase, Hypertable significantly outperformed HBase in all
tests except for the random read uniform test. Usually, NoSQL databases are based on a
hash table design which means that the data they manage is not kept physically ordered by
any meaningful key. Hypertable keeps data physically sorted by a primary key, so it is
better suited to applications which ranges of data, such as analytics, sorted URL lists,
messaging applications, etc. Hypertable does not support SQL: instead, it supports its own
query language called HQL. There is a version for 32/64-bit versions of Windows XP SP2,
Windows Vista, Window 7, 8, 8.1, Windows Server 2003 SP1, 2008, 2008 R2, 2012 and
2012 R2.

File Systems
HDFS

Tachyon
If Spark is memory centric, Tachyon (a theoretical particle that moves faster than light) is
even more. It is a distributed storage system enabling reliable data sharing at memory-
speed across cluster frameworks, such as Spark and MapReduce. It achieves high
performance by leveraging lineage information and using memory aggressively. Tachyon
caches working set files in memory, thereby avoiding going to disk to load datasets that
are frequently read. This enables different jobs/queries and frameworks to access cached
files at memory speed. In fact, it offers up to 300 times higher throughput than HDFS.

Tachyon is Hadoop compatible. Existing Spark and MapReduce programs can run on top
of it without any code change. The project is open source (Apache License 2.0) and is
deployed at multiple companies. It has more than 60 contributors from over 20
institutions, including Yahoo, Intel, and Redhat. The project is the storage layer of the
Berkeley Data Analytics Stack (BDAS) and also part of the Fedora distribution.

Installation
1 – sudo apt-get install git
2 – git clone git://github.com/amplab/tachyon.git
3 – cd tachyon
4 – mvn install
5 – Once it is built, you can start Tachyon:
cp conf/tachyon-env.sh.template conf/tachyon-env.sh
./bin/tachyon format
./bin/tachyon-start.sh local
You can then close the command window and tachyon will continue running
6 – To verify that Tachyon is running, go to http://127.0.0.1:19999 or check the log in the
folder tachyon/logs. You can also run a simple program: ./bin/tachyon runTest Basic
CACHE_THROUGH
7 – To stop it type ./bin/tachyon-stop.sh

Ceph

Search Systems

Elastic
Elasticsearch is an open source search and analytics engine created by Shay Banon back in
2010 based on Lucene and developed in Java. Elasticsearch sees some 700,000-800,000
downloads per months, has been downloaded 20 million times since the inception of the
project, is now the second most popular enterprise search engine and is used by companies
such as SoundCloud, StumbleUpon, Mozilla and Klout. Eventually it became a
commpany called Elastics (formerly elasticearch) and raised almost $105 million. Elastic
provides a distributed, RESTful full-text search engine with a web interface and schema-
free JSON documents. Elastic can run on Linux or Windows.
Installation
1 – wget https://download.elastic.co/elasticsearch/elasticsearch/elasticsearch-1.6.0.tar.gz
2 – tar zxfv elasticsearch-1.6.0.tar.gz
3 – Rename the newly create folder to elasticsearch
4 – Install the Marvel plugin which is used to manage the system
./bin/plugin -i elasticsearch/marvel/latest
5 – Run the system with ./bin/elasticsearch
Add -d if you want to run it in the background as a daemon.
6 – Test it out by opening a browser with http://localhost:9200 or opening another another
terminal window and running the following: curl ‘http://localhost:9200/?pretty’
In both you should see a response like this:
{
“status”: 200,
“name”: “Shrunken Bones”,
“version”: {
“number”: “1.4.0”,
“lucene_version”: “4.10”
},
“tagline”: “You Know, for Search”
}
This means that your Elasticsearch cluster is up and running.

Lucene

Administration Systems

Hue

Ambari
Ambari (a city in India) is a project by Apache aimed at making Hadoop management
simpler by developing software for provisioning, managing, and monitoring Apache
Hadoop clusters. Ambari provides an intuitive, easy-to-use Hadoop management web UI
backed by its RESTful APIs.

Ambari enables System Administrators to:
1 – Provision a Hadoop Cluster
Ambari provides a step-by-step wizard for installing Hadoop services across any number
of hosts.
Ambari handles configuration of Hadoop services for the cluster.
2 – Manage a Hadoop Cluster
Ambari provides central management for starting, stopping, and reconfiguring Hadoop
services across the entire cluster.
3 – Monitor a Hadoop Cluster
Ambari provides a dashboard for monitoring health and status of the Hadoop cluster.
Ambari leverages Ambari Metrics System for metrics collection.
Ambari leverages Ambari Alert Framework for system alerting and will notify you when
your attention is needed (e.g., a node goes down, remaining disk space is low, etc).

As for Application Developers and System Integrators, Ambari allows them to integrate
Hadoop provisioning, management, and monitoring capabilities to their own applications
with the Ambari REST APIs.

Ambari is the administration system of choice of Hortonworks.

Installation
1 - cd /etc/apt/sources.list.d
sudo wget http://public-repo-
1.hortonworks.com/ambari/ubuntu12/1.x/updates/1.7.0/ambari.list
2 – sudo apt-key adv —recv-keys —keyserver keyserver.ubuntu.com
B9733A7A07513CAD
3 – sudo apt-get update
4 – Install the server: sudo apt-get install ambari-server
5 – Setup the server: ambari-server setup
6 – Start Ambari Server: ambari-server start
7 – Deploy Cluster using Ambari Web UI: http://127.0.0.1:8080 Here, note that if you
already have Spark running you’ll need to change the port, as 8080 is the same Spark uses
in its UI. Use admin for both username and password. From here, it is identical to
Cloudera’s system as described below

Zookeper
ZooKeeper is a centralized service for maintaining configuration information, naming,
providing distributed synchronization, and providing group services. All of these kinds of
services are used in some form or another by distributed applications. Each time they are
implemented there is a lot of work that goes into fixing the bugs and race conditions that
are inevitable. Because of the difficulty of implementing these kinds of service. One of the
interesting features of Zookeeper is that the user is able to open a new administration
window should the first stop working: this provides extra security that the system will
indeed be managed. ZooKeeper is used by companies including Rackspace, Yahoo!,
Odnoklassniki and eBay as well as open source enterprise search systems like Solr.

Message Brokers
Traditionally, Big Data has always been a very batch-oriented environment, in which you
first collect a lot data and then you process it, waiting whatever it takes for the processing
to finish. In the internet and other real-time scenarios such as for example a production
assembly line which requires immedate attention this simply does not work. To this end,
“message brokers” were created, which basically ingest data from several sources into the
big data system in a real time fashion. To this end, two are the major ones used: Flume and
Kafka. There is significant overlap in the functions of both, so there are some
considerations when evaluating the two systems.
Kafka is very much a general-purpose system. You can have many producers and
many consumers sharing multiple topics. In contrast, Flume is a special-purpose
tool designed to send data to HDFS and HBase. It has specific optimizations for
HDFS and it integrates with Hadoop’s security. As a result, Cloudera recommends
using Kafka if the data will be consumed by multiple applications
Flume has many built-in sources and sinks. Kafka, however, has a significantly
smaller producer and consumer ecosystem, and it is not well supported by the
Kafka community. Thus, use Kafka if you are prepared to code or use Flume if its
existing sources and sinks match your requirements and you prefer a system that
can be set up without any development.
Flume can process data in-flight using interceptors. These can be very useful for
data masking or filtering. Kafka requires an external stream processing system for
that.
Both Kafka and Flume are reliable systems that with proper configuration can
guarantee zero data loss. However, Flume does not replicate events. As a result,
even when using the reliable file channel, if a node with Flume agent crashes, you
will lose access to the events in the channel until you recover the disks. Use Kafka
if you need an ingest pipeline with very high availability.
Flume and Kafka can work quite well together. If your design requires streaming
data from Kafka to Hadoop, using a Flume agent with Kafka source to read the data
makes sense: You don’t have to implement your own consumer, you get all the
benefits of Flume’s integration with HDFS and HBase, you have Cloudera Manager
monitoring the consumer and you can even add an interceptor and do some stream
processing on the way.

Flume
In a nutshell, Apache Flume is a tool to “suck” real time data into HDFS. To this, it has a
simple and flexible architecture based on streaming data flows and is robust and fault
tolerant with tunable reliability mechanisms for failover and recovery.
The following components make up Apache Flume:
1 – Event: A singular unit of data that is transported by Flume (typically a single log entry)
2 – Source: The entity through which data enters into Flume. Sources either actively poll
for data or passively wait for data to be delivered to them. A variety of sources allow data
to be collected, such as log4j logs and syslogs.
3 – Sink: The entity that delivers the data to the destination. A variety of sinks allow data
to be streamed to a range of destinations. One example is the HDFS sink that writes events
to HDFS.
4 – Channel: The conduit between the Source and the Sink. Sources ingest events into the
channel and the sinks drain the channel.
5 – Agent: Any physical Java virtual machine running Flume. It is a collection of sources,
sinks and channels.
6 – Client: The entity that produces and transmits the Event to the Source operating within
the Agent.
Installation
1 – Download
Wget http://apache.rediris.es/flume/1.6.0/apache-flume-1.6.0-bin.tar.gz
2 – Extract
tar -xzf apache-flume-1.6.0-bin.tar.gz
3 - Starting an agent
First make sure the JAVA_HOME variable is set, as flume needs it
An agent is started using a shell script called flume-ng which is located in the bin
directory of the Flume distribution. You need to specify the agent name, the config
directory, and the config file on the command line:
$ bin/flume-ng agent -n $agent_name -c conf -f conf/flume-conf.properties.template



Kafka
Kafka is a message broker which is currently replacing others like JMS and AMQP
because of its higher throughput, reliability and replication. Kafka works in combination
with Apache Storm, Apache HBase and Apache Spark for real-time analysis and rendering
of streaming data. For example, Kafka can message geospatial data from a fleet of long-
haul trucks or sensor data from heating and cooling equipment in office buildings. Thus,
Kafka brokers massive message streams for low-latency analysis in Enterprise Apache
Hadoop. To better understand this, Kafka has some basic messaging terminology:
Kafka maintains feeds of messages in categories called topics.
We’ll call processes that publish messages to a Kafka topic producers.
We’ll call processes that subscribe to topics and process the feed of published
messages consumers..
Kafka is run as a cluster comprised of one or more servers each of which is called a
broker.
So, at a high level, producers send messages over the network to the Kafka cluster which
in turn serves them up to consumers like this:
Communication between the clients and the servers is done over regular TCP protocol.
Originally developed by LinkedIn, was subsequently open sourced in early 2011 and left
the Apache Incubator on 23 October 2012. In November 2014, several engineers who built
Kafka at LinkedIn created a new company named Confluent with a focus on Kafka.

Installation and initial use
1 - Download it with wget http://ftp.cixug.es/apache/kafka/0.8.2.1/kafka_2.11-0.8.2.1.tgz
2 – Expand it with tar -xzf kafka_2.11-0.8.2.1.tgz
3 – Start the server. Kafka uses ZooKeeper so you need to first start a ZooKeeper server if
you don’t already have one. You can use the convenience script packaged with kafka to
get a quick-and-dirty single-node ZooKeeper instance with bin/zookeeper-server-start.sh
config/zookeeper.properties
Now start the Kafka server with bin/kafka-server-start.sh config/server.properties
4 – Create a topic named “test” with a single partition and only one replica:
bin/kafka-topics.sh —create —zookeeper localhost:2181 —replication-factor 1 —
partitions 1 —topic test

We can now see that topic if we run the list topic command:
bin/kafka-topics.sh —list —zookeeper localhost:2181
test

5 – Send some messages. For that, Kafka comes with a command line client that will take
input from a file or from standard input and send it out as messages to the Kafka cluster.
By default each line will be sent as a separate message. Run the producer and then type a
few messages into the console to send to the server.
bin/kafka-console-producer.sh —broker-list localhost:9092 —topic test
This is a message
This is another message

6 – Start a consumer. For that, Kafka also has a command line consumer that will dump
out messages to standard output.
bin/kafka-console-consumer.sh —zookeeper localhost:2181 —topic test —from-
beginning
This is a message
This is another message

If you have each of the above commands running in a different terminal then you should
now be able to type messages into the producer terminal and see them appear in the
consumer terminal.

Storm
Storm is a distributed realtime computation system. Differently from Kafka and Flume
which are unique and thus rather confusing for the uninitiated, Storm employs a
framework quite similar to Hadoop, and thus its learning curve is much lighter. Storm can
be used with any programming language. It was originally created by Nathan Marz. The
initial release was on 17 September 2011 and open sourced after being acquired by
Twitter. The fact it was developed in the Clojure computer language, a variant of Lisp
which is a language commonly used for Artificial Intelligence, is often seen as a good
indication of its power and possibilities and that’s why it is being increasingly used by
leading companies such as Alibaba.com, Spotify, The Weather Channel, etc
A Storm cluster is superficially similar to a Hadoop cluster. Whereas in Hadoop one runs
MapReduce “Jobs”, on Storm one runs “Topologies”. “Jobs” and “Topologies” themselves
are very different — one key difference is that a MapReduce job eventually finishes
whereas a topology processes messages forever or until it is killed. There are two kinds of
nodes on a Storm cluster: the master node and the worker nodes. The master node runs a
daemon called “Nimbus” that is similar to Hadoop’s JobTracker. Nimbus is responsible
for distributing code around the cluster, assigning tasks to machines, and monitoring for
failures. Each worker node runs a daemon called the “Supervisor”. The supervisor listens
for work assigned to its machine and starts and stops worker processes as necessary based
on what Nimbus has assigned to it. Each worker process executes a subset of a topology; a
running topology consists of many worker processes spread across many machines. All
coordination between Nimbus and the Supervisors is done through a Zookeeper cluster.
Additionally, the Nimbus daemon and Supervisor daemons are fail-fast and stateless; all
state is kept in Zookeeper or on local disk. This means one can kill -9 Nimbus or the
Supervisors and they’ll start back up like nothing happened: this design leads to Storm
clusters being incredibly stable.
Installation
1 – Download Storm and unzip it
wget http://apache.rediris.es/storm/apache-storm-0.9.5/apache-storm-0.9.5.tar.gz
tar -xvf apache-storm-0.9.5.tar.gz
2 – Configure Storm
Uncomment/add the following to conf/storm.yaml
storm.zookeeper.servers:
- “127.0.0.1”
nimbus.host: “127.0.0.1”
storm.local.dir: “/home/username/storm/datadir/storm”
supervisor.slots.ports: - 6700
3 – Start ZooKeeper (you have to install it first)
/bin/zkServer.sh start
4 – Start nimbus
/bin/storm nimbus
5 – Start supervisor
/bin/storm supervisor
6 – Start UI
/storm ui
You can then go to 127.0.0.1:8080 Please note Spark also uses port 8080, so if you are
running it you’ll have to alter the port numbers at either Spark or Storm

Sqoop
Sqoop is basically an import/export system. It is a connectivity tool for moving data from
non-Hadoop data stores – such as relational databases and data warehouses – into Hadoop.
It allows users to specify the target location inside of Hadoop and instruct Sqoop to move
data from Oracle, Teradata or other relational databases to the target. Sqoop provides a
pluggable mechanism for optimal connectivity to external systems. The Sqoop extension
API provides a convenient framework for building new connectors which can be dropped
into Sqoop installations to provide connectivity to various systems. Sqoop itself comes
bundled with various connectors that can be used for popular database and data
warehousing systems. Sqoop basically allows us to add any data that itsn’t part of our Big
Data framework into it. Thus, it may be said Sqoop is the very first step in any company
moving into Big Data, as it effectively imports all data the company has into the Big Data
environment.
Installation
1 – wget http://ftp.cixug.es/apache/sqoop/1.99.6/sqoop-1.99.6-bin-hadoop200.tar.gz
2 – tar xvf sqoop-1.99.6-bin-hadoop200.tar.gz
3 – Rename the newly created folder to sqoop
4 – Start the server with ./bin/sqoop.sh server start
Similarly you can stop server with ./bin/sqoop.sh server stop
5 - Client installation
Client do not need extra installation and configuration steps. Just copy Sqoop distribution
artifact on target machine and unzip it in desired location. You can start client with
following command:

bin/sqoop.sh client
From here, you can issue commands.

Splunk
Splunk is a propietary system which captures, indexes and correlates real-time data in a
searchable repository from which it can generate graphs, reports, alerts, dashboards and
visualizations. The systema makes machine data accessible across an organization by
identifying data patterns, providing metrics, diagnosing problems and providing
intelligence for business operations. It is by far the most used application for data mining,
with over 9,000 customers. The company (also called Splunk), is based in San Francisco,
has over 1700 employees, and started in 2003 by co-founders Michael Baum, Rob Das
and Erik Swan. The name “Splunk” is a reference to exploring caves, as in spelunking.
Thus, the name was used to reference deep dives in the emerging field of big data.

Splunk is available for Windows, Linux, Solaris, FreeBSD, AIX, and MAC, and offers its
main software in two license types: an Enterprise License designed for companies and
large organizations, and a freeware License designed for personal use. The freeware
version is limited to 500 MB of data a day, and lacks some features of the Enterprise
license edition. With a price tag of US$5,175 for 1GB/day, plus annual support fees for a
perpetual license, or a yearly fee instead of US$2,070 per year, Splunk is expensive and
not open souce. Thus, it faces competition from a rising wave of open source competitors.
One of the most prominent, Graylog, has unveiled its formal 1.0 release. Graylog is
written in Java and uses a few key open source technologies: Elasticsearch, MongoDB,
and Apache Kafka. Another alternative is the combined use of three open source projects:
Elasticsearch, Kibana, and Fluentd.
Installation
1 – wget http://download.splunk.com/products/splunk/releases/6.2.4/splunk/linux/splunk-
6.2.4-271043-Linux-x86_64.tgz
(it may be necessary to register first)
2 – tar xfv splunk-6.2.4-271043-Linux-x86_64.tgz
3 – Rename the newly created folder to splunk

Machine learning
Machine learning is a discipline of artificial intelligence focused on enabling machines to
learn without being explicitly programmed, and it is commonly used to improve future
performance based on previous outcomes. The developed algorithms form the basis of
various applications such as:
Vision processing
Language processing
Forecasting (e.g., stock market trends)
Pattern recognition
Games
Data mining
Expert systems
Robotics

There are several ways to implement machine learning techniques, however the most
commonly used ones are supervised and unsupervised learning.

Supervised Learning
Supervised learning deals with learning a function from available training data. A
supervised learning algorithm analyzes the training data and produces an inferred
function, which can be used for mapping new examples. Common examples of supervised
learning include:

Classifying e-mails as spam
Labeling webpages based on their content
Voice recognition
There are many supervised learning algorithms such as neural networks, Support Vector
Machines (SVMs), and Naive Bayes classifiers. Mahout implements Naive Bayes
classifier.

Unsupervised Learning
Unsupervised learning makes sense of unlabeled data without having any predefined
dataset for its training. Unsupervised learning is an extremely powerful tool for analyzing
available data and look for patterns and trends. It is most commonly used for clustering
similar input into logical groups. Common approaches to unsupervised learning include:
K-means
Self-organizing maps
Hierarchical clustering

Mahout


In India, a mahout is a person who rides an elephant. Since Haddop’s logo is a toy
elephant, and Apache Mahout rides on top of it, hence the name. Apache Mahout is a
library of machine-learning algorithms, implemented on top of Hadoop and using the
MapReduce paradigm. Once big data is stored on the Hadoop Distributed File System
(HDFS), Mahout provides the data science tools to automatically find meaningful patterns
in those big data sets. The Apache Mahout project aims to make it faster and easier to turn
big data into big information.

Mahout provides an implementation of various machine learning algorithms, some in local
mode and some in distributed mode (for use with Hadoop). Each algorithm in the Mahout
library can be invoked using the Mahout command line. To this, Mahout supports four
main data science use cases:
1 – Collaborative filtering – mines user behavior and makes product recommendations
(e.g. Amazon recommendations)
2 – Clustering – takes items in a particular class (such as web pages or newspaper
articles) and organizes them into naturally occurring groups, such that items belonging to
the same group are similar to each other
3 – Classification – learns from existing categorizations and then assigns unclassified
items to the best category
4 – Frequent itemset mining – analyzes items in a group (e.g. items in a shopping cart or
terms in a query session) and then identifies which items typically appear together

Starting from version 1, Mahout now supports Scala programming and can run on top of
Spark, meaning a much faster processing time than with Hadoop.

Installation
Mahout can be downloaded or built. Should you decide to build it, it needs Maven 3.3.3 to
do that, and the usual apt-get upgrade will not upgrade over version 3.05, so you need to
run the following commands:
sudo apt-add-repository ppa:andrei-pozolotin/maven3
sudo apt-get update
sudo apt-get install maven3
Also, Mahout does not compile with Java JDK version 8, you need to use version 7. Install
it with apt-get install oracle-java7-installer and activate the version with the command
export JAVA_HOME=/usr/lib/jvm/java-7-oracle. The compile it with mvn clean install

Companies
As the Big Data ecosystem grew, it became harder to keep track of new releases,
technologies, and to make it all work together. This created a business opportunity of an
offering which would bring all those technologies in a single package with a simplified
installation procedure and an integrated administration system. To this end, a number of
companies sprung with different offerings and approaches.

Cloudera
Installation
Perform the following steps in ALL members of the cluster

1 – Download and install Ubuntu 14 and then sudo apt-get install kubuntu-desktop to
make it look like windows 7, or download Linux Mint “Rebecca” Cinammon. Use the
same root and password in all nodes

2 – If using Mint then change the OS to Ubuntu with the following commands:
2.1 - sudo mv /etc/lsb-release /etc/lsb-release.original
2.2 - sudo cp /etc/upstream-release/lsb-release /etc/lsb-release
2.3 - At some point after your install is done, you can restore the original with:
sudo mv /etc/lsb-release.original /etc/lsb-release

3 – Grant passwordless root with the following commands:
3.1 - sudo visudo
3.2 - Add this line at the end
al ALL=(ALL) NOPASSWD: ALL

4 – Disable the firewall with the command sudo ufw disable otherwise the browser will
not work, it will not get heartbeats, etc

5 – Install SSH Server in all nodes using the following commands otherwise they won’t be
found
5.1 - sudo apt-get update
5.2 - sudo apt-get install openssh-server
5.3 - Generate public and private SSH keys: ssh-keygen
5.4 - Copy the SSH Public Key (id_rsa.pub) to the root account on your target
hosts:.ssh/id_rsa .ssh/id_rsa.pub
5.5 - Add the SSH Public Key to the authorized_keys file on your target hosts: cat
id_rsa.pub >> authorized_keys
5.6 - Depending on your version of SSH, you may need to set permissions on the .ssh
directory (to 700) and the authorized_keys file in that directory (to 600) on the target
hosts: chmod 700 ~/.ssh chmod 600 ~/.ssh/authorized_keys
5.77 - Make sure you can connect to each host in the cluster using SSH, without having to
enter a password: ssh root@<remote.target.host> where <remote.target.host> has the
value of each host name in the cluster

6 – Install the NTP service with the commands
sudo apt-get update && sudo apt-get install ntp
sudo service ntp start

7 – Make sure you have a fixed IP and edit /etc/hosts file with the FQDN of all the
members of the cluster.

8 – Download and install Cloudera Manager
The Manager:
Installs the package repositories for Cloudera Manager and the Oracle Java Development
Kit (JDK)
Installs the Oracle JDK 1.6 if it is not already installed. Cloudera Manager also supports
JDK 1.7 if it is already installed on the cluster hosts.
Installs the Cloudera Manager Server
Installs and configures an embedded PostgreSQL database for use by the Cloudera
Manager server

8.1 - Download the installer:
wget http://archive.cloudera.com/cm5/installer/latest/cloudera-manager-installer.bin
8.2 - Change cloudera-manager-installer.bin to have executable permission.
chmod u+x cloudera-manager-installer.bin
8.3 - Run the Cloudera Manager Server installer.
sudo ./cloudera-manager-installer.bin
If the installation fails, use the commands
sudo rm -Rf /usr/share/cmf /var/lib/cloudera* /var/cache/yum/cloudera*”
wget https://launchpad.net/~fossfreedom/+archive/packagefixes/+files/banish404_0.1-
4_all.deb
sudo dpkg -i banish404_0.1-4_all.deb
sudo banish404
To start the Cloudera Manager Server type sudo service cloudera-scm-server start
To stop the Cloudera Manager Server type sudo service cloudera-scm-server stop
To restart the Cloudera Manager Server type sudo service cloudera-scm-server restart

9 – Open browser with 127.0.0.1:7180 using username and password “admin”. The
Cloudera Enterprise Data Hub 5.4.1 system will install Hadoop, Hbase, ZooKeeper,
Oozie, Hive, Hue, Flume, Impala, Sentry, Sqoop, Cloudera Search, and Spark

10 – Search all hosts in the cloud using 192.168.1.[1-255] (or whatever the subnetwork is)
and select the ones found

11 – Choose parcels over packets as to allow automatic updating of all the nodes.
Do NOT choose Single User Mode as this complicates administration.
Type the root and password

HortonWorks
The company offers 2 major products, namely the HDM – Hortonworks Data Platform,
and the Sandbox

1 - Sandbox
The Sandbox is a pre-installed complete Hadoop system which comes packaged to run
under a virtualization system.

2 - HDP - HortonWorks Data Platform
The Hortonworks Data Platform consists of the essential set of Apache Hadoop projects
including MapReduce, Hadoop Distributed File System (HDFS), HCatalog, Pig, Hive,
HBase, Zookeeper and Ambari. Currently, it is the only platform which supports
Windows.

The Unix version uses Ambari as the administration system and can be installed in these
OS:
Red Hat Enterprise Linux (RHEL) v6.x
Red Hat Enterprise Linux (RHEL) v5.x (deprecated)
CentOS v6.x
CentOS v5.x (deprecated)
Oracle Linux v6.x
Oracle Linux v5.x (deprecated)
SUSE Linux Enterprise Server (SLES) v11, SP1 and SP3
Ubuntu Precise v12.04

HPD – Windows
For Windows, only Windows 2008 R2 64 bit and 2012 64 bit are supported.
1 - Install the pre-requisites: Java, Python, and MSFT C++ run time. Windows Server
2012 already has the up to date .NET runtime, so you can skip that step. Make sure to
install Java to somewhere without a space in the path – “Program Files” will not work.
You will also need to setup JAVA_HOME, which Hadoop requires.
2 - Ensure HDP can find Python – by updating the PATH System Environment variable.
3 -Open a Powershell prompt in Administrator (“Run as Administrator”) mode, and
execute the MSI through this command: msiexec /i “hdp-2.0.6.0.winpkg.msi”
4 - The HDP Setup window appears pre-populated with the host name of the server, as
well as default installation parameters. Now, complete the form with your parameters:
4.1 - Set the Hadoop User Password. This enables you to log in as the administrative user
and perform administrative actions. This must match your local Windows Server password
requirements. We recommend a strong pasword. Note the password you set – we’ll use
this later.
4.2 - Check ‘Delete Existing HDP Data’. This ensures that HDFS will be formatted and
ready to use after you install.
4.3 - Check ‘Install HDP Additional Components’. Select this check box to install
Zookeeper, Flume, and HBase as HDP services deployed to the single node server.
4.4 - Set the Hive and Oozie database credentials. Set ‘hive’ for all Hive Metastore entries,
and ‘oozie’ for all Oozie Metastore entries.
4.4 - Select DERBY, and not MSSQL, as the DB Flavor in the dropdown selection. This
will setup HDP to use an embedded Derby database, which is ideal for the evaluation
single node scenario.
5 - When you have finished setting the installation parameters, click ‘Install’ to install
HDP.
6 - Once the install is successful, you will start the HDP services on the single node. Open
a command prompt, and navigate to the HDP install directory and type
start_local_hdp_services
7 - Validate the install by running the full suite of smoke tests. It’s easiest to run the smoke
tests as the HDP super user: ‘hadoop’. In a command prompt, switch to using the ‘hadoop’
user: runas /user:hadoop cmd
8 - When prompted, enter the password you had set up during install. Run the provided
smoke tests as the hadoop user to verify that the HDP 2.0 services work as expected with
the command Run-SmokeTests hadoop This will fire up a Mapreduce job on your freshly
set up cluster. If it fails the first time, try running it again with the same command Run-
SmokeTests hadoop.

SyncFusion

MapR
MapR is a San Jose, California-based enterprise software company that develops and sells
Apache Hadoop-derived software. The company contributes to Apache Hadoop projects
like HBase, Pig (programming language), Apache Hive, and Apache ZooKeeper. MapR
was selected by Amazon to provide an upgraded version of Amazon’s Elastic Map Reduce
(EMR) service. MapR has also been selected by Google as a technology partner.MapR
was able to break the minute sort speed record on Google’s compute platform.With over
300 employees, MapR’s investors include Google Capital, Lightspeed Venture Partners,
Mayfield Fund, NEA, Qualcomm Ventures and Redpoint Ventures. MapR is used by more
than 700 customers and companies such as Amazon, Cisco, Google, Teradata and HP are
members of its environment.
Its software is available in three editions.
1 – M3 Standard Edition combines a complete distribution for Hadoop with an advanced
cluster management console and a multi-tenant environment.
2 – M5 Enterprise Edition combines M3 features with enterprise-grade high availability,
disaster recovery and consistent snapshots.
3 – M7 Enterprise Database Edition combines M5 features with the best in-Hadoop
database, MapR-DB, to run both online and analytical processing on one platform.

Installation
1 - Downloading
wget http://package.mapr.com/releases/installer/mapr-setup.sh -P /tmp

2 – Installing
sudo bash /tmp/mapr-setup.sh

You might also like