Big Data Analytics
Big Data Analytics
Data sets grow rapidly - in part because they are increasingly gathered from cheap and numerous
information-sensing mobile devices, aerial (remote sensing), software logs, cameras,
microphones, radio-frequency identification (RFID) readers and wireless sensor networks.
Big data is a term that refers to data sets or combinations of data sets whose size (volume),
complexity (variability), and rate of growth (velocity) make them difficult to be captured,
managed, processed or analyzed by conventional technologies and tools, such as relational
databases and desktop statistics or visualization packages, within the time necessary to make
them useful. While the size used to determine whether a particular data set is considered big data
is not firmly defined and continues to change over time, most analysts and practitioners currently
refer to data sets from 30-50 terabytes(10 12 or 1000 gigabytes per terabyte) to multiple
petabytes (1015 or 1000 terabytes per petabyte) as big data.
The complex nature of big data is primarily driven by the unstructured nature of much of the data
that is generated by modern technologies, such as that from web logs, radio frequency Id (RFID),
sensors embedded in devices, machinery, vehicles, Internet searches, social networks such as
Facebook, portable computers, smart phones and other cell phones, GPS devices, and call center
records. In most cases, in order to effectively utilize big data, it must be combined with
structured data (typically from a relational database) from a more conventional business
application, such as Enterprise Resource Planning (ERP) or Customer Relationship Management
(CRM).
Similar to the complexity, or variability, aspect of big data, its rate of growth, or velocity aspect,
is largely due to the ubiquitous nature of modern on-line, real-time data capture devices, systems,
and networks. It is expected that the rate of growth of big data will continue to increase for the
foreseeable future.
Specific new big data technologies and tools have been and continue to be developed. Much of
the new big data technology relies heavily on massively parallel processing (MPP) databases,
which can concurrently distribute the processing of very large sets of data across many servers.
As another example, specific database query tools have been developed for working with the
massive amounts of unstructured data that are being generated in big data environments.
On that note, below are some facts about big data that demonstrate its importance in todays
business environment.
1. According to the 2015 IDG Enterprise Big Data Research study, businesses will spend an
average of $7.4 million on data-related initiatives in 2016.
2. According to McKinsey, a retailer using Big Data to its fullest potential could increase its
operating margin by more than 60%.
3. Bad data or poor quality data costs organizations as much as 10-20% of their revenue.
4. A 10% increase in data accessibility by a Fortune 1000 company would give that
company approximately $65 million more in annual net income.
5. Big Data will drive $48.6 billion in annual spending by 2019.
6. Data production will be 44 times greater in 2020 than it was in 2009. Individuals create
more than 70% of the digital universe. But enterprises are responsible for storing and
managing 80% of it.
7. It is estimated that Walmart collects more than 2.5 petabytes of data every hour from its
customer transactions. A petabyte is one quadrillion bytes, or the equivalent of about 20
million filing cabinets worth of text.
8. By 2020 one third of all data will be stored, or will have passed through the cloud, and
we will have created 35 zetabytes worth of data.
9. A 2015 report by Cap Gemini found that 56% of companies expect to increase their
spending on big data in the next three years.
10. There will be a shortage of talent necessary for organizations to take advantage of Big
Data. By 2018, the United States alone could face a shortage of 140,000 to 190,000
skilled workers with deep analytical skills as well as 1.5 million managers and analysts
with the know-how to use Big Data analytics to make effective decisions.
You can see from the above statistics that Big Data really is a big deal. However, dont just
collect data for the sake of collecting it. If you do, youll end up with a massive digital
graveyard. To get the most out of your data, you need to have a clear plan in place for how you
will manage and use your data along with a goal for what you want to accomplish with it.
Remember, unlike wine, data doesnt get better with age. So if you have invested resources to
collect, store, and report data, then you need to put your data to work to get the most value out of
it.
All of the above statistics emphasize the underlying fact that organizations need to have
processes, systems, and tools in place to help them turn raw data into useful and actionable
information. Do you have the right tools in place to get the most out of your channel data? Can
you turn your channel data into channel intelligence? Please leave a comment below on what
your company has done to make the most of Big Data.
1.1.4 What Comes Under Big Data?
Big data involves the data produced by different devices and applications. Given below are some
of the fields that come under the umbrella of Big Data.
Black Box Data : It is a component of helicopter, airplanes, and jets, etc. It captures
voices of the flight crew, recordings of microphones and earphones, and the performance
information of the aircraft.
Social Media Data : Social media such as Facebook and Twitter hold information and
the views posted by millions of people across the globe.
Stock Exchange Data : The stock exchange data holds information about the buy and
sell decisions made on a share of different companies made by the customers.
Power Grid Data : The power grid data holds information consumed by a particular
node with respect to a base station.
Transport Data : Transport data includes model, capacity, distance and availability of a
vehicle.
Search Engine Data : Search engines retrieve lots of data from different databases.
Thus Big Data includes huge volume, high velocity, and extensible variety of data. The data in it
will be of three types.
Structured data : Relational data.
Semi Structured data : XML data.
Unstructured data : Word, PDF, Text, Media Logs.
The term big data doesnt just refer to the enormous amounts of data available today, it also
refers to the whole process of gathering, storing and analyzing that data. Importantly, this
process is being used to make the world a better place.
Since big data as we know it today is so new, theres not a whole lot of past to examine, but what
there is shows just how much big data has evolved and improved in such a short period of time
and hints at the changes that will come in the future. Importantly, big data is now starting to
move past being simply a buzzword thats understood by only a select few. Its become more
mainstream, and those who are actually implementing big data are finding great success.
In the past, big data was a big business tool. Not only were the big businesses the ones with the
huge amounts of information, but they were also the ones who had sufficient capital to get big
data up and running in the first place. Big data is still an enigma to many people. Its a relatively
new term that was only coined during the latter part of the last decade. While it may still be
ambiguous to many people, since its inception its become increasingly clear what big data is
and why its important to so many different companies. It used to be that in order to use big data
technology, a complex and costly on-premise infrastructure had to be installed. Along with that
expensive hardware came the responsibility to assemble an expert team to run and maintain the
system and make sense of the information. It wasnt easy, and it wasnt a small business friend.
Big data in the cloud changed all of that. It turned out to be the perfect solution for many
companies. It doesnt require any on-premise infrastructure, which greatly reduces the startup
costs. It also doesnt require the same amount of data gurus on the team because of how much
can be done by the cloud company itself. Big data in the cloud has been one of the key
components in big datas quick ascent in the business and technology world. Big data in the
cloud is also vital because of the growing amount of information each day. Its extremely hard to
scale your infrastructure when youve got an on-premise setup to meet your information needs.
You have to install more hardware for more data, or waste space and money with unused
hardware, when the data is less than expected. That problem doesnt exist with big data in the
cloud. Companies can scale up and down as their needs require, without significant financial
cost.
Big data has also evolved in its use since its inception. Today, we see it being used in the
military to reduce injuries, in the NBA to monitor every movement on the floor during a game,
in healthcare to prevent heart disease and cancer and in music to help artists go big. Were seeing
that it has no limits. Its fundamentally changing the way we do things. Theres so much
advancement thats coming to fruition because of it. With the increased availability and
affordability, the changes are only going to increase.
The increase in big data also means that companies are beginning to realize how important it is
to have excellent data analysts and data scientists. Companies are also beginning to implement
executive positions like chief data officer and chief data analyst. The ripple effect is being felt in
education, where universities and colleges are scrambling to provide learning for tomorrows
data specialists. Theres an enormous demand for data-literate people thats continually on the
rise.
It hasnt been around for long, but big data has been constantly evolving and that will only
continue. With an increase in technology and data, consumers can expect to see enormous
differences across a broad spectrum of industries. Big data is here to stay. As it continues to
grow and improve, those who adopt big data to discover the next competitive advantage are
going to find success ahead of their non-big data counterparts.
The explosion of the Internet, social media, technology devices and apps is creating a tsunami of
data. Extremely large sets of data can be collected and analyzed to reveal patterns, trends and
associations related to human behavior and interactions. Big data is being used to better
understand consumer habits, target marketing campaigns, improve operational efficiency, lower
costs, and reduce risk. International Data Corporation (IDC), a global provider of market
intelligence and information technology advisory services, estimates that the global big data and
analytics market will reach $125 billion in 2015.1
The challenge for businesses is how to make the best use of this wealth of information. Some
experts break down big data into three subcategories:
Smart data Information is useful and actionable if it can be organized and segmented
according to a companys needs. Smart data can be combined from multiple sources and
customized to address particular business challenges.
Identity data Profile data on consumers can be combined with their social media data,
purchasing habits and other behavioral analytics to help companies target their marketing
campaigns much more precisely.
People data Gleaned largely from social media data sets, people data helps companies to
better understand their customers as individuals and develop programs that address and
anticipate their needs. It seeks to create a shared community of customers with mutual likes,
ideas and sentiments.
Big data sets are so large that traditional processing methods often are inadequate. Big data
challenges data analysis, capture, management, search, sharing, storage, transfer, visualization
and privacy protection. As companies work through these data processing and management
issues, the focus is shifting to the areas of data strategy and data governance.
Volume: refers to the quantity of data gathered by a company. This data must be used further to
gain important knowledge. Enterprises are awash with ever-growing data of all types, easily
amassing terabytes even petabytes of information(e.g. turning 12 terabytes of Tweets per day
into improved product sentiment analysis; or converting 350 billion annual meter readings to
better predict power consumption). Moreover, Demchenko, Grosso, de Laat and Membrey stated
that volume is the most important and distinctive feature of Big Data, imposing specific
requirements to all traditional technologies and tools currently used.
Velocity: refers to the time in which Big Data can be processed. Some activities are very
important and need immediate responses, which is why fast processing maximizes efficiency.
For time-sensitive processes such fraud detection, Big Data flows must be analyzed and used as
they stream into the organizations in order to maximize the value of the information (e.g.
scrutinize 5 million trade events created each day to identify potential fraud; analyze 500 million
daily call detail records in real-time to predict customer churn faster).
Variety: refers to the type of data that Big Data can comprise. This data maybe structured or
unstructured. Big data consists in any type of data, including structured and unstructured data
such as text, sensor data, audio, video, click streams, log files and so on. The analysis of
combined data types brings new problems, situations, and so on, such as monitoring hundreds of
live video feeds from surveillance cameras to target points of interest, exploiting the 80% data
growth in images, video and documents to improve customer satisfaction) );
Value: refers to the important feature of the data which is defined by the added-value that the
collected data can bring to the intended process, activity or predictive analysis/hypothesis. Data
value will depend on the events or processes they represent such as stochastic, probabilistic,
regular or random. Depending on this the requirements may be imposed to collect all data, store
for longer period (for some possible event of interest), etc. In this respect data value is closely
related to the data volume and variety.
Veracity: refers to the degree in which a leader trusts information in order to make a decision.
Therefore, finding the right correlations in Big Data is very important for the business future.
However, as one in three business leaders do not trust the information used to reach decisions,
generating trust in Big Data presents a huge challenge as the number and type of sources grows.
C. Security infrastructure
As Big Data analysis becomes part of workflow, it becomes vital to secure that data. For
example, a healthcare company probably wants to use Big Data applications to determine
changes in demographics or shifts in patient needs. This data about patients needs to be
protected, both to meet compliance requirements and to protect patient privacy. The company
needs to consider who is allowed to see the data and when they may see it. Also, the company
need to be able to verify the identity of users, as well as protect the identity of patients. These
types of security requirements must be part of the Big Data fabric from the outset, and not an
afterthought.
E. Performance matters
Data architecture also must work to perform in concert with supporting infrastructure of
organization or company. For instance, the company might be interested in running models to
determine whether it is safe to drill for oil in an offshore area, provided with real-time data of
temperature, salinity, sediment resuspension, and many other biological, chemical, and physical
properties of the water column. It might take days to run this model using a traditional server
configuration. However, using a distributed computing model, a days long task may take
minutes. Performance might also determine the kind of database that company would use. Under
certain circumstances, stakeholders may want to understand how two very distinct data elements
are related, or the relationship between social network activity and growth in sales. This is not
the typical query the company could ask of a structured, relational database. A graphing database
might be a better choice, as it may be tailored to separate the nodes or entities from its
properties or the information that defines that entity, and the edge or relationship between
nodes and properties. Using the right database may also improve performance. Typically, a graph
database maybe used in scientific and technical applications.
Prioritizing correlations
Data analysts use big data to tease out correlation: when one variable is linked to another.
However, not all these correlations are substantial or meaningful. More specifically, just because
2 variables are correlated or linked doesnt mean that a causative relationship exists between
them (i.e.,correlation does not imply causation). For instance, between 2000 and 2009, the
number of divorces in the U.S. state of Maine and the per capita consumption of margarine both
similarly decreased. However, margarine and divorce have little to do with each other. A good
consultant will help you figure out which correlations mean something to your business and
which correlations mean little to your business.
Security
As with many technological endeavors, big data analytics is prone to data breach. The
information that you provide a third party could get leaked to customers or competitors.
Transferability
Because much of the data you need analyzed lies behind a firewall or on a private cloud, it takes
technical know-how to efficiently get this data to an analytics team. Furthermore, it may be
difficult to consistently transfer data to specialists for repeat analysis.
Apache Hadoop is the most important framework for working with Big Data. Hadoop biggest
strength is scalability. It upgrades from working on a single node to thousands of nodes without
any issue in a seamless manner.
The different domains of Big Data means we are able to manage the datas are from videos, text
medium, transactional data, sensor information, statistical data, social media conversations,
search engine queries, ecommerce data, financial information, weather data, news updates, forum
discussions, executive reports, and so on
Googles Doug Cutting and his team members developed an Open Source Project namely known
as HADOOP which allows you to handle the very large amount of data. Hadoop runs the
applications on the basis of MapReduce where the data is processed in parallel and accomplish
the entire statistical analysis on large amount of data.
It is a framework which is based on java programming. It is intended to work upon from a single
server to thousands of machines each offering local computation and storage. It supports the
large collection of data set in a distributed computing environment.
The Apache Hadoop software library based framework that gives permissions to distribute huge
amount of data sets processing across clusters of computers using easy programming models.
Why Hadoop?
It simplifies dealing with Big Data. This answer immediately resonates with people, it is clear
and succinct, but it is not complete. The Hadoop framework has built-in power and flexibility to
do what you could not do before. In fact, Cloudera presentations at the latest O'Reilly Strata
conference mentioned that MapReduce was initially used at Google and Facebook not primarily
for its scalability, but for what it allowed you to do with the data.
In 2010, the average size of Cloudera's customers' clusters was 30 machines. In 2011 it was 70.
When people start using Hadoop, they do it for many reasons, all concentrated around the new
ways of dealing with the data. What gives them the security to go ahead is the knowledge that
Hadoop solutions are massively scalable, as has been proved by Hadoop running in the world's
largest computer centers and at the largest companies.
As you will discover, the Hadoop framework organizes the data and the computations, and then
runs your code. At times, it makes sense to run your solution, expressed in a MapReduce
paradigm, even on a single machine.
But of course, Hadoop really shines when you have not one, but rather tens, hundreds, or
thousands of computers. If your data or computations are significant enough (and whose aren't
these days?), then you need more than one machine to do the number crunching. If you try to
organize the work yourself, you will soon discover that you have to coordinate the work of many
computers, handle failures, retries, and collect the results together, and so on. Enter Hadoop to
solve all these problems for you. Now that you have a hammer, everything becomes a nail:
people will often reformulate their problem in MapReduce terms, rather than create a new
custom computation platform.
No less important than Hadoop itself are its many friends. The Hadoop Distributed File System
(HDFS) provides unlimited file space available from any Hadoop node. HBase is a high-
performance unlimited-size database working on top of Hadoop. If you need the power of
familiar SQL over your large data sets, Pig provides you with an answer. While Hadoop can be
used by programmers and taught to students as an introduction to Big Data, its companion
projects (including ZooKeeper, about which we will hear later on) will make projects possible
and simplify them by providing tried-and-proven frameworks for every aspect of dealing with
large data sets.
Hadoop clusters scale horizontally- More storage and compute power can be achieved by adding
more nodes to a Hadoop cluster. This eliminates the need to buy more and more powerful and
expensive hardware.
Hadoop can handle unstructured / semi-structured data- Hadoop doesn't enforce a 'schema' on the
data it stores. It can handle arbitrary text and binary data. So Hadoop can 'digest' any
unstructured data easily.
Hadoop clusters provides storage and computing- We saw how having separate storage and
processing clusters is not the best fit for Big Data. Hadoop clusters provide storage and
distributed computing all in one.
One study by Cloudera suggested that enterprises usually spend around $25,000 to $50,000 per
terabyte per year. With Hadoop, this cost drops to a few thousand dollars per terabyte per year.
As hardware gets cheaper and cheaper, this cost continues to drop.
One example would be website click logs. Because the volume of these logs can be very high,
not many organizations captured these. Now with Hadoop it is possible to capture and store the
logs.
For example, take click logs from a website. A few years ago, these logs were stored for a brief
period of time to calculate statistics like popular pages. Now with Hadoop, it is viable to store
these click logs for longer period of time.
Of course, writing custom MapReduce code is not the only way to analyze data in Hadoop.
Higher-level Map Reduce is available. For example, a tool named Pig takes English like data
flow language and translates them into MapReduce. Another tool, Hive, takes SQL queries and
runs them using MapReduce.
Business intelligence (BI) tools can provide even higher level of analysis. There are tools for this
type of analysis as well.
Hadoop History
Hadoop was created by Doug Cutting who had created the Apache Lucene(Text
Search),which is origin in Apache Nutch(Open source search Engine).Hadoop is a part of
Apache Lucene Project.Actually Apache Nutch was started in 2002 for working crawler and
search system.Nutch Architecture would not scale up to billions of pages on the web.
In 2003 google had published one Architecture called Google Distributed
Filesystem(GFS),which was solve the storage need for the very large files generated as a part of
the web crawl and indexing process.
In 2004 based on GFS architecture Nutch was implementing open source called the
Nutch Distributed Filesystem (NDFS).In 2004 google was published Mapreduce,In 2005 Nutch
developers had working on Mapreduce in Nutch Project.Most of the Algorithms had been ported
to run using mapreduce and NDFS.
In February 2006 they moved out of Nutch to form an independent subproject of
Lucene called Hadoop.At around the same time, Doug Cutting joined Yahoo!, which provided a
dedicated team and the resources to turn Hadoop into a system that ran at web scale. This was
demonstrated in February 2008 when Yahoo! announced that its production search index was
being generated by a 10,000-core Hadoop cluster.
In January 2008, Hadoop was made its own top-level project at Apache, confirming its
success and its diverse, active community. By this time, Hadoop was being used by many other
companies besides Yahoo!, such as Last.fm, Facebook, and the New York Times.
In April 2008, Hadoop broke a world record to become the fastest system to sort a
terabyte of data. Running on a 910-node cluster, Hadoop sorted one terabyte in 209 seconds (just
under 3 minutes), beating the previous years winner of 297 seconds.
Hadoop in Facebook:
There are many data driven companies which are using hadoop at a great scale but in
this blog we will be discussing the implementation of hadoop in few companies like
Facebook,Yahoo,IBM,health care organizations. Messaging in facebook has been one of its
popular feature since its inception.
Another features of facebook such has like button or status updates are done in Mysql
database but applications such as facebook messaging system runs on the top of HBASE which
is hadoops NoSql database framework.
The data warehousing solution of facebooks lies in HIVE which is built on the top of
HDFS.
The reporting needs of the FACEBOOK is also achieved by using HIVE.
Post 2011 with increase in the magnitude of data and to improve the efficiency
facebook started implementing apache corona which works very much like Yarn framework.
In apache corona the a new scheduling framework is used which separates cluster
resource management from job coordination.
Hadoop in yahoo:
When it comes about the size of the hadoop cluster,yahoo beats all by having the 42000
nodes in about 20 YARN (aka MapReduce 2.0)clusters with 600 petabytes of data on HDFS to
serve the companys mobile, search, advertising, personalization, media, and communication
efforts.
Yahoo uses hadoop to block around 20.5 billion messages and checks it to enter it into
its email server.Yahoos spam detection abilities has increased to manifolds since it started using
hadoop.
In the ever growing family of hadoop,yahoo has been one of the major contributor.
Yahoo has been the pioneer of many new technologies which have already embraced
itself into hadoop ecosystem.
Few notable technologies which yahoo has been using apart from mapreduce and hdfs
is Apache tez and spark.
One of the main vehicle of yahoos hadoop chariot is pig which started in yahoo and it
still tops the chart as 50-60 percent of jobs are processed using pig scripts.
Hadoop in checking re-occurrence of heart cardiac attack: UC Irvine Health in USA while
discharging heart patients is equipping them with a wireless scale so that weight measured by
them in home could be transferred automatically and wirelessly to the hadoop cluster established
in the hospital inside which hadoop algorithm running determines a chance for reoccurrence for
heart attack by analyzing the risk factor associated with the received weight data.
Analyzing call data records: To reduce the rate of call drop and improve the sound quality,the
call details pouring in to the companys database in real time has to be analyzed to maximum
precision.
Telecom companies have been using tools like Flume to ingest the millions of call
records per second into hadoop and then using Apache storm for processing them in real time to
identify the troubling patterns.
Timely servicing of the equipments: Replacing the equipments from transmission tower of
telecom companies is very much costlier than the repairing.
To determine an optimum schedule for maintenance(not too early,not too late),hadoop
has been used by the companies for storing unstructured, sensor and streaming data.
Machine learning algorithms are applied on these data to reduce maintenance cost and
to do timely repair of the equipments before it gets any problem.
Anti money laundering practice: Before hadoop, finance companies used to follow the
approach where selective storing of the data used to take place by discarding historical data due
to storage limitations.
So the sample data available for analytics was not suffice to give a full proof results
which could be used to check money laundering.
But now companies have been using hadoop framework for greater storing and
processing abilities and to determine the sources of black money and keep it out of the system.
Companies are now able to manage millions of customer names and their transactions
in real time and the rate of detecting the suspicious transactions have increased drastically after
implementing hadoop ecosystem.
Hadoop in Banks
Many banks across the world have been using Hadoop platform to collect and analyze
all the data pertaining to their customers like daily transactional data,data coming from
interaction from multiple customer touch points like call centers, home value data and merchant
records.
All these data can be analyzed by banks to segregate customers into one or more
sections based on their needs in terms of banking product and services,their sales,promotion and
marketing accordingly.
Using Big data Hadoop architecture , many credit card issuing banks has been
implementing fraud detection system which detects
Suspicious activity by analyzing ones past history with spending patterns and trends
and have been disabling the cards of the suspects.
Client:
It is neither master nor slave, rather play a role of loading the data into cluster, submit
MapReduce jobs describing how the data should be processed and then retrieve the data to see
the response after job completion.
Masters:
The Masters consists of 3 components NameNode, Secondary Node name and JobTracker.
NameNode:
NameNode does NOT store the files but only the file's metadata. In later section we
will see it is actually the DataNode which stores the files.
NameNode oversees the health of DataNode and coordinates access to the data stored
in DataNode. Name node keeps track of all the file system related information such as to
Which section of file is saved in which part of the cluster
Last access time for the files
User permissions like which user have access to the file
JobTracker:
JobTracker coordinates the parallel processing of data using MapReduce.
The job of Secondary Node is to contact NameNode in a periodic manner after certain time
interval(by default 1 hour).
NameNode which keeps all filesystem metadata in RAM has no capability to process that
metadata on to disk. So if NameNode crashes, you lose everything in RAM itself and you don't
have any backup of filesystem. What secondary node does is it contacts NameNode in an hour
and pulls copy of metadata information out of NameNode. It shuffle and merge this information
into clean file folder and sent to back again to NameNode, while keeping a copy for itself. Hence
Secondary Node is not the backup rather it does job of housekeeping.
In case of NameNode failure, saved metadata can rebuild it easily.
Slaves:
Slave nodes are the majority of machines in Hadoop Cluster and are responsible to
Store the data
Process the computation
Each slave runs both a DataNode and Task Tracker daemon which communicates to their
masters. The Task Tracker daemon is a slave to the JobTracker and the DataNode daemon a
slave to the NameNode
Advantages of Hadoop:
1. Scalable
Hadoop is a highly scalable storage platform, because it can stores and distribute very large data
sets across hundreds of inexpensive servers that operate in parallel. Unlike traditional relational
database systems (RDBMS) that cant scale to process large amounts of data, Hadoop enables
businesses to run applications on thousands of nodes involving many thousands of terabytes of
data.
2. Cost effective
Hadoop also offers a cost effective storage solution for businesses exploding data sets. The
problem with traditional relational database management systems is that it is extremely cost
prohibitive to scale to such a degree in order to process such massive volumes of data. In an
effort to reduce costs, many companies in the past would have had to down-sample data and
classify it based on certain assumptions as to which data was the most valuable. The raw data
would be deleted, as it would be too cost-prohibitive to keep. While this approach may have
worked in the short term, this meant that when business priorities changed, the complete raw
data set was not available, as it was too expensive to store.
3. Flexible
Hadoop enables businesses to easily access new data sources and tap into different types of data
(both structured and unstructured) to generate value from that data. This means businesses can
use Hadoop to derive valuable business insights from data sources such as social media, email
conversations. Hadoop can be used for a wide variety of purposes, such as log processing,
recommendation systems, data warehousing, market campaign analysis and fraud detection.
4. Fast
Hadoops unique storage method is based on a distributed file system that basically maps data
wherever it is located on a cluster. The tools for data processing are often on the same servers
where the data is located, resulting in much faster data processing. If youre dealing with large
volumes of unstructured data, Hadoop is able to efficiently process terabytes of data in just
minutes, and petabytes in hours.
5. Resilient to failure
A key advantage of using Hadoop is its fault tolerance. When data is sent to an individual node,
that data is also replicated to other nodes in the cluster, which means that in the event of failure,
there is another copy available for use.
6. Hadoop framework allows the user to quickly write and test distributed systems. It is efficient,
and it automatic distributes the data and work across the machines and in turn, utilizes the
underlying parallelism of the CPU cores.
7. Hadoop does not rely on hardware to provide fault-tolerance and high availability (FTHA),
rather Hadoop library itself has been designed to detect and handle failures at the application
layer.
8. Servers can be added or removed from the cluster dynamically and Hadoop continues to
operate without interruption.
9. Another big advantage of Hadoop is that apart from being open source, it is compatible on all
the platforms since it is Java based.
Disadvantages of Hadoop:
As the backbone of so many implementations, Hadoop is almost synomous with big data.
1. Security Concerns
Just managing a complex applications such as Hadoop can be challenging. A simple example can
be seen in the Hadoop security model, which is disabled by default due to sheer complexity. If
whoever managing the platform lacks of know how to enable it, your data could be at huge risk.
Hadoop is also missing encryption at the storage and network levels, which is a major selling
point for government agencies and others that prefer to keep their data under wraps.
2. Vulnerable By Nature
Speaking of security, the very makeup of Hadoop makes running it a risky proposition. The
framework is written almost entirely in Java, one of the most widely used yet controversial
programming languages in existence. Java has been heavily exploited by cybercriminals and as a
result, implicated in numerous security breaches.
5. General Limitations
The article introducesApache Flume, MillWheel, and Googles own Cloud Dataflow as possible
solutions. What each of these platforms have in common is the ability to improve the efficiency
and reliability of data collection, aggregation, and integration. The main point the article stresses
is that companies could be missing out on big benefits by using Hadoop alone.
Chapter 2
Hadoop Design, Architecture & MapReduce
1. Hadoop is not a database. Hbase or Impala may be considered databases but Hadoop is
just a file system (hdfs) with built in redundancy, parallelism.
2. Traditional databases/RDBMS have ACID properties - Atomicity, Consistency, Isolation
and Durability. You get none of these out of the box with Hadoop. So if you have to for
example write code to take money from one bank account and put into another one, you
have to (painfully) code all the scenarios like what happens if money is taken out but a
failure occurs before its moved into another account.
3. Hadoop offers massive scale in processing power and storage at a very low comparable
cost to an RDBMS.
4. Hadoop offers tremendous parallel processing capabilities. You can run jobs in parallel
to crunch large volumes of data.
5. Some people argue that traditional databases do not work well with un-structured data,
but its not as simple as that. There are many applications built using traditional RDBMS
that use a lot of unstructured data or video files or PDFs that I have come across that
work well.
6. Typically RDBMS will manage a large chunk of the data in its cache for faster
processing while at the same time maintaining read consistency across sessions. I would
argue Hadoop does a better job at using the memory cache to process the data without
offering any other items like read consistency.
7. Hive SQL is almost always a magnitude of times slower than SQL you can run in
traditional databases. So if you are thinking SQL in Hive is faster than in a database, you
are in for a sad disappointment. It will not scale at all for complex analytics.
8. Hadoop is very good for parallel processing problems - like finding a set of keywords in a
large set of documents (this operation can be parallelized). However typically RDBMS
implementations will be faster for comparable data sets.
It is a fact that data has exploded in the past and with voulmes going through the roof, traditional
databases which were developed from the premise of a single cpu and RAM cache will no longer
be able to support the requirements that business has. In all fairness maybe businesses will also
start accepting that they will be able to live with partially or reasonably reports instead of
completely consistent (but old) reports. This will be an evolution but Hadoop and RDBMS both
will have to evolve to be able to address that.
2.3 Hadoop Ecosystem Overview
Big Data is the buzz word circulating in IT industry from 2008. The amount of data
being generated by social networks, manufacturing, retail, stocks, telecom, insurance, banking,
and health care industries is way beyond our imaginations.
Before the advent of Hadoop, storage and processing of big data was a big challenge.
But now that Hadoop is available, companies have realized the business impact of Big Data and
how understanding this data will drive the growth. For example:
Banking sectors have a better chance to understand loyal customers, loan defaulters
and fraud transactions.
n Retail sectors now have enough data to forecast demand.
Manufacturing sectors need not depend on the costly mechanisms for quality testing.
Capturing sensors data and analyzing it would reveal many patterns.
E-Commerce, social networks can personalize the pages based on customer interests.
Stock markets generate humongous amount of data, correlating from time to time will
reveal beautiful insights.
Hadoop is the straight answer for processing Big Data. Hadoop ecosystem is a combination of
technologies which have proficient advantage in solving business problems.
Hadoop ecosystem comprises of services like HDFS, Map reduce for storing and processing
large amount of data sets. In addition to services there are several tools provided in ecosystem to
perform different type data modeling operations. Ecosystem consists of hive for querying and
fetching the data that's stored in HDFS.
Similarly ecosystem consists of Pig for data flowing language and to implement some map
reduce jobs. For data migration and job scheduling, we use some more tools in Hadoop
ecosystem.
In order to handle large data sets, Hadoop has a distributed framework which can scale out to
thousands of nodes. Hadoop adopts Parallel Distributed Approach to process huge amount of
data. The two main components of Apache Hadoop are HDFS (Hadoop Distributed File System)
and Map Reduce (MR). The basic principle of Hadoop is to write once and read many times.
The Hadoop ecosystem includes both official Apache open source projects and a wide range of
commercial tools and solutions. Some of the best-known open source examples include Spark,
Hive, Pig, Oozie and Sqoop. Commercial Hadoop offerings are even more diverse and include
platforms and packaged distributions from vendors such as Cloudera, Hortonworks, and MapR,
plus a variety of tools for specific Hadoop development, production, and maintenance tasks.
Most of the solutions available in the Hadoop ecosystem are intended to supplement one or two
of Hadoops four core elements (HDFS, MapReduce, YARN, and Common). However, the
commercially available framework solutions provide more comprehensive functionality. The
sections below provide a closer look at some of the more prominent components of the Hadoop
ecosystem, starting with the Apache projects
Components of HDFS:
i. NameNode: It is also known as Master node. NameNode does not store actual data or dataset.
NameNode stores Metadata i.e. number of blocks, their location, on which Rack, which
Datanode the data is stored and other details. It consists of files and directories.
Tasks of NameNode
Manage file system namespace.
Regulates clients access to files.
Executes file system execution such as naming, closing, opening files and directories.
ii. DataNode: It is also known as Slave. HDFS Datanode is responsible for storing actual data in
HDFS. Datanode performs read and write operation as per the request of the clients. Replica
block of Datanode consists of 2 files on the file system. The first file is for data and second file is
for recording the blocks metadata. HDFS Metadata includes checksums for data. At startup,
each Datanode connects to its corresponding Namenode and does handshaking. Verification of
namespace ID and software version of DataNode take place by handshaking. At the time of
mismatch found, DataNode goes down automatically.
Tasks of DataNode
DataNode performs operations like block replica creation, deletion and replication
according to the instruction of NameNode.
DataNode manages data storage of the system.
b. MapReduce
Hadoop MapReduce is the core component of hadoop which provides data processing.
MapReduce is a software framework for easily writing applications that process the vast amount
of structured and unstructured data stored in the Hadoop Distributed File system.
Hadoop MapReduce programs are parallel in nature, thus are very useful for
performing large-scale data analysis using multiple machines in the cluster. By this parallel
processing, speed and reliability of cluster is improved.
Working of MapReduce
MapReduce works by breaking the processing into two phases:
Map phase
Reduce phase
Each phase has key-value pairs as input and output. In addition, programmer also specifies two
functions: map function and reduce function
Map function takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key/value pairs).
Reduce function takes the output from the Map as an input and combines those data tuples
based on the key and accordingly modifies the value of the key.
Features of MapReduce
i. Simplicity: MapReduce jobs are easy to run. Applications can be written in any language such
as java, C++, and python.
iii. Speed: By means of parallel processing problems that take days to solve, it is solved in hours
and minutes by MapReduce.
iv. Fault tolerance: MapReduce takes care of failures. If one copy of data is unavailable, another
machine has a copy of the same key pair which can be used for solving the same subtask.
c. YARN
YARN provides the resource management. YARN is called as the operating system of hadoop as
it is responsible for managing and monitoring workloads. It allows multiple data processing
engines such as real-time streaming and batch processing to handle data stored on a single
platform.
YARN has been projected as a data operating system for Hadoop2. Main features of YARN are:
Flexibility: Enables other purpose-built data processing models beyond MapReduce
(batch), such as interactive and streaming. Due to this feature of YARN, other
applications can also be run along with Map Reduce programs in hadoop2.
Efficiency As many applications can be run on same cluster, efficiency of Hadoop
increases without much effect on quality of service.
Shared Provides a stable, reliable, secure foundation and shared operational services
across multiple workloads. Additional programming models such as graph processing
and iterative modeling are now possible for data processing.
d. Hive
Hive is an open source data warehouse system for querying and analyzing large datasets stored
in hadoop files. Hive do three main functions: data summarization, query, and analysis.
Hive use language called HiveQL (HQL), which is similar to SQL. HiveQL automatically
translates SQL-like queries into MapReduce jobs which will execute on hadoop.
e. Pig
Pig is a HDFS. Language used in Pig is called PigLatin. It is very similar to SQL. It is used to
load the data, apply the required filters and dump the data in the required format. For Programs
execution, pig requires Java runtime environment.
f. Hbase
It is distributed database that was designed to store structured data in tables that could have
billions of row and millions of columns. Hbase is scalable, distributed, and Nosql database that is
built on top of HDFS. Hbase provide real time access to read or write data in HDFS.
Components of Hbase
i. Hbase master: It is not part of the actual data storage but negotiates load balancing across all
RegionServer.
ii. RegionServer: It is the worker node which handle read, write, update and delete requests
from clients. Region server process runs on every node in hadoop cluster. Region server runs on
HDFS DateNode.
g. HCatalog
HCatalog is a table and storage management layer for hadoop. HCatalog supports different
components available in hadoop like MapReduce, hive and pig to easily read and write data from
the cluster. HCatalog is a key component of Hive that enables the user to store their data in any
format and structure.
By default, HCatalog supports RCFile, CSV, JSON, sequenceFile and ORC file formats.
Benefits of HCatalog:
Enables notifications of data availability.
With the table abstraction, HCatalog frees the user from overhead of data storage.
Provide visibility for data cleaning and archiving tools.
h. Avro
Avro is a most popular Data serialization system. Avro is an open source project that provides
data serialization and data exchange services for hadoop. These services can be used together or
independently. Big data can exchange programs written in different languages using Avro.
Using serialization service programs can serialize data into files or messages. Avro stores data
definition and data together in one message or file making it easy for programs to dynamically
understand information stored in Avro file or message.
Avro schema: Avro relies on schemas for serialization/deserialization. Avro requires schema
when data is written or read. When Avro data is stored in a file its schema is stored with it, so
that files may be processed later by any program.
Dynamic typing: It refers to serialization and deserialization without code generation. It
complements the code generation which is available in Avro for statically typed language as an
optional optimization.
Avro provides:
Rich data structures.
Remote procedure call.
Compact, fast, binary data format.
Container file, to store persistent data.
i. Thrift
It is a software framework for scalable cross-language services development. Thrift is an
interface definition language used for RPC communication. Hadoop does a lot of RPC calls so
there is a possibility of using Apache Thrift for performance or other reasons.
j. Apache Drill
The main purpose of the drill is large-scale data processing including structured and semi-
structured data. It is a low latency distributed query engine that is designed to scale to several
thousands of nodes and query petabytes of data. The drill is the first distributed SQL query
engine that has a schema-free model.
k. Apache Mahout
Mahout is open source framework that is primarily used for creating scalable machine learning
algorithm and data mining library. Once data is stored in Hadoop HDFS, mahout provides the
data science tools to automatically find meaningful patterns in those big data sets.
l. Apache Sqoop
Sqoop is used for importing data from external sources into related hadoop components like
HDFS, Hbase or Hive. It is also used for exporting data from hadoop to other external sources.
Sqoop works with relational databases such as teradata, Netezza, oracle, MySQL.
m. Apache Flume
Flume is used for efficiently collecting, aggregating and moving a large amount of data from its
origin and sending it back to HDFS. Flume is fault tolerant and reliable mechanism. Flume was
created to allow flow data from the source into Hadoop environment. It uses a simple extensible
data model that allows for the online analytic application. Using Flume, we can get the data from
multiple servers immediately into hadoop.
n. Ambari
Ambari is a management platform for provisioning, managing, monitoring and securing apache
Hadoop cluster. Hadoop management gets simpler as Ambari provide consistent, secure platform
for operational control.
Features of Ambari:
Simplified installation, configuration, and management: Ambari easily and efficiently
create and manage clusters at scale.
Centralized security setup: Ambari reduce the complexity to administer and configure
cluster security across the entire platform.
Highly extensible and customizable: Ambari is highly extensible for bringing custom
services under management.
Full visibility into cluster health: Ambari ensures that the cluster is healthy and available
with a holistic approach to monitoring.
o. Zookeeper
Apache Zookeeper is a centralized service for maintaining configuration information, naming,
providing distributed synchronization, and providing group services. Zookeeper is used to
manage and coordinate a large cluster of machines.
Features of zookeeper:
Fast: zookeeper is fast with workloads where reads to data are more common than writes.
The ideal read/write ratio is 10:1.
Ordered: zookeeper maintains a record of all transactions, which can be used for high-
level
p. Oozie
It is a workflow scheduler system for managing apache hadoop jobs. Oozie combines multiple
jobs sequentially into one logical unit of work. Oozie framework is fully integrated with apache
hadoop stack, YARN as an architecture center and supports hadoop jobs for apache MapReduce,
pig, Hive, and Sqoop.
In Oozie, users are permitted to create Directed Acyclic Graph of workflow, which can run in
parallel and sequentially in hadoop. Oozie is scalable and can manage timely execution of
thousands of workflow in a hadoop cluster. Oozie is very much flexible as well. One can easily
start, stop, suspend and rerun jobs. It is even possible to skip a specific failed node or rerun it in
Oozie.
A MapReduce program is composed of a Map() procedure (method) that performs filtering and
sorting (such as sorting students by first name into queues, one queue for each name) and a
Reduce() method that performs a summary operation (such as counting the number of students in
each queue, yielding name frequencies). The "MapReduce System" (also called "infrastructure"
or "framework") orchestrates the processing by marshalling the distributed servers, running the
various tasks in parallel, managing all communications and data transfers between the various
parts of the system, and providing for redundancy and fault tolerance.
The model is a specialization of the split-apply-combine strategy for data analysis. It is inspired
by the map and reduce functions commonly used in functional programming, although their
purpose in the MapReduce framework is not the same as in their original forms. The key
contributions of the MapReduce framework are not the actual map and reduce functions (which,
for example, resemble the 1995 Message Passing Interface standard's reduce and scatter
operations), but the scalability and fault-tolerance achieved for a variety of applications by
optimizing the execution engine. As such, a single-threaded implementation of MapReduce will
usually not be faster than a traditional (non-MapReduce) implementation; any gains are usually
only seen with multi-threaded implementations. The use of this model is beneficial only when
the optimized distributed shuffle operation (which reduces network communication cost) and
fault tolerance features of the MapReduce framework come into play. Optimizing the
communication cost is essential to a good MapReduce algorithm.
MapReduce libraries have been written in many programming languages, with different levels of
optimization. A popular open-source implementation that has support for distributed shuffles is
part of Apache Hadoop. The name MapReduce originally referred to the proprietary Google
technology, but has since been genericized. By 2014, Google was no longer using MapReduce as
their primary Big Data processing model, and development on Apache Mahout had moved on to
more capable and less disk-oriented mechanisms that incorporated full map and reduce
capabilities.
MapReduce is a framework for processing parallelizable problems across large datasets using a
large number of computers (nodes), collectively referred to as a cluster (if all nodes are on the
same local network and use similar hardware) or a grid (if the nodes are shared across
geographically and administratively distributed systems, and use more heterogenous hardware).
Processing can occur on data stored either in a filesystem (unstructured) or in a database
(structured). MapReduce can take advantage of the locality of data, processing it near the place it
is stored in order to minimize communication overhead.
"Map" step: Each worker node applies the "map()" function to the local data, and writes
the output to a temporary storage. A master node ensures that only one copy of redundant
input data is processed.
"Shuffle" step: Worker nodes redistribute data based on the output keys (produced by the
"map()" function), such that all data belonging to one key is located on the same worker
node.
"Reduce" step: Worker nodes now process each group of output data, per key, in parallel.
MapReduce allows for distributed processing of the map and reduction operations. Provided that
each mapping operation is independent of the others, all maps can be performed in parallel
though in practice this is limited by the number of independent data sources and/or the number of
CPUs near each source. Similarly, a set of 'reducers' can perform the reduction phase, provided
that all outputs of the map operation that share the same key are presented to the same reducer at
the same time, or that the reduction function is associative. While this process can often appear
inefficient compared to algorithms that are more sequential (because multiple rather than one
instance of the reduction process must be run), MapReduce can be applied to significantly larger
datasets than "commodity" servers can handle a large server farm can use MapReduce to sort a
petabyte of data in only a few hours. The parallelism also offers some possibility of recovering
from partial failure of servers or storage during the operation: if one mapper or reducer fails, the
work can be rescheduled assuming the input data is still available.
1. Prepare the Map() input the "MapReduce system" designates Map processors, assigns
the input key value K1 that each processor would work on, and provides that processor
with all the input data associated with that key value.
2. Run the user-provided Map() code Map() is run exactly once for each K1 key value,
generating output organized by key values K2.
3. "Shuffle" the Map output to the Reduce processors the MapReduce system designates
Reduce processors, assigns the K2 key value each processor should work on, and
provides that processor with all the Map-generated data associated with that key value.
4. Run the user-provided Reduce() code Reduce() is run exactly once for each K2 key
value produced by the Map step.
5. Produce the final output the MapReduce system collects all the Reduce output, and
sorts it by K2 to produce the final outcome.
These five steps can be logically thought of as running in sequence each step starts only after
the previous step is completed although in practice they can be interleaved as long as the final
result is not affected.
In many situations, the input data might already be distributed ("sharded") among many different
servers, in which case step 1 could sometimes be greatly simplified by assigning Map servers
that would process the locally present input data. Similarly, step 3 could sometimes be sped up
by assigning Reduce processors that are as close as possible to the Map-generated data they need
to process.
Map(k1,v1) list(k2,v2)
The Map function is applied in parallel to every pair (keyed by k1) in the input dataset. This
produces a list of pairs (keyed by k2) for each call. After that, the MapReduce framework
collects all pairs with the same key (k2) from all lists and groups them together, creating one
group for each key.
The Reduce function is then applied in parallel to each group, which in turn produces a
collection of values in the same domain:
Examples
The prototypical MapReduce example counts the appearance of each word in a set of documents:
Here, each document is split into words, and each word is counted by the map function, using the
word as the result key. The framework puts together all the pairs with the same key and feeds
them to the same call to reduce. Thus, this function just needs to sum all of its input values to
find the total appearances of that word.
As another example, imagine that for a database of 1.1 billion people, one would like to compute
the average number of social contacts a person has according to age. In SQL, such a query could
be expressed as:
function Map is
input: integer K1 between 1 and 1100, representing a batch of 1 million social.person records
for each social.person record in the K1 batch do
let Y be the person's age
let N be the number of contacts the person has
produce one output record (Y,(N,1))
repeat
end function
function Reduce is
input: age (in years) Y
for each input record (Y,(N,C)) do
Accumulate in S the sum of N*C
Accumulate in Cnew the sum of C
repeat
let A be S/Cnew
produce one output record (Y,(A,Cnew))
end function
The MapReduce System would line up the 1100 Map processors, and would provide each with
its corresponding 1 million input records. The Map step would produce 1.1 billion (Y,(N,1))
records, with Y values ranging between, say, 8 and 103. The MapReduce System would then line
up the 96 Reduce processors by performing shuffling operation of the key/value pairs due to the
fact that we need average per age, and provide each with its millions of corresponding input
records. The Reduce step would result in the much reduced set of only 96 output records (Y,A),
which would be put in the final result file, sorted by Y.
The count info in the record is important if the processing is reduced more than one time. If we
did not add the count of the records, the computed average would be wrong, for example:
If we reduce files #1 and #2, we will have a new file with an average of 9 contacts for a 10-year-
old person ((9+9+9+9+9)/5):
-- reduce step #1: age, average of contacts
10, 9
If we reduce it with file #3, we lose the count of how many records we've already seen, so we
end up with an average of 9.5 contacts for a 10-year-old person ((9+10)/2), which is wrong. The
correct answer is 9.166 = 55 / 6 = (9*3+9*2+10*1)/(3+2+1).
2.4.3 Dataflow
The frozen part of the MapReduce framework is a large distributed sort. The hot spots, which the
application defines, are:
an input reader
a Map function
a partition function
a compare function
a Reduce function
an output writer
A common example will read a directory full of text files and return each line as a record.
2.4.4 Implementation
Many different implementations of the MapReduce interface are possible. The right choice
depends on the environment. For example, one implementation may be suitable for a small
shared-memory machine, another for a large NUMA multi-processor, and yet another for an
even larger collection of networked machines. This section describes an implementation targeted
to the computing environment in wide use at Google: large clusters of commodity PCs connected
together with switched Ethernet. In our environment:
(1) Machines are typically dual-processor x86 processors running Linux, with 2-4 GB of
memory per machine.
(2) Commodity networking hardware is used typically either 100 megabits/second or 1
gigabit/second at the machine level, but averaging considerably less in overall bisection
bandwidth.
(3) A cluster consists of hundreds or thousands of machines, and therefore machine failures are
common.
(4) Storage is provided by inexpensive IDE disks attached directly to individual machines. A
distributed file system developed in-house is used to manage the data stored on these disks. The
file system uses replication to provide availability and reliability on top of unreliable hardware.
(5) Users submit jobs to a scheduling system. Each job consists of a set of tasks, and is mapped
by the scheduler to a set of available machines within a cluster.
Execution Overview
After successful completion, the output of the mapreduce execution is available in the R output
files (one per reduce task, with file names as specified by the user). Typically, users do not need
to combine these R output files into one file they often pass these files as input to another
MapReduce call, or use them from another distributed application that is able to deal with input
that is partitioned into multiple files.
Worker Failure
The master pings every worker periodically. If no response is received from a worker
in a certain amount of time, the master marks the worker as failed. Any map tasks completed by
the worker are reset back to their initial idle state, and therefore become eligible for scheduling
on other workers. Similarly, any map task or reduce task in progress on a failed worker is also
reset to idle and becomes eligible for rescheduling.
Completed map tasks are re-executed on a failure because their output is stored on the
local disk(s) of the failed machine and is therefore inaccessible. Completed reduce tasks do not
need to be re-executed since their output is stored in a global file system.
When a map task is executed first by worker A and then later executed by worker B
(because A failed), all workers executing reduce tasks are notified of the reexecution. Any
reduce task that has not already read the data from worker A will read the data from worker B.
MapReduce is resilient to large-scale worker failures. For example, during one
MapReduce operation, network maintenance on a running cluster was causing groups of 80
machines at a time to become unreachable for several minutes. The MapReduce mastersimply re-
executed the work done by the unreachable worker machines, and continued to make forward
progress, eventually completing the MapReduce operation.
Master Failure
It is easy to make the master write periodic checkpoints of the master data structures
described above. If the master task dies, a new copy can be started from the last checkpointed
state. However, given that there is only a single master, its failure is unlikely; therefore our
current implementation aborts the MapReduce computation if the master fails. Clients can check
for this condition and retry the MapReduce operation if they desire.
When designing a MapReduce algorithm, the author needs to choose a good tradeoff between the
computation and the communication costs. Communication cost often dominates the
computation cost, and many MapReduce implementations are designed to write all
communication to distributed storage for crash recovery.
For processes that complete quickly, and where the data fits into main memory of a single
machine or a small cluster, using a MapReduce framework usually is not effective. Since these
frameworks are designed to recover from the loss of whole nodes during the computation, they
write interim results to distributed storage. This crash recovery is expensive, and only pays off
when the computation involves many computers and a long runtime of the computation. A task
that completes in seconds can just be restarted in the case of an error, and the likelihood of at
least one machine failing grows quickly with the cluster size. On such problems,
implementations keeping all data in memory and simply restarting a computation on node
failures or when the data is small enough non-distributed solutions will often be faster than
a MapReduce system.
The reduce operations operate much the same way. Because of their inferior properties with
regard to parallel operations, the master node attempts to schedule reduce operations on the same
node, or in the same rack as the node holding the data being operated on. This property is
desirable as it conserves bandwidth across the backbone network of the datacenter.
Implementations are not necessarily highly reliable. For example, in older versions of Hadoop
the NameNode was a single point of failure for the distributed filesystem. Later versions of
Hadoop have high availability with an active/passive failover for the "NameNode."
2.4.7 Uses
MapReduce is useful in a wide range of applications, including distributed pattern-based
searching, distributed sorting, web link-graph reversal, Singular Value Decomposition, web
access log stats, inverted index construction, document clustering, machine learning, and
statistical machine translation. Moreover, the MapReduce model has been adapted to several
computing environments like multi-core and many-core systems, desktop grids, multi-cluster,
volunteer computing environments, dynamic cloud environments, mobile environments, and
high-performance computing environments.
At Google, MapReduce was used to completely regenerate Google's index of the World Wide
Web. It replaced the old ad hoc programs that updated the index and ran the various analyses.
Development at Google has since moved on to technologies such as Percolator, FlumeJava and
MillWheel that offer streaming operation and updates instead of batch processing, to allow
integrating "live" search results without rebuilding the complete index.
MapReduce's stable inputs and outputs are usually stored in a distributed file system. The
transient data are usually stored on local disk and fetched remotely by the reducers.
2.4.8 Criticism
2.4.8.1 Lack of novelty
David DeWitt and Michael Stonebraker, computer scientists specializing in parallel databases
and shared-nothing architectures, have been critical of the breadth of problems that MapReduce
can be used for. They called its interface too low-level and questioned whether it really
represents the paradigm shift its proponents have claimed it is. They challenged the MapReduce
proponents' claims of novelty, citing Teradata as an example of prior art that has existed for over
two decades. They also compared MapReduce programmers to CODASYL programmers, noting
both are "writing in a low-level language performing low-level record manipulation."
MapReduce's use of input files and lack of schema support prevents the performance
improvements enabled by common database system features such as B-trees and hash
partitioning, though projects such as Pig (or PigLatin), Sawzall, Apache Hive, YSmart, HBase
and BigTable are addressing some of these problems.
Greg Jorgensen wrote an article rejecting these views. Jorgensen asserts that DeWitt and
Stonebraker's entire analysis is groundless as MapReduce was never designed nor intended to be
used as a database.
DeWitt and Stonebraker have subsequently published a detailed benchmark study in 2009
comparing performance of Hadoop's MapReduce and RDBMS approaches on several specific
problems. They concluded that relational databases offer real advantages for many kinds of data
use, especially on complex processing or where the data is used across an enterprise, but that
MapReduce may be easier for users to adopt for simple or one-time processing tasks.
Google has been granted a patent on MapReduce. However, there have been claims that this
patent should not have been granted because MapReduce is too similar to existing products. For
example, map and reduce functionality can be very easily implemented in Oracle's PL/SQL
database oriented language or is supported for developers transparently in distributed database
architectures such as Clusterpoint XML database or MongoDB NoSQL database.
In this map-reduce operation, MongoDB applies the map phase to each input document (i.e. the
documents in the collection that match the query condition). The map function emits key-value
pairs. For those keys that have multiple values, MongoDB applies the reduce phase, which
collects and condenses the aggregated data. MongoDB then stores the results in a collection.
Optionally, the output of the reduce function may pass through a finalize function to further
condense or process the results of the aggregation.
All map-reduce functions in MongoDB are JavaScript and run within the mongod process. Map-
reduce operations take the documents of a single collection as the input and can perform any
arbitrary sorting and limiting before beginning the map stage. mapReduce can return the results
of a map-reduce operation as a document, or may write the results to collections. The input and
the output collections may be sharded.
NOTE
For most aggregation operations, the Aggregation Pipeline provides better performance and more
coherent interface. However, map-reduce operations provide some flexibility that is not presently
available in the aggregation pipeline.
Map-Reduce Behavior
In MongoDB, the map-reduce operation can write results to a collection or return the results
inline. If you write map-reduce output to a collection, you can perform subsequent map-reduce
operations on the same input collection that merge replace, merge, or reduce new results with
previous results. See mapReduce and Perform Incremental Map-Reduce for details and
examples.
When returning the results of a map reduce operation inline, the result documents must be within
the BSON Document Size limit, which is currently 16 megabytes. For additional information on
limits and restrictions on map-reduce operations, see the mapReduce reference page.
MongoDB supports map-reduce operations on sharded collections. Map-reduce operations can
also output the results to a sharded collection. See Map-Reduce and Sharded Collections.
Views do not support map-reduce operations.
{
_id: ObjectId("50a8240b927d5d8b5891743c"),
cust_id: "abc123",
ord_date: new Date("Oct 04, 2012"),
status: 'A',
price: 25,
items: [ { sku: "mmm", qty: 5, price: 2.5 },
{ sku: "nnn", qty: 5, price: 2.5 } ]
}
2. Define the corresponding reduce function with two arguments keyCustId and valuesPrices:
The valuesPrices is an array whose elements are the price values emitted by the map
function and grouped by keyCustId.
The function reduces the valuesPrice array to the sum of its elements.
3. Perform the map-reduce on all documents in the orders collection using the mapFunction1
map function and the reduceFunction1 reduce function.
db.orders.mapReduce(
mapFunction1,
reduceFunction1,
{ out: "map_reduce_example" }
)
This operation outputs the results to a collection named map_reduce_example. If the
map_reduce_example collection already exists, the operation will replace the contents with the
results of this map-reduce operation:
Calculate Order and Total Quantity with Average Quantity Per Item
In this example, you will perform a map-reduce operation on the orders collection for all
documents that have an ord_date value greater than 01/01/2012. The operation groups by the
item.sku field, and calculates the number of orders and the total quantity ordered for each sku.
The operation concludes by calculating the average quantity per order for each sku value:
2. Define the corresponding reduce function with two arguments keySKU and countObjVals:
countObjVals is an array whose elements are the objects mapped to the grouped keySKU
values passed by map function to the reducer function.
The function reduces the countObjVals array to a single object reducedValue that
contains the count and the qty fields.
In reducedVal, the count field contains the sum of the count fields from the individual
array elements, and the qty field contains the sum of the qty fields from the individual
array elements.
return reducedVal;
};
3. Define a finalize function with two arguments key and reducedVal. The function modifies the
reducedVal object to add a computed field named avg and returns the modified object:
var finalizeFunction2 = function (key, reducedVal) {
reducedVal.avg = reducedVal.qty/reducedVal.count;
return reducedVal;
};
4. Perform the map-reduce operation on the orders collection using the mapFunction2,
reduceFunction2, and finalizeFunction2 functions.
db.orders.mapReduce( mapFunction2,
reduceFunction2,
{
out: { merge: "map_reduce_example" },
query: { ord_date:
{ $gt: new Date('01/01/2012') }
},
finalize: finalizeFunction2
}
)
This operation uses the query field to select only those documents with ord_date greater than
new Date(01/01/2012). Then it output the results to a collection map_reduce_example. If the
map_reduce_example collection already exists, the operation will merge the existing contents
with the results of this map-reduce operation.
Note: Be sure to place the generic options before the streaming options, otherwise the command
will fail. For an example, see Making Archives Available to Tasks.
bin/hadoop pipes \
[-input inputDir] \
[-output outputDir] \
[-jar applicationJarFile] \
[-inputformat class] \
[-map class] \
[-partitioner class] \
[-reduce class] \
[-writer class] \
[-program program url] \
[-conf configuration file] \
[-D property=value] \
[-fs local|namenode:port] \
[-jt local|jobtracker:port] \
[-files comma separated list of files] \
[-libjars comma separated list of jars] \
[-archives comma separated list of archives]
The application programs link against a thin C++ wrapper library that handles the
communication with the rest of the Hadoop system. The C++ interface is "swigable" so that
interfaces can be generated for python and other scripting languages. All of the C++ functions
and classes are in the HadoopPipes namespace. The job may consist of any combination of Java
and C++ RecordReaders, Mappers, Paritioner, Combiner, Reducer, and RecordWriter.
Hadoop Pipes has a generic Java class for handling the mapper and reducer (PipesMapRunner
and PipesReducer). They fork off the application program and communicate with it over a
socket. The communication is handled by the C++ wrapper library and the PipesMapRunner and
PipesReducer.
The application program passes in a factory object that can create the various objects needed by
the framework to the runTask function. The framework creates the Mapper or Reducer as
appropriate and calls the map or reduce method to invoke the application's code. The JobConf is
available to the application.
The Mapper and Reducer objects get all of their inputs, outputs, and context via context objects.
The advantage of using the context objects is that their interface can be extended with additional
methods without breaking clients. Although this interface is different from the current Java
interface, the plan is to migrate the Java interface in this direction.
Although the Java implementation is typed, the C++ interfaces of keys and values is just a byte
buffer. Since STL strings provide precisely the right functionality and are standard, they will be
used. The decision to not use stronger types was to simplify the interface.
The application can also define combiner functions. The combiner will be run locally by the
framework in the application process to avoid the round trip to the Java process and back.
Because the compare function is not available in C++, the combiner will use memcmp to sort the
inputs to the combiner. This is not as general as the Java equivalent, which uses the user's
comparator, but should cover the majority of the use cases. As the map function outputs
key/value pairs, they will be buffered. When the buffer is full, it will be sorted and passed to the
combiner. The output of the combiner will be sent to the Java process.
The application can also set a partition function to control which key is given to a particular
reduce. If a partition function is not defined, the Java one will be used. The partition function
will be called by the C++ framework before the key/value pair is sent back to Java.
The application programs can also register counters with a group and a name and also increment
the counters and get the counter values.
HDFS holds very large amount of data and provides easier access. To store such huge
data, the files are stored across multiple machines. These files are stored in redundant fashion to
rescue the system from possible data losses in case of failure. HDFS also makes applications
available to parallel processing.
Hadoop DistributedFile System (HDFS) splits the large data files into parts which are
managed by different machines in the cluster. Each part is replicated across many machines in a
cluster, so that if there is a single machine failure it does not result in data being unavailable. In
the Hadoop programming framework data is record oriented. Specific to the application logic,
individual input data files are broken into various formats. Subsets of these records are then
processed by each process running on a machine in the cluster. Using the knowledge from the
DFS these processes are scheduled by the Hadoop framework based on the location of the record
or data. The files are spread across the DFS as chunks and are computed by the process running
on the node. Hadoop framework helps in preventing unwanted network transfers and strain on
network can be obtained by reading data from the local disk directly into the CPU. Thus with
hadoop one could have high performance results due to data locality, with their strategy of
moving the computation to the data.
HDFS stores file system metadata and application data separately. As in other
distributed file systems, like PVFS, Lustre and GFS, HDFS stores metadata on a dedicated
server, called the NameNode. Application data are stored on other servers called DataNodes. All
servers are fully connected and communicate with each other using TCP-based protocols.
2.8.2 Goals
1 Hardware Failure
Hardware failure is the norm rather than the exception. An HDFS instance may consist of
hundreds or thousands of server machines, each storing part of the file systems data. The fact
that there are a huge number of components and that each component has a nontrivial probability
of failure means that some component of HDFS is always non-functional. Therefore, detection of
faults and quick, automatic recovery from them is a core architectural goal of HDFS.
B. DataNodes
Each block replica on a DataNode is represented by two files in the local hosts native
file system. The first file contains the data itself and the second file is blocks metadata including
checksums for the block data and the blocks generation stamp. The size of the data file equals
the actual length of the block and does not require extra space to round it up to the nominal block
size as in traditional file systems. Thus, if a block is half full it needs only half of the space of the
full block on the local drive.
During startup each DataNode connects to the NameNode and performs a handshake.
The purpose of the handshake is to verify the namespace ID and the software version of the
DataNode. If either does not match that of the NameNode the DataNode automatically shuts
down.
The namespace ID is assigned to the file system instance when it is formatted. The
namespace ID is persistently stored on all nodes of the cluster. Nodes with a different namespace
ID will not be able to join the cluster, thus preserving the integrity of the file system.
The consistency of software versions is important because incompatible version may
cause data corruption or loss, and on large clusters of thousands of machines it is easy to
overlook nodes that did not shut down properly prior to the software upgrade or were not
available during the upgrade.
A DataNode that is newly initialized and without any namespace ID is permitted to join
the cluster and receive the clusters namespace ID.
After the handshake the DataNode registers with the NameNode. DataNodes
persistently store their unique storage IDs. The storage ID is an internal identifier of the
DataNode, which makes it recognizable even if it is restarted with a different IP address or port.
The storage ID is assigned to the DataNode when it registers with the NameNode for the first
time and never changes after that.
A DataNode identifies block replicas in its possession to the NameNode by sending a
block report. A block report contains the block id, the generation stamp and the length for each
block replica the server hosts. The first block report is sent immediately after the DataNode
registration. Subsequent block reports are sent every hour and provide the NameNode with an
up-todate view of where block replicas are located on the cluster.
During normal operation DataNodes send heartbeats to the NameNode to confirm that
the DataNode is operating and the block replicas it hosts are available. The default heartbeat
interval is three seconds. If the NameNode does not receive a heartbeat from a DataNode in ten
minutes the NameNode considers the DataNode to be out of service and the block replicas
hosted by that DataNode to be unavailable. The NameNode then schedules creation of new
replicas of those blocks on other DataNodes.
Heartbeats from a DataNode also carry information about total storage capacity,
fraction of storage in use, and the number of data transfers currently in progress. These statistics
are used for the NameNodes space allocation and load balancing decisions.
The NameNode does not directly call DataNodes. It uses replies to heartbeats to send
instructions to the DataNodes. The instructions include commands to:
replicate blocks to other nodes;
remove local block replicas;
re-register or to shut down the node;
send an immediate block report.
These commands are important for maintaining the overall system integrity and therefore it is
critical to keep heartbeats frequent even on big clusters. The NameNode can process thousands
of heartbeats per second without affecting other NameNode operations.
C. HDFS Client
User applications access the file system using the HDFS client, a code library that
exports the HDFS file system interface.
Similar to most conventional file systems, HDFS supports operations to read, write and
delete files, and operations to create and delete directories. The user references files and
directories by paths in the namespace. The user application generally does not need to know that
file system metadata and storage are on different servers, or that blocks have multiple replicas.
When an application reads a file, the HDFS client first asks the NameNode for the list
of DataNodes that host replicas of the blocks of the file. It then contacts a DataNode directly and
requests the transfer of the desired block. When a client writes, it first asks the NameNode to
choose DataNodes to host replicas of the first block of the file. The client organizes a pipeline
from node-to-node and sends the data. When the first block is filled, the client requests new
DataNodes to be chosen to host replicas of the next block. A new pipeline is organized, and the
client sends the further bytes of the file. Each choice of DataNodes is likely to be different. The
interactions among the client, the NameNode and the DataNodes are illustrated in Fig. 1.
Figure 1. An HDFS client creates a new file by giving its path to the NameNode. For each block of the file, the
NameNode returns a list of DataNodes to host its replicas. The client then pipelines data to the chosen DataNodes,
which eventually confirm the creation of the block replicas to the NameNode
Unlike conventional file systems, HDFS provides an API that exposes the locations of
a file blocks. This allows applications like the MapReduce framework to schedule a task to
where the data are located, thus improving the read performance. It also allows an application to
set the replication factor of a file. By default a files replication factor is three. For critical files or
files which are accessed very often, having a higher replication factor improves their tolerance
against faults and increase their read bandwidth.
E. CheckpointNode
The NameNode in HDFS, in addition to its primary role serving client requests, can
alternatively execute either of two other roles, either a CheckpointNode or a BackupNode. The
role is specified at the node startup.
The CheckpointNode periodically combines the existing checkpoint and journal to
create a new checkpoint and an empty journal. The CheckpointNode usually runs on a different
host from the NameNode since it has the same memory requirements as the NameNode. It
downloads the current checkpoint and journal files from the NameNode, merges them locally,
and returns the new checkpoint back to the NameNode.
Creating periodic checkpoints is one way to protect the file system metadata. The
system can start from the most recent checkpoint if all other persistent copies of the namespace
image or journal are unavailable.
Creating a checkpoint lets the NameNode truncate the tail of the journal when the new
checkpoint is uploaded to the NameNode. HDFS clusters run for prolonged periods of time
without restarts during which the journal constantly grows. If the journal grows very large, the
probability of loss or corruption of the journal file increases. Also, a very large journal extends
the time required to restart the NameNode. For a large cluster, it takes an hour to process a week-
long journal. Good practice is to create a daily checkpoint.
F. BackupNode
A recently introduced feature of HDFS is the BackupNode. Like a CheckpointNode,
the BackupNode is capable of creating periodic checkpoints, but in addition it maintains an
inmemory, up-to-date image of the file system namespace that is always synchronized with the
state of the NameNode.
The BackupNode accepts the journal stream of namespace transactions from the active
NameNode, saves them to its own storage directories, and applies these transactions to its own
namespace image in memory. The NameNode treats the BackupNode as a journal store the same
as it treats journal files in its storage directories. If the NameNode fails, the BackupNodes image
in memory and the checkpoint on disk is a record of the latest namespace state.
The BackupNode can create a checkpoint without downloading checkpoint and journal
files from the active NameNode, since it already has an up-to-date namespace image in its
memory. This makes the checkpoint process on the BackupNode more efficient as it only needs
to save the namespace into its local storage directories.
The BackupNode can be viewed as a read-only NameNode. It contains all file system
metadata information except for block locations. It can perform all operations of the regular
NameNode that do not involve modification of the namespace or knowledge of block locations.
Use of a BackupNode provides the option of running the NameNode without persistent storage,
delegating responsibility for the namespace state persisting to the BackupNode.
Here we are considering the case that we are going to create a new file, write data to it
and will close the file. Now in writing a data to HDFS there are seven steps involved. These
seven steps are:
Step 1: The client creates the file by create() method on Distributed File System.
Step 2: Distributed File System makes an RPC call to the namenode to create a new file in the
filesystems namespace, with no blocks associated with it. The namenode performs various
checks to make sure the file doesn't already exist and that the client has the right permissions to
create the file. If these checks pass, the namenode makes a record of the new file; otherwise, file
creation fails and the client is thrown an IOException. The Distributed FileSystem returns an
FSDataOutputStream for the client to start writing data to. Just as in the read case,
FSDataOutputStream wraps a DFSOutput Stream, which handles communication with the
datanodes and namenode.
Step 3: As the client writes data, DFSOutput Stream splits it into packets, which it writes to an
internal queue, called the data queue. The data queue is consumed by the DataStreamer, which is
responsible for asking the namenode to allocate new blocks by picking a list of suitable
datanodes to store the replicas. The list of datanodes forms a pipeline, and here well assume the
replication level is three, so there are three nodes in the pipeline. TheDataStreamer streams the
packets to the first datanode in the pipeline, which stores the packet and forwards it to the second
datanode in the pipeline.
Step 4: Similarly, the second datanode stores the packet and forwards it to the third (and last)
datanode in the pipeline.
Step 5: DFSOutputStream also maintains an internal queue of packets that are waiting to be
acknowledged by datanodes, called the ack queue. A packet is removed from the ack queue only
when it has been acknowledged by all the datanodes in the pipeline.
Step 6: When the client has finished writing data, it calls close() on the stream.
Step 7: This action flushes all the remaining packets to the datanode pipeline and waits for
acknowledgments before contacting the namenode to signal that the file is complete The
namenode already knows which blocks the file is made up of (via DataStreamer asking for block
allocations), so it only has to wait for blocks to be minimally replicated before returning
successfully.
Fig 4 shows six steps involved in reading the file from HDFS:
Let's suppose a Client (a HDFS Client) wants to read a file from HDFS. So the steps involved in
reading the file is:
Step 1: First the Client will open the file by giving a call to open() method onFileSystem object,
which for HDFS is an instance of DistributedFileSystemclass.
Step 2: DistributedFileSystem calls the Namenode, using RPC, to determine thelocations of the
blocks for the first few blocks of the file. For each block, the namenode returns the addresses of
all the datanodes that have a copy of that block. The DistributedFileSystem returns an object of
FSDataInputStream(an input stream that supports file seeks) to the client for it to read data from.
FSDataInputStream in turn wraps a DFSInputStream, which manages the datanode and
namenode I/O.
Step 3: The client then calls read() on the stream. DFSInputStream, which has stored the
datanode addresses for the first few blocks in the file, then connects to the first closest datanode
for the first block in the file.
Step 4: Data is streamed from the datanode back to the client, which calls read()repeatedly on
the stream.
Step 5: When the end of the block is reached, DFSInputStream will close the connection to the
datanode, then find the best datanode for the next block. This happens transparently to the client,
which from its point of view is just reading a continuous stream.
Step 6: Blocks are read in order, with the DFSInputStream opening new connections to
datanodes as the client reads through the stream. It will also call the namenode to retrieve the
datanode locations for the next batch of blocks as needed. When the client has finished reading,
it calls close() on the FSDataInputStream.
HDFS estimates the network bandwidth between two nodes by their distance. The
distance from a node to its parent node is assumed to be one. A distance between two nodes can
be calculated by summing up their distances to their closest common ancestor. A shorter distance
between two nodes means that the greater bandwidth they can utilize to transfer data.
HDFS allows an administrator to configure a script that returns a nodes rack
identification given a nodes address. The NameNode is the central place that resolves the rack
location of each DataNode. When a DataNode registers with the NameNode, the NameNode
runs a configured script to decide which rack the node belongs to. If no such a script is
configured, the NameNode assumes that all the nodes belong to a default single rack.
The placement of replicas is critical to HDFS data reliability and read/write
performance. A good replica placement policy should improve data reliability, availability, and
network bandwidth utilization. Currently HDFS provides a configurable block placement policy
interface so that the users and researchers can experiment and test any policy thats optimal for
their applications.
The default HDFS block placement policy provides a tradeoff between minimizing the
write cost, and maximizing data reliability, availability and aggregate read bandwidth. When a
new block is created, HDFS places the first replica on the node where the writer is located, the
second and the third replicas on two different nodes in a different rack, and the rest are placed on
random nodes with restrictions that no more than one replica is placed at one node and no more
than two replicas are placed in the same rack when the number of replicas is less than twice the
number of racks. The choice to place the second and third replicas on a different rack better
distributes the block replicas for a single file across the cluster. If the first two replicas were
placed on the same rack, for any file, two-thirds of its block replicas would be on the same rack.
After all target nodes are selected, nodes are organized as a pipeline in the order of
their proximity to the first replica. Data are pushed to nodes in this order. For reading, the
NameNode first checks if the clients host is located in the cluster. If yes, block locations are
returned to the client in the order of its closeness to the reader. The block is read from
DataNodes in this preference order. (It is usual for MapReduce applications to run on cluster
nodes, but as long as a host can connect to the NameNode and DataNodes, it can execute the
HDFS client.)
This policy reduces the inter-rack and inter-node write traffic and generally improves
write performance. Because the chance of a rack failure is far less than that of a node failure, this
policy does not impact data reliability and availability guarantees. In the usual case of three
replicas, it can reduce the aggregate network bandwidth used when reading data since a block is
placed in only two unique racks rather than three.
The default HDFS replica placement policy can be summarized as follows:
1. No Datanode contains more than one replica of any block.
2. No rack contains more than two replicas of the same block, provided there are sufficient racks
on the cluster.
2.10.4 Balancer
HDFS block placement strategy does not take into account DataNode disk space
utilization. This is to avoid placing newmore likely to be referenceddata at a small subset of
the DataNodes. Therefore data might not always be placed uniformly across DataNodes.
Imbalance also occurs when new nodes are added to the cluster.
The balancer is a tool that balances disk space usage on an HDFS cluster. It takes a
threshold value as an input parameter, which is a fraction in the range of (0, 1). A cluster is
balanced if for each DataNode, the utilization of the node (ratio of used space at the node to total
capacity of the node) differs from the utilization of the whole cluster (ratio of used space in the
cluster to total capacity of the cluster) by no more than the threshold value.
The tool is deployed as an application program that can be run by the cluster
administrator. It iteratively moves replicas from DataNodes with higher utilization to DataNodes
with lower utilization. One key requirement for the balancer is to maintain data availability.
When choosing a replica to move and deciding its destination, the balancer guarantees that the
decision does not reduce either the number of replicas or the number of racks.
The balancer optimizes the balancing process by minimizing the inter-rack data
copying. If the balancer decides that a replica A needs to be moved to a different rack and the
destination rack happens to have a replica B of the same block, the data will be copied from
replica B instead of replica A.
A second configuration parameter limits the bandwidth consumed by rebalancing
operations. The higher the allowed bandwidth, the faster a cluster can reach the balanced state,
but with greater competition with application processes.
2.10.6 Decommissioning
The cluster administrator specifies which nodes can join the cluster by listing the host
addresses of nodes that are permitted to register and the host addresses of nodes that are not
permitted to register. The administrator can command the system to re-evaluate these include and
exclude lists. A present member of the cluster that becomes excluded is marked for
decommissioning. Once a DataNode is marked as decommissioning, it will not be selected as the
target of replica placement, but it will continue to serve read requests. The NameNode starts to
schedule replication of its blocks to other DataNodes. Once the NameNode detects that all blocks
on the decommissioning DataNode are replicated, the node enters the decommissioned state.
Then it can be safely removed from the cluster without jeopardizing any data availability.
2.11.2 Staging
A client request to create a file does not reach the NameNode immediately. In fact,
initially the HDFS client caches the file data into a temporary local file. Application writes are
transparently redirected to this temporary local file. When the local file accumulates data worth
over one HDFS block size, the client contacts the NameNode. The NameNode inserts the file
name into the file system hierarchy and allocates a data block for it. The NameNode responds to
the client request with the identity of the DataNode and the destination data block. Then the
client flushes the block of data from the local temporary file to the specified DataNode. When a
file is closed, the remaining un-flushed data in the temporary local file is transferred to the
DataNode. The client then tells the NameNode that the file is closed. At this point, the
NameNode commits the file creation operation into a persistent store. If the NameNode dies
before the file is closed, the file is lost.
The above approach has been adopted after careful consideration of target applications
that run on HDFS. These applications need streaming writes to files. If a client writes to a remote
file directly without any client side buffering, the network speed and the congestion in the
network impacts throughput considerably. This approach is not without precedent. Earlier
distributed file systems, e.g. AFS, have used client side caching to improve performance. A
POSIX requirement has been relaxed to achieve higher performance of data uploads.
2.12.1 FS Shell
HDFS allows user data to be organized in the form of files and directories. It provides a
command line interface called FS shell that lets a user interact with the data in HDFS. The syntax
of this command set is similar to other shells (e.g. bash, csh) that users are already familiar with.
Here are some sample action/command pairs:
FS shell is targeted for applications that need a scripting language to interact with the stored data.
2.12.2 DFSAdmin
The DFSAdmin command set is used for administering an HDFS cluster. These are
commands that are used only by an HDFS administrator. Here are some sample action/
command pairs:
After formatting the HDFS, start the distributed file system. The following command will start
the namenode as well as the data nodes as cluster.
$ start-dfs.sh
Step 2: Transfer and store a data file from local systems to the Hadoop file system using the put
command.
$ $HADOOP_HOME/bin/hadoop fs -put /home/file.txt /user/input
Step 1: Initially, view the data from HDFS using cat command.
$ $HADOOP_HOME/bin/hadoop fs -cat /user/output/outfile
Step 2: Get the file from HDFS to the local file system using get command.
$ $HADOOP_HOME/bin/hadoop fs -get /user/output/ /home/hadoop_tp/
While not considered a scheduler per se, Hadoop also supports the scheme of
provisioning virtual cluster environments (within physical clusters). This concept is labeled
Hadoop On Demand (HOD). The HOD approach utilizes the Torque resource manager for node
allocation to size the virtual cluster. Within the virtual environment, the HOD system prepares
the configuration files in an automated manner, and initializes the system based on the nodes that
comprise the virtual cluster. Once initialized, the HOD virtual cluster can be used in a rather
independent manner. A certain level of elasticity is build into HOD, as the system adapts to
changing workload conditions. To illustrate, HOD automatically de-allocates nodes from the
virtual cluster after detecting no active jobs over a certain time period. This shrinking behavior
allows for the most efficient usage of the overall physical cluster assets. HOD is considered as a
valuable option for deploying Hadoop clusters within a cloud infrastructure.
To briefly illustrate the methodology, 2-dimensional points with x and y axes are
considered. The target space is decomposed into 2^n x 2^n small cells, where the constant n
determines the granularity of the decomposition. As the k-nearest neighbor points for a data
point are normally located in close proximity, the assumption made is that most of the kNN
objects are located in the nearby cells. Hence, the approach is based on classifying the data
points into the corresponding cells and to compute candidate kNN points for each point. This
process can easily be parallelized, and hence is suited for a MapReduce framework. It has to be
pointed out though that the approach may not be able to determine the kNN points in a single
processing cycle and therefore, additional processing steps may be necessary (the crux of the
issue is that data points in other nearby cells may belong to the k-nearest neighbors). To illustrate
this problem Figure 7 depicts a akNN processing scenario for k = 2. Figure 7 outlines that the
query can locate 2 NN points for A by just considering the inside (boundary) of cell 0. In other
words, the circle centered around A already covers the target 2 NN objects without having to go
beyond the boundary of cell 0.
On the other hand, the circle for B overlaps with cells 1, 2, and 3, respectively. In such
a scenario, there is a possibility to locate the 2 NN objects in 4 different cells. Ergo, in some
circumstances, it may not be feasible to draw the circle for a point just based on the cell
boundary. For point C, there is only 1 point available in cell 1, and hence cell 1 violates the k = 2
requirement. Therefore, it is a necessary design choice for this study to prohibit scenarios where
cells contain less than k points. This is accomplished by 1st identifying the number of points
within each cell, and 2nd by merging a cell with < k points with a neighboring cell to assure that
the number of points in the merged cell is >= k. At that point, the boundary circle can be drawn.
The challenge with this approach is though that an additional counting cycle prior to the NN
computation is required. The benefit of the approach is that during the counting phase, cells with
no data points can be identified and hence, can be eliminated from the actual processing cycle.
The entire processing cycle encompasses 4 steps that are discussed below. The input dataset
reflects a set of records that are formatted in a [id, x, y] fashion (the parameters n and k are
initially specified by the user).
MapReduce S2 Step: akNN Computation. In the 2nd step, the input records for each
cell are collected and candidate kNN points (for each point in the cell region) are computed. The
Map function receives the original data points and computes the corresponding cell id. The
output records are formatted as [cell_id, id, coordinates], where the id represents the point id, and
the coordinates reflect the actual coordinates of the point. The reduce function receives the
records corresponding to a cell, formatted as [id, coordinates]. The Reduce function calculates
the distance for each 2-point combo and computes the kNN points for each point in the cell. The
output records are formatted as [id, coordinates, cell_id, kNN_list], where the id is used as the
key, and the kNN_list reflects the list of the kNN points for point id (formatted as [ho1; d1i;.... ;
hoi; dki], where hoi represents the i-th NN point and dki the corresponding distance.
MapReduce3 S3 Step: Update k-NN Points. In the 3d step, the actual boundary circles
are determined. In this step, depending on the data set, additional k-NN point processing
scenarios may apply. The Map function basically receives the result of the MapReduce S2 step
and for each point, the boundary circle is computed. Two different scenarios are possible here:
1. The boundary circle does not overlap with other cells: In this scenario, no further
processing is required, and the output (key value pairs) is formatted as [cell_id, id,
coordinates, kNN_list, true]. The cell_id reflects the key while the Boolean true denotes
the fact that the processing cycle is completed. In other words, in the corresponding
Reduce function, no further processing is necessary and only the record format is
converted.
2. The boundary circle overlaps with other cells: In this scenario, there is a possibility that
neighboring cells may contain kNN points, and hence an additional (check) process is
required. To accomplish that, the [key, value] pairs are formatted as [cell_idx, id,
coordinates, kNN_list, false]. The key cell_idx reflects the cell id of one of the
overlapped cells while the Boolean false indicates a non-completed process. In scenarios
where n cells overlap, n corresponding [key, value] pairs are used.
The Shuffle operation submits records with the same cell ids to the corresponding
node, and the records are used as input into the Reduce function. As different types of records
may exist as input, it is necessary to first classify the records and second to update the kNN
points in scenarios where additional checks are required. The output records are formatted as [id,
coordinates, cell_id, kNN_list], where the id field represents the key.
MapReduce S4 Step: This final processing step may be necessary as multiple update
scenarios of kNN points per point are possible. In this step, if necessary, these kNN lists are
fused, and a final kNN (result) list is generated. The Map function receives the results of the S3
step and outputs the kNN list. The output format of the records is [id, kNN_list], where the id
represents the key. The Reduce function receives the records with the same keys and generates
the integrated list formatted as [id, kNN_list]. If multiple cells per point scenarios are possible,
the Shuffle process clusters the n outputs and submits the data to the Reduce function.
Ultimately, the kNN points per reference point are determined, and the processing cycle
completes.
As Figure 8 outlines, the average total execution time decreases as the number of Reduce tasks is
increased (total execution time equals to the sum of the steps S1, S2, and if necessary, S3, and
S4). With 16 Reduce tasks, the 8,000,000 data points are processed in approximately 262
seconds on the Hadoop cluster described above. The largest performance increase (delta of
approximately 505 seconds) was measured while increasing the number of Reduce tasks from 2
to 4. The study showed that with the data set at hand, only approximately 29.3% of the data
points require executing the MapReduce S3 step discussed above. In other words, due to the
rather high data point density in the target space (8,000,000 reference points), over 70% of the
data points only require executing steps S1 and S2, respectively. Further, the study revealed that
while increasing the number of worker threads, the execution time for step S1 is increased. This
is due to the fact that the input size per Reducer task diminishes, and as the data processing cost
is rather low, increasing the number of Reducer tasks actually adds overhead into the aggregate
processing cycle. For the 2nd set of benchmark runs, the number of Map and Reduce tasks was
set to 8 and 24, respectively, while the number of k Nearest Neighbors was scaled from 4 to 32.
As Figure 9 depicts, the average total execution time increases as k is scaled up. The study
disclosed that while scaling k, the processing cost for the MapReduce steps S3 and S4 increases
significantly. The cost increase is mainly due to the increase of (larger size) intermediate records
that have to be processed, as well as due to the increased number of data points that require steps
S3 and S4 to be processed. To illustrate, increasing k from 16 to 32 resulted in an average record
size increase of approximately 416 bytes, while at the same time an additional 10.2% of the data
points required executing steps S3 and S4. From an average total execution time perspective, the
delta between the k=16 and k=32 was approximately 534 seconds, while the delta for the k=8
and k=16 runs was only approximately 169 seconds. To summarize, while processing the akNN
MapReduce framework in parallel, it is possible to reduce the total execution time rather
considerably (see Figure 8).
Support
There are a couple of different ways that Ubuntu Server Edition is supported:
commercial support and community support. The main commercial support (and development
funding) is available from Canonical, Ltd. They supply reasonably- priced support contracts on a
per desktop or per server basis. For more information see the Ubuntu Advantage page.
Community support is also provided by dedicated individuals and companies that wish
to make Ubuntu the best distribution possible. Support is provided through multiple mailing lists,
IRC channels, forums, blogs, wikis, etc. The large amount of information available can be
overwhelming, but a good search engine query can usually provide an answer to your questions.
The Server Edition provides a common base for all sorts of server applications. It is a minimalist
design providing a platform for the desired services, such as file/print services, web hosting,
email hosting, etc.
Kernel Differences:
Ubuntu version 10.10 and prior, actually had different kernels for the server and
desktop editions. Ubuntu no longer has separate -server and -generic kernel flavors. These have
been merged into a single -generic kernel flavor to help reduce the maintenance burden over the
life of the release.
Note: When running a 64-bit version of Ubuntu on 64-bit processors you are not limited by
memory addressing space.
To see all kernel configuration options you can look through /boot/config-4.4.0-server. Also,
Linux Kernel in a Nutshell is a great resource on the options available.
3.2.1.3 Backing Up
Before installing Ubuntu Server Edition you should make sure all data on the system is backed
up. See Backups for backup options.
If this is not the first time an operating system has been installed on your computer, it is likely
you will need to re-partition your disk to make room for Ubuntu.
Any time you partition your disk, you should be prepared to lose everything on the disk should
you make a mistake or something goes wrong during partitioning. The programs used in
installation are quite reliable, most have seen years of use, but they also perform destructive
actions.
1. Download and burn the appropriate ISO file from the Ubuntu web site.
4. From the main boot menu there are some additional options to install Ubuntu Server
Edition. You can install a basic Ubuntu Server, check the CD-ROM for defects, check the
system's RAM, boot from first hard disk, or rescue a broken system. The rest of this
section will cover the basic Ubuntu Server install.
5. The installer asks which language it should use. Afterwards, you are asked to select your
location.
6. Next, the installation process begins by asking for your keyboard layout. You can ask the
installer to attempt auto-detecting it, or you can select it manually from a list.
7. The installer then discovers your hardware configuration, and configures the network
settings using DHCP. If you do not wish to use DHCP at the next screen choose "Go
Back", and you have the option to "Configure the network manually".
9. A new user is set up; this user will have root access through the sudo utility.
10. After the user settings have been completed, you will be asked if you want to encrypt
your home directory.
11. Next, the installer asks for the system's Time Zone.
12. You can then choose from several options to configure the hard drive layout. Afterwards
you are asked which disk to install to. You may get confirmation prompts before
rewriting the partition table or setting up LVM depending on disk layout. If you choose
LVM, you will be asked for the size of the root logical volume. For advanced disk
options see Advanced Installation.
13. The Ubuntu base system is then installed.
14. The next step in the installation process is to decide how you want to update the system.
There are three options:
i. No automatic updates: this requires an administrator to log into the machine and
manually install updates.
ii. Install security updates automatically: this will install the unattended-upgrades
package, which will install security updates without the intervention of an
administrator. For more details see Automatic Updates.
iii. Manage the system with Landscape: Landscape is a paid service provided by
Canonical to help manage your Ubuntu machines. See the Landscape site for
details.
15. You now have the option to install, or not install, several package tasks. See Package
Tasks for details. Also, there is an option to launch aptitude to choose specific packages
to install. For more information see Aptitude.
16. Finally, the last step before rebooting is to set the clock to UTC.
Note: If at any point during installation you are not satisfied by the default setting, use the "Go
Back" function at any prompt to be brought to a detailed installation menu that will allow you to
modify the default settings.
At some point during the installation process you may want to read the help screen provided by
the installation system. To do this, press F1.
Package Tasks
During the Server Edition installation you have the option of installing additional packages from
the CD. The packages are grouped by the type of service they provide.
1. DNS server: Selects the BIND DNS server and its documentation.
3. Mail server: This task selects a variety of packages useful for a general purpose mail
server system.
5. PostgreSQL database: This task selects client and server packages for the PostgreSQL
database.
6. Print server: This task sets up your system to be a print server.
7. Samba File server: This task sets up your system to be a Samba file server, which is
especially suitable in networks with both Windows and Linux systems.
9. Virtual Machine host: Includes packages needed to run KVM virtual machines.
10. Manually select packages: Executes aptitude allowing you to individually select
packages.
Installing the package groups is accomplished using the tasksel utility. One of the important
differences between Ubuntu (or Debian) and other GNU/Linux distribution is that, when
installed, a package is also configured to reasonable defaults, eventually prompting you for
additional required information. Likewise, when installing a task, the packages are not only
installed, but also configured to provided a fully integrated service.
Once the installation process has finished you can view a list of available tasks by
entering the following from a terminal prompt:
tasksel --list-tasks
Note: The output will list tasks from other Ubuntu based distributions such as Kubuntu and
Edubuntu. Note that you can also invoke the tasksel command by itself, which will bring up a
menu of the different tasks available.
You can view a list of which packages are installed with each task using the --task-packages
option. For example, to list the packages installed with the DNS Server task enter the following:
tasksel --task-packages dns-server
If you did not install one of the tasks during the installation process, but for example you decide
to make your new LAMP server a DNS server as well, simply insert the installation CD and from
a terminal:
sudo tasksel install dns-server
3.2.3 Upgrading
There are several ways to upgrade from one Ubuntu release to another. This section
gives an overview of the recommended upgrade method: do-release-upgrade.
do-release-upgrade
The recommended way to upgrade a Server Edition installation is to use the do-release-
upgrade utility. Part of the update-manager-core package, it does not have any graphical
dependencies and is installed by default.
Debian based systems can also be upgraded by using apt dist-upgrade. However, using
do-release-upgrade is recommended because it has the ability to handle system configuration
changes sometimes needed between releases.
To upgrade to a newer release, from a terminal prompt enter:
do-release-upgrade
For further stability of a LTS release there is a slight change in behaviour if you are currently
running a LTS version. LTS systems are only automatically considered for an upgrade to the
next LTS via do-release-upgrade with the first point release. So for example 14.04 will only
upgrade once 16.04.1 is released. If you want to update before, e.g. on a subset of machines to
evaluate the LTS upgrade for your setup the same argument as an upgrade to a dev release has to
be used via the -d switch.
Partitioning
Follow the installation steps until you get to the Partition disks step, then:
1. Select Manual as the partition method.
2. Select the first hard drive, and agree to "Create a new empty partition table on this
device?".
Repeat this step for each drive you wish to be part of the RAID array.
3. Select the "FREE SPACE" on the first drive then select "Create a new partition".
4. Next, select the Size of the partition. This partition will be the swap partition, and a
general rule for swap size is twice that of RAM. Enter the partition size, then choose
Primary, then Beginning.
Note: A swap partition size of twice the available RAM capacity may not always be
desirable, especially on systems with large amounts of RAM. Calculating the swap
partition size for servers is highly dependent on how the system is going to be used.
5. Select the "Use as:" line at the top. By default this is "Ext4 journaling file system",
change that to "physical volume for RAID" then "Done setting up partition".
6. For the / partition once again select "Free Space" on the first drive then "Create a new
partition".
7. Use the rest of the free space on the drive and choose Continue, then Primary.
8. As with the swap partition, select the "Use as:" line at the top, changing it to "physical
volume for RAID". Also select the "Bootable flag:" line to change the value to "on".
Then choose "Done setting up partition".
9. Repeat steps three through eight for the other disk and partitions.
RAID Configuration
With the partitions setup the arrays are ready to be configured:
1. Back in the main "Partition Disks" page, select "Configure Software RAID" at the top.
2. Select "yes" to write the changes to disk.
3. Choose "Create MD device".
4. For this example, select "RAID1", but if you are using a different setup choose the
appropriate type (RAID0 RAID1 RAID5).
Note: In order to use RAID5 you need at least three drives. Using RAID0 or RAID1 only
two drives are required.
5. Enter the number of active devices "2", or the amount of hard drives you have, for the
array. Then select "Continue".
6. Next, enter the number of spare devices "0" by default, then choose "Continue".
7. Choose which partitions to use. Generally they will be sda1, sdb1, sdc1, etc. The numbers
will usually match and the different letters correspond to different hard drives.
For the swap partition choose sda1 and sdb1. Select "Continue" to go to the next step.
8. Repeat steps three through seven for the / partition choosing sda2 and sdb2.
9. Once done select "Finish".
Formatting
There should now be a list of hard drives and RAID devices. The next step is to format and set
the mount point for the RAID devices. Treat the RAID device as a local hard drive, format and
mount accordingly.
1. Select "#1" under the "RAID1 device #0" partition.
2. Choose "Use as:". Then select "swap area", then "Done setting up partition".
3. Next, select "#1" under the "RAID1 device #1" partition.
4. Choose "Use as:". Then select "Ext4 journaling file system".
5. Then select the "Mount point" and choose "/ - the root file system". Change any of the
other options as appropriate, then select "Done setting up partition".
6. Finally, select "Finish partitioning and write changes to disk".
If you choose to place the root partition on a RAID array, the installer will then ask if
you would like to boot in a degraded state. See Degraded RAID for further details.
The installation process will then continue normally.
Degraded RAID
At some point in the life of the computer a disk failure event may occur. When this
happens, using Software RAID, the operating system will place the array into what is known as a
degraded state.
If the array has become degraded, due to the chance of data corruption, by default
Ubuntu Server Edition will boot to initramfs after thirty seconds. Once the initramfs has booted
there is a fifteen second prompt giving you the option to go ahead and boot the system, or
attempt manual recover. Booting to the initramfs prompt may or may not be the desired
behavior, especially if the machine is in a remote location. Booting to a degraded array can be
configured several ways:
1. The dpkg-reconfigure utility can be used to configure the default behavior, and during the
process you will be queried about additional settings related to the array. Such as
monitoring, email alerts, etc. To reconfigure mdadm enter the following:
sudo dpkg-reconfigure mdadm
Once the system has booted you can either repair the array see RAID Maintenance for
details, or copy important data to another machine due to major hardware failure.
RAID Maintenance
The mdadm utility can be used to view the status of an array, add disks to an array, remove disks,
etc:
1. To view the status of an array, from a terminal prompt enter:
sudo mdadm -D /dev/md0
The -D tells mdadm to display detailed information about the /dev/md0 device. Replace
/dev/md0 with the appropriate RAID device.
2. To view the status of a disk in an array:
sudo mdadm -E /dev/sda1
The output if very similar to the mdadm -D command, adjust /dev/sda1 for each disk.
Sometimes a disk can change to a faulty state even though there is nothing physically
wrong with the drive. It is usually worthwhile to remove the drive from the array then re-add it.
This will cause the drive to re-sync with the array. If the drive will not sync with the array, it is a
good indication of hardware failure.
The /proc/mdstat file also contains useful information about the system's RAID
devices:
cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sda1[0] sdb1[1]
10016384 blocks [2/2] [UU]
The following command is great for watching the status of a syncing drive:
watch -n1 cat /proc/mdstat
Press Ctrl+c to stop the watch command.
If you do need to replace a faulty drive, after the drive has been replaced and synced, grub will
need to be installed. To install grub on the new drive, enter the following:
sudo grub-install /dev/md0
Replace /dev/md0 with the appropriate array device name.
Overview
A side effect of LVM's power and flexibility is a greater degree of complication. Before diving
into the LVM installation process, it is best to get familiar with some terms.
1. Physical Volume (PV): physical hard disk, disk partition or software RAID partition
formatted as LVM PV.
2. Volume Group (VG): is made from one or more physical volumes. A VG can can be
extended by adding more PVs. A VG is like a virtual disk drive, from which one or more
logical volumes are carved.
3. Logical Volume (LV): is similar to a partition in a non-LVM system. A LV is formatted
with the desired file system (EXT3, XFS, JFS, etc), it is then available for mounting and
data storage.
Installation
As an example this section covers installing Ubuntu Server Edition with /srv mounted
on a LVM volume. During the initial install only one Physical Volume (PV) will be part of the
Volume Group (VG). Another PV will be added after install to demonstrate how a VG can be
extended.
There are several installation options for LVM, "Guided - use the entire disk and setup
LVM" which will also allow you to assign a portion of the available space to LVM, "Guided -
use entire and setup encrypted LVM", or Manually setup the partitions and configure LVM. At
this time the only way to configure a system with both LVM and standard partitions, during
installation, is to use the Manual approach.
1. Follow the installation steps until you get to the Partition disks step, then:
2. At the "Partition Disks screen choose "Manual".
3. Select the hard disk and on the next screen choose "yes" to "Create a new empty partition
table on this device".
4. Next, create standard /boot, swap, and / partitions with whichever filesystem you prefer.
5. For the LVM /srv, create a new Logical partition. Then change "Use as" to "physical
volume for LVM" then "Done setting up the partition".
6. Now select "Configure the Logical Volume Manager" at the top, and choose "Yes" to
write the changes to disk.
7. For the "LVM configuration action" on the next screen, choose "Create volume group".
Enter a name for the VG such as vg01, or something more descriptive. After entering a
name, select the partition configured for LVM, and choose "Continue".
8. Back at the "LVM configuration action" screen, select "Create logical volume". Select
the newly created volume group, and enter a name for the new LV, for example srv since
that is the intended mount point. Then choose a size, which may be the full partition
because it can always be extended later. Choose "Finish" and you should be back at the
main "Partition Disks" screen.
9. Now add a filesystem to the new LVM. Select the partition under "LVM VG vg01, LV
srv", or whatever name you have chosen, the choose Use as. Setup a file system as
normal selecting /srv as the mount point. Once done, select "Done setting up the
partition".
Finally, select "Finish partitioning and write changes to disk". Then confirm the changes
and continue with the rest of the installation.
Note: Make sure you don't already have an existing /dev/sdb before issuing the commands
below. You could lose some data if you issue those commands on a non-empty disk.
3. Use vgdisplay to find out the free physical extents - Free PE / size (the size you can
allocate). We will assume a free size of 511 PE (equivalent to 2GB with a PE size of
4MB) and we will use the whole free space available. Use your own PE and/or free
space.
The Logical Volume (LV) can now be extended by different methods, we will only see
how to use the PE to extend the LV:
sudo lvextend /dev/vg01/srv -l +511
The -/ option allows the LV to be extended using PE. The -L option allows the LV to be
extended using Meg, Gig, Tera, etc bytes.
4. Even though you are supposed to be able to expand an ext3 or ext4 filesystem without
unmounting it first, it may be a good practice to unmount it anyway and check the
filesystem, so that you don't mess up the day you want to reduce a logical volume (in that
case unmounting first is compulsory).
The following commands are for an EXT3 or EXT4 filesystem. If you are using another
filesystem there may be other utilities available.
sudo umount /srv
sudo e2fsck -f /dev/vg01/srv
The -f option of e2fsck forces checking even if the system seems clean.
3. You will be prompted to Enter an IP address to scan for iSCSI targets with a description
of the format for the address. Enter the IP address for the location of your iSCSI target
and navigate to <continue> then hit ENTER
4. If authentication is required in order to access the iSCSI device, provide the username in
the next field. Otherwise leave it blank.
5. If your system is able to connect to the iSCSI provider, you should see a list of available
iSCSI targets where the operating system can be installed. The list should be similar to
the following :
Select the iSCSI targets you wish to use.
iSCSI targets on 192.168.1.29:3260:
[ ] iqn.2016-03.TrustyS-iscsitarget:storage.sys0
<Go Back> <Continue>
6. Select the iSCSI target that you want to use with the space bar. Use the arrow keys to
navigate to the target that you want to select.
If the connection to the iSCSI target is successful, you will be prompted with the [!!]
Partition disks installation menu. The rest of the procedure is identical to any normal installation
on attached disks. Once the installation is completed, you will be asked to reboot.
4. You will be prompted to Enter an IP address to scan for iSCSI targets. with a description
of the format for the address. Enter the IP address and navigate to <continue> then hit
ENTER
5. If authentication is required in order to access the iSCSI device, provide the username in
the next field or leave it blank.
6. If your system is able to connect to the iSCSI provider, you should see a list of available
iSCSI targets where the operating system can be installed. The list should be similar to
the following :
Select the iSCSI targets you wish to use.
iSCSI targets on 192.168.1.29:3260:
[ ] iqn.2016-03.TrustyS-iscsitarget:storage.sys0
<Go Back> <Continue>
7. Select the iSCSI target that you want to use with the space bar. Use the arrow keys to
navigate to the target that you want to select
9. If successful, you will come back to the menu asking you to Log into iSCSI targets.
Navigate to Finish and hit ENTER
The newly connected iSCSI disk will appear in the overview section as a device
prefixed with SCSI. This is the disk that you should select as your installation disk. Once
identified, you can choose any of the partitioning methods.
Note: Depending on your system configuration, there may be other SCSI disks attached to the
system. Be very careful to identify the proper device before proceeding with the installation.
Otherwise, irreversible data loss may result from performing an installation on the wrong disk.
If the procedure is successful, you should see the Grub menu appear on the screen.
3.3 Hadoop Installation and Deployment
Now that we have been introduced to Hadoop and learned about its core components, HDFS and
YARN and their related processes, as well as different deployment modes for Hadoop, lets look
at the different options for getting a functioning Hadoop cluster up and running.
Many other projects in the open source and Hadoop ecosystem have compatibility issues
with non-Linux platforms.
That said there are options for installing Hadoop on Windows, should this be your platform of
choice. We will use Linux for all of our exercises and examples in this book, but consult the
documentation for your preferred Hadoop distribution for Windows installation and support
information if required.
If you are using Linux, choose a distribution you are comfortable with. All major distributions
are supported (Red Hat, Centos, Ubuntu, SLES, etc.). You can even mix distributions if
appropriate; for instance, master nodes running Red Hat and slave nodes running Ubuntu.
Master Nodes
A Hadoop cluster relies on its master nodes, which host the NameNode and
ResourceManager, to operate, although you can implement high availability for each subsystem.
Failure and failover of these components is not desired. Furthermore, the master node processes,
particularly the NameNode, require a large amount of memory to operate efficiently, when we
dive into the internals of HDFS. Therefore, when specifying hardware requirements the
following guidelines can be used for medium to large-scale production Hadoop implementations:
This is only a guide, and as technology moves on quickly, these recommendations will change as
well. The bottom line is that you need carrier class hardware with as much CPU and memory
capacity as you can get!
Slave Nodes
Slave nodes do the actual work in Hadoop, both for processing and storage so they will benefit
from more CPU and memoryphysical memory, not virtual memory. That said, slave nodes are
designed with the expectation of failure, which is one of the reasons blocks are replicated in
HDFS. Slave nodes can also be scaled out linearly. For instance, you can simply add more nodes
to add more aggregate storage or processing capacity to the cluster, which you cannot do with
master nodes. With this in mind, economic scalability is the objective when it comes to slave
nodes. The following is a guide for slave nodes for a well-subscribed, computationally intensive
Hadoop cluster; for instance, a cluster hosting machine learning and in memory processing using
Spark.
Note: JBOD
JBOD is an acronym for just a bunch of disks, meaning directly attached storage that is not in a
RAID configuration, where each disk operates independently of the other disks. RAID is not
recommended for block storage on slave nodes as the access speed is limited by the slowest disk
in the array, unlike JBOD where the average speed can be greater than that of the slowest disk.
JBOD has been proven to outperform RAID 0 for block storage by 30% to 50% in benchmarks
conducted at Yahoo!.
Caution: Storing Too Much Data on Any Slave Node May Cause Issues
As slave nodes typically host the blocks in a Hadoop filesystem, and as storage costs,
particularly for JBOD configurations, are relatively inexpensive, it may be tempting to allocate
excess block storage capacity to each slave node. However, as you will learn in the next hour on
HDFS, you need to consider the network impact of a failed node, which will trigger re-
replication of all blocks that were stored on the slave node.
Slave nodes are designed to be deployed on commodity-class hardware, and yet while
they still need ample processing power in the form of CPU cores and memory, as they will be
executing computational and data transformation tasks, they dont require the same degree of
fault tolerance that master nodes do.
Networking Considerations
Fully distributed Hadoop clusters are very chatty, with control messages, status updates
and heartbeats, block reports, data shuffling, and block replication, and there is often heavy
network utilization between nodes of the cluster. If you are deploying Hadoop on-premise, you
should always deploy Hadoop clusters on private subnets with dedicated switches. If you are
using multiple racks for your Hadoop cluster (you will learn more about this in Hour 21,
Understanding Advanced HDFS), you should consider redundant core and top of rack
switches.
Hostname resolution is essential between nodes of a Hadoop cluster, so both forward
and reverse DNS lookups must work correctly between each node (master-slave and slave-slave)
for Hadoop to function. Either DNS or a hosts files can be used for resolution. IPv6 should also
be disabled on all hosts in the cluster.
Time synchronization between nodes of the cluster is essential as well, as some
components, such as Kerberos, which is discussed in Hour 22, Securing Hadoop, rely on this
being the case. It is recommended you use ntp (Network Time Protocol) to keep clocks
synchronized between all nodes.
Other ecosystem projects will have their specific prerequisites; for instance, Apache Spark
requires Scala and Python as well, so you should always refer to the documentation for these
specific projects.
Non-Commercial Hadoop
Lets deploy a Hadoop cluster using the latest Apache release now.
In this exercise we will install a pseudo-distributed mode Hadoop cluster using the latest Hadoop
release downloaded from hadoop.apache.org.
As this is a test cluster the following specifications will be used in our example:
Red Hat Enterprise Linux 7.2 (The installation steps would be similar using other Linux
distributions such as Ubuntu)
2 CPU cores
8GB RAM
30GB HDD
hostname: hadoopnode0
3. Reboot
5. Install Java. We will install the OpenJDK, which will install both a JDK and JRE:
$ sudo yum install java-1.7.0-openjdk-devel
a. Test that Java has been successfully installed by running the following command:
$ java -version
If Java has been installed correctly you should see output similar to the following:
java version "1.7.0_101"
OpenJDK Runtime Environment (rhel-2.6.6.1.el7_2-x86_64..)
OpenJDK 64-Bit Server VM (build 24.95-b01, mixed mode)
Note that depending upon which operating system you are deploying on, you may
have a version of Java and a JDK installed already. In these cases it may not be
necessary to install the JDK, or you may need to set up alternatives so you do not
have conflicting Java versions.
6. Locate the installation path for Java, and set the JAVA_HOME environment variable:
$ export JAVA_HOME=/usr/lib/jvm/REPLACE_WITH_YOUR_PATH/
7. Download Hadoop from your nearest Apache download mirror. You can obtain the link
by selecting the binary option for the version of your choice at
http://hadoop.apache.org/releases.html. We will use Hadoop version 2.7.2 for our
example.
$ wget http://REPLACE_WITH_YOUR_MIRROR/hadoop-2.7.2.tar.gz
8. Unpack the Hadoop release, move it into a system directory, and set an environment
variable from the Hadoop home directory:
$ tar -xvf hadoop-2.7.2.tar.gz
$ mv hadoop-2.7.2 hadoop
$ sudo mv hadoop/ /usr/share/
$ export HADOOP_HOME=/usr/share/hadoop
10. Create a mapred-site.xml file (I will discuss this later) in the Hadoop configuration
directory:
$ sudo cp $HADOOP_HOME/etc/hadoop/mapred-site.xml.template
$HADOOP_HOME/etc/hadoop/mapred-site.xml
12. Create a symbolic link between the Hadoop configuration directory and the /etc/hadoop
/conf directory created in Step 10:
$ sudo ln -s $HADOOP_HOME/etc/hadoop/* /etc/hadoop/conf/
16. Run the built in Pi Estimator example included with the Hadoop release.
$ cd $HADOOP_HOME
$ sudo -u hdfs bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-
2.7.2.jar pi 16 1000
As we have not started any daemons or initialized HDFS, this program runs in
LocalJobRunner mode (recall that I discussed this in Hour 2, Understanding the Hadoop
Cluster Architecture). If this runs correctly you should see output similar to the
following:
...
Job Finished in 2.571 seconds
Estimated value of Pi is 3.14250000000000000000
Now lets configure a pseudo-distributed mode Hadoop cluster from your installation.
17. Use the vi editor to update the core-site.xml file, which contains important information
about the cluster, specifically the location of the namenode:
$ sudo vi /etc/hadoop/conf/core-site.xml
# add the following config between the <configuration>
# and </configuration> tags:
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoopnode0:8020</value>
</property>
Note that the value for the fs.defaultFS configuration parameter needs to be set to
hdfs://HOSTNAME:8020, where the HOSTNAME is the name of the NameNode host,
which happens to be the localhost in this case.
18. Adapt the instructions in Step 17 to similarly update the hdfs-site.xml file, which contains
information specific to HDFS, including the replication factor, which is set to 1 in this
case as it is a pseudo-distributed mode cluster:
sudo vi /etc/hadoop/conf/hdfs-site.xml
# add the following config between the <configuration>
# and </configuration> tags:
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
19. Adapt the instructions in Step 17 to similarly update the yarn-site.xml file, which
contains information specific to YARN. Importantly, this configuration file contains the
address of the resourcemanager for the clusterin this case it happens to be the
localhost, as we are using pseudo-distributed mode:
$ sudo vi /etc/hadoop/conf/yarn-site.xml
# add the following config between the <configuration>
# and </configuration> tags:
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoopnode0</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
20. Adapt the instructions in Step 17 to similarly update the mapred-site.xml file, which
contains information specific to running MapReduce applications using YARN:
$ sudo vi /etc/hadoop/conf/mapred-site.xml
# add the following config between the <configuration>
# and </configuration> tags:
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
24. Use the jps command included with the Java JDK to see the Java processes that are
running:
$ sudo jps
26. Now run the same Pi Estimator example you ran in Step 16. This will now run in pseudo-
distributed mode:
$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar pi 16
1000
The output you will see in the console will be similar to that in Step 16. Open a browser
and go to localhost:8088. You will see the YARN ResourceManager Web UI (which I
discuss in Hour 6, Understanding Data Processing in Hadoop) (Figure 3.1):
Congratulations! You have just set up your first pseudo-distributed mode Hadoop cluster.
Cloudera
Hortonworks
MapR
Importantly, enterprise support agreements and subscriptions can be purchased from the various
Hadoop vendors for their distributions. Each vendor also supplies a suite of management utilities
to help you deploy and manage Hadoop clusters. Lets look at each of the three major pure play
Hadoop vendors and their respective distributions.
3.3.3.1 Cloudera
Cloudera was the first mover in the commercial Hadoop space, establishing their first
commercial release in 2008. Cloudera provides a Hadoop distribution called CDH (Cloudera
Distribution of Hadoop), which includes the Hadoop core and many ecosystem projects. CDH is
entirely open source.
Cloudera also provides a management utility called Cloudera Manager (which is not
open source). Cloudera Manager provides a management console and framework enabling the
deployment, management, and monitoring of Hadoop clusters, and which makes many
administrative tasks such as setting up high availability or security much easier. The mix of open
source and proprietary software is often referred to as open core. A screenshot showing Cloudera
Manager is pictured in Figure 3.2.
The Quickstart VM is a great way to get started with the Cloudera Hadoop offering. To find out
more, go to http://www.cloudera.com/downloads.html.
3.3.3.2 Hortonworks
Hortonworks provides pure open source Hadoop distribution and a founding member
of the open data platform initiative (ODPi) discussed in Hour 1. Hortonworks delivers a
distribution called HDP (Hortonworks Data Platform), which is a complete Hadoop stack
including the Hadoop core and selected ecosystem components. Hortonworks uses the Apache
Ambari project to provide its deployment configuration management and cluster monitoring
facilities. A screenshot of Ambari is shown in Figure 3.4.
The simplest method to deploy a Hortonworks Hadoop cluster would involve the following
steps:
1. Install Ambari using the Hortonworks installer on a selected host.
2. Add hosts to the cluster using Ambari.
3. Deploy Hadoop services (such as HDFS and YARN) using Ambari.
Hortonworks provides a fully functional, pseudo-distributed HDP cluster with the complete
Hortonworks application stack in a virtual machine called the Hortonworks Sandbox. The
Hortonworks Sandbox is available for VirtualBox, VMware, and KVM. The Sandbox virtual
machine includes several demo applications and learning tools to use to explore Hadoop and its
various projects and components. The Hortonworks Sandbox welcome screen is shown in Figure
3.5.
Figure 3.5 Hortonworks Sandbox
3.3.3.3 MapR
MapR delivers a Hadoop-derived software platform that implements an API-compatible
distributed filesystem called MapRFS (the MapR distributed Filesystem). MapRFS has been
designed to maximize performance and provide read-write capabilities not offered by native
HDFS. MapR delivers three versions of their offering called the Converged Data Platform.
These include:
M3 or Converged Community Edition (free version)
M5 or Converged Enterprise Edition (supported version)
M7 (M5 version that includes MapRs custom HBase-derived data store)
Like the other distributions, MapR has a demo offering called the MapR Sandbox, which is
available for VirtualBox or VMware. It is pictured in Figure 3.6.
Figure 3.6 MapR Sandbox VM.
MapRs management offering is called the MapR Control System (MCS), which offers a central
console to configure, monitor and manage MapR clusters. It is shown in Figure 3.7.
Figure 3.7 MapR Control System (MCS).
3.3.4.2 S3
Simple Storage Service (S3) is Amazons cloud-based object store. An object store
manages data (such as files) as objects. These objects exist in buckets. Buckets are logical, user-
created containers with properties and permissions. S3 provides APIs for users to create and
manage buckets as well as to create, read, and delete objects from buckets.
The S3 bucket namespace is global, meaning that any buckets created must have a
globally unique name. The AWS console or APIs will let you know if you are trying to create a
bucket with a name that already exists.
S3 objects, like files in HDFS, are immutable, meaning they are write once, read many.
When an S3 object is created and uploaded, an ETag is created, which is effectively a signature
for the object. This can be used to ensure integrity when the object is accessed (downloaded) in
the future.
There are also public buckets in S3 containing public data sets. These are datasets
provided for informational or educational purposes, but they can be used for data operations such
as processing with Hadoop. Public datasets, many of which are in the tens or hundreds of
terabytes, are available, and topics range from historical weather data to census data, and from
astronomical data to genetic data.
Then click Create Cluster on the EMR welcome page as shown in Figure 3.9, and simply follow
the dialog prompts.
or
Note: If you already have Java JDK installed on your system, then you need not run the above
command.
To install it
user@ubuntu:~$ sudo apt-get install sun-java6-jdk
This will add the user hduser1 and the group hadoop_group to the local machine. Add hduser1 to
the sudo group
user@ubuntu:~$ sudo adduser hduser1 sudo
3. Configuring SSH
The hadoop control scripts rely on SSH to peform cluster-wide operations. For example, there is
a script for stopping and starting all the daemons in the clusters. To work seamlessly, SSH needs
to be setup to allow password-less login for the hadoop user from machines in the cluster. The
simplest way to achive this is to generate a public/private key pair, and it will be shared across
the cluster.
Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine.
For our single-node setup of Hadoop, we therefore need to configure SSH access to localhost for
the hduser user we created in the earlier.
The second line will create an RSA key pair with an empty password.
You have to enable SSH access to your local machine with this newly created key which is done
by the following command.
hduser1@ubuntu:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
The final step is to test the SSH setup by connecting to the local machine with the hduser1 user.
The step is also needed to save your local machines host key fingerprint to the hduser users
known hosts file.
hduser@ubuntu:~$ ssh localhost
3.4.2 Installation
3.4.2.1 Main Installation
Now, I will start by switching to hduser
hduser@ubuntu:~$ su - hduser1
hadoop-env.sh
Change the file: conf/hadoop-env.sh
#export JAVA_HOME=/usr/lib/j2sdk1.5-sun
conf/*-site.xml
Now we create the directory and set the required ownerships and permissions
hduser@ubuntu:~$ sudo mkdir -p /app/hadoop/tmp
hduser@ubuntu:~$ sudo chown hduser:hadoop /app/hadoop/tmp
hduser@ubuntu:~$ sudo chmod 750 /app/hadoop/tmp
The last line gives reading and writing permissions to the /app/hadoop/tmp directory
Error: If you forget to set the required ownerships and permissions, you will see a
java.io.IO Exception when you try to format the name node.
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
In file conf/mapred-site.xml
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
In file conf/hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
This will startup a Namenode, Datanode, Jobtracker and a Tasktracker on the machine.
hduser@ubuntu:/usr/local/hadoop$ jps
3.4.2.5 Errors
1. If by chance your datanode is not starting, then you have to erase the contents of the
folder /app/hadoop/tmp
The command that can be used
hduser@ubuntu:~:$ sudo rm Rf /app/hadoop/tmp/*
2. You can also check with netstat if Hadoop is listening on the configured ports.
The command that can be used
hduser@ubuntu:~$ sudo netstat -plten | grep java
Note:
The masters and slaves file should contain localhost.
In /etc/hosts, the ip of the system should be given with the alias as localhost.
Set the java home path in hadoop-env.sh as well bashrc.
3.5.2 Prerequisites
Configuring single-node clusters first,here we have used two single node clusters. Shutdown
each single-node cluster with the following command
user@ubuntu:~$ bin/stop-all.sh
3.5.3 Networking
The easiest is to put both machines in the same network with regard to hardware and
software configuration.
Update /etc/hosts on both machines .Put the alias to the ip addresses of all the machines.
Here we are creating a cluster of 2 machines , one is master and other is slave 1
hduser@master:$ cd /etc/hosts
Add the hduser@master public SSH key using the following command
hduser@master:~$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@slave1
Connect with user hduser from the master to the user account hduser on the slave.
3.5.5.1 Configuration
conf/masters
The machine on which bin/start-dfs.sh is running will become the primary NameNode. This file
should be updated on all the nodes. Open the masters file in the conf directory
hduser@master/slave :~$ /usr/local/hadoop/conf
hduser@master/slave :~$ sudo gedit masters
conf/slaves
This file should be updated on all the nodes as master is also a slave. Open the slaves file in the
conf directory
hduser@master/slave:~/usr/local/hadoop/conf$ sudo gedit slaves
Change the fs.default.name parameter (in conf/core-site.xml), which specifies the NameNode
(the HDFS master) host and port.
conf/mapred-site.xml
Open this file in the conf directory
hduser@master:~$ /usr/local/hadoop/conf
hduser@master:~$ sudo gedit mapred-site.xml
conf/hdfs-site.xml
Open this file in the conf directory
hduser@master:~$ /usr/local/hadoop/conf
hduser@master:~$ sudo gedit hdfs-site.xml
Change the dfs.replication parameter (in conf/hdfs-site.xml) which specifies the default block
replication. We have two nodes available, so we set dfs.replication to 2.
By this command:
The NameNode daemon is started on master, and DataNode daemons are started on all
slaves (here: master and slave).
The JobTracker is started on master, and TaskTracker daemons are started on all slaves
(here: master and slave)
2. When you start the cluster, clear the tmp directory on all the nodes (master+slaves) using
the following command
hduser@master:~$ rm -Rf /app/hadoop/tmp/*
3. Configuration of /etc/hosts , masters and slaves files on both the masters and the slaves
nodes should be the same.
This command deletes the junk files which gets stored in tmp folder of hadoop
hduser@master:~$ sudo rm -Rf /app/hadoop/tmp/*
3.7
Q&A
Q. Why do master nodes normally require a higher degree of fault tolerance than slave
nodes?
A. Slave nodes are designed to be implemented on commodity hardware with the expectation
of failure, and this enables slave nodes to scale economically. The fault tolerance and
resiliency built into HDFS and YARN enables the system to recover seamlessly from a failed
slave node. Master nodes are different; they are intended to be always on. Although there are
high availability implementation options for master nodes, failover is not desirable. Therefore,
more local fault tolerance, such as RAID disks, dual power supplies, etc., is preferred for
master nodes.
Q. What does JBOD stand for, and what is its relevance for Hadoop?
A. JBOD is an acronym for Just a Bunch of Disks, which means spinning disks that operate
independently of one another, in contrast to RAID, where disks operate as an array. JBOD is
recommended for slave nodes, which are responsible for HDFS block storage. This is because
the average speed of all disks on a slave node is greater than the speed of the slowest disk. By
comparison, RAID read and write speeds are limited by the speed of the slowest disk in the
array.
A. Commercial distributions contain a stack of core and ecosystem components that are
tested with one another and certified for the respective distribution. The commercial vendors
typically include a management application, which is very useful for managing multi-node
Hadoop clusters at scale. The commercial vendors also offer enterprise support as an option as
well.
Quiz
1. True or False: A Java Runtime Environment (JRE) is required on hosts that run Hadoop
services.
A. EC2
B. EMR
C. S3
D. DynamoDB
4. The open source Hadoop cluster management utility used by Hortonworks is called ______.
Answers
1. True. Hadoop services and processes are written in Java, are compiled to Java bytecode, and
run in Java Virtual Machines (JVMs), and therefore require a JRE to operate.
2. B. Elastic MapReduce (EMR).
3. Commodity.
4. Ambari.