UNIT-IV PDF
UNIT-IV PDF
UNIT IV:
Advanced Topics: Introduction, Apache Hadoop, Using Hadoop Map
Reduce for Batch Data Analysis.
IEEE 802.15.4: The IEEE 802 committee family of protocols, The
physical layer, The Media Access control layer, Uses of 802.15.4, The
Future of 802.15.4: 802.15.4e and 802.15.4g.
Introduction:
The definition of a powerful person has changed in this world. A powerful is one who has
access to the data. This is because data is increasing at a tremendous rate. Suppose we are
living in 100% data world. Then 90% of the data is produced in the last 2 to 4 years. This is
because now when a child is born, before her mother, she first faces the flash of the camera.
All these pictures and videos are nothing but data. Similarly, there is data of emails, various
smartphone applications, statistical data, etc. All this data has the enormous power to affect
various incidents and trends. This data is not only used by companies to affect their consumers
but also by politicians to affect elections. This huge data is referred to as Big Data. In such a
world, where data is being produced at such an exponential rate, it needs to maintained,
analyzed, and tackled. This is where Hadoop creeps in.
Hadoop is a framework of the open source set of tools distributed under Apache License. It is
used to manage data, store data, and process data for various big data applications running
under clustered systems. In the previous years, Big Data was defined by the “3Vs” but now
there are “5Vs” of Big Data which are also termed as the characteristics of Big Data.
1
UNIT-IV
Structured Data: It is the relational data which is stored in the form of rows and columns.
Unstructured Data: Texts, pictures, videos etc. are the examples of unstructured data which
can‟t be stored in the form of rows and columns.
Semi Structured Data: Log files are the examples of this type of data.
Veracity: The term Veracity is coined for the inconsistent or incomplete data which results in
the generation of doubtful or uncertain Information. Often data inconsistency arises because of
the volume or amount of data e.g. data in bulk could create confusion whereas less amount of
data could convey half or incomplete Information.
Value: After having the 4 V‟s into account there comes one more V which stands for Value!.
Bulk of Data having no Value is of no good to the company, unless you turn it into something
useful. Data in itself is of no use or importance but it needs to be converted into something
valuable to extract Information. Hence, you can state that Value! is the most important V of all
the 5V‟s.
Big data is a collection of large datasets that cannot be processed using traditional computing
techniques. It is not a single technique or a tool, rather it has become a complete subject, which
involves various tools, technqiues and frameworks.
Big data involves the data produced by different devices and applications. Given below are
some of the fields that come under the umbrella of Big Data.
2
UNIT-IV
Black Box Data − It is a component of helicopter, airplanes, and jets, etc. It captures
voices of the flight crew, recordings of microphones and earphones, and the
performance information of the aircraft.
Social Media Data − Social media such as Facebook and Twitter hold information and
the views posted by millions of people across the globe.
Stock Exchange Data − The stock exchange data holds information about the „buy‟ and
„sell‟ decisions made on a share of different companies made by the customers.
Power Grid Data − The power grid data holds information consumed by a particular
node with respect to a base station.
Transport Data − Transport data includes model, capacity, distance and availability of a
vehicle.
Search Engine Data − Search engines retrieve lots of data from different databases.
Thus Big Data includes huge volume, high velocity, and extensible variety of data. The data in
it will be of three types.
3
UNIT-IV
Using the information kept in the social network like Facebook, the marketing agencies
are learning about the response for their campaigns, promotions, and other advertising
mediums.
Using the information in the social media like preferences and product perception of
their consumers, product companies and retail organizations are planning their
production.
Using the data regarding the previous medical history of patients, hospitals are providing
better and quick service.
Capturing data
Curation
Storage
Searching
Sharing
Transfer
Analysis
Presentation
To fulfill the above challenges, organizations normally take the help of enterprise servers.
Traditional Approach
4
UNIT-IV
In this approach, an enterprise will have a computer to store and process big data. For storage
purpose, the programmers will take the help of their choice of database vendors such as Oracle,
IBM, etc. In this approach, the user interacts with the application, which in turn handles the part
of data storage and analysis.
Limitation
This approach works fine with those applications that process less voluminous data that can be
accommodated by standard database servers, or up to the limit of the processor that is
processing the data. But when it comes to dealing with huge amounts of scalable data, it is a
hectic task to process such data through a single database bottleneck.
Google’s Solution
Google solved this problem using an algorithm called MapReduce. This algorithm divides the
task into small parts and assigns them to many computers, and collects the results from them
which when integrated, form the result dataset.
5
UNIT-IV
Hadoop
Using the solution provided by Google, Doug Cutting and his team developed an Open Source
Project called HADOOP.
Hadoop runs applications using the MapReduce algorithm, where the data is processed in
parallel with others. In short, Hadoop is used to develop applications that could perform
complete statistical analysis on huge amounts of data.
6
UNIT-IV
Evolution of Hadoop: Hadoop was designed by Doug Cutting and Michael Cafarella in 2005.
The design of Hadoop is inspired by Google. Hadoop stores the huge amount of data through a
system called Hadoop Distributed File System (HDFS) and processes this data with the
technology of Map Reduce. The designs of HDFS and Map Reduce are inspired by the Google
File System (GFS) and Map Reduce. In the year 2000 Google suddenly overtook all existing
search engines and became the most popular and profitable search engine. The success of
Google was attributed to its unique Google File System and Map Reduce. No one except
Google knew about this, till that time. So, in the year 2003 Google released some papers on
GFS. But it was not enough to understand the overall working of Google. So in 2004, Google
again released the remaining papers. The two enthusiasts Doug Cutting and Michael Cafarella
studied those papers and designed what is called, Hadoop in the year 2005. Doug‟s son had a
toy elephant whose name was Hadoop and thus Doug and Michael gave their new creation, the
name “Hadoop” and hence the symbol “toy elephant.”
7
UNIT-IV
1. Hadoop Distributed File System: In our local PC, by default the block size in Hard Disk
is 4KB. When we install Hadoop, the HDFS by default changes the block size to 64 MB.
Since it is used to store huge data. We can also change the block size to 128 MB. Now
HDFS works with Data Node and Name Node. While Name Node is a master service and it
keeps the metadata as for on which commodity hardware, the data is residing, the Data
Node stores the actual data. Now, since the block size is of 64 MB thus the storage required
to store metadata is reduced thus making HDFS better. Also, Hadoop stores three copies of
every dataset at three different locations. This ensures that the Hadoop is not prone to
single point of failure.
2. Map Reduce: In the simplest manner, it can be understood that MapReduce breaks a query
into multiple parts and now each part process the data coherently. This parallel execution
helps to execute a query faster and makes Hadoop a suitable and optimal choice to deal
with Big Data.
3. YARN: As we know that Yet Another Resource Negotiator works like an operating system
to Hadoop and as operating systems are resource managers so YARN manages the
resources of Hadoop so that Hadoop serves big data in a better way.
8
IOT UNIT-4
Unit 4
IEEE 802.15.4
• The Institute of Electrical and Electronics Engineers (IEEE) committee 802 defines physical
and data link technologies.
• The IEEE decomposes the OSI link layer into two sublayers:
* it presents on top of the physical layer (PHY), and implements the methods used to access
the network
* those methods are carriersense multiple access with collision detection (CSMA/CD) used by
Ethernet and
* the carrier-sense multiple access with collision avoidance (CSMA/CA) used by IEEE wireless
protocols.
2. The logical link control layer (LLC), which formats the data frames sent over the
communication channel through the MAC and PHY layers.
• IEEE 802.2 defines a frame format that is independent of the MAC and PHY
layers, and presents a uniform interface to the upper layers.
LINK LAYER
C C layer layer r C
MAC layer
LLC
1 CSMA/CD –Ethernet G.PRUDVI REDDY
CSMA/CA-Wirelss Data frames
IOT UNIT-4
2 G.PRUDVI REDDY
IOT UNIT-4
• The MLME contains the configuration and state parameters for the
MAC layer such as
– 64-bit IEEE address and 16-bit short address for the node
• Two alternative topology models can be used with its corresponding data-
transfer method:
3 G.PRUDVI REDDY
IOT UNIT-4
– The star topology: data transfers are possible only between the PAN
coordinator and the devices.
– The peer to peer topology: data transfers can occur between any two
devices
– The beacon-enabled access method (or slotted CSMA/CA). When this mode is
selected, the PAN coordinator periodically broadcasts a superframe, composed of a
starting and ending beacon frame, 15 time slots, and an optional inactive period during
which the coordinator may enter a low-power mode.
– The superframe is as shown in figure MAC layer Access Control methods for
4 G.PRUDVI REDDY
IOT UNIT-4
802.15.4 The beacon-enabled access method (or slotted CSMA/CA) The nonbeacon-
enabled access method (unslotted CSMA/CA).
@
k
E
The first time slots define the contention access period (CAP)
– The last N (N ≤ 7) time slots form the optional contention free period (CFP), for use by nodes
requiring deterministic network access or guaranteed bandwidth.
– The beacon frame starts by the generalMAC layer frame control field then includes the source
PAN ID, a list of addresses for which the coordinator has pending data, and provides superframe
settings parameters
– Devices willing to send data to a coordinator first listen to the superframe beacon, and
synchronizes superframe and transmit data either during the CAP using CSMA/CA, or during the
CFP.
5 G.PRUDVI REDDY
IOT UNIT-4
mode used by ZigBee and 6LoWPAN. All nodes access the network using
CSMA/CA.
✓ The coordinator provides a beacon only when requested by a node, and sets
the beaconorder (BO) parameter to 15 to indicate use of the nonbeacon-
enabled access method.
✓ Nodes (including the coordinator) request a beacon during the active scan
procedure, it also identify whether networks are located in the vicinity, and
what is their PAN ID.
Association:-
✓ The association request specifies the PAN ID that the node wishes to join,
and a set of capability flags encoded in one octet:
CSE/CBI 1
802.15.4 Addresses:
1. EUI-64:Each 802.15.4 node is required to have a unique 64-bit address, called the extended
unique identifier (EUI-64).
• Since longer addresses increase the packet size, therefore require more transmission time and
more energy, devices can also request a 16-bit short address from the PAN controller.
6 G.PRUDVI REDDY
IOT UNIT-4
• The special 16-bit address FFFF is used as theMAC broadcast address. TheMAC layer of all
devices will transmit packets addressed to FFFF to the upper layers.
– The type of data contained in the payload field is determined from the first 3 bits of the frame
control field:
1. Data frames contain network layer data directly in the payload part of the MAC frame.
2. The Ack frame format is specific: it contains only a sequence number and frame check
sequence, and omits the address and data fields.
3. The payload for command frames begins with a command identifier (Figure 1.10), followed
by a command specific payload.
Security in 802.15.4
– 802.15.4 facilitate the use of symmetric key cryptography in order to provide data
confidentiality, data authenticity and replay protection. It is possible to use a specific
– key for each pair of devices (link key), or a common key for a group of devices.
Uses of 802.15.4:
• 802.15.4 provides all the MAC and PHY level mechanisms required by higher-level
protocols to exchange packets securely, and form a network
• It does not provide a fragmentation and reassembly mechanism applications will need to be
careful when sending unsecured packets larger than 108 bytes
• Bandwidth is also very limited, and much less than the PHY level bitrate of 250
kbit/s.Packets cannot be sent continuously
• ZigBee and 6LoWPAN introduce segmentation mechanisms that overcome the issue of
small and hard to predict application payload sizes at the MAC layer
7 G.PRUDVI REDDY
IOT UNIT-4
• The need for more modulation options, notably in the sub-GHz space
• 802.15.4e
– sensor networks performance and memory buffers, it is generally considered that in a 1000
- node network
– The idea is that the receiver is switched on periodically ( about 5ms) but with a very low duty
cycle.
– On the transmission side, this requires senders to use preambles longer than the receive in
periodicity of the target,
– CSL is the mode of choice if the receive latency needs to be in the order of one second or less.
8 G.PRUDVI REDDY
IOT UNIT-4
– The RIT strategy is a simple power-saving strategy that is employed by many existing wireless
technologies
– the application layer of the receiving node periodically polls a server in the network for
pending data
– the receiver broadcasts a datarequest frame and listens for a short amount of time
– The receiver can also be turned on for a brief period after sending data.
– It adds frequency diversity to other diversity methods and will improve the resilience of
802.15.4 networks to transient spectrum pollution.
– In a multimode network, there are situations in which finding a common usable channel across
all nodes is challenging.
9 G.PRUDVI REDDY
IOT UNIT-4
802.15.4g:
10 G.PRUDVI REDDY
IOT UNIT-4
UNIT 4
Hadoop MapReduce
Introduction to Hadoop Framework:
• Apache top level project, open-source implementation of frameworks for reliable, scalable,
distributed computing and data storage.
• It is a flexible and highly-available architecture for large scale computation and data processing
on a network of commodity hardware.
• Hadoop offers a software platform that was originally developed by a Yahoo! group. The
package enables users to write and run applications over vast amounts of distributed data.
• Users can easily scale Hadoop to store and process petabytes of data in the web space. Hadoop
is economical in that it comes with an open source version of MapReduce that minimizes
overhead in task spawning and massive data communication.
• It is efficient, as it processes data with a high degree of parallelism across a large number of
commodity nodes nodes, and it is reliable in that it automatically keeps multiple data copies to
facilitate redeployment of computing tasks upon
Hadoop:
• an open-source software framework that supports data-intensive distributed applications, licensed
under the Apache v2 license.
• Software platform that lets one easily write and run applications that process vast amounts of data. It
includes:
• Goals / Requirements:
• Abstract and facilitate the storage and processing of large and/or rapidly growing data sets
• Fault-tolerance
Hadoop’sArchitecture:
• Distributed, with some centralization
• Main nodes of cluster are where most of the computational power and storage of the system
lies
• Main nodes run TaskTracker to accept and reply to MapReduce tasks, and also DataNode to
store needed blocks closely as possible
• Central control node runs NameNode to keep track of HDFS directories & files, and JobTracker
to dispatch compute tasks to TaskTracker
• Written in Java, also supports Python and Ruby
Dept of
CSE,CBIT
MapReduce Engine:
• JobTracker & TaskTracker
• JobTracker splits up data into smaller tasks(“Map”) and sends it tothe TaskTracker process in
each node
• TaskTracker reports back to the JobTracker node and reports on job
progress, sends data (“Reduce”) or requests newjobs
• None of these components are necessarily limited to using HDFS
• Many other distributed file-systems with quite different architectures work
• Many other software packages besides Hadoop's
MapReduce platform make use of HDFS
• Hadoop is in use at most organizations that handle big data:
• Yahoo
• Facebook
• Amazon
• Netflix Etc…
MapReduce:
• Hadoop implements Google’s MapReduce, using HDFS
• MapReduce divides applications into many small blocks of work.
• HDFS creates multiple replicas of data blocks for reliability, placing them on
compute nodes around the cluster.
• MapReduce can then process the data where it is located.
• MapReduce is Sort/merge based distributed computing
• The underlying system takes care of the partitioning of the input data, scheduling the
program’s execution across several machines, handling machine failures, and
managing required inter-machine communication. (This is the key for Hadoop’s
success)
• The run time partitions the input and provides it to different Map instances;
• MapReduce Usage
✓ Log processing
✓ Web search indexing
✓ Ad-hoc queries
• JobClient
Submit job
• JobTracker
Manage and schedule job, split job into tasks
• TaskTracker
Start and monitor the task execution
• Child
The process that really execute the task
Protocol
JobSubmissionProtocol
JobClient<-------------> JobTracker
InterTrackerProtocol
TaskTracker<------------> JobTracker
TtaskTracker <-------------> Child
JobTracker impliments both protocol and works asserver in both IPC
TaskTracker implements the TaskUmbilicalProtocol; Child gets task
information and reports task status through it.
HDFS:
The Hadoop Distributed File System (HDFS) is a distributed file system
designed to run on commodity hardware. It has many similarities with existing
distributed file systems. However, the differences from other distributed file
systems are significant.
http://hadoop.apache.org/core/.
HDFS Architecture:
• A BlockSever
• Block Report
Block Placement
• Replication Strategy
• Data Correctness
• File Creation
• FileAccess