BHARATI VIDYAPEETH DEEMED UNIVERSITY
COLLEGE OF ENGINEERING
Seminar Presentation On
A Big Data Analytics - Challenges with-
in New data, meta-data management &
Analysis platforms
Under the guidance of
Dr. Debnath Bhattacharyya
By
Research scholar : Mr. Ashish Nandkumar Patil
Department of Computer Engineering.
Seat No: 1011-149
Date of Presentation : 02/05/2016
Prerequisites
• Data Vs Information Vs Metadata
• History of Database System
• Database Languages
• Data Models-
Classical DBMS-Hierarchical, Network, Relational
New Directions- Extended Relational (ORDBMS),
Object-Oriented, Distributed DB
• Data Warehouse & Data Mining
• DBMS (Conventional & Advanced)
3
Contents
• Introduction to Big Data
• Literature Survey
• Applications of Big Data
• Open Source Hadoop – case study
• Hadoop Framework Architecture
• Open Source Hadoop Components
• Challenges, Research Areas & Topics
• Big Data Datasets
• Conclusion
• References
Introduction
Introduction
Introduction
• Big Data - collection of data sets: large and
complex, sizes beyond the ability of commonly
used software tools capture, curate, manage,
and process the data within a tolerable elapsed
time
The trend to larger data sets :
Single large sets v/s Separate smaller sets
Spot business trends, Prevent diseases,
Combat crime i.e. Scientists, business
executives, practitioners of medicine, advertising
and governments difficulties
Introduction (Cont..)
META Group (now Gartner) analyst Doug Laney: 3-D-
"3Vs" model BD
• Volume (amount of data)…too Big
• Velocity (speed of data in and out)….too Fast
• Variety(range of data types and sources)…too Hard
Additionally, a new V:
"Veracity” - unreliability inherent in some sources of data
Big Data: Complex type of Data
Data Growth
Developed economies increasingly use data-
intensive technologies; 4.6 billion mobile-phone
subscriptions worldwide, and between 1 billion
and 2 billion people accessing the internet.
Between 1990 and 2005, more than 1 billion
people worldwide entered the middle class, which
means more people become more literate, which
in turn leads to information growth.
The world's effective capacity to exchange
information through telecommunication networks
was 281 petabytes in 1986, 471 petabytes in
1993,
2.2 exabytes in 2000, 65 exabytes in 2007 and
predictions put the amount of internet traffic at
667 exabytes annually by 2014 7
Data Growth
Literature Survey
Reviewed on data analytics studies from traditional data
analysis to the recent big data analysis. Focused on
performance-oriented and result oriented issues in big data
analytics framework and platform. Introduction to data and
big data mining algorithms which consist of clustering
classification and frequent patterns mining technologies.
find solutions to welcome the new age of Big data
[1] Chun‑Wei Tsai1, Chin‑Feng Lai, Han‑Chieh Chao and Athanasios V.
Vasilakos, “Big data analytics: a survey”, Tsai et al. Journal of Big
Data (2015), A Springer open journal.
The paper proposed the inefficiency of the Hadoop when
executing binary-input applications. It introduces Bi-hadoop
as an extension to Hadoop to better support and integrates
easy user interface. It implements a binary input aware
scheduler and transparent caching mechanism which
improve the Hadoop framework. Provides the future
directions in developing scheduling algorithms that improve
more general task sharing patterns
[2] Xiao Yu and Bo Hong, Dept. of Electrical and Computer Engineering,
Georgia Institute of Technology, “Bi-Hadoop: Extending Hadoop To
Improve Support For Binary-Input Applications” IEEE (computer
society 2013)/ ACM International Symposisum on cluster, cloud and
Grid computing 9
Applications of Big Data
•Transformations due to …………… Big Data
•TwitterHealth- flu epidemics.
•NASA Center: Climate Simulation (NCCS) stores 32
petabytes of climate observations and simulations on
the Discover supercomputing cluster
•google Trends: future orientation index based on GDP
search.
•Facebook: handles 50 billion photos from its user
base.
•FICO: Falcon Credit Card Fraud Detection System
protects 2.1 billion active accounts world-wide.
Case Study- Hadoop
•Apache Hadoop
Apache open source software framework for reliable,
scalable, distributed computing of massive amount of
data
Hides underlying system details and complexities
from user
Developed in Java
• Core Sub Projects
MapReduce
Hadoop Distributed File System : HDFS
•Supported by several Hadoop-related projects
Hbase, Hive, Zookeeper, Avro, etc
• Meant for heterogeneous commodity hardware
Design principles of Hadoop
•Scalable
– New nodes can be added on the fly
• Performance & reliability
– Adaptive MapReduce, Compression,
– Indexing, Flexible Scheduler
•Affordable
– Massively parallel computing on
commodity servers
•Flexible
– Hadoop is schema-less, and can absorb any
type of data
•Fault Tolerant
– Through MapReduce software framework
Hadoop Framework
Architecture
Two Key Aspects of Hadoop
• Hadoop Distributed File System = HDFS
Where Hadoop stores data
A file system that spans all the nodes in a
Hadoop cluster
It links together the file systems on many
local nodes to make them into one big file
system
• MapReduce framework
How Hadoop understands and assigns work to
the nodes
MapReduce
Take a large problem and divide it into sub-problems
– Break data set down into small chunks
Perform the same function on all sub-problems
Combine the output from all sub-problems
MapReduce co-locating with HDFS
MapReduce Processing
•User runs a program on client computer
• Program submits a job to HDFS. Job contains:
– Input data
– MapReduce program
– Configuration information
• Job sent to JobTracker
•JobTracker communicates with NameNode and assigns parts of a
job to TaskTrackers
(TaskTracker is run on each DataNode)
– Task is a single MAP or REDUCE operation over piece
of data
– Hadoop divides the input to MAP / REDUCE jobs into
equal splits
• The JobTracker knows (from NameNode) which nodes contain
the data, and which other machines are nearby.
• TaskTracker does the processing and sends heartbeats to
jobTracker.
Open Source Hadoop Components
Challenges With-in
• New meta-data/data management platforms
• Analysis platforms
• Techniques for data pre-processing and
addressing imbalance data sets
• Handling huge streaming data
• Enhancement for traditional mining iteration
based techniques over new frameworks
• Advance machine learning algorithms
• Soft computing techniques for efficient
processing
• Privacy Management
19
Techniques for Data Preprocessing
and addressing imbalance data sets
20
21
Conclusion
Big data has increased the demand of not
only the data and information management
specialists but also the data analysts. Due to 4Vs
(Volume, Velocity, Variety, Veracity) properties
of the big data; the Big Data is the big issue in
new digital age. To solve these issue; there is
need to build some suitable and optimized
platforms for management of the new Data,
Metadata and Information analysis which
improve the process of the Big Data Analytics.
22
References
[1] Chun‑Wei Tsai1, Chin‑Feng Lai, Han‑Chieh Chao and Athanasios V. Vasilakos,
“Big data analytics: a survey”, Tsai et al. Journal of Big Data (2015), A
Springer open journal.
[2] Sara del Río ⇑, Victoria López, José Manuel Benítez, Francisco Herrera, “On
the use of MapReduce for imbalanced big data using Random Forest”,
Information Sciences 285 (2014) 112–137, 2014 Elsevier Inc.
[3] Jes´us Maillo, Isaac Triguero, Francisco Herrera, “A MapReduce-based k-
Nearest Neighbor Approach for Big Data Classification”, IEEE (computer society
2015), DOI 10.1109/Trustcom-BigDataSe-ISPA.2015.577
[4] Sergio Ram´ırez-Gallego, Salvador Garc´ıa, H´ector Mouri˜no-Tal´ın†, David
Mart´ınez-Rego, “Distributed Entropy Minimization Discretizer for Big Data
Analysis under Apache Spark”, IEEE (computer society 2015), DOI
10.1109/Trustcom-BigDataSe-ISPA.2015.559
[5] Daniel Peralta, Sara del Río, Sergio Ramírez-Gallego,1 Isaac Triguero,
JoseM. Benitez, Francisco Herrera1, “Evolutionary Feature Selection for Big Data
Classification: A MapReduce Approach”
[6] M. B. Chandak, “Role of big‑data in classification and novel class detection
in data streams”, Chandak J Big Data (2016) 3:5, A Springer open journal, DOI
10.1186/s40537-016-0040-9
[7] Silberschatz A., Korth H, Sudarshan S., “Database System Concepts”,4th
Edition, Mc GrawHill Publishers.
[8] J. Han and M. Kamber, “Data Mining- Concepts and Techniques”, 2nd
Edition, Morgan Kaufmann, 2006. 23