0% found this document useful (0 votes)
46 views38 pages

Big Data Lec4

Uploaded by

mohyahmad52
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views38 pages

Big Data Lec4

Uploaded by

mohyahmad52
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Big Data Analysis

Lec. 4
Dr. Mona Abbass
Content
❑What is Big Data Analytics
❑Types of Big Data Analytics
❑Tools for Big Analysis
What is Big Data Analytics
❑ Big Data requires tools and methods that can be applied to analyze
and extract patterns from large-scale data.

❑ Big Data Analytics refers to the process of collecting, organizing,


analyzing large data sets to discover different patterns and other
useful information.

❑ Big data analytics is a set of technologies and techniques that


require new forms of integration to discover large hidden values
from large datasets that are different from the usual ones, more
complex, and of a large huge scale.

❑ It mainly focuses on solving new problems or old problems in


better and effective ways.
Types of Big Data Analytics
❑Descriptive Analytics
❑Diagnostic Analytics
❑Predictive Analytics
❑Prescriptive Analytics
Descriptive Analytics
❑It consists of asking the question: What is happening?

❑It is a preliminary stage of data processing that creates a set of


historical data.

❑Descriptive analytics provides future probabilities and trends


and gives an idea about what might happen in the future.
Diagnostic Analytics
❑It consists of asking the question: Why did it happen?

❑Diagnostic analytics looks for the root cause of a problem.

❑It is used to determine why something happened.

❑ This type attempts to find and understand the causes of events


and behaviors.
Predictive Analytics
❑It consists of asking the question: What is likely to happen?

❑It uses past data in order to predict the future.

❑It is all about forecasting.

❑Predictive analytics uses many techniques like data mining


and artificial intelligence to analyze current data and make
scenarios of what might happen.
Prescriptive Analytics
❑It consists of asking the question: What should be done?

❑It is dedicated to finding the right action to be taken.

❑Descriptive analytics provides a historical data, and predictive


analytics helps forecast what might happen.

❑Prescriptive analytics uses these parameters to find the best


solution.
Tools for Big Analysis
❑Advancement in computing architecture is required to handle
both the data storage requirements and the heavy server
processing required to analyze large volumes and variety of
data economically.

❑Big data analytics has huge application in various fields


including astronomy, healthcare, and telecommunication.

❑ Despite advantages, big data analytics has its own limitations


and challenges.
Tools for Big Analysis
❑Tools that are being used to collect data encompass various
digital devices (for example, mobile devices, camera,, and
smart watches) and applications that generate enormous data
in the form of logs, text, voice, images, and video.

❑ In order to process these data, several researchers are coming


up with new techniques that help better representation of the
unstructured data, which makes sense in big data context to
gain useful insights that may not have been envisioned earlier.
Not only Structured Query Languages
(NoSQL)
❑Relational Database Management System (RDBMS) is the
traditional method of managing structured data.

❑ RDBMS uses a relational database and schema for storage


and retrieval of data.

❑A Data warehouse is used to store and retrieve large datasets.

❑ Structured Query Language (SQL) is most commonly used


database query language.
Not only Structured Query Languages
(NoSQL)
❑The data is stored in a data warehouse using dimensional
approach and normalized approach .

❑In dimensional approach, data are divided into fact table and
dimension table which supports the fact table.

❑In normalized approach, data is divided into entities creating


several tables in a relational database.
Not only Structured Query Languages
(NoSQL)
❑Due to Atomicity, Consistency, Isolation and Durability
(ACID) constraint, scaling of a large volume of data is not
possible.
❑Atomicity: This refers to the fact that a transaction is treated
as a unit of operation. Consequently, it dictates that either all the
actions related to a transaction are completed or none of them is
carried out.
❑Consistency: Referring to its correctness, this property deals
with maintaining consistent data in a database system.
Not only Structured Query Languages
(NoSQL)
❑Due to Atomicity, Consistency, Isolation and Durability
(ACID) constraint, scaling of a large volume of data is not
possible.
❑Isolation: According to this property, each transaction should
see a consistent database at all times. Consequently, no other
transaction can read or modify data that is being modified by
another transaction
❑Durability: This property ensures that once a transaction
commits, its results are permanent and cannot be erased
from the database.
Not only Structured Query Languages
(NoSQL)
❑RDBMS is incapable of handling semi-structured and
unstructured data.

❑These limitations of RDBMS led to the concept of NoSQL.

❑NoSQL stores and manages unstructured data. These


databases are also known as “schema-free” databases since
they enable quick up gradation of structure of data without
table rewrites.
Not only Structured Query Languages
(NoSQL)
❑NoSQL supports document store, key value stores, and graph
database.
▪ document store, is a computer program and data storage system
designed for storing, retrieving and managing document-oriented
information, also known as semi-structured data.
▪ Key-value databases are a collection of key-value pairs that are
stored as individual records and do not have a predefined data
structure.
▪ A graph database, also referred to as a semantic database, is a
software application designed to store, query and modify network
graphs. A network graph is a visual construct that consists of
nodes and edges. Each node represents an entity (such as a
person) and each edge represents a connection or relationship
between two nodes.
Not only Structured Query Languages
(NoSQL)
❑It uses looser consistency model than the traditional databases.

❑Data management and data storage functions are separate in


NoSQL database .

❑It allows the scalability of data.

❑Few examples of NoSQL databases are HBase, MangoDB, and


Dynamo.
Tools for Big Analysis
❑Different commercial and open source software are available
for Analysis.
❑The most frequently used software are:
▪ Apache Hadoop
▪ Apache Spark
▪ Apache Hbase
Apache Hadoop
❑Hadoop is used by companies with very large volumes of data
to process.
❑Among them are web giants such as Facebook, Twitter,
LinkedIn, and Amazon.
❑Apache Hadoop is an open source, software framework, for a
big data.
❑It has two basic parts:
▪ The first one is called, HDFS, Hadoop Distributed File System,
▪ The other is called, ‘Programming Model’, which is called a, ‘Map
Reduce’.
Hadoop Distributed File System
(HDFS)
❑HDFS is the fault-tolerant, scalable distributed storage system for a
Hadoop cluster.

❑Clustering is an unsupervised machine learning method of


identifying and grouping similar data points in larger datasets
without concern for the specific outcome.
(clustering (sometimes called cluster analysis)is usually used to
classify data into structures that are more easily understood and
manipulated.
The goal of clustering is to reduce the amount of data by categorizing
or grouping similar data items together.
Fault Tolerance
❑A system's ability to continue operating uninterrupted
despite the failure of one or more of its components.
❑Fault-tolerant systems use backup components that
automatically take the place of failed components, ensuring
no loss of service.
Hadoop Distributed File System
(HDFS)
❑Data in the Hadoop cluster is broken down into pieces by
HDFS and are distributed across different servers in the
Hadoop cluster.

❑ A small chunk of the whole data set is stored on the server.


Hadoop MapReduce
❑MapReduce is a software framework for distributed
processing of vast amounts of data in a reliable, fault-tolerant
manner.
❑The two distinct phases of MapReduce are:
▪ Map Phase: In Map phase, the workload is divided into smaller sub-
workloads. The tasks are assigned to Mapper, which processes each
unit block of data to produce a sorted list of (key, value) pairs. This list,
which is the output of mapper, is passed to the next phase. This process
is known as shuffling.
▪ Reduce: In Reduce phase, the input is analyzed and merged to produce
the final output which is written to the HDFS in the cluster.
Apache Spark
❑Based on the official website of Apache Spark, Spark “is a
unified analytics engine for large-scale data processing”.
❑Apache Spark is “a cluster computing framework for large-
scale data processing.”
❑Spark provides high-level tools including:
▪ Spark SQL for SQL and structured data processing,
▪ MLlib for machine learning,
▪ GraphX for graph processing, and
▪ Structured Streaming for incremental computation and
streaming processing.
Apache HBase
❑HBase is a distributed column-oriented database built on top
of HDFS, suitable for applications of large scale of stored data,
and high I/O throughput random access to return a small
subset of data.

❑However, keep in mind that if the application requires complex


SQL queries, transactions, ACID compliance and multiple
indexes on a table, HBase may not a good choice.
Software Tools for Handling Big Data
❑There are many tools that help in achieving these goals and help data
scientists to process data for analyzing them.
❑ Many new languages, frameworks and data storage technologies have
emerged that supports handling of big data such as:
▪R
▪ Python
▪ Scala
▪ Apache Spark
▪ Apache Hive
▪ Apache Pig
▪ Amazon Elastic Compute Cloud (EC2)
R
❑ is an open-source statistical computing language that provides
a wide variety of statistical and graphical techniques to derive
insights from the data.
❑It has an effective data handling and storage facility and
supports vector operations with a suite of operators for faster
processing.
❑ It has all the features of a standard programming language
and supports conditional arguments, loops, and user-defined
functions.
❑ R is supported by a huge number of packages through
Comprehensive R Archive Network (CRAN).
R
❑It is available on Windows, Linux, and Mac platforms.
❑It has a strong documentation for each package.
❑It has a strong support for data mining and machine learning
algorithms along with a good support for reading and writing
in distributed environment, which makes it appropriate for
handling big data.
❑R Studio is an Integrated Development Environment that is
developed for programming in R language.
Python
❑programming language, which is open source and is supported
by Windows, Linux and Mac platforms.
❑It hosts thousands of packages from third-party or community
contributed modules.
❑NumPy, and Scikit support some of the popular packages for
machine learning and data mining for data preprocessing,
computing and modeling.
❑NumPy is the base package for scientific computing.
Python
❑It adds support for large, multi-dimensional arrays and
matrices with Python.
❑Scikit supports classification, regression, clustering, feature
selection, and preprocessing and model selection algorithms.
❑It has strong support for graph analysis with its NetworkX
library and nltk for text analytics and Natural language
processing.
❑Python is very user-friendly and great for quick and dirty
analysis on a problem.
Scala
❑an object-oriented language and has an acronym for “Scalable
Language”.
❑The object and every operation in Scala is a method-call, just like
any object-oriented language.
❑ It requires java virtual machine environment.
❑Spark, an in-memory cluster computing framework is written in
Scala.
(In-memory computing means using a type of middleware software
that allows one to store data in RAM, across a cluster of computers,
and process it in parallel.)

❑ Scala is becoming popular programming tool for handling big data


problems.
Apache Spark
❑is an in-memory cluster computing technology designed for
fast computation, which is implemented in Scala.
❑It uses Hadoop for storage purpose as it has its own cluster
management capability.
❑It comes with 80 high-level operators for interactive querying.
The in-memory computation is supported with its Resilient
Distributed Data (RDD) framework, which distributes the data
frame into smaller chunks on different machines for faster
computation.
Apache Spark
❑It also supports Map and Reduce for data processing.
❑ It supports SQL, data streaming, graph processing algorithms
and machine learning algorithms.
❑ Though Spark can be accessed with Python, Java, and R, it
has a strong support for Scala.
❑ It supports deep learning.
Apache Hive
❑is an open source platform that provides facilities for querying
and managing large datasets residing in distributed storage
(For example, HDFS).

❑ It is similar to SQL and it is called as HiveQL.

❑It uses Map Reduce for processing the queries .


Apache Pig
❑is a platform that allows analysts to analyzing large data sets.

❑It is a high-level programming language, called as Pig Latin


for creating MapReduce programs that requires Hadoop for
data storage.

❑The Pig Latin code is extended with the help of User-Defined


Functions that can be written in Java, Python and few other
languages.
Amazon Elastic Compute Cloud (EC2)
❑is a web service that provides compute capacity over the cloud.

❑ It gives full control of the computing resources and allows


developers to run their computation in the desired computing
environment.

❑ It is one of the most successful cloud computing platform.

❑ It works on the principle of the pay-as-you-go model.


Questions
1. What are the types of Big Data Analytics?
2. What are Tools for Big Analysis?
Thanks
Dr. Mona Abbass
E-mail [email protected]

You might also like