0% found this document useful (0 votes)

46 views38 pages

Big Data Lec4

Uploaded by

mohyahmad52

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views38 pages

Big Data Lec4

Uploaded by

mohyahmad52

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

Big Data Analysis

Lec. 4
Dr. Mona Abbass
Content
❑What is Big Data Analytics
❑Types of Big Data Analytics
❑Tools for Big Analysis
What is Big Data Analytics
❑ Big Data requires tools and methods that can be applied to analyze
and extract patterns from large-scale data.

❑ Big Data Analytics refers to the process of collecting, organizing,

analyzing large data sets to discover different patterns and other
useful information.

❑ Big data analytics is a set of technologies and techniques that

require new forms of integration to discover large hidden values
from large datasets that are different from the usual ones, more
complex, and of a large huge scale.

❑ It mainly focuses on solving new problems or old problems in

better and effective ways.
Types of Big Data Analytics
❑Descriptive Analytics
❑Diagnostic Analytics
❑Predictive Analytics
❑Prescriptive Analytics
Descriptive Analytics
❑It consists of asking the question: What is happening?

❑It is a preliminary stage of data processing that creates a set of

historical data.

❑Descriptive analytics provides future probabilities and trends

and gives an idea about what might happen in the future.
Diagnostic Analytics
❑It consists of asking the question: Why did it happen?

❑Diagnostic analytics looks for the root cause of a problem.

❑It is used to determine why something happened.

❑ This type attempts to find and understand the causes of events

and behaviors.
Predictive Analytics
❑It consists of asking the question: What is likely to happen?

❑It uses past data in order to predict the future.

❑It is all about forecasting.

❑Predictive analytics uses many techniques like data mining

and artificial intelligence to analyze current data and make
scenarios of what might happen.
Prescriptive Analytics
❑It consists of asking the question: What should be done?

❑It is dedicated to finding the right action to be taken.

❑Descriptive analytics provides a historical data, and predictive

analytics helps forecast what might happen.

❑Prescriptive analytics uses these parameters to find the best

solution.
Tools for Big Analysis
❑Advancement in computing architecture is required to handle
both the data storage requirements and the heavy server
processing required to analyze large volumes and variety of
data economically.

❑Big data analytics has huge application in various fields

including astronomy, healthcare, and telecommunication.

❑ Despite advantages, big data analytics has its own limitations

and challenges.
Tools for Big Analysis
❑Tools that are being used to collect data encompass various
digital devices (for example, mobile devices, camera,, and
smart watches) and applications that generate enormous data
in the form of logs, text, voice, images, and video.

❑ In order to process these data, several researchers are coming

up with new techniques that help better representation of the
unstructured data, which makes sense in big data context to
gain useful insights that may not have been envisioned earlier.
Not only Structured Query Languages
(NoSQL)
❑Relational Database Management System (RDBMS) is the
traditional method of managing structured data.

❑ RDBMS uses a relational database and schema for storage

and retrieval of data.

❑A Data warehouse is used to store and retrieve large datasets.

❑ Structured Query Language (SQL) is most commonly used

database query language.
Not only Structured Query Languages
(NoSQL)
❑The data is stored in a data warehouse using dimensional
approach and normalized approach .

❑In dimensional approach, data are divided into fact table and
dimension table which supports the fact table.

❑In normalized approach, data is divided into entities creating

several tables in a relational database.
Not only Structured Query Languages
(NoSQL)
❑Due to Atomicity, Consistency, Isolation and Durability
(ACID) constraint, scaling of a large volume of data is not
possible.
❑Atomicity: This refers to the fact that a transaction is treated
as a unit of operation. Consequently, it dictates that either all the
actions related to a transaction are completed or none of them is
carried out.
❑Consistency: Referring to its correctness, this property deals
with maintaining consistent data in a database system.
Not only Structured Query Languages
(NoSQL)
❑Due to Atomicity, Consistency, Isolation and Durability
(ACID) constraint, scaling of a large volume of data is not
possible.
❑Isolation: According to this property, each transaction should
see a consistent database at all times. Consequently, no other
transaction can read or modify data that is being modified by
another transaction
❑Durability: This property ensures that once a transaction
commits, its results are permanent and cannot be erased
from the database.
Not only Structured Query Languages
(NoSQL)
❑RDBMS is incapable of handling semi-structured and
unstructured data.

❑These limitations of RDBMS led to the concept of NoSQL.

❑NoSQL stores and manages unstructured data. These

databases are also known as “schema-free” databases since
they enable quick up gradation of structure of data without
table rewrites.
Not only Structured Query Languages
(NoSQL)
❑NoSQL supports document store, key value stores, and graph
database.
▪ document store, is a computer program and data storage system
designed for storing, retrieving and managing document-oriented
information, also known as semi-structured data.
▪ Key-value databases are a collection of key-value pairs that are
stored as individual records and do not have a predefined data
structure.
▪ A graph database, also referred to as a semantic database, is a
software application designed to store, query and modify network
graphs. A network graph is a visual construct that consists of
nodes and edges. Each node represents an entity (such as a
person) and each edge represents a connection or relationship
between two nodes.
Not only Structured Query Languages
(NoSQL)
❑It uses looser consistency model than the traditional databases.

❑Data management and data storage functions are separate in

NoSQL database .

❑It allows the scalability of data.

❑Few examples of NoSQL databases are HBase, MangoDB, and

Dynamo.
Tools for Big Analysis
❑Different commercial and open source software are available
for Analysis.
❑The most frequently used software are:
▪ Apache Hadoop
▪ Apache Spark
▪ Apache Hbase
Apache Hadoop
❑Hadoop is used by companies with very large volumes of data
to process.
❑Among them are web giants such as Facebook, Twitter,
LinkedIn, and Amazon.
❑Apache Hadoop is an open source, software framework, for a
big data.
❑It has two basic parts:
▪ The first one is called, HDFS, Hadoop Distributed File System,
▪ The other is called, ‘Programming Model’, which is called a, ‘Map
Reduce’.
Hadoop Distributed File System
(HDFS)
❑HDFS is the fault-tolerant, scalable distributed storage system for a
Hadoop cluster.

❑Clustering is an unsupervised machine learning method of

identifying and grouping similar data points in larger datasets
without concern for the specific outcome.
(clustering (sometimes called cluster analysis)is usually used to
classify data into structures that are more easily understood and
manipulated.
The goal of clustering is to reduce the amount of data by categorizing
or grouping similar data items together.
Fault Tolerance
❑A system's ability to continue operating uninterrupted
despite the failure of one or more of its components.
❑Fault-tolerant systems use backup components that
automatically take the place of failed components, ensuring
no loss of service.
Hadoop Distributed File System
(HDFS)
❑Data in the Hadoop cluster is broken down into pieces by
HDFS and are distributed across different servers in the
Hadoop cluster.

❑ A small chunk of the whole data set is stored on the server.

Hadoop MapReduce
❑MapReduce is a software framework for distributed
processing of vast amounts of data in a reliable, fault-tolerant
manner.
❑The two distinct phases of MapReduce are:
▪ Map Phase: In Map phase, the workload is divided into smaller sub-
workloads. The tasks are assigned to Mapper, which processes each
unit block of data to produce a sorted list of (key, value) pairs. This list,
which is the output of mapper, is passed to the next phase. This process
is known as shuffling.
▪ Reduce: In Reduce phase, the input is analyzed and merged to produce
the final output which is written to the HDFS in the cluster.
Apache Spark
❑Based on the official website of Apache Spark, Spark “is a
unified analytics engine for large-scale data processing”.
❑Apache Spark is “a cluster computing framework for large-
scale data processing.”
❑Spark provides high-level tools including:
▪ Spark SQL for SQL and structured data processing,
▪ MLlib for machine learning,
▪ GraphX for graph processing, and
▪ Structured Streaming for incremental computation and
streaming processing.
Apache HBase
❑HBase is a distributed column-oriented database built on top
of HDFS, suitable for applications of large scale of stored data,
and high I/O throughput random access to return a small
subset of data.

❑However, keep in mind that if the application requires complex

SQL queries, transactions, ACID compliance and multiple
indexes on a table, HBase may not a good choice.
Software Tools for Handling Big Data
❑There are many tools that help in achieving these goals and help data
scientists to process data for analyzing them.
❑ Many new languages, frameworks and data storage technologies have
emerged that supports handling of big data such as:
▪R
▪ Python
▪ Scala
▪ Apache Spark
▪ Apache Hive
▪ Apache Pig
▪ Amazon Elastic Compute Cloud (EC2)
R
❑ is an open-source statistical computing language that provides
a wide variety of statistical and graphical techniques to derive
insights from the data.
❑It has an effective data handling and storage facility and
supports vector operations with a suite of operators for faster
processing.
❑ It has all the features of a standard programming language
and supports conditional arguments, loops, and user-defined
functions.
❑ R is supported by a huge number of packages through
Comprehensive R Archive Network (CRAN).
R
❑It is available on Windows, Linux, and Mac platforms.
❑It has a strong documentation for each package.
❑It has a strong support for data mining and machine learning
algorithms along with a good support for reading and writing
in distributed environment, which makes it appropriate for
handling big data.
❑R Studio is an Integrated Development Environment that is
developed for programming in R language.
Python
❑programming language, which is open source and is supported
by Windows, Linux and Mac platforms.
❑It hosts thousands of packages from third-party or community
contributed modules.
❑NumPy, and Scikit support some of the popular packages for
machine learning and data mining for data preprocessing,
computing and modeling.
❑NumPy is the base package for scientific computing.
Python
❑It adds support for large, multi-dimensional arrays and
matrices with Python.
❑Scikit supports classification, regression, clustering, feature
selection, and preprocessing and model selection algorithms.
❑It has strong support for graph analysis with its NetworkX
library and nltk for text analytics and Natural language
processing.
❑Python is very user-friendly and great for quick and dirty
analysis on a problem.
Scala
❑an object-oriented language and has an acronym for “Scalable
Language”.
❑The object and every operation in Scala is a method-call, just like
any object-oriented language.
❑ It requires java virtual machine environment.
❑Spark, an in-memory cluster computing framework is written in
Scala.
(In-memory computing means using a type of middleware software
that allows one to store data in RAM, across a cluster of computers,
and process it in parallel.)

❑ Scala is becoming popular programming tool for handling big data

problems.
Apache Spark
❑is an in-memory cluster computing technology designed for
fast computation, which is implemented in Scala.
❑It uses Hadoop for storage purpose as it has its own cluster
management capability.
❑It comes with 80 high-level operators for interactive querying.
The in-memory computation is supported with its Resilient
Distributed Data (RDD) framework, which distributes the data
frame into smaller chunks on different machines for faster
computation.
Apache Spark
❑It also supports Map and Reduce for data processing.
❑ It supports SQL, data streaming, graph processing algorithms
and machine learning algorithms.
❑ Though Spark can be accessed with Python, Java, and R, it
has a strong support for Scala.
❑ It supports deep learning.
Apache Hive
❑is an open source platform that provides facilities for querying
and managing large datasets residing in distributed storage
(For example, HDFS).

❑ It is similar to SQL and it is called as HiveQL.

❑It uses Map Reduce for processing the queries .

Apache Pig
❑is a platform that allows analysts to analyzing large data sets.

❑It is a high-level programming language, called as Pig Latin

for creating MapReduce programs that requires Hadoop for
data storage.

❑The Pig Latin code is extended with the help of User-Defined

Functions that can be written in Java, Python and few other
languages.
Amazon Elastic Compute Cloud (EC2)
❑is a web service that provides compute capacity over the cloud.

❑ It gives full control of the computing resources and allows

developers to run their computation in the desired computing
environment.

❑ It is one of the most successful cloud computing platform.

❑ It works on the principle of the pay-as-you-go model.

Questions
1. What are the types of Big Data Analytics?
2. What are Tools for Big Analysis?
Thanks
Dr. Mona Abbass
E-mail [email protected]

THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
From Everand
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
AJIT DASH
2/5 (2)
19c Database Administration
0% (1)
19c Database Administration
5 pages
cp5293 Big Data Analytics Question Bank
0% (1)
cp5293 Big Data Analytics Question Bank
13 pages
SQL Tutorial - DB2 SQL Tutorials - SQL Tutor
No ratings yet
SQL Tutorial - DB2 SQL Tutorials - SQL Tutor
2 pages
SQL Queries - Examples - 2 - Practice
No ratings yet
SQL Queries - Examples - 2 - Practice
12 pages
BIG data1
No ratings yet
BIG data1
49 pages
Module 1
No ratings yet
Module 1
54 pages
BAD601 Big Data Model Question Paper Solution Search Creators
No ratings yet
BAD601 Big Data Model Question Paper Solution Search Creators
50 pages
Big Data complete Notes
No ratings yet
Big Data complete Notes
33 pages
Analyzing Limitations and Solutions of Existing Data Analytics
No ratings yet
Analyzing Limitations and Solutions of Existing Data Analytics
21 pages
Cp5293 Big Data Analytics Question Bank
0% (1)
Cp5293 Big Data Analytics Question Bank
13 pages
ESE_BDA
No ratings yet
ESE_BDA
28 pages
BIG DATA ANALYTICS
No ratings yet
BIG DATA ANALYTICS
10 pages
Big Data
No ratings yet
Big Data
190 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
20 pages
No SQL Database in Bda
No ratings yet
No SQL Database in Bda
84 pages
IT_(R20)_4-1_BIG DATA ANALYTICS_DIGITAL NOTES (1)
No ratings yet
IT_(R20)_4-1_BIG DATA ANALYTICS_DIGITAL NOTES (1)
117 pages
Hadoop PPT
No ratings yet
Hadoop PPT
25 pages
Hand Book: Ahmedabad Institute of Technology
No ratings yet
Hand Book: Ahmedabad Institute of Technology
103 pages
BD unit 1
No ratings yet
BD unit 1
5 pages
Bangladesh University of Professionals: Submitted by Submitted To ID: Section: Batch
No ratings yet
Bangladesh University of Professionals: Submitted by Submitted To ID: Section: Batch
6 pages
BDA Class3
No ratings yet
BDA Class3
15 pages
Big Data Seminar
100% (2)
Big Data Seminar
27 pages
Big Data Analytics (R20a0520)
No ratings yet
Big Data Analytics (R20a0520)
84 pages
1.5 Module-1
No ratings yet
1.5 Module-1
21 pages
(IJCST-V5I4P10) :M Dhavapriya
No ratings yet
(IJCST-V5I4P10) :M Dhavapriya
5 pages
Big Data Analytics
No ratings yet
Big Data Analytics
31 pages
MCA - BigData Notes
No ratings yet
MCA - BigData Notes
136 pages
Big Data Analytics
100% (3)
Big Data Analytics
79 pages
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
No ratings yet
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
30 pages
A STUDY ON BIG DATA HADOOP Nandha Kumar
No ratings yet
A STUDY ON BIG DATA HADOOP Nandha Kumar
7 pages
Updated Unit-2
0% (1)
Updated Unit-2
55 pages
It-222 Reviewer
No ratings yet
It-222 Reviewer
3 pages
2 emerging
No ratings yet
2 emerging
10 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
Module 1. 16974328175990
No ratings yet
Module 1. 16974328175990
119 pages
Ashish_Presentation_Stage1_modify_LR
No ratings yet
Ashish_Presentation_Stage1_modify_LR
24 pages
Module 1
No ratings yet
Module 1
21 pages
hadoop-big-data-unit-2
No ratings yet
hadoop-big-data-unit-2
23 pages
BigDataAnalytics_1.2 (2)
No ratings yet
BigDataAnalytics_1.2 (2)
25 pages
BDA UNIT -1_pdf
No ratings yet
BDA UNIT -1_pdf
143 pages
Big Data Analytics
No ratings yet
Big Data Analytics
21 pages
Big Data Analysis: Shubham Gupta B.Tech Computers Batch: E3 Roll No.:E059
No ratings yet
Big Data Analysis: Shubham Gupta B.Tech Computers Batch: E3 Roll No.:E059
34 pages
Bigdata
No ratings yet
Bigdata
12 pages
Big Data
No ratings yet
Big Data
1 page
Terminologies Used in Big Data Environments
No ratings yet
Terminologies Used in Big Data Environments
3 pages
Unit-1 Introduction to Data Analytics.pptx
No ratings yet
Unit-1 Introduction to Data Analytics.pptx
35 pages
Literature Review On Big Data Analytics Vishal Kumar Harsh Bansal
No ratings yet
Literature Review On Big Data Analytics Vishal Kumar Harsh Bansal
6 pages
Big Data Analytics Using Apache Hadoop
No ratings yet
Big Data Analytics Using Apache Hadoop
33 pages
Big-Data-A-Comprehensive-Overview
No ratings yet
Big-Data-A-Comprehensive-Overview
25 pages
Big Data
No ratings yet
Big Data
25 pages
BDA Unit 1 Notes
No ratings yet
BDA Unit 1 Notes
34 pages
IV Unit Big Data Analysis
No ratings yet
IV Unit Big Data Analysis
17 pages
Introduction To Bda
No ratings yet
Introduction To Bda
67 pages
BDA Unit 1 Notes-1
No ratings yet
BDA Unit 1 Notes-1
34 pages
BIG Data_Unit_1
No ratings yet
BIG Data_Unit_1
24 pages
It (r20) 4-1 Big Data Analytics Digital Notes
No ratings yet
It (r20) 4-1 Big Data Analytics Digital Notes
84 pages
Unit 4 LT
No ratings yet
Unit 4 LT
16 pages
Adbase Presentation Group 4
No ratings yet
Adbase Presentation Group 4
60 pages
R II Bca IV Sem Unit 3 Balu Sir
No ratings yet
R II Bca IV Sem Unit 3 Balu Sir
14 pages
Business Intelligence & Big Data Analytics-CSE3124Y
No ratings yet
Business Intelligence & Big Data Analytics-CSE3124Y
25 pages
Hadoop Ecosystem for Big Data
From Everand
Hadoop Ecosystem for Big Data
Dr. Zemelak Goraga
No ratings yet
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet
SQL101
No ratings yet
SQL101
54 pages
1z0-148.exam.47q: Website: VCE To PDF Converter: Facebook: Twitter
No ratings yet
1z0-148.exam.47q: Website: VCE To PDF Converter: Facebook: Twitter
55 pages
Or SQL MCQ
No ratings yet
Or SQL MCQ
7 pages
Lectuer-3 Database Design and Applications: BITS Pilani
No ratings yet
Lectuer-3 Database Design and Applications: BITS Pilani
35 pages
Review Vijitha Presentation2
No ratings yet
Review Vijitha Presentation2
14 pages
Document Overview: MET CS 669 Database Design and Implementation For Business Lab 1 Explanation: SQL From Scratch
No ratings yet
Document Overview: MET CS 669 Database Design and Implementation For Business Lab 1 Explanation: SQL From Scratch
39 pages
Keys in Database Management System
No ratings yet
Keys in Database Management System
14 pages
Multiple Choice Questions of Microsoft Access
86% (7)
Multiple Choice Questions of Microsoft Access
6 pages
POD Question Bank Answer
No ratings yet
POD Question Bank Answer
12 pages
2024_2025 TEST_ Attempt Review _ LAUTECH
No ratings yet
2024_2025 TEST_ Attempt Review _ LAUTECH
13 pages
Gujarat Technological University
No ratings yet
Gujarat Technological University
16 pages
Awrrpt 1 1454 1457
No ratings yet
Awrrpt 1 1454 1457
74 pages
Bahri Cloning Oracle Applications Release 12
No ratings yet
Bahri Cloning Oracle Applications Release 12
14 pages
1Z0-067 Exam - Oracle Database Administration Exam
No ratings yet
1Z0-067 Exam - Oracle Database Administration Exam
9 pages
Da 100 5
No ratings yet
Da 100 5
3 pages
Chapter 1 DB2 Application Development Overview
No ratings yet
Chapter 1 DB2 Application Development Overview
5 pages
IM Module 3, Lesson 3
No ratings yet
IM Module 3, Lesson 3
51 pages
Dbms Lab Name: Rollno: Section
No ratings yet
Dbms Lab Name: Rollno: Section
11 pages
Transport Tablespace From One Database To Another
No ratings yet
Transport Tablespace From One Database To Another
10 pages
Database Management Systems Lecture Notes: Unit-I Data
No ratings yet
Database Management Systems Lecture Notes: Unit-I Data
60 pages
Oracle 19C Practice Test
No ratings yet
Oracle 19C Practice Test
90 pages
DBMS 10
No ratings yet
DBMS 10
16 pages
Computer Science: Chapter: 16 Relatonal Database
No ratings yet
Computer Science: Chapter: 16 Relatonal Database
10 pages
Oracle Apps R122ado PDF Free
No ratings yet
Oracle Apps R122ado PDF Free
147 pages
EE436L: Database Engineering: Lab Manual
No ratings yet
EE436L: Database Engineering: Lab Manual
2 pages
Databse-chapter4-Entity Relationship (ER) Modeling
No ratings yet
Databse-chapter4-Entity Relationship (ER) Modeling
33 pages

Big Data Lec4

Uploaded by

Big Data Lec4

Uploaded by

Big Data Analysis

❑ Big Data Analytics refers to the process of collecting, organizing,

❑ Big data analytics is a set of technologies and techniques that

❑ It mainly focuses on solving new problems or old problems in

❑It is a preliminary stage of data processing that creates a set of

❑Descriptive analytics provides future probabilities and trends

❑Diagnostic analytics looks for the root cause of a problem.

❑It is used to determine why something happened.

❑ This type attempts to find and understand the causes of events

❑It uses past data in order to predict the future.

❑It is all about forecasting.

❑Predictive analytics uses many techniques like data mining

❑It is dedicated to finding the right action to be taken.

❑Descriptive analytics provides a historical data, and predictive

❑Prescriptive analytics uses these parameters to find the best

❑Big data analytics has huge application in various fields

❑ Despite advantages, big data analytics has its own limitations

❑ In order to process these data, several researchers are coming

❑ RDBMS uses a relational database and schema for storage

❑A Data warehouse is used to store and retrieve large datasets.

❑ Structured Query Language (SQL) is most commonly used

❑In normalized approach, data is divided into entities creating

❑These limitations of RDBMS led to the concept of NoSQL.

❑NoSQL stores and manages unstructured data. These

❑Data management and data storage functions are separate in

❑It allows the scalability of data.

❑Few examples of NoSQL databases are HBase, MangoDB, and

❑Clustering is an unsupervised machine learning method of

❑ A small chunk of the whole data set is stored on the server.

❑However, keep in mind that if the application requires complex

❑ Scala is becoming popular programming tool for handling big data

❑ It is similar to SQL and it is called as HiveQL.

❑It uses Map Reduce for processing the queries .

❑It is a high-level programming language, called as Pig Latin

❑The Pig Latin code is extended with the help of User-Defined

❑ It gives full control of the computing resources and allows

❑ It is one of the most successful cloud computing platform.

❑ It works on the principle of the pay-as-you-go model.

You might also like