0% found this document useful (0 votes)

11 views

1 Introduction

The document provides an overview of big data, its history, and key technologies such as Hadoop and Apache Spark. It discusses the evolution of big data solutions, highlighting the challenges faced by early search engines and the development of Hadoop and Spark for processing large datasets. Additionally, it covers the characteristics of big data, including the 'Three Vs' (Volume, Velocity, Variety) and their extensions, as well as the advantages of using Spark over Hadoop.

Uploaded by

Sandeep Regalla

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

1 Introduction

Uploaded by

Sandeep Regalla

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

CSC 735 – Data

Analytics
INTRODUCTION

1
Brief History of Big Data
In early 2000s, search engine providers faced the challenge of Internet
Scale Problems
Google and Yahoo! worked on possible solutions
In 2003, Google released a whitepaper titled "The Google File System"
(GFS)

2
Google File System (GFS)

3
Brief History of Big Data (cont.)
In 2004, Google released another whitepaper, titled "MapReduce:
Simplified Data Processing on Large Clusters."
These white papers inspired Doug Cutting and Mike Cafarella to
develop Hadoop

4
What is Hadoop?
It is an Apache open source big data platforms
Processes large datasets across a cluster of commodity computers
It is written in Java
Scalable
Fault-tolerant

5
Hadoop Architecture
1. Hadoop Common: Java files and libraries necessary to start Hadoop
and for supporting the other Hadoop modules
2. Hadoop Distributed File System (HDFS): distributed storage system
with ideas similar to GFS
3. Hadoop YARN: A framework for job scheduling and cluster resource
management
4. Hadoop MapReduce: A framework for parallel processing of large
data sets.

6
Hadoop Architecture

7
Big Data
Big data is essential for many organizations
Big data is growing exponentially

8
Units of Storage
Unit Equivalent

1 kilobyte (KB) 1,024 bytes

1 megabyte (MB) 1,048,576 bytes

1 gigabyte (GB) 1,073,741,824 bytes

1 terabyte (TB) 1,099,511,627,776 bytes

1 petabytes (PB) 1,024 TB

1 exabyte (EB) 1,024 PB

1 zettabyte (ZB) 1,024 EB

9
Rate of Data Growth
in 2017:
2.5 quintillion bytes of data created each day
90% of the data was generate in the last two years
Google processes more than 40,000 searches EVERY second (3.5 billion
searches per day)!
77% of searches are conducted on Google
About 5 billion searches a day

10
Rate of Data Growth (cont.)
Every minute (2017):
◦ Facebook users post 510,000 comments
◦ 456,000 tweets on Twitter
◦ 46,740 photos on Instagram
◦ Users watch 4,146,600 YouTube videos
◦ 527,760 photos shared on Snapchat

11
Definition of Big Data
People define it in different ways
◦ one definition relates to the volume of data
◦ another definition relates to the richness of data
◦ another definition is “too big” by traditional standards

12
Characteristics of Big Data
Three Vs of big data
◦ Volume
◦ Velocity
◦ Variety

13
Characteristics of Big Data
Three Vs of big data
◦ Volume
◦ Velocity
◦ Variety

4th V
◦ Veracity

14
Characteristics of Big Data
Three Vs of big data
◦ Volume
◦ Velocity
◦ Variety

4th V
◦ Veracity

5th V
◦ Value

15
What is Apache Spark?
Unified computing engine and set of libraries for parallel data
processing on computer clusters
Open source engine for big data
It support languages: Scala, Python, Java, and R
Has APIs for multiple analytics tasks such as: SQL, streaming, machine
learning
It can run on a laptop and on a cluster of thousands of computers

16
Brief History
Research project at UC Berkeley AMPlab in 2009

17
Brief History
Research project at UC Berkeley AMPlab in 2009
Motivation

18
Brief History
Research project at UC Berkeley AMPlab in 2009
Motivation
Initially batch applications

19
Brief History
Research project at UC Berkeley AMPlab in 2009
Motivation
Initially batch applications
Then allowed interactive analysis and SQL queries

20
Brief History
Research project at UC Berkeley AMPlab in 2009
Motivation
Initially batch applications
Then allowed interactive analysis and SQL queries
More APIs added over time: MLlib, Streaming, GraphX

21
Brief History
Research project at UC Berkeley AMPlab in 2009
Motivation
Initially batch applications
Then allowed interactive analysis and SQL queries
More APIs added over time: MLlib, Streaming, GraphX
In 2013, project contributed as open-source vender-
independent to Apache Software Foundation

22
Brief History
Research project at UC Berkeley AMPlab in 2009
Motivation
Initially batch applications
Then allowed interactive analysis and SQL queries
More APIs added over time: MLlib, Streaming, GraphX
In 2013, project contributed as open-source vender-
independent to Apache Software Foundation
Databricks

23
Brief History
Research project at UC Berkeley AMPlab in 2009
Motivation
Initially batch applications
Then allowed interactive analysis and SQL queries
More APIs added over time: MLlib, Streaming, GraphX
In 2013, project contributed as open-source vender-
independent to Apache Software Foundation
Databricks
Spark 1.0 in 2014 and 2.0 in 2016

24
Hadoop MapReduce vs Spark
Spark is faster
◦ Hadoop stores data on disk
◦ Spark keeps as much data in memory as possible

25
Hadoop MapReduce vs Spark
Spark is faster
◦ Hadoop stores data on disk
◦ Spark keeps as much data in memory as possible
Spark provides much more functionality
◦ Hadoop only uses Map & Reduce
◦ Spark uses most functional programming

26
Hadoop MapReduce vs Spark
Spark is faster
◦ Hadoop stores data on disk
◦ Spark keeps as much data in memory as possible
Spark provides much more functionality
◦ Hadoop only uses Map & Reduce
◦ Spark uses most functional programming
Spark
◦ Easier to use
◦ REPL & interactive environment

27
Why Scala?
A language such as R or MATLAB does not scale with Scala,
it's easier to scale your problem to large datasets
Distributed computations in Spark are simple to write in
Scala

28
Running Spark
You can download and install Spark on your computer
◦ All you need is java installed on your system PATH

◦ YouTube Video- Installing Apache Spark and Scala on Windows

Databrick’s Community Edition: free cloud environment for

learning Spark
◦ Create an account

29
Launching Spark’s Interactive Consoles
Launching the Python console
◦ From Spark’s home directory, run

.\bin\pyspark

Launching the Scala console

◦ From Spark’s home directory, run

.\bin\spark-shell

30
Using Databricks Community Edition
Hassle free environment for using Spark
Has all the data used by our book
Provides a notebook experience for using Spark
Basic overview
Book’s GitHub page

Introduction To Big Data With Spark and Hadoop
No ratings yet
Introduction To Big Data With Spark and Hadoop
61 pages
BAHRIA UNIVERSITY (Karachi Campus) : Object-Oriented Programming (Csc-210)
100% (1)
BAHRIA UNIVERSITY (Karachi Campus) : Object-Oriented Programming (Csc-210)
5 pages
DGUS Development Guide V3.4.0
No ratings yet
DGUS Development Guide V3.4.0
90 pages
Apache Hadoop and Spark:: and Use Cases For Data Analysis
No ratings yet
Apache Hadoop and Spark:: and Use Cases For Data Analysis
48 pages
Module 2.pptx
No ratings yet
Module 2.pptx
20 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
Data Science
No ratings yet
Data Science
87 pages
Analyzing Big Data in Hadoop Spark
No ratings yet
Analyzing Big Data in Hadoop Spark
30 pages
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
No ratings yet
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
30 pages
0 The BigDataEra
No ratings yet
0 The BigDataEra
36 pages
Apache Spark Engine
100% (1)
Apache Spark Engine
82 pages
Introduction To Big Data Analytics
No ratings yet
Introduction To Big Data Analytics
33 pages
Introduction To Big Data and Hadoop
No ratings yet
Introduction To Big Data and Hadoop
10 pages
Hadoop Ecosystem Large PDF
No ratings yet
Hadoop Ecosystem Large PDF
229 pages
Big Data Overview
No ratings yet
Big Data Overview
18 pages
Big Data Streams Analytics: Challenges, Analysis, and Applications
No ratings yet
Big Data Streams Analytics: Challenges, Analysis, and Applications
55 pages
DA U2
No ratings yet
DA U2
17 pages
Taming Big Data
No ratings yet
Taming Big Data
268 pages
Lecture 3 PPT 22
No ratings yet
Lecture 3 PPT 22
25 pages
Big data Handling Techniques
No ratings yet
Big data Handling Techniques
21 pages
Big Data Analytics Digital Notes
No ratings yet
Big Data Analytics Digital Notes
119 pages
L8 Big Data Management en
No ratings yet
L8 Big Data Management en
58 pages
Bootcamp Keynote
No ratings yet
Bootcamp Keynote
47 pages
Ashish_Presentation_Stage1_modify_LR
No ratings yet
Ashish_Presentation_Stage1_modify_LR
24 pages
BDA - Unit-1
No ratings yet
BDA - Unit-1
24 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Introduction To Big Data Technologies
No ratings yet
Introduction To Big Data Technologies
10 pages
Fast and Interactive Analytics Over Hadoop Data With Spark
No ratings yet
Fast and Interactive Analytics Over Hadoop Data With Spark
7 pages
Hadoop - Quick Guide Hadoop - Big Data Overview
No ratings yet
Hadoop - Quick Guide Hadoop - Big Data Overview
41 pages
IOT and Comp.architecture
No ratings yet
IOT and Comp.architecture
17 pages
Module 1
No ratings yet
Module 1
54 pages
The Age OF: Every Minute
No ratings yet
The Age OF: Every Minute
47 pages
What Is Bigdata
No ratings yet
What Is Bigdata
5 pages
0 Principles of Big Data
No ratings yet
0 Principles of Big Data
70 pages
Big Data Open Source Frameworks Lecture Slides
No ratings yet
Big Data Open Source Frameworks Lecture Slides
109 pages
Bigdata Intro
No ratings yet
Bigdata Intro
76 pages
Chapter 2-Data Science
No ratings yet
Chapter 2-Data Science
23 pages
Chap3_OverviewOfBigDataEcosystem
No ratings yet
Chap3_OverviewOfBigDataEcosystem
91 pages
BDA 01 - Introduction
No ratings yet
BDA 01 - Introduction
43 pages
Hadoop - Quick Guide Hadoop - Big Data Overview
No ratings yet
Hadoop - Quick Guide Hadoop - Big Data Overview
32 pages
Hadoop Quick Guide
No ratings yet
Hadoop Quick Guide
32 pages
Big Data Analytics (R18a0529)
No ratings yet
Big Data Analytics (R18a0529)
134 pages
Hadoop PPT
No ratings yet
Hadoop PPT
25 pages
Poetic Seminar
No ratings yet
Poetic Seminar
17 pages
Big-Data-A-Comprehensive-Overview
No ratings yet
Big-Data-A-Comprehensive-Overview
25 pages
CloudxLab BDHS Course Details
No ratings yet
CloudxLab BDHS Course Details
9 pages
UNIT1 -BDH
No ratings yet
UNIT1 -BDH
77 pages
Lecture8 -Big Data (Hadoop)
No ratings yet
Lecture8 -Big Data (Hadoop)
29 pages
BDA Unit - II
No ratings yet
BDA Unit - II
66 pages
biggdata
No ratings yet
biggdata
24 pages
Big Data PPT [Autosaved]
No ratings yet
Big Data PPT [Autosaved]
193 pages
Bdhs - Ebook
No ratings yet
Bdhs - Ebook
970 pages
Big Data Analytics_Lecture Slides
No ratings yet
Big Data Analytics_Lecture Slides
72 pages
BDA - Lecture 3
100% (1)
BDA - Lecture 3
17 pages
Big Data Unit 1 AKTU Notes
No ratings yet
Big Data Unit 1 AKTU Notes
87 pages
PPT 2.1.1.
No ratings yet
PPT 2.1.1.
24 pages
Big Data Processing With Apache Spark - Infoqdotcom
No ratings yet
Big Data Processing With Apache Spark - Infoqdotcom
16 pages
Updated Unit-2
0% (1)
Updated Unit-2
55 pages
Implementing Cloud Storage with OpenStack Swift
From Everand
Implementing Cloud Storage with OpenStack Swift
Amar Kapadia
No ratings yet
Big Data Analytics
From Everand
Big Data Analytics
Venkat Ankam
No ratings yet
Mastering Apache Iceberg: Managing Big Data in a Modern Data Lake
From Everand
Mastering Apache Iceberg: Managing Big Data in a Modern Data Lake
Robert Johnson
No ratings yet
Mastering Big Data and Hadoop: From Basics to Expert Proficiency
From Everand
Mastering Big Data and Hadoop: From Basics to Expert Proficiency
William Smith
No ratings yet
Assesed HW
No ratings yet
Assesed HW
3 pages
Nomor Spesifikasi Server Harga Referensi
No ratings yet
Nomor Spesifikasi Server Harga Referensi
3 pages
Jasmine Infotech
No ratings yet
Jasmine Infotech
3 pages
ITIL 4 Foundation Manual Diagrams170119
100% (1)
ITIL 4 Foundation Manual Diagrams170119
12 pages
Intrusion Detection System
No ratings yet
Intrusion Detection System
22 pages
01 Beagle Bone
No ratings yet
01 Beagle Bone
16 pages
Chapt 07 KIP Irvine
No ratings yet
Chapt 07 KIP Irvine
36 pages
Upasana Intern Report (1)
No ratings yet
Upasana Intern Report (1)
53 pages
HAHA 2019 Dataset: A Corpus For Humor Analysis in Spanish: Luis Chiruzzo, Santiago Castro, Aiala Rosá
No ratings yet
HAHA 2019 Dataset: A Corpus For Humor Analysis in Spanish: Luis Chiruzzo, Santiago Castro, Aiala Rosá
7 pages
Kaspersky Unified Monitoring and Analysis RFP 1.0 En
No ratings yet
Kaspersky Unified Monitoring and Analysis RFP 1.0 En
18 pages
What Is A Network
No ratings yet
What Is A Network
47 pages
Product Sheet UTSUTM EN 03 28 V01 PDF
No ratings yet
Product Sheet UTSUTM EN 03 28 V01 PDF
9 pages
2023 Orientica CATALOGUE Edp30 Huile6
No ratings yet
2023 Orientica CATALOGUE Edp30 Huile6
5 pages
CT - AFD.01.eLM - Certification - Test Version 1.3
No ratings yet
CT - AFD.01.eLM - Certification - Test Version 1.3
5 pages
HUAWEI Esight Brochure PDF
No ratings yet
HUAWEI Esight Brochure PDF
4 pages
Forest Fire Detection and Recognition
No ratings yet
Forest Fire Detection and Recognition
11 pages
Fortinet Quiz 1.1 - Bad Actors - Attempt Review
No ratings yet
Fortinet Quiz 1.1 - Bad Actors - Attempt Review
2 pages
datasheet_SGI_Rev12e
No ratings yet
datasheet_SGI_Rev12e
6 pages
NOKIA Detail
No ratings yet
NOKIA Detail
2 pages
PCF Users Guide
No ratings yet
PCF Users Guide
104 pages
2G, 3G & 4G Zero Rna Cells 7-11-2021
No ratings yet
2G, 3G & 4G Zero Rna Cells 7-11-2021
9 pages
Arnob Mahmud Android Dev PDF
No ratings yet
Arnob Mahmud Android Dev PDF
3 pages
Viva
No ratings yet
Viva
10 pages
Trigonometric Substitution - 7.3
No ratings yet
Trigonometric Substitution - 7.3
7 pages
PK21-006 - Software Release Bulletin - Polk Command U6 - 28OCT2020
No ratings yet
PK21-006 - Software Release Bulletin - Polk Command U6 - 28OCT2020
2 pages
Active Directory Introduction & Installation
No ratings yet
Active Directory Introduction & Installation
16 pages
PWP1F44 - CRLV-e
No ratings yet
PWP1F44 - CRLV-e
1 page
Selection Manual For Micro Series
No ratings yet
Selection Manual For Micro Series
20 pages