100% found this document useful (1 vote)

344 views29 pages

Databricks On AWS 01 Getting Started Apache Spark Slides

This document provides an introduction and overview of Apache Spark. It discusses how Spark was created at UC Berkeley to address bringing together data and machine learning. Spark uses a unified analytics engine and APIs for processing big data, including SQL, streaming, machine learning and graph processing. It also summarizes the benefits of using DataFrames in Spark, such as optimized performance and a more user-friendly API compared to lower-level RDDs.

Uploaded by

Mohil Joshi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

344 views29 pages

Databricks On AWS 01 Getting Started Apache Spark Slides

Uploaded by

Mohil Joshi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Getting Started

with Apache Spark

Welcome and Housekeeping

● You should have received instructions on how

to participate in the training session
● If you have questions, you can use the Q&A
window in Go To Webinar
● The slides will also be made available to you as
well as a recording of the session after the
event

2
About Your Instructor

Doug Bateman is Director

of Training and Education
at Databricks. Prior to this
role he was Director of
Training at NewCircle.

3
Apache Spark - Genesis and Open Source

Spark was originally created at the AMP Lab at Berkeley. The

original creators went on to found Databricks.
Spark was created to address bringing data and machine
learning together
Spark was donated to the Apache Foundation to create the
Apache Spark open source project

4
VISION Accelerate innovation by unifying data science,
engineering and business

SOLUTION Unified Analytics Platform

WHO WE • Original creators of

ARE • 2000+ global companies use our platform across big
data & machine learning lifecycle
Introducing Delta Lake

A New Standard for Building Data Lakes

Open Format Based on Parquet

With Transactions

Apache Spark API’s

Apache Spark - A Unified Analytics Engine

7
Apache Spark

“Unified analytics engine for big data

processing, with built-in modules for
streaming, SQL, machine learning and
graph processing”
● Research project at UC Berkeley in 2009
● APIs: Scala, Java, Python, R, and SQL
● Built by more than 1,200 developers from more than 200
companies

8
HOW TO PROCESS LOTS OF DATA?
M&Ms

10
Spark Cluster

One Driver and many Executor JVMs

11
Spark APIs

● RDD
● DataFrame
● Dataset

12
RDD

Resilient: Fault-tolerant

Distributed: Computed across multiple nodes

Dataset: Collection of partitioned data

● Immutable once constructed

● Track lineage information
● Operations on collection of elements in parallel

13
Transformations and Actions

Transformations Actions
Filter Count
Sample Take
Union Collect

14
Dataframe

Data with columns (built on RDDs)

Improved performance via optimizations

15
Datasets

16
Dataframe vs. Dataset

17
DATAFRAMES
Why Switch to Dataframes?

● User-friendly API

dataRDD = sc.parallelize([("Jim", 20), ("Anne", 31), ("Jim", 30)])

# RDD
(dataRDD.map(lambda (x,y): (x, (y,1)))
.reduceByKey(lambda x,y: (x[0] +y[0], x[1] +y[1]))

.map(lambda (x, (y, z)): (x, y / z)))

# DataFrame
dataDF = dataRDD.toDF(["name", "age"])

dataDF.groupBy("name").agg(avg("age"))

19
Why Switch to Dataframes?

● User-friendly API

Benefits:

■ SQL/DataFrame queries
■ Tungsten and Catalyst
optimizations
■ Uniform APIs across languages

20
Why Switch to Dataframes?

Wrapper to create logical plan

21
Catalyst: Under the Hood

22
Still Not Convinced?

23
Structured APIs in Spark

24
WHY SWITCH FROM
MAPREDUCE TO SPARK?
Spark vs. MapReduce

26
When to Use Spark

● Scale out: Model or data too large to

process on a single machine
● Speed up: Benefit from faster results

27
Spark References

● Databricks
● Apache Spark ML Programming Guide
● Scala API Docs
● Python API Docs
● Spark Key Terms

28
Questions?

Further Training Options: http://bit.ly/DBTrng

● Live Onsite Training
● Live Online
● Self Paced

Meet one of our Spark experts: http://bit.ly/ContactUsDB

Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
De Mod 5 Deploy Workloads With Databricks Workflows
No ratings yet
De Mod 5 Deploy Workloads With Databricks Workflows
19 pages
Data Bricks Interview
No ratings yet
Data Bricks Interview
18 pages
Deloitte Scenario-Based Questions in Spark
No ratings yet
Deloitte Scenario-Based Questions in Spark
7 pages
Simulado Databricks
No ratings yet
Simulado Databricks
25 pages
Near Real-Time Big Data Processing
No ratings yet
Near Real-Time Big Data Processing
59 pages
Hadoop Log Level MapReduce Tutorial
No ratings yet
Hadoop Log Level MapReduce Tutorial
3 pages
Data Engineer Interview Questions With Examples
No ratings yet
Data Engineer Interview Questions With Examples
8 pages
Caching DataFrames in PySpark
No ratings yet
Caching DataFrames in PySpark
51 pages
25 Pyspark Transformation
No ratings yet
25 Pyspark Transformation
10 pages
Mastering Azure Databricks Day-5
No ratings yet
Mastering Azure Databricks Day-5
9 pages
Using Apache Spark in Local Mode
No ratings yet
Using Apache Spark in Local Mode
56 pages
Data Fundamentals
No ratings yet
Data Fundamentals
37 pages
Troubleshooting Spark Challenges
No ratings yet
Troubleshooting Spark Challenges
7 pages
Azure Databricks & Spark Course
No ratings yet
Azure Databricks & Spark Course
9 pages
Pyspark Syntax Using Simple Examples
No ratings yet
Pyspark Syntax Using Simple Examples
28 pages
Fall209 Spark SQL MC
No ratings yet
Fall209 Spark SQL MC
96 pages
ETL Process Overview in Agriculture
100% (1)
ETL Process Overview in Agriculture
42 pages
SCD Type-1,2 Implementation in Pyspark
No ratings yet
SCD Type-1,2 Implementation in Pyspark
6 pages
Pyspark 1
No ratings yet
Pyspark 1
19 pages
Databricks Course Curriculum
No ratings yet
Databricks Course Curriculum
2 pages
AaxHadoop Interview Questions and Answers
No ratings yet
AaxHadoop Interview Questions and Answers
37 pages
Delta Lake vs Data Lake ETL Comparison
No ratings yet
Delta Lake vs Data Lake ETL Comparison
12 pages
Simplifying Big Data with Databricks
No ratings yet
Simplifying Big Data with Databricks
25 pages
Machine Learning with Spark Guide
No ratings yet
Machine Learning with Spark Guide
26 pages
SSIS Succinctly
No ratings yet
SSIS Succinctly
116 pages
Apache Spark RDD API Examples
No ratings yet
Apache Spark RDD API Examples
38 pages
Databricks Optimization Technique
No ratings yet
Databricks Optimization Technique
18 pages
ETL Interview Preparation
No ratings yet
ETL Interview Preparation
18 pages
Azure Data Factory Workshop
No ratings yet
Azure Data Factory Workshop
26 pages
SparkInternals All
No ratings yet
SparkInternals All
90 pages
Spark RDD Actions & Transformations
No ratings yet
Spark RDD Actions & Transformations
25 pages
ABD22 1st Exam - 6 January - Attempt Review
No ratings yet
ABD22 1st Exam - 6 January - Attempt Review
13 pages
Apache Spark for Developers
No ratings yet
Apache Spark for Developers
8 pages
Change Data Capture Using Snowflake Streams - by Alexander - Snowflake - Medium
No ratings yet
Change Data Capture Using Snowflake Streams - by Alexander - Snowflake - Medium
5 pages
Hive Cheat Sheet - Quick Reference
No ratings yet
Hive Cheat Sheet - Quick Reference
19 pages
PySpark Zero To Hero Ebook
No ratings yet
PySpark Zero To Hero Ebook
6 pages
Apache Spark 101 For Data Engineering
No ratings yet
Apache Spark 101 For Data Engineering
15 pages
SCD Type 2. Pyspark
No ratings yet
SCD Type 2. Pyspark
7 pages
Data Engineering 101 - Databricks Optimization
No ratings yet
Data Engineering 101 - Databricks Optimization
16 pages
Apache Spark Architecture
No ratings yet
Apache Spark Architecture
7 pages
Compare Hadoop and Spark.: Table
No ratings yet
Compare Hadoop and Spark.: Table
10 pages
Understanding Spark Architecture Basics
No ratings yet
Understanding Spark Architecture Basics
25 pages
Databricks Sparkconfig 1669383836
No ratings yet
Databricks Sparkconfig 1669383836
1 page
Apache Spark Analytics Made Simple
No ratings yet
Apache Spark Analytics Made Simple
76 pages
Spark Interview Questions and Answers
No ratings yet
Spark Interview Questions and Answers
32 pages
Pyspark Interview Questions: Click Here
0% (1)
Pyspark Interview Questions: Click Here
35 pages
Deloitee Data Engineer Interview Questions
100% (1)
Deloitee Data Engineer Interview Questions
24 pages
CA7 Notes
100% (1)
CA7 Notes
27 pages
Spark QA
No ratings yet
Spark QA
34 pages
Competitive Intelligence Course
No ratings yet
Competitive Intelligence Course
36 pages
Azure Data Engineering Guide
No ratings yet
Azure Data Engineering Guide
11 pages
Delta Live Tables for Data Engineering
No ratings yet
Delta Live Tables for Data Engineering
27 pages
Shaik 200 Questions Data Engineer Interview Guide
No ratings yet
Shaik 200 Questions Data Engineer Interview Guide
76 pages
Databricks Best Practices
No ratings yet
Databricks Best Practices
25 pages
Spark Essentials for Data Engineers
No ratings yet
Spark Essentials for Data Engineers
17 pages
Pyspark Notes New
No ratings yet
Pyspark Notes New
18 pages
Apache Spark Defined
No ratings yet
Apache Spark Defined
14 pages
Introduction-to-Apache-Spark
No ratings yet
Introduction-to-Apache-Spark
22 pages
Module 4
No ratings yet
Module 4
29 pages
Caption Generator
No ratings yet
Caption Generator
18 pages
Dbms (Lab) Da-1: Table Creation For Airline Database
No ratings yet
Dbms (Lab) Da-1: Table Creation For Airline Database
23 pages
Tejpal Iit (Bhu) Sde II
No ratings yet
Tejpal Iit (Bhu) Sde II
1 page
Evolution of Object-Oriented Programming
No ratings yet
Evolution of Object-Oriented Programming
28 pages
Yashwanth Resume
No ratings yet
Yashwanth Resume
2 pages
White Paper Deltav Sis Standalone en 57874
100% (1)
White Paper Deltav Sis Standalone en 57874
15 pages
Software For University Libraries 2.0 (SOUL 2.0) : Major Features and Functionalities
No ratings yet
Software For University Libraries 2.0 (SOUL 2.0) : Major Features and Functionalities
5 pages
WordPress Speed Optimization Guide
No ratings yet
WordPress Speed Optimization Guide
92 pages
Hospital Online Queueing System
No ratings yet
Hospital Online Queueing System
26 pages
JIRA Test Execution Report 2020
No ratings yet
JIRA Test Execution Report 2020
8 pages
STQA Book
No ratings yet
STQA Book
129 pages
Module 1
No ratings yet
Module 1
6 pages
2020 Kumar ReBICTE
No ratings yet
2020 Kumar ReBICTE
12 pages
UV Connect: UV Connect Gives You Access To A UV System Expert No Matter Where You Are in The World
No ratings yet
UV Connect: UV Connect Gives You Access To A UV System Expert No Matter Where You Are in The World
2 pages
How To Setup Jboss For Using T24 Custom Adapters
No ratings yet
How To Setup Jboss For Using T24 Custom Adapters
8 pages
Dis Question Paper Set 2
No ratings yet
Dis Question Paper Set 2
1 page
Laica Luneta - Nursing Informatics Lab ActivityWorksheet 2
No ratings yet
Laica Luneta - Nursing Informatics Lab ActivityWorksheet 2
6 pages
Open-Source Sports Software Solutions
No ratings yet
Open-Source Sports Software Solutions
10 pages
Scheduling Algorithm
No ratings yet
Scheduling Algorithm
7 pages
Privacy by Design Software Guide
No ratings yet
Privacy by Design Software Guide
3 pages
Important Tables in Sap BW 7.X: Sap Netweaver Business Warehouse Bw-Whm-Awb - Data Warehousing Workbench
No ratings yet
Important Tables in Sap BW 7.X: Sap Netweaver Business Warehouse Bw-Whm-Awb - Data Warehousing Workbench
15 pages
Baltzan Managing Information Technology 5 Chap003 PPT
No ratings yet
Baltzan Managing Information Technology 5 Chap003 PPT
57 pages
TN DataPolicy 2022
No ratings yet
TN DataPolicy 2022
72 pages
Summative1st Etech 25 26
No ratings yet
Summative1st Etech 25 26
4 pages
500+ Interview Questions-1
No ratings yet
500+ Interview Questions-1
126 pages
ADO.NET Database Connectivity Guide
No ratings yet
ADO.NET Database Connectivity Guide
7 pages
Blue Team Tools For SOC Analysts
No ratings yet
Blue Team Tools For SOC Analysts
7 pages
Recent Topics For ELEN E6713, ELEN E677, & Other Related Topics Courses
No ratings yet
Recent Topics For ELEN E6713, ELEN E677, & Other Related Topics Courses
1 page
Digital Fluency Course Overview
No ratings yet
Digital Fluency Course Overview
37 pages
Investigating ERP Systems Customization Afocus On Cloud ERP
No ratings yet
Investigating ERP Systems Customization Afocus On Cloud ERP
9 pages

Databricks On AWS 01 Getting Started Apache Spark Slides

Uploaded by

Databricks On AWS 01 Getting Started Apache Spark Slides

Uploaded by

Getting Started

with Apache Spark

● You should have received instructions on how

Doug Bateman is Director

Spark was originally created at the AMP Lab at Berkeley. The

SOLUTION Unified Analytics Platform

WHO WE • Original creators of

A New Standard for Building Data Lakes

Open Format Based on Parquet

Apache Spark API’s

“Unified analytics engine for big data

One Driver and many Executor JVMs

Distributed: Computed across multiple nodes

Dataset: Collection of partitioned data

● Immutable once constructed

Data with columns (built on RDDs)

Improved performance via optimizations

dataRDD = sc.parallelize([("Jim", 20), ("Anne", 31), ("Jim", 30)])

.map(lambda (x, (y, z)): (x, y / z)))

Wrapper to create logical plan

● Scale out: Model or data too large to

Further Training Options: http://bit.ly/DBTrng

Meet one of our Spark experts: http://bit.ly/ContactUsDB

You might also like