SlideShare a Scribd company logo
An Introduction to
Apache Spark
Anastasios Skarlatidis
@anskarl
Software Engineer/Researcher
IIT, NCSR "Demokritos"
• Part I: Getting to know Spark
• Part II: Basic programming
• Part III: Spark under the hood
• Part IV: Advanced features
Outline
Part I:
Getting to know Spark
Spark in a Nutshell
• General cluster computing platform:
• Distributed in-memory computational framework.
• SQL, Machine Learning, Stream Processing, etc.
• Easy to use, powerful, high-level API:
• Scala, Java, Python and R.
Unified Stack
Spark SQL
Spark
Streaming
(real-time
processing)
MLlib
(Machine
Learning)
GraphX
(graph
processing)
Spark Core
Standalone Scheduler YARN Mesos
High Performance
• In-memory cluster computing.
• Ideal for iterative algorithms.
• Faster than Hadoop:
• 10x on disk.
• 100x in memory.
Brief History
• Originally developed in 2009, UC Berkeley AMP Lab.
• Open-sourced in 2010.
• As of 2014, Spark is a top-level Apache project.
• Fastest open-source engine for sorting 100 ΤΒ:
• Won the 2014 Daytona GraySort contest.
• Throughput: 4.27 TB/min
Who uses Spark,
and for what?
A. Data Scientists:
• Analyze and model data.
• Data transformations and prototyping.
• Statistics and Machine Learning.
B. Software Engineers:
• Implement production data processing systems.
• Require a reasonable API for distributed processing.
• Reliable, high performance, easy to monitor platform.
Resilient Distributed Dataset
RDD is an immutable and partitioned collection:
• Resilient: it can be recreated, when data in memory is lost.
• Distributed: stored in memory across the cluster.
• Dataset: data that comes from file or created
programmatically.
RDD
partitions
Resilient Distributed Datasets
• Feels like coding using typical Scala collections.
• RDD can be build:
1. Directly from a datasource (e.g., text file, HDFS, etc.),
2. or by applying a transformation to another RDD(s).
• Main features:
• RDDs are computed lazily.
• Automatically rebuild on failure.
• Persistence for reuse (RAM and/or disk).
Part II:
Basic programming
Spark Shell
$	
  cd	
  spark	
  
$	
  ./bin/spark-­‐shell	
  
Spark	
  assembly	
  has	
  been	
  built	
  with	
  Hive,	
  including	
  Datanucleus	
  jars	
  on	
  classpath	
  
Welcome	
  to	
  
	
  	
  	
  	
  	
  	
  ____	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  __	
  
	
  	
  	
  	
  	
  /	
  __/__	
  	
  ___	
  _____/	
  /__	
  
	
  	
  	
  	
  _	
  /	
  _	
  /	
  _	
  `/	
  __/	
  	
  '_/	
  
	
  	
  	
  /___/	
  .__/_,_/_/	
  /_/_	
  	
  	
  version	
  1.2.1	
  
	
  	
  	
  	
  	
  	
  /_/	
  
Using	
  Scala	
  version	
  2.10.4	
  (Java	
  HotSpot(TM)	
  64-­‐Bit	
  Server	
  VM,	
  Java	
  1.7.0_71)	
  
Type	
  in	
  expressions	
  to	
  have	
  them	
  evaluated.	
  
Type	
  :help	
  for	
  more	
  information.	
  
Spark	
  context	
  available	
  as	
  sc.	
  
scala>
Standalone Applications
Sbt:	
  
	
  	
  	
  	
  "org.apache.spark"	
  %%	
  "spark-­‐core"	
  %	
  "1.2.1"
Maven:	
  
	
  	
  	
  	
  groupId:	
  org.apache.spark	
  
	
  	
  	
  	
  artifactId:	
  spark-­‐core_2.10	
  
	
  	
  	
  	
  version:	
  1.2.1
Initiate Spark Context
import	
  org.apache.spark.SparkContext	
  
import	
  org.apache.spark.SparkContext._	
  
import	
  org.apache.spark.SparkConf	
  
object	
  SimpleApp	
  extends	
  App	
  {	
  
	
  	
  val	
  conf	
  =	
  new	
  SparkConf().setAppName("Hello	
  Spark")	
  
	
  	
  val	
  sc	
  =	
  new	
  SparkContext(conf)	
  
}
Rich, High-level API
map	
  
filter	
  
sort	
  
groupBy	
  
union	
  
join	
  
…	
  
reduce	
  
count	
  
fold	
  
reduceByKey	
  
groupByKey	
  
cogroup	
  
zip	
  
…	
  
sample	
  
take	
  
first	
  
partitionBy	
  
mapWith	
  
pipe	
  
save	
  
…	
  
Rich, High-level API
map	
  
filter	
  
sort	
  
groupBy	
  
union	
  
join	
  
…	
  
reduce	
  
count	
  
fold	
  
reduceByKey	
  
groupByKey	
  
cogroup	
  
zip	
  
…	
  
sample	
  
take	
  
first	
  
partitionBy	
  
mapWith	
  
pipe	
  
save	
  
…	
  
Loading and Saving
• File Systems: Local FS, Amazon S3 and HDFS.
• Supported formats: Text files, JSON, Hadoop sequence files,
parquet files, protocol buffers and object files.
• Structured data with Spark SQL: Hive, JSON, JDBC,
Cassandra, HBase and ElasticSearch.
Create RDDs
//	
  sc:	
  SparkContext	
  instance	
  
//	
  Scala	
  List	
  to	
  RDD	
  
val	
  rdd0	
  =	
  sc.parallelize(List(1,	
  2,	
  3,	
  4))	
  
//	
  Load	
  lines	
  of	
  a	
  text	
  file	
  
val	
  rdd1	
  =	
  sc.textFile("path/to/filename.txt")	
  
//	
  Load	
  a	
  file	
  from	
  HDFS	
  
val	
  rdd2	
  =	
  sc.hadoopFile("hdfs://master:port/path")	
  
//	
  Load	
  lines	
  of	
  a	
  compressed	
  text	
  file	
  	
  
val	
  rdd3	
  =	
  sc.textFile("file:///path/to/compressedText.gz")	
  
//	
  Load	
  lines	
  of	
  multiple	
  files	
  
val	
  rdd4	
  =	
  sc.textFile("s3n://log-­‐files/2014/*.log")
RDD Operations
1. Transformations: define new RDDs based on current one,
e.g., filter, map, reduce, groupBy, etc.
RDD New RDD
2. Actions: return values, e.g., count, sum, collect, etc.
value
RDD
Transformations (I): basics
val	
  nums	
  =	
  sc.parallelize(List(1,	
  2,	
  3))	
  
//	
  Pass	
  each	
  element	
  through	
  a	
  function	
  
val	
  squares	
  =	
  nums.map(x	
  =>	
  x	
  *	
  x)	
  //{1,	
  4,	
  9}	
  
//	
  Keep	
  elements	
  passing	
  a	
  predicate	
  
val	
  even	
  =	
  squares.filter(_	
  %	
  2	
  ==	
  0)	
  //{4}	
  
//	
  Map	
  each	
  element	
  to	
  zero	
  or	
  more	
  others	
  
val	
  mn	
  =	
  nums.flatMap(x	
  =>	
  1	
  to	
  x)	
  //{1,	
  1,	
  2,	
  1,	
  2,	
  3}	
  
Transformations (I): illustrated
nums
squares
ParallelCollectionRDD
even
mn
nums.flatMap(x	
  =>	
  1	
  to	
  x)
squares.filter(_	
  %	
  2	
  ==	
  0)
MappedRDD
FilteredRDD
nums.map(x	
  =>	
  x	
  *	
  x)
FlatMappedRDD
Transformations (II): key - value
val	
  pets	
  =	
  sc.parallelize(List(("cat",	
  1),	
  ("dog",	
  1),	
  
("cat",	
  2)))	
  
ValueKey
pets.filter{case	
  (k,	
  v)	
  =>	
  k	
  ==	
  "cat"}	
  
//	
  {(cat,1),	
  (cat,2)}	
  
pets.map{case	
  (k,	
  v)	
  =>	
  (k,	
  v	
  +	
  1)}	
  	
  
//	
  {(cat,2),	
  (dog,2),	
  (cat,3)}	
  
pets.mapValues(v	
  =>	
  v	
  +	
  1)	
  	
  
//	
  {(cat,2),	
  (dog,2),	
  (cat,3)}	
  
Transformations (II): key - value
//	
  Aggregation	
  
pets.reduceByKey((l,	
  r)	
  =>	
  l	
  +	
  r)	
  //{(cat,3),	
  (dog,1)}	
  
//	
  Grouping	
  
pets.groupByKey()	
  //{(cat,	
  Seq(1,	
  2)),	
  (dog,	
  Seq(1)}	
  
//	
  Sorting	
  
pets.sortByKey()	
  	
  //{(cat,	
  1),	
  (cat,	
  2),	
  (dog,	
  1)}	
  
val	
  pets	
  =	
  sc.parallelize(List(("cat",	
  1),	
  ("dog",	
  1),	
  
("cat",	
  2)))	
  
ValueKey
Transformations (III): key - value
//RDD[(URL,	
  page_name)]	
  tuples	
  	
  	
  
val	
  names	
  =	
  sc.textFile("names.txt").map(…)…	
  
//RDD[(URL,	
  visit_counts)]	
  tuples	
  
val	
  visits	
  =	
  sc.textFile("counts.txt").map(…)…	
  
//RDD[(URL,	
  (visit	
  counts,	
  page	
  name))]	
  
val	
  joined	
  =	
  visits.join(names)
Basics: Actions
val	
  nums	
  =	
  sc.parallelize(List(1,	
  2,	
  3))	
  
//	
  Count	
  number	
  of	
  elements	
  
nums.count()	
  	
  	
  //	
  =	
  3	
  
//	
  Merge	
  with	
  an	
  associative	
  function	
  
nums.reduce((l,	
  r)	
  =>	
  l	
  +	
  r)	
  	
  //	
  =	
  6	
  
//	
  Write	
  elements	
  to	
  a	
  text	
  file	
  
nums.saveAsTextFile("path/to/filename.txt")
Workflow
data transformation action
result
Part III:
Spark under the hood
1. Job: work required to compute an RDD.
2. Each job is divided to stages.
3. Task:
• Unit of work within a stage
• Corresponds to one RDD partition.
Units of Execution Model
Task 0
Job
Stage 0
…Task 1 Task 0
Stage 1
…Task 1 …
Execution Model
Driver Program
SparkContext
Worker Node
Task Task
Executor
Worker Node
Task Task
Executorval	
  lines	
  =	
  sc.textFile("README.md")	
  
val	
  countedLines	
  =	
  lines.count()	
  
Example: word count
val	
  lines	
  =	
  sc.textFile("hamlet.txt")	
  
val	
  counts	
  =	
  lines.flatMap(_.split("	
  "))	
  	
  	
  //	
  (a)	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .map(word	
  =>	
  (word,	
  1))	
  	
  //	
  (b)	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .reduceByKey(_	
  +	
  _)	
  	
  	
  	
  	
  	
  //	
  (c)
"to	
  be	
  or"
"not	
  to	
  be"
(a) "to"	
  
"be"	
  
"or"
"not"	
  
"to"	
  
"be"
("to",	
  1)	
  
("be",	
  1)	
  
("or",	
  1)
("not",	
  1)	
  
("to",	
  1)	
  
("be",	
  1)
(b)
("be",	
  2)	
  
("not",	
  1)
("or",	
  1)	
  
("to",	
  2)
(c)
to be or
not to be
12:	
  val	
  lines	
  =	
  sc.textFile("hamlet.txt")	
  //	
  HadoopRDD[0],	
  MappedRDD[1]	
  
13:	
  
14:	
  val	
  counts	
  =	
  lines.flatMap(_.split("	
  "))	
  //	
  FlatMappedRDD[2]	
  
15:	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .map(word	
  =>	
  (word,	
  1))	
  	
  	
  	
  //	
  MappedRDD[3]	
  
16:	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .reduceByKey(_	
  +	
  _)	
  	
  	
  	
  	
  	
  	
  	
  //	
  ShuffledRDD[4]	
  
17:	
  
18:	
  counts.toDebugString	
  
res0:	
  String	
  =	
  	
  
(2)	
  ShuffledRDD[4]	
  at	
  reduceByKey	
  at	
  <console>:16	
  []	
  
	
  +-­‐(2)	
  MappedRDD[3]	
  at	
  map	
  at	
  <console>:15	
  []	
  
	
  	
  	
  	
  |	
  	
  FlatMappedRDD[2]	
  at	
  flatMap	
  at	
  <console>:14	
  []	
  
	
  	
  	
  	
  |	
  	
  hamlet.txt	
  MappedRDD[1]	
  at	
  textFile	
  at	
  <console>:12	
  []	
  
	
  	
  	
  	
  |	
  	
  hamlet.txt	
  HadoopRDD[0]	
  at	
  textFile	
  at	
  <console>:12	
  []	
  
Visualize an RDD
Lineage Graph
val	
  lines	
  =	
  sc.textFile("hamlet.txt")	
  //	
  MappedRDD[1],	
  HadoopRDD[0]	
  
val	
  counts	
  =	
  lines.flatMap(_.split("	
  "))	
  	
  //	
  FlatMappedRDD[2]	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .map(word	
  =>	
  (word,	
  1))	
  //	
  MappedRDD[3]	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .reduceByKey(_	
  +	
  _)	
  	
  	
  	
  	
  //	
  ShuffledRDD[4]
[0] [1] [2] [3] [4]
HadoopRDD MappedRDD FlatMappedRDD MappedRDD ShuffledRDD
Lineage Graph
val	
  lines	
  =	
  sc.textFile("hamlet.txt")	
  //	
  MappedRDD[1],	
  HadoopRDD[0]	
  
val	
  counts	
  =	
  lines.flatMap(_.split("	
  "))	
  	
  //	
  FlatMappedRDD[2]	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .map(word	
  =>	
  (word,	
  1))	
  //	
  MappedRDD[3]	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .reduceByKey(_	
  +	
  _)	
  	
  	
  	
  	
  //	
  ShuffledRDD[4]
HadoopRDD MappedRDD FlatMappedRDD MappedRDD ShuffledRDD
HadoopRDD MappedRDD FlatMappedRDD MappedRDD ShuffledRDD
Execution Plan
val	
  lines	
  =	
  sc.textFile("hamlet.txt")	
  //	
  MappedRDD[1],	
  HadoopRDD[0]	
  
val	
  counts	
  =	
  lines.flatMap(_.split("	
  "))	
  	
  //	
  FlatMappedRDD[2]	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .map(word	
  =>	
  (word,	
  1))	
  //	
  MappedRDD[3]	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  .reduceByKey(_	
  +	
  _)	
  	
  	
  	
  	
  //	
  ShuffledRDD[4]
Stage 1 Stage 2
pipelining
Part IV:
Advanced Features
Persistence
• When we use the same RDD multiple times:
• Spark will recompute the RDD.
• Expensive to iterative algorithms.
• Spark can persist RDDs, avoiding recomputations.
Levels of persistence
val	
  result	
  =	
  input.map(expensiveComputation)	
  
result.persist(LEVEL)
LEVEL
Space
Consumption
CPU time In memory On disk
MEMORY_ONLY (default) High Low Y N
MEMORY_ONLY_SER Low High Y N
MEMORY_AND_DISK High Medium Some Some
MEMORY_AND_DISK_SER Low High Some Some
DISK_ONLY Low High N Y
Persistence Behaviour
• Each node will store its computed partition.
• In case of a failure, Spark recomputes the
missing partitions.
• Least Recently Used cache policy:
• Memory-only: recompute partitions.
• Memory-and-disk: recompute and write to disk.
• Manually remove from cache: unpersist()
Shared Variables
1. Accumulators: aggregate values from worker
nodes back to the driver program.
2. Broadcast variables: distribute values to all
worker nodes.
Accumulator Example
val	
  input	
  =	
  sc.textFile("input.txt")	
  
val	
  sum	
  =	
  sc.accumulator(0)	
  
val	
  count	
  =	
  sc.accumulator(0)	
  
	
   	
  
input	
  
.filter(line	
  =>	
  line.size	
  >	
  0)	
  
.flatMap(line	
  =>	
  line.split("	
  "))	
  
.map(word	
  =>	
  word.size)	
  
.foreach{	
  
	
   	
   size	
  =>	
  
	
   	
   	
   sum	
  +=	
  size	
  //	
  increment	
  accumulator	
  
	
   	
   	
   count	
  +=	
  1	
  	
  //	
  increment	
  accumulator	
  
	
   }	
   	
  
val	
  average	
  =	
  sum.value.toDouble	
  /	
  count.value
driver only
initialize the
accumulators
• Safe: Updates inside actions will only applied once.
• Unsafe: Updates inside transformation may applied
more than once!!!
Accumulators and Fault
Tolerance
Broadcast Variables
• Closures and the variables they use are send
separately to each task.
• We may want to share some variable (e.g., a Map)
across tasks/operations.
• This can efficiently done with broadcast variables.
Example without
broadcast variables
//	
  RDD[(String,	
  String)]	
  	
  	
  	
  
val	
  names	
  =	
  …	
  //load	
  (URL,	
  page	
  name)	
  tuples	
  
//	
  RDD[(String,	
  Int)]	
  
val	
  visits	
  =	
  …	
  //load	
  (URL,	
  visit	
  counts)	
  tuples	
  
//	
  Map[String,	
  String]	
  
val	
  pageMap	
  =	
  names.collect.toMap	
  
val	
  joined	
  =	
  visits.map{	
  	
  
	
   case	
  (url,	
  counts)	
  =>	
  	
  
	
   	
   (url,	
  (pageMap(url),	
  counts))	
  
}
pageMap	
  is	
  sent	
  along	
  
with	
  every	
  task
Example with
broadcast variables
//	
  RDD[(String,	
  String)]	
  	
  	
  	
  
val	
  names	
  =	
  …	
  //load	
  (URL,	
  page	
  name)	
  tuples	
  
//	
  RDD[(String,	
  Int)]	
  
val	
  visits	
  =	
  …	
  //load	
  (URL,	
  visit	
  counts)	
  tuples	
  
//	
  Map[String,	
  String]	
  
val	
  pageMap	
  =	
  names.collect.toMap	
  
val	
  bcMap	
  =	
  sc.broadcast(pageMap)	
  
val	
  joined	
  =	
  visits.map{	
  	
  
	
   case	
  (url,	
  counts)	
  =>	
  	
  
	
   	
   (url,	
  (bcMap.value(url),	
  counts))	
  
}
Broadcast variable
pageMap	
  is	
  sent	
  only	
  
to	
  each	
  node	
  once
Introduction to Apache Spark
Appendix
Staging
groupBy
map filter
join
Staging
groupBy
map filter
join
Caching
Staging
groupBy
map filter
join
Stage 1
Stage 2
Stage 3

More Related Content

What's hot (20)

Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
sudhakara st
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
DataArt
 
Spark
SparkSpark
Spark
Heena Madan
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
Aakashdata
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
Whizlabs
 
Spark overview
Spark overviewSpark overview
Spark overview
Lisa Hua
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
Pietro Michiardi
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
Mostafa
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
Abdullah Çetin ÇAVDAR
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Robert Sanders
 
Introduction to Pig
Introduction to PigIntroduction to Pig
Introduction to Pig
Prashanth Babu
 
Spark
SparkSpark
Spark
Koushik Mondal
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
Julien Le Dem
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Edureka!
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
datamantra
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 
Enabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARNEnabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARN
DataWorks Summit
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
Databricks
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
sudhakara st
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
DataArt
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
Aakashdata
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
Whizlabs
 
Spark overview
Spark overviewSpark overview
Spark overview
Lisa Hua
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
Pietro Michiardi
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
Mostafa
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
Julien Le Dem
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Edureka!
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
datamantra
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 
Enabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARNEnabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARN
DataWorks Summit
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
Databricks
 

Viewers also liked (20)

Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
Databricks
 
About the LoMRF project
About the LoMRF projectAbout the LoMRF project
About the LoMRF project
Anastasios Skarlatidis
 
ApacheCon NA 2015 Spark / Solr Integration
ApacheCon NA 2015 Spark / Solr IntegrationApacheCon NA 2015 Spark / Solr Integration
ApacheCon NA 2015 Spark / Solr Integration
thelabdude
 
Spark - The beginnings
Spark -  The beginningsSpark -  The beginnings
Spark - The beginnings
Daniel Leon
 
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of DatabricksBig Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Data Con LA
 
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Chris Fregly
 
Apache Spark
Apache SparkApache Spark
Apache Spark
Mahdi Esmailoghli
 
Apache spark linkedin
Apache spark linkedinApache spark linkedin
Apache spark linkedin
Yukti Kaura
 
New directions for Apache Spark in 2015
New directions for Apache Spark in 2015New directions for Apache Spark in 2015
New directions for Apache Spark in 2015
Databricks
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
Martin Zapletal
 
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
Chris Fregly
 
Extreme-scale Ad-Tech using Spark and Databricks at MediaMath
Extreme-scale Ad-Tech using Spark and Databricks at MediaMathExtreme-scale Ad-Tech using Spark and Databricks at MediaMath
Extreme-scale Ad-Tech using Spark and Databricks at MediaMath
Spark Summit
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
Databricks
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Chris Fregly
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
DataFactZ
 
Spark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science London
Databricks
 
Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming model
Martin Zapletal
 
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark
Spark After Dark:  Real time Advanced Analytics and Machine Learning with SparkSpark After Dark:  Real time Advanced Analytics and Machine Learning with Spark
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark
Chris Fregly
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
MapR Technologies
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
Databricks
 
ApacheCon NA 2015 Spark / Solr Integration
ApacheCon NA 2015 Spark / Solr IntegrationApacheCon NA 2015 Spark / Solr Integration
ApacheCon NA 2015 Spark / Solr Integration
thelabdude
 
Spark - The beginnings
Spark -  The beginningsSpark -  The beginnings
Spark - The beginnings
Daniel Leon
 
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of DatabricksBig Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Data Con LA
 
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Chris Fregly
 
Apache spark linkedin
Apache spark linkedinApache spark linkedin
Apache spark linkedin
Yukti Kaura
 
New directions for Apache Spark in 2015
New directions for Apache Spark in 2015New directions for Apache Spark in 2015
New directions for Apache Spark in 2015
Databricks
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
Martin Zapletal
 
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
Chris Fregly
 
Extreme-scale Ad-Tech using Spark and Databricks at MediaMath
Extreme-scale Ad-Tech using Spark and Databricks at MediaMathExtreme-scale Ad-Tech using Spark and Databricks at MediaMath
Extreme-scale Ad-Tech using Spark and Databricks at MediaMath
Spark Summit
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
Databricks
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Chris Fregly
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
DataFactZ
 
Spark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science London
Databricks
 
Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming model
Martin Zapletal
 
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark
Spark After Dark:  Real time Advanced Analytics and Machine Learning with SparkSpark After Dark:  Real time Advanced Analytics and Machine Learning with Spark
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark
Chris Fregly
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
MapR Technologies
 
Ad

Similar to Introduction to Apache Spark (20)

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Mohamed hedi Abidi
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
Massimo Schenone
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
Fabio Fumarola
 
Spark core
Spark coreSpark core
Spark core
Prashant Gupta
 
Apache Spark
Apache SparkApache Spark
Apache Spark
Uwe Printz
 
SparkNotes
SparkNotesSparkNotes
SparkNotes
Demet Aksoy
 
20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 20150716 introduction to apache spark v3
20150716 introduction to apache spark v3
Andrey Vykhodtsev
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Euangelos Linardos
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
Himanshu Gupta
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
clairvoyantllc
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
bhargavi804095
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
spinningmatt
 
Ten tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache SparkTen tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache Spark
Will Du
 
Distributed computing with spark
Distributed computing with sparkDistributed computing with spark
Distributed computing with spark
Javier Santos Paniego
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
Li Ming Tsai
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdf
WalmirCouto3
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizations
Gal Marder
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Vincent Poncet
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
Snehal Nagmote
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
Massimo Schenone
 
20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 20150716 introduction to apache spark v3
20150716 introduction to apache spark v3
Andrey Vykhodtsev
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Euangelos Linardos
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
Himanshu Gupta
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
bhargavi804095
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
spinningmatt
 
Ten tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache SparkTen tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache Spark
Will Du
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
Li Ming Tsai
 
Artigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdfArtigo 81 - spark_tutorial.pdf
Artigo 81 - spark_tutorial.pdf
WalmirCouto3
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizations
Gal Marder
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Vincent Poncet
 
Ad

Recently uploaded (20)

What is FinOps as a Service and why is it Trending?
What is FinOps as a Service and why is it Trending?What is FinOps as a Service and why is it Trending?
What is FinOps as a Service and why is it Trending?
Amnic
 
Report_Government Authorities_Index_ENG_FIN.pdf
Report_Government Authorities_Index_ENG_FIN.pdfReport_Government Authorities_Index_ENG_FIN.pdf
Report_Government Authorities_Index_ENG_FIN.pdf
OlhaTatokhina1
 
BE PROGRAMjwjwjwjsjsjsjsME TEMPLATE.pptx
BE PROGRAMjwjwjwjsjsjsjsME TEMPLATE.pptxBE PROGRAMjwjwjwjsjsjsjsME TEMPLATE.pptx
BE PROGRAMjwjwjwjsjsjsjsME TEMPLATE.pptx
AaronBaluyut
 
apidays New York 2025 - Using GraphQL SDL files as executable API Contracts b...
apidays New York 2025 - Using GraphQL SDL files as executable API Contracts b...apidays New York 2025 - Using GraphQL SDL files as executable API Contracts b...
apidays New York 2025 - Using GraphQL SDL files as executable API Contracts b...
apidays
 
Media_Literacy_Index_of_Media_Sector_Employees.pdf
Media_Literacy_Index_of_Media_Sector_Employees.pdfMedia_Literacy_Index_of_Media_Sector_Employees.pdf
Media_Literacy_Index_of_Media_Sector_Employees.pdf
OlhaTatokhina1
 
unit- 5 Biostatistics and Research Methodology.pdf
unit- 5 Biostatistics and Research Methodology.pdfunit- 5 Biostatistics and Research Methodology.pdf
unit- 5 Biostatistics and Research Methodology.pdf
KRUTIKA CHANNE
 
Part Departement Head Presentation for Business
Part Departement Head Presentation for BusinessPart Departement Head Presentation for Business
Part Departement Head Presentation for Business
Rizki229625
 
apidays New York 2025 - Lessons From Two Technical Transformations by Leah Hu...
apidays New York 2025 - Lessons From Two Technical Transformations by Leah Hu...apidays New York 2025 - Lessons From Two Technical Transformations by Leah Hu...
apidays New York 2025 - Lessons From Two Technical Transformations by Leah Hu...
apidays
 
LONGSEM2024-25_CSE3015_ETH_AP2024256000125_Reference-Material-I.pptx
LONGSEM2024-25_CSE3015_ETH_AP2024256000125_Reference-Material-I.pptxLONGSEM2024-25_CSE3015_ETH_AP2024256000125_Reference-Material-I.pptx
LONGSEM2024-25_CSE3015_ETH_AP2024256000125_Reference-Material-I.pptx
vemuripraveena2622
 
2.5-DESPATCH-ORDINARY MAILS.pptxlminub7b7t6f7h7t6f6g7g6fg
2.5-DESPATCH-ORDINARY MAILS.pptxlminub7b7t6f7h7t6f6g7g6fg2.5-DESPATCH-ORDINARY MAILS.pptxlminub7b7t6f7h7t6f6g7g6fg
2.5-DESPATCH-ORDINARY MAILS.pptxlminub7b7t6f7h7t6f6g7g6fg
mk1227103
 
Managed Cloud services - Opsio Cloud Man
Managed Cloud services - Opsio Cloud ManManaged Cloud services - Opsio Cloud Man
Managed Cloud services - Opsio Cloud Man
Opsio Cloud
 
apidays New York 2025 - The Challenge is Not the Pattern, But the Best Integr...
apidays New York 2025 - The Challenge is Not the Pattern, But the Best Integr...apidays New York 2025 - The Challenge is Not the Pattern, But the Best Integr...
apidays New York 2025 - The Challenge is Not the Pattern, But the Best Integr...
apidays
 
Philippine-Constitution-and-Law in hospitality
Philippine-Constitution-and-Law in hospitalityPhilippine-Constitution-and-Law in hospitality
Philippine-Constitution-and-Law in hospitality
kikomendoza006
 
apidays Singapore 2025 - Building Finance Innovation Ecosystems by Umang Moon...
apidays Singapore 2025 - Building Finance Innovation Ecosystems by Umang Moon...apidays Singapore 2025 - Building Finance Innovation Ecosystems by Umang Moon...
apidays Singapore 2025 - Building Finance Innovation Ecosystems by Umang Moon...
apidays
 
1-2. Lab Introduction to Linux environment.ppt
1-2. Lab Introduction to Linux environment.ppt1-2. Lab Introduction to Linux environment.ppt
1-2. Lab Introduction to Linux environment.ppt
Wahajch
 
apidays New York 2025 - Life is But a (Data) Stream by Sandon Jacobs (Confluent)
apidays New York 2025 - Life is But a (Data) Stream by Sandon Jacobs (Confluent)apidays New York 2025 - Life is But a (Data) Stream by Sandon Jacobs (Confluent)
apidays New York 2025 - Life is But a (Data) Stream by Sandon Jacobs (Confluent)
apidays
 
apidays New York 2025 - Breaking Barriers: Lessons Learned from API Integrati...
apidays New York 2025 - Breaking Barriers: Lessons Learned from API Integrati...apidays New York 2025 - Breaking Barriers: Lessons Learned from API Integrati...
apidays New York 2025 - Breaking Barriers: Lessons Learned from API Integrati...
apidays
 
apidays New York 2025 - CIAM in the wild by Michael Gruen (Layr)
apidays New York 2025 - CIAM in the wild by Michael Gruen (Layr)apidays New York 2025 - CIAM in the wild by Michael Gruen (Layr)
apidays New York 2025 - CIAM in the wild by Michael Gruen (Layr)
apidays
 
apidays New York 2025 - Why an SDK is Needed to Protect APIs from Mobile Apps...
apidays New York 2025 - Why an SDK is Needed to Protect APIs from Mobile Apps...apidays New York 2025 - Why an SDK is Needed to Protect APIs from Mobile Apps...
apidays New York 2025 - Why an SDK is Needed to Protect APIs from Mobile Apps...
apidays
 
apidays New York 2025 - Fast, Repeatable, Secure: Pick 3 with FINOS CCC by Le...
apidays New York 2025 - Fast, Repeatable, Secure: Pick 3 with FINOS CCC by Le...apidays New York 2025 - Fast, Repeatable, Secure: Pick 3 with FINOS CCC by Le...
apidays New York 2025 - Fast, Repeatable, Secure: Pick 3 with FINOS CCC by Le...
apidays
 
What is FinOps as a Service and why is it Trending?
What is FinOps as a Service and why is it Trending?What is FinOps as a Service and why is it Trending?
What is FinOps as a Service and why is it Trending?
Amnic
 
Report_Government Authorities_Index_ENG_FIN.pdf
Report_Government Authorities_Index_ENG_FIN.pdfReport_Government Authorities_Index_ENG_FIN.pdf
Report_Government Authorities_Index_ENG_FIN.pdf
OlhaTatokhina1
 
BE PROGRAMjwjwjwjsjsjsjsME TEMPLATE.pptx
BE PROGRAMjwjwjwjsjsjsjsME TEMPLATE.pptxBE PROGRAMjwjwjwjsjsjsjsME TEMPLATE.pptx
BE PROGRAMjwjwjwjsjsjsjsME TEMPLATE.pptx
AaronBaluyut
 
apidays New York 2025 - Using GraphQL SDL files as executable API Contracts b...
apidays New York 2025 - Using GraphQL SDL files as executable API Contracts b...apidays New York 2025 - Using GraphQL SDL files as executable API Contracts b...
apidays New York 2025 - Using GraphQL SDL files as executable API Contracts b...
apidays
 
Media_Literacy_Index_of_Media_Sector_Employees.pdf
Media_Literacy_Index_of_Media_Sector_Employees.pdfMedia_Literacy_Index_of_Media_Sector_Employees.pdf
Media_Literacy_Index_of_Media_Sector_Employees.pdf
OlhaTatokhina1
 
unit- 5 Biostatistics and Research Methodology.pdf
unit- 5 Biostatistics and Research Methodology.pdfunit- 5 Biostatistics and Research Methodology.pdf
unit- 5 Biostatistics and Research Methodology.pdf
KRUTIKA CHANNE
 
Part Departement Head Presentation for Business
Part Departement Head Presentation for BusinessPart Departement Head Presentation for Business
Part Departement Head Presentation for Business
Rizki229625
 
apidays New York 2025 - Lessons From Two Technical Transformations by Leah Hu...
apidays New York 2025 - Lessons From Two Technical Transformations by Leah Hu...apidays New York 2025 - Lessons From Two Technical Transformations by Leah Hu...
apidays New York 2025 - Lessons From Two Technical Transformations by Leah Hu...
apidays
 
LONGSEM2024-25_CSE3015_ETH_AP2024256000125_Reference-Material-I.pptx
LONGSEM2024-25_CSE3015_ETH_AP2024256000125_Reference-Material-I.pptxLONGSEM2024-25_CSE3015_ETH_AP2024256000125_Reference-Material-I.pptx
LONGSEM2024-25_CSE3015_ETH_AP2024256000125_Reference-Material-I.pptx
vemuripraveena2622
 
2.5-DESPATCH-ORDINARY MAILS.pptxlminub7b7t6f7h7t6f6g7g6fg
2.5-DESPATCH-ORDINARY MAILS.pptxlminub7b7t6f7h7t6f6g7g6fg2.5-DESPATCH-ORDINARY MAILS.pptxlminub7b7t6f7h7t6f6g7g6fg
2.5-DESPATCH-ORDINARY MAILS.pptxlminub7b7t6f7h7t6f6g7g6fg
mk1227103
 
Managed Cloud services - Opsio Cloud Man
Managed Cloud services - Opsio Cloud ManManaged Cloud services - Opsio Cloud Man
Managed Cloud services - Opsio Cloud Man
Opsio Cloud
 
apidays New York 2025 - The Challenge is Not the Pattern, But the Best Integr...
apidays New York 2025 - The Challenge is Not the Pattern, But the Best Integr...apidays New York 2025 - The Challenge is Not the Pattern, But the Best Integr...
apidays New York 2025 - The Challenge is Not the Pattern, But the Best Integr...
apidays
 
Philippine-Constitution-and-Law in hospitality
Philippine-Constitution-and-Law in hospitalityPhilippine-Constitution-and-Law in hospitality
Philippine-Constitution-and-Law in hospitality
kikomendoza006
 
apidays Singapore 2025 - Building Finance Innovation Ecosystems by Umang Moon...
apidays Singapore 2025 - Building Finance Innovation Ecosystems by Umang Moon...apidays Singapore 2025 - Building Finance Innovation Ecosystems by Umang Moon...
apidays Singapore 2025 - Building Finance Innovation Ecosystems by Umang Moon...
apidays
 
1-2. Lab Introduction to Linux environment.ppt
1-2. Lab Introduction to Linux environment.ppt1-2. Lab Introduction to Linux environment.ppt
1-2. Lab Introduction to Linux environment.ppt
Wahajch
 
apidays New York 2025 - Life is But a (Data) Stream by Sandon Jacobs (Confluent)
apidays New York 2025 - Life is But a (Data) Stream by Sandon Jacobs (Confluent)apidays New York 2025 - Life is But a (Data) Stream by Sandon Jacobs (Confluent)
apidays New York 2025 - Life is But a (Data) Stream by Sandon Jacobs (Confluent)
apidays
 
apidays New York 2025 - Breaking Barriers: Lessons Learned from API Integrati...
apidays New York 2025 - Breaking Barriers: Lessons Learned from API Integrati...apidays New York 2025 - Breaking Barriers: Lessons Learned from API Integrati...
apidays New York 2025 - Breaking Barriers: Lessons Learned from API Integrati...
apidays
 
apidays New York 2025 - CIAM in the wild by Michael Gruen (Layr)
apidays New York 2025 - CIAM in the wild by Michael Gruen (Layr)apidays New York 2025 - CIAM in the wild by Michael Gruen (Layr)
apidays New York 2025 - CIAM in the wild by Michael Gruen (Layr)
apidays
 
apidays New York 2025 - Why an SDK is Needed to Protect APIs from Mobile Apps...
apidays New York 2025 - Why an SDK is Needed to Protect APIs from Mobile Apps...apidays New York 2025 - Why an SDK is Needed to Protect APIs from Mobile Apps...
apidays New York 2025 - Why an SDK is Needed to Protect APIs from Mobile Apps...
apidays
 
apidays New York 2025 - Fast, Repeatable, Secure: Pick 3 with FINOS CCC by Le...
apidays New York 2025 - Fast, Repeatable, Secure: Pick 3 with FINOS CCC by Le...apidays New York 2025 - Fast, Repeatable, Secure: Pick 3 with FINOS CCC by Le...
apidays New York 2025 - Fast, Repeatable, Secure: Pick 3 with FINOS CCC by Le...
apidays
 

Introduction to Apache Spark

  • 1. An Introduction to Apache Spark Anastasios Skarlatidis @anskarl Software Engineer/Researcher IIT, NCSR "Demokritos"
  • 2. • Part I: Getting to know Spark • Part II: Basic programming • Part III: Spark under the hood • Part IV: Advanced features Outline
  • 3. Part I: Getting to know Spark
  • 4. Spark in a Nutshell • General cluster computing platform: • Distributed in-memory computational framework. • SQL, Machine Learning, Stream Processing, etc. • Easy to use, powerful, high-level API: • Scala, Java, Python and R.
  • 6. High Performance • In-memory cluster computing. • Ideal for iterative algorithms. • Faster than Hadoop: • 10x on disk. • 100x in memory.
  • 7. Brief History • Originally developed in 2009, UC Berkeley AMP Lab. • Open-sourced in 2010. • As of 2014, Spark is a top-level Apache project. • Fastest open-source engine for sorting 100 ΤΒ: • Won the 2014 Daytona GraySort contest. • Throughput: 4.27 TB/min
  • 8. Who uses Spark, and for what? A. Data Scientists: • Analyze and model data. • Data transformations and prototyping. • Statistics and Machine Learning. B. Software Engineers: • Implement production data processing systems. • Require a reasonable API for distributed processing. • Reliable, high performance, easy to monitor platform.
  • 9. Resilient Distributed Dataset RDD is an immutable and partitioned collection: • Resilient: it can be recreated, when data in memory is lost. • Distributed: stored in memory across the cluster. • Dataset: data that comes from file or created programmatically. RDD partitions
  • 10. Resilient Distributed Datasets • Feels like coding using typical Scala collections. • RDD can be build: 1. Directly from a datasource (e.g., text file, HDFS, etc.), 2. or by applying a transformation to another RDD(s). • Main features: • RDDs are computed lazily. • Automatically rebuild on failure. • Persistence for reuse (RAM and/or disk).
  • 12. Spark Shell $  cd  spark   $  ./bin/spark-­‐shell   Spark  assembly  has  been  built  with  Hive,  including  Datanucleus  jars  on  classpath   Welcome  to              ____                            __            /  __/__    ___  _____/  /__          _  /  _  /  _  `/  __/    '_/        /___/  .__/_,_/_/  /_/_      version  1.2.1              /_/   Using  Scala  version  2.10.4  (Java  HotSpot(TM)  64-­‐Bit  Server  VM,  Java  1.7.0_71)   Type  in  expressions  to  have  them  evaluated.   Type  :help  for  more  information.   Spark  context  available  as  sc.   scala>
  • 13. Standalone Applications Sbt:          "org.apache.spark"  %%  "spark-­‐core"  %  "1.2.1" Maven:          groupId:  org.apache.spark          artifactId:  spark-­‐core_2.10          version:  1.2.1
  • 14. Initiate Spark Context import  org.apache.spark.SparkContext   import  org.apache.spark.SparkContext._   import  org.apache.spark.SparkConf   object  SimpleApp  extends  App  {      val  conf  =  new  SparkConf().setAppName("Hello  Spark")      val  sc  =  new  SparkContext(conf)   }
  • 15. Rich, High-level API map   filter   sort   groupBy   union   join   …   reduce   count   fold   reduceByKey   groupByKey   cogroup   zip   …   sample   take   first   partitionBy   mapWith   pipe   save   …  
  • 16. Rich, High-level API map   filter   sort   groupBy   union   join   …   reduce   count   fold   reduceByKey   groupByKey   cogroup   zip   …   sample   take   first   partitionBy   mapWith   pipe   save   …  
  • 17. Loading and Saving • File Systems: Local FS, Amazon S3 and HDFS. • Supported formats: Text files, JSON, Hadoop sequence files, parquet files, protocol buffers and object files. • Structured data with Spark SQL: Hive, JSON, JDBC, Cassandra, HBase and ElasticSearch.
  • 18. Create RDDs //  sc:  SparkContext  instance   //  Scala  List  to  RDD   val  rdd0  =  sc.parallelize(List(1,  2,  3,  4))   //  Load  lines  of  a  text  file   val  rdd1  =  sc.textFile("path/to/filename.txt")   //  Load  a  file  from  HDFS   val  rdd2  =  sc.hadoopFile("hdfs://master:port/path")   //  Load  lines  of  a  compressed  text  file     val  rdd3  =  sc.textFile("file:///path/to/compressedText.gz")   //  Load  lines  of  multiple  files   val  rdd4  =  sc.textFile("s3n://log-­‐files/2014/*.log")
  • 19. RDD Operations 1. Transformations: define new RDDs based on current one, e.g., filter, map, reduce, groupBy, etc. RDD New RDD 2. Actions: return values, e.g., count, sum, collect, etc. value RDD
  • 20. Transformations (I): basics val  nums  =  sc.parallelize(List(1,  2,  3))   //  Pass  each  element  through  a  function   val  squares  =  nums.map(x  =>  x  *  x)  //{1,  4,  9}   //  Keep  elements  passing  a  predicate   val  even  =  squares.filter(_  %  2  ==  0)  //{4}   //  Map  each  element  to  zero  or  more  others   val  mn  =  nums.flatMap(x  =>  1  to  x)  //{1,  1,  2,  1,  2,  3}  
  • 21. Transformations (I): illustrated nums squares ParallelCollectionRDD even mn nums.flatMap(x  =>  1  to  x) squares.filter(_  %  2  ==  0) MappedRDD FilteredRDD nums.map(x  =>  x  *  x) FlatMappedRDD
  • 22. Transformations (II): key - value val  pets  =  sc.parallelize(List(("cat",  1),  ("dog",  1),   ("cat",  2)))   ValueKey pets.filter{case  (k,  v)  =>  k  ==  "cat"}   //  {(cat,1),  (cat,2)}   pets.map{case  (k,  v)  =>  (k,  v  +  1)}     //  {(cat,2),  (dog,2),  (cat,3)}   pets.mapValues(v  =>  v  +  1)     //  {(cat,2),  (dog,2),  (cat,3)}  
  • 23. Transformations (II): key - value //  Aggregation   pets.reduceByKey((l,  r)  =>  l  +  r)  //{(cat,3),  (dog,1)}   //  Grouping   pets.groupByKey()  //{(cat,  Seq(1,  2)),  (dog,  Seq(1)}   //  Sorting   pets.sortByKey()    //{(cat,  1),  (cat,  2),  (dog,  1)}   val  pets  =  sc.parallelize(List(("cat",  1),  ("dog",  1),   ("cat",  2)))   ValueKey
  • 24. Transformations (III): key - value //RDD[(URL,  page_name)]  tuples       val  names  =  sc.textFile("names.txt").map(…)…   //RDD[(URL,  visit_counts)]  tuples   val  visits  =  sc.textFile("counts.txt").map(…)…   //RDD[(URL,  (visit  counts,  page  name))]   val  joined  =  visits.join(names)
  • 25. Basics: Actions val  nums  =  sc.parallelize(List(1,  2,  3))   //  Count  number  of  elements   nums.count()      //  =  3   //  Merge  with  an  associative  function   nums.reduce((l,  r)  =>  l  +  r)    //  =  6   //  Write  elements  to  a  text  file   nums.saveAsTextFile("path/to/filename.txt")
  • 28. 1. Job: work required to compute an RDD. 2. Each job is divided to stages. 3. Task: • Unit of work within a stage • Corresponds to one RDD partition. Units of Execution Model Task 0 Job Stage 0 …Task 1 Task 0 Stage 1 …Task 1 …
  • 29. Execution Model Driver Program SparkContext Worker Node Task Task Executor Worker Node Task Task Executorval  lines  =  sc.textFile("README.md")   val  countedLines  =  lines.count()  
  • 30. Example: word count val  lines  =  sc.textFile("hamlet.txt")   val  counts  =  lines.flatMap(_.split("  "))      //  (a)                                      .map(word  =>  (word,  1))    //  (b)                                      .reduceByKey(_  +  _)            //  (c) "to  be  or" "not  to  be" (a) "to"   "be"   "or" "not"   "to"   "be" ("to",  1)   ("be",  1)   ("or",  1) ("not",  1)   ("to",  1)   ("be",  1) (b) ("be",  2)   ("not",  1) ("or",  1)   ("to",  2) (c) to be or not to be
  • 31. 12:  val  lines  =  sc.textFile("hamlet.txt")  //  HadoopRDD[0],  MappedRDD[1]   13:   14:  val  counts  =  lines.flatMap(_.split("  "))  //  FlatMappedRDD[2]   15:                              .map(word  =>  (word,  1))        //  MappedRDD[3]   16:                              .reduceByKey(_  +  _)                //  ShuffledRDD[4]   17:   18:  counts.toDebugString   res0:  String  =     (2)  ShuffledRDD[4]  at  reduceByKey  at  <console>:16  []    +-­‐(2)  MappedRDD[3]  at  map  at  <console>:15  []          |    FlatMappedRDD[2]  at  flatMap  at  <console>:14  []          |    hamlet.txt  MappedRDD[1]  at  textFile  at  <console>:12  []          |    hamlet.txt  HadoopRDD[0]  at  textFile  at  <console>:12  []   Visualize an RDD
  • 32. Lineage Graph val  lines  =  sc.textFile("hamlet.txt")  //  MappedRDD[1],  HadoopRDD[0]   val  counts  =  lines.flatMap(_.split("  "))    //  FlatMappedRDD[2]                                      .map(word  =>  (word,  1))  //  MappedRDD[3]                                      .reduceByKey(_  +  _)          //  ShuffledRDD[4] [0] [1] [2] [3] [4] HadoopRDD MappedRDD FlatMappedRDD MappedRDD ShuffledRDD
  • 33. Lineage Graph val  lines  =  sc.textFile("hamlet.txt")  //  MappedRDD[1],  HadoopRDD[0]   val  counts  =  lines.flatMap(_.split("  "))    //  FlatMappedRDD[2]                                      .map(word  =>  (word,  1))  //  MappedRDD[3]                                      .reduceByKey(_  +  _)          //  ShuffledRDD[4] HadoopRDD MappedRDD FlatMappedRDD MappedRDD ShuffledRDD
  • 34. HadoopRDD MappedRDD FlatMappedRDD MappedRDD ShuffledRDD Execution Plan val  lines  =  sc.textFile("hamlet.txt")  //  MappedRDD[1],  HadoopRDD[0]   val  counts  =  lines.flatMap(_.split("  "))    //  FlatMappedRDD[2]                                      .map(word  =>  (word,  1))  //  MappedRDD[3]                                      .reduceByKey(_  +  _)          //  ShuffledRDD[4] Stage 1 Stage 2 pipelining
  • 36. Persistence • When we use the same RDD multiple times: • Spark will recompute the RDD. • Expensive to iterative algorithms. • Spark can persist RDDs, avoiding recomputations.
  • 37. Levels of persistence val  result  =  input.map(expensiveComputation)   result.persist(LEVEL) LEVEL Space Consumption CPU time In memory On disk MEMORY_ONLY (default) High Low Y N MEMORY_ONLY_SER Low High Y N MEMORY_AND_DISK High Medium Some Some MEMORY_AND_DISK_SER Low High Some Some DISK_ONLY Low High N Y
  • 38. Persistence Behaviour • Each node will store its computed partition. • In case of a failure, Spark recomputes the missing partitions. • Least Recently Used cache policy: • Memory-only: recompute partitions. • Memory-and-disk: recompute and write to disk. • Manually remove from cache: unpersist()
  • 39. Shared Variables 1. Accumulators: aggregate values from worker nodes back to the driver program. 2. Broadcast variables: distribute values to all worker nodes.
  • 40. Accumulator Example val  input  =  sc.textFile("input.txt")   val  sum  =  sc.accumulator(0)   val  count  =  sc.accumulator(0)       input   .filter(line  =>  line.size  >  0)   .flatMap(line  =>  line.split("  "))   .map(word  =>  word.size)   .foreach{       size  =>         sum  +=  size  //  increment  accumulator         count  +=  1    //  increment  accumulator     }     val  average  =  sum.value.toDouble  /  count.value driver only initialize the accumulators
  • 41. • Safe: Updates inside actions will only applied once. • Unsafe: Updates inside transformation may applied more than once!!! Accumulators and Fault Tolerance
  • 42. Broadcast Variables • Closures and the variables they use are send separately to each task. • We may want to share some variable (e.g., a Map) across tasks/operations. • This can efficiently done with broadcast variables.
  • 43. Example without broadcast variables //  RDD[(String,  String)]         val  names  =  …  //load  (URL,  page  name)  tuples   //  RDD[(String,  Int)]   val  visits  =  …  //load  (URL,  visit  counts)  tuples   //  Map[String,  String]   val  pageMap  =  names.collect.toMap   val  joined  =  visits.map{       case  (url,  counts)  =>         (url,  (pageMap(url),  counts))   } pageMap  is  sent  along   with  every  task
  • 44. Example with broadcast variables //  RDD[(String,  String)]         val  names  =  …  //load  (URL,  page  name)  tuples   //  RDD[(String,  Int)]   val  visits  =  …  //load  (URL,  visit  counts)  tuples   //  Map[String,  String]   val  pageMap  =  names.collect.toMap   val  bcMap  =  sc.broadcast(pageMap)   val  joined  =  visits.map{       case  (url,  counts)  =>         (url,  (bcMap.value(url),  counts))   } Broadcast variable pageMap  is  sent  only   to  each  node  once