The How and Why of Fast Data
Analytics with Apache Spark
with Justin Pihony 

@JustinPihony
Today’s agenda:
▪ Concerns
▪ Why Spark?
▪ Spark basics
▪ Common pitfalls
▪ We can help!
2
Target Audience
3
Concerns
▪ Am I too small?
4
▪ Will switching be too costly?
▪ Can I utilize my current infrastructure?
▪ Will I be able to find developers?
▪ Are there enough resources available?
Why Spark?
5
grep?
Why Spark?
6
object WordCount{
def main(args: Array[String])){
val conf = new SparkConf()
.setAppName("wordcount")
val sc = new SparkContext(conf)
sc.textFile(args(0))
.flatMap(_.split(" "))
.countByValue
.saveAsTextFile(args(1))
}
}
7
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
Tiny CodeBig Code
Why Spark?
Why Spark?
8
Readability
Expressiveness
Fast
Testability
Interactive
Fault Tolerant
Unify Big Data
9
The MapReduce Explosion
10
“Spark will kill MapReduce,
but save Hadoop.”
- http://insidebigdata.com/2015/12/08/big-data-industry-predictions-2016/
Big Data Unified API
13
Spark Core
Spark
SQL
Spark
Streaming
MLlib
(machine
learning)
GraphX
(graph)
DataFrames
14
Yahoo!
Who Is Using Spark?
Spark Mechanics
15
Worker WorkerWorker
Driver
Spark Mechanics
16
Spark Context
Worker WorkerWorker
Driver
Spark Context
17
Task creator
Scheduler
Data locality
Fault tolerance
RDD
18
▪ Resilient Distributed Dataset
▪ Transformations
- map
- filter
- …
▪ Actions
- collect
- count
- reduce
- …
Expressive and Interactive
19
Built-in UI
20
Common Pitfalls
▪ Functional
▪ Out of memory
▪ Debugging
▪ …
21
Concerns
▪ Am I too small?
22
▪ Will switching from MapReduce be too costly?
▪ Can I utilize my current infrastructure?
▪ Will I be able to find developers?
▪ Are there enough resources available?
Q & A
23
EXPERT SUPPORT
Why Contact Typesafe for Your Apache Spark Project?
Ignite your Spark project with 24/7 production SLA,
unlimited expert support and on-site training:
• Full application lifecycle support for Spark Core,
Spark SQL & Spark Streaming
• Deployment to Standalone, EC2, Mesos clusters
• Expert support from dedicated Spark team
• Optional 10-day “getting started” services
package
Typesafe is a partner with Databricks, Mesosphere
and IBM.
Learn more about on-site trainingCONTACT US
©Typesafe 2016 – All Rights Reserved

The How and Why of Fast Data Analytics with Apache Spark

  • 1.
    The How andWhy of Fast Data Analytics with Apache Spark with Justin Pihony 
 @JustinPihony
  • 2.
    Today’s agenda: ▪ Concerns ▪Why Spark? ▪ Spark basics ▪ Common pitfalls ▪ We can help! 2
  • 3.
  • 4.
    Concerns ▪ Am Itoo small? 4 ▪ Will switching be too costly? ▪ Can I utilize my current infrastructure? ▪ Will I be able to find developers? ▪ Are there enough resources available?
  • 5.
  • 6.
  • 7.
    object WordCount{ def main(args:Array[String])){ val conf = new SparkConf() .setAppName("wordcount") val sc = new SparkContext(conf) sc.textFile(args(0)) .flatMap(_.split(" ")) .countByValue .saveAsTextFile(args(1)) } } 7 public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } } Tiny CodeBig Code Why Spark?
  • 8.
  • 9.
  • 10.
  • 11.
    “Spark will killMapReduce, but save Hadoop.” - http://insidebigdata.com/2015/12/08/big-data-industry-predictions-2016/
  • 13.
    Big Data UnifiedAPI 13 Spark Core Spark SQL Spark Streaming MLlib (machine learning) GraphX (graph) DataFrames
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
    RDD 18 ▪ Resilient DistributedDataset ▪ Transformations - map - filter - … ▪ Actions - collect - count - reduce - …
  • 19.
  • 20.
  • 21.
    Common Pitfalls ▪ Functional ▪Out of memory ▪ Debugging ▪ … 21
  • 22.
    Concerns ▪ Am Itoo small? 22 ▪ Will switching from MapReduce be too costly? ▪ Can I utilize my current infrastructure? ▪ Will I be able to find developers? ▪ Are there enough resources available?
  • 23.
  • 24.
    EXPERT SUPPORT Why ContactTypesafe for Your Apache Spark Project? Ignite your Spark project with 24/7 production SLA, unlimited expert support and on-site training: • Full application lifecycle support for Spark Core, Spark SQL & Spark Streaming • Deployment to Standalone, EC2, Mesos clusters • Expert support from dedicated Spark team • Optional 10-day “getting started” services package Typesafe is a partner with Databricks, Mesosphere and IBM. Learn more about on-site trainingCONTACT US
  • 25.
    ©Typesafe 2016 –All Rights Reserved