Apache Spark Fundamentals: Getting Started
Apache Spark Fundamentals: Getting Started
GETTING STARTED
Justin Pihony
DEVELOPER SUPPORT MANAGER @ LIGHTBEND
@JustinPihony
Why?
grep?
http://databricks.com/blog/2014/11/05/
spark-officially-sets-a-new-record-in-large-scale-sorting.html
Big Data
Big Data
Big Code
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.waitForCompletion(true);
}
}
Big Data
Big Code Tiny Code
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
object WordCount{
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, def main(def main(args: Array[String])){
InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
val sparkConf = new SparkConf()
word.set(tokenizer.nextToken());
}
context.write(word, one);
.setAppName("wordcount")
}
}
val sc = new SparkContext(sparkConf)
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
sc.textFile(args(0))
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
.flatMap(_.split(" "))
}
}
.countByValue
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
.saveAsTextFile(args(1))
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
}
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class); }
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.waitForCompletion(true);
}
}
Why Spark?
Readability
Expressiveness
Fast
Testability
Interactive
Fault Tolerant
Unify Big Data
Course Overview
DataFrames/Datasets
MLlib
Spark Spark GraphX
(machine
SQL Streaming (graph)
learning)
Spark Core
The History of Spark
BSD Open Source
https://spark.apache.org/releases/spark-release-MAJOR-MINOR-REVISION.html
Stability
https://github.com/apache/spark/pull/6841
Stability
Who Is Using Spark?
Yahoo!
Spark Languages
Spark Languages
Big Data
Big Data
Big Data
Course Notes
#
Spark Logistics
Alpha Component
Resources
§ https://amplab.cs.berkeley.edu/for-big-data-moores-law-means-better-decisions/
§ https://www.chrisstucchio.com/blog/2013/hadoop_hatred.html
§ http://aadrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-
cluster.html
§ https://spark.apache.org
- /documentation.html
- /docs/latest/
- /community.html
- /examples.html
§ Learning Spark: Lightning-Fast Big Data Analysis by Holden Karau, Andy Konwinski,
Patrick Wendell, Matei Zaharia
§ https://github.com/apache/spark
Summary
§ Why
§ MapReduce Explosion
§ Spark’s History
§ Installation
§ Hello Big Data!
§ Additional Resources