0% found this document useful (0 votes)
74 views

Apache Spark Fundamentals: Getting Started

Spark is an open-source cluster computing framework that provides fast and general computation engine for big data. It addresses the limitations of MapReduce by providing features like fast performance, ease of use, and interactive queries. The document provides an overview of Spark including its history and origins from MapReduce, how it addresses the explosion of MapReduce programs, and its core APIs and libraries for SQL, streaming, machine learning and more. It also discusses Spark's stability, adoption, programming languages and resources for learning more.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views

Apache Spark Fundamentals: Getting Started

Spark is an open-source cluster computing framework that provides fast and general computation engine for big data. It addresses the limitations of MapReduce by providing features like fast performance, ease of use, and interactive queries. The document provides an overview of Spark including its history and origins from MapReduce, how it addresses the explosion of MapReduce programs, and its core APIs and libraries for SQL, streaming, machine learning and more. It also discusses Spark's stability, adoption, programming languages and resources for learning more.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Apache Spark Fundamentals

GETTING STARTED

Justin Pihony
DEVELOPER SUPPORT MANAGER @ LIGHTBEND

@JustinPihony
Why?
grep?
http://databricks.com/blog/2014/11/05/
spark-officially-sets-a-new-record-in-large-scale-sorting.html
Big Data
Big Data
Big Code
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {


public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}

public static void main(String[] args) throws Exception {


Configuration conf = new Configuration();

Job job = new Job(conf, "wordcount");

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);

job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path(args[0]));


FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.waitForCompletion(true);
}
}
Big Data
Big Code Tiny Code
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
object WordCount{
private Text word = new Text();

public void map(LongWritable key, Text value, Context context) throws IOException, def main(def main(args: Array[String])){
InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
val sparkConf = new SparkConf()
word.set(tokenizer.nextToken());

}
context.write(word, one);
.setAppName("wordcount")
}
}
val sc = new SparkContext(sparkConf)
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
sc.textFile(args(0))
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
.flatMap(_.split(" "))
}
}
.countByValue
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
.saveAsTextFile(args(1))
Job job = new Job(conf, "wordcount");

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
}
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class); }
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path(args[0]));


FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.waitForCompletion(true);
}
}
Why Spark?

Readability
Expressiveness
Fast
Testability
Interactive
Fault Tolerant
Unify Big Data
Course Overview

§ Basics of Spark § Libraries


§ Core API - SQL
- Streaming
§ Cluster Managers
- MLlib/GraphX
§ Spark Maintenance
§ Troubleshooting /
Optimization
§ Future of Spark
Section
Course Overview
Overview

§ Basics of Spark § Libraries


§ - Hadoop
Core API - SQL
-HistoryManagers
of Spark - Streaming
§ Cluster
- Installation - MLlib/GraphX
§ Spark Maintenance
- Big Data’s Hello World § Troubleshooting /
- Course Prep Optimization
§ Future of Spark
The MapReduce Explosion
A Unified Platform for Big Data

DataFrames/Datasets

MLlib
Spark Spark GraphX
(machine
SQL Streaming (graph)
learning)

Spark Core
The History of Spark
BSD Open Source

Spark Paper Apache Spark 2.x


Top Level
MapReduce databricks

2004 2006 2009 2010 2011 2013 2014 2016


databricks ==
Stability

https://spark.apache.org/releases/spark-release-MAJOR-MINOR-REVISION.html
Stability

https://github.com/apache/spark/pull/6841
Stability
Who Is Using Spark?

Yahoo!
Spark Languages
Spark Languages
Big Data
Big Data
Big Data
Course Notes

#
Spark Logistics

Experimental Developer API

Alpha Component
Resources
§ https://amplab.cs.berkeley.edu/for-big-data-moores-law-means-better-decisions/

§ https://www.chrisstucchio.com/blog/2013/hadoop_hatred.html

§ http://aadrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-
cluster.html
§ https://spark.apache.org
- /documentation.html

- /docs/latest/

- /community.html

- /examples.html

§ Learning Spark: Lightning-Fast Big Data Analysis by Holden Karau, Andy Konwinski,
Patrick Wendell, Matei Zaharia

§ https://github.com/apache/spark
Summary

§ Why
§ MapReduce Explosion
§ Spark’s History
§ Installation
§ Hello Big Data!
§ Additional Resources

You might also like