Unit Testing of Spark Applications
Himanshu Gupta
Sr. Software Consultant
Knoldus Software LLP
Agenda
● What is Spark ?
● What is Unit Testing ?
● Why we need Unit Testing ?
● Unit Testing of Spark Applications
● Demo
What is Spark ?
● Distributed compute engine for
large-scale data processing.
● 100x faster than Hadoop MapReduce.
● Provides APIs in Python, Scala, Java
and R (Spark 1.4)
● Combines SQL, streaming and
complex analytics.
● Runs on Hadoop, Mesos, or
in the cloud.
src: http://spark.apache.org/
What is Unit Testing ?
●Unit Testing is a Software Testing method by which individual units
of source code are tested to determine whether they are fit for use or
not.
● They ensure that code meets its design specifications and behaves as
intended.
● Its goal is to isolate each part of the program and show that the
individual parts are correct.
src: https://en.wikipedia.org/wiki/Unit_testing
Why we need Unit Testing ?
● Find problems early
- Finds bugs or missing parts of the specification early in the development cycle.
● Facilitates change
- Helps in refactoring and upgradation without worrying about breaking functionality.
● Simplifies integration
- Makes Integration Tests easier to write.
● Documentation
- Provides a living documentation of the system.
● Design
- Can act as formal design of project.
src: https://en.wikipedia.org/wiki/Unit_testing
Unit Testing of Spark Applications
Unit to Test
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
class WordCount {
def get(url: String, sc: SparkContext): RDD[(String, Int)] = {
val lines = sc.textFile(url)
lines.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _)
}
}
Method 1
import org.scalatest.{ BeforeAndAfterAll, FunSuite }
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
class WordCountTest extends FunSuite with BeforeAndAfterAll {
private var sparkConf: SparkConf = _
private var sc: SparkContext = _
override def beforeAll() {
sparkConf = new SparkConf().setAppName("unit-testing").setMaster("local")
sc = new SparkContext(sparkConf)
}
private val wordCount = new WordCount
test("get word count rdd") {
val result = wordCount.get("file.txt", sc)
assert(result.take(10).length === 10)
}
override def afterAll() {
sc.stop()
}
}
Cons of Method 1
●Explicit management of SparkContext creation and
destruction.
● Developer has to write more lines of code for testing.
● Code duplication as Before and After step has to be repeated
in all Test Suites.
Method 2 (Better Way)
Spark Testing Base
A spark package containing base classes to use when writing
tests with Spark.
How ?
"com.holdenkarau" %% "spark-testing-base" % "1.6.1_0.3.2"
Method 2 (Better Way) contd...
Example 1
import org.scalatest.FunSuite
import com.holdenkarau.spark.testing.SharedSparkContext
class WordCountTest extends FunSuite with SharedSparkContext {
private val wordCount = new WordCount
test("get word count rdd") {
val result = wordCount.get("file.txt", sc)
assert(result.take(10).length === 10)
}
}
Method 2 (Better Way) contd...
Example 2
import org.scalatest.FunSuite
import com.holdenkarau.spark.testing.SharedSparkContext
import com.holdenkarau.spark.testing.RDDComparisons
class WordCountTest extends FunSuite with SharedSparkContext {
private val wordCount = new WordCount
test("get word count rdd with comparison") {
val expected =
sc.textFile("file.txt")
.flatMap(_.split(" "))
.map((_, 1))
.reduceByKey(_ + _)
val result = wordCount.get("file.txt", sc)
assert(RDDComparisons.compare(expected, result).isEmpty)
}
}
Pros of Method 2
● Succinct code.
● Rich Test API.
● Supports Scala, Java and Python.
● Provides API for testing Streaming applications too.
● Has in-built RDD comparators.
● Supports both Local & Cluster mode testing.
When to use What ?
Method 1 Method 2
●For Small Scale Spark ●For Large Scale Spark
applications. applications.
●No requirement of extended ●Requirement of Cluster mode or
capabilities of spark-testing-base. Performance testing.
● For Sample applications. ● For Production applications.
Demo
Questions & Option[A]
References
● https://github.com/holdenk/spark-testing-base
● Effective testing for spark programs Strata NY 2015
● Testing Spark: Best Practices
Thank you