Introduction to Apache
Spark and Machine Learning
Ezekiel Awoyemi
Data Engineer
Andela
What is Apache Spark
• It is an open-source cluster computing framework
built around speed, ease of use, and sophisticated
analytics compared to other big data analytics like
MapReduce and Storm.
2
Spark-stack
3
What is Big Data and where
does it come from
• Ad impression
• Fast forward, pause and rewind of videos
• Transactions
• Social networks
• Telecommunication networks
4
Data Science
• Data Science aims to derive knowledge from big data, efficiently and
intelligently
• Nowcasting: example Google flu trends in Feb, 2010.
• Forecasting: example Princeton University’s Epidemiological modelling of
online social network dynamics
5
Database/Data Science
6
ELEMENTS DATABASE DATA SCIENCE
PRIORITIES
Consistency, Error recovery,
Audibility
Speed, Availability, Query
richness
DATA VALUE Precious Cheap
DATA VOLUME Modest Massive
STRUCTURE Strong (Schema) Weak or none(Text)
EXAMPLES
Bank records, Medical
records, Census, Personal
records
Online clicks, GPS logs,
Tweets, etc
Querying the past Querying the future
Spark Program Lifecycle
• Create RDDs from external data or parallels a
collection in your driver program
• Lazily transform them into new RDDs
• Cache() some RDDs for reuse
• Perform actions to execute parallel computation and
produce results
7
Machine Learning
• Machine Learning is used to solve Supervised Classification
Problems. Give machines examples and they will learn with
that
• We can use Collaborative filtering which is commonly used for
recommender systems
• Naive Bayes Principles/algorithms, etc
8
Examples of machine learning
• Classification of email as spam
• Self driving car
• Recommending new songs, movies, etc
9
Coding Example 1.
• Text file is the complete work of William Shakespeare
• Count the number of lines in the file
• Print the first line or first item in the RDD
• How many lines contain the word “come”
• Count the number of words in the file
• How many times do we now have the word “come”
• Print the first item in the RDD
• Print (word, count) pair
10
Coding Example 2.
• We have 1000209 ratings from 6040 users on 3706 movies
collected by MovieLens
• Using a small set of movies that have received the most
ratings from users in the MoviesLens dataset.
• Get a fellow to rate movies (1(poor) - 5(best), or 0 if not seen)
• Make movie recommendations
11
Thank You
12

Introduction to apache spark and machine learning

  • 1.
    Introduction to Apache Sparkand Machine Learning Ezekiel Awoyemi Data Engineer Andela
  • 2.
    What is ApacheSpark • It is an open-source cluster computing framework built around speed, ease of use, and sophisticated analytics compared to other big data analytics like MapReduce and Storm. 2
  • 3.
  • 4.
    What is BigData and where does it come from • Ad impression • Fast forward, pause and rewind of videos • Transactions • Social networks • Telecommunication networks 4
  • 5.
    Data Science • DataScience aims to derive knowledge from big data, efficiently and intelligently • Nowcasting: example Google flu trends in Feb, 2010. • Forecasting: example Princeton University’s Epidemiological modelling of online social network dynamics 5
  • 6.
    Database/Data Science 6 ELEMENTS DATABASEDATA SCIENCE PRIORITIES Consistency, Error recovery, Audibility Speed, Availability, Query richness DATA VALUE Precious Cheap DATA VOLUME Modest Massive STRUCTURE Strong (Schema) Weak or none(Text) EXAMPLES Bank records, Medical records, Census, Personal records Online clicks, GPS logs, Tweets, etc Querying the past Querying the future
  • 7.
    Spark Program Lifecycle •Create RDDs from external data or parallels a collection in your driver program • Lazily transform them into new RDDs • Cache() some RDDs for reuse • Perform actions to execute parallel computation and produce results 7
  • 8.
    Machine Learning • MachineLearning is used to solve Supervised Classification Problems. Give machines examples and they will learn with that • We can use Collaborative filtering which is commonly used for recommender systems • Naive Bayes Principles/algorithms, etc 8
  • 9.
    Examples of machinelearning • Classification of email as spam • Self driving car • Recommending new songs, movies, etc 9
  • 10.
    Coding Example 1. •Text file is the complete work of William Shakespeare • Count the number of lines in the file • Print the first line or first item in the RDD • How many lines contain the word “come” • Count the number of words in the file • How many times do we now have the word “come” • Print the first item in the RDD • Print (word, count) pair 10
  • 11.
    Coding Example 2. •We have 1000209 ratings from 6040 users on 3706 movies collected by MovieLens • Using a small set of movies that have received the most ratings from users in the MoviesLens dataset. • Get a fellow to rate movies (1(poor) - 5(best), or 0 if not seen) • Make movie recommendations 11
  • 12.