This presentation is an introduction to Apache Spark. It covers the basic API, some advanced features and describes how Spark physically executes its jobs.
The document summarizes Spark SQL, which is a Spark module for structured data processing. It introduces key concepts like RDDs, DataFrames, and interacting with data sources. The architecture of Spark SQL is explained, including how it works with different languages and data sources through its schema RDD abstraction. Features of Spark SQL are covered such as its integration with Spark programs, unified data access, compatibility with Hive, and standard connectivity.
This document provides an overview of a talk on Apache Spark. It introduces the speaker and their background. It acknowledges inspiration from a previous Spark training. It then outlines the structure of the talk, which will include: a brief history of big data; a tour of Spark including its advantages over MapReduce; and explanations of Spark concepts like RDDs, transformations, and actions. The document serves to introduce the topics that will be covered in the talk.
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
This presentation about Apache Spark covers all the basics that a beginner needs to know to get started with Spark. It covers the history of Apache Spark, what is Spark, the difference between Hadoop and Spark. You will learn the different components in Spark, and how Spark works with the help of architecture. You will understand the different cluster managers on which Spark can run. Finally, you will see the various applications of Spark and a use case on Conviva. Now, let's get started with what is Apache Spark.
Below topics are explained in this Spark presentation:
1. History of Spark
2. What is Spark
3. Hadoop vs Spark
4. Components of Apache Spark
5. Spark architecture
6. Applications of Spark
7. Spark usecase
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
Simplilearn’s Apache Spark and Scala certification training are designed to:
1. Advance your expertise in the Big Data Hadoop Ecosystem
2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark
3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos
What skills will you learn?
By completing this Apache Spark and Scala course you will be able to:
1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations
2. Understand the fundamentals of the Scala programming language and its features
3. Explain and master the process of installing Spark as a standalone cluster
4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark
5. Master Structured Query Language (SQL) using SparkSQL
6. Gain a thorough understanding of Spark streaming features
7. Master and describe the features of Spark ML programming and GraphX programming
Who should take this Scala course?
1. Professionals aspiring for a career in the field of real-time big data analytics
2. Analytics professionals
3. Research professionals
4. IT developers and testers
5. Data scientists
6. BI and reporting professionals
7. Students who wish to gain a thorough understanding of Apache Spark
Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training
Apache Spark is a cluster computing framework designed for fast, general-purpose processing of large datasets. It uses in-memory computing to improve processing speeds. Spark operations include transformations that create new datasets and actions that return values. The Spark stack includes Resilient Distributed Datasets (RDDs) for fault-tolerant data sharing across a cluster. Spark Streaming processes live data streams using a discretized stream model.
This document provides an overview of key mathematical concepts relevant to machine learning, including linear algebra (vectors, matrices, tensors), linear models and hyperplanes, dot and outer products, probability and statistics (distributions, samples vs populations), and resampling methods. It also discusses solving systems of linear equations and the statistical analysis of training data distributions.
Optimizing spark jobs through a true understanding of spark core. Learn: What is a partition? What is the difference between read/shuffle/write partitions? How to increase parallelism and decrease output files? Where does shuffle data go between stages? What is the "right" size for your spark partitions and files? Why does a job slow down with only a few tasks left and never finish? Why doesn't adding nodes decrease my compute time?
Apache Spark is a In Memory Data Processing Solution that can work with existing data source like HDFS and can make use of your existing computation infrastructure like YARN/Mesos etc. This talk will cover a basic introduction of Apache Spark with its various components like MLib, Shark, GrpahX and with few examples.
The document discusses the Internet Protocol (IP) which is the cornerstone of the TCP/IP architecture and allows all computers on the Internet to communicate. There are two main versions of IP - IPv4, the currently used version, and IPv6 which is intended to replace IPv4 and includes improvements like longer addresses. IP addresses are 32-bit for IPv4 and 128-bit for IPv6. Strategies like private addressing and Classless Inter-Domain Routing (CIDR) help conserve the limited number of available IP addresses.
Spark is an open source cluster computing framework for large-scale data processing. It provides high-level APIs and runs on Hadoop clusters. Spark components include Spark Core for execution, Spark SQL for SQL queries, Spark Streaming for real-time data, and MLlib for machine learning. The core abstraction in Spark is the resilient distributed dataset (RDD), which allows data to be partitioned across nodes for parallel processing. A word count example demonstrates how to use transformations like flatMap and reduceByKey to count word frequencies from an input file in Spark.
This document provides an overview of Apache Spark, including how it compares to Hadoop, the Spark ecosystem, Resilient Distributed Datasets (RDDs), transformations and actions on RDDs, the directed acyclic graph (DAG) scheduler, Spark Streaming, and the DataFrames API. Key points covered include Spark's faster performance versus Hadoop through its use of memory instead of disk, the RDD abstraction for distributed collections, common RDD operations, and Spark's capabilities for real-time streaming data processing and SQL queries on structured data.
Spark is an open-source distributed computing framework used for processing large datasets. It allows for in-memory cluster computing, which enhances processing speed. Spark core components include Resilient Distributed Datasets (RDDs) and a directed acyclic graph (DAG) that represents the lineage of transformations and actions on RDDs. Spark Streaming is an extension that allows for processing of live data streams with low latency.
we will see an overview of Spark in Big Data. We will start with an introduction to Apache Spark Programming. Then we will move to know the Spark History. Moreover, we will learn why Spark is needed. Afterward, will cover all fundamental of Spark components. Furthermore, we will learn about Spark’s core abstraction and Spark RDD. For more detailed insights, we will also cover spark features, Spark limitations, and Spark Use cases.
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
Slides cover Spark core concepts of Apache Spark such as RDD, DAG, execution workflow, forming stages of tasks and shuffle implementation and also describes architecture and main components of Spark Driver. The workshop part covers Spark execution modes , provides link to github repo which contains Spark Applications examples and dockerized Hadoop environment to experiment with
The document provides an overview of Apache Spark internals and Resilient Distributed Datasets (RDDs). It discusses:
- RDDs are Spark's fundamental data structure - they are immutable distributed collections that allow transformations like map and filter to be applied.
- RDDs track their lineage or dependency graph to support fault tolerance. Transformations create new RDDs while actions trigger computation.
- Operations on RDDs include narrow transformations like map that don't require data shuffling, and wide transformations like join that do require shuffling.
- The RDD abstraction allows Spark's scheduler to optimize execution through techniques like pipelining and cache reuse.
This session covers how to work with PySpark interface to develop Spark applications. From loading, ingesting, and applying transformation on the data. The session covers how to work with different data sources of data, apply transformation, python best practices in developing Spark Apps. The demo covers integrating Apache Spark apps, In memory processing capabilities, working with notebooks, and integrating analytics tools into Spark Applications.
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2L4rPmM
This CloudxLab Basics of RDD tutorial helps you to understand Basics of RDD in detail. Below are the topics covered in this tutorial:
1) What is RDD - Resilient Distributed Datasets
2) Creating RDD in Scala
3) RDD Operations - Transformations & Actions
4) RDD Transformations - map() & filter()
5) RDD Actions - take() & saveAsTextFile()
6) Lazy Evaluation & Instant Evaluation
7) Lineage Graph
8) flatMap and Union
9) Scala Transformations - Union
10) Scala Actions - saveAsTextFile(), collect(), take() and count()
11) More Actions - reduce()
12) Can We Use reduce() for Computing Average?
13) Solving Problems with Spark
14) Compute Average and Standard Deviation with Spark
15) Pick Random Samples From a Dataset using Spark
This document provides an overview of Apache Spark, including its goal of providing a fast and general engine for large-scale data processing. It discusses Spark's programming model, components like RDDs and DAGs, and how to initialize and deploy Spark on a cluster. Key aspects covered include RDDs as the fundamental data structure in Spark, transformations and actions, and storage levels for caching data in memory or disk.
Introduction to Apache Spark. With an emphasis on the RDD API, Spark SQL (DataFrame and Dataset API) and Spark Streaming.
Presented at the Desert Code Camp:
http://oct2016.desertcodecamp.com/sessions/all
This document provides an introduction to the Pig analytics platform for Hadoop. It begins with an overview of big data and Hadoop, then discusses the basics of Pig including its data model, language called Pig Latin, and components. Key points made are that Pig provides a high-level language for expressing data analysis processes, compiles queries into MapReduce programs for execution, and allows for easier programming than lower-level systems like Java MapReduce. The document also compares Pig to SQL and Hive, and demonstrates visualizing Pig jobs with the Twitter Ambrose tool.
Spark is an open-source cluster computing framework that allows processing of large datasets in parallel. It supports multiple languages and provides advanced analytics capabilities. Spark SQL was built to overcome limitations of Apache Hive by running on Spark and providing a unified data access layer, SQL support, and better performance on medium and small datasets. Spark SQL uses DataFrames and a SQLContext to allow SQL queries on different data sources like JSON, Hive tables, and Parquet files. It provides a scalable architecture and integrates with Spark's RDD API.
How to use Parquet as a basis for ETL and analyticsJulien Le Dem
Parquet is a columnar format designed to be extremely efficient and interoperable across the hadoop ecosystem. Its integration in most of the Hadoop processing frameworks (Impala, Hive, Pig, Cascading, Crunch, Scalding, Spark, …) and serialization models (Thrift, Avro, Protocol Buffers, …) makes it easy to use in existing ETL and processing pipelines, while giving flexibility of choice on the query engine (whether in Java or C++). In this talk, we will describe how one can us Parquet with a wide variety of data analysis tools like Spark, Impala, Pig, Hive, and Cascading to create powerful, efficient data analysis pipelines. Data management is simplified as the format is self describing and handles schema evolution. Support for nested structures enables more natural modeling of data for Hadoop compared to flat representations that create the need for often costly joins.
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Edureka!
This Edureka Spark SQL Tutorial will help you to understand how Apache Spark offers SQL power in real-time. This tutorial also demonstrates an use case on Stock Market Analysis using Spark SQL. Below are the topics covered in this tutorial:
1) Limitations of Apache Hive
2) Spark SQL Advantages Over Hive
3) Spark SQL Success Story
4) Spark SQL Features
5) Architecture of Spark SQL
6) Spark SQL Libraries
7) Querying Using Spark SQL
8) Demo: Stock Market Analysis With Spark SQL
Apache Spark is a fast, general engine for large-scale data processing. It provides unified analytics engine for batch, interactive, and stream processing using an in-memory abstraction called resilient distributed datasets (RDDs). Spark's speed comes from its ability to run computations directly on data stored in cluster memory and optimize performance through caching. It also integrates well with other big data technologies like HDFS, Hive, and HBase. Many large companies are using Spark for its speed, ease of use, and support for multiple workloads and languages.
This document discusses Spark shuffle, which is an expensive operation that involves data partitioning, serialization/deserialization, compression, and disk I/O. It provides an overview of how shuffle works in Spark and the history of optimizations like sort-based shuffle and an external shuffle service. Key concepts discussed include shuffle writers, readers, and the pluggable block transfer service that handles data transfer. The document also covers shuffle-related configuration options and potential future work.
The document discusses enabling diverse workload scheduling in YARN. It covers several topics including node labeling, resource preemption, reservation systems, pluggable scheduler behavior, and Docker container support in YARN. The presenters are Wangda Tan and Craig Welch from Hortonworks who have experience with big data systems like Hadoop, YARN, and OpenMPI. They aim to discuss how these features can help different types of workloads like batch, interactive, and real-time jobs run together more happily in YARN.
Simplifying Big Data Analytics with Apache SparkDatabricks
Apache Spark is a fast and general-purpose cluster computing system for large-scale data processing. It improves on MapReduce by allowing data to be kept in memory across jobs, enabling faster iterative jobs. Spark consists of a core engine along with libraries for SQL, streaming, machine learning, and graph processing. The document discusses new APIs in Spark including DataFrames, which provide a tabular interface like in R/Python, and data sources, which allow plugging external data systems into Spark. These changes aim to make Spark easier for data scientists to use at scale.
Apache Spark 2.0: Faster, Easier, and SmarterDatabricks
In this webcast, Reynold Xin from Databricks will be speaking about Apache Spark's new 2.0 major release.
The major themes for Spark 2.0 are:
- Unified APIs: Emphasis on building up higher level APIs including the merging of DataFrame and Dataset APIs
- Structured Streaming: Simplify streaming by building continuous applications on top of DataFrames allow us to unify streaming, interactive, and batch queries.
- Tungsten Phase 2: Speed up Apache Spark by 10X
Spark is an open source cluster computing framework for large-scale data processing. It provides high-level APIs and runs on Hadoop clusters. Spark components include Spark Core for execution, Spark SQL for SQL queries, Spark Streaming for real-time data, and MLlib for machine learning. The core abstraction in Spark is the resilient distributed dataset (RDD), which allows data to be partitioned across nodes for parallel processing. A word count example demonstrates how to use transformations like flatMap and reduceByKey to count word frequencies from an input file in Spark.
This document provides an overview of Apache Spark, including how it compares to Hadoop, the Spark ecosystem, Resilient Distributed Datasets (RDDs), transformations and actions on RDDs, the directed acyclic graph (DAG) scheduler, Spark Streaming, and the DataFrames API. Key points covered include Spark's faster performance versus Hadoop through its use of memory instead of disk, the RDD abstraction for distributed collections, common RDD operations, and Spark's capabilities for real-time streaming data processing and SQL queries on structured data.
Spark is an open-source distributed computing framework used for processing large datasets. It allows for in-memory cluster computing, which enhances processing speed. Spark core components include Resilient Distributed Datasets (RDDs) and a directed acyclic graph (DAG) that represents the lineage of transformations and actions on RDDs. Spark Streaming is an extension that allows for processing of live data streams with low latency.
we will see an overview of Spark in Big Data. We will start with an introduction to Apache Spark Programming. Then we will move to know the Spark History. Moreover, we will learn why Spark is needed. Afterward, will cover all fundamental of Spark components. Furthermore, we will learn about Spark’s core abstraction and Spark RDD. For more detailed insights, we will also cover spark features, Spark limitations, and Spark Use cases.
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
Slides cover Spark core concepts of Apache Spark such as RDD, DAG, execution workflow, forming stages of tasks and shuffle implementation and also describes architecture and main components of Spark Driver. The workshop part covers Spark execution modes , provides link to github repo which contains Spark Applications examples and dockerized Hadoop environment to experiment with
The document provides an overview of Apache Spark internals and Resilient Distributed Datasets (RDDs). It discusses:
- RDDs are Spark's fundamental data structure - they are immutable distributed collections that allow transformations like map and filter to be applied.
- RDDs track their lineage or dependency graph to support fault tolerance. Transformations create new RDDs while actions trigger computation.
- Operations on RDDs include narrow transformations like map that don't require data shuffling, and wide transformations like join that do require shuffling.
- The RDD abstraction allows Spark's scheduler to optimize execution through techniques like pipelining and cache reuse.
This session covers how to work with PySpark interface to develop Spark applications. From loading, ingesting, and applying transformation on the data. The session covers how to work with different data sources of data, apply transformation, python best practices in developing Spark Apps. The demo covers integrating Apache Spark apps, In memory processing capabilities, working with notebooks, and integrating analytics tools into Spark Applications.
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2L4rPmM
This CloudxLab Basics of RDD tutorial helps you to understand Basics of RDD in detail. Below are the topics covered in this tutorial:
1) What is RDD - Resilient Distributed Datasets
2) Creating RDD in Scala
3) RDD Operations - Transformations & Actions
4) RDD Transformations - map() & filter()
5) RDD Actions - take() & saveAsTextFile()
6) Lazy Evaluation & Instant Evaluation
7) Lineage Graph
8) flatMap and Union
9) Scala Transformations - Union
10) Scala Actions - saveAsTextFile(), collect(), take() and count()
11) More Actions - reduce()
12) Can We Use reduce() for Computing Average?
13) Solving Problems with Spark
14) Compute Average and Standard Deviation with Spark
15) Pick Random Samples From a Dataset using Spark
This document provides an overview of Apache Spark, including its goal of providing a fast and general engine for large-scale data processing. It discusses Spark's programming model, components like RDDs and DAGs, and how to initialize and deploy Spark on a cluster. Key aspects covered include RDDs as the fundamental data structure in Spark, transformations and actions, and storage levels for caching data in memory or disk.
Introduction to Apache Spark. With an emphasis on the RDD API, Spark SQL (DataFrame and Dataset API) and Spark Streaming.
Presented at the Desert Code Camp:
http://oct2016.desertcodecamp.com/sessions/all
This document provides an introduction to the Pig analytics platform for Hadoop. It begins with an overview of big data and Hadoop, then discusses the basics of Pig including its data model, language called Pig Latin, and components. Key points made are that Pig provides a high-level language for expressing data analysis processes, compiles queries into MapReduce programs for execution, and allows for easier programming than lower-level systems like Java MapReduce. The document also compares Pig to SQL and Hive, and demonstrates visualizing Pig jobs with the Twitter Ambrose tool.
Spark is an open-source cluster computing framework that allows processing of large datasets in parallel. It supports multiple languages and provides advanced analytics capabilities. Spark SQL was built to overcome limitations of Apache Hive by running on Spark and providing a unified data access layer, SQL support, and better performance on medium and small datasets. Spark SQL uses DataFrames and a SQLContext to allow SQL queries on different data sources like JSON, Hive tables, and Parquet files. It provides a scalable architecture and integrates with Spark's RDD API.
How to use Parquet as a basis for ETL and analyticsJulien Le Dem
Parquet is a columnar format designed to be extremely efficient and interoperable across the hadoop ecosystem. Its integration in most of the Hadoop processing frameworks (Impala, Hive, Pig, Cascading, Crunch, Scalding, Spark, …) and serialization models (Thrift, Avro, Protocol Buffers, …) makes it easy to use in existing ETL and processing pipelines, while giving flexibility of choice on the query engine (whether in Java or C++). In this talk, we will describe how one can us Parquet with a wide variety of data analysis tools like Spark, Impala, Pig, Hive, and Cascading to create powerful, efficient data analysis pipelines. Data management is simplified as the format is self describing and handles schema evolution. Support for nested structures enables more natural modeling of data for Hadoop compared to flat representations that create the need for often costly joins.
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Edureka!
This Edureka Spark SQL Tutorial will help you to understand how Apache Spark offers SQL power in real-time. This tutorial also demonstrates an use case on Stock Market Analysis using Spark SQL. Below are the topics covered in this tutorial:
1) Limitations of Apache Hive
2) Spark SQL Advantages Over Hive
3) Spark SQL Success Story
4) Spark SQL Features
5) Architecture of Spark SQL
6) Spark SQL Libraries
7) Querying Using Spark SQL
8) Demo: Stock Market Analysis With Spark SQL
Apache Spark is a fast, general engine for large-scale data processing. It provides unified analytics engine for batch, interactive, and stream processing using an in-memory abstraction called resilient distributed datasets (RDDs). Spark's speed comes from its ability to run computations directly on data stored in cluster memory and optimize performance through caching. It also integrates well with other big data technologies like HDFS, Hive, and HBase. Many large companies are using Spark for its speed, ease of use, and support for multiple workloads and languages.
This document discusses Spark shuffle, which is an expensive operation that involves data partitioning, serialization/deserialization, compression, and disk I/O. It provides an overview of how shuffle works in Spark and the history of optimizations like sort-based shuffle and an external shuffle service. Key concepts discussed include shuffle writers, readers, and the pluggable block transfer service that handles data transfer. The document also covers shuffle-related configuration options and potential future work.
The document discusses enabling diverse workload scheduling in YARN. It covers several topics including node labeling, resource preemption, reservation systems, pluggable scheduler behavior, and Docker container support in YARN. The presenters are Wangda Tan and Craig Welch from Hortonworks who have experience with big data systems like Hadoop, YARN, and OpenMPI. They aim to discuss how these features can help different types of workloads like batch, interactive, and real-time jobs run together more happily in YARN.
Simplifying Big Data Analytics with Apache SparkDatabricks
Apache Spark is a fast and general-purpose cluster computing system for large-scale data processing. It improves on MapReduce by allowing data to be kept in memory across jobs, enabling faster iterative jobs. Spark consists of a core engine along with libraries for SQL, streaming, machine learning, and graph processing. The document discusses new APIs in Spark including DataFrames, which provide a tabular interface like in R/Python, and data sources, which allow plugging external data systems into Spark. These changes aim to make Spark easier for data scientists to use at scale.
Apache Spark 2.0: Faster, Easier, and SmarterDatabricks
In this webcast, Reynold Xin from Databricks will be speaking about Apache Spark's new 2.0 major release.
The major themes for Spark 2.0 are:
- Unified APIs: Emphasis on building up higher level APIs including the merging of DataFrame and Dataset APIs
- Structured Streaming: Simplify streaming by building continuous applications on top of DataFrames allow us to unify streaming, interactive, and batch queries.
- Tungsten Phase 2: Speed up Apache Spark by 10X
ApacheCon NA 2015 Spark / Solr Integrationthelabdude
Apache Solr has been adopted by all major Hadoop platform vendors because of its ability to scale horizontally to meet even the most demanding big data search problems. Apache Spark has emerged as the leading platform for real-time big data analytics and machine learning. In this presentation, Timothy Potter presents several common use cases for integrating Solr and Spark.
Specifically, Tim covers how to populate Solr from a Spark streaming job as well as how to expose the results of any Solr query as an RDD. The Solr RDD makes efficient use of deep paging cursors and SolrCloud sharding to maximize parallel computation in Spark. After covering basic use cases, Tim digs a little deeper to show how to use MLLib to enrich documents before indexing in Solr, such as sentiment analysis (logistic regression), language detection, and topic modeling (LDA), and document classification.
Are you a Java developer interested in big data processing and never had the chance to work with Apache Spark ? My presentation aims to help you get familiar with Spark concepts and start developing your own distributed processing application.
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of DatabricksData Con LA
Spark and the Berkeley Data Analytics Stack (BDAS) represent a unified, distributed, and parallel high-performance big data processing and analytics platform. Written in Scala, Spark supports multiple languages including Python, Java, Scala, and even R. Commonly seen as the successor to Hadoop, Spark is fully compatible with Hadoop including UDFs, SerDe’s, file formats, and compression algorithms. The high-level Spark libraries include stream processing, machine learning, graph processing, approximating, sampling - and every combination therein. The most active big data open source project in existence, Spark boasts ~500 of contributors and 10,000 commits to date. Spark recently broke the Daytona GraySort 100 TB record with almost 3 times the throughput, 1/3rd less time, and 1/10th of the resources!
This document summarizes Chris Fregly's presentation on how Apache Spark beat Hadoop at sorting 100 TB of data. Key points include:
- Spark set a new record in the Daytona GraySort benchmark by sorting 100 TB of data in 23 minutes using 250,000 partitions on EC2.
- Optimizations that contributed to Spark's win included using CPU cache locality with (Key, Pointer) pairs, an optimized sorting algorithm, reducing network overhead with Netty, and reducing OS resources with a sort-based shuffle.
- The sort-based shuffle merges mapper outputs into a single file per partition to minimize disk seeks during the shuffle.
This document discusses Resilient Distributed Datasets (RDD), a fault-tolerant abstraction in Apache Spark for cluster computing. RDDs allow data to be reused across computations and support transformations like map, filter, and join. RDDs can be created from stable storage or other RDDs, and Spark computes them lazily for efficiency. The document provides examples of how RDDs can express algorithms like MapReduce, SQL queries, and graph processing. Benchmarks show Spark is 20x faster than Hadoop for iterative algorithms due to RDDs enabling data reuse in memory across jobs.
This document provides an overview and introduction to Apache Spark. It discusses what Spark is, how it was developed, why it is useful for big data processing, and how its core components like RDDs, transformations, and actions work. The document also demonstrates examples of using Spark through its interactive shell and shows how to run Spark jobs locally and on a cluster.
This document discusses new directions for Apache Spark in 2015, including improved interfaces for data science, external data sources, and machine learning pipelines. It also summarizes Spark's growth in 2014 with over 500 contributors, 370,000 lines of code, and 500 production deployments. The author proposes that Spark will become a unified engine for all data sources, workloads, and environments.
This document provides a history and market overview of Apache Spark. It discusses the motivation for distributed data processing due to increasing data volumes, velocities and varieties. It then covers brief histories of Google File System, MapReduce, BigTable, and other technologies. Hadoop and MapReduce are explained. Apache Spark is introduced as a faster alternative to MapReduce that keeps data in memory. Competitors like Flink, Tez and Storm are also mentioned.
We are a company driven by inquisitive data scientists, having developed a pragmatic and interdisciplinary approach, which has evolved over the decades working with over 100 clients across multiple industries. Combining several Data Science techniques from statistics, machine learning, deep learning, decision science, cognitive science, and business intelligence, with our ecosystem of technology platforms, we have produced unprecedented solutions. Welcome to the Data Science Analytics team that can do it all, from architecture to algorithms.
Our practice delivers data driven solutions, including Descriptive Analytics, Diagnostic Analytics, Predictive Analytics, and Prescriptive Analytics. We employ a number of technologies in the area of Big Data and Advanced Analytics such as DataStax (Cassandra), Databricks (Spark), Cloudera, Hortonworks, MapR, R, SAS, Matlab, SPSS and Advanced Data Visualizations.
This presentation is designed for Spark Enthusiasts to get started and details of the course are below.
1. Introduction to Apache Spark
2. Functional Programming + Scala
3. Spark Core
4. Spark SQL + Parquet
5. Advanced Libraries
6. Tips & Tricks
7. Where do I go from here?
Spark Under the Hood - Meetup @ Data Science LondonDatabricks
The document summarizes a meetup on Apache Spark hosted by Data Science London. It introduces the speakers - Sameer Farooqui, Doug Bateman, and Jon Bates - and their backgrounds in data science and Spark training. The agenda includes talks on a power plant predictive modeling demo using Spark and different approaches to parallelizing machine learning algorithms in Spark like model, divide and conquer, and data parallelism. It also provides overviews of Spark's machine learning library MLlib and common algorithms. The goal is for attendees to learn about Spark's unified engine and how to apply different machine learning techniques at scale.
Spark's distributed programming model uses resilient distributed datasets (RDDs) and a directed acyclic graph (DAG) approach. RDDs support transformations like map, filter, and actions like collect. Transformations are lazy and form the DAG, while actions execute the DAG. RDDs support caching, partitioning, and sharing state through broadcasts and accumulators. The programming model aims to optimize the DAG through operations like predicate pushdown and partition coalescing.
Spark After Dark: Real time Advanced Analytics and Machine Learning with SparkChris Fregly
1. The document discusses various techniques for generating high-quality recommendations using Apache Spark including parallelism, performance optimizations, real-time streaming, and machine learning algorithms.
2. It demonstrates Spark's high-level libraries like Spark Streaming, Spark SQL, GraphX, and MLlib for tasks such as generating recommendations, computing page rank, and training word embedding models.
3. The goals of the talk are to show how to build a recommendation engine in Spark that can perform personalized recommendations using techniques like collaborative filtering, content-based filtering, and similarity joins.
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
This document provides an agenda for a 3+ hour workshop on Apache Spark 2.x on Databricks. It includes introductions to Databricks, Spark fundamentals and architecture, new features in Spark 2.0 like unified APIs, and workshops on DataFrames/Datasets, Spark SQL, and structured streaming concepts. The agenda covers lunch and breaks and is divided into hour and half hour segments.
http://bit.ly/1BTaXZP – This presentation was given by Marco Vasquez, Data Scientist at MapR, at the Houston Hadoop Meetup
This is my slides from ebiznext workshop : Introduction to Apache Spark.
Please download code sources from https://github.com/MohamedHedi/SparkSamples
Apache Spark is a cluster computing framework that allows for fast, easy, and general processing of large datasets. It extends the MapReduce model to support iterative algorithms and interactive queries. Spark uses Resilient Distributed Datasets (RDDs), which allow data to be distributed across a cluster and cached in memory for faster processing. RDDs support transformations like map, filter, and reduce and actions like count and collect. This functional programming approach allows Spark to efficiently handle iterative algorithms and interactive data analysis.
This document provides an introduction to Apache Spark, including its architecture and programming model. Spark is a cluster computing framework that provides fast, in-memory processing of large datasets across multiple cores and nodes. It improves upon Hadoop MapReduce by allowing iterative algorithms and interactive querying of datasets through its use of resilient distributed datasets (RDDs) that can be cached in memory. RDDs act as immutable distributed collections that can be manipulated using transformations and actions to implement parallel operations.
This document provides an overview of Apache Spark, including its core concepts, transformations and actions, persistence, parallelism, and examples. Spark is introduced as a fast and general engine for large-scale data processing, with advantages like in-memory computing, fault tolerance, and rich APIs. Key concepts covered include its resilient distributed datasets (RDDs) and lazy evaluation approach. The document also discusses Spark SQL, streaming, and integration with other tools.
Spark is a fast and general engine for large-scale data processing. It runs programs up to 100x faster than Hadoop in memory, and 10x faster on disk. Spark supports Scala, Java, Python and can run on standalone, YARN, or Mesos clusters. It provides high-level APIs for SQL, streaming, machine learning, and graph processing.
Apache Spark is a fast, general-purpose, and easy-to-use cluster computing system for large-scale data processing. It provides APIs in Scala, Java, Python, and R. Spark is versatile and can run on YARN/HDFS, standalone, or Mesos. It leverages in-memory computing to be faster than Hadoop MapReduce. Resilient Distributed Datasets (RDDs) are Spark's abstraction for distributed data. RDDs support transformations like map and filter, which are lazily evaluated, and actions like count and collect, which trigger computation. Caching RDDs in memory improves performance of subsequent jobs on the same data.
Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa
The document is a presentation about Apache Spark given on August 25th, 2015 in Pittsburgh by Sneha Challa. It introduces Spark as a fast and general cluster computing engine for large-scale data processing. It discusses Spark's Resilient Distributed Datasets (RDDs) and transformations/actions. It provides examples of Spark APIs like map, reduce, and explains running Spark on standalone, Mesos, YARN, or EC2 clusters. It also covers Spark libraries like MLlib and running machine learning algorithms like k-means clustering and logistic regression.
A lecture on Apace Spark, the well-known open source cluster computing framework. The course consisted of three parts: a) install the environment through Docker, b) introduction to Spark as well as advanced features, and c) hands-on training on three (out of five) of its APIs, namely Core, SQL \ Dataframes, and MLlib.
Knoldus organized a Meetup on 1 April 2015. In this Meetup, we introduced Spark with Scala. Apache Spark is a fast and general engine for large-scale data processing. Spark is used at a wide range of organizations to process large datasets.
Introduction to Apache Spark. With an emphasis on the RDD API, Spark SQL (DataFrame and Dataset API) and Spark Streaming.
Presented at the Desert Code Camp:
http://oct2016.desertcodecamp.com/sessions/all
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...spinningmatt
This document provides an introduction to Apache Spark, including:
- A brief history of Spark, which started at UC Berkeley in 2009 and was donated to the Apache Foundation in 2013.
- An overview of what Spark is - an open-source, efficient, and productive cluster computing system that is interoperable with Hadoop.
- Descriptions of Spark's core abstractions including Resilient Distributed Datasets (RDDs), transformations, actions, and how it allows loading and saving data.
- Mentions of Spark's machine learning, SQL, streaming, and graph processing capabilities through projects like MLlib, Spark SQL, Spark Streaming, and GraphX.
Ten tools for ten big data areas 03_Apache SparkWill Du
Apache Spark is a fast and general-purpose cluster computing system for large-scale data processing. It provides functions for distributed processing of large datasets across clusters using a concept called resilient distributed datasets (RDDs). RDDs allow in-memory clustering computing to improve performance. Spark also supports streaming, SQL, machine learning, and graph processing.
This document discusses Spark, an open-source cluster computing framework. It begins with an introduction to distributed computing problems related to processing large datasets. It then provides an overview of Spark, including its core abstraction of resilient distributed datasets (RDDs) and how Spark builds on the MapReduce model. The rest of the document demonstrates Spark concepts like transformations and actions on RDDs and the use of key-value pairs. It also discusses SparkSQL and shows examples of finding the most retweeted tweet using core Spark and SparkSQL.
This document provides an introduction and overview of Apache Spark. It discusses why in-memory computing is important for speed, compares Spark and Ignite, describes what Spark is and how it works using Resilient Distributed Datasets (RDDs) and a directed acyclic graph (DAG) model. It also provides examples of Spark operations on RDDs and shows a word count example in Java, Scala and Python.
This document provides an overview of Apache Spark, an open-source cluster computing framework. It discusses Spark's history and community growth. Key aspects covered include Resilient Distributed Datasets (RDDs) which allow transformations like map and filter, fault tolerance through lineage tracking, and caching data in memory or disk. Example applications demonstrated include log mining, machine learning algorithms, and Spark's libraries for SQL, streaming, and machine learning.
Spark real world use cases and optimizationsGal Marder
This document provides an overview of Spark, its core abstraction of resilient distributed datasets (RDDs), and common transformations and actions. It discusses how Spark partitions and distributes data across a cluster, its lazy evaluation model, and the concept of dependencies between RDDs. Common use cases like word counting, bucketing user data, finding top results, and analytics reporting are demonstrated. Key topics covered include avoiding expensive shuffle operations, choosing optimal aggregation methods, and potentially caching data in memory.
This document provides an introduction to Apache Spark presented by Vincent Poncet of IBM. It discusses how Spark is a fast, general-purpose cluster computing system for large-scale data processing. It is faster than MapReduce, supports a wide range of workloads, and is easier to use with APIs in Scala, Python, and Java. The document also provides an overview of Spark's execution model and its core API called resilient distributed datasets (RDDs).
This document provides an overview of Apache Spark and machine learning using Spark. It introduces the speaker and objectives. It then covers Spark concepts including its architecture, RDDs, transformations and actions. It demonstrates working with RDDs and DataFrames. Finally, it discusses machine learning libraries available in Spark like MLib and how Spark can be used for supervised machine learning tasks.
What is FinOps as a Service and why is it Trending?Amnic
The way we build and scale companies today has changed forever because of cloud adoption. However, this flexibility introduces unpredictability, which often results in overspending, inefficiencies, and a lack of cost accountability.
FinOps as a Service is a modern approach to cloud cost management that combines powerful tooling with expert advisory to bring financial visibility, governance, and optimization into the cloud operating model, without slowing down the engineering team. FinOps empowers the engineering team, finance, and leadership/management as they make data-informed decisions about cost, together.
In this presentation, we will break down what FinOps is, why it matters more than ever, and a little about how a managed FinOps service can help organizations:
- Optimize cloud spend - without slowing down dev
- Create visibility into the cost per team, service, or feature
- Set financial guardrails while allowing autonomy in engineering
- Drive cultural alignment between finance, engineering, and product
This will guide and help whether you are a cloud-native startup or a scaling enterprise, and convert cloud cost into a strategic advantage.
Report based on the findings of a quantitative research conducted by the research agency New Image Marketing Group, commissioned by the NGO Detector Media, compiled by PhD in Sociology Marta Naumova.
apidays New York 2025 - Using GraphQL SDL files as executable API Contracts b...apidays
Using GraphQL SDL files as executable API Contracts
Hari Krishnan, Co-founder & CTO at Specmatic
apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
May 14 & 15, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
Report based on the findings of a quantitative research conducted by the research agency New Image Marketing Group, commissioned by the NGO Detector Media, compiled by PhD in Sociology Marta Naumova.
apidays New York 2025 - Lessons From Two Technical Transformations by Leah Hu...apidays
You Can't Outrun Complexity - But You Can Orchestrate It: Lessons From Two Technical Transformations
Leah Hurwich Adler, Senior Staff Product Manager at Apollo GraphQL
apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
May 14 & 15, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
At Opsio, we specialize in delivering advanced cloud services that enable businesses to scale, transform, and modernize with confidence. Our core offerings focus on cloud management, digital transformation, and cloud modernization — all designed to help organizations unlock the full potential of their technology infrastructure.We take a client-first approach, blending industry-leading hosted technologies with strategic expertise to create tailored, future-ready solutions. Leveraging AI, automation, and emerging technologies, our services simplify IT operations, enhance agility, and accelerate business outcomes. Whether you're migrating to the cloud or optimizing existing cloud environments, Opsio is your partner in achieving sustainable, measurable success.
apidays New York 2025 - The Challenge is Not the Pattern, But the Best Integr...apidays
The Challenge is Not the Pattern, But the Best Integration
Yisrael Gross, CEO at Ammune.ai
apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
May 14 & 15, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
apidays Singapore 2025 - Building Finance Innovation Ecosystems by Umang Moon...apidays
Building Finance Innovation Ecosystems
Umang Moondra, CEO at APIX
apidays Singapore 2025
Where APIs Meet AI: Building Tomorrow's Intelligent Ecosystems
April 15 & 16, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
apidays New York 2025 - Life is But a (Data) Stream by Sandon Jacobs (Confluent)apidays
Life is But a (Data) Stream: Building Quality Data Pipelines
Sandon Jacobs, Senior Developer Advocate at Confluent
apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
May 14 & 15, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
apidays New York 2025 - Breaking Barriers: Lessons Learned from API Integrati...apidays
Breaking Barriers: Lessons Learned from API Integration with Large Hotel Chains and the Role of Standardization
Constantine Nikolaou, Manager Business Solutions Architect at Booking.com
apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
May 14 & 15, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
apidays New York 2025 - CIAM in the wild by Michael Gruen (Layr)apidays
CIAM in the wild: What we learned while scaling from 1.5 to 3 million users
Michael Gruen, VP of Engineering at Layr
apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
Convene 360 Madison, New York
May 14 & 15, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
apidays New York 2025 - Why an SDK is Needed to Protect APIs from Mobile Apps...apidays
Why an SDK is Needed to Protect APIs from Mobile Apps
Pearce Erensel, Global VP of Sales at Approov Mobile Security
apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
May 14 & 15, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
apidays New York 2025 - Fast, Repeatable, Secure: Pick 3 with FINOS CCC by Le...apidays
Fast, Repeatable, Secure: Pick 3 with FINOS CCC
Leigh Capili, Kubernetes Contributor at Control Plane
apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
May 14 & 15, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
4. Spark in a Nutshell
• General cluster computing platform:
• Distributed in-memory computational framework.
• SQL, Machine Learning, Stream Processing, etc.
• Easy to use, powerful, high-level API:
• Scala, Java, Python and R.
6. High Performance
• In-memory cluster computing.
• Ideal for iterative algorithms.
• Faster than Hadoop:
• 10x on disk.
• 100x in memory.
7. Brief History
• Originally developed in 2009, UC Berkeley AMP Lab.
• Open-sourced in 2010.
• As of 2014, Spark is a top-level Apache project.
• Fastest open-source engine for sorting 100 ΤΒ:
• Won the 2014 Daytona GraySort contest.
• Throughput: 4.27 TB/min
8. Who uses Spark,
and for what?
A. Data Scientists:
• Analyze and model data.
• Data transformations and prototyping.
• Statistics and Machine Learning.
B. Software Engineers:
• Implement production data processing systems.
• Require a reasonable API for distributed processing.
• Reliable, high performance, easy to monitor platform.
9. Resilient Distributed Dataset
RDD is an immutable and partitioned collection:
• Resilient: it can be recreated, when data in memory is lost.
• Distributed: stored in memory across the cluster.
• Dataset: data that comes from file or created
programmatically.
RDD
partitions
10. Resilient Distributed Datasets
• Feels like coding using typical Scala collections.
• RDD can be build:
1. Directly from a datasource (e.g., text file, HDFS, etc.),
2. or by applying a transformation to another RDD(s).
• Main features:
• RDDs are computed lazily.
• Automatically rebuild on failure.
• Persistence for reuse (RAM and/or disk).
12. Spark Shell
$
cd
spark
$
./bin/spark-‐shell
Spark
assembly
has
been
built
with
Hive,
including
Datanucleus
jars
on
classpath
Welcome
to
____
__
/
__/__
___
_____/
/__
_
/
_
/
_
`/
__/
'_/
/___/
.__/_,_/_/
/_/_
version
1.2.1
/_/
Using
Scala
version
2.10.4
(Java
HotSpot(TM)
64-‐Bit
Server
VM,
Java
1.7.0_71)
Type
in
expressions
to
have
them
evaluated.
Type
:help
for
more
information.
Spark
context
available
as
sc.
scala>
14. Initiate Spark Context
import
org.apache.spark.SparkContext
import
org.apache.spark.SparkContext._
import
org.apache.spark.SparkConf
object
SimpleApp
extends
App
{
val
conf
=
new
SparkConf().setAppName("Hello
Spark")
val
sc
=
new
SparkContext(conf)
}
15. Rich, High-level API
map
filter
sort
groupBy
union
join
…
reduce
count
fold
reduceByKey
groupByKey
cogroup
zip
…
sample
take
first
partitionBy
mapWith
pipe
save
…
16. Rich, High-level API
map
filter
sort
groupBy
union
join
…
reduce
count
fold
reduceByKey
groupByKey
cogroup
zip
…
sample
take
first
partitionBy
mapWith
pipe
save
…
17. Loading and Saving
• File Systems: Local FS, Amazon S3 and HDFS.
• Supported formats: Text files, JSON, Hadoop sequence files,
parquet files, protocol buffers and object files.
• Structured data with Spark SQL: Hive, JSON, JDBC,
Cassandra, HBase and ElasticSearch.
18. Create RDDs
//
sc:
SparkContext
instance
//
Scala
List
to
RDD
val
rdd0
=
sc.parallelize(List(1,
2,
3,
4))
//
Load
lines
of
a
text
file
val
rdd1
=
sc.textFile("path/to/filename.txt")
//
Load
a
file
from
HDFS
val
rdd2
=
sc.hadoopFile("hdfs://master:port/path")
//
Load
lines
of
a
compressed
text
file
val
rdd3
=
sc.textFile("file:///path/to/compressedText.gz")
//
Load
lines
of
multiple
files
val
rdd4
=
sc.textFile("s3n://log-‐files/2014/*.log")
19. RDD Operations
1. Transformations: define new RDDs based on current one,
e.g., filter, map, reduce, groupBy, etc.
RDD New RDD
2. Actions: return values, e.g., count, sum, collect, etc.
value
RDD
20. Transformations (I): basics
val
nums
=
sc.parallelize(List(1,
2,
3))
//
Pass
each
element
through
a
function
val
squares
=
nums.map(x
=>
x
*
x)
//{1,
4,
9}
//
Keep
elements
passing
a
predicate
val
even
=
squares.filter(_
%
2
==
0)
//{4}
//
Map
each
element
to
zero
or
more
others
val
mn
=
nums.flatMap(x
=>
1
to
x)
//{1,
1,
2,
1,
2,
3}
24. Transformations (III): key - value
//RDD[(URL,
page_name)]
tuples
val
names
=
sc.textFile("names.txt").map(…)…
//RDD[(URL,
visit_counts)]
tuples
val
visits
=
sc.textFile("counts.txt").map(…)…
//RDD[(URL,
(visit
counts,
page
name))]
val
joined
=
visits.join(names)
25. Basics: Actions
val
nums
=
sc.parallelize(List(1,
2,
3))
//
Count
number
of
elements
nums.count()
//
=
3
//
Merge
with
an
associative
function
nums.reduce((l,
r)
=>
l
+
r)
//
=
6
//
Write
elements
to
a
text
file
nums.saveAsTextFile("path/to/filename.txt")
28. 1. Job: work required to compute an RDD.
2. Each job is divided to stages.
3. Task:
• Unit of work within a stage
• Corresponds to one RDD partition.
Units of Execution Model
Task 0
Job
Stage 0
…Task 1 Task 0
Stage 1
…Task 1 …
36. Persistence
• When we use the same RDD multiple times:
• Spark will recompute the RDD.
• Expensive to iterative algorithms.
• Spark can persist RDDs, avoiding recomputations.
37. Levels of persistence
val
result
=
input.map(expensiveComputation)
result.persist(LEVEL)
LEVEL
Space
Consumption
CPU time In memory On disk
MEMORY_ONLY (default) High Low Y N
MEMORY_ONLY_SER Low High Y N
MEMORY_AND_DISK High Medium Some Some
MEMORY_AND_DISK_SER Low High Some Some
DISK_ONLY Low High N Y
38. Persistence Behaviour
• Each node will store its computed partition.
• In case of a failure, Spark recomputes the
missing partitions.
• Least Recently Used cache policy:
• Memory-only: recompute partitions.
• Memory-and-disk: recompute and write to disk.
• Manually remove from cache: unpersist()
39. Shared Variables
1. Accumulators: aggregate values from worker
nodes back to the driver program.
2. Broadcast variables: distribute values to all
worker nodes.
40. Accumulator Example
val
input
=
sc.textFile("input.txt")
val
sum
=
sc.accumulator(0)
val
count
=
sc.accumulator(0)
input
.filter(line
=>
line.size
>
0)
.flatMap(line
=>
line.split("
"))
.map(word
=>
word.size)
.foreach{
size
=>
sum
+=
size
//
increment
accumulator
count
+=
1
//
increment
accumulator
}
val
average
=
sum.value.toDouble
/
count.value
driver only
initialize the
accumulators
41. • Safe: Updates inside actions will only applied once.
• Unsafe: Updates inside transformation may applied
more than once!!!
Accumulators and Fault
Tolerance
42. Broadcast Variables
• Closures and the variables they use are send
separately to each task.
• We may want to share some variable (e.g., a Map)
across tasks/operations.
• This can efficiently done with broadcast variables.
43. Example without
broadcast variables
//
RDD[(String,
String)]
val
names
=
…
//load
(URL,
page
name)
tuples
//
RDD[(String,
Int)]
val
visits
=
…
//load
(URL,
visit
counts)
tuples
//
Map[String,
String]
val
pageMap
=
names.collect.toMap
val
joined
=
visits.map{
case
(url,
counts)
=>
(url,
(pageMap(url),
counts))
}
pageMap
is
sent
along
with
every
task
44. Example with
broadcast variables
//
RDD[(String,
String)]
val
names
=
…
//load
(URL,
page
name)
tuples
//
RDD[(String,
Int)]
val
visits
=
…
//load
(URL,
visit
counts)
tuples
//
Map[String,
String]
val
pageMap
=
names.collect.toMap
val
bcMap
=
sc.broadcast(pageMap)
val
joined
=
visits.map{
case
(url,
counts)
=>
(url,
(bcMap.value(url),
counts))
}
Broadcast variable
pageMap
is
sent
only
to
each
node
once