Introduction to PySpark | Distributed Computing with Apache Spark
Last Updated :
29 Apr, 2022
Datasets are becoming huge. Infact, data is growing faster than processing speeds. Therefore, algorithms involving large data and high amount of computation are often run on a distributed computing system. A distributed computing system involves nodes (networked computers) that run processes in parallel and communicate (if, necessary).
MapReduce - The programming model that is used for Distributed computing is known as MapReduce. The MapReduce model involves two stages, Map and Reduce.
- Map - The mapper processes each line of the input data (it is in the form of a file), and produces key - value pairs.
Input data → Mapper → list([key, value])
- Reduce - The reducer processes the list of key - value pairs (after the Mapper's function). It outputs a new set of key - value pairs.
list([key, value]) → Reducer → list([key, list(values)])
Spark - Spark (open source Big-Data processing engine by Apache) is a cluster computing system. It is faster as compared to other cluster computing systems (such as, Hadoop). It provides high level APIs in Python, Scala, and Java. Parallel jobs are easy to write in Spark. We will cover PySpark (Python + Apache Spark), because this will make the learning curve flatter. To install Spark on a linux system, follow
this. To run Spark in a multi - cluster system, follow
this. We will see how to create RDDs (fundamental data structure of Spark).
RDDs (Resilient Distributed Datasets) - RDDs are immutable collection of objects. Since we are using PySpark, these objects can be of multiple types. These will become more clear further.
SparkContext - For creating a standalone application in Spark, we first define a SparkContext -
Python
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local").setAppName("Test")
# setMaster(local) - we are doing tasks on a single machine
sc = SparkContext(conf = conf)
RDD transformations - Now, a SparkContext object is created. Now, we will create RDDs and see some transformations on them.
Python
# create an RDD called lines from ‘file_name.txt’
lines = sc.textFile("file_name.txt", 2)
# print lines.collect() prints the whole RDD
print lines.collect()
One major advantage of using Spark is that it does not load the dataset into memory, lines is a pointer to the
‘file_name.txt’ ?file.
A simple PySpark app to count the degree of each vertex for a given graph -
Python
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local").setAppName("Test")
# setMaster(local) - we are doing tasks on a single machine
sc = SparkContext(conf = conf)
def conv(line):
line = line.split()
return (int(line[0]), [int(line[1])])
def numNeighbours(x, y):
return len(x) + len(y)
lines = sc.textFile('graph.txt')
edges = lines.map(lambda line: conv(line))
Adj_list = edges.reduceByKey(lambda x, y: numNeighbours(x, y))
print Adj_list.collect()
Understanding the above code -
-
Our text file is in the following format - (each line represents an edge of a directed graph)
1 2
1 3
2 3
3 4
. .
. .
. .PySpark
-
Large Datasets may contain millions of nodes, and edges.
-
First few lines set up the SparkContext. We create an RDD lines from it.
-
Then, we transform the lines RDD to edges RDD.The function conv acts on each line and key value pairs of the form (1, 2), (1, 3), (2, 3), (3, 4), ... are stored in the edges RDD.
- After this the reduceByKey aggregates all the key - pairs corresponding to a particular key and numNeighbours function is used for generating each vertex's degree in a separate RDD Adj_list, which has the form (1, 2), (2, 1), (3, 1), ...
Running the code -
-
The above code can be run by the following commands -
$ cd /home/arik/Downloads/spark-1.6.0/
$ ./bin/spark-submit degree.py
- You can use your Spark installation path in the first line.
We will see more on, how to run MapReduce tasks in a cluster of machines using Spark, and also go through other MapReduce tasks.
References -
- http://lintool.github.io/SparkTutorial/
- https://spark.apache.org/
Similar Reads
SQL Commands | DDL, DQL, DML, DCL and TCL Commands SQL commands are crucial for managing databases effectively. These commands are divided into categories such as Data Definition Language (DDL), Data Manipulation Language (DML), Data Control Language (DCL), Data Query Language (DQL), and Transaction Control Language (TCL). In this article, we will e
7 min read
JavaScript Tutorial JavaScript is a programming language used to create dynamic content for websites. It is a lightweight, cross-platform, and single-threaded programming language. It's an interpreted language that executes code line by line, providing more flexibility.JavaScript on Client Side: On the client side, Jav
11 min read
TCP/IP Model The TCP/IP model is a framework that is used to model the communication in a network. It is mainly a collection of network protocols and organization of these protocols in different layers for modeling the network.It has four layers, Application, Transport, Network/Internet and Network Access.While
7 min read
Web Development Web development is the process of creating, building, and maintaining websites and web applications. It involves everything from web design to programming and database management. Web development is generally divided into three core areas: Frontend Development, Backend Development, and Full Stack De
5 min read
Basics of Computer Networking A computer network is a collection of interconnected devices that share resources and information. These devices can include computers, servers, printers, and other hardware. Networks allow for the efficient exchange of data, enabling various applications such as email, file sharing, and internet br
14 min read
React Interview Questions and Answers React is an efficient, flexible, and open-source JavaScript library that allows developers to create simple, fast, and scalable web applications. Jordan Walke, a software engineer who was working for Facebook, created React. Developers with a JavaScript background can easily develop web applications
15+ min read
React Tutorial React is a powerful JavaScript library for building fast, scalable front-end applications. Created by Facebook, it's known for its component-based structure, single-page applications (SPAs), and virtual DOM,enabling efficient UI updates and a seamless user experience.Note: The latest stable version
7 min read
Java Programs - Java Programming Examples In this article, we will learn and prepare for Interviews using Java Programming Examples. From basic Java programs like the Fibonacci series, Prime numbers, Factorial numbers, and Palindrome numbers to advanced Java programs.Java is one of the most popular programming languages today because of its
8 min read
JavaScript Interview Questions and Answers JavaScript is the most used programming language for developing websites, web servers, mobile applications, and many other platforms. In Both Front-end and Back-end Interviews, JavaScript was asked, and its difficulty depends upon the on your profile and company. Here, we compiled 70+ JS Interview q
15+ min read
Unified Modeling Language (UML) Diagrams Unified Modeling Language (UML) is a general-purpose modeling language. The main aim of UML is to define a standard way to visualize the way a system has been designed. It is quite similar to blueprints used in other fields of engineering. UML is not a programming language, it is rather a visual lan
14 min read