0% found this document useful (0 votes)

66 views10 pages

Install Spark on Ubuntu: A Step-by-Step Guide

This document provides a step-by-step guide on how to install Apache Spark on an Ubuntu system, including prerequisites, installation of required packages, and configuration of environment variables. It details the process of downloading Spark, setting up a master and slave server, and testing the Spark shell with both Scala and Python. The tutorial concludes with basic commands for starting and stopping the Spark master and worker processes.

Uploaded by

eswarannihil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

66 views10 pages

Install Spark on Ubuntu: A Step-by-Step Guide

Uploaded by

eswarannihil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

How to Install Spark on Ubuntu

Prerequisites

 An Ubuntu system.
 Access to a terminal or command line.
 A user with sudo or root permissions.

Install Packages Required for Spark

Before downloading and setting up Spark, you need to install necessary
dependencies. This step includes installing the following packages:
 Scala
 Git

Open a terminal window and run the following command to install all two
packages at once (If you install Hadoop first, you don’t need to install JDK this
time:

$ sudo apt install scala git -y

Once the process completes, verify the installed dependencies by running

these commands:

$ java -version; javac -version; scala -version; git --ver

sion

The output prints the versions if the installation completed successfully for all
packages.
Download and Set Up Spark on Ubuntu
Now, you need to download the version of Spark you want form their
website. We will go for Spark 3.0.1 with Hadoop 3.2 as it is the latest version
at the time of writing this article.

Or, you can go to [Link] to choose the

version you want.

Use the wget command and the direct link to download the Spark archive:

$ wget[Link]
[Link]

When the download completes, you will see the saved message.
Now, extract the saved archive using the tar command:

$ tar xvf spark-*

Let the process complete. The output shows the files that are being unpacked
from the archive.

Finally, move the unpacked directory spark-3.0.1-bin-hadoop2.7 to

the opt/spark directory.
Use the mv command to do so:

$ sudo mv spark-3.0.1-bin-hadoop2.7 /opt/spark

The terminal returns no response if it successfully moves the directory. If you

mistype the name, you will get a message similar to:

mv: cannot stat 'spark-3.0.1-bin-hadoop2.7': No such file

or directory.

Configure Spark Environment

Before starting a master server, you need to configure environment variables.

You can add the export paths by editing the .profile file in the editor of your
choice, such as nano or vim.

For example, to use nano, enter:

$ nano .profile

When the profile loads, scroll to the bottom of the file.

Then, add these three lines:

export SPARK_HOME=/opt/spark

export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

export PYSPARK_PYTHON=/usr/bin/python3

Exit and save changes when prompted.

When you finish adding the paths, load the .profile file in the command line by
typing:

$ source ~/.profile
Start Standalone Spark Master Server
Now that you have completed configuring your environment for Spark, you
can start a master server.

In the terminal, type:

$ [Link]

To view the Spark Web user interface, open a web browser and enter the
localhost IP address on port 8080.

[Link]

The page shows your Spark URL, status information for workers, hardware
resource utilization, etc.

The URL for Spark Master is the name of your device on port 8080. In our
case, this is yixi-virtualbox:8080. So, there are three possible ways to load
Spark Master’s Web UI:
1. [Link]:8080
2. localhost:8080
3. yixi-virtualbox:8080
Start Spark Slave Server (Start a Worker Process)
In this single-server, standalone setup, we will start one slave server along
with the master server.

To do so, run the following command in this format:

[Link] spark://master:port

The master in the command can be an IP or hostname.

In ou r case it is ubuntu1:

$ [Link] spark://yixi-virtualbox:7077

Now that a worker is up and running, if you reload Spark Master’s Web UI,
you should see it on the list:

Specify Resource Allocation for Workers

The default setting when starting a worker on a machine is to use all available
CPU cores. You can specify the number of cores by passing the -c flag to
the start-slave command.
For example, to start a worker and assign only one CPU core to it, enter this
command:

$ [Link] -c 1 spark://yixi-virtualbox:7077

Reload Spark Master’s Web UI to confirm the worker’s configuration.

Similarly, you can assign a specific amount of memory when starting a

worker. The default setting is to use whatever amount of RAM your machine
has, minus 1GB.
To start a worker and assign it a specific amount of memory, add the -
m option and a number. For gigabytes, use G and for megabytes, use M.

For example, to start a worker with 512MB of memory, enter this command:

$ [Link] -m 512M spark://yixi-virtualbox:7077

Reload the Spark Master Web UI to view the worker’s status and confirm the
configuration.

Test Spark Shell

After you finish the configuration and start the master and slave server, test if
the Spark shell works.

Load the shell by entering:

$ spark-shell
You should get a screen with notifications and Spark information. Scala is the
default interface, so that shell loads when you run spark-shell.

The ending of the output looks like this for the version we are using at the time
of writing this guide:

Type :q and press Enter to exit Scala.

Test Python in Spark

If you do not want to use the default Scala interface, you can switch to Python.

Make sure you quit Scala and then run this command:

$ pyspark

The resulting output looks similar to the previous one. Towards the bottom,
you will see the version of Python.
To exit this shell, type quit() and hit Enter.

Basic Commands to Start and Stop Master Server and

Workers
Below are the basic commands for starting and stopping the Apache Spark
master server and workers. Since this setup is only for one machine, the
scripts you run default to the localhost.

To start a master server instance on the current machine, run the command
we used earlier in the guide:

$ [Link]

To stop the master instance started by executing the script above, run:

$ [Link]

To stop a running worker process, enter this command:

$ [Link]

The Spark Master page, in this case, shows the worker status as DEAD.
Conclusion
This tutorial showed you how to install Spark on an Ubuntu machine, as
well as the necessary dependencies.

The setup in this guide enables you to perform basic tests before you start
configuring a Spark cluster and performing advanced actions.

Reference
[Link]

Titanic Survival Prediction with PySpark
No ratings yet
Titanic Survival Prediction with PySpark
19 pages
Multi-Node Apache Spark Setup Guide
No ratings yet
Multi-Node Apache Spark Setup Guide
11 pages
Spark Deployment Modes Overview
No ratings yet
Spark Deployment Modes Overview
22 pages
Unit 4 Spark Updated
No ratings yet
Unit 4 Spark Updated
86 pages
Apache Spark Installation
No ratings yet
Apache Spark Installation
4 pages
Inceptez Spark Installation Guide
No ratings yet
Inceptez Spark Installation Guide
2 pages
Apache Spark Overview and Installation Guide
No ratings yet
Apache Spark Overview and Installation Guide
23 pages
Learning Spark - Chapter 2
No ratings yet
Learning Spark - Chapter 2
6 pages
PySpark Tutorial for Beginners
No ratings yet
PySpark Tutorial for Beginners
206 pages
SCALA Program with Apache Spark Guide
No ratings yet
SCALA Program with Apache Spark Guide
4 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
Spark Security and Documentation Overview
No ratings yet
Spark Security and Documentation Overview
4 pages
Installing Scala and Apache Spark
No ratings yet
Installing Scala and Apache Spark
5 pages
Installing Scala and Apache Spark
No ratings yet
Installing Scala and Apache Spark
5 pages
Step 1: Verifying Java Installation: Download Scala
No ratings yet
Step 1: Verifying Java Installation: Download Scala
3 pages
Next Gen EMR Setup Tutorial
No ratings yet
Next Gen EMR Setup Tutorial
41 pages
Configure Apache Spark on VMs Guide
No ratings yet
Configure Apache Spark on VMs Guide
9 pages
Advantages of PySpark Over Python
No ratings yet
Advantages of PySpark Over Python
7 pages
Apache Spark Installation Guide
No ratings yet
Apache Spark Installation Guide
2 pages
Apache Spark and Python Installation Guide
No ratings yet
Apache Spark and Python Installation Guide
3 pages
Pyspark Tutorial
100% (2)
Pyspark Tutorial
27 pages
Install PySpark on Linux Guide
No ratings yet
Install PySpark on Linux Guide
5 pages
Apache Spark Installation Guide
No ratings yet
Apache Spark Installation Guide
8 pages
Spark Installation Guide for Python & Scala
No ratings yet
Spark Installation Guide for Python & Scala
6 pages
Apache Spark Tutorial for Fast Data Architecture
No ratings yet
Apache Spark Tutorial for Fast Data Architecture
5 pages
Install Apache Spark on Windows 10 Guide
No ratings yet
Install Apache Spark on Windows 10 Guide
14 pages
PySpark with Jupyter Docker Stacks Guide
No ratings yet
PySpark with Jupyter Docker Stacks Guide
33 pages
Setting Up Spark in Docker: A Guide
No ratings yet
Setting Up Spark in Docker: A Guide
14 pages
Understanding Apache Spark RDDs
No ratings yet
Understanding Apache Spark RDDs
7 pages
Apache Spark Installation Guide
No ratings yet
Apache Spark Installation Guide
20 pages
Spark Installation and Job Submission Guide
No ratings yet
Spark Installation and Job Submission Guide
2 pages
Install Scala and Spark on Ubuntu Guide
No ratings yet
Install Scala and Spark on Ubuntu Guide
5 pages
Installing and Using Apache Spark
No ratings yet
Installing and Using Apache Spark
11 pages
Apache Spark Overview and Getting Started
No ratings yet
Apache Spark Overview and Getting Started
67 pages
Submitting PySpark Apps on AWS EMR
No ratings yet
Submitting PySpark Apps on AWS EMR
7 pages
Understanding Apache Spark Architecture
No ratings yet
Understanding Apache Spark Architecture
39 pages
PTC Big Data Analysis With ApacheS 27.11-28.11.2019 Handout
No ratings yet
PTC Big Data Analysis With ApacheS 27.11-28.11.2019 Handout
48 pages
Python Integration with Hadoop and Spark
No ratings yet
Python Integration with Hadoop and Spark
10 pages
Install Apache Spark on Windows & macOS
No ratings yet
Install Apache Spark on Windows & macOS
23 pages
Apache Spark Cluster Setup Guide
No ratings yet
Apache Spark Cluster Setup Guide
4 pages
RDD Programming in Spark 3.5.5
No ratings yet
RDD Programming in Spark 3.5.5
14 pages
Start Spark Single Node Cluster Guide
No ratings yet
Start Spark Single Node Cluster Guide
2 pages
Big Data Analytics with Apache Spark
No ratings yet
Big Data Analytics with Apache Spark
121 pages
Apache Spark Development Lab Guide
No ratings yet
Apache Spark Development Lab Guide
54 pages
Spark Deployment Guide
No ratings yet
Spark Deployment Guide
45 pages
Apache Spark 3.5 Tutorial Overview
No ratings yet
Apache Spark 3.5 Tutorial Overview
232 pages
Pyspark Interview Prep: Top Questions & Setup
No ratings yet
Pyspark Interview Prep: Top Questions & Setup
88 pages
Setting Up Dev Environment for JPMorgan
No ratings yet
Setting Up Dev Environment for JPMorgan
93 pages
Install PySpark on Windows, Mac, Linux
No ratings yet
Install PySpark on Windows, Mac, Linux
18 pages
Installing Hadoop on Ubuntu 20.04
No ratings yet
Installing Hadoop on Ubuntu 20.04
15 pages
AWS EC2 Lab: Launch & Manage Instances
No ratings yet
AWS EC2 Lab: Launch & Manage Instances
16 pages
Introduction to Apache Spark Basics
No ratings yet
Introduction to Apache Spark Basics
49 pages
Cloud Computing Laboratory Experiments Guide
No ratings yet
Cloud Computing Laboratory Experiments Guide
64 pages
Big Data with Spark and Python Guide
No ratings yet
Big Data with Spark and Python Guide
28 pages
Hadoop 2.x Single Node Setup Guide
No ratings yet
Hadoop 2.x Single Node Setup Guide
9 pages
DEV3600SlideGuide PDF
No ratings yet
DEV3600SlideGuide PDF
555 pages
Apache Spark Guide for Data Engineers
No ratings yet
Apache Spark Guide for Data Engineers
33 pages
Install Apache Kafka on Ubuntu 20.04
No ratings yet
Install Apache Kafka on Ubuntu 20.04
14 pages
Install Power BI Desktop on Ubuntu
No ratings yet
Install Power BI Desktop on Ubuntu
2 pages
Install Apache Hive on Ubuntu Guide
No ratings yet
Install Apache Hive on Ubuntu Guide
9 pages
NumPy Basics for Python Programming
No ratings yet
NumPy Basics for Python Programming
16 pages
Student Marks Management System
No ratings yet
Student Marks Management System
8 pages
Python Programming Basics and Examples
No ratings yet
Python Programming Basics and Examples
6 pages
AI Product Management Overview
No ratings yet
AI Product Management Overview
4 pages
Visual Studio Code Installation Guide
No ratings yet
Visual Studio Code Installation Guide
3 pages
Install PyCharm on Mac for Python
No ratings yet
Install PyCharm on Mac for Python
5 pages
Install R and RStudio Guide
No ratings yet
Install R and RStudio Guide
6 pages
Business Intelligence Market Analysis 2023
No ratings yet
Business Intelligence Market Analysis 2023
64 pages
Nihil Kumar's Technical Resume
No ratings yet
Nihil Kumar's Technical Resume
3 pages
Global vs Static Variables in C
No ratings yet
Global vs Static Variables in C
2 pages
CSE 302 Final Exam Assignment
No ratings yet
CSE 302 Final Exam Assignment
5 pages
VASP 5.4.4.pl2 Installation Guide
No ratings yet
VASP 5.4.4.pl2 Installation Guide
3 pages
Cassandra - Data Model For Twitter - Part 3 - Treselle Systems
No ratings yet
Cassandra - Data Model For Twitter - Part 3 - Treselle Systems
6 pages
Bakery Management Software Overview
No ratings yet
Bakery Management Software Overview
102 pages
Python Application Development Exam Paper
100% (1)
Python Application Development Exam Paper
3 pages
Software Evolution MCQs and Concepts
No ratings yet
Software Evolution MCQs and Concepts
25 pages
Online Hospital Management System Report
No ratings yet
Online Hospital Management System Report
32 pages
Python Series Operations in Informatics
No ratings yet
Python Series Operations in Informatics
5 pages
CS8862 Mobile App Development Manual
No ratings yet
CS8862 Mobile App Development Manual
153 pages
C# Programming Lab Manual
No ratings yet
C# Programming Lab Manual
50 pages
College & Banking Automation Systems
No ratings yet
College & Banking Automation Systems
48 pages
Software Architecture View Models
No ratings yet
Software Architecture View Models
4 pages
Convert HTML to PDF with Sejda
No ratings yet
Convert HTML to PDF with Sejda
1 page
Citra Emulator Nand Load Errors
No ratings yet
Citra Emulator Nand Load Errors
10 pages
Python Control Statements Explained
No ratings yet
Python Control Statements Explained
8 pages
Operator Grammar and Parsing Concepts
No ratings yet
Operator Grammar and Parsing Concepts
8 pages
ESQL - Message Broker
100% (3)
ESQL - Message Broker
386 pages
OTM 6.2 Planning Enhancements Overview
No ratings yet
OTM 6.2 Planning Enhancements Overview
22 pages
Understanding Arrays in Python Programming
No ratings yet
Understanding Arrays in Python Programming
17 pages
Java LinkedList Tutorial Guide
No ratings yet
Java LinkedList Tutorial Guide
5 pages
p6 Eppm Tested Config
No ratings yet
p6 Eppm Tested Config
15 pages
Introduction To Cortex-M3 Programming: ARM University Program
No ratings yet
Introduction To Cortex-M3 Programming: ARM University Program
34 pages
Understanding C++ Virtual Base Classes
No ratings yet
Understanding C++ Virtual Base Classes
17 pages
EWM Process and Error Management Guide
No ratings yet
EWM Process and Error Management Guide
5 pages
Campus Companion: Android App for Hospitals
No ratings yet
Campus Companion: Android App for Hospitals
14 pages
IT3030 Programming Frameworks Exam 2023
No ratings yet
IT3030 Programming Frameworks Exam 2023
4 pages
Module2 Notes
No ratings yet
Module2 Notes
46 pages
PIMS Log File Analysis for MyPeugeot
No ratings yet
PIMS Log File Analysis for MyPeugeot
70 pages
Keil C - Embedded C Programming Tutorial - Pointers - 8051 Micro Controller Projects AVR PIC Projects Tutorials Ebooks Libraries
No ratings yet
Keil C - Embedded C Programming Tutorial - Pointers - 8051 Micro Controller Projects AVR PIC Projects Tutorials Ebooks Libraries
2 pages

Install Spark on Ubuntu: A Step-by-Step Guide

Uploaded by

Install Spark on Ubuntu: A Step-by-Step Guide

Uploaded by

How to Install Spark on Ubuntu

Install Packages Required for Spark

$ sudo apt install scala git -y

Once the process completes, verify the installed dependencies by running

$ java -version; javac -version; scala -version; git --ver

Or, you can go to [Link] to choose the

$ tar xvf spark-*

Finally, move the unpacked directory spark-3.0.1-bin-hadoop2.7 to

$ sudo mv spark-3.0.1-bin-hadoop2.7 /opt/spark

The terminal returns no response if it successfully moves the directory. If you

mv: cannot stat 'spark-3.0.1-bin-hadoop2.7': No such file

Configure Spark Environment

For example, to use nano, enter:

When the profile loads, scroll to the bottom of the file.

Then, add these three lines:

Exit and save changes when prompted.

In the terminal, type:

To do so, run the following command in this format:

The master in the command can be an IP or hostname.

Specify Resource Allocation for Workers

Reload Spark Master’s Web UI to confirm the worker’s configuration.

Similarly, you can assign a specific amount of memory when starting a

$ [Link] -m 512M spark://yixi-virtualbox:7077

Test Spark Shell

Load the shell by entering:

Type :q and press Enter to exit Scala.

Test Python in Spark

Basic Commands to Start and Stop Master Server and

To stop a running worker process, enter this command:

You might also like