0% found this document useful (0 votes)
66 views10 pages

Install Spark on Ubuntu: A Step-by-Step Guide

This document provides a step-by-step guide on how to install Apache Spark on an Ubuntu system, including prerequisites, installation of required packages, and configuration of environment variables. It details the process of downloading Spark, setting up a master and slave server, and testing the Spark shell with both Scala and Python. The tutorial concludes with basic commands for starting and stopping the Spark master and worker processes.

Uploaded by

eswarannihil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views10 pages

Install Spark on Ubuntu: A Step-by-Step Guide

This document provides a step-by-step guide on how to install Apache Spark on an Ubuntu system, including prerequisites, installation of required packages, and configuration of environment variables. It details the process of downloading Spark, setting up a master and slave server, and testing the Spark shell with both Scala and Python. The tutorial concludes with basic commands for starting and stopping the Spark master and worker processes.

Uploaded by

eswarannihil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

How to Install Spark on Ubuntu

Prerequisites

 An Ubuntu system.
 Access to a terminal or command line.
 A user with sudo or root permissions.

Install Packages Required for Spark


Before downloading and setting up Spark, you need to install necessary
dependencies. This step includes installing the following packages:
 Scala
 Git

Open a terminal window and run the following command to install all two
packages at once (If you install Hadoop first, you don’t need to install JDK this
time:

$ sudo apt install scala git -y

Once the process completes, verify the installed dependencies by running


these commands:

$ java -version; javac -version; scala -version; git --ver


sion

The output prints the versions if the installation completed successfully for all
packages.
Download and Set Up Spark on Ubuntu
Now, you need to download the version of Spark you want form their
website. We will go for Spark 3.0.1 with Hadoop 3.2 as it is the latest version
at the time of writing this article.

Or, you can go to [Link] to choose the


version you want.

Use the wget command and the direct link to download the Spark archive:

$ wget[Link]
[Link]

When the download completes, you will see the saved message.
Now, extract the saved archive using the tar command:

$ tar xvf spark-*

Let the process complete. The output shows the files that are being unpacked
from the archive.

Finally, move the unpacked directory spark-3.0.1-bin-hadoop2.7 to


the opt/spark directory.
Use the mv command to do so:

$ sudo mv spark-3.0.1-bin-hadoop2.7 /opt/spark

The terminal returns no response if it successfully moves the directory. If you


mistype the name, you will get a message similar to:

mv: cannot stat 'spark-3.0.1-bin-hadoop2.7': No such file


or directory.

Configure Spark Environment


Before starting a master server, you need to configure environment variables.

You can add the export paths by editing the .profile file in the editor of your
choice, such as nano or vim.

For example, to use nano, enter:


$ nano .profile

When the profile loads, scroll to the bottom of the file.

Then, add these three lines:

export SPARK_HOME=/opt/spark

export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

export PYSPARK_PYTHON=/usr/bin/python3

Exit and save changes when prompted.

When you finish adding the paths, load the .profile file in the command line by
typing:

$ source ~/.profile
Start Standalone Spark Master Server
Now that you have completed configuring your environment for Spark, you
can start a master server.

In the terminal, type:

$ [Link]

To view the Spark Web user interface, open a web browser and enter the
localhost IP address on port 8080.

[Link]

The page shows your Spark URL, status information for workers, hardware
resource utilization, etc.

The URL for Spark Master is the name of your device on port 8080. In our
case, this is yixi-virtualbox:8080. So, there are three possible ways to load
Spark Master’s Web UI:
1. [Link]:8080
2. localhost:8080
3. yixi-virtualbox:8080
Start Spark Slave Server (Start a Worker Process)
In this single-server, standalone setup, we will start one slave server along
with the master server.

To do so, run the following command in this format:

[Link] spark://master:port

The master in the command can be an IP or hostname.


In ou r case it is ubuntu1:

$ [Link] spark://yixi-virtualbox:7077

Now that a worker is up and running, if you reload Spark Master’s Web UI,
you should see it on the list:

Specify Resource Allocation for Workers


The default setting when starting a worker on a machine is to use all available
CPU cores. You can specify the number of cores by passing the -c flag to
the start-slave command.
For example, to start a worker and assign only one CPU core to it, enter this
command:

$ [Link] -c 1 spark://yixi-virtualbox:7077

Reload Spark Master’s Web UI to confirm the worker’s configuration.

Similarly, you can assign a specific amount of memory when starting a


worker. The default setting is to use whatever amount of RAM your machine
has, minus 1GB.
To start a worker and assign it a specific amount of memory, add the -
m option and a number. For gigabytes, use G and for megabytes, use M.

For example, to start a worker with 512MB of memory, enter this command:

$ [Link] -m 512M spark://yixi-virtualbox:7077

Reload the Spark Master Web UI to view the worker’s status and confirm the
configuration.

Test Spark Shell


After you finish the configuration and start the master and slave server, test if
the Spark shell works.

Load the shell by entering:

$ spark-shell
You should get a screen with notifications and Spark information. Scala is the
default interface, so that shell loads when you run spark-shell.

The ending of the output looks like this for the version we are using at the time
of writing this guide:

Type :q and press Enter to exit Scala.

Test Python in Spark


If you do not want to use the default Scala interface, you can switch to Python.

Make sure you quit Scala and then run this command:

$ pyspark

The resulting output looks similar to the previous one. Towards the bottom,
you will see the version of Python.
To exit this shell, type quit() and hit Enter.

Basic Commands to Start and Stop Master Server and


Workers
Below are the basic commands for starting and stopping the Apache Spark
master server and workers. Since this setup is only for one machine, the
scripts you run default to the localhost.

To start a master server instance on the current machine, run the command
we used earlier in the guide:

$ [Link]

To stop the master instance started by executing the script above, run:

$ [Link]

To stop a running worker process, enter this command:

$ [Link]

The Spark Master page, in this case, shows the worker status as DEAD.
Conclusion
This tutorial showed you how to install Spark on an Ubuntu machine, as
well as the necessary dependencies.

The setup in this guide enables you to perform basic tests before you start
configuring a Spark cluster and performing advanced actions.

Reference
[Link]

You might also like