Week 01 - Environment Setup
Week 01 - Environment Setup
In this activity, we will learn how to set up the environment and make it ready for processing big
data. Big data processing employs many open source software/libraries to form a software stack.
It's a challenge to ensure the compatibility of different libraries/versions from various sources. In
this unit, we will use container technology (Docker) to manage the environment, and a pre-built and
optimised image has been provided with the following software:
1. Python (3.10) as a programming language
Python 3.11 and above version is not compatible with some dependencies as of the time of
this document (July/2023)
2. Jupyter Lab/Jupyter Notebook as an IDE for python development.
The docker image has both the Jupyter lab and notebook pre-installed. Generally, jupyter
lab is recommended since it’s a newer version of jupyter notebook and provides many
useful features like in-built file management, CSV viewer, terminal emulator, etc. However,
a few known bugs exist in Jupyter lab when you try to use real-time visualisation(e.g.
plotting steam data in realtime), due to backward compatibility issues.
3. Apache Spark (3.4.0) as a big data processing and analysis tool
4. Apache Kafka as the tool for streaming
5. (Optional commonly used libraries) Scikit Learn/Numpy/MatplotLib(latest version as of
July/2023)
We will be installing Docker Desktop first, then go through steps to use Jupyter notebooks with it.
As this is a big data unit, we will only learn basic docker commands to manage our environment.
1
Week/Session 1: Docker Installation and Environment
Setup
Minimum Requirement:
A Windows 10/11 or Linux or MacOS laptop/desktop that is less than 5 years old, with a minimum
of 8GB RAM and a solid state drive (SSD).
For Windows/Linux laptop: Avoid "U" series CPUs, get standard or "H"/"HK" version if possible.
For Mac, 8th gen Intel or M1/M2 is recommended.
2
2) Find “Docker” from your application and start it, then click on the Docker icon
and select “Preferences”.
3) In “General” tab, change file sharing implementation to “VirtioFS”. This option will
improve IO performance on bind mounts.
3
4) In the “Resource” tab, please change CPUs, Memory and Virtual disk limits
depending on your laptop specification. We recommend using 8GB RAM for
Docker if possible.
4
c) Windows 10/11
1) Prerequisites: Window Subsystem for Linux (WSL 2) is required. Please
follow instructions from Microsoft: Install WSL | Microsoft Learn
2) Click on “Start”, find “Microsoft store” and search for Ubuntu 22.04(LTS). This
will install a base linux distribution for Docker.
3) Double click on the .exe file and follow the instructions, all default settings
work fine.
4) (Optional) Please refer to step 3 and 4 for performance tuning.
d) Linux
Docker engine is recommended instead of Docker Desktop. Please follow the instructions
depending on your distribution.
Ubuntu: Install Docker Engine on Ubuntu | Docker Documentation
Debian: Install Docker Engine on Debian | Docker Documentation
CentOS: Install Docker Engine on CentOS | Docker Documentation
Parameters:
1) -v(blue colour): Your local folder to be binded to the home directory inside the
container. Please note Windows path needs to include drive letter and use \. For
example:
docker run -v D:\5202_docker\labs:/home/student -p 8888:8888 -p 4040:4040
monashfit/fit5202-pyspark:latest jupyter notebook
You can change the red part to your own folder. Make sure this folder exists before
running the command, all your jupyter notebook files will be stored in this folder.
2) -p(8888 and 4040): Port mapping from host to container, 8888 is the default port for
Jupyter notebook and 4040 is the default port for Spark UI. The left port number
before : is host port, the right side is container port.
3) “jupyter notebook”: execute jupyter notebook inside the container. This can also be
“jupyter lab” if you prefer.
5
The “docker run” command is only required for the first time starting a container. When you
finish using it, you can press Ctrl+C or use “docker stop [container id]” to stop it and
“docker start [container]” to start it again later on.
Please copy and paste the URL including the token to your preferred web browser.
Now that you have a working environment, please continue exploring the Jupyter notebooks from
Moodle.
● What is a shorthand?
When you execute the “docker ps” command, you will see a long container id. It is required
to perform some commands. (e.g. docker stop [container id]). Since we don’t want to type
the long ID every time, you can use the initial 1-2 characters instead of the whole ID, as
long as Docker can identify the unique container with it.
6
● Do we have to run the “docker run …” command every time?
No. To stop a container, the command is “docker stop [container id]”; or, it will be
automatically stopped when you shutdown your computer. The next time you need your
container, run “docker start [container id]”.
Both containers will use UTC timezone by default. To specific a timezone, an additional
environment variable can be used, for example: -e TZ=Australia/Melbourne
Note: Use jupyter notebook instead of jupyter lab for real time
matplotlib/visualisation.