0% found this document useful (0 votes)
14 views

Week 01 - Environment Setup

Monash, Big Data - FIT5202
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Week 01 - Environment Setup

Monash, Big Data - FIT5202
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

FIT5202 - Data Processing For Big Data

Lab 01 Activity: Setting up your environment

In this activity, we will learn how to set up the environment and make it ready for processing big
data. Big data processing employs many open source software/libraries to form a software stack.
It's a challenge to ensure the compatibility of different libraries/versions from various sources. In
this unit, we will use container technology (Docker) to manage the environment, and a pre-built and
optimised image has been provided with the following software:
1. Python (3.10) as a programming language
Python 3.11 and above version is not compatible with some dependencies as of the time of
this document (July/2023)
2. Jupyter Lab/Jupyter Notebook as an IDE for python development.
The docker image has both the Jupyter lab and notebook pre-installed. Generally, jupyter
lab is recommended since it’s a newer version of jupyter notebook and provides many
useful features like in-built file management, CSV viewer, terminal emulator, etc. However,
a few known bugs exist in Jupyter lab when you try to use real-time visualisation(e.g.
plotting steam data in realtime), due to backward compatibility issues.
3. Apache Spark (3.4.0) as a big data processing and analysis tool
4. Apache Kafka as the tool for streaming
5. (Optional commonly used libraries) Scikit Learn/Numpy/MatplotLib(latest version as of
July/2023)

We will be installing Docker Desktop first, then go through steps to use Jupyter notebooks with it.
As this is a big data unit, we will only learn basic docker commands to manage our environment.

Week 1: Docker Installation and Environment Setup 2


Step 0: System Requirements 2
Step 1: Installing Docker Desktop 2
a) Download Docker Desktop 2
b) Installation and Tuning for MacOS 2
c) Windows 10/11 5
d) Linux 5
Step 2: Running Docker Image 5
Step 3: Access Jupyter Notebook in Browser 6
Step 4: Using Jupyter Notebook 6

Howtos and FAQs 6


● How to view running/stopped containers? 6
● What is a shorthand? 6
● Do we have to run the “docker run …” command every time? 7
● How do I get the token after stop/start? 7
● How to delete containers and clean up? 7
● My token doesn’t work. 7

Week 9: Zookeeper and Kafka in Docker What about week 2-8? 7

1
Week/Session 1: Docker Installation and Environment
Setup

Step 0: System Requirements


Big data processing and machine learning are very demanding for computing resources. Having a
desktop/laptop with faster CPU and larger RAM will speed up some lab activities and machine
learning training.

Minimum Requirement:
A Windows 10/11 or Linux or MacOS laptop/desktop that is less than 5 years old, with a minimum
of 8GB RAM and a solid state drive (SSD).

Recommendations (optional, not a hard requirement):


Less than 3 years old computer with 16GB RAM, 256GB NVME SSD.

For Windows/Linux laptop: Avoid "U" series CPUs, get standard or "H"/"HK" version if possible.
For Mac, 8th gen Intel or M1/M2 is recommended.

Step 1: Installing Docker Desktop

a) Download Docker Desktop


Go to https://www.docker.com/ and download the version appropriate for your operating system.
Note: Mac have two different versions, one for Intel CPU and the other for M1/M2, please
download accordingly.

b) Installation and Tuning for MacOS


1) Double click on the downloaded .dmg file and drag Docker to your
“Applications”.

2
2) Find “Docker” from your application and start it, then click on the Docker icon
and select “Preferences”.

3) In “General” tab, change file sharing implementation to “VirtioFS”. This option will
improve IO performance on bind mounts.

3
4) In the “Resource” tab, please change CPUs, Memory and Virtual disk limits
depending on your laptop specification. We recommend using 8GB RAM for
Docker if possible.

4
c) Windows 10/11
1) Prerequisites: Window Subsystem for Linux (WSL 2) is required. Please
follow instructions from Microsoft: Install WSL | Microsoft Learn
2) Click on “Start”, find “Microsoft store” and search for Ubuntu 22.04(LTS). This
will install a base linux distribution for Docker.
3) Double click on the .exe file and follow the instructions, all default settings
work fine.
4) (Optional) Please refer to step 3 and 4 for performance tuning.
d) Linux
Docker engine is recommended instead of Docker Desktop. Please follow the instructions
depending on your distribution.
Ubuntu: Install Docker Engine on Ubuntu | Docker Documentation
Debian: Install Docker Engine on Debian | Docker Documentation
CentOS: Install Docker Engine on CentOS | Docker Documentation

Step 2: Running Docker Image


After installing Docker, start “Docker Desktop” and open a terminal/Window command
line and run the following command (works for all platforms).
Change the red part to your local folder containing notebook files. Which files?

docker run -v /home/jay/fit5202/labs:/home/student -p 8888:8888 -p 4040:4040


monashfit/fit5202-pyspark jupyter notebook

The following diagram explains different parameters.

Parameters:
1) -v(blue colour): Your local folder to be binded to the home directory inside the
container. Please note Windows path needs to include drive letter and use \. For
example:
docker run -v D:\5202_docker\labs:/home/student -p 8888:8888 -p 4040:4040
monashfit/fit5202-pyspark:latest jupyter notebook
You can change the red part to your own folder. Make sure this folder exists before
running the command, all your jupyter notebook files will be stored in this folder.
2) -p(8888 and 4040): Port mapping from host to container, 8888 is the default port for
Jupyter notebook and 4040 is the default port for Spark UI. The left port number
before : is host port, the right side is container port.
3) “jupyter notebook”: execute jupyter notebook inside the container. This can also be
“jupyter lab” if you prefer.

5
The “docker run” command is only required for the first time starting a container. When you
finish using it, you can press Ctrl+C or use “docker stop [container id]” to stop it and
“docker start [container]” to start it again later on.

To remove a container, use “docker rm [container id]”.

Step 3: Access Jupyter Notebook in Browser


If the above steps are successful, you will see the outputs similar to the following picture on your
terminal.

Please copy and paste the URL including the token to your preferred web browser.

Step 4: Using Jupyter Notebook


Download the Jupyter notebook files from Moodle, and either:
1) Upload the files to Jupyter notebook How??
Or 2) Copy the files to the binded local folder on your laptop.

Now that you have a working environment, please continue exploring the Jupyter notebooks from
Moodle.

Howtos and FAQs

● How to view running/stopped containers?


To view running containers, you can use “docker ps”.
To view all containers including stopped ones, the command is “docker ps -a”.

● What is a shorthand?
When you execute the “docker ps” command, you will see a long container id. It is required
to perform some commands. (e.g. docker stop [container id]). Since we don’t want to type
the long ID every time, you can use the initial 1-2 characters instead of the whole ID, as
long as Docker can identify the unique container with it.

6
● Do we have to run the “docker run …” command every time?
No. To stop a container, the command is “docker stop [container id]”; or, it will be
automatically stopped when you shutdown your computer. The next time you need your
container, run “docker start [container id]”.

● How do I get the token after stop/start?


For security reasons, jupyter will generate a new token if the session is timed out. (e.g. stop
a container and start it again after more than 30 minutes). To view the new token, the
command is “docker logs [container id]”; or you can “attach” to the running container. How?

● How to delete containers and clean up?


You can use “docker rm [container id]” to remove containers that are no longer required. It
won’t delete your volume/data in case you still need them. To perform a cleanup, the
command is “docker system prune”.

● My token doesn’t work.


1. Make sure you copy the whole token only. Sometimes you may copy the
whitespace/line break character without noticing.
2. Check for port conflict. If you have another local jupyter notebook/pyspark instance
running, it may be using port 8888 and 4040. Solution: Change the listening port to
another free port. How to do that?

Week/Session 9: Zookeeper and Kafka in Docker


Based on the separated responsibility principle, we will need two more containers for Spark data
streaming.
1. Zookeeper
Command: docker run -d -p 2181:2181 monashfit/fit5202-zookeeper
According to the Zookeeper document: “ZooKeeper is a centralised service for
maintaining configuration information, naming, providing distributed
synchronisation, and providing group services. All of these kinds of services are
used in some form or another by distributed applications. “ Kafka uses zookeeper to
store some of its critical data, hence, Zookeeper needs to be started first.
2. Kafka
docker run -d -e KAFKA_ZOOKEEPER_CONNECT=host_ip:2181 -e
KAFKA_ADVERTISED_HOST_NAME=host_ip -p 9092:9092 fit5202/kafka
Replace host_ip with your IP address.

Both containers will use UTC timezone by default. To specific a timezone, an additional
environment variable can be used, for example: -e TZ=Australia/Melbourne

Note: Use jupyter notebook instead of jupyter lab for real time
matplotlib/visualisation.

You might also like