Docker REPORT
Docker REPORT
Docker is a tool designed to make it easier to create, deploy, and run applications by
using containers. Containers allow developers to package an application with all its
dependencies (code, libraries, environment variables, etc.) so that it works consistently across
different environments. This is especially useful when moving applications from one machine
to another, like from a developer's local environment to a production server.
In web scraping, Docker helps package your scraping environment (Python libraries, scraping
tools, browsers like ChromeDriver for Selenium, etc.) into a single container. This eliminates
issues with software dependencies across different machines. To note that a container is a
lightweight, standalone executable package that includes everything needed to run an
application (e.g., source code, libraries, settings). For example, web scraping tools like
Selenium or BeautifulSoup often require specific dependencies (e.g., browser drivers, Python
packages). Docker makes sure everything is bundled correctly so our scraping code will run
smoothly wherever the container is executed.
Here’s how I started building a simple web scraping tool with Docker:
mkdir webscraper
cd webscraper
3. I created a Python script (scraper.py) inside my project folder with the following code:
import requests
from bs4 import BeautifulSoup
requests
beautifulsoup4
5. I created a Dockerfile with the instructions to build a Docker image for my scraper:
· FROM python:3.8-slim: This line tells Docker to use the official Python image
as the base for my container. The 3.8-slim version is a lightweight version of
Python 3.8, which is smaller in size and includes just enough libraries to run
Python applications. Using a slim image makes the container smaller, faster to
build, and more efficient.
· WORKDIR /app: This sets the working directory inside the container to /app.
Every subsequent command (like copying files or installing dependencies) will
happen within this directory.
· COPY . /app: This copies all files and folders from my local machine’s current
directory into the /app directory inside the container, allowing Docker to see
my scraper.py script, requirements.txt, and any other necessary files.
· RUN pip install --no-cache-dir -r requirements.txt: This installs the Python
packages listed in requirements.txt inside the container using pip. The --no-
cache-dir option ensures that pip doesn’t save cache files for installed
packages, keeping the container small and efficient.
Chapter 1 GENERAL CONCEPTS
· CMD ["python", "scraper.py"]: This specifies the command to run when the
container starts, telling Docker to run the scraper.py Python script.
6. I built the Docker image by running the following command in my project directory:
This command tells Docker to create an image named webscraper based on the
Dockerfile in the current directory.
· -t webscraper: This flag assigns the name webscraper to the image. I can use
any name here, but webscraper is just an example.
· . (dot): The dot at the end tells Docker to look for the Dockerfile in the current
directory.
This executes the scraper.py file inside the container and outputs the title of the web
page. Docker creates a new container using the webscraper image and then executes
the command specified in the Dockerfile.
8. After I am done, I can stop and remove any running containers (optional):
If I have many containers running or exited, it’s a good idea to remove them once
they’re no longer needed to free up system resources.
· Consistency: The same code will run identically on any machine where Docker is
installed, eliminating issues caused by different environments.
· Dependency Management: All the necessary dependencies are packaged inside the
Docker container.
· Portability: You can easily share the Docker image with others, and they can run the
same code on their machines without any setup issues.
· Isolation: The containerized environment isolates the scraping tool from your host
machine, ensuring that any issues inside the container won’t affect the host.
To do it, we use tools like BeautifulSoup (for HTML), Selenium (for dynamic content), or
APIs (if available) to extract data. However, it’s important to always check the website’s terms
of service to make sure we’re not violating any rules.