0% found this document useful (0 votes)

11 views

Docker REPORT

Uploaded by

chloekhoury2004

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

Docker REPORT

Uploaded by

chloekhoury2004

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 2

Chapter 1 GENERAL CONCEPTS

1.2. What is Docker?

1.2.1. Why is it used?

Docker is a tool designed to make it easier to create, deploy, and run applications by
using containers. Containers allow developers to package an application with all its
dependencies (code, libraries, environment variables, etc.) so that it works consistently across
different environments. This is especially useful when moving applications from one machine
to another, like from a developer's local environment to a production server.

Docker solves the problem of inconsistencies between environments by bundling everything

needed to run an application inside a container, which is lightweight, portable, and isolated
from the host machine. Containers can run on any machine that has Docker installed, ensuring
that our application behaves the same regardless of where it's deployed.

1.2.2. How Docker is Used in Web Scraping:

In web scraping, Docker helps package your scraping environment (Python libraries, scraping
tools, browsers like ChromeDriver for Selenium, etc.) into a single container. This eliminates
issues with software dependencies across different machines. To note that a container is a
lightweight, standalone executable package that includes everything needed to run an
application (e.g., source code, libraries, settings). For example, web scraping tools like
Selenium or BeautifulSoup often require specific dependencies (e.g., browser drivers, Python
packages). Docker makes sure everything is bundled correctly so our scraping code will run
smoothly wherever the container is executed.

1.2.3. Docker Example for Web Scraping

Here’s how I started building a simple web scraping tool with Docker:

1. I installed Docker on my system from Docker's official site.

2. I created a directory for my scraping project and navigated to it:

mkdir webscraper
cd webscraper

3. I created a Python script (scraper.py) inside my project folder with the following code:

import requests
from bs4 import BeautifulSoup

Chapter 1 GENERAL CONCEPTS

# Make a request to a website

URL = 'https://example.com'
response = requests.get(URL)

# Parse the HTML content

soup = BeautifulSoup(response.text, 'html.parser')

# Find the title of the page

title = soup.title.string
print(f"Page title: {title}")

4. I created a requirements.txt file and added the libraries my project needs:

requests
beautifulsoup4

5. I created a Dockerfile with the instructions to build a Docker image for my scraper:

# Use an official Python runtime as a base image

FROM python:3.8-slim

# Set the working directory to /app

WORKDIR /app

# Copy the current directory contents into the container at /app

COPY . /app

# Install the dependencies

RUN pip install --no-cache-dir -r requirements.txt

# Run the scraper script when the container starts

CMD ["python", "scraper.py"]

· FROM python:3.8-slim: This line tells Docker to use the official Python image
as the base for my container. The 3.8-slim version is a lightweight version of
Python 3.8, which is smaller in size and includes just enough libraries to run
Python applications. Using a slim image makes the container smaller, faster to
build, and more efficient.
· WORKDIR /app: This sets the working directory inside the container to /app.
Every subsequent command (like copying files or installing dependencies) will
happen within this directory.
· COPY . /app: This copies all files and folders from my local machine’s current
directory into the /app directory inside the container, allowing Docker to see
my scraper.py script, requirements.txt, and any other necessary files.
· RUN pip install --no-cache-dir -r requirements.txt: This installs the Python
packages listed in requirements.txt inside the container using pip. The --no-
cache-dir option ensures that pip doesn’t save cache files for installed
packages, keeping the container small and efficient.
Chapter 1 GENERAL CONCEPTS

· CMD ["python", "scraper.py"]: This specifies the command to run when the
container starts, telling Docker to run the scraper.py Python script.

6. I built the Docker image by running the following command in my project directory:

docker build -t webscraper

This command tells Docker to create an image named webscraper based on the
Dockerfile in the current directory.

· -t webscraper: This flag assigns the name webscraper to the image. I can use
any name here, but webscraper is just an example.
· . (dot): The dot at the end tells Docker to look for the Dockerfile in the current
directory.

7. Once the image is built, I ran my scraper inside a Docker container:

docker run webscraper

This executes the scraper.py file inside the container and outputs the title of the web
page. Docker creates a new container using the webscraper image and then executes
the command specified in the Dockerfile.

Chapter 1 GENERAL CONCEPTS

8. After I am done, I can stop and remove any running containers (optional):

docker container ls -a # List all containers

docker rm <container_id> # Remove the container by ID

If I have many containers running or exited, it’s a good idea to remove them once
they’re no longer needed to free up system resources.

1.2.4. Benefits of Docker in Scraping:

· Consistency: The same code will run identically on any machine where Docker is
installed, eliminating issues caused by different environments.
· Dependency Management: All the necessary dependencies are packaged inside the
Docker container.
· Portability: You can easily share the Docker image with others, and they can run the
same code on their machines without any setup issues.
· Isolation: The containerized environment isolates the scraping tool from your host
machine, ensuring that any issues inside the container won’t affect the host.

1.3. What is scraping Third-Party Apps/Websites?

This refers to collecting data from websites or apps that we do not own. It can be tricky
because some websites block scraping, or they might have legal restrictions on scraping their
data. In fact, we scrape third-party apps and websites when we need information that they
display publicly, like prices, reviews, or product details, but don’t offer an API to access the
data easily.

To do it, we use tools like BeautifulSoup (for HTML), Selenium (for dynamic content), or
APIs (if available) to extract data. However, it’s important to always check the website’s terms
of service to make sure we’re not violating any rules.