0% found this document useful (0 votes)
288 views587 pages

Fun With Python - Hubert Piotrowski

The document is a comprehensive guide titled 'Fun with Python' by Hubert Piotrowski, focusing on practical applications of Python programming across various projects such as mobile app development, automation, and cryptocurrency analysis. It includes 16 chapters covering topics from basic Python concepts to advanced projects like building chatbots, web crawlers, and financial tools. The book aims to provide readers with hands-on experience and skills to effectively utilize Python in real-world scenarios.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
288 views587 pages

Fun With Python - Hubert Piotrowski

The document is a comprehensive guide titled 'Fun with Python' by Hubert Piotrowski, focusing on practical applications of Python programming across various projects such as mobile app development, automation, and cryptocurrency analysis. It includes 16 chapters covering topics from basic Python concepts to advanced projects like building chatbots, web crawlers, and financial tools. The book aims to provide readers with hands-on experience and skills to effectively utilize Python in real-world scenarios.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 587

OceanofPDF.

com
Fun
with
Python
Developing mobile apps, automating
tasks, and
analyzing cryptocurrency trends

Hubert Piotrowski
www.bpbonline.com
OceanofPDF.com
First Edition 2025

Copyright © BPB Publications, India

ISBN: 978-93-65893-816

All Rights Reserved. No part of this publication may be reproduced, distributed or transmitted in any
form or by any means or stored in a database or retrieval system, without the prior written permission
of the publisher with the exception to the program listings which may be entered, stored and executed
in a computer system, but they can not be reproduced by the means of publication, photocopy,
recording, or by any electronic and mechanical means.

LIMITS OF LIABILITY AND DISCLAIMER OF WARRANTY


The information contained in this book is true to correct and the best of author’s and publisher’s
knowledge. The author has made every effort to ensure the accuracy of these publications, but
publisher cannot be held responsible for any loss or damage arising from any information in this
book.

All trademarks referred to in the book are acknowledged as properties of their respective owners but
BPB Publications cannot guarantee the accuracy of this information.

www.bpbonline.com

OceanofPDF.com
Dedicated to

My beautiful wife Agnieszka and


My lovely daughters Dominika and Klara
thank you for your patience

OceanofPDF.com
About the Author

Hubert Piotrowski, M.Sc. Computer Science, is a seasoned technologist


and speaker with over 20 years of experience in software and hardware
domains. His areas of expertise range from software development and
automation to data lakes, embedded devices, large-scale web services, cloud
technologies, and systems architecture. He has shared his insights at PyCon
Singapore, PyCon Poland, and through various webinars.
OceanofPDF.com
Acknowledgement

There are a few people I would like to thank for the continued and ongoing
support they have given me during writing of this book. First and foremost,
I would like to thank my family who has been such a great support and as
well at the same time they have shown so much patience and understanding.
Without their great support finishing this book couldn’t been accomplished.
I am also grateful to the companies, contractors, and colleagues I have
worked with in the past. Their collaboration has enriched my journey and
contributed greatly to my experience.
My greetings and gratitude also go to the team at BPB Publications for
being supportive enough and full of understanding towards me to provide
me with great time to finish this challenging book.
OceanofPDF.com
Preface

This book explores various aspects of Python programming, guiding you


through how to effectively use Python in contemporary projects.
It takes a practical approach for Python programming. We are going to talk
about real projects that can be used and extended for daily use. We will
learn how to not only use Python libraries but as well build useful
applications. All of these we are going to learn together as fun with Python.
The book is divided into 16 chapters. All those chapters cover different
types of tasks and variety of challenges and topics. The details are listed
below.
Chapter 1: Python 101 - In this chapter, we will demonstrate how Python
works through simple examples, gradually increasing in complexity to help
you build a solid foundation for the upcoming chapters. This fundamental
knowledge will be essential as you progress. The chapter will cover the
following topics:
Simple scripts, AKA hello world
Exceptions
Classes, objects, polymorphism
Functional programming
Organizing files and simple modules
Chapter 2: Setting up Python Environment 102 - In this chapter, we will
learn how to organize a Python workbench - get clean projects files
structure, work with different versions of Python on the same computer and
how to use them in each project. Topics to be covered:
Working with external libraries
Building your own libraries
Libraries chaos under control
Python environments
Clean code
Hello world in a clean fashion
Chapter 3: Designing a Conversational Chatbot - In this chapter, you
will learn how to create your own interactive chatbot. You’ve probably
encountered those helpful web widgets on websites that allow you to chat
with customer support—many times, you’re actually talking to an AI. Here,
we’ll guide you through building a service that can power such a chatbot.
Topics to be covered include:
Brief introduction to chatbots and AI behind the scenes
Writing server API
Building AI logic for chatbot
Prototyping client side
Let's play - this where you will learn how to connect frontend part of
chatbot with its backend logic
Chapter 4: Developing App to Analyze Financial Expenses - Everybody
has bills to pay. When you grow up and start to have a family your personal
expenses can go beyond expectations. Why not build a simple tool that can
help you to track your home budget. We learn here how to build an efficient
personal budget calculator and how to estimate expenses for the future
based on what you already spent and collected. The following topics will be
covered in this chapter:
Understanding how to calculate expenses budget
Estimate future expenses based on income and outcome happened in
the past
Building behavioral driver estimator
Statistics
Chapter 5: Building Non-blocking Web Crawler - In this chapter, you
will learn how to fetch data from a website. You will be able to not only get
web pages but also extract their content, parse it and use collected data.
Imagine a website is blocking your web spider - no problem after this
chapter you will know how to deal with it and write very efficient web
crawlers. This chapter will cover following topics:
Web crawler basics
Efficient data scraping
Parsing HTML and extracting data
Building and using proxy services
Chapter 6: Create Your Own Virus Detection System - Malware and
viruses are pretty bad things if it comes to the internet. Why not build your
own scanner that can analyze your OS and find any traces of malicious
software. In this chapter you will learn how to write a simple yet powerful
virus scanner. Topics to be covered:
Understanding basics of what malicious software is
Writing file structure reader with recurrence algorithms
Analyzing files and extract potential threats
When speed is important - parallel processing
Updating virus database and use it
Chapter 7: Create Your Own Crypto Trading Platform - In this chapter,
we will learn how to build our own personal crypto broker application.
Using Python, we'll explore how to interact with cryptocurrency exchange
markets and automate the buying and selling of your valuable crypto assets
to work in your favor. In this section, we’ll take the journey of a crypto
investor who doesn't just blindly buy and sell cryptocurrencies. Instead, we
aim to make smart decisions and act at the right moment. This chapter will
cover following topics
brief introduction to crypto market
building client for crypto market
trends analyzer
integrating with crypto wallet
purchase and sell
Chapter 8: Construct Your Own High-tech Loudspeaker - In this
chapter, we analyze and learn how to build our own smart speaker system.
Have you ever heard of devices like Alexa, for instance? Here we will teach
you how to write advance bots that based on the voice recognition can
interact with user. After this chapter we will know how to write interaction
scenarios and use them for smart devices. Topics to be covered:
Understand voice recognition
Writing simple voice recognition scripts
Build chatbot with voice recognition
Organize and build chat scenarios
Chapter 9: Make a Music and Video Downloader - In the previous
chapters, we showed you how to write web crawlers and you learned how
to use crawled data. In the chapter below you will teach you how to use
gathered knowledge to build not a web crawler but a software which can
help you to download your favorite music or video clips from YouTube to
your local PC so you can watch them offline whenever you want. This
chapter will cover following topics
Understanding API concept
Building YouTube API client
Organize downloaded data
Support for different formats and resolutions
Building batch data downloader
Chapter 10: Make A Program to Safeguard Websites - In this chapter,
you will learn how to write your own internet gateway in Python. By
having our own fully programmable gateway we can start controlling web
content. All the internet traffic will be router through it. You will be able to
filter malicious website queries and filter them out completely so our
gateway clients can stay safe. This chapter will cover following topics
understanding packages routing policies
write own DNS server
write DHCP server
package inspection software
filtering web content
challenges with encrypted websites
Chapter 11: Centralizing All Calendars - Imagine that we have multiple
email accounts and what comes with them are many calendars. Following
all the meetings, scheduled tasks, when you shall meet someone or when
you’re booked can be a big challenge. In this chapter we will learn how to
build a smart tool that can follow all the submitted calendars to it and show
us in a single endpoint all the useful information about time schedules for
every day of our busy life. Topics to be covered:
building subscriber tool for web calendars
pull vs push information from/to calendars
creating local DB driven centralized calendar
creating endpoint for desktop calendar which can subscribe your
centralized calendar
error reporting and controlling individual calendars overlapping with
each other
Chapter 12: Developing a Method for Monitoring Websites - In this
chapter, you'll learn how to create a simple but effective tool for monitoring
your websites. You'll discover how to track the availability of specified
sites, monitor uptime, and receive alerts for critical moments when your key
services become inaccessible. Topics covered include:
brief introduction to TCP/UDP packets
understanding how monitoring works
concept of monitoring probes
building reporting central
design alarm system
Chapter 13: Making a Low-cost, Fully-automated Shopping App - In
this chapter, we will build a personal bot that monitors online stores for
products you may be interested in purchasing. The bot will track price
fluctuations and availability, alerting you when the time is right to buy. You
will also learn the fundamentals of enhancing this tool to enable it to
automatically purchase items on your behalf. The chapter will cover the
following topics:
Connecting to eBay bidding service, placing a bid a hunting for the
best price
Writing plugins for popular webstores to find and buy best product
price
Tracking prices and generating alarm upon best price available
Chapter 14: Python Goes Mobile - In this chapter, you'll learn how to use
Python on mobile devices (smartphones) and run your Python programs on
these unique platforms. We'll focus on writing small, efficient code that
works well on mobile systems. The aim of this chapter is to teach you how
to develop mobile applications using Python. The topics covered include:
brief introduction to mobile applications - their concept and
limitations
overview of Python libraries for mobile devices
calculator in Python for a iOS and Android
Chapter 15: QR Generator and Reader - In this chapter, you will learn
how to use Python for image processing. We will learn how to generate our
own QR codes with Python and how to include simple and complex
structures in it as well. Then will learn how to process these QR codes and
read them by using Python. This chapter will cover following topics:
introduction to barcode and QR codes
building simple barcode code generator
building simple QR code generator
embedding vCard’s into barcodes
adding images into QR codes
uploading and processing QR codes
Chapter 16: App to Keep Track of Digital Currencies - In this chapter,
we will create a tool to analyze cryptocurrency trends, alerting you when
specific trends are detected. You will learn how to parse and analyze data
streams, create graphs, and use trend analysis to extract insights from large
datasets—all with Python. Topics to be covered:
building data stream analyzer
storage for data results - time driven DB
sockets connections
analyze tool for trends
learning how to draw graphs with python
building alarms logic
OceanofPDF.com
Code Bundle and Coloured Images
Please follow the link to download the
Code Bundle and the Coloured Images of the book:

https://rebrand.ly/81aee6
The code bundle for the book is also hosted on GitHub at
https://github.com/bpbpublications/Fun-with-Python. In case there’s an
update to the code, it will be updated on the existing GitHub repository.
We have code bundles from our rich catalogue of books and videos
available at https://github.com/bpbpublications. Check them out!

Errata
We take immense pride in our work at BPB Publications and follow best
practices to ensure the accuracy of our content to provide with an indulging
reading experience to our subscribers. Our readers are our mirrors, and we
use their inputs to reflect and improve upon human errors, if any, that may
have occurred during the publishing processes involved. To let us maintain
the quality and help us reach out to any readers who might be having
difficulties due to any unforeseen errors, please write to us at :
[email protected]
Your support, suggestions and feedbacks are highly appreciated by the BPB
Publications’ Family.

Did you know that BPB offers eBook versions of every book published, with PDF and ePub files
available? You can upgrade to the eBook version at www.bpbonline.com and as a print book
customer, you are entitled to a discount on the eBook copy. Get in touch with us at :
[email protected] for more details.
At www.bpbonline.com, you can also read a collection of free technical articles, sign up for a
range of free newsletters, and receive exclusive discounts and offers on BPB books and eBooks.
Piracy
If you come across any illegal copies of our works in any form on the internet, we would be
grateful if you would provide us with the location address or website name. Please contact us at
[email protected] with a link to the material.

If you are interested in becoming an author


If there is a topic that you have expertise in, and you are interested in either writing or
contributing to a book, please visit www.bpbonline.com. We have worked with thousands of
developers and tech professionals, just like you, to help them share their insights with the global
tech community. You can make a general application, apply for a specific hot topic that we are
recruiting an author for, or submit your own idea.

Reviews
Please leave a review. Once you have read and used this book, why not leave a review on the site
that you purchased it from? Potential readers can then see and use your unbiased opinion to make
purchase decisions. We at BPB can understand what you think about our products, and our
authors can see your feedback on their book. Thank you!
For more information about BPB, please visit www.bpbonline.com.

Join our book’s Discord space


Join the book’s Discord Workspace for Latest updates, Offers, Tech
happenings around the world, New Release and Sessions with the Authors:
https://discord.bpbonline.com

OceanofPDF.com
Table of Contents

1. Python 101
Introduction
Structure
Objectives
Installing Python
Using Python
Editor
Hello world
Basics
Loops
Iterators vs generators
Functions
Classes
Modules and packages
Error handling
Code style
Conclusion

2. Setting up Python Environment 102


Introduction
Structure
Objectives
Clean and proper Python workbench
Python in Linux
Python in Windows
Controlling projects
Libraries under control
Clean code
Flake8
Pylint
IDE
Pre-commit
Build your own library
Conclusion

3. Designing a Conversational Chatbot


Introduction
Structure
Objectives
Client-server architecture
Chatbot basics
Training
Chat
Application
Frontend
Conclusion

4. Developing App to Analyze Financial Expenses


Introduction
Structure
Objectives
Excel
Export
Import
Analyze expenses
Estimate future expenses based on income and outcome happened in
the past
Building behavioral driver estimator
Conclusion

5. Building Non-blocking Web Crawler


Introduction
Structure
Objectives
Working with text
Working with HTML
Basic example
Simple crawler
Parallel processing
Improvements
Limit parallel processing
Proxy
Conclusion

6. Create Your Own Virus Detection System


Introduction
Structure
Objectives
Building files and directories scanner
Calculating hashing keys
Introducing viruses
Use and update viruses DB
Building map of suspicious files
Parallel processing
Conclusion

7. Create Your Own Crypto Trading Platform


Introduction
Structure
Objectives
Brief introduction to crypto market
Currencies
Building client for crypto market
Trends analyzer
Integrating with crypto wallet
Purchase and sell
Conclusion

8. Construct Your Own High-tech Loudspeaker


Introduction
Structure
Objectives
Building a software that can support speech to text
Recording
Response
Building interaction scenarios
Connecting to third party service like music players
Building physical devices
Conclusion

9. Make a Music and Video Downloader


Introduction
Structure
Objectives
Download manager
Organizing downloaded data
Building YouTube API client
Support for different formats and resolutions
Conclusion

10. Make A Program to Safeguard Websites


Introduction
Structure
Objectives
Understanding package routing policies
Write DHCP server
Package inspection software
Challenges with encrypted websites
Filtering Web Content
Conclusion

11. Centralizing All Calendars


Introduction
Structure
Objectives
Building subscriber tool for web calendars
Google
Office 365
iCal
Calendar parser
Subscribe locally
Synchronize with external calendar
Conclusion
12. Developing a Method for Monitoring Websites
Introduction
Structure
Objectives
TCP/UDP
Port scanner
Advance scanner
Reporting
Conclusion

13. Making a Low-cost, Fully-automated Shopping App


Introduction
Structure
Objectives
eBay client
Writing plugins to find and buy best product price
Automated price tracker
Tracking multiple items
Historical values
Auto purchase product
Conclusion

14. Python Goes Mobile


Introduction
Structure
Objectives
Basics
Python GUI
GUI
Toga
Kivy
Compiler
Calculator
Calculation logic
Callbacks
Android
Alternative UI
Android
iOS
Conclusion

15. QR Generator and Reader


Introduction
Structure
Objectives
Barcode generator
Barcode reader
QR code generator
QR code reader
Conclusion

16. App to Keep Track of Digital Currencies


Introduction
Structure
Objectives
Data stream
Storing stream
Reading data
Data visualization
Data estimate
Alarms
Conclusion

Index
OceanofPDF.com
CHAPTER 1
Python 101

Introduction
Many technical universities during early 2000s used software called
MATLAB1 to simulate some use cases for Arithmetic Logic Unit (ALU) in
CPU. Some calculations and simulations during that time with pure
MATLAB were time-consuming and a bit complex to achieve. At that time,
many of us were introduced to Python, which could work with MATLAB
well and replace it in many ways for calculations and data manipulation in
academical requirements.
Many were surprised by the syntax, speed, and ease of use. With a
background primarily in assembler, pascal, and raw C, a new language
called Python with its weird syntax so much different from what was known
during those times.
Wide use of Python came a few years later because of many of us were
“distracted” by PHP during web boom. Next, web world heard about web
frameworks like Django or APIs where Python started to become more
visible and more mature. Backend services started to use twisted framework2
with all these callbacks and async programming for the web was really
something.
In this chapter, we will explain how Python works with simple examples,
albeit getting more complicated, so that the reader can get more prepared for
the following chapters requiring this fundamental knowledge. If you are
already familiar with Python but think you need to refresh your knowledge
or like to learn how to write clean code – please join us in this fun.

Structure
In this chapter, we will discuss the following topics:
Basic syntax of Python code
Understanding of how to build basic structures
Basics of object-oriented programming
How to build packages, modules, and classes

Objectives
After reading this chapter, you should know solid fundamental Python
programming. You will also learn how to write clean code. Let us have some
fun with Python together!

Installing Python
Python is a language with open-source code which means you can go to
https://python.org and download the entire language with all its tools in the
form of compliable code. If you have some experience with C code and
dependent libraries, you can try installing Python from the source. It is a
long process and requires a bit of knowledge, but worth doing. This will
allow you to narrow down the Python stack to your personal needs.
In this book, we will focus on installing Python by using precompiled and
prepared by Python team installers. In this case, you should go to the Python
website and download Python 3.10 3. In the bottom section of the page, you
can find an appropriate installer for your operating system.
In the following chapters, we will use Windows, macOS, and Linux system
Ubuntu to demonstrate the installation process and Python use cases.

Using Python
This book will teach you how to run Python programs, but you must have
basic knowledge of using the Command Line Interface (CLI). If you are
familiar with using CLI in Windows and Unix-based systems (macOS or
Linux), you can skip this paragraph and jump to the next one.
There is a vast history related to CLI and how it was evolving - if you're
interested to know more, you can check out various sources online.

Figure 1.1: Example of CLI interface in Linux OS


If you want to see how to use CLI and Windows, please go to Chapter 2,
Setup Python Environment, where we will show how to organize your
workbench when you want to work with Python in the most convenient way.
There are some key differences if we compare Python for macOS or Linux
vs. Windows. For example, using system-level events, threads, and
multiprocessing communication is significantly different when running a
Python program. The key difference comes from the fact that those operating
systems use different system kernels and libraries. From a Python developer
perspective, you will run your program the same way – that is the beauty of
Python.
When you become more experienced in Python development, you will learn
that you have to use some different techniques in different OS-es when
running low-level operations. For the need of this book, we will learn the
high-level part without going too deep and too complex into OS-level
nuances.

Editor
To make our code look good, as it was already highlighted- we do not use
curly brackets and semicolons to identify logical blocks but indentations
instead - these must be consistent to avoid any potential syntax errors. Single
indentation equals 4 spaces, 2 indentations = 8 spaces end, and so on. There
are other suggestions delivered by the Python community, for instance, how
many lines of spaces must be between function definitions. More
information regarding code formatting can be found on the official Python
website4. In this book, for easier reference, we will be using Visual Studio
Code5 - this IDE is free and open-source and has got lots of great quirks and
features that are very useful for everyday use for every developer.

Figure 1.2: Example use of Visual Studio Code with Python

Hello world
Most universities or programming training courses will tell you that C
language is the father of all programming languages, which is true, no doubt!
For instance, Python itself and its modules are written in C. We are bringing
C here to show you how much easier Python syntax is, compared to C and
other languages.
1. #include <stdio.h>
2. int main() {
3. // printf() displays the string inside quotation
4. printf("Hello, World!");
5. return 0;
6. }
Code 1.1
Now, let us try to see what hello world looks like in Python:
1. print("Hello world")
Code 1.2
The first time you see this "hello world" microprogram, you think to
yourself, where are the curly bracket or semicolons? Well - that is the real
beauty of Python language. As you might’ve already noticed - semicolons
are being completely dropped in Python syntax. What replaces curly
brackets, then? As you probably can tell, it is indentation. We will talk a bit
more about this in Chapter 2, Setup Python Environment, regarding clean
code syntax and tools to help you with this.
The preceding example is going to be again hello world but wrapped up into
function with message as argument.
1. def hello_world(message):
2. print(message)
3.
4. hello_world("Hello world")
Code 1.3
To run this program, open Visual Studio Code (VSC) and create a new file
called hello_world.py, then copy and paste the above example from Figure
1.2 and save it to your home directory. Now, open CLI and go to your home
folder where you saved the file. In the next steps, we will assume that you
have successfully installed Python in your operating system.
1. $ python3 hello_world.py
2. Hello world
Code 1.4
As you can see, the output of that program goes directly to the CLI stream
called stdout. This is how Python programs work - they run in a Python
interpreter and can redirect their output to stdout or log file - we will talk
about this in the upcoming chapters.
The default Python interpreter in CLI can be started by command python
and hitting enter. This is going to start Python shell, and from now on,
everything you type in that console is Python code.
1. Python 3.10.8 (main, Oct 21 2022, 22:22:30)
2. Type "help", "copyright", "credits" or "license" for more information.
3. >>>
Code 1.5

Basics
So far, we have learned how to organize your Python craftsmen's desk. Now,
let’s learn a few basic Python concepts that we will need for the following
chapters of this book. To know more about specific programming concepts,
we strongly suggest spending some time reading official Python docs.6

Loops
When you need to repeat some block of code of the course, you mustn’t
copy and paste the same thing multiple times. There is a way to repeat code
blocks based on condition or until some condition is reached. As an example,
we can use Fibonacci series7.
1. >>> a, b = 0, 1
2. >>> while a < 10:
3. … print(a)
4. … a, b = b, a+b
5. …
6. 0
7. 1
8. 1
9. 2
10. 3
11. 5
12. 8
Code 1.6
Above is an example of a repeating block of code (lines 5-6) until variable
a’s value is lower than 10. If you follow carefully how the value of a is
being increased, you can see a small imperfection with the above code – it
prints out twice value 1 on the screen. Why we say that – let us analyze this
together.
1st iteration → a = 0, b = a + b = 0 + 1 = 1
2nd iteration → a = b = 1, b = a + b = 1 + 1 = 2
3rd iteration → a = b = 2, b = a + b = 1 + 2 = 3 and so on.
Why do we have the wrong output of running the while loop in the output
(example above). You can probably see the pattern – we should print the
values of variables a and b after calculating them. In that case, we will print
the correct values. Please notice one trick in Python – what I’d call a one-
liner → a, b = b, a + b What is happening here? Python has magical syntax.
Instead of typing in two lines something like this:
1. def hello_world(message):
2. print(message)
3.
4. for x in range(10):
5. hello_world(f"repeat {x}")
Code 1.7
To see other examples of the simple loop, we can mix the function hello
world that we learn to write with the actual looping technique. As you can
see, we used the range function as a sequence generator and looping with for
statement. Please notice that each time that loop makes a turn (line no. 5),
we call out the hello_world function with the value saved under variable x.
That looping is the simplest example of what the programming is about -
you are building blocks of logical statements that are executed upon specific
programmed conditions.
In our case, instead of writing something like 10 times printed statement
repeat 0...10, we simplified it to the state of code that gives us the same
result, but is much easier to control. Why? Imagine that you want to print
that line like in the previous example but not 10 times - 1000 times! After
which repetition of copy-paste would you give up?
1. repeat 0
2. repeat 1
3. repeat 2
4. repeat 3
5. repeat 4
6. repeat 5
7. repeat 6
8. repeat 7
9. repeat 8
10. repeat 9
Code 1.8
Let us now improve upon this by using a loop to automate the process. Code
1.9 demonstrates how to achieve the same result as Code 1.8 using a 'for'
loop and conditional statements.
1. def hello_world(message):
2. print(message)
3.
4. for x in range(10):
5. if x % 2 == 0:
6. y=x*2
7. hello_world(f"value {y}")
Code 1.9
Now let us try to modify this simple example, print every 2nd line, use a
new variable with the current iterator value multiplied by 2 and then print its
content. To help us, we can use if statements.
Here, you can check what is going to be the result of running above when we
have conditionals being used when looping.
1. value 0
2. value 4
3. value 8
4. value 12
5. value 16
Code 1.10
Now think for a minute. Do you know what will happen if we modify line 4
to make it look like for x in range(11) – will it change anything in the
output of the running code? The answer is – yes. You will have an additional
line printed with the value of 20.
Why is that? Function range generates inerrable array from 0 to 9 (10
elements as a generator). In Python, for loop will start iteration from element
0, so by having a default iterator over 10 elements (line 4), we will finish
when x=9. When we change and do range(11), we will finish the last loop
when x = 10. In that case, we will enter line 5, and by reaching line 6, we
will do a simple assignment to the variable → y = 10 * 2.

Iterators vs generators
When we talk about loops, we must explain how iterators and generators
work and where they’re useful. When we talk iterators vs generators, it can
be a bit confusing which is what. A simple explanation is that an iterator is a
function and block of code that code consumes what the generator is
producing. In simple words, check the following example:
1. my_numbers = [1, 2, 3, 4, 5]
2. data = iter(my_numbers)
3. print(next(data))
4. print(next(data))
5. print(next(data))
6. print(next(data))
7. print(next(data))
Code 1.11
You can see that in line 1, we created an array with 5 elements in it. In line 2,
we will convert this array to iterator – method iter. From lines 3 – 7, you can
see that we are calling the method next that allows me to fetch the
state/value of the object one at a time. We can say that we created an object
that exposes a mechanism to iterate over its values. The iterator goes in pairs
with generators.
1. 1
2. 2
3. 3
4. 4
5. 5
Code 1.12
The generator is a function that will return the generator type, which
eventually returns the sequence of values instead of a single value. An
example of a simple generator is shown below – please notice that function
does not use the return keyword. Instead, we use yield, which explicitly
informs Python that this line returns the generator value.
1. def my_numbers():
2. for i in range(1, 6):
3. yield i
4.
5. obj = my_numbers()
6. print(next(obj))
7. print(next(obj))
8. print(next(obj))
9. print(next(obj))
10. print(next(obj))
Code 1.13
When you follow carefully from lines 5-10, you will notice similarities with
the previous example, but there is no iter method being called. The reason
being is the fact that our function my_numbers is a generator itself, so that
is why we do not do it.
1. class AsciiIterator:
2.
3. def __iter__(self):
4. self.current_value = 65
5. return self
6.
7. def __next__(self):
8. if self.current_value > 90:
9. raise StopIteration
10. tmp_value = self.current_value
11. self.current_value += 1
12. return chr(tmp_value)
13.
14. obj = AsciiIterator()
15. my_iterator = iter(obj)
16.
17. for letter in my_iterator:
18. print(letter, end = ",")
Code 1.14
In the example from the preceding code, you can see a more precise and
advanced iterator. What is happening here is – We declared class in line 1
with 2 important methods – lines 3 and 7. __iter__ - is the one that is being
called when the iterator is initialized - calling iter function. Once that part is
done every time when in main code – lines 17-18 are called by loop Python
calls __next__. That part of the code keeps increasing the internal value of
current_value and returning it, then line 18 prints it on the screen.
1. A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z,
Code 1.15
Let us try to do something similar with the generator and compare.
Noticeably, to approach the same result – especially the part with a loop –
we built much simpler code using the generator approach (lines 1-3)
compared to the iterator one.
1. def ascii_iterator():
2. for i in range(65, 91):
3. yield chr(i)
4.
5. my_letters = ascii_iterator()
6.
7. for letter in my_letters:
8. print(letter, end=",")
Code 1.16

Functions
So far, we have learned how to write hello world function. Let us try to dive
a little deeper into the function definition and understand how Python deals
with different styles of functions.
1. def foo(arg1, arg2):
2. return args1 + arg2
3.
4. foo(1,2)
Code 1.17
Let us analyze for a minute how a typical Python function is organized. As it
is highlighted in following figure - (item 1) We called our test function foo -
of course, arguments (items A and B) values are flexible, and you can call
your function with any arguments you want, albeit.

Figure 1.3: Example of function and its arguments


1. The function name shouldn't use any of the registered Python functions
names - please read Python official documentation regarding standard
function and keywords names8. For instance, a keyword function called
sum – whenever you register your function with the same name, for
example def sum(arg1, arg2) - this is going to overwrite the default
Python sum(*args). Each time in your Python code, when you call
function sum, by thinking you are using standard Python sum function,
you will not be called the original Python function sum but the function
that you initialized – your custom function. That can lead to many
unexpected issues and errors in your code. My advice - in VSC, when
editing this IDE, already highlights function keyword - that should be a
helpful indicator that the name you are trying to use is already registered
to the Python keyword.
2. Without diving too deep into details and very complex explanations -let
us make it clear for more effortless coding - function name can't have
spaces in its name, so using hello word is not correct. Please use
hello_word.
3. The same applies to using special characters beyond standard ASCII or
non-alphanumeric characters.
4. Naming convention - there are some standards in this topic. A professor
in my university who taught us Pascal programming language used to
say, “what is beautiful regarding standards – you have so many of
them.” What we mean by this is some more experience developers will
tell you to name functions MyFunction, myFunction, or
my_function. To make it easier and cleaner, we described it in the next
subchapter – Basics, in detail what standards we will use in this book.
5. The function's name should be explicit and tell you exactly what is
happening inside it. Try to avoid any abbreviations – use full names. Do
not fear using long names – a good Python IDE, that is, VSC, can deal
with it well and show you in a file tree your functions, and by browsing
through them, you can understand much easier what these functions do
by just looking at the name.
6. A good habit of every experienced developer is to use docstring – this
part of the function is where you should descriptively tell other
developers what is happening in this function and what is expected to
happen in a case of exception. Docstrings these days are being parsed by
IDEs, and by pointing the mouse cursor on the function in the code, you
can see it in the preview mentioned docstring. This helps us understand
the code that you are reading.
7. Finally, use English words from the standard English dictionary.
Developers from the Python community will expect others to do the
same.
If you are interested in understanding more of this topic – you can read the
official Python coding and styles guides9. We will read in the next chapter
how to organize your coding desk so it can be much easier for you to deal
with all these principles in auto-manner.
In the previous example, you could see the basic function definition - it
expects 2 arguments in input and returns the arithmetic sum of these two. As
you can see, Python does not force you to define the types of these
arguments. So calling a function like foo(1, “lalala”) is possible, but it will
raise an exception because Python doesn’t know how to deal with these
given arguments when reaching line no 2. It may sound obvious, but we
must make it clear. In Python, it is your responsibility as a developer to
protect your code against unexpected argument types in function – especially
the ones publicly exposed.
What we mean here is to imagine writing a library that you want to publish
on GitHub for other developers. You should, as much as possible, protect the
public methods of your library against unexpected and uncontrolled crashes.
We’ve got a little bit sidetracked here. We are getting back to the function
definition itself. We know how to write a function with defined arguments.
Now let us focus on a case where the number of arguments is not strictly
defined. In Python, you can do this in a bit special, like in the following
example. You can see that only one argument is given in the function
definition, but this one is with an asterisk.
1. def foo(*args):
2. return sum(args)
3.
4. foo(1,2,3,4,5,6,7,8)
Code 1.18
What does it give us - we can call the function with as many arguments as
we want – which gives us lots of flexibility, especially since those arguments
are internally converted to an array. Calling foo(1,2,3) internally under
variable args is assigning something like this (1,2,3), so when inside your
code, for instance, you want to get access to the 2nd parameter of the
function call, you can easily do this by calling args[1] in your code.
1. def foo(arg1, arg2, **kwargs):
2. return arg1 + arg2 + sum([v for v in kwargs.values()])
3.
4. foo(2, arg2=5, v1=10, v2=15, v3=5
1. def foo(arg1, arg2, *args, **kwargs):
2. result = arg1 + arg2 + sum(args)
3. if kwargs and kwargs.get('multi'):
4. return result * kwargs['multi']
5. return result
6.
7. foo(2, 5, 1, 2 ,3, multi=10)
8. foo(2, 5, 1, 2 ,3, ignore_arg=123)
Code 1.19
All looks clear so far, we hope. Now, we will show you one trick in Python
that can improve the readability of your code, especially if you want to
expose parts of your code to the public. In the following example, you can
see the improved the above code with hints. The Python runtime itself does
not require those to be used, albeit when exposing public methods, hints can
be understood by IDEs and presented to other developers in a very clean and
understandable fashion. You can read about it more in the official Python
docs10.
1. from typing import List, Dict
2.
3.
4. def foo(arg1: int, arg2: int, *args: List[int],
**kwargs: Dict[str, int]):
5. result = arg1 + arg2 + sum(args)
6. if kwargs and kwargs.get('multi'):
7. return result * kwargs['multi']
8. return result
9.
10. value_1 = foo(2, 5, 1, 2 ,3, multi=10)
11. value_2 = foo(2, 5, 1, 2 ,3, ignore_arg=123)
12.
13. print(value_1)
14. print(value_2)
Code 1.20
Now, once we have the basics explained, let me show you how a function
can use variables. In Python, like in other mature languages, you have few
options to define and control your variables:
local
nonlocal
global
1. my_variable = "started with value"
2. def scoped_function():
3. def run_global():
4. global my_variable
5. my_variable = "global eggs"
6.
7. def run_local():
8. my_variable = "i love eggs"
9.
10. def run_nonlocal():
11. nonlocal my_variable
12. my_variable = "nonlocal eggs"
13.
14. my_variable = "test value"
15. run_local()
16. print(f"After local assignment: {my_variable}")
17. run_nonlocal()
18. print(f"After nonlocal assignment: {my_variable}")
19. run_global()
20. print(f"After global assignment: {my_variable}")
21.
22. print(f"Started with value: {my_variable}")
23. scoped_function()
24. print("In global scope:", my_variable)
Code 1.21
In this example, we have shown you a few aspects of using variables and
how to protect them against accidental overwriting. In the example Code
1.21, we also managed to overwrite these global variables. How does it
work? Let’s analyze it – line 1 – you can see a defined variable in the main
root scope. Then in line 23, we call function scoped_function, which as a
first thing that happens initially in line 14, creates a local variable – the same
name as in the root context, but this one is scoped locally. We run function
(line 15), which again is reusing the same variable name, albeit it is scoped
in that function body (line 8), so any initial use or changes are accessing
variable my_variable on the function level.
1. Started with value: started with value
2. After local assignment: test value
3. After nonlocal assignment: nonlocal eggs
4. After global assignment: nonlocal eggs
5. In global scope: global eggs
Code 1.22
Line 11 tells Python that whatever happens next in the function body
regarding variable my_variable will occur in the scope of scoped_function
level. Something similar is stated in line 4, but in this context statement,
global tells Python that we are referring to variable my_variable in global
context.
As we’ve seen, the global context of the variable is being used with that
statement. We need to clarify something. Global does not mean only in the
sense of current Python file that is executed. It means accessing the variable
globally, ignoring where it is initially used. Check the following example to
understand more. Let us create 3 files where we declare and use the same
variable.
The global context of the variable is being used with that statement. Global
does not mean only in the sense of current Python file that is executed. It
means globally accessing the variable, ignoring where the variable is being
used initially. Check the following example to understand it better:
Code file func1.py:
1. def foo():
2. SOME_VAR = "x"
Code 1.23
Code file func2.py:
1. SOME_VAR = "abc"
2.
3. def foo2():
4. global SOME_VAR
5. print(SOME_VAR)
Code 1.24
Code file func3.py:
1. from func1 import foo
2. from func2 import foo2, SOME_VAR
3.
4.
5. def foo3():
6. global SOME_VAR
7. SOME_VAR = "here i am"
8.
9. foo()
10. foo2()
11. foo3()
12. foo2()
13. foo()
14.
15. print(SOME_VAR)
Code 1.25
To run the preceding example, we just execute python3 func3.py We will
see the result presented below. You can see that global keywords can be
useful when working with modules and packages, but it is also a curse. You
should be careful when using it.
1. abc
2. abc
3. here I am
Code 1.26
When defining variables used across different modules as globally possible
to import, we have to be careful. In some of our modules we import
variables from other module context (like in example Code 1.25 with
variable SOME_VAR). Then we access it as a global statement and
overwrite its content – your application can start working as not expected –
in some other part of code you use that variable as module scoped by you
overwritten it as global in your current module or vice versa. Debugging
such an issue can give you a lot of unnecessary stress. So again – be careful
with module context variables versus global variables.
We suggest avoiding any global variables you shared and updating across
modules as much as needed until you do not have another choice. Threat
them as read-only and places where you should share variables across
modules – you return statements from functions execution.
Classes
In the previous sections, we’ve seen how to write clean and proper function
definitions, and deal with local and private variables. We also showed you
how to use global variables and shared them across different modules. That
was needed to understand the basics of operating with variables on multiple
scoping levels.
1. class Point:
2. x: int
3. y: int
Code 1.27
Above, you can see an example of a simple class definition. We can create
an object from it in a pretty simple way. In line 1, we create an instance of an
object. In the next lines, we assign internal attributes with values. By
extending this concept, we can start using these attributes (class variables in
a more sophisticated way which we will understand later in the chapter.
1. p = Point()
2. p.x = 20
3. p.y = 15
Code 1.28
Here, let us stop for a minute and clarify a few things before we continue our
fun with classes and their world. Python classes and their namespaces do not
overlap each other. The same is with functions. You can define the same
class name or function name in two different namespaces and call these
classes; there is no relation between these. Check the example Code 1.29.
1. from super.awesome.module import caller
2. from another.awesome.module import stuff
3.
4. p1 = stuff.Point()
5. p1 = caller.Point()
Code 1.29
We have mentioned some pseudocode to demonstrate how Python allows
you to encapsulate classes or function definitions on the namespace level.
Namespaces in this care are imports in lines 1-2.
Going back to classes – maybe it is obvious, but we would like to emphasize
it again, class definition must be executed before it can take any real effect –
it is the same as with functions. When class execution happens, Python
creates a local namespace. Thus, all references to them are encapsulated. The
class instances do not overlap its attributes and method definitions, a small
proceeding example.
1. from super.awesome.module.caller import Point
2. from another.awesome.module.stuff import Point
3.
4. p1 = Point()
5. p1 = Point()
6. print(p1 == p2) # False
7. print(isinstance(p1, Point)) # True
8. print(isinstance(p2, Point)) # True
Code 1.30
Please notice in proceeding example where we tried to show you how
namespaces work in classes. To be precise, this demonstrates that instances
of classes are fully encapsulated and do not share any namespace – lines 23-
25. In the previous example Code 1.30, we saw that even when you import
the same named function or class from 2 different places, Python will also
make a private namespace for such use.
Now, we will see the wrong import technique and when namespaces can
overlap with each other.
1. class Point:
2. x: int
3. y: int
4.
5. p1 = Point()
6. p1.x = 10
7. p1.y = 5
8.
9. p2 = Point()
10. p2.x = 10
11. p2.y = 5
12.
13. print(p1 == p2) # False
14. print(isinstance(p1, Point)) # True
15. print(isinstance(p2, Point)) # True
Code 1.31
Lines 1-2 are almost identical, and we import from the same modules. The
key difference is we import Point directly instead of indirectly via a module,
like the caller.Point as we did in Code 1.30. Import done in line 2 overwrites
imported namespace for Point from line 1. Thus, in the result, when we
initialize the class instance (lines 4-5), we refer to the imported Point class
from line 2.
In conclusion, be extra careful when you import modules, classes, and
functions. A little bit more explanation of how properly import packages is
explained further in the next subchapter - Modules and Packages.
As mentioned earlier, the class has a definition and its instance. When the
class is initialized (its instance) class can call a special auto-executed method
named constructor. The constructor can have attributes you want to assign in
class instance or some piece of code you’d like to auto-execute as a part of
the constructor. Here is an example of how to use this technique.
In the preceding example, the class's constructor forces the developer to
initialize the class with 2 mandatory parameters.
1. class Foo:
2.
3. def __init__(self, arg1, arg2):
4. self._argument_1 = arg1
5. self._argument_2 = arg2
6. self.print_me()
7.
8. def print_me(self):
9. """Example printing of constructor args"""
10. print(f"argument 1: {self._argument_1}")
11. print(f"argument 2: {self._argument_2}")
Code 1.32
Once given, the initializer assigns them to class attributes (lines 4-5) and
calls the internal method print_me.
1. f = Foo(10, 15)
2. argument 1: 10
3. argument 2: 15
4.
5. f.print_me()
6. argument 1: 10
7. argument 2: 15
Code 1.33
With that approach, you can control the logic of a class instance when it is
created. It is, for sure, a powerful way of driving logic inside the
encapsulated class. Here, let me explain something about class attributes
being addressed by the constructor. Earlier, we’ve seen modified version of
the above example with a little different constructor.
There are two main ways of initializing class attributes via constructor –
something we call lazy initialization, presented in Code 1.32, and non-lazy
(explicit). The difference is in its use. Imagine you have a constructor where
you specify some attributes, and you would like to drive logic based on these
values– that is what we did in Code 1.34. You can notice in line 10, if that
condition is not fulfilled – the class initializer will use the default value
defined in line 4. This way of defining class attributes is a non-lazy
initializer.
1. class Foo:
2.
3. argument_1 = 5
4. argument_2 = 6
5.
6. def __init__(self, arg1, arg2):
7. self.print_me()
8. self.argument_1 = arg1
9. if arg2 > self.argument_2:
10. self.argument_2 = arg2
11. self.print_me()
12.
13. def print_me(self):
14. """Example printing of constructor args"""
15. print(f"argument 1: {self.argument_1}")
16. print(f"argument 2: {self.argument_2}")
Code 1.34
We suggest using non-lazy initializers since they are secure compared to lazy
initializers. Imagine you, by mistake, do not initialize some class attribute in
the constructor because of a bug in your code, then you try to use that
attribute in other methods – this will lead to a fatal exception. With a non-
lazy initializer, you, as a developer, are certain that initial values are defined
and will not lead to fatal exceptions when trying to access them.
I showed how to assign attributes in class via the constructor, and now we
will see a few things you should be aware of as a developer. In the following
example, you can see we initialize instance arguments (line 1) via the
constructor, adding new values of these arguments (lines 9-10). This is a
good technique if that public access to class arguments is what you want as a
developer, what to do if you want to hide these arguments from public access
and only allow the user to change them via the constructor or public method.
In the preceding example, you already noticed we used a single underscore
as a prefix for the variable name. This technique is called protected attribute
and is also applicable for class methods, that is, def _print_me(self).
1. class Foo:
2.
3. def __init__(self, arg1, arg2):
4. self._argument_1 = arg1
5. self._argument_2 = arg2
6. self.print_me()
7.
8. def print_me(self):
9. """Example printing of constructor args"""
10. print(f"argument 1: {self._argument_1}")
11. print(f"argument 2: {self._argument_2}")
Code 1.35
What does it give us? Theoretically, using the class attributes should prevent
developers from public access to them. In practice, let us try to use
something as in following example Code 1.36 – you will quickly discover
that command dir prints out all the possible attributes of the class instance,
even those that are protected – so by knowing that developer can call
protected methods directly and bypass class API.
It is fair to say that protected attributes can be accessed like “normal”
without any special tricks, but…
1. f = Foo(5,7)
2. dir(f)
Code 1.36
By Python and a Python community standards, if you run dir function on
some class instance and see any methods or attributes marked as protected –
please do not access them directly. The reason is that method or attribute can
be changed in the next release of the library/class that you are currently
using. So there is a high risk of breaking compatibility of your code when
you upgrade related libraries when you use some protected attributes directly
from them.
You already know how to “protect” your stuff from public access but after
discovering that even being protected, it is still something that can be
exposed. How can we protect the variable of method? Let me show you
private attributes.
1. class Foo:
2.
3. def __init__(self, arg1, arg2):
4. self.__argument_1 = arg1
5. self.__argument_2 = arg2
6. self.__print_me()
7.
8. def __print_me(self):
9. """Example printing of constructor args"""
10. print(f"argument 1: {self.__argument_1}")
11. print(f"argument 2: {self.__argument_2}")
Code 1.37
A protected attribute or method allows you to hide your logic in an
encapsulated class instance. This is a good way of building software or
libraries where you can expose to the public only those parts that should be
accessible and hide the ones that should be protected against accidental
overwriting or damaging your library logic.
By starting with constructor, we showed you how to deal with the class
attributes. How about if we want to run some logic when the class instance is
being destroyed? Python VM takes care of running a garbage collector and
removing unused stuff to free up memory. There are some needs when a
developer wants to delete a class instance on demand. For example, it takes
too much memory, and we want to free it up before processing other things
with a hard memory footprint. This is how you can do it:
1. class Foo:
2.
3. def __init__(self, arg1, arg2):
4. self.__argument_1 = arg1
5. self.__argument_2 = arg2
6. self.__print_me()
7.
8. def __print_me(self):
9. """Example printing of constructor args"""
10. print(f"argument 1: {self.__argument_1}")
11. print(f"argument 2: {self.__argument_2}")
12.
13. def __del__(self):
14. self.__argument_1 = None
15. self.__argument_2 = None
16. print('bye bye!')
Code 1.38
In the lines 13-16, we assign empty values to attributes instantiated in the
class constructor and print some text to inform the developer that the
destructor has been executed properly. Lines 4-6 in the following example is
what happens when running line 2 – the actual way of deleting class,
variable, or attribute in Python, which in the class instance case is calling
destructor from the class body.
1. f = Foo(1, 5)
2. del(f)
3.
4. argument 1: 1
5. argument 2: 5
6. bye bye!
Code 1.39

Modules and packages


Python has building packages and modules11 you can use in your project.
Packages are a group of modules organized in a structure, and the module
itself is just a file with Python code, which can contain files, classes,
variables, and so on.
If you want to organize your packages and modules, you can follow Code
1.40. Packages are a way of organizing the Python module and its file tree.
For importing them, we use dotted notation, that is, from
my_module.sub.sub.sub import some_class. Below is an example of a
pseudo module and how it is organized.
1. sound/ Top-level package
2. __init__.py Initialize the sound package
3. formats/ Subpackage for file format conversions
4. __init__.py
5. wavread.py
6. wavwrite.py
7. aiffread.py
8. aiffwrite.py
9. auread.py
10. auwrite.py
11. ...
12. effects/ Subpackage for sound effects
13. __init__.py
14. echo.py
15. surround.py
16. reverse.py
17. ...
18. filters/ Subpackage for filters
19. __init__.py
20. equalizer.py
21. vocoder.py
22. karaoke.py
23. ...
Code 1.40
Please notice that every subdirectory has __init__.py in it, which is an
indicator for Python that this folder has some Python files there, and Python
can seek them. That is the basic meaning. What is important to notice is that
this file is also a public interface for modules.
Let us assume you want to publicly expose from module effects only module
echo – the rest should be hidden – check the example below.
1. __all__ = ['echo']
2.
3. SOME_VARIABLE = 'abc 123'
4.
5. def some_function(arg: int) -> str:
6. return str(arg)
Code 1.41
In line 1 we explicitly exposed module echo as we wanted. Next, in line 3,
we defined variable, and in line 5-6. These are public function and variable
for module effects. When some developers import like from sound.effects
import * Python will only import module echo and auto import to the
current namespace variable from line 3 and function from line 5. But nothing
is stopping a developer from importing explicitly, like from
sound.effects.surround import function_foo. In that case, the public
interface does not hide that part in __init__ but at least for these imports
with *, it allows you to control what is being imported. Where can this
technique help?
Protect namespaces from accidental overwriting variables, functions, or
classes
Hide unnecessary modules from accidental import
Cleaner namespace for module import
Expose public logic semi-automated when importing module
So far, we walked about implicit module imports but how about package
3internal module imports, like, let us say module surround wants to import
function foo from module filters/vocoder?
1. from sound.filters.vocoder import foo
2. from ..filters.vocoder import foo
Code 1.42
Please notice the main difference between these two import techniques. First
is direct import from the package and then modules. Second is package
intra-import like in standard file structure, so that you can use dot notation
for import, that is,
1. from . import echo
2. from .. import formats
3. from ..filters.vocoder import foo
Code 1.43
Now, since we learned how to do absolute and intra-imports in packages and
modules. We also learned what the package or module is, we can explain
how Python knows where to look for these modules.
Python has a few runtime environment variables12 you can use in your favor.
First, when importing, Python will try to read env variable PYTHONPATH
to know where to search for packages. By default, being empty, it tells
Python to look in the dependency’s installation directory (check Python
HOMEDIR env) and site packages there. Then, it checks the local file
structure, that is, the current working directory.
If you need to change libraries path for importing, you can overwrite
PYTHONPATH, run your Python script from one directory and import stuff
from another. That is very useful when working on a package you plan to
release as an installation candidate. You do not want to reinstall it every time
you want to make a change to test it.

Error handling
As a very mature language, Python has an advance system of catching and
raising exceptions. If you do not know what an exception is in programming,
we briefly introduce you to this concept. Let us understand it from the
following example. Notice that it is a simple function that divides a by b,
where b is converted to float (line 2).
1. def foo(a, b):
2. print(a/float(b))
3.
4. print(foo(1, 2))
5. print(foo(1, 0))
Code 1.44
Let us run this program and see what is going to happen. You can see in the
following example that it crashed poorly. This is because in line 5, (Code
1.44) we tried to divide by zero, which as a result, led our code to crash.
1. 0.5
2. None
3.
4. ---
5.
6. ZeroDivisionError Traceback (most recent call last)
7. Cell In [31], line 5
8. 2 print(a/float(b))
9. 4 print(foo(1, 2))
10. ----> 5 print(foo(1, 0))
11.
12. Cell In [31], line 2, in foo(a, b)
13. 1 def foo(a, b):
14. ----> 2 print(a/float(b))
15.
16. ZeroDivisionError: float division by zero
Code 1.45
Our example code crashed with the exception (line 16) ZeroDivisionError.
If we know that our code crashed (lines 10-16) and even Python points us to
where we have an issue, how can you, as a developer protect your code
against such unexpected events? The answer is simple – we can use code to
catch exception blocks. We modified the same code below against fatal
crashes and unexpected input.
1. def foo(a, b):
2. try:
3. return (a/float(b))
4. except ZeroDivisionError:
5. print("We don't know how to devide by zero")
6. except Exception as e:
7. print(f"Something unexpected happened, details: {e}")
8.
9. print(foo(1, 2))
10. print(foo(1, 0))
11. print(foo(1, "lalala"))
Code 1.46
We added returning the proper value of dividing (line 3) two arguments
when everything is correct. Suppose there is a case of dividing by zero, we
can catch this exception and print the proper message (lines 4-5), and in a
case when something unexpected happens (line 11 – where 2nd argument of
the function call is a string instead of number) we catch this corner case and
print proper message.
The preceding code helps you react to unexpected situations, especially lines
6-7. This catches all exceptions that are not about dividing by zero cases.
Still, the question is – is there a better way of testing arguments before
executing them or maybe raising an exception if something unexpected is
going to happen before it does?
Running the preceding code is a bit cleaner. Check lines 2-3 – we
arechecking if the given 2nd argument is greater than zero – to avoid, for
instance, dividing by zero is the case with negatives.
1. def foo(a, b):
2. assert isinstance(a, (int, float)), "Argument 2 must be number"
3. assert isinstance(b, (int, float)), "Argument b must be number"
4.
5. if b < 0:
6. raise Exception("Sorry but 2nd argument
must be greater than zero")
7.
8. return (a/float(b))
Code 1.47
We will run above code print(foo(1, 2)) which gives us the proper and
expected result (line 1). After running with print(foo(1, -1)), we ran to
compare and check (line 2, Code 1.47) block, and based on the improper
value of argument b. We raised an exception (line 3, Code 1.47) which leads
Python to print traceback of such an exception and indicator where exactly it
has been raised (lines 4-17, Code 1.48).
1. In [1]: foo(1,2)
2. Out[1]: 0.5
3.
4. In [2]: foo(1,0)
5. ---------------------------------------------------------------------------
6. ZeroDivisionError
Traceback (most recent call last)
7. <ipython-input-12-157e21edc363> in <module>
8. ----> 1 foo(1,0)
9.
10. <ipython-input-9-a157a0d2288f> in foo(a, b)
11. 6 raise Exception
("Sorry but 2nd argument must be greater than zero")
12. 7
13. ----> 8 return (a/float(b))
14. 9
15.
16. ZeroDivisionError: float division by zero
Code 1.48
Next, let us run a case where argument b is a string, which is not the
expected way of running the function – this case will be checked by the
assert function.
1. In [1]: print(foo(1, "lalala"))
2. ---------------------------------------------------------------------------
3. AssertionError
Traceback (most recent call last)
4. <ipython-input-13-812be92c8833> in <module>
5. ----> 1 print(foo(1, "lalala"))
6.
7. <ipython-input-9-a157a0d2288f> in foo(a, b)
8. 1 def foo(a, b):
9. 2 assert isinstance(a, (int, float)),
"Argument 2 must be number"
10. ----> 3 assert isinstance(b, (int, float)),
"Argument b must be number"
11. 4
12. 5 if b < 0:
13.
14. AssertionError: Argument b must be number
Code 1.49
Pretty cool, right? With assertion, you can check as developer proper
argument types, values or their instance types, and many more possibilities.
If something doesn’t match your expected values, you can throw an assertion
that will stop the code from execution and lead to an exception.

Code style
The previous section, talked about coding standards and where to find them.
Remembering all the standards and how and when to use them in your code
can be pretty challenging, especially if you have to refactor some legacy
code and you’re not sure if some standard in coding used by other developer
was correct or not. But, there is a way in this chaos. Few things to follow
that are worth remembering. Please also check Chapter 2, Setup Python
Environment, where we describe how to organize Python and its tools on
your work machine –some tools mentioned can simplify the whole process
of writing proper and clean syntax.
Variables
Now, we will share a few simple rules for naming your variables properly
for cleaner understanding and correct notation according to pep8.
1. Name your variables with underscores (snake case notation) and as
explicit as it can go. Do not use any:
a. Hungarian notations, example arru8NumberList = [1,2,3]
b. Camel case notation, example CamelCase = 5
Please use snake case - lower case with underscores and standard English
dictionary. For example, my_variable_name = 5
2. Do not use names from the standard Python library as y own variable
name. For instance, sum = 5 ← this will overwrite the default Python
sum function with your variable, and from everywhere in your code
where you try to use sum function, it will fail with a fatal exception. So
be careful with these.
3. Do not make it too long – it is hard to read variable names if they are
longer than the working space in your IDE.
4. If you plan to keep the variable read-only, use a capital letter for a
name, that is, MY_CONTENT_FOR_NAME = “John”.
Functions
As we already learned the fundaments of how construct functions body, we
now will dive a little deeper into details of how to write functions in a clean
manner. First, let us start with an example of a bad code.
1. def FoO(ARgINt8):
2. myTmpV = ARgINt8 * 20
3. return OtherFoO2(myTmpV/2)
4. def OtherFoO2(ARgINt8):
5. return ARgINt8*0.1
6. if __name__ == "__main__":
7. print(FoO())
Code 1.50
The preceding code is bad syntax because it is hard to read. As you can see
in the function definition, we broke the main rule to use snake case names.
Please notice how difficult it is to follow such a code. Another thing is that
there is no space between function bodes and their uses. The author also
didn’t add any comments in function definitions, so whoever uses his code
can have a hard time reading the code. Code also uses Hungarian notation –
for example, in function augments.
Of course, in the above context, we used a trivial example of simple code but
imagine something complex written in such a way – following such a code
can give you lots of hard times as a developer. How to fix this? Let us take a
closer look to clean and properly write the same code.
1. def simple_calculator(in_multiplier):
2. """This is pretty simple function that deliver some math."""
3. my_temp_var = in_multiplier * 20
4. return moving_comma(my_temp_var/2)
5.
6.
7. def moving_comma(data):
8. return data*0.1
9.
10.
11. if __name__ == "__main__":
12. print(simple_calculator())
Code 1.51
You probably already noticed by comparing number numbers that the above
example is a bit longer in that sense – but does it matter? Yes, it does. First
of all, if you add those two extra lines of space after the function definition,
it is correct according to pep8 standards and makes your code cleaner – it
gives more light into so much text. That aspect is super important, especially
if you have to read and analyze lots of code. Trust me on this – such a small
detail can make a big difference.
So why did we say it matters? Oh well, for sure of what we mentioned
above, but when you are adding those spaces it does not impact Python
effectiveness. Remember, it is a dynamically compiled language – so adding
1 or 5 extra lines or spaces for better readiness of your code does not make a
serious difference to Python, but it does to you. Cleaner = easier to follow.
Another thing worthy of mentioning is what we highlighted before, using
variable names in such a way that they are descriptive and are not crazy
acronyms. Trust me on this – reading even your own code is much easier
after a long time if variable function names tell you what they should store
or return.
Classes
Cleanly writing a class is pretty the same rules are applied here as we
already learned in previous subchapters when we talked about variables and
function definition. Now we need to re-apply these rules to the class
definition. Let us take a closer look at an example class definition and its
use.
1. import pickle
2. from collections.abc import Callable
3. from typing import NewType
4.
5. MyObject = NewType('UserId', Callable[[], str])
6.
7.
8. class Serializer(object):
9.
10. def __init__(self, compression=False, compression_level=6, use_zlib:
bool=False,
11. pickle_protocol=pickle.HIGHEST_PROTOCOL):
12. """
13. Initializer, expected arguments:
14. - compression - True, means zip compression is going to be used
15. - compression_level - compresson level
16. - use_zlib - True, means using zlib library
17. «»»
18. self.comp = compression
19. self.comp_level = compression_level
20. self.use_zlib = use_zlib
21. self.pickle_protocol = pickle_protocol
or pickle.HIGHEST_PROTOCOL
22. if self.comp:
23. if self.use_zlib and zlib is None:
24. raise ConfigurationError('use_zlib specified, but zlib module '
25. 'not found.')
26. elif gzip is None:
27. raise ConfigurationError
('gzip module required to enable '
28. 'compression.')
29.
30. def _serialize(self, data: str) -> str:
31. """Serialize given data to pickle reprezentation"""
32. return pickle.dumps(data, self.pickle_protocol)
33.
34. def _deserialize(self, data: str) -> Callable[[], str]:
35. """Deserialize pickled object to its original state"""
36. return pickle.loads(data)
37.
38. def serialize(self, data: Callable[[], str]):
39. data = self._serialize(data)
40. if self.comp:
41. if self.use_zlib:
42. data = zlib.compress(data, self.comp_level)
43. else:
44. data = gzip_compress(data, self.comp_level)
45. return data
46.
47. def deserialize(self, data: MyObject) -> str:
48. if self.comp:
49. if not is_compressed(data):
50. logger.warning('compression enabled but message data does n
ot '
51. 'appear to be compressed.')
52. elif self.use_zlib:
53. data = zlib.decompress(data)
54. else:
55. data = gzip_decompress(data)
56. return self._deserialize(data)
Code 1.52
Let us try to analyze what is happening in the above source code. First thing,
please notice the Class statement – as we already learned in Python, is the
beginning of the class definition statement. Everything that will be defined
in its body must follow some basic rules.
1. data = {"key1": "some value"}
2. s = Serializer()
3. serialized_data = s.serialize(data)
4. s.deserialize(serialized_data)
Code 1.53
Packages
We have already mentioned packages in the previous chapter:
They have their namespace, so you must be aware of how they work
Import functions, variables, or classes having the same names but
coming from different namespaces in such a way as not to overwrite
each other
Now let us focus on how to import components from different packages in
the cleanest way. There are a few simple rules to follow:
Import in alphabetic order
Organize your import in 3 groups
Python system imports
3rd party modules
The imports from your project:
Import only those things that you need to use in your current working
file – do not overcomplicate imports
Never import with a star, like the following example:
1. from typing import Dict, List
2. from some.package import *
Code 1.54

Conclusion
In this chapter, we managed to learn the basics of Python. We also managed
to see how to write clean code in fashion that follows community standard.
We got through some important topics like classes, exception and modules
that we are going to use when we will be programming some mini projects
in next chapters of this book.
In the next chapter, before we get some hands-on code, we will learn how to
properly organize workbench for our Python stack. We will see what kind of
tools are used by professionals and how they can work for us.

1. https://www.mathworks.com/products/matlab.html
2. https://twisted.org/
3. https://www.python.org/downloads/release/python-3107 - release
notes and link to manual
4. https://peps.python.org/pep-0008/ - link to PEP specification about
proper code formatting
5. https://code.visualstudio.com – download Visual Studio Code
6. https://docs.python.org – Python full documentation.
7. https://en.wikipedia.org/wiki/Fibonacci_number
8. https://docs.python.org/3/tutorial/stdlib.html - Python standard
library documentation
9. https://peps.python.org/pep-0008/#function-and-variable-names –
how to write proper and clean functions based on Python official coding
guideline.
10. https://docs.python.org/3/library/typing.html - official Python
support manual for using typing and data types.
11. https://docs.python.org/3/library/index.html - Python standard
modules.
12. https://docs.python.org/3.10/using/cmdline.html#envvar
OceanofPDF.com
CHAPTER 2
Setting up Python Environment 102

Introduction
Python since its beginning went long way. Starting from scripting yet very
powerful language it became a very advance programming ecosystem that
can run almost on any operating system. This power and flexibility brings
some challenges for every developer – how to write clean code that is
structured, follows strong standards, and allows developer to extend
language capabilities by adding libraries and Python extensions.
In this chapter, we will learn some basics regarding Python. We will also go
through the challenges and features of Python and learn how to organize our
development workbench with all necessary tools to start effective coding.

Structure
In this chapter, we will cover the following topics:
Clean and proper Python workbench
Python in Linux
Python in Windows
Controlling projects
Packages – working with external libraries – libraries under control
Organized clean code
Validate code quality – pying and flake8
Working with integrated development environment (IDE)
Auto applying code quality fixes – precommit
Build your own library

Objectives
In this chapter, we will learn how to make our Python code at its best. We
will learn how to install it and use 3rd party libraries in the most organized
way. Next, we will dive into the topic of building our own deployable
libraries for Python.
In the end we will wrap everything we do using automation that is going to
help us simplify process of software development. At the same time, we will
briefly see how to integrate software version control like Git1 with Python
quality automation tools.

Clean and proper Python workbench


“Indeed, the ratio of time spent reading versus writing is well over 10 to 1.
We are constantly reading old code as part of the effort to write new code. ...
[Therefore,] making it easy to read makes it easier to write.” ― Robert C.
Martin, Clean Code: A Handbook of Agile Software Craftsmanship
In Chapter 1, Python 101, we concentrated on how to write code in Python
and do it cleanly. This chapter will teach us how to prepare an ideal
workbench and organize work to make working with Python fun.
This chapter is divided into two main scopes: Linux and Windows. The
reason is that even if Python is a multisystem programming language, it has
some caveats for operating systems in how you use some modules.
The main module we will use to control Python projects is virtualenv2. It is
not a silver bullet for all issues with controlling packages, and it has some
concerns regarding operating systems, libraries, or CPU architecture.

Python in Linux
We will learn how to organize the craftsman's desk so that you can manage
multiple projects on the same computer and use different Python versions if
needed. Let us start with the basics; the modern line of Python language is
version 3+, for example, 3.10, which we installed for this chapter.
You can still find some projects that use Python 2.x line, albeit it has been
officially announced to be discontinued3. To be able to control what kind of
Python instances you use for which project, we strongly suggest to start
using pyenv4. As we can read from the project's GitHub page.
"Pyenv lets you easily switch between multiple versions of Python. It is
simple, unobtrusive, and follows the UNIX tradition of single-purpose tools
that do one thing well."
To simplify, we are installing a pyenv - the versioning system that allows us
to install different Python versions under the same roof. Please notice that
the following only applies to Unix-based systems (MacOS and Linux). For
Windows, we will take care of multiple versions of Python in a bit different
way. Jump to the next section if you are using Windows OS.
1. $ git clone https://github.com/pyenv/pyenv.git ~/.pyenv
2. $ cd ~/.pyenv && src/configure && make -C src
Code 2.1
1. $ echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bashrc
2. $ echo 'command -
v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"' >>
~/.bashrc
3. $ echo 'eval "$(pyenv init -)»› >> ~/.bashrc
Code 2.2
With the preceding code, you will clone the pyenv repository and compile it
in your local system. We assume that you use a bash default shell in a Linux
system. In that case, you must tell your shell how to autoload the pyenv
stack.
After installing pyenv, let us try to install Python 3.7 and 3.10. We will have
flexibility by having two different Python versions under the same system.
Once this part is done, you will see three installed Python versions. First on
the top is the one installed widely in your operating system; 2nd and 3rd are
those we just installed with pyenv and ran our hello world program under
each Python version.
1. $ pyenv install 3.10.4
2. $ pyenv install 3.7.4
Code 2.3
1. $ pyenv versions
2. *system(set by /home/darkman66/•pyenv/version)
3. 3.7.4
4. 3.10.4
5. $
Code 2.4

Python in Windows
Installation of Python in the Windows system is not difficult; you just
download the installer from the Python website5 and install it with the
installation wizard. That is one way of installing Python in the Windows
ecosystem, and the other approach is installing different Python versions
using PowerShell6 and pyenv7.
If you do not want to install Bash or PowerShell and want to use native CLI
for Windows, we suggest using cmder8. It is a CLI tool with many great
features, including git support. Once you have installed Python, you can
initialize virtualenv:
1. virtualenv stuff1
2.
3. Using base prefix 'c:\\users\\hub\\appdata\\local\\programs\\python\\pyth
on37'
4. New python executable in C:\Users\hub\Desktop\cmder\stuff1\Scripts\p
ython.exe
5. Installing setuptools, pip, wheel...
6. done.
Code 2.5
Once we have initialized virtualenv (line 1), we can start using it and
installing packages. Refer to the following figure:
1. λ C:\Users\hub\Desktop\cmder\stuff1\Scripts\activate.bat
2.
3. C:\Users\hub\Desktop\cmder
4. (stuff1) λ
Code 2.6

Controlling projects
Python has so many flexible ways of controlling projects and their
dependencies that we could probably write a separate chapter about it. In this
chapter, we will share how to control projects and dependencies.
To make this little mess a bit cleaner and keep all projects and their
dependencies clean, we will use a Python module called
virtualenvwrapper9.
By using this module, you will be able to:
Keep all your project-related dependencies in a single clean place.
Track projects and their libraries.
Easy way to destroy and recreate virtual env.
Experiment with different Python libraries.
Installation of virtualenvwrapper is simple and will be installed as a
system-wide accessible module. That means you will have access to all the
virtualenvwrapper commands on the level of your system.
1. pip install virtualenvwrapper
2. echo 'export VIRTUALENVWRAPPER_PYTHON=/usr/local/bin/pyth
on' >> ~/.bashrc
3. echo 'export WORKON_HOME=$HOME/.virtualenvs' >> ~/.bashrc
4. echo 'export VIRTUALENVWRAPPER_VIRTUALENV=/usr/local/bin
/virtualenv' >> ~/.bashrc
5. source /usr/local/bin/virtualenvwrapper.sh
Code 2.7
You can use the preceding tool whenever you open the bash terminal. For
Windows users to use that tool, you will need to install the bash system
extension; you can visit the Microsoft blog10 , where you will find more
details regarding the installation process:
1. $ mkvirtualenv -p ~/.pyenv/versions/3.7.4/bin/python hello1
2. $ mkvirtualenv -p ~/.pyenv/versions/3.10.4/bin/python hello2
Code 2.8
Let us create two virtual environments as above for our hello world program.
We will do it to demonstrate the feasibility of using two different Python
versions running on the same machine. It is important to remember that each
Python version is compiled on the local machine. So, if you have any issues
installing it with pyenv, ensure all necessary libraries are installed before
continuing pyenv installation.
After successful installation of pyenv and virtualenvwrapper, we can go to
the list of available virtualenvs installed in our system by typing
lsvirutalenv. It will list all available in your system.
1. $ sudo apt install -y wget build-essential libreadline-dev \
2. libncursesw5-dev libssl-dev libsqlite3-dev tk-dev libgdbm-dev \
3. libc6-dev libbz2-dev libffi-dev zlib1g-dev
Code 2.9
To start working in selected virtualenv, type workon hello1, and that is it.
All your Python bin files and libraries will point to the location of the
virtualenv that you selected.
Another alternative approach to using virtualenv is to use it with pyenv
directly. For example, install Python 3.7.4 and initialize virtualenv.
1. $ pyenv install 3.7.4
2. $ pyenv virtualenv 3.7.4 your_venv_name
3. $ pyenv activate your_venv_name
4. $ pyenv version
Code 2.10

Libraries under control


Once you have successfully managed to work with virtualenv, it is time to
start installing some packages. For the need of this exercise, let us assume
we have a project that requires the following packages:
1. backports.shutil-get-terminal-size==1.0.0
2. blinker==1.4
3. bugsnag
Code 2.11
To install the preceding packages, you can try doing pip install <package
name> individually. There are a few issues with this approach. First, it will
take ages to install all packages. If we need to install lots of packages that
our project requires – installing them package by package, manually is not
recommended. First reason is the fact that this can be very cumbersome,
time consuming and not effective. It can also lead to many side effects – for
instance let us imagine we install matplotlib11 which requires numpy12.
Now, we are installing pandas13 package which also requires numpy. In that
case, we can accidentally overwrite version of numpy package that we
installed with mathplotlib a minute ago. Even if you spend lot of time
installing packages in this way, it is extremely hard to control what package
version should be installed.
Thankfully, there is a way in pip14 that can install packages from the file. It
means you can create the file with a list of packages you want to install, or if
you have already installed some packages, you could create such a file
automatically.
1. $ pip freeze > requirements.txt
Code 2.12
That will create a requirements.txt file listing all currently installed 3rd
party libraries. Now to install them, you simply run below. File with required
libraries can have versions specified, sub-packages, or Python version itself
specified in the requirements file. There is a way of specifying that you want
to install some library from its source.
1. $ pip install -r requirements.txt
Code 2.13
The following requirements example shows how many use cases you can
cover with such a requirement file.
1. # you can use comments in requirements file
2. pytest
3. pytest-cov
4. beautifulsoup4
5.
6. # The syntax supported here is the same as
that of requirement specifiers.
7. docopt==0.6.1 # you can specify specific version of library
8. requests[security] >= 2.8.1, == 2.8.* ; python_version < "2.7" # only w
hen Python version is lower than 2.7
9.
10. # you can install from external URL as zip file
11. urllib3 @ https://github.com/urllib3/urllib3/archive
/refs/tags/1.26.8.zip
12.
13. # you can refer to other requirements file(s)
14. -r some-other-requirements.txt
15.
16. # It is possible to refer to specific local distribution paths.
17. ./downloads/requests-1.0.2.whl
18.
19. # It is possible to refer to URLs and hash/branch or tag version
20. -e git+https://github.com/psf/[email protected]#egg=requests
Code 2.14
The file with required libraries can be part of your source control, making it
easy to track changes over a time. Oranizing requirements in a way we
shown in Code 2.14 give us better confidence as developers that the 3rd party
libraries we need to install to our project are organized in a clean and explicit
list. Refer to the following example of requirements.txt, and you will see
that there can be a few caveats in the above approach:
1. django >= 4.1.3
2. pillow <= 9.3.0
Code 2.15
In the following code we are installing Python packages that we deifned in
Code 2.15. To install these pakges we created file requirements.txt with
content listed in code 2.15 and we run it with pip install -r
requirements.txt and in following exmaple we can see the output of
isntallation process.
1. Collecting django>=4.1.3
2. Downloading Django-4.1.3-py3-none-any.whl (8.1 MB)
3. ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.1/8.1
MB
447.2 kB/s eta 0:00:00
4. Collecting pillow<=9.3.0
5. Downloading Pillow-9.3.0-cp38-cp38-
manylinux_2_28_x86_64.whl (3.3 MB)
6. ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.3/3.3
MB
318.4 kB/s eta 0:00:00
7. Collecting sqlparse>=0.2.2
8. Downloading sqlparse-0.4.3-py3-none-any.whl (42 kB)
9. ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42.8/42.
8 kB
271.2 kB/s eta 0:00:00
10. Collecting backports.zoneinfo
11. Downloading backports.zoneinfo-0.2.1-cp38-cp38-many
linux1_x86_64.whl (74 kB)
12. ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 74.0/74.
0 kB
258.0 kB/s eta 0:00:00
13. Collecting asgiref<4,>=3.5.2
14. Downloading asgiref-3.5.2-py3-none-any.whl (22 kB)
15. Installing collected packages: sqlparse, pillow, backports.zoneinfo, asgir
ef, django
16. Successfully installed asgiref-3.5.2 backports.zoneinfo-0.2.1 django-
4.1.3 pillow-9.3.0 sqlparse-0.4.3
17.
18. [notice] A new release of pip available: 22.2.2 -> 22.3.1
19. [notice] To update, run: pip install --upgrade pip
Code 2.16
Installing those two packages from the source list are going smoothly, so
where is the problem? The whole issue is that we only specified two main
packages in the source list we want to install. Once installation is starting,
pip starts to install dependent libraries, which is not a bad thing. What is
worth of noticing here is the favct that there is one possible potential point of
failure in this flow (Code 2.16). When we specify in our requirements, let us
say Django package (Code 2.15, line 1), pip is going to start installing
libraries that Django has specified in their own requirements file
(setyp.py)15, let us say sqlparse in a specific version. Next, pip is going to
the next line in the requirements file (pillow in this case) and starts installing
that library, which by a high chance has its own requirements file and…how
about if one of them is also a sqlparse library? In this case, pip is going to
check – do we already have installed the sqlparse library, which is at least
the same or higher version as the one specified in the pillow project – if it is,
then we are going to continue to the next requirement.
Everything is installed, and we are happy developers now. We can start
working on the project, but it crashes when we try to start it.
We visit Django project website and check the packages requirements in the
installation file. So far we see that what Django is installing does not have
any direct impact on our project and is not the source of our problem. Next,
we are going to visit project Pillow website and check what kind of required
packages it installs with it. We can see that library maintainer made a
mistake and did not specify in requirements which version of sqlparse
library it wants to use as a part of his project. The one that we have installed
with Django is incompatible in this case.
You can imagine how annoying such a process is. Above is one example, but
there can be many more, for instance, when pip’s installing package and its
dependencies are downgrading the library already installed for previously
run library installation. In many cases, it can lead to a situation where all is
working, albeit not stable or working. One way or the other, this is not
desired situation.
We suggest using pip tools to make this chaos easier to control with
dependencies and sub-dependencies16. Installation is trivial; just run
following code.
1. $ pip install pip-tools
Code 2.17
Next, we can use our requirements file and rename it to requirements.in We
still can keep defined requirements there, where we define the minimum
library version required, Python version, installation from source, etc. The
difference between pip-tools and pip itself is we run pip-tools now to
analyze our requirements.in and prepare the final requirements.txt that
contains all requirements analyzed for libraries and their versions. If there is
any excluding requirement or some version conflicts, pip-tools will detect
that and raise an alert about it.
1. $ pip-compile -v --output-file=requirements.txt requirements.in
Code 2.18
Once you run the above command, you get a clean and validated
requirements.txt file that you can use to install requirements as usual with
the pip command.
1. #
2. # This file is autogenerated by pip-compile with python 3.9
3. # To update, run:
4. #
5. # pip-compile --output-file=requirements.txt req.in
6. #
7. asgiref==3.5.2
8. # via django
9. django==4.1.3
10. # via -r req.in
11. pillow==9.3.0
12. # via -r req.in
13. sqlparse==0.4.3
14. # via django
Code 2.19

Clean code
As explained in Chapter 1, Python 101, Python is a bit unique with its
syntax using indentations instead of clear indications of each block's
beginning and end. By having this said, you must have proper tools to
format and check your code to be always consistent. For instance, you
should not have indentation blocks where one starts with four spaces (Figure
2.1, line 16-18) and the other one goes with three spaces (Figure 2.1, line
20), as shown in the following figure:
Figure 2.1: Example of lines indentation
Notice we used inconsistent indentations and different ways of commenting
function doc strings. This is just an example of the wrong syntax, and there
can be more by going deeper into the code. To make your life easier as a
Python developer and to become a good coder, we should install a few tools
in your local system which will help you with your code by working for you
in the background with auto-formatting wrong code.
There are plenty of these tools but these are my recommendations:
flake817
pylint18
black19
precommit20

Flake8
Install the proceeding tools by using pip and use them to control your code.
We have a project with terrible syntax that looks like the following example,
that we would like to clean. Let us see how you can do it on a few levels to
understand which works better.
1. # -_- coding: utf-8 -_-
2. from pprint import pprint
3. def łąka(p):
4. pprint(p)
5.
6. łąka("some message")
Code 2.20
When you run flake8, you will see immediately what is wrong with your
file. Please notice that you can run the flake8 command for an individual file
or the entire folder containing the entire project. Let us check with following
example how to execute flake8 command.
1. $ flake8 my_app.py
Code 2.21
We guess that it is obvious running the flake8 command only makes sense
on Python files, and other files can either confuse flake8 or give you very
misleading results.
1. my_app.py:7:1: E305 expected 2 blank lines after
class or function definition,found 1
2. my_app.py:8:1: W391 blank line at end of file
Code 2.22
Our tool is very clear in indicating to us where the problems are with our
source file. Once you fix them, you can re-run flake8 until all is clear,
although there are cases where you must use some syntax that should not be
analyzed by flake8; there are a few ways to do it. Let us check following
example.
1. $ flake8 --select E113,W505 my_app.py
Code 2.23
To be able to not overwrite ignored errors like (example Code 2.23) but
extend already predefined ignored errors list we can run flake8 command
like in the following example:
1. $ flake8 --extend-ignore E113,W505 my_app.py
Code 2.24
Including or excluding what we want to validate as a command line
parameter can be a terrible idea, especially if you want to repeat the same
presets for multiple projects and share them with others. That is why flake8
supports saving presets in the config file. Configuration21 can be saved in:
top-level user directory
project directory
With supported formats: setup.cfg, tox.ini22 or flake823. Flake8 supports
reading config from files by using the Python config parser module24.
1. [flake8]
2. ignore = D203
3. exclude =
4. .git,
5. __pycache__
6. max-complexity = 10
7. max-line-length = 120
Code 2.24

Pylint
Pylint is comparable to flake8 and helps to keep coding standards based on
pep825. It has many more cool features, like detecting errors and repetitions
of blocks of code, which can lead to antipatterns. Pylint can also help
refactor code and draw UML diagrams representing your code26. Executing
pylint with example code we are performing in the following code.
1. $ pylint my_app.py
Code 2.25
The output of running pyplint is shown in the following example:
1. ******\******* Module my_app
2. my_app.py:8:0: C0305: Trailing newlines (trailing-newlines)
3. my_app.py:1:0: C0114: Missing module docstring (missing-module-
docstring)
4. my_app.py:4:0: C0116: Missing function or method docstring (missing-
function-docstring)
5. my_app.py:4:11: C0103: Argument name "p" doesn't conform to snake_
case naming style (invalid-name)
6. my_app.py:4:0: C2401: Function name "łąka" contains a non-
ASCII character, consider renaming it. (non-ascii-name)
7.
8. ---
9.
10. Your code has been rated at 0.00/10
Code 2.26
Pylint, when analyzing source files, as you can notice in the above output, it
is diving much deeper into the code. It will not only check syntax
recommended by pep8, but it also does shallow security checks and code
analysis (lines 4-6), and in the end (line 10), it prints out the overall score for
your code quality.
What is very valuable in pylint analysis because this tool will run pieces of
your code and catch all the potential issues. For example, it checks circular
references, unused variables, unimported modules, diving by zero, etc.

IDE
Once we agreed that for the need of this course, we would be using VSC27
for coding and managing your projects, you can also integrate tools we
introduced to you to improve the quality of your code.

Figure 2.2: Installing pylint plugin for VSC


It is trivial to install Pylint for visual studio code. On the left side of the
menu, choose extensions and search for the word pylint. You can choose any
of the above plugins, albeit Linter is more advanced and can check plus
auto-fix your syntax while typing, which for some developers, is a valuable
thing to have.
The following are valuable information regarding what the IDE status bar
tells you when configuring it with Python. These are the personal
recommendations to always work in virtual environments, so in this case,
you can install the pylint module not only for IDE but also for each separate
event.

Figure 2.3: Visual Studio Code status bar with many useful information for everyday development
After installing flake8, IDE will check your syntax during typing and
highlight all the possible problems as shown in the following figure:

Figure 2.4: Installing additional quality checker for VSC – flake8


It is important not to mix tools in a single IDE and not install flake8 and
pylint simultaneously because it can lead to many unwanted issues.

Pre-commit
If you have been working previously with any source control system, you
know Git28. That distributed open-source version control system can support
many plugins, git flow29, and pre and post-commit systems called hooks.
The idea behind the hooks is that you can automatically run shell scripts or
even the entire standalone applications before the actual commit or after.

Figure 2.5: Example of life of code cycle with commits and branching out when pre-commit script is
in use
The preceding figure shows the life cycle of the hooks being executed upon
each commit. As you can see, we can execute the script each time you
commit code to your local branch (which can also be remote). Why did we
mention git hook? Because you can use them to auto-analyze the quality of
your code.
1.  ~/work/fun-with-python/ ll .git
2. total 72
3. 18B Nov 19 15:55 COMMIT_EDITMSG
4. 96B Dec 9 23:01 FETCH_HEAD
5. 21B Nov 19 15:55 HEAD
6. 41B Nov 22 17:45 ORIG_HEAD
7. 310B Oct 16 20:46 config
8. 73B Oct 16 20:46 description
9. 372B Oct 29 22:05 fork-settings
10. 480B Oct 16 20:46 hooks
11. 1.9K Nov 22 21:58 index
12. 96B Oct 16 20:46 info
13. 128B Oct 16 20:46 logs
14. 1.6K Dec 9 23:01 objects
15. 112B Oct 16 20:46 packed-refs
16. 160B Oct 16 20:46 refs
Code 2.27
We can see in the code example 2.27 that when we list content of .git
directory in our project we are able to see that there is sub-folder called
hooks. Let’s take a look what is inside by following example. You can
quickly notice that this subfolder has many action files (hooks) that git
version control system is going to call respecteviely depending on the action
that you are performing – for instance file pre-commit is going to be always
called upon every single commit that you perform. That is happening
without any check where action takes a place. What we meant by this is – we
can perform commit action (commit code changes to git repository) from
CLI or GUI client – still git subsystem is going to execute pre-commit
script30.
1. $ ~/work/fun-with-python/ ll .git/hooks
2. total 120
3. 478B Oct 16 20:46 applypatch-msg.sample
4. 896B Oct 16 20:46 commit-msg.sample
5. 4.6K Oct 16 20:46 fsmonitor-watchman.sample
6. 189B Oct 16 20:46 post-update.sample
7. 424B Oct 16 20:46 pre-applypatch.sample
8. 1.6K Oct 16 20:46 pre-commit.sample
9. 416B Oct 16 20:46 pre-merge-commit.sample
10. 1.3K Oct 16 20:46 pre-push.sample
11. 4.8K Oct 16 20:46 pre-rebase.sample
12. 544B Oct 16 20:46 pre-receive.sample
13. 1.5K Oct 16 20:46 prepare-commit-msg.sample
14. 2.7K Oct 16 20:46 push-to-checkout.sample
15. 3.6K Oct 16 20:46 update.sample
Code 2.28
As you can see in the preceding figure, we can create a file pre-commit and
try to make it execute pylint on all the files in the repository. Such a script is
going to look like the following example:
1. #!/bin/sh
2.
3. set -e
4.
5. pylint –rcfile=./config.rc 2>&1
Code 2.29
In line 3, we tell the operating system shell (that is, bash) if, in line 5, pylint
is going to detect any issues, it will immediately stop our pre-commit script,
and as a side effect, it will stop Git from committing changes into the code
repository.
We additionally added an option to pylint with the config file if it is fully
optional, but you can keep it and adjust the config file to your needs. You
can add all those errors or warning messages you want to ignore during
pylint checks.
This solution has a few challenges:
Pylint will analyze all files – time execution of such a script for big
projects can be very inefficient for each commit.
Analyzing all files will also include non-python files.
Not having pylint installed will also crash the commit process, which
brings a challenge when someone wants to edit some not Python file
and must install Python with pylint
Let us try to improve the same script by adding a few additional checks so it
can address the preceding issues and keep the same quality of code checks:
1. #!/bin/sh
2.
3. set -e
4.
5. FILES_CHANGED=$(git diff --name-only --diff-
filter=ACM origin/main | grep "\.py" || true
6.
7. if [ -n "$FILES_CHANGED" ]; then
8. echo $FILES_CHANGED | xargs pylint --rcfile=./config.rc 2>&1
9. fi
Code 2.30
As we can notice in Code 2.30 we have added line 5 (comparing to example
Code 2.29), where we’re checking, by using git command, what are the files
that were changed. By this we mean those files that the current git commit
command added or updated in current working repository. By having a list
of those files, we can run pylint against these, which:
Limits the number of files that pylint must analyze.
Gives better performance.
There is one catch, - please pay attention regarding the grep expression
(Code 2.30, line 5). We try to find all the files that end with .py extension in
any of the currect working folders – for example
some_directory/my_file.py. That is the base concept here, but has some
logical bug. Regexp will get all the files that have .py in their name for
example file1.py and as well fiel.py_some_name.txt. So we can easily see
that extension is nothing to do here. Regular expression is matching string
that has .py in it, which means it will also catch files like .pyc, .pyi etc. We
can see now where the logical bug is. How to fix this then? We will have to
update example Code 2.30 with the following fixes:
1. #!/bin/sh
2.
3. set -e
4.
5. FILES_CHANGED=$(git diff --name-only --diff-
filter=ACM origin/main | grep -E ".py$" || true
6.
7. if [ -n "$FILES_CHANGED" ]; then
8. echo $FILES_CHANGED | xargs pylint --rcfile=./config.rc 2>&1
9. fi
Code 2.31
What would you do if you want to run many things as code validators or
something similar, but you want to logically split these actions across
multiple scripts where not all of them will be the shell scripts? It can be done
by calling subscripts from the main pre-commit script, but it can be a bit
hard to track. Python community is always coming to the rescue with its
massive library of projects and solutions – it is a project called pre-
commit31. It will help organize coed checkers and cleaner.
We install this module by simply running following code.
1. $ pip install -U pre-commit
Code 2.32
Once we have it installed in our Python libraries, we need to install it as git
pre hook. Let us do this by running below command.
1. $ pre-commit install
Code 2.33
The module will inject itself into the standard git pre-commit hook by giving
you more superpowers which you can use by configuring pre-commit
config.
Let us assume we have some project where we installed the pre-commit
module and have the pylint script (as shown in Code 2.31) installed under
config/pylint.sh and the similar one installed under config/flake8.sh is
doing the same thing as pylint but executing flake8 instead. Let us check
how pylint configuration file is going to look like in the following example:
1. default_stages: [commit, push]
2. repos:
3.
4. - repo: https://github.com/pre-commit/pre-commit-hooks
5. rev: v4.3.0
6. hooks:
7. - id: trailing-whitespace
8. - id: end-of-file-fixer
9. - id: check-json
10. - id: check-yaml
11. - id: debug-statements
12. - id: check-merge-conflict
13. - id: detect-private-key
14. - id: end-of-file-fixer
15. - id: pretty-format-json
16. args: [--autofix]
17. - id: no-commit-to-branch
18. args: [--branch, master]
19. - repo: https://github.com/ambv/black
20. rev: 22.10.0
21. hooks:
22. - id: black
23. args: [--line-length=120]
24. - repo: local
25. hooks:
26. - id: pylint
27. name: Pylint
28. stages: [push]
29. description: Run pylint
30. entry: ./config/pylint.sh
31. language: script
32. types: [python]
33. pass_filenames: false
34. - id: flake8
35. name: Check flake8
36. stages: [push]
37. description: RUn flake8
38. entry: ./config/flake8.sh
39. language: script
40. args: [local]
41. types: [python]
42. pass_filenames: false
Code 2.34
This config file .pre-commit.yaml is located in the main root of your project
folder. You know all the configuration parameters and their meaning; you
can check on the pre-commit project website.
We have a hooks section: This is where you define the scripts you want to
run. You can also name these scripts, so when pre-commit scripts start, you
know exactly when each of the scripts gets executed.
In lines 19-23, we intentionally added the use of the module called black32
and how many lines it should use for auto-formatting code. The fact that
black tool is a fantastic code formatter that can automagically fix most of
styling issues that was mentioned in Chapter 1, Python 101.
Additionally, we enabled via config few other very helpful features of pre-
commit. Check lines 6-18, some of these can even help you autoformat
JSON and YAML files which, as most of us know, are so painful to read if
they are not formatted well.
1. trim trailing whitespace..................................Passed
2. fix end of files..........................................Passed
3. check json............................(no files to check)Skipped
4. check yaml............................(no files to check)Skipped
5. debug statements (python).................................Passed
6. check for merge conflicts.................................Passed
7. detect private key........................................Passed
8. fix end of files..........................................Passed
9. pretty format json....................(no files to check)Skipped
10. don't commit to branch....................................Passed
11. black.....................................................Passed
12. Run pylint................................................Passed
13. Check flake8..............................................Passed
14. [bugfix/some-branch-name ee9cdc99] test commit
15. 1 file changed, 1 insertion(+)
Code 2.35
Build your own library
So far in this chapter we have been using libraries that we’ve installed by
using pip tool. In this subchapter we are about to learn how to build our own
library and publish it to pip repository.
Before building out our own library we need to create files structure like in
following figure:

Figure 2.6: Example of directory and file structure for our example library
Main thing to notice it that we created package simple_calculator like we
learned in previous chapter. Our simple calculator module
(my_calculator.py) is going to look like the following:
1. class MyCalculator:
2.
3. def __init__(self, x):
4. self.x = x
5.
6. def add_me(self, y):
7. return self.x + y
8.
9. def substract_me(self, y):
10. return self.x - y
11.
12. def divide_me(self, y):
13. if y != 0:
14. return self.x / float(y)
15. raise Exception('Hey hey, 2nd argument can not be 0')
16.
17. def multiple_me(self, y):
18. return self.x * y
Code 2.36
We managed to create very simple calculator with few helper methods in it.
Class object constructor gets one argument that is going to be used for
calculator operations. Let us update main.py file that is going to
demonstrate how to use add_me method. Let us analyze the following
example:
1. from simple_calculator.my_calculator import MyCalculator
2.
3. m = MyCalculator(5)
4. print("result", m.add_me(5))
Code 2.37
We have the first example to use the newly created calculator module (Code
2.37, line 1). To make this cleaner let us update __init__.py file with
following code:
1. from .my_calculator import MyCalculator
Code 2.38
This kind of import makes coding cleaner and defines publicly accessible
modules from our package in much cleaner and strict way. By this change
we also must update main.py file as in the following example:
1. from simple_calculator import MyCalculator
2.
3. m = MyCalculator(5)
4. print("result", m.add_me(5))
Code 2.39
The next step is to add requirements that are going to be installed as part of
our module. To do so we have to update requirements.txt file as in
following example.
1. pytest~=7.4
Code 2.40

We will install pytest33 module as part of our package. This is not something
very much required by creating modules standards albeit it is highly
recommended practice to include tests in our module so community can
accept it easier when reviewing it.
To organize tests in clean way, we need to keep all of them in one single
folder called tests. Let us create such a folder with files like listed in
following figure:

Figure 2.7: Example of test file


By having tests folder created with two files created like shown in Figure
2.7 let us run tests. Example how to run tests is shown in following code
example.
1. $ pytest tests
2. ============================= test session starts
========================
3. platform darwin -- Python 3.10.12, pytest-7.4.0, pluggy-1.2.0
4. rootdir: (…)chapter_2/simple_calculator
5. plugins: anyio-3.7.0
6. collected 0 items
7.
8. ========================== no tests ran in 0.01s
=========================
Code 2.40
We can notice that running tests from tests folder did not have any impact of
the tests results since there are no tests in it. That is obviously expected since
we created empty files. Let us add some basic tests. In the proceeding
example we are mainly testing basic functionalities of our package. Let us
update test_basics.py file:
1. import pytest
2. from simple_calculator import MyCalculator
3.
4. def test_adding():
5. m = MyCalculator(5)
6. assert m.add_me(5) == 10
7.
8. def test_substract():
9. m = MyCalculator(5)
10. assert m.substract_me(5) == 0
11.
12. def test_divide():
13. m = MyCalculator(5)
14. assert m.divide_me(2) == 2.5
15.
16. def test_divide_with_zero():
17. m = MyCalculator(5)
18. with pytest.raises(Exception) as exc:
19. m.divide_me(0)
20. assert str(exc.value) == 'Hey hey, 2nd argument can not be 0'
21.
22. def test_multiple_me():
23. m = MyCalculator(3)
24. assert m.multiple_me(3) == 9
Code 2.41
Before we will dive into running tests itself let’s try to analyze quickly what
is happening in Code 2.41. We have created few tests (functions) that are
running all the basic operations of our module that we provide with it. We
have test for
Add (line 4) and comparing result with expected value (line 6)
Subtract (line 8)
Divide (line 12 and line 16) where in line 18-20 we divide by zero
which in our module is not accepted thus we check if module properly
managed to support this case by raising exception (line 18) and then we
check what is the exception message (line 20)
Multiple (line 22)
By knowing what is happening in the tests and please noticed that values
used for calculation and comparing results are just examples used in the tests
itself we can run testing framework to verify if the results of the tests are
passing.
1. $ pytest tests
2. ========================== test session starts
===========================
3. platform darwin -- Python 3.10.12, pytest-7.4.0, pluggy-1.2.0
4. rootdir: (...)/chapter_2/simple_calculator
5. plugins: anyio-3.7.0
6. collected 5 items
7.
8. tests/test_basics.py ..... [100%]
9.
10. ========================= 5 passed in 0.01s
==============================
Code 2.42
This time, running pytest with our tests is showing that 5 tests (line 6) were
detected (expected) and 100% of them passed (line 8-10). One of the
benefits of providing tests as a part of library that we are going to share with
other developers is to provide proof that our library works but also
sometimes show how-to use it which can be valuable as a documentation.
Next step in creating our custom library is to update our empty setup.py file
with the content that can be properly recognized by pip or setup tools34.
1. from setuptools import setup, find_packages
2.
3. # for encoding
4. from codecs import open
5. from os import path
6.
7. WORKING_DIR = path.abspath(path.dirname(__file__))
8.
9. with open(path.join(WORKING_DIR, 'README.md'), encoding='utf-
8') as f:
10. readme_docs = f.read()
11.
12. # This call to setup() does all the work
13. setup(
14. name="simple_calculator",
15. version="0.1.0",
16. description="Example simple calculator",
17. long_description=readme_docs,
18. long_description_content_type="text/markdown",
19. url="https://fun-with-python-example-
simple-calculator.readthedocs.io/en/latest/",
20. author="Hubert Piotrowski",
21. author_email="[email protected]",
22. license="MIT",
23. classifiers=[
24. "Intended Audience :: Developers",
25. "License :: OSI Approved :: MIT License",
26. "Programming Language :: Python",
27. "Programming Language :: Python :: 3",
28. "Programming Language :: Python :: 3.6",
29. "Programming Language :: Python :: 3.7",
30. "Programming Language :: Python :: 3.8",
31. "Programming Language :: Python :: 3.9",
32. "Operating System :: OS Independent"
33. ],
34. packages=["simple_calculator"],
35. include_package_data=True,
36. install_requires=["pytest"]
37. )
Code 2.43
We created standard setup file which is going to be used to generate
distribution package for Python. What is interesting to notice in this
definition file is the requirements that package has (line 36). That means that
when user is going to install package it will install additional packages as its
dependencies.
We can use this setup configuration to build our package locally and to
create distributable version of our library called wheel35. Before we can start
building, we have to install helper module that is going to allow us to build
wheels like in following example.
1. $ pip install build
Code 2.44
Once we have this installed, we can build our distributable version of our
simple_calculator library like in proceeding example.
1. $ python -m build
2.
3. running egg_info
4. writing simple_calculator.egg-info/PKG-INFO
5. writing dependency_links to simple_calculator.egg-
info/dependency_links.txt
6. writing requirements to simple_calculator.egg-info/requires.txt
7. writing top-level names to simple_calculator.egg-info/top_level.txt
8. reading manifest file 'simple_calculator.egg-info/SOURCES.txt'
9. adding license file 'LICENSE'
10. writing manifest file 'simple_calculator.egg-info/SOURCES.txt'
11. * Building sdist...
12. (...)
13. removing build/bdist.macosx-13-arm64/wheel
14. Successfully built simple_calculator-0.1.0.tar.gz and simple_calculator-
0.1.0-py3-none-any.whl
Code 2.45
As you can see build command managed to generate 2 files – compressed
version of our library (tar.gz) and distributable wheel version (whl) where
both are located under newly automatically crested folder dist.
To be able to install our library we have created we can use a too that we
already managed to get familiar with – pip. Let us run the following
example:
1. $ pip install dist/simple_calculator-0.1.0-py3-none-any.whl
Code 2.46
By running pip it is not only going to install our module in current Python
global namespace or virtualenv (depending how we run pip command) but
as well it will make sure that dependencies required by our module are met.
In our case we required to have pytest installed (Code 2.43, line 36) so pip is
going to install it automatically.

Conclusion
In this chapter we managed to learn fundamental principles about how to
organize Python environment to make it as efficient as possible. Next, we
went deeper into technicalities about how to run Python project on most
popular operating systems. We also learn how to prepare our work as custom
library which can be shared with other developers and become part of
something big.
In the next chapter, we are going to learn to build our first small yet
powerful Python project. We will also desing and build a real working
example where you may develeop in step by step something that is actually
working and showing you how much fun you can have with Python.

1. https://git-scm.com
2. Virtualenv - https://docs.python.org/3/tutorial/venv.html
3. https://www.python.org/doc/sunset-python-2/
4. Pyenv - https://github.com/pyenv/pyenv
5. https://www.python.org/downloads/windows/
6. https://learn.microsoft.com/en-
us/powershell/scripting/install/installing-powershell-on-windows?
view=powershell-7.3
7. Pyenv for Windows - https://pyenv-win.github.io/pyenv-win/
8. Native CLI for Windows - https://cmder.app
9. Virtualenvwrapper -
https://virtualenvwrapper.readthedocs.io/en/latest/
10. Microsoft blog - https://devblogs.microsoft.com/commandline/bash-
on-ubuntu-on-windows-download-now
11. https://matplotlib.org
12. https://numpy.org
13. https://pandas.pydata.org
14. Python package manager - https://pypi.org/project/pip
15. https://docs.python.org/3/distutils/setupscript.html
16. Install pip tools https://pypi.org/project/pip-tools/
17. Checking code syntax - https://flake8.pycqa.org/en/latest/
18. Checking code syntax - https://pylint.org
19. Auto formater - https://pypi.org/project/black/
20. Amazing git pre commit hooks - https://pre-commit.com
21. https://flake8.pycqa.org/en/3.9.2/user/options.html#options-list
22. We will talk about tox in next pages.
23. Please notice dot as prefix for filename.
24. https://docs.python.org/3/library/configparser.html
25. https://pep8.org
26. https://pylint.pycqa.org/en/latest/pyreverse.html
27. Visual Studio Code - https://code.visualstudio.com
28. https://git-scm.com
29. https://www.atlassian.com/git/tutorials/comparing-
workflows/gitflow-workflow
30. https://git-scm.com/book/en/v2/Customizing-Git-Git-Hooks
31. https://pre-commit.com
32. https://pypi.org/project/black/
33 https://docs.pytest.org/en/7.4.x/
34. https://setuptools.pypa.io/en/latest/
35. https://pythonwheels.com
Join our book’s Discord space
Join the book's Discord Workspace for Latest updates, Offers, Tech
happenings around the world, New Release and Sessions with the Authors:
https://discord.bpbonline.com

OceanofPDF.com
CHAPTER 3
Designing a Conversational
Chatbot

Introduction
In the previous chapters, we have learned many things about Python, its
syntax, development tools, and how to control and deliver the best code with
its finest quality. At the same time, we learned how to integrate Python with
very useful tools that can help us as a developer to deliver code easier and
more efficient. In this chapter, we will start learning our web service project
with some basics and fundaments and then we will be moving towards more
complex topics.

Structure
In this chapter, we will discuss the following topics:
Client-server architecture
Chatbot basics
Training
Chat
Application
Frontend
Objectives
By the end of this chapter, you will learn some fundaments of client-server
application and how to write such an application with using HTTP standards.
When you are finish with this chapter you will know how to build client-
server application, use Python to build above as browser-based HTTP
service, and write asynchronous web service.

Client-server architecture
We will use the most popular architecture in the web world called, client-
server. As shown in the following figure, the client is always sending
requests for resources, assets, pages or asking the server to perform a
specific task and wait for the server to finish. Once the server completes the
processing request, it will respond back to the client with the result or
processing request.

Figure 3.1: Example HTTP messaging flow


Before starting to code our chatbot, we must learn a few things quickly. We
assume you have some basic knowledge regarding HTTP and TCP/IP, so we
will focus on the aspects needed for this chapter.
In HTTP format,1 we have headers that the browser or client application
sends to the server. On the server side, those received headers are processed
and server can perform some required checks. For instance, as shown in the
following figure, client is sending in headers what kind of server (resource)
path he wants to access and what assets or communication protocol (HTTP
version) client’s request is about.
One of the headers also tells server what kind of response language (in this
case English) client is expecting to receive. This means that if server
supports response (i.e., HTML) in many languages and customer is
requesting English – returned content is in English language.
Figure 3.2 shows how the transaction is going to look like:

Figure 3.2: Example HTTP call with explanation of its sections

By knowing how headers and path with requested method work2 we can
start building simple web service. A request (question the user asked) will be
send to the server, and the server, based on the request, will respond back.
Let us try to build such an example by using the Twisted3 framework:
1. from twisted.web import server, resource
2. from twisted.internet import reactor
3.
4.
5. class MyServer(resource.Resource):
6. isLeaf = True
7.
8. def render_GET(self, request):
9. uri_path = request.uri.decode('utf-8')
10. print(f"Received request for '{uri_path}'")
11. return f"Hello, world! {uri_path}".encode('utf8')
12.
13. service = server.Site(MyServer())
14. reactor.listenTCP(8083, service)
15. reactor.run()
Code 3.1
By using Twisted framework, we wrote a simple web server that works in
echo mode. That means when we send request to it - it will always respond
with “hello world” message additionally containing full request call path that
we sent in the request (Code 3.2, line 2).
To see how it works in action, check following example’s -notice test_path
part. That is the requested server resource path (line 1) – it is repeated in
response message (line 2)
1. ~/ curl http://localhost:8083/test_path
2. Hello, world! /test_path
Code 3.2
You can see that the simple service can process HTTP GET calls. How about
if we want to send some parameters to it and, based on those, prepare
responses that depend on the given values? Since this is a GET call, we can
process parameters in two ways:
By processing request resource path.
Reading and checking the value of query string.
Let us modify the main class MyServer based on the preceding
requirements so that we can see you how to process incoming requests in
twisted and parse query parameters – let us check following example called
server_2.py:
1. from urllib import parse
2. from twisted.web import server, resource
3. from twisted.internet import reactor
4.
5. """
6. Web server with support for resource path and query string
7. """
8.
9. class MyServer(resource.Resource):
10. isLeaf = True
11.
12. def main_view(self):
13. return "main view"
14.
15. def hello_view(self, **kwargs):
16. if kwargs and kwargs.get('a') and kwargs.get('b'):
17. total = int(kwargs['a']) + int(kwargs['b'])
18. return f"Total sum: {total}"
19. return 'hello to you too'
20.
21. def convert_query_string(self, resource):
22. """Convert query strin to Python dictionary»»»
23. parsed_data = parse.urlparse(resource).query
24. return dict(parse.parse_qsl(parsed_data, keep_blank_values=True))
25.
26. def path_finder(self, request):
27. resource = request.uri.decode('utf-8')
28. query_kwargs = self.convert_query_string(resource)
29. parsed_data = parse.urlparse(resource)
30. resource_path = parsed_data.path
31. result = f"Sorry do not know you {resource}"
32.
33. if resource_path == '/':
34. result = self.main_view()
35. elif resource_path == '/hello':
36. result = self.hello_view(**query_kwargs)
37.
38. return result
39.
40. def render_GET(self, request):
41. output = self.path_finder(request)
42. if output:
43. return output.encode('utf8')
44. return b"Something went wrong"
45.
46. if __name__ == '__main__':
47. service = server.Site(MyServer())
48. reactor.listenTCP(8083, service)
49. reactor.run()
Code 3.3
You can see that we put the decision point of the sub-method based on the
resource path in line 18. Note that we are using the Python urrlib4 module
in the beginning, so do not forget to import:
1. from urllib import parse
Code 3.4

To be able to extract query string5 from full URI and convert it to standard
Python dictionary we created method convert_query_string (code 3.3, lines
21-24). Notice that it will preserve blank values in the query string, that is,
value=. In this case passing value in query string like for instance “?
q1=some-value&q2=“ is going to be converted to dictionary like {“q1”:
“some-value”, “q2”: “”}
This approach will be useful in later stage of our chapter when we will have
to identify some parameter given in the URL query string when there are
empty values.
To demonstrate how we can use our code with different use case please
check the following example:
1. ~/ curl "http://localhost:8083/"
2. main view
3.
4. ~/ curl "http://localhost:8083/hello"
5. hello to you too
6.
7. ~/ curl "http://localhost:8083/hello?a=4&b=3"
8. Total sum: 7
9.
10. ~/ curl "http://localhost:8083/sd"
11. Sorry do not know you /sd
Code 3.5
We managed to cover a few cases with the above examples. The program is
naïve and does not cover corner cases (lines 8-9), so we only check if the
arguments are empty. If they are not numbers, it will lead to a crash. That is
expected, but we can easily fix this by improving those lines like the
following example file python server_2.1.py:
1. from twisted.internet import reactor
2. from twisted.web import server
3.
4. from server_2 import MyServer
5.
6.
7. class MyServer2(MyServer):
8.
9. def hello_view(self, **kwargs):
10. if kwargs and kwargs.get('a') and kwargs.get('b'):
11. try:
12. total = int(kwargs['a']) + int(kwargs['b'])
13. return f"Total sum: {total}"
14. except ValueError:
15. return "One of the arguments is not a number"
16. return 'hello to you too'
17.
18.
19. def start_service():
20. service = server.Site(MyServer2())
21. reactor.listenTCP(8083, service)
22. reactor.run()
23.
24.
25. if __name__ == '__main__':
26. start_service()
Code 3.6
We packed the server startup part (lines 19-22) into individual functions,
which we can inherit later without rewriting the code. Additionally, we also
managed to share the existing code from the previous example (line 7) – we
imported previous example (Code 3.3 being imported in Code 3.6, line 4)
and use the inheritance by covering only the method that we wanted to
change (fix).
This technique allows you to reuse code and is a proper way of writing code
in the object-oriented programming. We need to note that lines 25-26 (Code
3.6) where we check if we run our script directly like script python
server_2.1.py, it will run that part. That help in inheritance (notice example
1) where I also checked if main script was being run if now inheritance will
be possible without accidental string web server.

Chatbot basics
Chatbot is software that interacts with humans in a chat with server-side
software, often based on artificial intelligence (AI) or expertise systems.
Despite the fact what kind of underlying system we will choose, we can
address two main aspects of chatbots:
Rules-based service: It teaches chatbots based on pre-defined rules to
answer the questions in that list.
Self-leaning chatbot: This variant is more flexible because it can learn
independently, albeit technologically more demanding. It requires AI or
machine learning models instead of fixed predefined rules.
Modern chatbots use AI that can utilize natural language processing systems
in real-time. Further, they can analyze language with its variants, flexies,
mistakes, or even dialects. They are powerful tools. For example, try to call
any customer support hotline. In many cases, the first line of direct help will
be the chatbot with a voice recognition system that will guide you smoothly
in analyzing and understanding your reason for calling. As a result of
chatting, if it is possible, the officer on call will not even be needed, or if
eventually there is a need to connect you with a real person, that agent will
get prepared and will be prefilled form all necessary details to help you
quickly and efficiently.
We all know that chatbots can be highly sophisticated AI-driven
applications. However, we will try to concentrate on a simpler yet still
powerful example of a chatbot in Python. This will give us a chance to learn
how chatbots may work and how to dive into AI world.

Training
To build a chatbot, we will use a popular library chatterbot. It has most of
the functionalities we need, such as machine learning training models. As we
can see on its GitHub page,6 “Chatterbot is a machine-learning based
conversational dialog engine built in Python, making generating responses
based on collections of known conversations possible. The language-
independent design of chatterbot allows it to be trained to speak any
language.”
To install it, simply run the following pip command:
1. pip install chatterbot
Code 3.7
The above version has a small bug in our example. It will try to load non-
existing language en (English), so to fix this, we managed to fork the above
project and patch it for our chapter needs. To install it, we just run the
following command; essentially, we need a spacy module, a Python library
for supporting advanced natural language processing.
1. $ pip install spacy==3.4.4 pyyaml==5.4.1
2. $ git clone [email protected]:bpbpublications/Fun-with-Python.git
3. $ cd fun-with-Python/chapter_3/ChatterBot/
4. $ python setup.py install
Code 3.8
After installing the above modules, we must download the Spacy language
pack. Run the following command:
1. python -m spacy download en
Code 3.9
Once it is installed, we will have to build the training tool to help our chatbot
learn basic phrases and sentences that it can expect from the conversation
with a user. To be able to build such a tool, we are going to use the
chatterbot language training corpus7. To use that library most conveniently,
we would build a command line script that can be run at any time with a
need for a server.
First, we must create a folder which is going to contain config files with
basic phrases that we want to teach our chatbot:
1. mkdir -p ~/chatterbot_corpus/data/english
2. vim ~/chatterbot_corpus/data/english/conversations.yml
Code 3.10
Content of conversations file: Let us use something simple as the following.
This is going to be needed for basic chatbot training. At the later stage we
can do some more complex example:
1. categories:
2.
3. - conversations
4. conversations:
5. - - Good morning, how are you?
6. - I am doing well, how about you?
7. - I'm also good.
8. - That's good to hear.
9. - Yes it is.
10. - - Hello
11. - Hi
12. - How are you doing?
13. - I am doing well.
14. - That is good to hear
15. - Yes it is.
16. - Can I help you with anything?
17. - Yes, I have a question.
18. - What is your question?
19. - Could I borrow a cup of sugar?
20. - I'm sorry, but I don't have any.
21. - Thank you anyway
22. - No problem
23. - - How are you doing?
24. - I am doing well, how about you?
25. - I am also good.
26. - That's good.
27. - - Have you heard the news?
28. - What good news?
29. - - What is your favorite book?
30. - I can't read.
31. - So what's your favorite color?
32. - Blue
Code 3.11
We generated the above example conversation file. You can notice that the
yaml file has the first indentation, identifying questions that we may
potentially ask. The second layer of indentation in configuration file (ie lines
30-32) are those responses that chatbot can respond with. Let us see how to
use it. First, let us create a file called chatbot.py:
1. from chatterbot import ChatBot
2.
3.
4. def chatbot():
5. return ChatBot('Trainer')
Code 3.12
The instance of chatbot is wrapped up in separate function that we can call
from any point of our code base and will always initialize same instance of
chatbot. Now, create file called trainer.py.
1. from chatbot import chatbot
2. from chatterbot.trainers import ChatterBotCorpusTrainer
3.
4. _chatbot = chatbot()
5. trainer = ChatterBotCorpusTrainer(_chatbot)
6. trainer.train("chatterbot.corpus.english")
Code 3.13
Execute the above example python ttrainer.py You will notice that in the
same directory with your script there has been a file created db.sqlite3. This
is the sqlite database that contains all verbs, words, phrases, and so on that
our chatbot managed to learn after running training script.

Chat
So far, we analyzed and learned how to use Python modules to train chat
model and how to write interactive scenarios that we are going to use to
interact with user. Now, we will learn how to use these trained materials to
perform simple chat.
1. pip install ipython
Code 3.14
Install ipython, will be useful in testing our model and conversational
scenarios. Once we have it, we will be testing the basic conversation:
1. In [1]: from chatbot import chatbot
2.
3. In [2]: c=chatbot()
4.
5. In [3]: c.get_response('hi')
6. Out[3]: <Statement text:How are you doing?>
7.
8. In [4]: print(c.get_response('hi'))
9. How are you doing?
10.
11. In [5]: print(c.get_response('how are you?'))
12. I am doing well.
Code 3.15
We imported chatbot function from previously created file chatbot.py and
initialized it in line 2. Next you can see in the following lines we are using
method get_response. This is the entrance for chatbot API where we can ask
a question and get the response. You should be able to notice that the
response given is based on the yaml config file that we created before.
Application
In the previous introductions about building a simple web server in client-
server architecture we have learned the basics of webserver and how to
serve resource based on path and query params. We used Twisted framework
and decided to use it for beginning examples (Code 3.1 - 3.6) to show you
more details from HTTP and how you can process and analyze HTTP
requests. For building web application which is our chatbot we are not going
to use twisted since it is too low level in our opinion, especially if we
compare it with other web services-oriented frameworks. The choice made
lets us use Django.8
First let us install it. As usual, to install new package we will use pip:
1. pip install django==3.2.16
Code 3.16
For the time being, when we write this chapter there are two main lines of
long-term support (LTS) versions of Django projection - 3.2 and 4.2.
Version 4.2 is ongoing active development and actively getting new features,
enhancement and bug fixes. For the use of this book, we are going to stay
with recommended stable 3.2 version.
We will start our project that we will call chatter for easier naming
convention. To do so we will use Django commands:
1. django-admin startproject chatter
Code 3.17
Once the project is properly initialized, we should see this file structure like
in proceeding example:
1. .
2. ├── chatter
3. │ ├── __init__.py
4. │ ├── asgi.py
5. │ ├── settings.py
6. │ ├── urls.py
7. │ └── wsgi.py
8. ├── db.sqlite3
9. └── manage.py
Code 3.18
Django admin automatically created main file for managing your project
called manage.py We will use it in many stages, such as running service,
DB migrations, translations and many more that is beyond of the scope of
this mini project. First let us start our web server:
1. $ python manage.py runserver
2.
3. Watching for file changes with StatReloader
4. Performing system checks...
5.
6. System check identified no issues (0 silenced).
7.
8. You have 18 unapplied migration(s). Your project may not work
properly until you apply the migrations for app(s): admin,
auth, contenttypes, sessions.
9. Run 'python manage.py migrate' to apply them.
10. December 30, 2022 - 09:44:56
11. Django version 3.2.16, using settings 'chatter.settings'
12. Starting development server at http://127.0.0.1:8000/
13. Quit the server with CONTROL-C.
Code 3.19
Once the application is running, you should be able to access the app server,
as shown below. Welcome to the Django web service!
Figure 3.3: Main hello world screen accessible once the webserver is started.
We created project in Django called chatter. Now, it is time to create the
actual application. What is the difference between project and the app?
Project is a group of applications you can translate this as it is a web server
that can host many applications. App itself is like an application which does
something particular, for instance our chatbot can be such an app and chatbot
admin dashboard will be another.
So, let us create an app called chat. Before that, we need to apply database
migrations that Django asked us to do. Database migrations9 are like
controlling the history of changes of database tables, triggers, indexes,
columns, data types, etc. In one simple sentence – migrations allow us to
control the history of any changes we may want to apply to the database.
Instead of manually comparing as a developer what kind of database tables
we currently have in the database, what types of columns, data types, and so
on and on, we can use a very simple yet powerful engine for managing and
tracking database changes - migrations.
Migrations are part of Django framework. Any kind of Django model change
can be track by migrations system and it will be always reflected in database
schema. It is always possible to apply migration forward or rollback any
changes that we do not want in database. Concluding, let us apply those
mentioned migrations in following example:
1. $ python manage.py migrate
2.
3. Operations to perform:
4. Apply all migrations: admin, auth, contenttypes, sessions
5. Running migrations:
6. Applying contenttypes.0001_initial... OK
7. Applying auth.0001_initial... OK
8. Applying admin.0001_initial... OK
9. Applying admin.0002_logentry_remove_auto_add... OK
10. Applying admin.0003_logentry_add_action_flag_choices... OK
11. Applying contenttypes.0002_remove_content_type_name... OK
12. Applying auth.0002_alter_permission_name_max_length... OK
13. Applying auth.0003_alter_user_email_max_length... OK
14. Applying auth.0004_alter_user_username_opts... OK
15. Applying auth.0005_alter_user_last_login_null... OK
16. Applying auth.0006_require_contenttypes_0002... OK
17. Applying auth.0007_alter_validators_add_error_messages... OK
18. Applying auth.0008_alter_user_username_max_length... OK
19. Applying auth.0009_alter_user_last_name_max_length... OK
20. Applying auth.0010_alter_group_name_max_length... OK
21. Applying auth.0011_update_proxy_permissions... OK
22. Applying auth.0012_alter_user_first_name_max_length... OK
23. Applying sessions.0001_initial... OK
Code 3.20
Just a small explanation what is happening above. Applying command in
line 1 triggered the migrations. Line 4 is where Django shows you what kind
of applications migrations are being applied. Once we finished with running
migrations, we can finally create our first Django application:
1. $ python manage.py startapp chat
Code 3.21
App created via the above command will also automatically create folders
and file structure by Django. Next, we must create a folder to store templates
for our application. By being in the main folder where manage.py file is
located execute the following command:
1. mkdir -p chat/templates/chat
Code 3.22
In the same directory where we have our manage.py file, to simplify the
description, we will call this location root folder and create a chatbot.py
main file.
1. from chatterbot import ChatBot
2.
3.
4. def chatbot():
5. return ChatBot(
6. 'Trainer',
7. storage_adapter=’chatterbot.storage.SQLStorageAdapter’,
8. database_uri='sqlite:///chatbot.sqlite3'
9. )
Code 3.23
We created the same file as in the training example, but in this case, we
explicitly told Chatbot constructor what kind of data storage we want to use
(SQLite) and where it is located (database_url).
For training, we will use trainer.py in root folder with the following
content:
1. from chatbot import chatbot
2. from chatterbot.trainers import ChatterBotCorpusTrainer
3.
4.
5. _chatbot = chatbot()
6. trainer = ChatterBotCorpusTrainer(_chatbot)
7. trainer.train("chatterbot.corpus.english")
Code 3.24
The remaining config showed in the training section stays the same. Now, it
is time to inform our Django project about our new app. We are going to
update chatter/settings.py in root folder and update the application’s list:
1. INSTALLED_APPS = [
2. 'django.contrib.admin',
3. 'chat',
4. 'django.contrib.auth',
5. 'django.contrib.contenttypes',
6. 'django.contrib.sessions',
7. 'django.contrib.messages',
8. 'django.contrib.staticfiles'
9. ]
Code 3.25
Django should now be able to see our app, albeit before we can start using it,
we have to fix routing. Edit chatter/urls.py in root folder so it looks like the
following:
1. """chatter URL Configuration"""
2. from django.contrib import admin
3. from django.urls import path, include
4. from chat.views import main_view
5.
6. urlpatterns = [
7. path('admin/', admin.site.urls),
8. path('chat/', include('chat.urls')),
9. path('', main_view, name='main_view'),
10. ]
Code 3.26
Line 7 points to admin URLs definition. You can open admin page on
http://localhost:8000/admin/ login and password, that must be created by
executing the following command:
1. $ python manage.py createsuperuser
2.
3. Username (leave blank to use 'foo'): admin
4. Email address: [email protected]
5. Password:
6. Password (again):
7. The password is too similar to the username.
8. This password is too short. It must contain at least 8 characters.
9. This password is too common.
10. Bypass password validation and create user anyway? [y/N]: y
11. Superuser created successfully.
Code 3.27
When an admin account is created, you can access it with the credentials you
used to create the admin account above. You can use it later as this chapter
extension.

Figure 3.4: Admin section available when running example hello world app
The admin is accessible and working. It is time to create main view URL
path defined in line 9 (urls.py). Please notice that the mentioned view (Code
3.28) is imported in line 4, so we create that view in file chat/views.py.
1. from django.http import HttpResponse
2.
3.
4. def main_view(request):
5. return HttpResponse("hello world")
Code 3.28
After accessing main page http://localhost:8000/ you should see the hello
world message that we return in line 5.

Figure 3.5: Example hello world web page


After this is accomplished, we must start creating the endpoint that will
receive the user's message and return the chatbot response. In the same file
we are going to add new view called chat_query.
1. from django.http import HttpResponse
2. from chatbot import chatbot
3.
4.
5. def chat_query(request):
6. user_message = request.GET['message']
7. response = chatbot().get_response(user_message)
8. response_data = response.text
9. return HttpResponse(response_data)
10. def main_view(request):
11.
12.
13. return HttpResponse("hello world")
Code 3.29
The next step is to update the definition of URLs router, so our chat app can
point to a newly added controller view. It is a simple syntax like the
following example (chatter/urls.py).
1. """chatter URL Configuration"""
2. from django.contrib import admin
3. from django.urls import path, include
4. from chat.views import main_view
5.
6. urlpatterns = [
7. path('admin/', admin.site.urls),
8. path('chat/', include('chat.urls')),
9. path('', main_view, name='main_view'),
10. ]
Code 3.30
How does it work? In our main URLs definition file (Code 3.30) from root
folder (chatter/urls.py) in line 8 we define imports for URLs from chat
application. In that case, Django for every path /chat/* will be looking for
routes definitions in chat/urls.py file which looks like in proceeding
example.
1. from django.urls import path
2.
3. from . import views
4.
5. urlpatterns = [
6. path('query', views.chat_query, name='chat_query'),
7. ]
Code 3.31
In that case we defined path /chat/query to be able to return the result of
chat_query view. We assume you already managed to run trainer.py, so our
AI model will be trained. Once you have done this you should see in main
root directory DB file chatbot.sqlite3. All set, let us make a test call to our
new endpoint:
1. curl -v -L "http://localhost:8000/chat/query?message=hi"
Code 3.32
Our chatbot app should be able to response based on the above query
(message parameter) with previously trained response. For example, you can
expect to see something like this:
1. * Trying 127.0.0.1:8000...
2. * Connected to localhost (127.0.0.1) port 8000 (#0)
3. > GET /chat/query?message=hi HTTP/1.1
4. > Host: localhost:8000
5. > User-Agent: curl/7.85.0
6. > Accept: */*
7. >
8. * Mark bundle as not supporting multiuse
9. < HTTP/1.1 200 OK
10. < Date: Sun, 01 Jan 2023 18:00:26 GMT
11. < Server: WSGIServer/0.2 CPython/3.7.9
12. < Content-Type: text/html; charset=utf-8
13. < X-Frame-Options: DENY
14. < Content-Length: 12
15. < X-Content-Type-Options: nosniff
16. < Referrer-Policy: same-origin
17. <
18. * Connection #0 to host localhost left intact
19. How are you?
Code 3.33
Before we go to the next section, note that we defined query string in URL
that we call to get response from chatbot. That is something what we are
going to change in the next section. Reason being is simple, query string is
not very efficient for such calls for instance length of query string is limited
and more complex queries, for example non-ASCII can lead to issues.

Frontend
In the previous sections we concentrated on the server side of our chatbot
application. This section will mainly focus on the frontend side, the part of
the app you access in the web browser.
First let us make sure that our chat app can render HTML properly. We will
update the main template file chat/templates/chat/main.html with this
simple content:
1. <!doctype html>
2. <html lang="en" class="scroll-smooth">
3. <head>
4. <meta charset="utf-8">
5. <meta name="viewport" content="width=device-width, initial-
scale=1">
6. <title>Chatbot demo</title>
7. </head>
8. <body>
9. <p>hello world!</p>
10. </body>
11. </html>
Code 3.34
This is trivial hello world HTML example. To display it we must update the
main controller in the chat application. Edit main view file chat/views.py
and update function as the following example:
1. from django.shortcuts import render
2.
3.
4. def main_view(request):
5. context = {}
6. return render(request, 'chat/main.html', context)
Code 3.35
You can see in line 6 that we return generated HTML from template file.
This simple render function takes three arguments:
One is the request object10,
Second is the path to the template file, and
The last has dictionary with all the data structures for template
Our frontend part will display a small box with all the chat history and
vertical scroll if there are more messages than box can display. On the
bottom of the box, we will add input field where user can write a message
and a button to send it. Overall, it is going to look like the following figure:

Figure 3.6: Prototype of out chatbot application


HTML template prepared earlier (hello world, Code 3.34) will have to
improve it by adding bootstrap library11 for better CSS and look and feel of
the page itself. The code to display our chat box:
1. <!doctype html>
2. <html lang="en" class="scroll-smooth">
3. <head>
4. <meta charset="utf-8">
5. <meta name="viewport" content="width=device-width, initial-
scale=1">
6. <title>Chatbot demo</title>
7. <link href="https://cdn.jsdelivr.net/npm/[email protected]/dist/css/boo
tstrap.min.css" rel="stylesheet" integrity="sha384-
rbsA2VBKQhggwzxH7pPCaAqO46MgnOM80zW1RWuH61DGLwZJ
EdK2Kadq2F9CUG65"
crossorigin="anonymous" />
8. </head>
9. <body>
10. <div class="container text-center">
11. <div class="row">
12. <div class="col">chatbox example</div>
13. </div>
14. <div class="row">
15. {% csrf_token %}
16. <div class="col"></div>
17. <div class="col"></div>
18. <div class="col" style="height:90%">
19. <div class="card" style="width: 18rem;">
20. <div class="info-div" id="chat_content">...</div>
21. <div class="card-body">
22. <p class="card-text">
23. <div class="input-group mb-3">
24. <input type="text" class="form-
control" placeholder="Your message" aria-label="Your message" aria-
describedby="basic-addon2" id="message">
25. <button class="input-group-text" id="basic-
addon2" >send</span>
26. </div>
27. </p>
28. </div>
29. </div>
30. </div>
31. </div>
32. </div>
33. </body>
34. </html>
Code 3.36
In line 7 we loaded bootstrap library from external CDN resource. It is
simple and convenient way of using additional JavaScript or CSS libraries.
In lines 10-32 by using bunch of HTML elements with bootstrap CSS we
managed to build chat window that is fulfilling our requirements.
In line 15, we added special Django CSRF12 (cross site request forgery
protection) tag that will inject one time token into page’s source. Each time
the user will reload the page CSRF, the token value gets refreshed as well –
both on server and website side. We need to use that token value in payload
that browser will send each time user press send button.
Notice that in HTML elements like chat itself we used ID (chat_content)
and same in input text (ID – message). We will use these elements later in
our JavaScript code. We will read values of those elements to send them in
payload request to server and process responses. Next, as a result, we will
update chat box with proper content.
To be able to trigger any JavaScript action we could use raw JavaScript, but
working with browser XHR and native JS code can be pain, especially if we
talk about compatibility with different browsers. How to make it easier; let
us use JS library called jQuery13.
Figure 3.7: Chatbot application work-flow diagram
In the following code, we updated the head section (from Code 3.36) in
HTML source where we have added additionally jQuery library loader:
1. <head>
2. <meta charset="utf-8">
3. <meta name="viewport" content="width=device-width, initial-
scale=1">
4. <title>Chatbot demo</title>
5. <link href="https://cdn.jsdelivr.net/npm/[email protected]/dist/css/boo
tstrap.min.css" rel="stylesheet" integrity="sha384-
<....same as before>" crossorigin="anonymous" />
6. <script src="https://code.jquery.com/jquery-
3.6.3.min.js" crossorigin="anonymous"></script>
7. </head>
Code 3.37
Since all main libraries are loaded into our HTML template, it is time to
write some JavaScript code to send and receive chat messages. Let us update
line 25 in the HTML chat window section by updating button with onclick
action:
1. <button class="input-group-text" id="basic-
addon2" onClick="sendMessage();">send</span>
Code 3.38
That will allow the user to click send button and fire JS function called
sendMessage. To make it work we should add empty sendMessage in
template header section:
1. <script>
2. function sendMessage() {
3. }
4. </script>
Code 3.39
Clicking send button does not do anything since the body of the above
function is empty but at least it does not trigger any kind of JS error. Now,
we will send XHR POST message to backend server with the content of
input text field (user’s chat message). To do so, we will use jQuery post
method14. Updated sendMessage is going to look like this:
1. message = $("#message").val()
2. $("#message").val('')
3. $.post(
4. "chat/query",
5. {'message': message},
6. function( data ) { }
7. )
Code 3.40
jQuery syntax can appear different, when compared to Python code. Let us
analyze what is actually happening in Code 3.40. From line 1-2 we get value
of chat input text and clear it immediately. Next in line 3-7 we send its
content to URL chat/query and as the third argument we call anonymous
function with data argument that is the actual response from server.
To receive post message in our chat view in backend sever we must tweak it
a little bit. Let us update chat/views.py to make it look like this:
1. from django.shortcuts import render
2. from django.http import JsonResponse
3. from chatbot import chatbot
4.
5.
6. def chat_query(request):
7. user_message = request.POST['message']
8. response = chatbot().get_response(user_message)
9. response_data = response.text
10. return JsonResponse({"message": response_data})
11.
12.
13. def main_view(request):
14. context = {}
15. return render(request, 'chat/main.html', context)
Code 3.41
In line 7 we are reading incoming user message from request.POST instead
of GET, how it was before. Additionally, we updated in line 10 that view
function will return JSON structure that our frontend function can
understand.
Now, let us try to insert word “hi” in the chat input text and click send. In
backend server console log, you can see 403 error message forbidden access.
You may think - what does it mean? Do you remember CSRF token that we
mentioned before – that special Django tag (Code 3.55) for CSRF tags? We
must send it as a part of our payload request. If we do not send that token
server will reject request with 403 error. Let us update the above function
like in proceeding example.
1. <script>
2. function sendMessage() {
3. csrf = $("input[name='csrfmiddlewaretoken']").val()
4. message = $("#message").val()
5. $("#message").val('')
6. $.post("chat/query",
7. {"csrfmiddlewaretoken": csrf, 'message': message},
8. function( data ) { }
9. );
10. }
11. </script>
Code 3.42
Now there is no more 403 error in server console, and we are getting some
response, but it is not displayed in chat window. The reason is because
anonymous function that should process response and update the chat
window is empty. Let us show you how to fix this in following example.
1. <script>
2. function sendMessage() {
3. csrf = $("input[name='csrfmiddlewaretoken']").val()
4. send_time = new Date()
5. send_time = send_time.toLocaleTimeString()
6. message = $("#message").val()
7. $("#message").val('')
8. $.post("chat/query",
9. {"csrfmiddlewaretoken": csrf, 'message': message},
10. function( data ) {
11. chat_content = $(«#chat_content»).html()
12. response_time = new Date()
13. response_time = response_time.toLocaleTimeString()
14. response_message = `[${response_time}] ` +data.message
15. request_message = `[${send_time}] ` + message
16. $("#chat_content").html( chat_content + '<div style="backgroun
d: #eeeeee">' + request_message + '</div>
<div>' + response_message + '</div>');
17. }
18. );
19. }
20. </script>
Code 3.43
We can see in the following figure how the chatbot application is going to
look like when running in the browser.
Figure 3.8: Example of chatbot application.
Finally, sending message is working. So, let us try to use some more of our
example. Once we have more messages showing in the chat window, the
browser scroll does not appear. To fix this issue we need to add CSS style.
We need to update the header section (code 3.55) in template source with the
following CSS code.
1. <style>
2. .info-div {
3. width: 100%;
4. height: 400px !important;
5. max-height: 400px !important;;
6. overflow-y: auto;
7. text-align: left
8. }
9. </style>
Code 3.44
Now, we have an example that is working. The outstanding task for you is to
update conversations.yml with your own conversations and rerun training
script. If you want to start from scratch, you can always drop chatbot.sqlite3
file and run trainer script.

Conclusion
In this chapter, we learned how to use simple AI models with Python and
how to train them to be able to build chatbot. We also learned how to build
client-server architecture application that we managed to convert to web
application with a use of interactive JavaScript and HTML.
In the next chapter, we will learn how to use Python for analyzing and
managing our home expenses. After reading this chapter, we will see how
we can use Python to predict future expenses and manage our home budget
based on our income and how much money we spend.

1. https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol
2. https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods
3. https://twisted.org
4. https://docs.python.org/3/library/urllib.html
5. https://en.wikipedia.org/wiki/Query_string
6. https://github.com/gunthercox/ChatterBot
7. https://github.com/gunthercox/chatterbot-corpus
8. https://www.djangoproject.com
9. https://docs.djangoproject.com/en/4.2/topics/migrations/
10. https://docs.djangoproject.com/en/4.1/ref/request-response/
11. https://getbootstrap.com
12. https://docs.djangoproject.com/en/5.1/ref/csrf/
13. https://jquery.com
14. https://api.jquery.com/jQuery.post/

Join our book’s Discord space


Join the book's Discord Workspace for Latest updates, Offers, Tech
happenings around the world, New Release and Sessions with the Authors:
https://discord.bpbonline.com
OceanofPDF.com
CHAPTER 4
Developing App to Analyze
Financial Expenses

Introduction
It is well-established that a balanced and organized home budget is crucial.
Traditionally, we managed this manually with math. Now, programming
languages provide a more streamlined approach, automating calculations and
offering real-time insights.

Structure
In this chapter, we will discuss the following topics:
Excel
Import
Export
Analyze expenses
Estimate future expenses based on income and outcome happened in the
past
Building behavioral driver estimator
Statistics
Objectives
In this chapter, you will learn how to use Python to manage your finances.
You will see how to import and export data to and from Excel files, which
can help you keep track of your income and expenses. You will also learn
how to collect and organize data in a way that allows you to calculate
estimates and analyze your home budget. By the end of this chapter, you will
have the skills to use Python as a powerful tool for financial planning and
decision making.

Excel
Excel is a popular spreadsheet application that can store, manipulate, and
analyze data in various formats. Python is a powerful programming language
that can perform various tasks with data, such as cleaning, processing, and
visualizing. In this tutorial, we will learn how to import and export data to
and from Excel files using Python and CSV1 format.

Export
Let us check following code to see how we can convert data to CSV format.
1. import csv
2.
3. data = [
4. {"name": "icecream", "amount": 15, "comment": ""},
5. {"name": "water", "amount": 3.2, "comment": "it was hot day"},
6. {"name": "bread", "amount": 1.3, "comment": "my favorite one"},
7. ]
8.
9. output_filename = "output_file.csv"
10. headers = ("name", "amount", "comment")
11.
12. with open(output_filename, 'w') as csv_file:
13. csv_writer = csv.writer(csv_file)
14. for item in data:
15. csv_writer.writerow([item.get(key) for key in headers])
Code 4.1
Our code after running will create output file (line 12) that is going to look
like in the following example.
1. icecream,15,
2. water,3.2,it was hot day
3. bread,1.3,my favorite one
Code 4.2
What we must notice that Code 4.1 is not efficient because it iterates over
the data list and creates a new list for each item by calling the get method on
the item dictionary. This can cause performance issues with lots of items, as
it consumes more memory and time than necessary. The optimized version
of the same thing is going to look like following code.
1. import csv
2.
3. data = [
4. ("icecream", 15, ""),
5. ("water", 3.2, "it was hot day"),
6. ("bread", 1.3, "my favorite one"),
7. ]
8.
9. output_filename = "output_file.csv"
10. headers = [("name", "amount", "comment")]
11.
12. with open(output_filename, "w") as csv_file:
13. csv_writer = csv.writer(csv_file)
14. csv_writer.writerows(headers)
15. csv_writer.writerows(data)
Code 4.3
The Code 4.3 uses the writerows (Code 4.3, line 15) method of the
csv_writer object to write multiple rows at once to a CSV file. The first
argument is a list of headers, which are the column names for the CSV file.
The second argument is a list of data, which are the rows of values for each
column. The Code 4.3 does not loop over the data, but writes it all in one go.
This requires that the data is already in a suitable format for the CSV file,
such as a list of lists or a list of tuples.
Another thing that we must address is the fact that CVS files very offer
require that some types of data is going to be wrapped in quotes. One reason
why quoting data in a CSV file makes sense is that it can prevent the comma
character, which is used as a delimiter, from being interpreted as part of the
data, for example, Smith, John, we need to quote them to avoid confusing
the parser.
Another reason why quoting data in a CSV file makes sense is that it can
preserve the whitespace characters, such as spaces, tabs, or newlines, that are
part of the data.
Let us check following example of how to achieve this.
1. import csv
2.
3. data = [
4. ("icecream", 15, ""),
5. ("water", 3.2, "it was hot day, no"),
6. ("bread", 1.3, "my favorite one"),
7. ]
8.
9. output_filename = "output_file.csv"
10. headers = [("name", "amount", "comment")]
11.
12. with open(output_filename, "w") as csv_file:
13. csv_writer = csv.writer(csv_file, delimiter=',',quotechar='"')
14. csv_writer.writerows(headers)
15. csv_writer.writerows(data)
Code 4.4
Another way to generate a csv file with pandas is to use the DataFrame to
CSV method, which takes a file name or a file object as an argument and
writes the data frame to a .csv file. For example, if we have a data frame
called df, we can write it to a .csv file. Let’s install pandas with the following
code:
1. $ pip install pandas
Code 4.5
This module let us check the following code to see how we can use pandas
to make CSV files.
1. import pandas as pd
2. import numpy as np
3.
4. data = [
5. ("icecream", 15, ""),
6. ("water", 3.2, "it was hot day, no"),
7. ("bread", 1.3, "my favorite one"),
8. ]
9.
10. output_filename = "output_file.csv"
11. headers = ("name", "amount", "comment")
12.
13. df = pd.DataFrame(np.array(data), columns=headers)
14. df.to_csv(output_filename, index=False)
Code 4.6
Once we execute Code 4.5, we are going to get results (line 14) same as in
previous examples (Code 4.3 and Code 4.4).

Import
So far, we have learned how to export exiting data to external format that
can be understood by Excel. To import data back from excel to Python we
are going to use pandas, we need to use the pd.read_excel() function. Let us
check following example how to import data from Excel. First let us create
such a spreadsheet that was generated by importing CSV file that we created
after running Code 4.6. It should look like in the following figure.
Figure 4.1: Example CSV output file imported in Excel
Now, we are going to import the data back to pandas from Excel. To do this,
we need to use the following code:
1. import pandas as pd
2.
3. # specify the path of the Excel file
4. excel_file = "example.xlsx"
5. # read the Excel file into a pandas DataFrame
6. df = pd.read_excel(excel_file)
7. print(df.head())
Code 4.7
The preceding code uses the read_excel function to read the Excel file into a
pandas DataFrame. The function takes the path of the Excel file as an
argument and returns a DF object that contains the data from the
spreadsheet. We can then print rows of the DF using the head method to
check if the data was imported correctly.

Analyze expenses
After reading the Excel file, we can start analyzing the expenses data in the
data frame. We are going to prepare an example report that shows the total
expenses by category with included income. Let us check following figure
how example spreadsheet is going to look like.

Figure 4.2: Example spreadsheet containing monthly expanses


Before we can analyze the expenses data, we need to import it to pandas, a
popular Python library for data analysis. Additionally, we have to install
following libraries.
1. $ pip install matplotlib click openpyxl
Code 4.8
To import the data into Python, we need to have a sample file (Figure 4.2)
that contains the expenses. We assume that our example is going to be save
as Excel format. Let’s check following example how to import such a data
file.
1. import click
2. import pandas as pd
3.
4. class Expenser:
5. def load_and_convert(self, file_path):
6. self.df = pd.read_excel(file_path, sheet_name=None, header=None
, names=('Type', 'Value'))
7. print(self.df['expenses'])
8.
9. @click.command()
10. @click.option("--file", type=str, help="Data file", required=True)
11. def main(file):
12. exp = Expenser()
13. exp.load_and_convert(file)
14.
15. if __name__ == '__main__':
16. main()
Code 4.9
We used already learned click2 library for supporting command line
parameters. One reason why using python click is awesome is that it allows
us to create user-friendly and consistent command line interfaces with
minimal code. We use custom parameter file (line 10) that that points to
excel file we load to pandas as data frame (line 6). Next, we print content of
expenses spreadsheet (line 7).
After loading data, we are going to draw graph that is going to represent of
how much money we get and how much we spend on what kind of group of
things. Let us check following code:
1. import click
2. import pandas as pd
3. import matplotlib.pyplot as plt
4.
5. class Expenser:
6. def load_and_convert(self, file_path):
7. self.df = pd.read_excel(file_path, sheet_name=None, header=None
, names=("Type", "Value"))
8. print(self.df["expenses"])
9.
10. def draw(self):
11. fig, ax = plt.subplots()
12.
13. ax.bar(self.df["expenses"]["Type"], self.df["expenses"]["Value"])
14.
15. ax.set_ylabel('amount ($)')
16. ax.set_title("Results of expenses")
17. ax.legend(title='Expenses')
18.
19. plt.show()
20.
21.
22. @click.command()
23. @click.option("--file", type=str, help="Data file", required=True)
24. def main(file):
25. exp = Expenser()
26. exp.load_and_convert(file)
27. exp.draw()
28.
29.
30. if __name__ == "__main__":
31. main()
Code 4.10
To be able to execute Code 4.10 we need to run following example.
1. $ python load_expenses.py --file espense-example.xlsx
Code 4.11
Code 4.10 is a command that runs a Python script called load_expenses.py.
This script takes an argument --file, which specifies the name of an Excel
file that contains expense data. The script reads the Excel file and loads the
data into a pandas DataFrame, which is a data structure for storing tabular
data in Python. The script then prints some basic information about the
DataFrame, such as its shape, columns, and data types (line 8). Finally, as
result of running our script we draw bar chart (line 10-19) where we use
matplotlib3 module for graphical representation of our data, which is shown
in following figure:

Figure 4.3: Example expenses data presented as bar chart


Clearly, we can see that after importing data from Excel and drawing bar
chart (Figure 4.3) where we tried to present which part of expenses consume
most of our budget, we shown so much data that our data is very fuzzy and
hard to read. Let us try to change Code 4.10 in such a way that is presented
in proceeding example.
1. def load_and_convert(self, file_path):
2. self.df = pd.read_excel(file_path, sheet_name=None, header=None, n
ames=("Type", "Value"))
3. self.df["expenses"] = self.df["expenses"].sort_values("Value", ascendi
ng=False)
Code 4.12
In example 4.12 once we loaded data to DataFrame we sort it by value (line
3). The order of sorting is descending (parameter ascending=False) to be
able to apply following code change.
1. ax.bar(self.df["expenses"]["Type"][:5], self.df["expenses"]["Value"][:5])
Code 4.13
In example 4.13 when we draw bars with values (Code 4.10, line 13) we
change it in such a way that once we have data sorted, we want to show only
top 5 most expenses we have in our budget. As the result of such a change
we have much cleaner figure like in the proceeding example.

Figure 4.4: Sorted expenses by biggest chunk of budget spend


We can see in Figure 4.4 that applying Python slice method (Code 4.13)
“[:5]” will give us option to slice out only 5 elements from expenses list and
then draw bar chart with it (Figure 4.4).
Estimate future expenses based on income and outcome
happened in the past
Estimating any kind of future data (interpolation) has special dedicated field
of studies – it is quite complex matter. Thankfully for Python there are many
projects that help us developers to use artificial intelligence4. The one that
we are going to use here is scipy5 and scikit-learn6. Before we can continue,
we assume that what has been mentioned before is already installed (pandas
etc.). Let us install additional packages.
1. $ pip install numpy scipy scikit-learn quandl
Code 4.14
Once these modules are installed, we need to clarify some important things.
The reason being why do we use SciPy and scilkit is because SciPy module
is going to help us with interpolation data in a low-level way where scikit is
introducing AI level on the top of it as addition to predict theory of SciPy7.
Let us check following example how can we use interpolation with Python.
1. import quandl
2.
3. df = quandl.get("WIKI/GOOGL")
4. df = df[['Adj. Open', 'Adj. High',
'Adj. Low', 'Adj. Close', 'Adj. Volume']]
Code 4.15

We used quandl8 which helps us to fetch financial data about google and
stock market – it is just a simple example so we can learn how such a data
can be in later stage used to predict future of stock exchange. We see that in
line 2 we use quandl API to fetch data about google stock. Once it is fetched,
we have it converted to pandas datafrom. Let us check what can we do with
such a data.
Python scikit-learn9 is a free and open-source machine learning library that
provides a range of supervised and unsupervised learning algorithms, as well
as tools for data preprocessing, model selection, evaluation, and feature
extraction. After installing module with Code 4.14 we can check following
example how to use loaded data (example 4.15) to be able to prepare our
data stack for interpolation.
1. import random
2. from datetime import datetime
3.
4. import quandl, math
5. import numpy as np
6. import pandas as pd
7. import matplotlib.pyplot as plt
8. from sklearn import preprocessing, svm
9. from sklearn.model_selection import train_test_split
10. from sklearn.linear_model import LinearRegression
11. from matplotlib import style
12.
13. style.use("ggplot")
14.
15. df = quandl.get("WIKI/GOOGL")
16. df = df[["Adj. Open", "Adj. High", "Adj. Low",
"Adj. Close", "Adj. Volume"]]
17. df = df[["Adj. Close", "Adj. Volume"]]
18.
19. forecast_col = "Adj. Close"
20. df.fillna(value=-99999, inplace=True)
21. forecast_size = int(math.ceil(0.02 * len(df)))
22. print("Forecast size: {forecast_size}")
23.
24. df["label"] = df[forecast_col].shift(-forecast_size)
25.
26. x = np.array(df.drop(["label"], axis=1))
27. x = preprocessing.scale(x)
28. x_lately = x[-forecast_size:]
29. x = x[:-forecast_size]
30.
31. df.dropna(inplace=True)
32.
33. y = np.array(df["label"])
34. x_train, X_test, y_train, y_test =
train_test_split(x, y, test_size=0.2)
35. clf = LinearRegression(n_jobs=-1)
36. clf.fit(x_train, y_train)
37. confidence = clf.score(X_test, y_test)
38.
39. forecast_set = clf.predict(x_lately)
40. df["Forecast"] = np.nan
41. last_date = df.iloc[-1].name
42. last_unix = last_date.timestamp()
43. one_day = 24 * 60 * 60 # 1 day in seconds
44. next_unix = last_unix + one_day
45.
46. for i in forecast_set:
47. next_date = datetime.fromtimestamp(next_unix)
48. next_unix += one_day
49. df.loc[next_date] = [np.nan for _ in
range(len(df.columns) - 1)] + [i]
50.
51. df["Adj. Close"].plot()
52. df["Forecast"].plot()
53. plt.legend(loc=4)
54. plt.ylabel("Value")
55. plt.xlabel("Date")
56. plt.show()
Code 4.16
There is lots of going on this code but let us try to analyze block step by
step. In Code 4.16, the goal is to create a forecast for the closing price of a
stock exchange based on historical data. The code uses the interpolation
library, which is a tool for time series forecasting that can handle seasonality,
trends, and holidays. In the first part (lines 15-17) we load stock exchange
like in Code 4.15 but in current use case (line 17) we drop those columns
that are not needed in our example – we only keep these that we will use for
estimations. In line 19, we can see that the close column contains the daily
closing prices of the stock, which will be used as the target variable for the
forecast. Next part is to create size of estimation (days for how many we
want to estimate) – line 21, we say we want to add estimation that is 20% of
total number of total number of days in the historical data retrieved from
external API (line 15).
Model creation and fitting will estimate the parameters of the model, such as
the trend, seasonality, and change points (lines 24-29). In later part we train
our model with data that represent the predicted value, the lower bound, and
the upper bound of the forecast, respectively (lines 33-39).
Once we have all the bounds and predictions being prepared, we use plot
method (line 30) to plot the historical and forecasted data on a graph. We can
notice that we had to use loop (lines 46-49) that goes over forecast set (line
46) and keeps using prepared values and inject them into missing dates by
going day by day (line 49).
The code also uses the plot components method (line 51-56) to plot the
components of the forecast, such as the trend, seasonality, and holidays.
These plots can help to understand the behavior and patterns of the time
series data and the forecast.
After analyzing our prophet code, we can check following figure how our
interpolation is going to look like.
Figure 4.5: Example forecast of Google stock market with using Python SkLearn
After warming up we can start changing our code (4.16) in such a way that
we can prepare code for predicting data regarding our expenses. We do not
want to concentrate in this subchapter on preparing data with many real-life
details involved – we want to learn how to interpolate data thus we are going
to create a script that will help us with preparing sample data.
Before we can continue, we have to install Python library as shown in
proceeding code.
1. pip install xlsxwriter
Code 4.17
Once module from Code 4.17 is installed we can build a code that is going to
produce XLS file that contains a simulated expenses grouped by expense
type. Let us check the following code to see how we can achieve this. Let us
create a file like shown below called seed_example_data.py.
1. import pandas as pd
2. import numpy as np
3.
4.
5. EXPENSES = (
6. "Bank Fees",
7. "Clothing",
8. "Consumables",
9. "Entertainment",
10. "Hotels",
11. "Interest Payments",
12. "Meals",
13. "Memberships",
14. "Pension Plan Contributions" "Rent",
15. "Service Fees",
16. "Travel Fares",
17. "Utilities",
18. "Cleaning Supplies",
19. "Communication Charges",
20. "Energy",
21. "Food",
22. "Insurance",
23. "Maintenance",
24. "Medical Costs",
25. "Office Supplies",
26. "Professional Service Fees",
27. "Repair Costs",
28. "Taxes",
29. "Tuition",
30. "Vehicle Lease",
31. )
32.
33.
34. class Seeder:
35. def generate(self):
36. with pd.ExcelWriter("expenses_seed_example.xlsx", engine="xlsx
writer") as writer:
37. for sheet_name in EXPENSES:
38. dates = pd.date_range(start="2020-01-01", end="2021-01-
01")
39. data = {
40. "date": dates,
41. "amount": pd.Series(np.random.choice(np.random.randint(
100, size=150), size=dates.size)),
42. }
43. df = pd.DataFrame(data, columns=["date", "amount"])
44. df.to_excel(writer, sheet_name=sheet_name)
45. if __name__ == "__main__":
46.
47.
48. s = Seeder()
49. s.generate()
Code 4.18
In that code we are creating list of example expanses (line 5-30) that in later
state of it we use to create separate sheets (line 37) in a single output file
(36). Every loop over example sheet name (expense type) we generate dates
range (line 38). Next, we create a series data with random values (line 41)
that are representing our example amount of daily expenses.
Once we have created dataframe which is a combination of dates with data
series (lines 38-43) we save it as excel sheet.
As a result of running Code 4.18 we should get seeded example file called
expenses_seed_example.xlsx. That file is going to be used in the following
part of the chapter.
After seeding expenses file with sample of data it is time to start building a
script that is going to help us analyze our expanses.
First, let us build a class that can load our sample data. Let us check
following example code how can we achieve this.
1. import click
2. import pandas as pd
3.
4.
5. class Interpolation:
6. def __init__(self, source_file):
7. self.df = pd.read_excel(source_file, sheet_name=None, header=No
ne, names=("Type", "Value"))
8.
9.
10. @click.command()
11. @click.option("--
source", type=str, help="Source file to load", required=True)
12. def main(source):
13. i = Interpolation(source)
14.
15.
16. if __name__ == "__main__":
17. main()
Code 4.19
As it is shown in lines 5-7 we load our source file to pandas dataframe. Once
we have it loaded with following code, we can proceed to analyze what we
have loaded.
1. python create_estimates.py --source expenses_seed_example.xlsx
Code 4.20
1. Next step, once data is in the system we are going to aggregate data
weekly in such a way that is presented in the following code.
1. def aggregate(self):
2. for expense in EXPENSES:
3. self.df[expense]["Date"] = pd.to_datetime(self.df[expense]
["Date"][1:]) - pd.to_timedelta(7, unit="d")
4. self.df[expense] = self.df[expense].groupby([pd.Grouper(key="
Date", freq="W")])["Value"].sum()
Code 4.21
Method aggregate (line 3) is extending start dates for time range we created
in seeding script (Code 4.18) for a week before we originally managed to
prepare. Once we have this, we group dates, originally being prepared as
daily expenses, in such a way that as result we have expenses group weekly
(groupby) – amount of these expenses get sum as well - ["Value"].sum()
(line 4).
To be able to use this method, we have to import EXPENSES variable from
Code 4.18 - we called this file seed_example_data.py. Let us check
following code how to achieve this.
1. from seed_example_data import EXPENSES
2.
3. def main(source):
4. i = Interpolation(source)
5. i.aggregate()
Code 4.22
In Code 4.22 we extended main function with explicit calling aggregate
method and import of mentioned EXPENSES variable that we needed.
2. Next step is to add a method that is going to draw what kind of
expenses are burning most of our home budget. Let us check following
code how can we deliver this requirement.
1. import matplotlib.pyplot as plt
2.
3. def plot(self):
4. for expense in EXPENSES:
5. self.df[expense].plot(label=expense, legend=True)
6. plt.legend(loc=4)
7. plt.xlabel("Date")
8. plt.ylabel("Amount")
9. plt.show()
Code 4.23
In this newly introduced method plot we are again looping over expense
types (line 2) and for each dataframe we plot linear value on the graph
presented in figure. In the end of the method, we call show method that is
drawing our data as shown in Figure 4.6.
Figure 4.6: Example of expenses grouped weekly per expense type
Figure 4.6 shows a line chart of the weekly expenses grouped by expense
type. The chart has a title, a legend, and labels for both axes. However, the
chart is hard to read because the lines overlap and cross each other
frequently, making it difficult to compare the trends and values of different
expense types. Moreover, the markers are too small, and the colors are not
very contrastive.

Building behavioral driver estimator


In the following example, we are going to even more simplify our expenses
by grouping them in such a way that we can find a trend what seems to be
the most dominating expense in our home budget.
1. def aggregate(self):
2. values = []
3. for expense in EXPENSES:
4. self.df[expense]["Date"] = pd.to_datetime(self.df[expense]["Date"]
[1:]) - pd.to_timedelta(7, unit="d")
5. self.df[expense] = self.df[expense].groupby([pd.Grouper(key="Dat
e", freq="W")])["Value"].sum()
6. values.append((expense, self.df[expense].sum()))
7.
8. self.df_most = pd.DataFrame(values, columns=
("Type", "Value"), index=EXPENSES)
9. self.df_most.sort_values("Value", ascending=False)
10.
11. def plot(self):
12. ax = self.df_most[:5].plot.bar(rot=0)
13. for container in ax.containers:
14. ax.bar_label(container)
15. plt.xlabel("Type")
16. plt.ylabel("Amount")
17. plt.show()
Code 4.24
With this simple aggregation (lines 8-9) we managed after aggregating data
weekly to sum all the expenses by expense type and in the end sort by
descending value. With this approach we know what the top 5 (line 12) most
valuable expanse type in our budget data is.
Next step is to plot bar-chart to see our “candidates” like shown in the
following figure.
Figure 4.7: Bar chart of the top 5 biggest expenses in home budget
Now, once we know what is mostly burning our budget and how to find it let
us try to apply some interpolation to our script so we can see how this trend
can change over a time.
1. def get_most_expenses(self):
2. return list(self.df_most[:5].index)
3.
4. def predict(self):
5. for key in self.get_most_expenses():
6. df = self.df_origin[key]
7. df = df.drop(df.columns[0], axis=1)
8. self.prepare_estimate(df)
9.
10. def prepare_estimate(self, df):
11. forecast_col = "Value"
12. # the rest of the code 4.16
13. df[forecast_col].plot()
14. df["Forecast"].plot()
15.
16. def plot(self):
17. plt.legend(loc=4)
18. plt.ylabel("Value")
19. plt.xlabel("Date")
20. plt.show()
Code 4.25
Code 4.25 will be extending out class defined in previous examples. We can
see that we have added method for extracting top 5 most expensive expense
types. Next, once we have it, we use them in method predict (lines 4-5) that
call prepare_estimate (lines 8). As it is easy to notice that method is the
same method, we already built in Code 4.16. In this case column name that
we use for interpolation is different albeit method body is the same.
Let us put all the calls together in the main function body like shown in the
following example.
1. @click.command()
2. @click.option("--
source", type=str, help="Source file to load", required=True)
3. def main(source):
4. i = Interpolation(source)
5. i.aggregate()
6. i.get_most_expenses()
7. i.predict()
8. i.plot()
Code 4.26
Let us take a look how interpolation (linear) is going to look like when we
generated data with random values.
Figure 4.8: Top 5 biggest expenses over a time with interpolation involved
We can see a lot of colorful data. The reason being is the fact that we used
linear interpolation which assumes that data goes up and gradually can go
down. When we created example data with seed script (Code 4.18) we
created values that keep jumping up and down very aggressively. Once we
create real data with our gradual data sets linear estimator will work here
very well.

Conclusion
In this chapter, we learned how we can analyze data prepared in Excel. We
managed to understand how Python can be used for reading and writing very
broad data sets. Once having data sets, we learned how to use Python for
manipulating data and drawing pretty complex graphs that represent
analyzed data.
In the next chapter, we are going to learn how can we use Python for
crawling web sites and extracting content out of them. We will also learn
how to make this effective and easy to learn.

1. https://en.wikipedia.org/wiki/Comma-separated_values
2. https://pypi.org/project/click/
3. https://matplotlib.org
4. https://en.wikipedia.org/wiki/Artificial_intelligence
5. https://scipy.org
6. https://scikit-learn.org/stable/index.html
7.
https://docs.scipy.org/doc/scipy/tutorial/interpolate/smoothing_splines.h
tml
8. https://pypi.org/project/Quandl/
9. https://scikit-learn.org/stable/

Join our book’s Discord space


Join the book's Discord Workspace for Latest updates, Offers, Tech
happenings around the world, New Release and Sessions with the Authors:
https://discord.bpbonline.com

OceanofPDF.com
CHAPTER 5
Building Non-blocking Web
Crawler

Introduction
Every web service being served by HTTP(S) protocol can be reached on a
very low level. What we mean by this is that by using Python and a few
libraries, we can fetch any website with all its assets and save it locally for
offline use. We do not need any browser to do so, and the whole process can
be fully automated.
As you can imagine, sometimes what is more important in the internet world
is the information we may want to extract from the web, not the beautiful
assets of websites but its data. In its plain state, data can be the most asset,
and sometimes it may be image itself. Whichever part of the information you
want to extract because of its importance for you we are going to learn in
this chapter how to extract those very important assets and fetch them from
remote website resources.

Structure
In this chapter, we will learn how to work with web sites starting with simple
examples where we will learn how to analyze and parse plain text. Next, we
are going to learn how does this work comparing to HTML. We will discuss
the following topics:
Parsing HTML and extracting data
Efficient data scrapping
Using proxy services

Objectives
By the end of this chapter, you will know how to build your efficient web
crawler, what kind of challenges it brings, and how to solve them. Let us
start coding!

Working with text


To write an effective web crawler, you should already know how to work
with text in Python. We assume you have some knowledge already about
string functions and string concatenations. Let us start with basic sample
text, a Wikipedia article about Python1. that we want to download and
convert the mentioned website to plain text:
1. curl -
L "https://en.wikipedia.org/wiki/Python_(programming_language)" | ht
ml2text > output.txt
Code 5.1
This is one way of fetching an article and converting it to raw text, albeit you
need an html2text script on your computer. Easier and giving the same
effect will be downloading HTML sources and using Python for conversion.
Refer to the following code:
1. $ curl -
L "https://en.wikipedia.org/wiki/Python_(programming_language)" -
o source.html
2.
3. $ pip install html2text
Code 5.2
Now convert and save the output to output.txt file:
1. with open('source.html', 'r') as f_read:
2. data = html2text.html2text(f_read.read())
3. with open('output.txt', 'w') as f_write:
4. f_write.write(data)
Code 5.3
Once the content is downloaded as plain text, we would like to calculate how
many times the word Python in the downloaded text is there. Here is the
simplest way to do it, just with basic string operations. For the following
examples, we will be using ipython2:
1. with open('output.txt', 'r') as f:
2. data = f.read()
3.
4. len(data.split('python')) - 1
Code 5.4
Verification of the result can be done with following code. We are using
standard CLI tools where we’re printing out the content of the output.txt
file. Next, we redirect output to grep tool where try to find word python. In
the end we use command wc which is going to help us calculate number of
lines that command grep generated. Special character | is called pipe and it is
used in posix system like Linux to redirect output of one CLI command to
another as its input. Let us check following example closer to see how we
managed to achieve this.
1. $ cat output.txt | grep python | wc -l
Code 5.5
We did -1 after calling len function (Code 5.4, line 4)– because array (code
5.4, line 4) is being generated by function split (Code 5.4, line 4) and will
always have 1 extra element, which we cannot count. Check by yourself
with the following example:
1. len("a b c".split("b"))
Code 5.6
The next exercise will extract all possible dates from our source text. String
functions are great for many cases, but to be able to extract what we need,
they cannot help much. We will use heavier gun and regular expressions and
have to build simple regexp and extract all possible combinations to extract
what we want.
1. import re
2.
3. DATES = re.compile(r"[A-Z]{1}[a-z]{3,}\s+[0-9]+, [0-9]+")
4. result = DATES.findall(data)
5.
6. print(result)
7. ['September 7, 2022']
Code 5.7
There is a lot of going on in there, correct? Let us explain a little bit. Line 1,
we imported regular expression modules from standard Python libraries. In
line 3, we compiled regular expression for faster execution, it is very helpful
with large texts you want to process. Notice below why the regexp looks the
way it does. We strongly suggest reading Wikipedia articles 3 regarding these
to understand better how they work and how powerful they are. After
extracting all possible combinations from the text, we found one occurrence
of full date in the source text (Code 5.7, line 4). What we have defined as
regular expression (Code 5.7, line 3) is going to help us to extract date string
as <full month name> <day>, <year>. If we want to find other variations of
dates in text, we’d have to update our regular expression.

Figure 5.1: Example of extracting date string


Figure 5.1 shows how extracting date pattern (November 17, 2021 in this
case) is working when regular expression is in use. Each part of the date
pattern is described in regexp language. Let us see another example of how
we can extract data based on some variables and use it to create new data
structures. Requirement is this, we want to create CSV4 file with all book
and articles references found in our source text. Refer to the following code:
1. import re
2. import csv
3.
4. with open('output.txt', 'r') as f:
5.
6. data = f.read()
7. REF = re.compile(r"[0-9]+\..*?\[\"(.*?)\"\]\((http.*?)\)[\.\s]+
(.*?)\.", re.M)
8.
9. results = REF.findall(data)
10. with open('reference_links.csv', 'w') as csvfile:
11. csv_writer = csv.writer(csvfile)
12. for item in results:
13. csv_writer.writerow(item)
Code 5.8
In the above example, we extracted all articles with corresponding link and
additionally comment into 3 columns in CSV file. The Python structure is
going to look like the following example:
1. [('Core Security', 'https://www.coresecurity.com/', '_Core Security_'),
2. ('What is Sugar?', 'http://sugarlabs.org/go/Sugar', 'Sugar Labs'),]
Code 5.9
The above code is not greatly efficient, in line 12, we are doing a loop over
the results. That is only suggested if you need to manipulate data before
adding row to CSV file. In our case this is not required to save results more
efficiently. To output file will look like the following example:
1. with open('reference_links.csv', 'w') as csvfile:
2. csv_writer = csv.writer(csvfile)
3. csv_writer.writerows(results)
Code 5.10
We hope you have some basics now of extracting data from plain text and
manipulating results. In that case after this lesson let us try to work a little bit
with the structures that internet talks in:
Working with HTML
In the previous paragraph, we explained how to deal with plain text and
build useful regular expressions to extract those parts of text you want to
process. Now let’s continue our exercise with same kind of scope of
requirements but with different type of source data – HTML. Firstly, let us
fetch the source for our example:
1. $ curl -
L "https://en.wikipedia.org/wiki/Python_(programming_language)" -
o source.html
Code 5.11
By having source, we can start doing the same exercise. Calculate how many
times word Python is in the source text. We can do it the same way as before
so there is no need to show the same code and technique. But how about if
we want to extract those URLs to CSV file that we did before? Well, we
could try to use Python module html2text, although we are not going to do
so since we want to learn how to extract data from raw HTML pages and if
would convert them to plain text it won’t be that feasible anymore.
1. import re
2. import csv
3.
4. with open('source.html', 'r') as f_read:
5. html_page = f_read.read()
6.
7. cleaned_up_source = html_page.replace('\n', '')
8.
9. REF = re.compile(r'<li.*?<a href="#cite_ref-AutoNT.*?
class="external text"\s+href="(.*?)\".*?>"(.*?)"</a>.*?<i>(.*?)
<\/i>', re.I)
10. CONTENT = re.compile(r'>References</(.*?)>Sources</', re.M)
11.
12. content = CONTENT.findall(cleaned_up_source)
13. if content:
14. result = REF.findall(content[0])
15. with open('reference_links.csv', 'w') as csvfile:
16. csv_writer = csv.writer(csvfile)
17. csv_writer.writerows(results)
Code 5.12
As you can see, we build regular expressions with a source. Why are we
doing first line 10 and then when having this block of content, we apply the
actual regexp to it instead of directly to the whole source. We should be
precise and extract the correct content from the valid block we wanted to
focus on. Notice line 10 regexp is generic so when we apply the whole
source content, we can dig out not only the part that we need but also some
additional results that can mislead the final goal with wrong values.
Extracting and parsing data when working with text is always cut out in
small chunks and then parse out the information that you need and cut out
even more on smaller chunks and do the parsing deeper. Generally speaking,
granularity is the key to success.
Notice that extracting data from HTML with regular expressions is not a bad
idea but it has some issues, for instance complexity of regexp itself can grow
if you would like to apply some twisted logic in it. Another thing is that
reading regexp after a while is a bit hard if you do not simplify them enough
(notice my advice regarding splitting parsing logic into smaller blocks).
There is a way to make it easier and cleaner, for sure. Let us show you the
library that is been designed to help with this kind of text processing
challenges, beautiful soup5.
1. $ pip install beautifulsoup4
Code 5.13
After successful installation it is time to write some code that will extract the
same kind of data as previous example from Code 5.12 with regexp. The
proceeding example as we can agree is more readable and easier to debug
later in the future:
1. import bs4
2. import csv
3. import re
4.
5.
6. with open('source.html', 'r') as f_read:
7. html_page = f_read.read()
8.
9. soup = BeautifulSoup(html_page, "html.parser")
10. cite_notes = soup.find_all('li', id=re.compile("cite_note-[0-9]+"))
11.
12. with open('reference_links.csv', 'w') as csvfile:
13. csv_writer = csv.writer(csvfile)
14. for item in cite_notes:
15. _link = item.find_all('a')[1]
16. _link_href = _link.attrs['href']
17. _link_text = _link.text.strip('"')
18. csv_writer.writerow([_link_href, _link_text])
Code 5.14
Libraries like BeautifulSoup is very intuitive to use. They are based on
object oriented programming patterns, offer full concept of this paradigm
method, attributes and much more. In our example you can see that effect
(the CSV file) is pretty the same as in previous examples (Code 5.12). we
only intentionally skipped 3rd column in our file.

Basic example
As a first example of web-crawler we will build crawling project that will
fetch some documents from python.org website. To build such a crawler, we
must design a simple queueing mechanism to store a list of URLs that the
crawler should visit and fetch the content. Once the content is downloaded,
we need a module that will analyze downloaded content and extract the
elements we need for further processing.
Figure 5.2: Concept of fetching web resource with support or URLs queue
Let us start with building a simple queue system that will store First in First
out (FIFO)6. In the proceeding example, (Code 5.15) we created a file
called crawler.py. We can see that for concept of our simple queue we used
Python module queue7. There is an additional attribute in its instance (code
5.15, line 6) that we intentionally skipped. That attribute can be, for
example, queue.Queue(10) where 10 is the size of the queue. We do not
need limit queue size here since we want to parse as many URLs as we can
so narrowing down queue size will be obstacle in this use case.
You can also notice that we do not load queue with any content at this stage.
We only created method _get_element for pulling element out of the queue.
Refer to the following code:
1. import queue
2.
3.
4. class Crawler:
5. def __init__(self):
6. self.urls_queue = queue.Queue()
7.
8. def _get_element(self):
9. return self.urls_queue.get()
10.
11. def process(self):
12. """Main method to start crawler and process"""
13. pass
Code 5.15
Next improvement that we want to do in our code is to add functionality
allowing us to load content, which is a list of URLs to process into our
queuing system. We do not want to hardcode any list of URLs in our code
but something more dynamic. Let us try to use CSV file with 2 columns
which will look this:
Column 1: Stores all URLs that we want to crawl.
Column 2: Number of retries if there is error detected.
How do we load this file to our queue? Again, we could use some sort of
hardcoded file name into our code, but this approach is not recommended
since we want to process any file given; so we should be able to parametrize
our file crawler.py. We will use module click8 , which allows us to build
advance command line tools. In the following example we are adding
command line arguments support. Please remember to import in the top of
Code 5.15 click module:
1. import click
2. import os
3.
4.
5. @click.command()
6. @click.option("--source", help="CSV full file path", required=True)
7. def main(source):
8. """Main entry point for processing URLs and start crawling."""
9. assert os.path.exists(source), f"Given file {source} does not exist"
10. c = Crawler()
11. c.process()
12.
13. if __name__ == '__main__':
14. main()
Code 5.16
We have imported two new modules: os and click. As for click it is imported
for obvious reasons, we wanted to build simple command line tool support
to run our script like in following example Code 5.17. As for os module, we
have imported it for validating, if the given source file path is valid. Refer to
the following code:
1. $ python crawler.py --source=skdjhfdksjh
2. Traceback (most recent call last):
3. File "crawler.py", line 26, in <module>
4. main()
5. File "/Users/hubertpiotrowski/.virtualenvs/fun1/lib/python3.7/site-
packages/click/core.py", line 1130, in __call__
6. return self.main(*args, **kwargs)
7. File "/Users/hubertpiotrowski/.virtualenvs/fun1/lib/python3.7/site-
packages/click/core.py", line 1055, in main
8. rv = self.invoke(ctx)
9. File "/Users/hubertpiotrowski/.virtualenvs/fun1/lib/python3.7/site-
packages/click/core.py", line 1404, in invoke
10. return ctx.invoke(self.callback, **ctx.params)
11. File "/Users/hubertpiotrowski/.virtualenvs/fun1/lib/python3.7/site-
packages/click/core.py", line 760, in invoke
12. return __callback(*args, **kwargs)
13. File "crawler.py", line 21, in main
14. assert os.path.exists(source), f"Given file {source} does not exist"
15. AssertionError: Given file skdjhfdksjh does not exist
Code 5.17
As you can see in Code 5.16, we have added assertion in line 9, if given file
path does not exist. In this case, in Code 5.17 we tested assertion, and as you
can see, it is working as expected. Now let us create valid CSV file which
looks like the following example, 3 URLs to load (1st column) and number
of retries (2nd column):
1. https://www.python.org,1
2. https://www.python.org/community/forums/,2
3. https://www.reddit.com/r/learnpython/,2
4. https://en.wikipedia.org/wiki/Elvis_Presley,3
Code 5.18
In the following code example, we already modified created method main so
it can read given existing CSV file (Let us call it for clear use in this code
example as urls.csv) and load its content to our processing queue:
1. import click
2. import csv
3. import os
4. import queue
5.
6. class Crawler:
7. def __init__(self):
8. self.urls_queue = queue.Queue()
9.
10. def load_content(self, file_path):
11. with open(file_path, 'r') as f:
12. reader = csv.reader(f)
13. for row in reader:
14. self.urls_queue.put(row)
15.
16. click.echo(f"After loaiding CSV content queue size id: {self.urls_q
ueue.qsize()}")
17.
18. def _get_element(self):
19. return self.urls_queue.get()
20.
21. def process(self):
22. """Main method to start crawler and process"""
23. pass
24.
25.
26. @click.command()
27. @click.option("--source", help="CSV full file path", required=True)
28. def main(source):
29. """Main entry point for processing URLs and start crawling."""
30. assert os.path.exists(source), f"Given file {source} does not exist"
31. c = Crawler()
32. c.load_content(source)
33. c.process()
34.
35.
36. if __name__ == '__main__':
37. main()
Code 5.17
We added new method load_content, which mainly loads CSV file and
pushes content to the queue, which we will be processing in the later stage
method called process. For the time being it does not do anything but in
following example, let us try to add some logic there which is going to:
Consume queue in FIFO order
Validate response content
Save content of crawled website to external file
Before we can start building this (code 5.17) part of the code to fetch
website content for us, we must install the requests9 package. It is simple to
use and with very elegant API HTTP library. We could use standard Python
library or some frameworks like in the examples in Chapter 3, Designing a
Conversational Chatbot we have been using Twisted10. In this case we
would like to make things clear and easy to follow with more complex
scenarios that our code is going to cover in the following scenario:
1. pip install requests -U
Once the requests are installed, we must modify our process method to
resemble the example in Code 5.18.
1. import requests
2.
3.
4. def process(self):
5. """Main method to start crawler and process"""
6. while self.urls_queue.qsize() > 0:
7. url, _ = self.urls_queue.get()
8. response = requests.get(url)
9. if response.status_code == 200:
10. f_name = sha256(url.encode('utf-8')).hexdigest()
11. output_file = f"/tmp/{f_name}.html"
12. with open(output_file, "w") as f:
13. f.write(response.text)
14. click.echo(f"URL: {url} [saved] under {output_file}")
Code 5.17
You probably noticed that we updated the process method with a mechanism
that pulls all the elements from the queue until it’s empty (line 6-7). There is
a weird syntax in line 7 as we try to pull an element from the queue that
comes as a list of 2 elements, URL and number of retries to fetch its content.
Since for the time being we ignore number of retries we immediately assign
that value to void variable (underscore symbol). For each element being
pulled, we call external resource (line 7) and check the response code11 (line
9). Only valid responses are being processed in the further code. As we
proceed we calculate SHA256 on the top of resource URL and use it as file
name (line 10) to save downloaded content (lines 11-14). Looks simple and
clean. Now, how to improve our code to download content and retry when
response fails. Let us analyze the following example Code 5.18:
1. import time
2.
3. SLEEP_TIME = 1
4.
5. def process(self):
6. """Main method to start crawler and process"""
7.
8. while self.urls_queue.qsize() > 0:
9. url, number_of_retries = self.urls_queue.get()
10.
11. for try_item in range(int(number_of_retries)):
12. click.echo(f"Number of retries: {try_item+1}/{number_of_retrie
s}")
13. response = requests.get(url)
14. if response.status_code == 200:
15. f_name = sha256(url.encode('utf-8')).hexdigest()
16. output_file = f"/tmp/{f_name}.html"
17. with open(output_file, "w") as f:
18. f.write(response.text)
19. click.echo(f"URL: {url} [saved] under {output_file}")
20. break
21. else:
22. click.echo(f"Fetching resource failed with status code {respon
se.status_code} sleeping {SLEEP_TIME}s before retry")
23. time.sleep(SLEEP_TIME)
Code 5.18

Simple crawler
After few modifications as you can see in Code 5.18, we are using number
of retries (line 11) and when it fails to fetch content (lines 21-23) we use
sleep for waiting until next try. Notice that we used global constant
SLEEP_TIME (code 5.18, line 3). We have used the value in such a way
(Code 5.18, line 3), because it is a global constant that needs to be
capitalized. Refer to Chapter 1, Python 101 and Chapter 2, Setting up
Python Environment, for more details about syntax albeit in this case we
decided to use constant since it’s read only value which we may want to use
all around the code, and it will always indicate the same sleep time for code
retries. Let us test it and add the following line to source CSV file:
1. http://dummy.non-existing.url.com,5
Now, run the code as shown in the following example:
1. Traceback (most recent call last):
2. File "/Users/hp/.virtualenvs/fun1/lib/python3.7/site-
packages/urllib3/connection.py", line 175, in _new_conn
3. (self._dns_host, self.port), self.timeout, **extra_kw
4. File "/Users/hp/.virtualenvs/fun1/lib/python3.7/site-
packages/urllib3/util/connection.py", line 72, in create_connection
5. for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREA
M):
6. File "/Library/Frameworks/Python.
framework/Versions/3.7/lib/python3.7/socket.py", line 752, in getaddrin
fo
7. for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
8. socket.gaierror: [Errno 8] nodename nor servname provided, or not kno
wn
9.
10. During handling of the above exception, another exception occurred:
11.
12. (...)
13.
14. requests.exceptions.ConnectionError: HTTPConnectionPool
(host='dummy.non-existing.url.com', port=80): Max retries exceeded
with url: / (Caused by NewConnectionError('<urllib3.connection.
HTTPConnection object at 0x7fd3705ba2d0>:
Failed to establish a new connection:
[Errno 8] nodename nor servname provided, or not known'))
Code 5.19
What just happened in Code 5.19, our improved code from Code 5.18 should
be reacting on this and retry when the URL cannot be fetched. Please
carefully read Code 5.19, we crashed with exception that explicitly tells us
as developers that requests library has raised exception “name nor service
name not know”. Ignoring the precise error message triggering (code 5.19,
line 14), we have to notice an important thing – in our improved example
(Code 5.18) we check if response code form the server is different than the
status code 200 (Code 5.18, line 14) and then retry in a case when it is
different (for instance 50312 - error). That kind of logic is going to be
reached only in these cases:
URL is valid but we cannot reach it although server is responding but
with error code.
Any kind of response code coming from server that is different than
200, for instance 404, resource not found.
In the case, presented in Code 5.19 we have exception being raised
NewConectionError and checking with “if else statement” that will not
work here. We should pack these kinds of cases into try except block and
properly process these conditions. So, to improve Code 5.18 we must do
something like the following example:
1. def process(self):
2. """Main method to start crawler and process"""
3. url_success = 0
4. url_fails = 0
5. while self.urls_queue.qsize() > 0:
6. url, number_of_retries = self.urls_queue.get()
7.
8. for try_item in range(int(number_of_retries)):
9. click.echo(f"Number of retries: {try_item+1}/{number_of_retrie
s}")
10. is_ok = False
11. try:
12. response = requests.get(url)
13. if response.status_code == 200:
14. f_name = sha256(url.encode('utf-8')).hexdigest()
15. output_file = f"/tmp/{f_name}.html"
16. with open(output_file, "w") as f:
17. f.write(response.text)
18. click.echo(f"URL: {url} [saved] under {output_file}")
19. url_success += 1
20. is_ok = True
21. break
22. except Exception as e:
23. click.echo(f"We failed to fetch URL {url} with exeception: {e
}")
24.
25. click.echo(f"Fetching resource failed with status code {response.
status_code} sleeping {SLEEP_TIME}s before retry")
26. time.sleep(SLEEP_TIME)
27. if not is_ok:
28. url_fails += 1
29. click.echo(f"We fetched {url_success} URLs with {url_fails} fails")
Code 5.20
We managed to slightly change our main process method without doing so
much revolution there so that is possible to catch exception when fetching
the given URL properly. You can probably notice in line 22 that we catch
generic exception without being too specific, so all possible failures trigger
retry. Let us say our code fails because of logical error which has nothing to
do with the fact that we cannot get destination URLs, it will retry until it
fails. At least we do log this, though as developers, we may spot this and fix
it by investigating logs later. This approach allows us to ensure that any fatal
exception does not stop our code from being executed properly.
In line 29 addition logging informs us how many URLs from the given list
in CSV we managed to get properly and how many times we failed. As you
can see, we added in lines 3-4 help variables that are counting these values,
which in the end we use them to print stats.
We now know how to build simple crawler, which is a very good start point
but how about extracting and keep crawling the entire page? For that we
would need to refactor our process method so it can extract necessary links
to subpages and fetch them properly.
1. import re
2. from typing import Optional
3. from urllib.parse import urlparse
4.
5. def fetch_url(self, url: str, number_of_retries: int) -> Optional[str]:
6. click.echo(f"Fetching {url}")
7. for try_item in range(int(number_of_retries)):
8. try:
9. response = requests.get(url)
10. if response.status_code == 200:
11. f_name = sha256(url.encode('utf-8')).hexdigest()
12. output_file = f"/tmp/{f_name}.html"
13. with open(output_file, "w") as f:
14. f.write(response.text)
15. click.echo(f"URL: {url} [saved] under {output_file}")
16. return response.text
17. except Exception as e:
18. click.echo(f"We failed to fetch URL {url} with exeception: {e}")
19.
20. click.echo(f"Fetching resource failed with status code {response.st
atus_code} sleeping {SLEEP_TIME}s before retry")
21. time.sleep(SLEEP_TIME)
Code 5.21
We refactored our code from Code 5.20 to split main method process into
smaller logical blocks - methods where main fetching and saving result of
downloaded URL is now isolated from reading and processing CSV file.
Please notice additional imports being added to the head of the changes.
Now let us looks at the main process method and what is happening with it:
1. def process(self):
2. """Main method to start crawler and process"""
3. url_success = 0
4. url_fails = 0
5. while self.urls_queue.qsize() > 0:
6. url, number_of_retries = self.urls_queue.get()
7. base_url = urlparse(url)
8. content = self.fetch_url(url, number_of_retries)
9. results = LINK.findall(content)
10. click.echo(f"Found {len(results)} links")
11. for parsed_url in results:
12. if parsed_url.startswith('/'):
13. parsed_url = f"{base_url.scheme}://{base_url.netloc}
{parsed_url}"
14. if not parsed_url.startswith('http'):
15. continue
16. content = self.fetch_url(parsed_url, number_of_retries)
Code 5.22
As it has been said, we refactored main method as shown in Code 5.22.
Noticeably, we are processing downloaded page and extracting list of URLs
(lines 9-16, code 5.22) to fetch all found URLs in the source page. We have
added additional checks (lines 12-15) so if parsed URL from the source:
Is does not correct, continue and do no fetch any page.
If it does not start with http/https use main domain and scheme to build
proper URL (lines 12-13).
Regular expression needed and added just after other static definition on the
top of the source file is presented in the following example:
1. LINK = re.compile(r"<a.*?href=[\"'](.*?)[\"']", re.I)
Code 5.23
So far, we have done a web crawler that can process a given CSV file
containing a list of URLs we want to process. It will fetch content of all the
given individual URL and extract all the URLs from it and fetch content of
those. We managed to build simple yet powerful retry mechanism in a case
when given URL resource cannot be fetched or the server keeps rejecting
access to the requested resource. Now, you have probably noticed the bottle
neck of our solution, we can only process URL with the URL, so there are
few issues with it:
Processing time for all the possible URLs is very long.
We are not utilizing resources of our computer and internet connection
properly.
Any failure is slowing down process.

Parallel processing
In the section, simple crawler we have been building simple crawler that
managed to get us into web crawling world, albeit at the same time we faced
some limitations of our solution presented in that chapter, we will refactor
our crawler to work in multiprocessing and parallel paradigm.
Natural choice, sounds like, can be multiprocessing13 or multithreading14 to
be chosen from. We could use these libraries and write some parallel
processing with those. Since Python 3.x is a good implementation of async
library, we will use async instead of multithreading. Reason being is simple,
in Python there is Global Interpreter Lock (GIL15) which in some cases
(especially for web crawling) can slow computation. Asynchronous
programming sound like better choice and more natural so web sockets and
access asynchronous resource which is the entire web.
Let us start refactoring our main module with main function that starts
crawling process. Following example will be using new file called
async_webcrawler.py. Before we can proceed with the following example,
we must install async replacement for requests library called httpx16:
1. pip install httpx -U
Code 5.24
Once we have this module installed it is time to do some refactoring. Before
doing so, we should analyze for a minute what the async world is. We
already briefly answered why using asynchronous is better than synchronous
functions. Since we use web resource which is 100% async and do not have
100% guarantee when and if it will respond, we should use async
programming. The main essence of async programming is, by simplifying
the answer, that Python does not wait for the resource to be finished, which
blocks the other part of the code execution and only continues computing
upon blocking operation. Instead, Python puts information regarding
blocking resource aside (it is called coroutine). Once the resource is reached,
it will inform Python core library about the waiting process to be finished
and then based on coroutine (callbacks) Python will decide what to do next.
Please imagine printer to give you a better picture of the blocking resource.
You can print single character at a time and let us say it is the mechanical
limitation of such a resource. If you have a blocking code, for instance,
before forming the next character to print, your code will let the printer
finish with printing 1st character before it can form next one. The same goes
for other operations, such as updating the printing progress bar in your
operating system, and so on. So not by having async and non-blocking code
you can control printing with coroutines and other operations without
worrying too much about blocking resource.
For the following example we could use external library called Twisted17, as
it has been mentioned few times, as the oldest and most mature framework
that supports event driven and async programming. In this case, to be more
minimalistic and have cleaner code, we are about to use built in Python
functionality called asyncio18.
So, by knowing all of this let us start by refactoring our starting our app
method called main. As you can see in proceeding example Code 5.25
starting async program differs. We must start event loop that will be
controlling mentioned coroutines:
1. import asyncio
2. import httpx
3.
4. async def main(source):
5. """Main entry point for processing URLs and start crawling."""
6. assert os.path.exists(source), f"Given file {source} does not exist"
7. c = Crawler()
8. await c.load_content(source)
9. await c.process()
10.
11.
12. @click.command()
13. @click.option("--source", help="CSV full file path", required=True)
14. def run(source):
15. asyncio.run(main(source))
16.
17.
18. if __name__ == '__main__':
19. run()
Code 5.23
We added two new major imports by removing import for requests module
which is no longer needed for asyncio library. Major change is also how we
had to refactor main function. Right now, it is asynchronous function. Please
notice the async statement in front of the def function tells Python that
function body will be asynchronous and whichever part of the code is called
must wait for its execution in special way (coroutine).
Additionally, from now on, our main function is starting Python asyncio
event loop, which controls coroutines and async assets. The same function
takes click CLI arguments which are being transferred for main function
since that one must be called in async way.
By proceeding example let us check how the entire crawler got refactored
and how to make async requests with httpx library:
1. import asyncio
2. import click
3. import csv
4. import httpx
5. import os
6. import re
7. from hashlib import sha256
8. from typing import Optional
9. from urllib.parse import urlparse
10.
11. SLEEP_TIME = 1
12. LINK = re.compile(r”<a.*?href=[\”’](.*?)[\”’]”, re.I)
Code 5.34
In example Code 5.34 we cleaned up imports to drop all these that are not
necessary anymore. You can notice that we already dropped import for
queue module – why is that? Reason being is because we must use async
queue which is part of the asyncio. In this case we must refactor __init__
method as in following example:
1. class Crawler:
2. def __init__(self):
3. self.urls_queue = asyncio.Queue()
Code 5.35
You can see that initializing queue looks very similar as in previous
examples (Code 5.17) but in this case it is an async queue, so methods
offered by it are different, so we also must refactor getting queue element
method:
1. async def _get_element(self):
2. return await self.urls_queue.get()
Code 5.36
In line 2 we added await in pair with the return statement. That is needed
(await) to inform Python that this part of code will return coroutine and we
do not want to block it.
The following example is showing how we refactored fetch_url method so
it can use httpx module and work in async pattern:
1. async def fetch_url(self, url: str, number_of_retries: int) -
> Optional[str]:
2. click.echo(f"Fetching {url}")
3. for try_item in range(int(number_of_retries)):
4. try:
5. async with httpx.AsyncClient() as client:
6. response = await client.get(url, follow_redirects=True)
7. if response.status_code == 200:
8. f_name = sha256(url.encode('utf-8')).hexdigest()
9. output_file = f"/tmp/{f_name}.html"
10. with open(output_file, "w") as f:
11. f.write(response.text)
12. click.echo(f"URL: {url} [saved] under {output_file}")
13. return response.text
14. except Exception as e:
15. click.echo(f"We failed to fetch URL {url} with exeception: {e}")
16.
17. click.echo(f"Fetching resource failed with status code {response.st
atus_code} sleeping {SLEEP_TIME}s before retry")
18. await asyncio.sleep(SLEEP_TIME)
Code 5.37
In Code 5.37, lines 5-6 we changed requests to httpx lib. Please notice that
the whole block works in context19 and simultaneously async with await
(fetching external) statement. You can also notice what it is very important,
in line 18 when we try to put code to sleep before we retry with fetching
given resource URL we do not use time.sleep. We do not use this sleeping
statement as it would kick in GIL and affect the event loop for async code,
which will put to everything to sleep that should be async and become
synchronous programming. Refer to the following code:
1. async def process(self):
2. """Main method to start crawler and process"""
3. url_success = 0
4. url_fails = 0
5. while True:
6. url, number_of_retries = await self.urls_queue.get()
7. base_url = urlparse(url)
8. content = await self.fetch_url(url, number_of_retries)
9. results = LINK.findall(content)
10. click.echo(f"Found {len(results)} links")
11. for parsed_url in results:
12. if parsed_url.startswith('/'):
13. parsed_url = f"{base_url.scheme}://{base_url.netloc}
{parsed_url}"
14. if not parsed_url.startswith('http'):
15. continue
16. content = await self.fetch_url(parsed_url, number_of_retries)
17. if self.urls_queue.empty():
18. break
19. click.echo("Processing finished, exiting...")
Code 5.38
Another async method process, we had to flip the logic of fetching elements
from the queue, if you compare it with Code 5.22. As you already noticed in
the mentioned example Code 5.22 we have been fetching elements from
queue if it was not empty. In the async use we are doing this differently, and
we are continuing fetching loop until queue is not empty (lines 17-18). We
have changed this because of the nature of async queue and public methods
of it.
The rest of the body of the process function looks like normal. We did not
have to change much except these parts blocking the code, so we converted
them to coroutine.
After refactoring all the previous blocking code and making it async, we
must answer very important question: Did we manage to make processing
faster than synchronous processing. The answer is not that straight and
simple. From one side if we compare apples to apples, yes, it is more
efficient since we do not block processing (lines 5-18) web resources. But it
is not as efficient as it can be. We still do not apply any parallel processing.
How to do it, let us look at the code snippet in the following example:
1. import random
2. import asyncio
3.
4.
5. async def func(func_number: int) -> None:
6. for i in range(1, 6):
7. sleep_time = random.randint(1, 5)
8. print(f"Func {func_number} go {i}/5, taking nap {sleep_time}s")
9. await asyncio.sleep(sleep_time)
10.
11.
12. async def call_tests():
13. await asyncio.gather(func(1), func(2), func(3))
14.
15. asyncio.run(call_tests())
Code 5.39
In Code 5.39 we wrote simple function that takes 1 integer argument, print
status on the screen and then goes to sleep for random number of seconds
(line 9). As you probably noticed so far, in async library we do not send
function to any kind of a thread. Instead, we start it as coroutine function
delegated to background and in line 13 we wait until all started async
functions are finished and can return result (None is this case).
By knowing how parallel processing can be achieved in asynchronous
programming in Python let us try to refactor our code:
1. async def process(self):
2. """Main method to start crawler and process"""
3. url_success = 0
4. url_fails = 0
5. while True:
6. url, number_of_retries = await self.urls_queue.get()
7. base_url = urlparse(url)
8. content = await self.fetch_url(url, number_of_retries)
9. results = LINK.findall(content)
10. click.echo(f"Found {len(results)} links")
11. calls = []
12. for parsed_url in results:
13. if parsed_url.startswith('/'):
14. parsed_url = f"{base_url.scheme}://{base_url.netloc}
{parsed_url}"
15. if not parsed_url.startswith('http'):
16. continue
17. calls.append(self.fetch_url(parsed_url, number_of_retries))
18. await asyncio.gather(*calls)
19. if self.urls_queue.empty():
20. break
21. click.echo("Processing finished, exiting...")
Code 5.40
In Code 5.40 we kept the main essence of our method the same way as it
was except that part that calls to fetch more sub-URLs after extracting them
out from main page (lines 14-17). Finally, we have parallel processing of
fetching and crawling URLs, but there is still thing to improve.
Notice that in line 12 we are still lopping over the list of URLs loaded form
CSV file and in line 18 we wait until downloading content of parsed URLs
addresses (line 18) is finished. This is again not optimum approach – why?
Because we wait until fetching content of long list is finished - crawling a
list from, for instance Wikipedia page, can be massive so downloading
hundreds of pages can take a while until we can continue with next URL
from CSV file.
Let us take a closer look on refactored method process so we can see how it
is going to work in 100% parallel mode:
1. async def process(self):
2. """Main method to start crawler and process"""
3. calls = []
4. while True:
5. url, number_of_retries = await self.urls_queue.get()
6. calls.append(self.process_item(url, number_of_retries))
7. if self.urls_queue.empty():
8. break
9. await asyncio.gather(*calls)
10. click.echo("Processing finished, exiting...")
Code 5.41
In Code 5.41 we have refactored our main process method. As you can see,
we removed the entire block extracting all the found URLs in the page
content and started crawling these URLs. Instead, we call separate async
method (lines 6-8) and gather results (wait for them). That is a way more
efficient. In proceeding example Code 5.42 you can see how we created new
method process_item:
1. async def process_item(self, url: str, number_of_retries: int):
2. base_url = urlparse(url)
3. content = await self.fetch_url(url, number_of_retries)
4. results = LINK.findall(content)
5. click.echo(f"Found {len(results)} links")
6. calls = []
7. extracted_urls = filter(lambda x: x.startswith('/') or x.startswith('http'),
results)
8. for parsed_url in extracted_urls:
9. if parsed_url.startswith('/'):
10. parsed_url = f"{base_url.scheme}://{base_url.netloc}
{parsed_url}"
11. calls.append(self.fetch_url(parsed_url, number_of_retries))
12. return await asyncio.gather(*calls)
Code 5.42
We do not change much in the body of the fetching and parsing block except
some code optimalization. You should notice that we applied filter method
combined with lambda line 7) to help filter out all the non-valid URLs texts
that regexp managed to catch. What is also worth noticing is that filter
function is returning iterable object which, as already said in Chapter 1,
Python 101 and Chapter 2, Setting up Python Environment, can lead to
effectiveness of memory utilization.
So far, we learned how to optimize crawler to fetch as many pages as
possible. We have been building crawler that can fetch HTML, albeit adding
few extra regexp to get images is quite good exercise to perform in next
example since fetching binary files (For example, PNG) is a bit different
than plain text. Refer to the following code:
1. IMAGES = re.compile(r»<img.*?src=[\»’](.*?)[\»’]», re.I)
2.
3. class Crawler:
4. def __init__(self, call_levels: int):
5. self.urls_queue = asyncio.Queue()
6. self.__call_levels = call_levels
7.
8.
9. async def main(source:str, level: int):
10. """Main entry point for processing URLs and start crawling."""
11. assert os.path.exists(source),
f"Given file {source} does not exist"
12. c = Crawler(level)
13. await c.load_content(source)
14. await c.process()
15.
16.
17. @click.command()
18. @click.option("--source", help="CSV full file path", required=True)
19. @click.option("--level", help="Crawling depth level",
type=int, required=False, default=5)
20. def run(source, level):
21. asyncio.run(main(source, level))
Code 5.43
In Code 5.43 we have added new regexp (line 1) for extracting images
URLs. Next, we changed (line 4) the constructor for the Crawler class and
added new parameter call_levels. We will use it in later refactoring when we
use a technique called recurrency and we want to limit how many levels of
recurrency we want to go. You can also see that in lines 17-21 we have
support for this option in main method, so executing script is going to look
like the following example:
1. python async_webcrawler_3.py --source=urls.csv --level=3
Code 5.44
We called this refactored file async_webcrawler_3.py and in Code 5.44 we
have an example how to use it with loading source URLs from CSV file
(urls.csv) and with 3 levels of recurrency. Now let us see how we can do
recurrency in proceeding example:
1. async def process_item(self, url: str, number_of_retries: int, call_level: i
nt=1) -> asyncio.gather:
2. base_url = urlparse(url)
3. content = await self.fetch_url(url, number_of_retries)
4. results = LINK.findall(content)
5. parsed_images = IMAGES.findall(content)
6. click.echo(f"Found {len(results)} links [level: {call_level}]")
7. click.echo(f"Found {len(parsed_images)} images [level: {call_level}]
")
8. calls = []
9.
10. extracted_urls = filter(lambda x: x.startswith('/') or x.startswith('http'),
results)
11. parsed_images = filter(lambda x: x.startswith('/') or x.startswith('http')
, parsed_images)
12.
13. for parsed_url in parsed_images:
14. if parsed_url.startswith('/'):
15. parsed_url = f"{base_url.scheme}://{base_url.netloc}
{parsed_url}"
16. calls.append(self.fetch_url(parsed_url, number_of_retries))
17.
18. for parsed_url in extracted_urls:
19. if parsed_url.startswith('/'):
20. parsed_url = f"{base_url.scheme}://{base_url.netloc}
{parsed_url}"
21. if call_level < self.__call_levels:
22. calls.append(self.process_item(parsed_url, number_of_retries, c
all_level+1))
23. else:
24. calls.append(self.fetch_url(parsed_url, number_of_retries))
25.
26. return await asyncio.gather(*calls)
Code 5.45
We changed the way we run process_item method. We start it as before but
additionally we added new argument call_level. What we do in the body of
this method is:
Extract all found images URLs and fetch them – lines 13-16.
Extract all HTML links and fetch them in lines 21, if recurrency limit
has not been reached yet, we call same method that we are in
process_item, that is technique called recurrency. If recurrency limit is
reached (line 24), call regular fetch_url (with no recurrency in this
context). Please note in line 22 we increased recurrency level when we
call recurrent method to control how deep we went to and stop recurrent
calls if the max limit has been reached.
In Code 5.46, we will see how fetch_url method was refactored. You can
see the main difference, how we get from the httpx response from the
downloaded content's body. Refer to the following code:
1. async def fetch_url(self, url: str, number_of_retries: int) -
> Optional[str]:
2. click.echo(f"Fetching {url}")
3. for try_item in range(int(number_of_retries)):
4. try:
5. async with httpx.AsyncClient() as client:
6. response = await client.get(url, follow_redirects=True)
7. content_type = response.headers.get('Content-Type').split(';')[0]
8. extension = content_type.split('/')[-1].lower()
9. if response.status_code == 200:
10. f_name = sha256(url.encode('utf-8')).hexdigest()
11. output_file = f"/tmp/{f_name}.{extension}"
12. with open(output_file, "wb") as f:
13. data = response.content
14. f.write(data)
15. click.echo(f"URL: {url} [saved] under {output_file}")
16. return data.decode('utf-8')
17. except Exception as e:
18. click.echo(f"We failed to fetch URL {url} with exeception: {e}")
19.
20. click.echo(f"Fetching resource failed with status code {response.st
atus_code} sleeping {SLEEP_TIME}s before retry")
21. if response.status_code != 404:
22. await asyncio.sleep(SLEEP_TIME)
Code 5.47
In Code 5.47 we optimized how we use file extension depending on the
server response metadata (line 8). It is not ideal since some metadata content
types can be application/octet-stream or binary/octet+stream, and this
can lead to weird file extensions that we save in line 13. For the need of this
exercise, we will keep it this way not to make things more complex.
Going further with example 5.47 you can discover that we also refactored
how we get body of the response. Previously (Code 5.21, line 14) we have
been fetching only test so that the technique was fine. Since we want also to
fetch images, we have gotten the content differently (Code 5.47, line 14)
albeit to save it properly, we should open binary stream (line 13).
Improvements
In previous chapters we have been building simple yet still powerful
WebCrawler. In this chapter we will see how to introduce few improvements
to out Proof of Concept (POC). First real live issue that our WebCrawler
may face is the performance of your local machine (computation). In this
case, we will try to introduce a technique that will limit the number of
simultaneous coroutines running simultaneously. As a result of introducing
such a handbrake, we will limit the number of parallel downloads. That is
going to help us limit consumed resources.

Limit parallel processing


Let us check the following example, but before we are going to look on
technique, we need to install prerequisites20:
1. pip install asyncio-pool
Code 5.48
Once our fundamental library is installed, we can continue refactoring the
main method of fetching functions in recurrency mode. First, we need to
change the main method run that will get extra argument. Refer to the
following code:
1. @click.command()
2. @click.option("--source", help="CSV full file path", required=True)
3. @click.option("--level", help="Crawling depth level",
type=int, required=False, default=5)
4. @click.option("--pool", help="Crawling pool size",
type=int, required=False, default=-1)
5. def run(source, level, pool):
6. asyncio.run(main(source, level, pool))
Code 5.49
We added a new pool option to drive our code if shared coroutines pool
should be used. Let us look at the following example, to see how to utilize
this variable. We added a new variable that will drive concurrency pool
(none means not pool in use).
1. class Crawler:
2. def __init__(self, call_levels: int, concurency: int=None):
3. self.urls_queue = asyncio.Queue()
4. self.__call_levels = call_levels
5. self.__concurrency = concurency
Code 5.50
In the following example, we refactored main method in such a way that it
will use pool or not depending on the value in constructor:
1. async def process_item(self, url: str, number_of_retries: int, call_level: i
nt=1) -> asyncio.gather:
2. base_url = urlparse(url)
3. content = await self.fetch_url(url, number_of_retries)
4. results = LINK.findall(content)
5. parsed_images = IMAGES.findall(content)
6. click.echo(f"Found {len(results)} links [level: {call_level}]")
7. click.echo(f"Found {len(parsed_images)} images [level: {call_level}]
")
8. calls = []
9.
10. extracted_urls = filter(lambda x: x.startswith('/') or x.startswith('http'),
results)
11. parsed_images = filter(lambda x: x.startswith('/') or x.startswith('http')
, parsed_images)
12.
13. for parsed_url in parsed_images:
14. if parsed_url.startswith('/'):
15. parsed_url = f"{base_url.scheme}://{base_url.netloc}
{parsed_url}"
16. calls.append(self.fetch_url(parsed_url, number_of_retries))
17. if self.__concurrency:
18. async with AioPool(size=self.__concurrency) as pool:
19. for parsed_url in extracted_urls:
20. if parsed_url.startswith('/'):
21. parsed_url = f"{base_url.scheme}://{base_url.netloc}
{parsed_url}"
22. if call_level < self.__call_levels:
23. await pool.spawn(self.process_item(parsed_url, number_of
_retries, call_level+1))
24. else:
25. await pool.spawn(self.fetch_url(parsed_url, number_of_retr
ies))
26. else:
27. for parsed_url in extracted_urls:
28. if parsed_url.startswith('/'):
29. parsed_url = f"{base_url.scheme}://{base_url.netloc}
{parsed_url}"
30. if call_level < self.__call_levels:
31. calls.append(self.process_item(parsed_url, number_of_retries,
call_level+1))
32. else:
33. calls.append(self.fetch_url(parsed_url, number_of_retries))
34.
35. return await asyncio.gather(*calls)
Code 5.51
You can notice that in Code 5.51 we are using 2 different techniques for
processing queue and starting crawler based on need of using pool. If we
specify pool size, we use library AioPool (lines 18-25). We wait for started
tasks as in previous examples in the section, Parallel processing, but we can
limit the number of coroutines running simultaneously.
To start our main app with limited resources and limit level of recurrency
just run the following Code 5.52. You should immediately notice how less
parallel fetching log messages you can see on the screen.
1. python async_webcrawler_3.py --source=urls.csv --level=1 --pool=10

Proxy
In some cases when you try to run our crawler you can notice that some
websites will be smart. They shall quickly detect that somebody is crawling
them and that may be something they do not like and will block traffic and
you will start getting lots of errors, for instance 403 response code access
denied.
This is because it is not natural behavior that single IP address, which is how
they see your laptop is sending so many requests simultaneously by asking
so many resources in parallel. There is a way to help us, and we can send all
the traffic to such a website via proxy21 as shown in the following figure:

Figure 5.3: Example of how to use proxy service when sending request to website
To be able to use such a proxy service we have 2 options. First, we can use
existing proxy solution, list of free proxy services is mentioned in linked
page in the Wikipedia article (19). Another option is to start cloud services
independently and install proxy service22.
One way or the other the way we are going to use is presented in following
example Code 5.52:
1. PROXIES = {
2. “http://”: “http://proxy.foo.com:8030”,
3. “https://”: “http://proxy.foo.com:8031”,
4. }
Code 5.52
In Code 5.52 we defined fixed list of proxy servers, for both HTTP and
HTTPS protocols we will use different proxy services23. Refer to the
following code:
1. async def fetch_url(self, url: str, number_of_retries: int) -
> Optional[str]:
2. click.echo(f"Fetching {url}")
3. for try_item in range(int(number_of_retries)):
4. try:
5. async with httpx.AsyncClient(proxies=PROXIES) as client:
6. response = await client.get(url, follow_redirects=True)
7. content_type = response.headers.get('Content-Type').split(';')[0]
8. extension = content_type.split('/')[-1].lower()
9. if response.status_code == 200:
10. f_name = sha256(url.encode('utf-8')).hexdigest()
11. output_file = f"/tmp/{f_name}.{extension}"
12. with open(output_file, "wb") as f:
13. data = response.content
14. f.write(data)
15. click.echo(f"URL: {url} [saved] under {output_file}")
16. return data.decode('utf-8')
17. except Exception as e:
18. click.echo(f"We failed to fetch URL {url} with exeception: {e}")
19.
20. click.echo(f"Fetching resource failed with status code {response.st
atus_code} sleeping {SLEEP_TIME}s before retry")
21. if response.status_code != 404:
22. await asyncio.sleep(SLEEP_TIME)
Code 5.53
Pretty simple, right? But this solution has one limitation, we define single
proxy service per protocol. We want to be able to have random proxies being
used per each external call. To achieve this let us use the following example:
1. import random
2.
3. _PROXIES_HTTP = ("http://proxy.foo.com:8030", "http://proxy2.foo.c
om", "http://proxy3.foo.com")
4. _PROXIES_HTTPS = ("https://http-proxy1.foo.com", "https://http-
proxy2.foo.com","https://http-proxy3.foo.com")
5.
6. async def fetch_url(self, url: str, number_of_retries: int) -
> Optional[str]:
7. random.choice(proxies)
8. my_proxies = {
9. "http://": random.choice(_PROXIES_HTTP),
10. "https://": random.choice(_PROXIES_HTTPS),
11. }
12. click.echo(f"Fetching {url}")
13. for try_item in range(int(number_of_retries)):
14. try:
15. async with httpx.AsyncClient(proxies=my_proxies) as client:
16. ...
Code 5.54
In Code 5.54 we randomly use predefined lists of proxy servers for each call.
In that case we have less probability that the destination server will notice
that all following requests are coming from the same place and block us.

Conclusion
In the world of web crawlers (sometimes called web spiders) what is the
most important in my option is that they are verry agnostic and can crawl
any URL. Based on given parameters, that is, level of recurrency or types of
files to extract they can fetch requested resources without having any
hardcoded logic. They are very agnostic and efficient and parallel processing
must have.
To make our crawler even more efficient we should also implement time-out
support. We do not want to get stuck with request to a resource that cannot
be reached at the moment or is dead-end since we have to crawl with high
efficiency.
Other code path that is worth implementing is the support of HTTP224.
Library that we use in our exercises (httpx) can easily support such a
requests25. As you can see there is still a lot of space for improvements.
In the next chapter, we will learn how to use Python as a tool that will help
us build effective virus scanner. We will also learn how Python can work on
an operating system low level requirements.
1. https://en.wikipedia.org/wiki/Python_(programming_language)
2. https://ipython.org/
3. https://en.wikipedia.org/wiki/Regular_expression
4. https://docs.python.org/3/library/csv.html
5. https://beautiful-soup-4.readthedocs.io/en/latest/
6. https://en.wikipedia.org/wiki/FIFO_(computing_and_electronics)
7. https://docs.python.org/3/library/queue.html
8. https://pypi.org/project/click/
9. https://docs.python-requests.org/en/latest/index.html
10. https://twisted.org
11. https://en.wikipedia.org/wiki/List_of_HTTP_status_codes
12. https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/503
13. https://docs.python.org/3/library/multiprocessing.html
14. https://docs.python.org/3/library/threading.html
15. https://wiki.python.org/moin/GlobalInterpreterLock
16. https://www.python-httpx.org
17. https://twisted.org
18. https://docs.python.org/3/library/asyncio.html
19. https://docs.python.org/3/library/contextlib.html
20. https://pypi.org/project/asyncio-pool/
21. https://en.wikipedia.org/wiki/Proxy_server
22. https://github.com/anapeksha/python-proxy-server
23. https://www.python-httpx.org/advanced/#http-proxying
24. https://en.wikipedia.org/wiki/HTTP/2
25. https://www.python-httpx.org/http2/
OceanofPDF.com
CHAPTER 6
Create Your Own Virus Detection
System

Introduction
Computer viruses can be for sure a very big problem for every computer
owner. They can spread rapidly in a computer system and affect the entire
back-office of enormous organization as same as a single laptop at home.
In this chapter, we will learn how to write simple yet very powerful virus
scanner by using Python. We will go step by step from understanding how
viruses can be detected and excluded from operating system from use – so
called quarantine.

Structure
In this chapter, we will discuss the following topics:
Building files and directories scanner
Calculating hashing keys
Introducing viruses
Use and update viruses DB
Building map of suspicious files
Parallel processing
Objectives
After reading this chapter, you should know how to build your own virus
scanner, how to get your tool working in your local system. We will also
learn how to effectively get latest virus definitions and use them.

Building files and directories scanner


In this chapter, we will learn how to scan local file system efficiently. Every
virus scanner should be able to scan files and directories as quickly as
possible without too much resources being consumed for this process from
operating system. Let us start with basic example shown in Code 6.1:
1. import click
2. import os
3. from pprint import pprint
4.
5.
6. def extra_list(file_path, extract_type):
7. data = []
8. assert extract_type in ('isfile', 'isdir')
9. for f in os.listdir(file_path):
10. absolute_file_path = os.path.join(file_path, f)
11. if getattr(os.path, extract_type)(absolute_file_path):
12. data.append(absolute_file_path)
13. return data
14.
15.
16. def scanner(file_path):
17. files = extra_list(file_path, 'isfile')
18. dirs = extra_list(file_path, 'isdir')
19. for f in files:
20. yield f
21. for d in dirs:
22. yield from scanner(d)
23.
24.
25. @click.command()
26. @click.option("--fpath", help="Path to start scanning", required=True)
27. def main(fpath):
28. pprint(list(scanner(fpath)))
29.
30.
31. if __name__ == '__main__':
32. main()
Code 6.1
We created a file called recurrency_scanner.py and used already well-
known click1 library from previous Chapter 5, Building Non-blocking Web
Crawler. We run Code 6.1 as it is shown in the following Code 6.2. This is
going to scan folder /var/tmp and create a list of all the found files and
keep as well absolute path that points to them.
1. python recurrency_listing.py --fpath /var/tmp/
Code 6.2
You can notice that we created function extra_list that lists in each folder
all possible accessible elements that can be either a file or folder (line 8).
We now have managed to call the correct method of checking if it is a file
or folder (line 11). Then, an interesting technique is used in function
scanner where we scan and list main folder and subfolders in recurrency. It
is worth checking how we used recurrency for building a list of all found
files (line 20 and 22). We used yield syntax to build generator, which is
already covered in Chapter 2, Setting up Python Environment.
The reason for using a generator instead of building a list is simple, in line
28 at the moment we convert generator to list, we have what we want; list
of all possible files discovered in a given folder (/var/tmp in this example,
Code 6.2). Memory is the main reason for using generator. Generating a
list at once how we did in line 28 is not memory efficient.
Let us validate and check following example to see how we can change
mentioned non efficient part of the code into something that is performing
better (memory wise) and still leads to the same result.
1. def check_for_virus(file_path):
2. """pseudo code for scanning file if it contains virus"""
3. pass
4.
5. @click.command()
6. @click.option("--fpath", help="Path to start scanning", required=True)
7. def main(fpath):
8. for file_path_to_scan in scanner(fpath):
9. check_for_virus(file_path_to_scan)
Code 6.3
We introduced pseudo function check_for_virus that is being called in the
loop (line 8-9). Since we use generator in this example, such call (on
demand), is more efficient than generating a list at once.
Since Python is providing functionality for scanning file system out of the
box, we can use function walk to replace our example using recurrence
calls with something even more efficient. Let us check following example
Code 6.4:
1. import click
2. import os
3.
4. def scanner(file_path):
5. for (root, dirs, files) in os.walk(file_path, topdown=True):
6. for f in files:
7. yield os.path.join(root, f)
8.
9.
10. @click.command()
11. @click.option("--fpath", help="Path to start scanning", required=True)
12. def main(fpath):
13. print(list(scanner(fpath)))
14.
15.
16. if __name__ == '__main__':
17. main()
Code 6.4
You may notice that this approach is also using loops and yield statement
although in this case we are using built-in Python function for scanning
given directory path. This solution is slightly better since it uses cPython
low level library.

Calculating hashing keys


We learned how to build simple scanner of file system which can help us
identify files in a given directory; it is time to write efficient encryption
algorithm that can help us create unique fingerprint on each of founded
files.
This technique is called hashing. In Python, we have library hashlib2 that
deliver lots of useful hashing algorithms albeit the one that we will
concentrate on are:
md5 sum3
sha2564
Calculating md5 sum in Python is pretty simple. We just import hashlib
and on the top of the string, we apply hashing function like in the following
example, Code 6.5:
1. from hashlib import md5
2.
3. data = b"some amazing string"
4. print(md5(data).hexdigest())
Code 6.5
Now we can compare how we calculated md5 with the following example
Code 6.6 where we calculate sha256 hash on the top of example string.
1. from hashlib import sha256
2.
3. data = b"some amazing string"
4. print(sha256(data).hexdigest())
Code 6.6
We are using sha256 hashing; let us try to analyze why we use this instead
of md5 for instance. Python provides very clear and easy to use public
interfaces via hashlib for calculating hashes; although there is one thing
that you have to be aware of. It is already a known fact among developers
that using md5 is not a good idea anymore because of hashes collisions5. If
we have hashes collisions for any specific strings there is a high probability
that in vast file system with hundreds of thousands of files, we may find
few where calculating md5 hash can lead to hash collision. That is an
unwanted side effect since we may mark both files as potential viruses (they
have the same hashes, right) wheres only one is the virus we have been
hunting for.
If not md5, we will use sha256 to calculate hashes. Few key factors should
be considered before using sha256:
Hashing key is longer
Hashes collision almost do not exist6
Calculation is same efficient as MD5
That all looks smooth and easy but how exactly to generate hashes
(fingerprints) on the found files that we did listed as a result of scanning
(Code 6.4). The following example is a combination of Code 6.4 and 6.6:
1. import click
2. import os
3. from hashlib import sha256
4.
5. def scanner(file_path):
6. for (root, dirs, files) in os.walk(file_path, topdown=True):
7. for f in files:
8. yield os.path.join(root, f)
9.
10. @click.command()
11. @click.option("--fpath", help="Path to start scanning", required=True)
12. def main(fpath):
13. for file_path in scanner(fpath):
14. with open(file_path, 'rb') as f:
15. print(f"File: {file_path}, hash: {sha256(f.read()).hexdigest()}")
16.
17. main()
Code 6.7
This way of scanning and calculating is not very difficult to achieve. We
run standard scanner as we did earlier but this time, however, on the top of
discovered files, we calculate sha256 hash. It will work pretty fast, except
for the big files which will be slow and consume lots of memory since we
have to read entire content of the file memory (as string) and then calculate
hash.
The other concern we have here is the fact that we load content of the file to
memory in the first place, which may lead to undesired situation – that by
reading file content to calculate hash we actually loaded virus to memory.
To avoid this unwanted situation, we shall run the code for reading file
content in a different way – we scan and load file content in a buffer (via
blocks) and recalculate sha256 on the top of those current blocks (total) and
once we are done, we are going to have final hash. That is safer and more
memory efficient. Let us take a look at Code 6.8 to understand how to
achieve this:
1. import click
2. import os
3. from hashlib import sha256
4.
5. def scanner(file_path: str):
6. for (root, dirs, files) in os.walk(file_path, topdown=True):
7. for f in files:
8. yield os.path.join(root, f)
9.
10. def calculate_hash(file_path: str) -> str:
11. with open(file_path, "rb") as f:
12. file_hash = sha256()
13. chunk = f.read(8192)
14. while chunk:
15. file_hash.update(chunk)
16. chunk = f.read(8192)
17.
18. return file_hash.hexdigest()
19.
20. @click.command()
21. @click.option("--fpath", help="Path to start scanning", required=True)
22. def main(fpath):
23. for file_path in scanner(fpath):
24. hash_value = calculate_hash(file_path)
25. print(f"File: {file_path}, hash: {hash_value}")
26.
27. main()
Code 6.8
As you can see Code 6.8 is not much different from Code 6.7. Although, in
this case, we read file by 8kB (8192 bytes) blocks. Why 8k? We would have
to go to bottom of C language and historical reasons7 - floppy disk Intel
8086 CPUs, remote streams, serial ports; that is the simplified answer. Let
us assume it is going to be 8kB block and we are going to work by reading
by blocks and calculating hash (lines 11-16).

Introducing viruses
So far, we’ve learned how to scan local folders and build a list discovered
files. On the top of that we managed to understand how to build a list of
those files and fingerprint them. You are probably wondering where his is
leading us. Before we are going to connect dots, we shall understand some
basics about viruses.
What is a computer virus? According to internet,8 “A computer virus is a
type of computer program that, when executed, replicates itself by
modifying other computer programs and inserting its own code into those
programs. If this replication succeeds, the affected areas are then said to be
"infected" with a computer virus, a metaphor derived from biological
viruses.”
So, we can see virus is a computer program which can be executed on your
personal computer. If executed, it can have hash calculated on top of it. That
means we can easily identify what kind of files on our local filesystem are
suspicious and may be a virus. These are the simplest kind of viruses that
are identified as single files. Sometimes they come in packages but, if we
calculate hashes for them, they can be hunted easily. If we want to know
more complex use cases, we will have to analyze files deeper and compare
chunks of code (fingerprints) with known virus database.

Use and update viruses DB


Once the working of virus is known to us, we need to understand where to
get this map of viruses’ hashes. There are plenty of open-source internet
databases of virus fingerprint. This is exactly what we are going to use to
improve our little scanner to be able to detect suspicious files.
We already mentioned in previous chapters that using MD5 hash is not the
best idea but if you want to modify following example on your own, you
can get MD5 hashes of viruses from GitHub repository9. We are going to
use sha256 to calculate and compare hashes as mentioned earlier.
Let us look for file with hash definitions using Code 6.9 and try to modify
Code 6.8 to be able to use it.
1. import click
2. import os
3. from hashlib import sha256
4.
5. BUFF_SIZE = 8192
6.
7.
8. def scanner(file_path: str):
9. for (root, dirs, files) in os.walk(file_path, topdown=True):
10. for f in files:
11. yield os.path.join(root, f)
12.
13.
14. def calculate_hash(file_path: str) -> str:
15. with open(file_path, "rb") as f:
16. file_hash = sha256()
17. chunk = f.read(BUFF_SIZE)
18. while chunk:
19. file_hash.update(chunk)
20. chunk = f.read(BUFF_SIZE)
21.
22. return file_hash.hexdigest()
23.
24.
25. @click.command()
26. @click.option("--
virus_def", help="File with virus definition", required=True)
27. @click.option("--fpath", help="Path to start scanning", required=True)
28. def main(fpath, virus_def):
29. with open(virus_def, 'rb') as f:
30. viruses_list = f.read().decode('utf-8').replace(' ', '').split('\n')
31.
32. for file_path in scanner(fpath):
33. hash_value = calculate_hash(file_path)
34. status = 'ok'
35. if hash_value in viruses_list:
36.
37. status = "virus!"
38.
39. print(f"File: {file_path}, hash: {hash_value}, status: {status}")
40. main()
Code 6.9
Now, we need to create example file with virus hash list. In our case, let us
call such a file example_virus_sha256.bin and content should be similar to
Code 6.10:
1. ebf454d4b0d094cedf591c6dbe370c4796572a67139174da72559156dd2
265ed
2. 61db3163315e6b3b08a156360812ca5efff0093234201a994d6bdedaf85
afeb0
Code 6.10
Once we have example file, example_virus_sha256.bin created, it can be
used with Code 6.9 by running it like in Code 6.11:
1. python code_6.9.py --fpath /var/tmp --
virus_def example_virus_sha256.bin

2.
3. <OUTPUT>
4. File: /var/tmp/cbrgbc_1.sqlite, hash:
ebf454d4b0d094cedf591c6dbe370c4796572a67139174da72559156dd2
265ed,
status: virus!
5. File: /var/tmp/cbrgbc_1.sqlite-shm, hash:
61db3163315e6b3b08a156f60812ca5efff009323420aa994d6bdedaf85a
feb0,
status: ok
6. File: /var/tmp/cbrgbc_1.sqlite-wal, hash:
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852
b855,
status: ok
Code 6.11
We have detected by running Code 6.11 that one file looks suspicious (line
4), because we have file hash (line 4) listed in our virus hashes list (Code
6.10, line 1). In this case, we could not only detect that the file is a virus but
also remove it from our computer. In Code 6.12, we modified printing
statement in such a way that we can remove infected file.
1. import os
2.
3. VIRUSES_LIST = []
4.
5. def is_virus(file_path, hash_value):
6. status = True
7. if hash_value in VIRUSES_LIST:
8. status = False
9.
10. if status:
11. print(f"File: {file_path}, hash: {hash_value}, status: [ok]")
12. else:
13. print(f"File: {file_path}, hash: {hash_value}, status: virus! removi
ng...")
14. try:
15. os.remove(file_path)
16. except OSError:
17. print("Seem like detected file can't be remove at
the moment, in use?")
18.
19.
20. @click.command()
21. @click.option("--
virus_def", help="File with virus definition", required=True)
22. @click.option("--fpath", help="Path to start scanning", required=True)
23. def main(fpath, virus_def):
24. with open(virus_def, 'rb') as f:
25. VIRUSES_LIST = f.read().decode('utf-8').replace(' ', '').split('\n')
26.
27. for file_path in scanner(fpath):
28. hash_value = calculate_hash(file_path)
29. is_virus(hash_value, file_path)
Code 6.12
We modified slightly main function (lines 23-29) by loading virus definition
file content into global variable (lines 24-25) and splitting detected virus
and removing infected file into separated function. In this case, it is cleaner
and easier to analyze given parameters: calculated file hash and its location.
Once detected as virus, it will be removed.
Please notice that we use OS exception catching (lines 16-17) because if we
deal with some real virus, there can be a case that it is being executed and
by being up and running (cloning itself, for instance) it will stop us from
removing such a running file. We will deal with such a case in the following
examples.
While reading virus hash list from file, there are few challenges. If we have
multiple files like that, we cannot read all of them at once. Even if we
manage to fix Code 6.12 to be able to read virus hash definitions from many
files, there are still few issues:
Each time we start out script we have to load these files all over again
Keeping result of loading such a file in memory can lead to significant
memory footprint of our code
To solve this, we are going to use local database which stores virus hash
definitions. In Code 6.13, we create separated script that is loading virus
hashes to our DB:
1. import click
2. import os
3. import sqlite3
4.
5. DB_FILENAME = "virus.db"
6.
7.
8. class VirusDB:
9.
10. def __init__(self):
11. self.conn = sqlite3.connect(DB_FILENAME)
12.
13. def _execute(self, sql):
14. print(f"Executing: {sql}")
15. cursor = self.conn.cursor()
16. cursor.execute(sql)
17. return cursor.fetchall()
18.
19. def _commit(self, sql):
20. print(f"Insert/update: {sql}")
21. cursor = self.conn.cursor()
22. cursor.execute(sql)
23. return self.conn.commit()
24.
25. def init_table(self):
26. sql = """CREATE TABLE IF NOT EXISTS virus_db (
27. id INTEGER PRIMARY KEY AUTOINCREMENT,
28. virus_hash TEXT UNIQUE,
29. created_at TIMESTAMP DEFAULT CURRENT_TIMESTAM
P
30. )»»»
31. print(self._execute(sql))
32.
33. def import_data(self, sources):
34. print(f"Importing: {sources}")
35. for source in sources:
36. assert os.path.exists(source), f"File {source} does not exist"
37. with open(source, 'r', encoding='utf-8') as f:
38. for line in f:
39. data = line.strip().strip('\n')
40. sql = f"INSERT OR IGNORE INTO virus_db (virus_hash)
values ('{data}')"
41. self._commit(sql)
42.
43.
44. @click.command()
45. @click.option("--
source", help="File with virus definition", multiple=True, type=str)
46. @click.option("--
operation", help="Operation type", required=True, type=click.Choice(['
init', 'import']))
47. def main(operation, source):
48. v = VirusDB()
49. if operation == 'init':
50. v.init_table()
51. elif operation == 'import':
52. assert source, 'We need source value'
53. v.import_data(source)
54.
55.
56. if __name__ == '__main__':
57. main()
Code 6.13
10
We created script that is using SQLite database engine to store all found
virus hash values being read from give virus hash list files. We run our
script line in Code 6.14. First, we initialize database file.
1. $ python code_6.13.py --operation=init
2.
3. # output
4. Executing: CREATE TABLE IF NOT EXISTS virus_db (
5. id INTEGER PRIMARY KEY AUTOINCREMENT,
6. virus_hash TEXT UNIQUE,
7. created_at TIMESTAMP DEFAULT CURRENT_TIMESTAM
P
8. )
Code 6.14
We used special syntax of SQL create table if not exists which helps avoid
situation where we run init script multiple times. It is easy to understand
that all data will be stored in table virus_db. Database file that we are using
is defined in Code 6.13, line 5. Once we have all DB initialized, we can
start loading data. Let us use the same file like in Code 6.11. Additionally,
we create 2nd file called example_virus_sha256_2.bin and with example
content in Code 6.15:
1. 1bf454d4b0d094cedf591c6dbe370c4796572a67139174da72559156dd
2265e1
2. 11db3163311e6b3b08a156360812ca5efff0093234201a994d6bdedaf85a
feb1
Code 6.15
In the following example, we run our main script (Code 6.13) with 2 source
files and virus definition:
1. $ python code_6.13.py --operation=import --
source=example_virus_sha256.bin --
source=example_virus_sha256_2.bin
2.
3. # output
4. Importing: ('example_virus_sha256.bin', 'example_virus_sha256_2.bin'
)
5. Insert/update: INSERT OR IGNORE INTO virus_db (virus_hash) valu
es
('ebf454d4b0d094cedf591c6dbe370c4796572a67139174da72559156dd
2265ed')
6. Insert/update: INSERT OR IGNORE INTO virus_db (virus_hash) valu
es
('61db3163315e6b3b08a156360812ca5efff0093234201a994d6bdedaf8
5afeb0')
7. Insert/update: INSERT OR IGNORE INTO virus_db (virus_hash) valu
es
('1bf454d4b0d094cedf591c6dbe370c4796572a67139174da72559156d
d2265e1')
8. Insert/update: INSERT OR IGNORE INTO virus_db (virus_hash) valu
es
('11db3163311e6b3b08a156360812ca5efff0093234201a994d6bdedaf8
5afeb1')
Code 6.16
We uploaded all the files content into our database (Code 6.13, line 33-41)
and now we can use these hashes to identify all potential viruses. Please
notice that reading and inserting/updating data from/to database is a little
different. For reading, we use fetchall method (Code 6.13, lines 14-17) on
top of created connection reading cursor. As for inserting/updating we do
something similar but then need to commit SQL statement via connection to
database (Code 6.13, lines 20-23). Most of the Python database drivers
work in this way11.
In Code 6.13, we also use feature of click library that allows us to force user
to choose from limited option for command line parameter (Code 6.13, line
46) like in Code 6.17. We also managed to force user to not skip that
parameter and have it as mandatory (Code 6.13, line 46). At the same time,
we kept optional parameter source (Code 6.13, line 45) to allow user to
specify file’s location with virus definition.
1. python code_6.13.py
2.
3. # OUTPUT
4. Usage: code_6.13.py [OPTIONS]
5. Try 'code_6.13.py --help' for help.
6.
7. Error: Missing option '--operation'. Choose from:
8. init,
9. import
Code 6.17
In Code 6.17 we modify Code 6.13 in such a way that we can use our newly
created database that has virus hashes:
1. import click
2. import os
3. import sqlite3
4. from hashlib import sha256
5.
6. DB_FILENAME = "virus.db"
7.
8.
9. class VirusScanner:
10.
11. def __init__(self):
12. self.conn = sqlite3.connect(DB_FILENAME)
13.
14. def _execute(self, sql):
15. cursor = self.conn.cursor()
16. cursor.execute(sql)
17. return cursor.fetchall()
18.
19. def check_hash(self, has_value) -> bool:
20. sql = f"SELECT * FROM virus_db WHERE virus_hash='{has_va
lue}' LIMIT 1"
21. cursor = self.conn.cursor()
22. cursor.execute(sql)
23. return True if cursor.fetchall() else False
24.
25. def is_virus(self, file_path, hash_value):
26. if not hash_value:
27. return
28. if self.check_hash(hash_value):
29. print(f"File: {file_path}, hash: {hash_value}, status: virus! rem
oving...")
30. try:
31. os.remove(file_path)
32. except OSError:
33. print("Seem like detected file can't be
remove at the moment, in use?")
34. else:
35. print(f"File: {file_path}, hash: {hash_value}, status: [ok]")
36.
37. def scanner(self, file_path: str):
38. for (root, dirs, files) in os.walk(file_path, topdown=True):
39. for f in files:
40. yield os.path.join(root, f)
41.
42. def calculate_hash(self, file_path: str) -> str:
43. if not file_path:
44. return
45. try:
46. with open(file_path, "rb") as f:
47. file_hash = sha256()
48. chunk = f.read(8192)
49. while chunk:
50. file_hash.update(chunk)
51. chunk = f.read(8192)
52.
53. return file_hash.hexdigest()
54. except OSError:
55. print(f'File {file_path} can not be opened at
the moment, skipping')
56.
57. def analyze(self, fpath):
58. for file_path in self.scanner(fpath):
59. hash_value = self.calculate_hash(file_path)
60. self.is_virus(hash_value, file_path)
61.
62.
63. @click.command()
64. @click.option("--fpath", help="Path to start scanning", required=True)
65. def main(fpath):
66. v = VirusScanner()
67. v.analyze(fpath)
68.
69.
70. if __name__ == '__main__':
71. main()
Code 6.18
By Code 6.18, we are scanning for files in the same way as we did in Code
6.13 but with a difference that when we try to compare if file that is being
validated is on the blacklist - we use database for this (Code 6.18, line 19-
23). We also added small improvements – when we check if the file cannot
be read for validation (Code 6.18, line 42-55) we do not crash if the file is
not present anymore or resource is busy and in use by other process (line
32-33).
The whole concept is based on the fact that we have virus hashes list in flat
files. How about if we get such a hash definition list automatically from
some internet source and do not have to worry about manual download?
In Code 6.19, we managed to create a simple snippet that is fetching SHA-
256 hashes and creating flat file from VirusBay12.
1. import requests
2.
3. MAIN_URL = 'https://beta.virusbay.io/sample/data'
4.
5. response = requests.get(MAIN_URL)
6. if response.status_code == 200:
7. data = response.json()
8. with open('virusbay.bin', 'w', encoding='utf8') as f:
9. for item in data['recent']:
10. virus_md5 = item['md5']
11. details_url = f'https://beta.virusbay.io/sample/data/{virus_md5}'
12. details_response = requests.get(details_url)
13. if details_response.status_code == 200:
14. data = details_response.json()
15. if 'sha256' in data:
16. f.write(f"{data['sha256']}\n")
Code 6.19
With this simple script we are fetching latest virus SHA-256 hashes from
VirusBay. Once we have file created virusbay.bin, we can import its
content to out virus database like in Code 6.20:
1. python code_6.13.py --operation=import --source=virusbay.bin
Code 6.20
So far, we managed to build a script that is analyzing given folder and
checking if found file SHA-256 hash matches virus hash list. That approach
may seem to be enough, but in real case scenarios, virus may be smarter
and resist in, for example, ZIP files. With Code 6.23, we modify our base
Code 6.18 in such a way that we can unzip compressed ZIP files and
analyze uncompressed content (files) that may be a virus.
Before we will start modifying script, we need to install magic library13 that
is going to help us analyze file types.
1. pip install python-magic
Code 6.21
You might be wondering why we cannot use file extension like .zip – the
reason is that the file extension can be misleading from the actual file type.
Each file in file system has digital fingerprint called metadata that is
clearly describing operating system. This is something like file header. It
can also be faked and broken by malicious software but in this example, we
will focus on how to read these headers and unzip files. To demonstrate
why using metadata to check file type is a better idea let us check Code
6.22.
Let us use PDF file as an example and rename its file extension to .txt and
run Code 6.22:
1. import magic
2.
3. print(magic.from_file("test.txt"))
4.
5. # OUTPUT
6. 'PDF document, version 1.2'
Code 6.22
As you can see, this is the right way to analyze file types and properly
check what kind of file we are dealing with. In Code 6.23, we will use this
library to check when we are facing ZIP file so we can extract and analyze
its content:
1. import click
2. import os
3. import magic
4. import sqlite3
5. import uuid
6. import zipfile
7. from hashlib import sha256
8.
9. DB_FILENAME = "virus.db"
10.
11.
12. class VirusScanner:
13.
14. def __init__(self):
15. self.conn = sqlite3.connect(DB_FILENAME)
16.
17. def _execute(self, sql):
18. cursor = self.conn.cursor()
19. cursor.execute(sql)
20. return cursor.fetchall()
21.
22. def check_hash(self, has_value) -> bool:
23. sql = f"SELECT * FROM virus_db WHERE virus_hash='
{has_value}' LIMIT 1"
24. cursor = self.conn.cursor()
25. cursor.execute(sql)
26. return True if cursor.fetchall() else False
27.
28. def is_virus(self, file_path, hash_value):
29. if not hash_value:
30. return
31. if self.check_hash(hash_value):
32. print(f"File: {file_path}, hash: {hash_value},
status: virus! removing...")
33. try:
34. os.remove(file_path)
35. except OSError:
36. print("Seem like detected file can't be remove at
the moment, in use?")
37. else:
38. print(f"File: {file_path}, hash: {hash_value},
status: [ok]")
39.
40. def scanner(self, file_path: str):
41. for (root, dirs, files) in os.walk(file_path, topdown=True):
42. for f in files:
43. yield os.path.join(root, f)
44.
45. def calculate_hash(self, file_path: str) -> str:
46. if not file_path:
47. return
48. try:
49. with open(file_path, "rb") as f:
50. file_hash = sha256()
51. chunk = f.read(8192)
52. while chunk:
53. file_hash.update(chunk)
54. chunk = f.read(8192)
55.
56. return file_hash.hexdigest()
57. except OSError:
58. print(f'File {file_path} can not be opened at the moment, skippi
ng')
59.
60. def analyze_zip(self, fpath):
61. extract_dir = "/tmp/{tmp_id}/".format(tmp_id=str(uuid.uuid4()))
62. with zipfile.ZipFile(fpath, 'r') as zip_ref:
63. zip_ref.extractall(extract_dir)
64. self.analyze(extract_dir)
65. os.remove(extract_dir)
66.
67. def analyze(self, fpath):
68. for file_path in self.scanner(fpath):
69. try:
70. hash_value = self.calculate_hash(file_path)
71. self.is_virus(hash_value, file_path)
72. if 'zip' in magic.from_file(file_path).lower():
73. self.analyze_zip(file_path)
74. except OSError:
75. print(f'File {file_path} can not be opened at the moment, skip
ping')
76.
77.
78. @click.command()
79. @click.option("--fpath", help="Path to start scanning", required=True)
80. def main(fpath):
81. v = VirusScanner()
82. v.analyze(fpath)
83.
84.
85. if __name__ == '__main__':
86. main()
Code 6.33
In Code 6.33, we are testing if the file that we found is infected with virus
(lines 71-72) and then if it is a ZIP file (line 73). Once we know it is a ZIP
archive, we unzip it into temporary folder (line 61-64) and then try to scan
and analyze it (line 65). After it is done, we remove temporary folder with
this content.

Building map of suspicious files


Removing files that we suspect in a virus is a necessary solution. All the
computer antivirus software will put suspicious files into quarantine. That
means that the file cannot be read and accessed by any system process
anymore once it stays in quarantine. Of course, there is an option to remove
files that stay in quarantine. Let us try to see how we can put a single file
first into such a state that it cannot be accessed by other processes.
For Code 6.34, we assume we have Linux or Mac OS (Unix based) files
system. The low-level file descriptor controller can use fcntl Python
library14 - that will allow us to create exclusive lock for reading any kind of
a file.
1. import fcntl
2.
3.
4. def acquire_lock(file_name):
5. f = open(file_name, 'a+')
6. fcntl.flock(f, fcntl.LOCK_EX)
7. return f
8.
9. def release_lock(f):
10. f.close()
11.
12. file_descr = acquire_lock('/var/tmp/some-file.txt')
13.
14. Release_lock(file_descr)
Code 6.34
As you can see in Code 6.34, we created two main functions: acquire_lock
where we create an exclusive lock (line 6) that will prevent other processes
from reading the file content for given file (line 12). Then, we have second
function release_lock where we can release the lock on file descriptor.
It is important to understand that creating such an exclusive lock on a file
only lives with a process that created such a lock. Once the process is
finished, lock will be automatically removed by operating system.
So far, we did exclusive locking for Unix based system. So, if we want to
do something similar for both Windows and Unix OS, let us check
processing example.
Before we can proceed with example, we need to install win32 library for
Python on Windows OS. How to use Python with virtualenv running on
windows machine please get back to Chapter 1, Python 101.
1. pip install pywin32
Code 6.35
The actual locking example that supports both Unix (Posix) and Windows
operating system is shown in Code 6.36:
1. import os
2.
3.
4. def acquire_lock(file_name):
5. f = open(file_name, 'a+')
6. # Windows
7. if os.name.lower() == 'nt':
8. import win32con, win32file, pywintypes
9.
10. _overlapped = pywintypes.OVERLAPPED()
11. f_handle = win32file._get_osfhandle(f.fileno())
12. win32file.LockFileEx(f_handle, win32con.LOCKFILE_EXCLUS
IVE_LOCK, 0, 0xffff0000, _overlapped)
13. # Unix like
14. elif os.name.lower() == 'posix':
15. import fcntl
16.
17. fcntl.flock(f, fcntl.LOCK_EX)
18. return f
19.
20. def release_lock(f):
21. if f:
22. f.close()
23.
24.
25. file_descr = acquire_lock('some-file.txt')
26.
27. release_lock(file_descr)
Code 6.36
Locking a file for Windows is different from Unix based systems as you can
see. First, we must use special Windows library that supports file descriptor
operations, in our case locking, instead of using built into Python files
locking extension. Another important difference is the way we put exclusive
lock on a file (line 12). Although this is the operating system related
difference (lines 10-12).
In our example Code 6.36, you can notice that putting lock on a file in not
so difficult but there is an easier, OS agnostic way of doing so. Let us install
library that is file locking oriented and put nice wrapper around low level
file locking functionality.
1. pip install portalocker
Code 6.37
After installing library from Code 6.37, let us try to modify our simple
example of file locking (Code 6.36) in such a way that we can achieve the
same locking technique but with portalocker library.
1. import time
2. import portalocker
3.
4. with open('somefile', 'a+') as file:
5. portalocker.lock(file, portalocker.LockFlags.EXCLUSIVE)
6. file.seek(0)
7. time.sleep(10)
Code 6.38
Opening lock by using portalocker library is simple when compared to
low-level standard Python library. Also, it is clear to remove the lock by
putting lock in context (line 4) when leaving it. Python will automatically
remove the lock. Another similar approach on how to use portalocker can
be seen in Code 6.39:
1. import portalocker
2.
3. with portalocker.Lock('some-file', 'rb+', timeout=60) as fh:
4. # we can execute some code here
5.
6. # flush and sync to filesystem if needed
7. fh.flush()
8. os.fsync(fh.fileno())
Code 6.39
Code 6.38 and 6.39, as mentioned earlier will only create exclusive lock if
this process is running. So, what should be done if we want to create a lock
that protects infected files from reading? We need to make the process run
forever. Before we apply this idea and modify Code 6.33, we have to
understand how to approach this:
Run analyzer code in the background and put all the suspicious files to
quarantine.
Have another block of code creating locks.
Main body of the code will run forever.
We have been using single process (our script itself) with single thread on
the base coding pattern that has been running so far on our examples. To be
able to deliver the above requirements, we need to start splitting our code
into 3 individual threads. Let us check Code 6.40 that shows how to build
simple threading:
1. import click
2. import logging
3. import threading
4. import time
5.
6.
7. class VirusScanner:
8.
9. def __init__(self):
10. self.threads = []
11.
12. def analyze(self, fpath):
13.
14. th = threading.Thread(target=self.analyze_path, args=(fpath,))
15. self.threads.append(th)
16. th = threading.Thread(target=self.analyze_locks)
17. self.threads.append(th)
18.
19. for x in self.threads:
20. x.start()
21.
22. for thread in self.threads:
23. thread.join()
24.
25. def analyze_path(self, fpath):
26. # code for analyzing
27. for i in range(0, 5):
28. print(f’analyze_path [{i}]’)
29. time.sleep(1)
30. print(f'Finished analyzing path: {fpath}')
31.
32. def analyze_locks(self):
33. for i in range(0, 10):
34. print(f'analyze_locks [{i}]')
35. time.sleep(0.8)
36. print(f'Finished creating and analyzing locks')
37.
38.
39. @click.command()
40. @click.option("--fpath", help="Path to start scanning", required=True)
41. def main(fpath):
42. v = VirusScanner()
43. v.analyze(fpath)
44.
45.
46. if __name__ == '__main__':
47. main()
Code 6.40
In Code 6.40, we created pseudo code based on Code 6.33 in such a way
that we run analyzing files and creating locks in separate threads (lines 14-
17), start them one by one (lines 19-20) and then wait until all of them are
completed running (lines 22-23).
In Code 6.40, we created method analyze_locks that is running in this short
example in loop and then exits. In the final version, to make all locking
work correctly, we should run that in infinity loop. Let us check Code 6.41
on how to make it happen:
1. import time
2. import petalock
3.
4. class VirusScanner:
5.
6. def __init__(self):
7. self.locks = {}
8. self.files_to_lock = []
9. self.files_to_remove_lock = []
10.
11. def _remove_lock(self, file_path):
12. petalock.unlock(self.locks[file_path])
13. self.locks[file_path].close()
14. del self.locks[file_path]
15.
16. def _create_lock(self, file_path):
17. with open(file_path, 'a+') as f:
18. f.seek(0)
19. self.locks[file_path] = f
20.
21. def analyze_locks(self):
22. while True:
23. while self.files_to_lock:
24. self._create_lock(self.files_to_lock.pop())
25. while self.files_to_remove_lock:
26. self._remove_lock(self.files_to_remove_lock.pop())
27. time.sleep(1)
28.
29.
30. print(f'Finished creating and analyzing locks')
Code 6.41
We created additional methods _remove_lock and _create_lock mentioned
in Code 6.41, only removing or creating locks on given file path (line 16,
line 11) from the locks list. Method analyze_locks runs in infinity loop
(line 22) and keeps pulling file path (lines 23-24) to create locks and then,
we pull element from a list of file paths to remove unnecessary locks (lines
25-26).
In Code 6.41, we use three helper class attributes:
Locks: This is where we store a dictionary of all the file paths with
descriptors for which we created locks.
files_to_lock: is a list of file paths for which we must create locks.
files_to_remove_lock: has a list of the file paths for which we should
release lock.
There is a logical bug in Code 6.41. You can see that in line 17 we open the
file descriptor without checking if such a file path does not already exist in
a list of locked files and then, we put this new descriptor in the mentioned
list. To fix this issue, we have to modify that part of the code as shown in
Code 6.42:
1. def _create_lock(self, file_path):
2. if file_path not in self.locks:
3. with open(file_path, 'a+') as f:
4. f.seek(0)
5. self.locks[file_path] = f
Code 6.42
In file systems world, working with exclusive locks is not straight forward
as we practiced in this sub chapter, Virus quarantine15. Python implements
locking API in a flexible way to make it more OS agnostic, for instance
implementing file locking in Linux system16 is not implemented 100% as
Kernel documentation suggests. Additionally, in Linux OS, user decision to
mount files system or which files system he decides to use, for instance
ext2, ext3 or ext4, or even nonstandard types like BtrFS makes a big
difference for file locking in Python.
So instead of creating some exclusive locks, which do not give us the best
option of making sure that infected file cannot be read by other process, we
need to figure out what to do to stop viruses from cloning in our beloved
OS and keeping it in quarantine.
In Code 6.45, we will use another concept for keeping files in quarantine by
encrypting those infected files so it becomes useless for OS, but we will
make sure it cannot execute any code and clone themselves.
Before we will get to example code showing how to use encryption, we
must install Python encryption module.
1. pip install pycryptodomex
Code 6.43
Once cryptography is installed, we can write a simple class that will allow
us to encrypt and decrypt files. For this, we need to have a script that allows
us to create encryption key which will be used for encrypting files.
1. from Cryptodome.Random import get_random_bytes
2.
3. with open('mykey.pem','wb') as f:
4. f.write(get_random_bytes(16))
Code 6.44
We created encryption key mykey.pem with random bytes which will be
used in Code 6.45 to encrypt and decrypt files:
1. from Cryptodome import Random
2. from Cryptodome.Cipher import AES
3.
4. class FilesEncoder:
5.
6. def __init__(self):
7. with open('mykey.pem','rb') as f:
8. self.hashing_key = f.read()
9.
10. def __make_padding(self, s):
11. return s + b"\0" * (AES.block_size - len(s) % AES.block_size)
12.
13. def __encode(self, message):
14. message = self.__make_padding(message)
15. iv = Random.new().read(AES.block_size)
16. cipher = AES.new(self.hashing_key, AES.MODE_CBC, iv)
17. return iv + cipher.encrypt(message)
18.
19. def __decode(self, data):
20. iv = data[:AES.block_size]
21. cipher = AES.new(self.hashing_key, AES.MODE_CBC, iv)
22. plaintext = cipher.decrypt(data[AES.block_size:])
23. return plaintext.rstrip(b"\0")
24.
25. def encrypt_file(self, file_name):
26. with open(file_name, 'rb') as f:
27. data = f.read()
28.
29. encdoded = self.__encode(data)
30. enc_filename = f"{file_name}.bin"
31. with open(enc_filename, 'wb') as fo:
32. fo.write(encdoded)
33. print(f"Encrypted file {file_name} to {enc_filename}")
34.
35. def decrypt_file(self, file_name):
36. with open(file_name, 'rb') as f:
37. data = f.read()
38.
39. decoded = self.__decode(data)
40. output_file_name = file_name[:-4]
41. with open(output_file_name, 'wb') as f:
42. f.write(decoded)
43. print(f"Decrypted file {file_name} to {output_file_name}")
Code 6.45
A class is created that will allow us to both encrypt (line 25) a given file and
reverse the process (line 35). You can see that when we call method
enrypt_file, encryption algorithm called AES is used 17. For this method,
we need to use encryption key (Code 6.44) and based on that, data gets
encrypted. In this case, we use a key that is 16 bytes long (Code 6.44, line
4).
Let us assume that we have a file, data.txt, which would like to encrypt.
Check Code 6.46 to see how to do it with FilesEncoder class.
1. enc = FilesEncoder()
2. enc.encrypt_file('data.txt')
Code 6.46
Code 6.46 will create a copy of original file and save encrypted version as
data.txt.bin. Body of such a file is going to look like this:
1. (fun1)  ~/work/fun-with-
python/chapter_6/ [main*] cat data.txt.bin

2.
3.  ͍18D%�Q���٬uI��ujz|��R‫�ۇ‬fX�������a�Zd�
4. DQMz@x���<���!8�-
E�D�R$��7��5Z��<I�blh�.ˉʴJ���f�-
���H� 0��\
2N:�:���H�\}M5.��ҕ�-~bF2��넩>V���*
5. �X�zҐ�]�/
�_k9���'k���:bxDS5ٓ���&m��p�3A�APZ9ćA$�0
}��
6.
�� ��P��w��u͓A�yal�V��i:^�mb
7.
@�r‫ݬ‬hg2_v�yl�����2UO��$'�#�/+>���R�~�c9
��8�N<Gå�uN�e��"
8.

�xR��%I�Q�� MRA���ђӑz���`�\����AGd<
�%���x7�YN�SP�C��#�!
�;m‫ע‬.�Yx*�K����֡��mXzhc�=�‫�إ‬K:
����!
���Y��Ҥ��z��:ħ{n�u�E��s&��uvg)n:T`�O�2
�H8xcpT
9. W��G?
<�Sq�B 38�a;�_�kS.�M�`bi�s;f��3���}�u
10. ��n»>���`�ߚ‫ݰ‬D�|R/^�q{z��^���VW���j9�#%
��l�B�F�o��m�@t�
11.
�+�5���oOe�»1��~v�~cvG��7�It����Ч�C»�
��
12. �[E
b���ʺ[!p��eb<[�#��NJR�[�o|Pbj��-
�� ,G�����[�s�,9.�›e�/�H|
��������]bpD��
13. ��h�s!\�!
ծ��ʊ�%cr�m1�� ��›mĠ�(�K‫؂‬7Q�I�����x%
As you can see, it is useless scrambled data that is not original virus
anymore. In this case, virus cannot open infected file anymore and execute
it by infecting more files. Of course, we have a cure to reverse the process
of encrypting data. Let us check Code 6.47:
1. enc = FilesEncoder()
2. enc.decrypt_file('data.txt.bin')
Code 6.47
Once the code is called, we will restore encrypted file to its original state
under the name data.txt.
Now, let us try to modify base Code 6.33 but instead of removing
suspicious files, we will encrypt them and remove original.
1. import click
2. import os
3. import magic
4. import sqlite3
5. import uuid
6. import zipfile
7. from enrypt_files import FilesEncoder
8. from hashlib import sha256
9.
10. DB_FILENAME = "virus.db"
11.
12.
13. class VirusScanner:
14.
15. def __init__(self):
16. self.conn = sqlite3.connect(DB_FILENAME)
17. self.encryptor = FilesEncoder()
18. self.locks = {}
19. self.files_to_lock = []
20. self.files_to_remove_lock = []
21.
22. def _execute(self, sql):
23. cursor = self.conn.cursor()
24. cursor.execute(sql)
25. return cursor.fetchall()
26.
27. def check_hash(self, has_value) -> bool:
28. sql = f"SELECT * FROM virus_db WHERE virus_hash='
{has_value}' LIMIT 1"
29. cursor = self.conn.cursor()
30. cursor.execute(sql)
31. return True if cursor.fetchall() else False
32.
33. def is_virus(self, file_path, hash_value):
34. if not hash_value:
35. return
36. if self.check_hash(hash_value):
37. print(f"File: {file_path}, hash: {hash_value},
status: virus! removing...")
38. try:
39. self.encryptor.encrypt_file(file_path)
40. os.remove(file_path)
41. except OSError:
42. print("Seem like detected file can't be
remove at the moment, in use?")
43. else:
44. print(f"File: {file_path}, hash:
{hash_value}, status: [ok]")
45.
46. def scanner(self, file_path: str):
47. for (root, dirs, files) in os.walk(file_path, topdown=True):
48. for f in files:
49. yield os.path.join(root, f)
50.
51. def calculate_hash(self, file_path: str) -> str:
52. if not file_path:
53. return
54. try:
55. with open(file_path, "rb") as f:
56. file_hash = sha256()
57. chunk = f.read(8192)
58. while chunk:
59. file_hash.update(chunk)
60. chunk = f.read(8192)
61.
62. return file_hash.hexdigest()
63. except OSError:
64. print(f'File {file_path} can not be opened at the moment, skippi
ng')
65.
66. def analyze_zip(self, fpath):
67. extract_dir = "/tmp/{tmp_id}/".format(tmp_id=str(uuid.uuid4()))
68. with zipfile.ZipFile(fpath, 'r') as zip_ref:
69. zip_ref.extractall(extract_dir)
70. self.analyze(extract_dir)
71. os.remove(extract_dir)
72.
73. def analyze(self, fpath):
74. for file_path in self.scanner(fpath):
75. try:
76. hash_value = self.calculate_hash(file_path)
77. self.is_virus(hash_value, file_path)
78. if 'zip' in magic.from_file(file_path).lower():
79. self.analyze_zip(file_path)
80. except OSError:
81. print(f'File {file_path} can not be opened at the moment, skip
ping')
82.
83.
84. @click.command()
85. @click.option("--fpath", help="Path to start scanning", required=True)
86. def main(fpath):
87. v = VirusScanner()
88. v.analyze(fpath)
89.
90.
91. if __name__ == '__main__':
92. main()
Code 6.48

Parallel processing
So far, we managed to build powerful tools to analyze file system to check
and mark those files that may look suspicious or are corrupted with virus.
When we want to analyze single files directories lineal walking over file
system seems to be effective for single main folder albeit when we want to
analyze the entire file systems where we have hundreds of thousands of
files with thousands of directories this method is going to be very slow.
To address this kind of problem we’re going to update or previous example
for scanning file system where we will introduce more effective way of
processing directories called parallel programming. We could use in this
case threads of processes driven development, although we will use same
approach that we already learned in previous chapters – asynchronous
programming. File system operations is very good example where we can
use this technique. First, we must install few Python modules like in
following example.
1. $ pip install asyncio_pool asyncio aiofiles
Code 6.49
After installing module, we will build simple example demonstrating how
to use parallel scanning for directories. Let’s check following example.
1. import os
2. import asyncio
3. from aiofiles import os as asyncio_os
4.
5.
6. async def async_scan_dir(dir_path):
7.
8. dirs = []
9. dir_list = await asyncio_os.listdir(dir_path)
10. for check_path in dir_list:
11. v_path = os.path.join(dir_path, check_path)
12. is_dir = await asyncio_os.path.isdir(v_path)
13. if is_dir:
14. dirs += await async_scan_dir(v_path)
15. else:
16. dirs.append(v_path)
17.
18.
19. return dirs
20. async def get_result(dir_path="/tmp"):
21. result = await async_scan_dir(dir_path)
22.
23.
24. print(f"result: {result}")
25. asyncio.run(get_result())
Code 6.50
When we run example Code 6.50, we will get as result list of all the files
scanned in /tmp directory. As it is easy to notice we used async file library�
that is helping with parallel scanning, and we reuse function
async_scan_dir (code 6.50, line 6) in recurrent mode to be able to build
final list of picked up files from pointed directory.
Let us try to refactor example Code 6.48 to make it work in async world.
Let’s check proceeding example to see how to approach this.
1. import asyncio
2. import click
3. import os
4. import magic
5. import sqlite3
6. import uuid
7. import zipfile
8. from enrypt_files import FilesEncoder
9. from aiofiles import os as asyncio_os
10. from hashlib import sha256
11.
12. DB_FILENAME = "virus.db"
13.
14.
15. class VirusScanner:
16.
17. def __init__(self):
18. self.conn = sqlite3.connect(DB_FILENAME)
19. self.encryptor = FilesEncoder()
20. self.locks = {}
21. self.files_to_lock = []
22. self.files_to_remove_lock = []
23.
24. def _execute(self, sql):
25. cursor = self.conn.cursor()
26. cursor.execute(sql)
27. return cursor.fetchall()
28.
29. def check_hash(self, has_value) -> bool:
30. sql = f"SELECT * FROM virus_db WHERE virus_hash='{has_va
lue}' LIMIT 1"
31. cursor = self.conn.cursor()
32. cursor.execute(sql)
33. return True if cursor.fetchall() else False
34.
35. def is_virus(self, file_path, hash_value):
36. if not hash_value:
37. return
38. if self.check_hash(hash_value):
39. print(f"File: {file_path}, hash: {hash_value},
status: virus! removing...")
40. try:
41. self.encryptor.encrypt_file(file_path)
42. os.remove(file_path)
43. except OSError:
44. print("Seem like detected file can't be
remove at the moment, in use?")
45. else:
46. print(f"File: {file_path}, hash: {hash_value},
status: [ok]")
47.
48. def calculate_hash(self, file_path: str) -> str:
49. if not file_path:
50. return
51. try:
52. with open(file_path, "rb") as f:
53. file_hash = sha256()
54. chunk = f.read(8192)
55. while chunk:
56. file_hash.update(chunk)
57. chunk = f.read(8192)
58.
59. return file_hash.hexdigest()
60. except OSError:
61. print(f'File {file_path} can not be opened at
the moment, skipping›)
62.
63. async def analyze_zip(self, fpath):
64. extract_dir = "/tmp/{tmp_id}/".format(tmp_id=str(uuid.uuid4()))
65. with zipfile.ZipFile(fpath, 'r') as zip_ref:
66. zip_ref.extractall(extract_dir)
67. await self.analyze(extract_dir)
68. os.remove(extract_dir)
69.
70. async def async_scan_dir(self, dir_path):
71. dirs = []
72. dir_list = await asyncio_os.listdir(dir_path)
73. for check_path in dir_list:
74. v_path = os.path.join(dir_path, check_path)
75. is_dir = await asyncio_os.path.isdir(v_path)
76. if is_dir:
77. dirs += await self.async_scan_dir(v_path)
78. else:
79. dirs.append(v_path)
80. return dirs
81.
82. async def analyze(self, fpath):
83. for file_path in await self.async_scan_dir(fpath):
84. try:
85. hash_value = self.calculate_hash(file_path)
86. self.is_virus(hash_value, file_path)
87. if 'zip' in magic.from_file(file_path).lower():
88. await self.analyze_zip(file_path)
89. except OSError:
90. print(f'File {file_path} can not be opened at the moment, skip
ping›)
91.
92.
93. @click.command()
94. @click.option("--fpath", help="Path to start scanning", required=True)
95. def main(fpath):
96. v = VirusScanner()
97. asyncio.run(v.analyze(fpath))
98.
99.
100. if __name__ == '__main__':
101. main()
Code 6.51
We refactored most parts of the virus scanner where we’ve been dealing
with OS and file system (lines 70-80 and 63-68) to make it work as async
operations. This will help us to make the whole code to be more efficient.

Conclusion
A demonstration of how Python can be used for analyzing file system while
hunting for viruses is shown in this chapter. It is not a difficult technique to
encrypt founded infected files and keep it locally. Next, we can send it to
online services where advanced technology can analyze given files and try
to cure them.
Viruses are obviously complex than flat files and can be hidden inside of
infected files, replicate themselves in memory, lots of sophisticated
techniques are used. Malicious software is always one step ahead of people
writing antivirus software. It is better if you run local antivirus scanner
frequently and be more aware of links you click.
In the next chapter, we will learn how we can use Python with crypto coins.
How can we analyze crypto currency exchange markets and where Python
is going to help us with crypto wallets.

1. https://click.palletsprojects.com/en/8.1.x/
2. https://docs.python.org/3/library/hashlib.html
3. https://datatracker.ietf.org/doc/html/rfc1321
4. https://www.thesslstore.com/blog/difference-sha-1-sha-2-sha-256-
hash-algorithms/
5. https://eprint.iacr.org/2004/199.pdf
6. https://eprint.iacr.org/2011/037.pdf
7. https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#Blocks
8. https://en.wikipedia.org/wiki/Computer_virus
9. https://github.com/Len-Stevens/MD5-Malware-Hashes
10. https://www.sqlite.org/index.html
11. https://docs.python.org/3/library/sqlite3.html
12. https://beta.virusbay.io/sample/browse
13. https://pypi.org/project/python-magic/
14. https://docs.python.org/3/library/fcntl.html
15. http://0pointer.de/blog/projects/locking.html
16. https://www.gnu.org/software/libc/manual/html_node/File-
Locks.html
17.
https://web.archive.org/web/20160306104007/http://research.microsoft.
com/en-us/projects/cryptanalysis/aesbc.pdf
18. https://pypi.org/project/aiofiles/

Join our book’s Discord space


Join the book's Discord Workspace for Latest updates, Offers, Tech
happenings around the world, New Release and Sessions with the Authors:
https://discord.bpbonline.com

OceanofPDF.com
CHAPTER 7
Create Your Own Crypto Trading
Platform

Introduction
Crypto currencies have been on the market for quite a while, so it is no
secret that, after getting lots of hype they have become standard payment
platforms in the digital world. Their currency exchange is moving up and
down very rapidly, so working with crypto assets can be fun and a bit of
challenge at the same time. In this chapter, we will learn how to utilize
Python as a tool for crypto currencies.

Structure
In this chapter, we will discuss the following topics:
Brief introduction to crypto market
Building client for crypto market
Trends analyzer
Integrating with crypto wallet
Purchase and sell

Objectives
After reading this chapter, you should know how to build your own crypto
market trading platform client, manage your crypto assets and use Python to
build simple yet powerful money exchange applications.

Brief introduction to crypto market


There is a noticeable hype for crypto currencies and block chains in the last
couple of years. Many of us have tried to buy and sell crypto currencies.
With so many crypto web applications, it is so popular to have a crypto
wallet and track crypto currencies to see how they go up and down with
currency exchange over a time. In this chapter, we will go one level lower
and learn about crypto currency, wallet and how to send assets from one
wallet to another. We will also learn some basics of analyzing moving trends
for currency exchange.
It is worth reading a bit more about crypto currencies1to understand what we
are dealing with here. It is also essential to know why we are using Python to
work with crypto in this chapter and how fun it is to work with
cryptography, because that is what crypto currencies and wallets are, with
our beloved Python.

Currencies
Before we can build any trading platform client, we need to collect all the
crypto currencies that will be analyzed and traced. We have a few options
regarding getting crypto currencies (codes):
For instance, we can add them manually to our application, but this is
going to be a very time-consuming process if we want to follow many
currencies.
Other option, that we will use in the proceeding example is to fetch
currencies codes from the existing crypto trading website.
In the following example, we will use coinmarketcap.com.
1. import json
2. import re
3. import requests
4. from pprint import pprint
5.
6. URL = "https://coinmarketcap.com/all/views/all/"
7. JSON_DATA = re.compile(r'<script\s+id="__NEXT_DATA__"\s+type=
"application/json">(.*?)</script>')
8.
9.
10. def main():
11. raw_data = requests.get(URL).text
12. data = json.loads(JSON_DATA.findall(raw_data).pop())
13. result = {}
14. for item in json.loads(data['props']['initialState'])['cryptocurrency']
['listingLatest']['data']:
15. try:
16. result[item[30]] = item[10]
17. except KeyError:
18. pass
19. return result
20.
21. if __name__ == '__main__':
22. pprint(main())
Code 7.1
In this script, we managed to use a trick where we found that in the body of
website coinmarketcap.com (line 6), they have a JavaScript section that
defines all the supported crypto currencies on the website. Following this
logic, we are extracting the JavaScript part (line 12). Further, we will extract
from the Python dictionary the part that has crypto currencies’ codes along
with their corresponding names. The output of running our script is going to
look like the following code:
1. python get_crypto_codes.py
2.
3. # output
4. {'1INCH': '1inch Network',
5. 'AAVE': 'Aave',
6. 'ACH': 'Alchemy Pay',
7. 'ADA': 'Cardano',
8. 'AGIX': 'SingularityNET',
9. ...
Code 7.2
We now have all the popular crypto currencies that are on the market. We
cannot print them on the screen like in example Code 7.1, however, we need
to store them in a database. In this case, we will use a well-known database,
SQLite2 that we have already been using in previous chapters.
Follow these steps to generate connection to database and initialize its
content:
1. First, let us create a script that will create a database structure. In the
proceeding example, we have also used a tool that we already utilized
many times in previous chapters called Click3.
2. Before we get to the main script, let us create universal class for
managing database and related queries. Follow this example to create a
file db.py.
1. import sqlite3
2.
3. DB_FILENAME = "crypto.db"
4.
5.
6. class DB:
7.
8. def __init__(self):
9. self.conn = sqlite3.connect(DB_FILENAME)
10.
11. def execute(self, sql):
12. print(f"Executing: {sql}")
13. cursor = self.conn.cursor()
14. cursor.execute(sql)
15. return cursor.fetchall()
16.
17. def commit(self, sql):
18. print(f"Insert/update: {sql}")
19. cursor = self.conn.cursor()
20. cursor.execute(sql)
21. return self.conn.commit()
22.
23. def init_table(self, table_name):
24. with open(f"{table_name}.sql") as f:
25. print(self.execute(f.read()))
Code 7.3
3. After creating a database class driver, we can create SQL file
currency.sql that is going to create the table. Here, we will store all the
crypto currencies that we can extract from the mentioned website.
1. CREATE TABLE IF NOT EXISTS currency (
2. id INTEGER PRIMARY KEY AUTOINCREMENT,
3. currency_code TEXT UNIQUE,
4. currency_name TEXT,
5. created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
6. );
Code 7.4
4. By having a database structure, we must create a script that will allow
us to create database structure. For this, we will use the well-known
click library. In the proceeding example, you can see how we will
approach it:
1. import click
2. from db import DB
3.
4. @click.command()
5. @click.option("--
table", help="Table type", required=True, type=click.Choice(['curren
cy']))
6. def main(table):
7. db = DB()
8. db.init_table(table)
9.
10. if __name__ == '__main__':
11. main()
Code 7.5
5. Once we have the table and script to create data storage, we should
modify the code from example 7.1 to store results in our database table.
6. Check the following example Code 7.6 to see how modified version of
Code 7.1 is going to use database storage.
1. import json
2. import re
3. import requests
4. from db import DB
5.
6. URL = "https://coinmarketcap.com/all/views/all/"
7. JSON_DATA = re.compile(r'<script\s+id="__NEXT_DATA__"\s+ty
pe="application/json">(.*?)</script>')
8.
9.
10. def main():
11. db = DB()
12. raw_data = requests.get(URL).text
13. data = json.loads(JSON_DATA.findall(raw_data).pop())
14. result = {}
15. i = 0
16. for item in json.loads(data['props']['initialState'])['cryptocurrency']
['listingLatest']['data']:
17. try:
18. result[item[30]] = item[10]
19. sql = f"""INSERT INTO currency(currency_code, currency_
name) VALUES ('{item[30]}', '{item[10]}');"""
20. db.commit(sql)
21. i += 1
22. except KeyError:
23. pass
24. return i
25.
26. if __name__ == '__main__':
27. no_items = main()
28. print(f"Inserted {no_items} items")
Code 7.6
7. Executing the above example Code 7.6 should give us a result like in
the proceeding example, with about 200 records in DB.
1. $ python code_7.6.py
2.
3. ## result
4.
5. ...
6. Insert/update: INSERT INTO currency(currency_code, currency_na
me) VALUES ('ZEN', 'Horizen');
7. Insert/update: INSERT INTO currency(currency_code, currency_na
me) VALUES ('BTRST', 'Braintrust');
8. Insert/update: INSERT INTO currency(currency_code, currency_na
me) VALUES ('TRAC', 'OriginTrail');
9. Insert/update: INSERT INTO currency(currency_code, currency_na
me) VALUES ('RBN', 'Ribbon Finance');
10. Insert/update: INSERT INTO currency(currency_code, currency_na
me) VALUES ('HFT', 'Hashflow');
11. Insert/update: INSERT INTO currency(currency_code, currency_na
me) VALUES ('METIS', 'MetisDAO');
12. Insert/update: INSERT INTO currency(currency_code, currency_na
me) VALUES ('JOE', 'JOE');
13. Insert/update: INSERT INTO currency(currency_code, currency_na
me) VALUES ('AXL', 'Axelar');
14. Inserted 200 items
Code 7.7
8. We inserted 200 currency codes with corresponding names. What is
going to happen if we run Code 7.6 once again? Let us see in following
example.
1. $ python code_7.6.py
2.
3.
4. Insert/update: INSERT INTO currency(currency_code, currency_na
me) VALUES ('BTC', 'Bitcoin');
5. Traceback (most recent call last):
6. File "code_7.6.py", line 27, in <module>
7. no_items = main()
8. File "code_7.6.py", line 20, in main
9. db.commit(sql)
10. File "/Users/hubertpiotrowski/work/fun-with-
python/chapter_7/db.py", line 20, in commit
11. cursor.execute(sql)
12. sqlite3.IntegrityError: UNIQUE constraint failed: currency.currency
_code
Code 7.8
9. However, we crashed. This is because we created in the table currency a
unique constraint on column currency_code that should prevent us
from inserting multiple times the same code to table (Code 7.4).
Unfortunately, we did not catch an exception in our Code 7.6 (line 20).
Hence, by trying to insert the same code twice, the database driver
raises integrity exception which we should catch and continue (Code
7.8, line 12). Let us modify our example Code 7.6, with proper catching
exceptions.
1. def main():
2. db = DB()
3. raw_data = requests.get(URL).text
4. data = json.loads(JSON_DATA.findall(raw_data).pop())
5. result = {}
6. i = 0
7. for item in json.loads(data['props']['initialState'])['cryptocurrency']
['listingLatest']['data']:
8. try:
9. result[item[30]] = item[10]
10. currency_code = item[30].strip().upper()
11. currency_name = item[10].strip()
12. sql = f"""INSERT INTO currency(currency_code, currency_
name) VALUES ('{currency_code}', '{currency_name}');"""
13. try:
14. db.commit(sql)
15. except sqlite3.IntegrityError:
16. print(f"Currency {currency_code} already exists, skipping
...")
17. except Exception as e:
18. print("Error: ", e)
19. i += 1
20. except KeyError:
21. pass
22. return i
Code 7.9
In example Code 7.9 we modified main method in such a way that now, we
can rerun import script as many times as we have to without having an issue
where we crash either because currency already exists (line 15) or because of
any other reason (line 17).

Building client for crypto market


This topic, as you may expect will touch upon a few aspects of crypto
wallets. We will try to understand, with Python of course, what is crypto
wallet and how does it work in detail. Next, we will use existing wallets and
try to emulate wallet money flow.
Before we start building any code, we should start with understanding a bit
about crypto wallet. In a nutshell, we can say that cryptocurrency wallets
store user’s public and private key. What is private and public key? They are
related to cryptography4. This concept is presented in Figure 7.1. All crypto
coins are stored in a blockchain which in its essence is a distributed
database. Crypto wallet with its private key, allows you to get an access to
those encrypted records (crypto coins) and check balance or send them to
another party etc.
Figure 7.1: Private and Public key
In Figure 7.1, we can see how the encrypted message can be verified or
become encrypted. We need public and private key to be able to work with
this kind of data. Your public key will allow you to become a receiving party
when someone sends you crypto coins while wallet address is also required.
Public key, as the name suggests, is public, i.e., available to anyone so that
the other side of the transaction (as shown in Figure 7.1) can verify your
transfers. We need a private key to access our encrypted assets wallet. By
accessing wallet, you may get access to your beloved crypto coins. In
conclusion, we can say that a private key is like a PIN on a bank card.
It is worth mentioning types of wallets here. We have two main types of
wallets; hot, and cold.
By hot wallet, we mean that it is an app or website that is 24/7 part of the
blockchain exchange and is accessible by you at any point of time from
anywhere in the world. Unfortunately, along with this convenience comes
security risks. Through a publicly accessible wallet even though it is not
wildly open (remember we need private key and sometimes PIN or SMS
verification) you personally can become a target for malicious hackers. By
using social engineering, hackers can get access to your private key. They
could also get access to your web wallet, they can help them take full control
over your valuable crypto assets.
To be more secure with crypto wallets we have option of cold wallets. By
saying cold wallet, we are referring to all those hardware wallets that you
can purchase. Search for the phrase hardware crypto wallet in your favorite
browser and you will find plenty of options to purchase such a wallet. A
point that many might mention as a disadvantage of cold wallets, is that they
are physical devices. Hence, once lost, you are losing all your assets. By
summarizing - the whole essence of any kind of wallet is the fact that you
are responsible for it to make sure that it will not get stolen.
In this chapter, we will build proof of content of a software wallet to
demonstrate how the actual web or mobile wallet app really works. We will
do so, using Python so let us get started:
Ethereum which will be referenced to in this exercise, like Bitcoin, uses a
cryptography algorithm called elliptic curve5. We could start by building a
script that is using cryptography and calculating private and public keys step
by step, but let us jump one step ahead and use Python cryptography library
that will do calculations for us. To be able to use power of Python library
that is going to solve very complex cryptographic tasks we shall install few
Python components mentioned in the following steps:
We need to install library called web36 as shown in the following example:
1. pip install web3==6.0.0
Code 7.10
It’s worth of noticing that installing web3 package as one of its sub-
dependencies it will install Python eth library. This eth library has its own
dependency on crypto package that some developers do not trust since a few
CVE7 reports in the past. To address that concern we will use
pycryptodome, which is a fork8 from pycrypto with some improvements
and security enhancements. To use this package, we must use our public fork
that has been prepared for this exercise9. We will install packages by
creating requirements.txt with the content like in the proceeding example.
1. pycryptodome==3.17
2. -e git+https://github.com/darkman66/[email protected]#egg=eth-
keyfile
3. web3==6.0.0
Code 7.11
Once we have a file with the required packages list prepared, we can install
them by using standard pip command like in the following example.
1. $ pip install -r requirements.txt
Code 7.12
As it’s been mentioned before, in ETH keys we use elliptic curve algorithms.
Let us try to create a public key of our local wallet shown in the proceeding
example code.
1. from web3 import Web3
2.
3. acc = w3.eth.account.create()
4. print(f"Public address of wallet: {acc.address}")
Code 7.13
Once we have the public address, we can create a private key as well and
keep it in a safe place for future use. Let us start with simple private without
any kind of protection.
1. from web3 import Web3
2.
3. acc = w3.eth.account.create()
4. print(f"Public address of wallet: {acc.address}")
5. print(f"Private key: {acc.key.hex()}")
Code 7.14
As mentioned in a case when we have the private key, we should be able to
open wallet and gain access to it. Notice that in Code 7.14, we created a
private key which we now may use for the wallet, like in the following
example:
1. from web3 import Web3
2.
3. acc = w3.eth.account.create()
4. private_key = acc.key.hex()
5.
6. recovered_account = w3.eth.account.from_key(private_key)
7. is_same = recovered_account.address == acc.address
8. print(f"Is adresss the same: {is_same}")
Code 7.15

Trends analyzer
Now, that we have gone through the basics of storing and securing crypto
currencies, it is time to see where and how we can exchange crypto
currencies into standard currency. For instance USD. More than platform for
an exchange, it is important to know how much value our crypto assets have
and when is the right time to exchange them.
Let us start with understanding of where and how to fetch latest currency
exchange. However, before we start we need to update our database
structure. In the proceeding code, we will create a currency exchange table
that stores current currency exchange.
1. CREATE TABLE IF NOT EXISTS currency_exchange (
2. id INTEGER PRIMARY KEY AUTOINCREMENT,
3. currency_code TEXT UNIQUE,
4. last_price FLOAT,
5. created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
6. updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
7. );
Code 7.16
We have a table where we are going to store current fetched currency
exchange values. Now, we need a similar table to keep historical values of
currency exchanges so we can compare them for future use. Mentioned table
is going to be generated like in the following example:
1. CREATE TABLE IF NOT EXISTS currency_exchange_history (
2. id INTEGER PRIMARY KEY AUTOINCREMENT,
3. currency_code TEXT,
4. last_price FLOAT,
5. created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
6. updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
7. CONSTRAINT constraint_name UNIQUE (currency_code, created_a
t)
8. );
Code 7.17
To apply that table definition file from Code 7.16 and Code 7.17, we need to
modify Code 7.5 to make sure that the database tables and content can be
used and processed.
1. import click
2. from db import DB
3.
4. @click.command()
5. @click.option("--
table", help="Table type", required=True, type=click.Choice(['currency',
'currency_exchange', 'currency_exchange_history']))
6. def main(table):
7. db = DB()
8. db.init_table(table)
9.
10. if __name__ == '__main__':
11. main()
Code 7.18
Let us run Code 7.18 to create the required tables like in the following
example. The proceeding Code 7.19 being run will create tables in our main
table database crypto.db.
1. python code_7.17.py --table currency_exchange
2. python code_7.17.py --table currency_exchange_history
Code 7.19
We have a main table with its historical copy. Now it is time to fetch some
data and fill that main table first with real time currency exchange. We have
many options to find our source of crypto market currency exchange
however, instead of building an unreliable screen scraper to fetch data from
popular websites we can try a different approach using API.
Before we dive into an example of using API as source of truth, I want to
highlight the fact that I am not recommending this example API service,
Live Coin Watch10 because it is the best on the market. Please keep in mind
that it is just an example, so if you want to use any other source of data
provider you can choose other API and replace the one shown in the
following examples.
1. Let us start by registering at https://www.livecoinwatch.com.
2. There is a tab called API – please go there and create API key.
3. Next, we have to be aware of some limitations of this service account. It
is free of charge albeit you can only perform limited number of requests
per month.
Let us try to send our first request to API service to check how many credits
we have left like in the proceeding example, Code 7.20. Before moving
ahead, please remember it is necessary to have installed the requests11
package - ike in the previous examples in the subsection with currencies.
1. import os
2. import requests
3. from pprint import pprint
4.
5. API_KEY = os.environ.get('API_KEY')
6. assert API_KEY, "variable API_KEY not specified"
7. URL = "https://api.livecoinwatch.com/credits"
8.
9. response = requests.post(URL, headers={"x-api-
key": API_KEY, "content-type": "application/json"}).json()
10.
11. print("Credit status:")
12. pprint(response)
Code 7.20
In Code 7.20, we used a very useful trick to read API access key. Instead of
hardcoding the access token key into our code, we read it from the
environmental variable (line 5). When the developer executing our code
wants to use their own access token, they need to specify it in runtime
environmental variable like in the following example.
1. API_KEY=111-11111-my-foo-acceess-key python code_7.20.py
Code 7.21
This a very clever way. Besides configurations files, you can in a simple and
clean manner, specify secrets for instance in this case token API access key.
When the developer does not specify any key, we will raise an exception line
6. When everything is correctly specified you should see output like in
example below.
1. Credit status:
2. {'dailyCreditsLimit': 10000, 'dailyCreditsRemaining': 10000}
Code 7.22
So far, we have introduced a concept of how and from where to get crypto
currencies exchange values. Now, in the proceeding example we are going to
use our well known click module, that is going to help us build scripts for
fetching crypto currencies exchange data.
First, we should create a Livecoin crypt client that will fetch data from API
and save results in our newly created DB.
1. import click
2. import os
3. import requests
4. from db import DB
5.
6.
7. class LiveCoinClient:
8.
9. def __init__(self):
10. self.__api_token = os.environ.get('API_KEY')
11. assert self.__api_token, "variable API_KEY not specified"
12. self._db = DB()
13.
14. def __fetch_data(self, url):
15. return requests.post(url, headers={"x-api-
key": self.__api_token, "content-type": "application/json"}).json()
16.
17. def __post_data(self, url):
18. data = {
19. "currency": "USD",
20. "sort": "rank",
21. "order": "ascending",
22. "offset": 0,
23. "limit": 500,
24. "meta": True
25. }
26. return requests.post(url, headers={"x-api-
key": self.__api_token, "content-
type": "application/json"}, json=data).json()
27.
28. def fetch_and_update_coins(self):
29. click.echo("Starting fetching livecoin updates")
30. url = 'https://api.livecoinwatch.com/coins/list'
31. data = self.__post_data(url)
32. data_to_refresh = {item['code']: item['rate'] for item in data}
33. self.refresh(data_to_refresh)
34.
35. def refresh(self, data):
36. for currency_code, currency_value in data.items():
37. click.echo(f"Updating coing: {currency_code}")
38. self.update_currency_exchange(currency_code, currency_value)
39.
40. def update_currency_exchange(self, currency_code, currency_value):
41. sql = f"SELECT * FROM currency_exchange WHERE currency_c
ode='{currency_code}'"
42. result = self._db.execute(sql)
43. if result:
44. result = result.pop()
45. self.save_history(result['currency_code'], result['last_price'])
46. sql = f"UPDATE currency_exchange SET last_price='{currency
_value}', updated_at=now() WHERE currency_code='{currency_code}
'"
47. self._db.commit(sql)
48. else:
49. sql = f"INSERT INTO currency_exchange (last_price, currency_
code) VALUES ('{currency_value}', '{currency_code}')"
50. self._db.commit(sql)
51.
52. def save_history(self, currency_code, currency_value):
53. sql = f"""INSERT INTO currency_exchange_history (currency_co
de, last_price) VALUES ('{currency_code}', '{currency_value}')"""
54. self._db.commit(sql)
Code 7.23
We created in Code 7.23, a generic class, that will fetch crypto currency
exchange from the 3rd party portal. Once the data is fetched, we will save the
current values in the main table currrency_exchange. Data that was
previously saved in that table gets pushed to table
currency_exchange_history where we keep historical currency exchange
values.
Through this mechanism, we can keep the present data as well as past values
which we will use to calculate and predict whether we shall buy or sell our
crypto assets. To use the above functionality, we need a script that will use
our class like in the proceeding example.
1. import click
2. from live_coin_client import LiveCoinClient
3.
4. def main():
5. click.echo("Starting import")
6. l = LiveCoinClient()
7. l.fetch_and_update_coins()
8.
9. if __name__ == '__main__':
10. main()
Code 7.24
To execute the script, we should run the proceeding example, Code 7.25
with API token so we can fetch all currencies.
1. API_KEY=<your api key> python update_currency_exchange.py
Code 7.25
After executing the script, we should have inserted 500 currency records into
main table. To validate it, you can execute the proceeding script:
1. sqlite3 crypto.db
2.
3. sqlite> SELECT count(*) from currency_exchange;
4. 500
Code 7.26
When we execute line 1 we are opening SQLite CLI shell where we can
execute SQL commands directly in our currency.db database. The reason
why we only fetch 500 records is because we have hardcoded page size in
Code 7.23, line 23. If we want to get more different coins values, we need to
modify __post_data method (from Code 7.23, lines 17-26) in such a way
that it can fetch more data if it’s available to fetch.
For instance, as an exercise, we can work with such a concept in the
following example, where we will use a well-known recurrency.
1. def __post_data(self, url, page_limit=500, page_offset=0):
2. click.echo(f"Page limit: {page_limit}, offset: {page_offset}")
3. data = {
4. "currency": "USD",
5. "sort": "rank",
6. "order": "ascending",
7. "offset": page_offset,
8. "limit": page_limit,
9. "meta": True
10. }
11. data = requests.post(url, headers={"x-api-
key": self.__api_token, "content-
type": "application/json"}, json=data).json()
12. if data:
13. more_data = self.__post_data(url, page_limit=page_limit, page_off
set=((page_offset+1)+page_limit))
14. if more_data:
15. data += more_data
16. return data
Code 7.27
This little change will help us get all the possible coins and related currency
exchange from API. Each call of __post_data method we make, we increase
the page number by 1 (line 13) and next, we call the same method
__post_data. We keep increasing page number and continue calling with
recurrency same method __post_data (Code 7.27, line 13) as long as new
data can be still fetched. Once the entire data is received, we keep processing
coin updates as usual.
Let us run the same script once again. This should lead to a case where the
current currency exchange values will be copied to the historical table. The
newly downloaded currency exchange values are going to be saved in the
present data table. Unfortunately, as you can see in the following code
example this is not the case.
1. Executing: SELECT * FROM currency_exchange WHERE currency_c
ode='BTC'
2. Traceback (most recent call last):
3. File "update_currency_exchange.py", line 9, in <module>
4. main()
5. File "update_currency_exchange.py", line 6, in main
6. l.fetch_and_update_coins()
7. File "/Users/hubertpiotrowski/work/fun-with-
python/chapter_7/live_coin_client.py", line 34, in fetch_and_update_coi
ns
8. self.refresh(data_to_refresh)
9. File "/Users/hubertpiotrowski/work/fun-with-
python/chapter_7/live_coin_client.py", line 39, in refresh
10. self.update_currency_exchange(currency_code, currency_value)
11. File "/Users/hubertpiotrowski/work/fun-with-
python/chapter_7/live_coin_client.py", line 47, in update_currency_exc
hange
12. self.save_history(result['currency_code'], result['last_price'])
13. TypeError: tuple indices must be integers or slices, not str
Code 7.28
This error is happening because the native Python SQLite driver returns
tuples for responses. Each DB record is a tuple instead of a dictionary and
this is what we incorrectly assumed in our code. The fix for this problem is
simple as illustrated in following example code.
1. def __init__(self):
2. click.echo(f"Database: {DB_FILENAME}")
3. self.conn = sqlite3.connect(DB_FILENAME)
4. self.conn.row_factory = sqlite3.Row
Code 7.29
We need to apply these changes in the database Code 7.3 in constructor. By
using this approach all cursors in database queries are going to return
dictionary instead of tuples.
Before we dive any deeper into the topic of analyzing fetched data, we can
make an improvement in our updater script (Code 7.23). In the proceeding
example, we will add support to fetch currency exchange data for a given
time range.
1. import click
2. import os
3. import requests
4. from datetime import datetime, timedelta
5. from db import DB
6.
7.
8. class LiveCoinClient:
9.
10. def __init__(self):
11. self.__api_token = os.environ.get('API_KEY')
12. assert self.__api_token, "variable API_KEY not specified"
13. self._db = DB()
14.
15. def __fetch_data(self, url):
16. return requests.post(url, headers={"x-api-
key": self.__api_token, "content-type": "application/json"}).json()
17.
18. def __post_data(self, url, page_limit=500, page_offset=0):
19. click.echo(f"Page limit: {page_limit}, offset: {page_offset}")
20. data = {
21. "currency": "USD",
22. "sort": "rank",
23. "order": "ascending",
24. "offset": page_offset,
25. "limit": page_limit,
26. "meta": True
27. }
28. data = requests.post(url, headers={"x-api-
key": self.__api_token, "content-
type": "application/json"}, json=data).json()
29. if data:
30. more_data = self.__post_data(url, page_limit=page_limit, page_
offset=((page_offset+1)+page_limit))
31. if more_data:
32. data += more_data
33. return data
34.
35. def format_time(self, dt_value):
36. return str(int(dt_value.timestamp())).replace('.', '')[:13].ljust(13, '0')
37.
38. def fetch_crypto(self, currency_code='BTC', days=1):
39. timestamp_now = datetime.now()
40. url = "https://api.livecoinwatch.com/coins/single/history"
41. payload = {
42. "currency": "USD",
43. "code": currency_code,
44. "start": self.format_time(timestamp_now-timedelta(days=days)),
45. "end": self.format_time(timestamp_now),
46. "meta": True
47. }
48. data = requests.post(url, headers={"x-api-
key": self.__api_token, "content-
type": "application/json"}, json=payload).json()
49. for item in data['history']:
50. self.update_currency_exchange(currency_code, item['rate'], item
['date'])
51.
52. def fetch_and_update_coins(self):
53. click.echo("Starting fetching livecoin updates")
54. url = 'https://api.livecoinwatch.com/coins/list'
55. data = self.__post_data(url)
56. data_to_refresh = {item['code']: item['rate'] for item in data}
57. self.refresh(data_to_refresh)
58.
59. def refresh(self, data):
60. for currency_code, currency_value in data.items():
61. click.echo(f"Updating coing: {currency_code}")
62. self.update_currency_exchange(currency_code, currency_value)
63.
64. def update_currency_exchange(self, currency_code, currency_value,
updated_value=None):
65. if not updated_value:
66. updated_value = datetime.now().timestamp()
67. sql = f"SELECT * FROM currency_exchange WHERE currency_c
ode='{currency_code}'"
68. result = self._db.execute(sql)
69. if result:
70. result = result.pop()
71. if float(result['updated_at']) <= float(updated_value):
72. self.save_history(result['currency_code'], result['last_price'], u
pdated_at=result['updated_at'])
73. sql = f"UPDATE currency_exchange SET last_price='{curren
cy_value}', updated_at=
{updated_value} WHERE currency_code='{currency_code}'"
74. self._db.commit(sql)
75. else:
76. self.save_history(result['currency_code'], result['last_price'], u
pdated_value)
77. else:
78. sql = f"INSERT INTO currency_exchange (last_price, currency_
code, updated_at,
created_at) VALUES ('{currency_value}', '{currency_code}',
{updated_value}, {updated_value})"
79. self._db.commit(sql)
80.
81. def save_history(self, currency_code, currency_value, updated_at):
82. sql = f"""INSERT INTO currency_exchange_history (currency_co
de, last_price, updated_at, created_at) VALUES ('{currency_code}', '{c
urrency_value}', '{updated_at}', '{updated_at}')"""
83. self._db.commit(sql)
Code 7.30
We have managed to modify our example Code 7.23 significantly into a
version as shown in Code 7.30. We have added support for obeying currency
exchange updates with an explicitly given date-time (lines 65-66), if
provided. We will use the current timestamp if the argument is not given.
This approach helps us thread the given record (line 64) as a fresh one, that
should be stored as current currency exchange (line 69-71). It needs to be
refreshed and updated in DB or a completely new record because it does not
exist in any database yet (line 77-79).
Another change one must notice is that we have introduced in line 81-83,
how to directly use the given updated_at value while creating historical
records in database.
All these changes can be used with the newly introduced method to fetch
crypto currencies. We can use method from Code 7.30 to fetch any historical
rates of any given crypto currency. As an argument we used the number of
days of how far we want to go (Code 7.30, line 38) with heretical data to
fetch.
We use the mentioned number of days as an input argument (line 38) and
calculate from present date time, the given number of days (line 41-46) of
data to fetch. A fact that needs to be highlighted is that the livecoinwatch12
API expects to receive date time fields as epoch13 timestamp instead of
ISO14 timestamp format.
API timestamp standard as well do not use epoch timestamp value as float
and size is always 13 characters long with filling zeros for empty space. In
that case, we have to create a method format_time that for given datetime
object is creating epoch timestamp with described logic so, that
Livecoinwatch API can understand given timestamps by us.
To consume newly updated codes and introduce the new method,
fetch_crypto we also need to modify our previously generated file from
example Code 7.30 so that it works as illustrated in the proceeding example
code.
1. import click
2. from live_coin_client import LiveCoinClient
3.
4. @click.command()
5. @click.option("--
coin", help="Coin to update", type=str, required=False)
6. @click.option("--
days", help="Number of days to fetch", type=int, required=False)
7. def main(days, coin):
8. click.echo("Starting import")
9. l = LiveCoinClient()
10. if coin:
11. l.fetch_crypto(currency_code=coin, days=(days if days else 10))
12. else:
13. l.fetch_and_update_coins()
14.
15. if __name__ == '__main__':
16. main()
Code 7.31
To use example Code 7.31 in a manner where we want to fetch all the
currency exchange rates, we should run the code like in the following
example:
1. API_KEY=<your api key> python update_currency_exchange.py
Code 7.32
Now, when we want to run a single currency update with given number of
days of history, we need to run the same script with parameters like in the
proceeding example.
1. API_KEY=<your api key> python update_currency_exchange.py --
coin=ETH --days=5
Code 7.33
When we run the same script with the same argument twice or more, you
might face an error of sqlite3 as in the following example:
1. sqlite3.IntegrityError: UNIQUE constraint failed: currency_exchange_h
istory.currency_code, currency_exchange_history.updated_at
Code 7.34
This error is occurring because we decided in example Code 7.17 that we
will only allow saves for historical records with a combination of unique
constraint where the condition is, – updated at + currency code must be
unique constrain. This means that when we run our updating data script with
the same combination of mentioned parameters (updated at + currency
code), it will eventually fetch data that we already have saved in our
database via SQLite driver thus will lead to exception.
To properly support mentioned edge case and deliver valid exception
handling, we must wrap commit DB query (Code 7.30, lines 81 - 83) into
try+except block of code. This fix is being accomplished in the proceeding
example.
1. import sqlite # remember to import this on the top of the script
2.
3. def save_history(self, currency_code, currency_value, updated_at):
4. try:
5. sql = f"""INSERT INTO currency_exchange_history (currency_co
de, last_price, updated_at, created_at) VALUES ('{currency_code}', '{c
urrency_value}', '{updated_at}', '{updated_at}')"""
6. self._db.commit(sql)
7. except sqlite3.IntegrityError:
8. click.echo("This kind of record is aleady saved in DB")
Code 7.35
As you can see, we are doing simple try + catch block here to make sure
that we can save all valid records and skip repetitions.

Integrating with crypto wallet


So far, we have managed to write scripts that help us retrieve current and
historical crypto currency rates. To proceed to the next chapter, we need to
understand how we can handle crypto wallets from Python. As mentioned, in
the previous subsection, a package that helps in managing crypto wallets is
web315. In this section, we will go one level up from local wallet to
connecting with Ethereum network, which we use as an example.
A prerequisite of all crypto currency trading platforms, is that you have to
register on such a trading website. Once the registration and verification
processes are finished, the system will generate your unique wallet for you
for all the coins you want to trade on the platform.
Do not worry if you do not have any crypto assets to play with. In the
proceeding exercise we will introduce and simulate very similar to real
trading platforms, we will work with a locally running Ethereum network,
that is going to simulate a few wallets for us. Follow these steps to install
Ethereum network locally and all its assets to be able to run simulation that
we mentioned:
1. First, we need to download and install locally Ethereum network
emulator.
2. The easiest choice is to use Ganache16. Once you have it installed, start
it and as the default option, please choose quick start Ethereum.
3. Wait a few seconds and you should see few wallets like in the following
figure.
4. For Python we need to install package web317 to be able to
communicate with the blockchain.
Figure 7.2: Genache default dashboard
5. Click on the first wallet address, right side it has a key icon, click on it.
You should see private and public address of the wallet.
6. Let us choose this wallet as simulation of the trading wallet, where we
will send out crypto assets to exchange them for USD or the way back.
7. The second wallet from the list we are going to use as our main private
wallet where we are going to keep our ETH coins that we want to sell or
receive when we buy coins.
All sorted now, it is time to connect to our locally running Ethereum
network. As we mentioned before to do so, we are going to use web3 Python
library. Let us create a script called eth_client.py like in the proceeding
example.
1. from web3 import Web3
2.
3. w3 = Web3(Web3.HTTPProvider('http://127.0.0.1:7545'))
4. print("Are we connected", w3.is_connected())
Code 7.36
With this script, we managed to connect to a local instance of Ethereum
network. In case you are wondering where the connection address in line 3
came from, please check Figure 7.2, in the section remote procedure call
(RPC) server, that is the address of Ethereum network.
Once we are connected, it is time to check the wallet balance to see how
many coins we have. As we agreed before, we are using the second wallet
from the list as our main crypto wallet. With the following script, we can
connect to a network and check assets status of the second wallet.
1. from web3 import Web3
2.
3. w3 = Web3(Web3.HTTPProvider('http://127.0.0.1:7545'))
4. print("Are we connected", w3.is_connected())
5.
6. wallet_address = "0x8b5105D3c66617D3D3Bc45c1d9714138E4b228B
D"
7. balance = w3.eth.get_balance(wallet_address)
8. print(f"balance: {balance}")
Code 7.37
By executing Code 7.37 we should have an output like in proceeding
example.
1. $ python eth_client.py
2.
3. Are we connected True
4. 100000000000000000000
Code 7.38
Please notice that value of the wallet is presented in Wei format. The number
is presented in a big integer like 1018 (10 power 18), which means you can
think of as representation of pennies and dollars in one big number with no
floating point. That means to know the actual value of ETH in this case we
have to apply this simple math rule: 1018/1015 -> will give us information
that we have 100,0000… ETH. To be able to see all of it in a more human
friendly form we have to covert the value as shown in the following example
code.
1. from web3 import Web3
2.
3. w3 = Web3(Web3.HTTPProvider('http://127.0.0.1:7545'))
4. print("Are we connected", w3.is_connected())
5.
6. wallet_address = "0x8b5105D3c66617D3D3Bc45c1d9714138E4b228B
D"
7. balance = w3.eth.get_balance(wallet_address)
8. human_blance = w3.from_wei(balance, "ether")
9. print(f"balance: {human_blance:.2f}")
Code 7.39
It is a simple operation to convert such big numbers to more human friendly
form albeit please take note that we had to specify the coin type (line 8)
during conversion. Then we print out the value of the coin (float) by using
Python numbers and strings formatting (line 9). That part .2f means that we
want to round up to 2 decimal points.
There is one important aspect of this exercise that you must be aware of.
Each time you restart your Ganache application, it will restart Ethereum
network and recreate wallets with their corresponding addresses along with
their assets. To keep our wallet addresses in a more flexible manner, we can
store wallet address in a configuration file instead of keeping them
hardcoded directly in the source code. The following example is a
configuration file called config.ini18
1. [network]
2. server = http://127.0.0.1:7545
3.
4. [wallet]
5. main = 0x6D4a84a4E7b7A0D1c1b68D8c18D88e4d21D67484
6. trade = 0x43319A04776dc250559eB752584FD0791Cf5688f
Code 7.40
If you use default settings for ganache, you can keep the network section
settings the same as in example configuration Code 7.40 although you have
to update coordinately setting for your wallets.
Let us put together the code we did so far with config file that we
introduced. In the proceeding example, as already mentioned, we will use
configuration driven wallets and network settings:
1. import configparser
2. from web3 import Web3
3.
4.
5. config = configparser.ConfigParser()
6. config.sections()
7. config.read('config.ini')
8.
9. eth_server = config.get('network', 'server')
10. main_wallet = config.get('wallet', 'main')
11. trade_wallet = config.get('wallet', 'trade')
12.
13. w3 = Web3(Web3.HTTPProvider(eth_server))
14. print("Are we connected", w3.is_connected())
15.
16. def check_ballance(wallet_address):
17. balance = w3.eth.get_balance(wallet_address)
18. human_blance = w3.from_wei(balance, "ether")
19. print(f"balance: {human_blance:.2f}")
20.
21. if __name__ == '__main__':
22. check_ballance(main_wallet)
23. check_ballance(trade_wallet)
Code 7.41
By refactoring our code in example Code 7.41, we also added support to
check the balance for both wallets, our main and trading wallet. As we have
already highlighted, we will use trading wallets as simulator of money
exchange platform, (website) where we can send our crypto assets and do
payouts from.
In the proceeding example, we will learn how to create a transaction in our
local blockchain that is going to send Ethereum assets from our main wallet
to the trading. We have mentioned before that to gain access to any wallet,
we need not only its public address, but also its private key. Indeed, we need
a private key to be able to send coins from the source wallet to destination.
In Ganache emulator you can retrieve the private key by clicking key icon as
illustrated in the following figure:

Figure 7.3: Configuration file for selected wallet


Once we have a private key for our main wallet, we need to use the
configuration file that we created in example Code 7.41, add it to section
wallet like in following example:
1. [wallet]
2. main = 0x6D4a84a4E7b7A0D1c1b68D8c18D88e4d21D67484
3. trade = 0x43319A04776dc250559eB752584FD0791Cf5688f
4. main_key =
0x7af6ebc503923e08147151407f9a8173fec9270d1b6b0278e699fb1323
d31644
Code 7.42
In the proceeding example, we are going to send 0.5 ETH to our trading
wallet. We are going to pack functionality of sending transaction into
dedicated function so that we can use it at a later stage.
1. import click
2. import configparser
3. from decimal import Decimal
4. from web3 import Web3
5.
6.
7. config = configparser.ConfigParser()
8. config.sections()
9. config.read('config.ini')
10.
11. eth_server = config.get('network', 'server')
12. main_wallet = config.get('wallet', 'main')
13. trade_wallet = config.get('wallet', 'trade')
14. main_private_key = config.get('wallet', 'main_key')
15.
16. w3 = Web3(Web3.HTTPProvider(eth_server))
17. print("Are we connected", w3.is_connected())
18.
19.
20. def make_transfer(account_src, account_dst, private_key, amount) -
> str:
21. # we need nonce for transaction
22. nonce = w3.eth.get_transaction_count(account_src)
23. # transaciton data
24. tx = {
25. 'nonce': nonce,
26. 'to': account_dst,
27. 'value': w3.to_wei(Decimal(amount), 'ether'),
28. 'gas': 21000,
29. 'maxFeePerGas': 20000000000,
30. 'maxPriorityFeePerGas': 1000000000,
31. 'chainId': w3.eth.chain_id
32. }
33. signed_tx = w3.eth.account.sign_transaction(tx, main_private_key)
34. tx_hash = w3.eth.send_raw_transaction(signed_tx.rawTransaction)
35. return w3.to_hex(tx_hash)
36.
37.
38. def check_ballance(wallet_address):
39. balance = w3.eth.get_balance(wallet_address)
40. human_blance = w3.from_wei(balance, "ether")
41. print(f"balance: {human_blance:.2f}")
42.
43. @click.command()
44. @click.option("--amount", help="Amount of ETH to send",
required=False, type=float)
45. @click.option("--send", help="Send coins to destination wallet",
required=False, default=False, is_flag=True)
46. @click.option("--balance", help="Show wallets balance",
required=False, default=False, is_flag=True)
47. def main(balance, send, amount):
48. if balance:
49. check_ballance(main_wallet)
50. check_ballance(trade_wallet)
51. the if send:
52. assert amount, "Amount to send is required"
53. result = make_transfer(main_wallet, trade_wallet, main_private_ke
y, amount)
54. print(f"Sending transaction id: {result}")
55.
56. if __name__ == '__main__':
57. main()
Code 7.42
We have used our very well-known click19 module that we have been using
many times in our previous chapters to address the customization of scripts
and driving the flow depending on what we want to archive. For instance, if
we want to check wallet balance, we run the proceeding example:
1. $ python eth_client_with_sending.py --balance
2.
3. Are we connected True
4. balance: 100.00
5. balance: 100.00
Code 7.43
In the following example, we are sending coins from our main wallet to the
trading wallet. After this, blockchain automatically takes care of transaction
and deducts requested assets from source wallet.
1. $ python eth_client_with_sending.py --send --amount 0.5
2.
3. Are we connected True
4. Sending transaction id:
0x55f6e2ba7da3d8751bddce4fb8f8fef32fd82b9db35d873fe952f3e94c4f
94d8
Code 7.44
You can notice that in Code 7.44 lines 24-32 we are preparing transaction
data by calculating gas automatically. This is done by specifying the
maximum values for gas fee and priority. In a nutshell, this is a fee that we
are obligated to pay when sending coins from source to destination20. Once
the transfer is done, we can check in Ganache to know what exactly happens
for the transaction id 0x55f63 like in the following figure:

Figure 7.4: Example crypto wallet address.

Purchase and sell


So far, we learned how to simulate blockchain network locally and how to
connect to it. We know how to send coins from source wallets to destination
with Python. In previous sub chapters we also covered how and where from
we can get prices for currency exchange – those historical and those that are
live. Now, it is time to get to the essence of our trending platform.
We will learn how to automate with Python the process of selling and buying
crypto currencies depending on the trends in the market. One thing we must
highlight here is that we are not learning how to sell and buy crypto coins on
the real market. Those processes are more complicated, and you can find
dozens of books regarding that topic. We are only having some fun with
Python and are using it as a tool for demonstrating how powerful it can be
when it comes to automated processes.
In the following example, we will apply a simple algorithm for calculating
trends. If the last three days’ trend of currency exchange for ETH went up,
we will sell our crypto coins (send them to trading wallet).
For calculating the average trend of currency exchange, we will use a
mathematical module for python called numpy21. Here is an example of
how to install a package:
1. $ pip install numpy==1.21.6
Code 7.45
After successful installation of numpy package we can create a simple
example code to demonstrate how we are planning to calculate trends in
rates. In the following example, code we have assumed that we have rates
for ETH reported every full hour and for each hour we have got following
values.
1. import numpy as np
2.
3. def growing_avg(a, n):
4. ret = np.cumsum(a, dtype=float)
5. ret[n:] = ret[n:] - ret[:-n]
6. return ret[n - 1:] / n
7.
8. def is_growing(data):
9. return np.all(np.diff(growing_avg(np.array(data), n=4))>0)
10.
11. rates_1 = [3.0, 6.0, 10.0, 4.2, 11.0]
12. rates_2 = [6.0, 4.0, 5.0, 4.0, 3.2]
13.
14. print("Is growing trend: ", is_growing(rates_1))
15. print("Is growing trend: ", is_growing(rates_2))
Code 7.46
Let us analyze what is happening in Code 7.46. Overall the algorithm is
simple. We have an array of numbers (line 11 and 12) and we calculate
arithmetical average for the elements in array number 1, 2, 3, 4 then 2,3,4, 5
next 3,4,5, 6 etc. By this we have average values for groups of elements in
array - 4 elements per group (function in line 3). Next we calculate pair
difference in all those average calculated elements (line 9, np.diff). In the
end we check if all elements for calculated diff are greater than zero (line 9,
np.all >0).
We managed to simulate growing and falling trend in this simulation Code
7.46. Now it is time to use numpy to analyze all the currency rates that we
managed to download. Please note that we are making some assumptions
here. We built our calculations on the top of the historical data that we
managed to retrieve (subchapter of Fetching and storing fresh currency
exchange), so both the time window and frequency of currency exchange
samples remains as they are being provided by livecoinwatch.com API. In a
nutshell, if livecoinwatch API will provide access to 6 months of historical
data – we will only have 6 months of such currency exchange markers.
1. import numpy as np
2. from db import DB
3.
4.
5. class AnalyzeRates:
6.
7. def __init__(self, currency):
8. self.currency = currency
9. self._db = DB()
10.
11. def growing_avg(self, a, n=3):
12. ret = np.cumsum(a, dtype=float)
13. ret[n:] = ret[n:] - ret[:-n]
14. return ret[n - 1:] / n
15.
16. def is_growing(self, data):
17. return np.all(np.diff(self.growing_avg(np.array(data), n=4))>0)
18.
19. def _get_last_rates(self, limit=7):
20. sql = f"SELECT last_price FROM currency_exchange_history WH
ERE currency_code='{self.currency}' ORDER BY updated_at DESC
LIMIT {limit}"
21. return [x['last_price'] for x in self._db.execute(sql)]
22.
23. def check_currency_growing(self):
24. currency_values = self._get_last_rates()
25. return self.is_growing(currency_values)
26.
27. if __name__ == "__main__":
28. checker = AnalyzeRates('ETH')
29. result = checker.check_currency_growing()
30. tendency = 'growing' if result else 'falling'
31. print(f'Currency ETH has tendency to {tendency}')
Code 7.47
We reused the database class for making a query (line 20-21) to get the last
seven records that we fetched from external API. Then we create an array of
values (line 21). By having this data we can apply logic for calculating,
mobbing average trends that we learned in example 7.69.
It will be fantastic to have all the data and visibility on how trends for crypto
market are behaving, to make decisions about selling and buying assets. In
the proceeding example, we will reuse the mechanism that we introduced in
example Code 7.65. Before we will dive into our new example, we shall
make small adjustments in Code 7.42 under function check_balance.
1. def check_ballance(wallet_address):
2. balance = w3.eth.get_balance(wallet_address)
3. human_blance = w3.from_wei(balance, "ether")
4. print(f"balance: {human_blance:.2f}")
5. return human_blance
Code 7.48
As you might notice in Code 7.48 we had to add returning value statement
(line 5) in the end of function body. That said we do not print out given
wallet value (line 4) but by returning its value we are able to use it in the
following example as well.
1. import click
2. import numpy as np
3. from db import DB
4. from decimal import Decimal
5. from eth_client_with_sending import make_transfer, main_wallet, trade
_wallet, main_private_key, check_ballance
6.
7.
8. class AnalyzeRates:
9.
10. def __init__(self, currency):
11. self.currency = currency
12. self._db = DB()
13.
14. def growing_avg(self, a, n=3):
15. ret = np.cumsum(a, dtype=float)
16. ret[n:] = ret[n:] - ret[:-n]
17. return ret[n - 1:] / n
18.
19. def is_growing(self, data):
20. return np.all(np.diff(self.growing_avg(np.array(data), n=4))>0)
21.
22. def _get_last_rates(self, limit=7):
23. sql = f"SELECT last_price FROM currency_exchange_history WH
ERE currency_code='{self.currency}' ORDER BY updated_at DESC LI
MIT {limit}"
24. return [x['last_price'] for x in self._db.execute(sql)]
25.
26. def check_currency_growing(self, percent_value_to_sell):
27. currency_values = self._get_last_rates()
28. status = self.is_growing(currency_values)
29. trend_status = 'growing' if status else 'falling'
30. click.echo(f'Currency ETH has trend of {trend_status}')
31. if status:
32. src_wallet_balance = check_ballance(main_wallet)
33. click.echo(f"Source wallet value: {src_wallet_balance}")
34. value_to_send = src_wallet_balance * Decimal(percent_value_t
o_sell/100.0)
35. click.echo(f"Value from srouce wallet value to send: {value_to_s
end}")
36. transaction_id = make_transfer(main_wallet, trade_wallet, main
_private_key, value_to_send)
37. click.echo(f"Transfer finsihed, transaction id: {transaction_id}")
38.
39.
40. @click.command()
41. @click.option("--proceed", help="When true values calculated
will be send", required=False, default=False, is_flag=True)
42. @click.option("--percent", help="Pecent of assets of ETH to send",
required=True, type=float)
43. def main(percent, proceed):
44. checker = AnalyzeRates('ETH')
45. checker.check_currency_growing(percent)
46.
47.
48. if __name__ == "__main__":
49. main()
Code 7.72
In that example based on what we have already learned so far, we will check
what is the moving average tendency for crypto currency. In this case, it is
ETH (line 26-31). If we detect that the average is growing, we calculate the
percentage of the value from our source wallet (line 34) and how much we
are going send to the trading wallet. Once the transaction is finished we print
out the transaction hash (Code 7.72, line 37).
This is important to know what is the transaction ID for sending or receiving
crypto coins. Based on that string you can always verify on Etherscan22
which wallet sends how many crypto assets and to where. Naturally, all the
data in blockchains is anonymous. You can see hashes and wallet address;
gas and coins being transferred but it does not disclose who made it. This
information is not useful for anonymous records but if you want to check
your own transfers and how much coins were sent you can always verify it
there.
When using our local emulator Etherscan, it is not going to show us our
transfers because all that is occurring in our local blockchains stay in local
environment.

Conclusion
In this chapter, we learned how we can use Python for analyzing crypto
market and its trends. When it is time to sell and when it is time to buy. We
did not perform the example of pulling our coins out of the trading wallet
since the mechanics is the same as for sending assets from main wallet to
trading one as we already performed, thus it is just a matter of switching
places – source with destination.
We have also managed to practice how to use all the codes that we wrote in
all the subsections That is the whole Python and overall programming
pattern. Gradually, we have been building a prototype to check, analyze and
send crypto assets to trading wallet. We have also managed to learn how to
use local simulations for blockchain which helps prototype dApps. 23 It also
helps you get familiar with crypto world from a developer’s point of view
without using real money where every mistake is not free.
In the next chapter, we are going to learn how to use Python with hardware
that we might want to build. This is going to be learning of how to build
smart speaker that you can interact with.
1. https://www.kaspersky.com/resource-center/definitions/what-is-
cryptocurrency
2. https://www.sqlite.org/index.html
3. https://pypi.org/project/click/
4. https://www.cloudflare.com/en-gb/learning/ssl/how-does-public-key-
encryption-work/
5. https://www.secg.org/sec2-v2.pdf?ref=hackernoon.com
6. https://pypi.org/project/web3/
7. https://github.com/pycrypto/pycrypto
8. https://docs.github.com/en/pull-requests/collaborating-with-pull-
requests/working-with-forks/fork-a-repo
9. https://github.com/darkman66/eth-keyfile
10. https://www.livecoinwatch.com/tools/api
11. https://pypi.org/project/requests/
12. https://livecoinwatch.github.io/lcw-api-docs/#coinssinglehistory
13. https://www.techtarget.com/searchdatacenter/definition/epoch
14. https://www.iso.org/iso-8601-date-and-time-format.html
15. https://web3py.readthedocs.io
16. https://trufflesuite.com/ganache/
17. https://web3py.readthedocs.io
18. https://docs.python.org/3/library/configparser.html
19. https://click.palletsprojects.com/en/8.1.x/
20. https://ethereum.org/en/developers/docs/gas/
21. https://numpy.org
22. https://etherscan.io
23. https://ethereum.org/en/dapps/
OceanofPDF.com
CHAPTER 8
Construct Your Own High-tech
Loudspeaker

Introduction
With galloping technological changes in computers and smart devices we
can say that not only they are bringing newer, better batteries, display, CPU
and other parts of hardware but as well they become more powerful and
efficient. These changes, also have a significant impact upon how we, as
humans interact with computers. Over the years, we have moved from
keyboards, towards touch screens. Now, through super-efficient CPU’s our
focus has shifted towards speech. We can now interact with smart devices by
using our voice to give commands and make simple interactions with them.
This technology is growing rapidly, and, in this chapter, we are going to
explore and learn how to use Python in the fascinating world of smart
speakers.

Structure
In this chapter, we will cover the following topics:
Building a software that can support speech to text
Recording
Response
Building interactions scenarios
Connecting to third party service like music players
Building physical devices

Objectives
After reading this chapter, you should know how to build your own smart
speaker. We will learn how to interact with it by scripting interaction
scenarios. You should also be able to understand how speech to text works,
what kind of challenges it may introduce, and how to beat them as a
developer who knows how to use Python.

Building a software that can support speech to text


When using speech-to-text recognition software, we have a very wide
variety of choices. We do not have to write software from the scratch
Writing software for voice recognition is not a trivial matter. We need to
build an artificial intelligence (AI) model1 and train it a lot, for it to
understand voice and different dialogs and accents2.
To build a smart speaker we are going to use few components and then
connect them together. Following are the key components required to build a
smart speaker:
A framework for recording audio samples
A trained model and framework for language recognizing
A simple model to write integrations with users based on spoken
sentences

Recording
To record audio samples, we will use system libraries and kernel drivers that
allow us access to input audio devices such as microphones via system API
without facing many challenges when accessing such a device on a very low
system level.
We will use a Python library called sounddevice3. This tool allows us to
record sounds from microphone and convert it to Scipy4 data arrays.
First, we need to install requirements to be able to start recording. Before we
install any of above requirements, we must install Python 3.105 which is
required by soundedevice packed.
Once you have installed the required Python version, we can start installing
all the required libraries as shown in the following code.
1. pip install sounddevice==0.4.6
2. pip install scipy==1.10.1
Code 8.1
Once we have installed Python 3.10 and the required libraries, we can try to
record our very first test recording to figure out how microphones are
configured in our system. Let us try the following example that is
demonstrating how to record sounds.
1. import sounddevice as sd
2. from scipy.io.wavfile import write
3.
4. fs = 44100
5. seconds = 3
6.
7. myrecording = sd.rec(int(seconds * fs), samplerate=fs, channels=1)
8. print('Start talking')
9. sd.wait()
10. write('output.wav', fs, myrecording)
Code 8.2
After executing the above code, you should be able to record a sound from
your default system microphone and find it saved under our application
folder in a file called output.wav.
We can see that in line 4 we are defining quality (frequency of sampling6) of
our recording and in line 5 we define how long we want to record that sound.
These two factors, are later being used in line 7 to determine recording
options as well as the number of channels. The reason why we use only
single channel here, is because the most popular microphones are mono,
which means that they deliver sounds in single channel. We do not want to
force recoding on stereo channels in case the microphone does not support it.
If you want, you can experiment with two channels as long as your
microphone supports it.
Next thing to note, is the fact that our framework (sounddevice) by default
generates recordings as numpy data array which means, we cannot send the
output of it directly to the file. This is why we use the function write from
Scipy packet – line 10.
We have the recording in a flat file. We will use this later for sure. For the
time being, we need to understand how to start recording and active
recording itself based on the spoken keyword. The easiest and most efficient
way to achieve is demonstrated in the Code 8.3. Let us try to put our
recording mechanics into a loop and wait for waking word which we will
add later. For time being we will keep listening in the loop.
1. import sounddevice as sd
2. from scipy.io.wavfile import write
3.
4. fs = 44100
5. seconds = 10
6.
7. print('I am listening... press ctrl+c to stop› )
8. while True:
9. myrecording = sd.rec(int(seconds * fs), samplerate=fs, channels=1)
10. sd.wait()
11. print("Finished and again, recordign size", len(myrecording))
Code 8.3
With this simple change (line 8), we can keep registering sounds forever for
ten seconds samples (line 5). By checking the following flowchart, we will
analyze why we need an infinity loop for recording:
Figure 8.1: Recording and catching waken word cycle
In Figure 8.1, we have explained why we need an infinity loop for
recording. That loop, in its basic concept is needed to listen, record, analyze
and wait until the user activates recording block (record spoken word). This
will happen when the user will speak out specific keywords. For this chapter,
let us assume that the triggering word is going to be speaker or hey speaker.
Our example Code 8.3, does not analyze speech to text, so we need to
introduce Python library, that will help us refactor our code and analyze
speech to convert it to text. Before we do some simple exercises, we need to
install Python library for analyzing sounds and be able to convert those to
text. Let’s check following example how to install required module.
1. pip install -U openai-whisper
Code 8.4

Once we have the installer whisper7 we can write a simple example like in
the following code: To make it work, we shall reuse concept from Code 8.2.
1. import whisper
2. import sounddevice as sd
3. from scipy.io.wavfile import write
4.
5. fs = 44100
6. seconds = 5
7.
8. myrecording = sd.rec(int(seconds * fs), samplerate=fs, channels=1)
9. print('Start talking')
10. sd.wait()
11.
12. print('Write output')
13. write('output.wav', fs, myrecording)
14.
15. print('Analyze text')
16.
17. model = whisper.load_model("base")
18. result = model.transcribe("output.wav")
19. analized_text = result["text"]
20.
21. print(f"What you said: {analized_text}")
Code 8.5
In this simple code, we are recording five seconds of audio input.
Once the recording is ready, it is saved to output file (line 13). Next, we will
load the base AI model to whisper (line 17). After the model is properly
loaded and processed, the content of the output file goes to whisper (line 18).
In the end, we have the converted version text form of our audio file.
Now, we need to add support for the wake word. For this, we have to
refactor our previous code and add detection for a key word. In this case, as
we agreed before our magic word is “speaker"” or “hey speaker”. Let’s
check following example how can we react upon mentioned keywords.
1. import logging
2. import whisper
3. import sounddevice as sd
4. from scipy.io.wavfile import write
5.
6. FS = 44100
7. SECONDS = 5
8. RECORDING_FILE = 'output.wav'
9. LOGGING_FORMAT = '%(asctime)s %(message)s'
10.
11. logging.basicConfig(level=logging.INFO, format=LOGGING_FORMA
T)
12.
13.
14. class SmartSpeaker:
15. def __init__(self):
16. self._current_text = None
17. self.model = whisper.load_model("base.en", download_root='.')
18. logging.info("Model loaded")
19.
20. def run(self):
21. if self.record_audio():
22. self.analized_text = self.audio_to_text()
23. logging.info(f"Translated text: {self.analized_text}")
24. if self.is_keyword_in_text:
25. logging.info("Hello, I can't talk yet but I heard you")
26.
27. def record_audio(self) -> bool:
28. try:
29. myrecording = sd.rec(int(SECONDS * FS), samplerate=FS, cha
nnels=1)
30. logging.info('Start talking')
31. sd.wait()
32.
33. logging.info('Write output')
34. write('output.wav', FS, myrecording)
35. except Exception as e:
36. logging.error(f"We crashed: {e}")
37. return False
38. return True
39.
40. def audio_to_text(self) -> str:
41. logging.info('Analyze text')
42. result = self.model.transcribe("output.wav")
43. return result["text"]
44.
45. @property
46. def is_keyword_in_text(self) -> bool:
47. return 'speaker' in self.analized_text.lower() or 'hey speaker' in self.
analized_text.lower()
48.
49. if __name__ == '__main__':
50. smart_speaker = SmartSpeaker()
51. smart_speaker.run()
Code 8.6
We can see that our code got refactored more towards object-oriented
programming. We have managed to delegate logical areas of the code such
as recording speech, (Code 8.4, lines 8-10), converting to text (Code 8.4,
lines 17-19) and picking up key phrases (wake word) into the delegated code
block. Functions converted into methods are in use when we clearly can
detect that recording audio was successful (Code 8.5, line 19) and we have
managed to convert speech to text (Code 8.5, line 20).
Once we detect that there is any of the waken words present in a recorded
spoken text like “speaker” or “hey speaker” (Code 8.5, line 21) then we
respond back to the user. In this case, we only log potential reply messages
(Code 8.5, line 22).

Response
Our project encompasses not just what the user is trying to say but also
focuses on building smart speaker thus we shall not only log what user has
tried to say (Code 8.5, line 22). Instead, we should find a way to speak back
to the user.
To make our application talk we have many options, we could use Amazon
AWS solution8, Google to speeches9 and many more cloud solutions. Let us
try to see how we can interact with Google Cloud service.
1. First, we need to install the Python package, as shown in the following
code. Let’s check how to install required modules.
1. $ pip install gtts playsound
Code 8.7
1. After installing this package, we can write our proof-of-concept code
like in the following code example:
1. from gtts import gTTS
2. import os
3. import playsound
4.
5. def text2speak(text):
6. tts = gTTS(text=text, lang='en')
7. filename = "tmp.mp3"
8. tts.save(filename)
9. playsound.playsound(filename)
10. os.remove(filename)
11.
12. text2speak('Hello there, nice to meet you!')
Code 8.8
In above Code 8.7, we used Google Cloud services that manages all the
conversion from text to speech (line 6). Take note that after fetching
converted text to audio, we must save the result to a file (line 7-8). Once the
file is being saved, we will use play sound module (line 9) to play what has
been fetched from Google service. To make things clean, we are removing
temporary recordings (line 10) before exiting from function.
Unfortunately, this solution has one big issue, the temporary file name is
static. Suppose we call our function text2speak multiple times at the same
time (parallel). Thus we will have a race condition issue. One of the
challenges is a case where the temporary file keeps being overwritten by
multiple instances of the same function. This is going to lead to issues with
playing content of it. To fix this, let us refactor the code like in proceeding
example:
1. import os
2. import playsound
3. from gtts import gTTS
4. from tempfile import mkstemp
5.
6. def text2speak(text):
7. tts = gTTS(text=text, lang='en')
8. filename = mkstemp()[1]
9. tts.save(filename)
10. playsound.playsound(filename)
11. os.remove(filename)
12.
13. text2speak('Hello there, nice to meet you!')
Code 8.9
In this code, we are using a Python library called Tempfile10 which allows
us to create unique temporary files for storing, recording etc. Like in the
previous example 8.7 once the recording is done, we are removing the
temporary file. Hence, make sure that there are no more breadcrumbs left
behind.
Using third party solutions like Google can lead to some noticeable delays
from the moment when we have a text to read until we can hear it. To
demonstrate this let us run the proceeding example:
1. $ time python text_to_speech_google_refatored.py
2.
3. python text_to_speech_google_refatored.py 0.21s user 0.06s system 5
% cpu 4.692 total
Code 8.10
In example Code 8.10, we have used system command time, which measures
execution time of our script. Please note that the last parameter (cpu total)
we have taken almost five seconds from system time to execute our script.
Let us try a similar example where we will use Python and system
synthesis11 12 text to speech functionality:
1. from os import system
2.
3. system('say Hello there, nice to meet you!')
Code 8.11
With an embedded system, say command we have a less natural voice being
used. It sounds more artificial than an actual human voice. This is a
disadvantage for sure, but the advantages of Code 8.10 lies in its simplicity,
no dependency on external commercial services and execution time. You
might question why we have mentioned execution time, as an advantage
against third party commercial service. Now in this following code, let us
check how much time does it take to execute example Code 8.10:
1. $ time python text_to_speech.py
2.
3. python text_to_speech.py 0.23s user 0.06s system 10% cpu 2.597 total
Code 8.12
You can notice that the execution time is almost half, when compared to
example Code 8.9. This is the factor, we will use here as a leverage to
choose a system solution for voice synthesis over third party libraries. Now,
it is time to refactor our Code 8.5, where instead of logging a response
message, we are going to say it to the user. In the following example, we
refactored Code 8.6 with included text to speech:
1. from os import system
2.
3. class SmartSpeaker:
4.
5. def run(self):
6. if self.record_audio():
7. self.analized_text = self.audio_to_text()
8. logging.info(f"Translated text: {self.analized_text}")
9. if self.is_keyword_in_text:
10. reply_txt = "Hello, I can't talk yet but I heard you"
11. system(f'say {reply_txt}')
Code 8.13
We replaced the logging in method run once we detected, that the waken
word phrase is being said by the user with proper text to speech mechanism
in use. As a result, the user will hear the text that we defined in line 11.
Building interaction scenarios
Ideally in the world of smart speakers, we should be able to interact with the
user in such a way that person who talks to smart speaker has a feeling that
he is talking to a person or at least to some form of artificial intelligence. Of
course, for the need of this book we are not going to build another
ChatGPT13 platform for smart interactions with user, albeit we are going to
use something simple, yet powerful for building interaction scenarios.
We will need Python library that allows us to build intents and can be fed by
scenario files that descriptively drive interactions with the user. We could
use, like in the previous subchapter of this chapter a commercial service14
for building smart conversations.
In this case, we will use an open-source library to build conversations. In the
proceeding example, we are going to install Python library to support that:
1. $ git clone [email protected]:bpbpublications/Fun-with-Python.git
2. $ cd fun-with-Python/chapter_8/neuralintents
3. $ python setup.py install
Code 8.14
Once the library is installed, we can go to the next section regarding the code
that will allow us to create interactive scenarios. Before we will do so we
should clarify something. Library neuralintents15 is a POC library that is
based on TenserFlow16. It allows us to work with natural languages and
program interactions with users based on natural human talk and grammar.
We can start with the following example where we define scenarios and
intents:
1. {"intents": [
2. {"tag": "greeting",
3. "patterns": ["Hi", "How are you", "Is anyone there?",
"Hello", "Good day", "Whats up", "Hey", "greetings"],
4. "responses": ["Hello!", "Good to see you again!",
"Hi there, how can I help?"],
5. "context_set": ""
6. },
7. {"tag": "goodbye",
8. "patterns": ["cya", "See you later", "Goodbye",
"I am Leaving", "Have a Good day", "bye", "cao", "see ya"],
9. "responses": ["Sad to see you go :
(", "Talk to you later", "Goodbye!"],
10. "context_set": ""
11. },
12. {"tag": "stocks",
13. "patterns": ["what stocks do I own?", "how are my shares?", "what
companies am I investing in?", "what am I doing in the markets?"],
14. "responses": ["You own the following shares: ABBV, AAPL,
FB, NVDA and an ETF of the S&P 500 Index!"],
15. "context_set": ""
16. }
17. ]}
Code 8.15
We have configured our neuralintents to react upon different waken words
(lines with pattern key, for example in line 3). With this approach, we can
create custom functions that will custom action upon a triggered word. The
great thing about TenserFlow is that we can ignore all the complex linguistic
cases grammar and inflection. Library will try to convert all the cases to its
original forms and find the best matching pattern.
In the proceeding example, we will try to read from command line the given
word or phrase. Our library based on the prepared scenario patterns, will call
custom function as mentioned.
1. import logging
2. from neuralintents import GenericAssistant
3.
4. LOGGING_FORMAT = '%(asctime)s %(message)s'
5. logging.basicConfig(level=logging.INFO, format=LOGGING_FORMA
T)
6.
7. def greetings_callback():
8. logging.info("Your greetings")
9.
10. def stocks_callback():
11. logging.info("Your stocks")
12.
13. mappings = {
14. 'greeting' : greetings_callback,
15. 'stocks' : stocks_callback
16. }
17.
18. assistant = GenericAssistant('intents.json', model_name="test_model", i
ntent_methods=mappings)
19. assistant.train_model()
20. assistant.save_model()
21.
22. while True:
23. message = input("Message: ")
24. if message == "STOP":
25. break
26. else:
27. assistant.request(message)
Code 8.16
In the following example, we will demonstrate how to use our example code
from Code 8.15. As it is easy to notice, the keyword STOP is used to stop
our program. You may also observe, that each time we try to use words from
our scenario patterns (Code 8.15) if it is a valid pattern then our code will
use custom callbacks (lines 7-10 and 13-16). Notice that it has been
mentioned that different forms of words are being handled by the linguistic
library – Code 8.16, line 9 and 14.
1. $ python neutral.py
2.
3.
4. Message: hi
5. 1/1 [==============================] - 0s 42ms/step
6. 2023-05-21 12:26:55,286 Your greetings
7. Message: stock value
8. 1/1 [==============================] - 0s 13ms/step
9. 2023-05-21 12:26:58,984 Your greetings
10. Message: how are my shares?
11. 1/1 [==============================] - 0s 13ms/step
12. 2023-05-21 12:27:11,358 Your stocks
13. Message: shares
14. 1/1 [==============================] - 0s 12ms/step
15. Message: how are my share?
16. 1/1 [==============================] - 0s 12ms/step
17. 2023-05-21 12:27:23,580 Your stocks
18. Message: STOP
19. $
Code 8.17
What is interesting is the fact, that like in line 9 and 10 you can notice that
lib is smart enough to be able to find a proper pattern for singular and plural
form of the same word. Now, it is time to reuse our callback function to be
able to properly support the waken word and start responding to the user by
using some simple response scenarios. Let’s check following example code
how can we achieve that.
1. import logging
2. import random
3. import whisper
4. import sounddevice as sd
5. from datetime import datetime
6. from os import system
7. from scipy.io.wavfile import write
8. from neuralintents import GenericAssistant
9.
10. FS = 44100
11. SECONDS = 5
12. RECORDING_FILE = 'output.wav'
13. LOGGING_FORMAT = '%(asctime)s %(message)s'
14.
15. logging.basicConfig(level=logging.INFO, format=LOGGING_FORMA
T)
16.
17.
18. class SmartSpeaker:
19. def __init__(self):
20. self._current_text = None
21. self.model = whisper.load_model("base.en", download_root='.')
22. logging.info("Model loaded")
23.
24. def audio2text(self):
25. if self.record_audio():
26. self.analized_text = self.audio_to_text()
27. logging.info(f"Translated text: {self.analized_text}")
28. return self.analized_text
29.
30. def run(self, assistant):
31. self.assistant = assistant
32. analyzed_text = self.audio2text()
33. if analyzed_text and self.is_keyword_in_text:
34. self.__say("Yes, how can I help you?")
35. new_analyzed_text = self.audio2text()
36. if new_analyzed_text:
37. self.assistant.request(new_analyzed_text)
38.
39. def record_audio(self) -> bool:
40. try:
41. myrecording = sd.rec(int(SECONDS * FS), samplerate=FS, cha
nnels=1)
42. logging.info('Start talking')
43. sd.wait()
44.
45. logging.info('Write output')
46. write('output.wav', FS, myrecording)
47. except Exception as e:
48. logging.error(f"We crashed: {e}")
49. return False
50. return True
51.
52. def audio_to_text(self) -> str:
53. logging.info('Analyze text')
54. result = self.model.transcribe("output.wav")
55. return result["text"]
56.
57. def get_response(self, tag):
58. list_of_intents = self.assistant.intents["intents"]
59. for i in list_of_intents:
60. if i["tag"] == tag:
61. return random.choice(i["responses"])
62.
63. def __say(self, message):
64. system (f'say {message}')
65.
66. @property
67. def is_keyword_in_text(self) -> bool:
68. return 'speaker' in self.analized_text.lower() or 'hey speaker' in self.
analized_text.lower()
69.
70. def callback_greetings(self):
71. response = self.get_response('greetings')
72. self.__say(response)
73.
74. def callback_time(self):
75. current_time = datetime.now().strftime("%I:%M%p")
76. response = self.get_response('time')
77. response = response.format(time=current_time)
78. self.__say(response)
79.
80.
81. if __name__ == '__main__':
82. smart_speaker = SmartSpeaker()
83. mappings = {
84. 'greetings': smart_speaker.callback_greetings,
85. 'time': smart_speaker.callback_time
86. }
87.
88. assistant = GenericAssistant('intents_speaker.json', model_name="tes
t_model", intent_methods=mappings)
89. assistant.train_model()
90. assistant.save_model()
91. smart_speaker.run(assistant)
Code 8.18
As you can see in example Code 8.17, we have mostly used the code
examples that we learned so far. In a nutshell, they include detecting waken
work and responding to the end user that we have heard the indicator word.
In lines 24-28, we are converting recorded audio to plain text and next in
lines 31-33 we check if the system converted contains waken word, which is,
as we said before, hey speaker. If the valid waken word is detected, we are
going to the next phase (line 34) where we respond back to the user that we
are ready to accept the actual command.
Next, when we accept the actual command, we again try to decipher what
was said by the user and if we can detect a proper command using neural
intents, we will respond with a random response defined in intent JSON file
(line 57-61).
Please check following example to see how our program works in action:
1. (...)
2. Epoch 197/200
3. 3/3 [==============================] - 0s 463us/step - loss: 0.
1315 - accuracy: 1.0000
4. Epoch 198/200
5. 3/3 [==============================] - 0s 481us/step - loss: 0.
0915 - accuracy: 1.0000
6. Epoch 199/200
7. 3/3 [==============================] - 0s 481us/step - loss: 0.
1208 - accuracy: 1.0000
8. Epoch 200/200
9. 3/3 [==============================] - 0s 463us/step - loss: 0.
0790 - accuracy: 1.0000
10.
11. 2023-05-10 17:07:03,653 Start talking
12. 2023-05-08 17:07:08,774 Write output
13. 2023-05-08 17:07:08,777 Analyze text
14. 2023-05-08 17:07:10,145 Translated text: Hey speaker!
15. 2023-05-08 17:07:12,673 Start talking
16. 2023-05-08 17:07:17,794 Write output
17. 2023-05-08 17:07:17,797 Analyze text
18. 2023-05-08 17:07:18,333 Translated text: What time is it?
19. 1/1 [==============================] - 0s 38ms/step
Code 8.19
Another thing we must pay attention to is the fact that to be able to feed and
train neural system we can only do this once when application starts. You
have probably already noticed, that after running our script you have locally
files created like *.pkl and *.h5. These are the linguistic models for English
language. In example Code 8.18, we can see when the application is done
loading the data files, we are interacting with our smart speaker by following
an algorithm that has been described before.

Connecting to third party service like music players


So far, we have learned how to integrate simple machine learning with voice
recognition. Now it is time to build something that is connecting with third
party services such as music streaming services. For the next example, we
are going to use Spotify17 streaming service and the developers API that it
provides18.
1. First, we need to register an account on Spotify itself. Once that is done,
although not compulsory but in order to make our integration fully
work, it is preferable to have a premium account. For details on how to
activate premium services on Spotify please check their website on your
own. To make things clear here, we are not promoting or selling Spotify
services in any way. We decided to use Spotify in the following
examples since it is a very well documented API and is easy to use.
2. Now that we have an account, it is time to create first Spotify app.
Login to the developer dashboard19 and create application. Use any
preferable name you want. Let us use smart speaker in our case.
For other necessary fields for Spotify, please specify fields like in the
following configuration figure:

Figure 8.2: Configuring Spotify application


Take note of the redirect URLs, these will be used for our application, so it is
important to configure them as presented in Figure 8.2. Other important
parameters such as Client ID and Client secret are going to be automatically
generated by the Spotify system.
In the following example, we will test if our newly created application is
working:
1. curl -X POST "https://accounts.spotify.com/api/token" \
2. -H "Content-Type: application/x-www-form-urlencoded" \
3. -d "grant_type=client_credentials&client_id=
< your client id >&client_secret=< your client secret >"
4.
5. # response
6. {"access_token":"***","token_type":"Bearer","expires_in":3600}
Code 8.20
Sample Code 8.20, is sending a request to Spotify API to get an access
token. As you can see, the token is valid only for 3600 seconds. This
example is only to prove that our newly created application is ready to be
use in the next examples.
Before we start rewriting our main application, we need to understand, what
we are going to build to make this third-party integration possible. In the
following figure, you can check how we are going to organize components
of our application into individual functional blocks:

Figure 8.3: Example flow of user interaction with smart speaker and Spotify service
As shown in Figure 8.3, we need to introduce a new application API in our
application schema. This is because Spotify only allows us to integrate with
their service via HTTP(s) protocol and authentication should be done via
OAuth20. We need a web browser for this. To make things cleaner and
introduce a better separation between the part in our application that handles
all the interactions with the user (main app) and the part that is responsible
for playing music, we have a stand-alone API block for this. Before we start
building API service, we must install a package for asynchronous21 HTTP
calls22:
1. $ pip install httpx==0.24.1
Code 8.21
Once we have installed HTTP client library, let us take a quick look at how
this library works in the following example:
1. import httpx
2.
3. url = '‹https://en.wikipedia.org/wiki/Python_(mythology)'
4.
5. response = httpx.get(url)
6.
7. print(response)
8. with open("/tmp/tmp_page.txt", 'wb') as w:
9. w.write(response.content)
Code 8.22
As you may have already noticed, using HTTPX library is very different
from using requests23 package. It is worth mentioning, that in the following
example, we do the same thing albeit we use asynchronous calls with
asyncio:
1. import asyncio
2. import httpx
3.
4. url = 'https://en.wikipedia.org/wiki/Python_(mythology)'
5.
6. async def main():
7. async with httpx.AsyncClient() as client:
8. response = await client.get(url)
9. print(response)
10. with open("/tmp/tmp_page.txt", 'wb') as w:
11. w.write(response.content)
12.
13. asyncio.run(main())
Code 8.23
By running this example, we now know how to use HTTP library to connect
to an external resource and fetch its content in synchronous format. Now, we
need to install another asynchronous library, which is easy to use and very
efficient in building API systems called FastAPI24:
1. $ pip install fastapi==0.96.0
Code 8.24
To understand how FastAPI framework works, let us start with the following
example where we create the first main endpoint for our API service. Let us
create an example script api_step1.py in the following code:
1. from fastapi import FastAPI
2.
3. app = FastAPI()
4.
5.
6. @app.get("/")
7. async def main():
8. url = "http://foo.com/redirect"
9. return {"url": url}
Code 8.25
We have only one single API endpoint (line 6). When we try to execute our
script to see if the API works like in the following example nothing happens:
1. python api_step1.py
Code 8.26
The reason why executing our script from example Code 8.25, is not
effective is because FastAPI is a web API framework, but the transportation
layer has to be delivered by the web server. In our case, we will use WSGI
HTPP server library called Gunicorn25. First, we have to install it like in the
following example:
1. $ pip install gunicorn==20.1.0
Code 8.27
Once we have it installed, we can run our example script from Code 8.25, by
using a proper WSGI server for serving our API. Please check the
proceeding example for a demonstration of the same:
1. $ gunicorn -k uvicorn.workers.UvicornWorker api_step1:app --reload -
b localhost:8888
Code 8.28
This command is going to help us to start our service so we can now test if it
is working by executing the following command:
1. $ curl -v http://localhost:8888/
2.
3. * Trying 127.0.0.1:8888...
4. * Connected to 127.0.0.1 (127.0.0.1) port 8888 (#0)
5. > GET / HTTP/1.1
6. > Host: 127.0.0.1:8888
7. > User-Agent: curl/7.88.1
8. > Accept: */*
9. >
10. < HTTP/1.1 200 OK
11. < date: Sat, 17 Jun 2023 19:24:02 GMT
12. < server: uvicorn
13. < content-length: 33
14. < content-type: application/json
15. <
16. * Connection #0 to host 127.0.0.1 left intact
17.
18. {"url":"http://foo.com/redirect"}
Code 8.29
We can see in Code 8.29, it is finally possible to retrieve an API response. In
our case, we will return a simple JSON response (line 18). As we mentioned
before, we need to build an API that can support OAuth with the Spotify
system and as a result of successful authentication, we can get the access
token. We demonstrated that flow before as basic token authentication (Code
8.20) – it’s been simplified approach and yet not with OAuth. In the
following example, we are using full OAuth flow to get the access token.
Take note that we are using config parser to load Spotify credentials.
In the proceeding example, we are going to use the credentials file
(api_config.ini) this is going to use access keys generated in example 8.20:
1. [spotify]
2. client_id = <your client ID>
3. client_secret = <your client secret>
Code 8.30
Before we proceed with the following examples, we must install Spotify
module that is going to help us properly follow OAuth26:
1. $ pip install git+https://github.com/darkman66/spotify.py.git
Code 8.31
Once we have the configuration file, let us try to utilize it in the following
example where we load credentials and proceed with OAuth flow:
1. import configparser
2. import spotify
3. from fastapi import FastAPI
4. from fastapi.responses import RedirectResponse
5. from typing import Tuple
6.
7. config = configparser.ConfigParser()
8. config.sections()
9. config.read('api_config.ini')
10.
11. SPOTIFY_CLIENT_ID = config.get('spotify', 'client_id')
12. SPOTIFY_CLIENT_SECRET = config.get('spotify', 'client_secret')
13. REDIRECT_URI: str = 'http://localhost:8888/spotify/callback'
14. SPOTIFY_CLIENT = spotify.Client(SPOTIFY_CLIENT_ID, SPOTIF
Y_CLIENT_SECRET)
15. OAUTH2_SCOPES: Tuple[str] = ('user-modify-playback-state', 'user-
read-currently-playing', 'user-read-playback-state')
16. OAUTH2: spotify.OAuth2 = spotify.OAuth2(SPOTIFY_CLIENT.id, R
EDIRECT_URI, scopes=OAUTH2_SCOPES)
17. AUTH_TOKEN = None
18.
19. app = FastAPI()
20.
21.
22. @app.get("/")
23. async def main():
24. url = None
25. if not AUTH_TOKEN:
26. url = OAUTH2.url
27. return RedirectResponse(url, status_code=302)
28. return {"url": url}
Code 8.32
We have imported the config parser (line 1), loaded our configuration file
and used it for Spotify credentials are static values (lines 7-12) taken from
configuration. Once we have the credentials sorted, we are defining callback
URL after a successful authentication from Spotify’s side. Remember, we
have defined that callback URL in our Spotify application configuration
(Figure 8.2).
Next, we need to initialize Spotify client instance (line 14) and use it for
constructor of OAuth2 client (line 16). What you need to notice is line 15.
We define here, what scope of private data we want to get access to27. It is
important to define a proper scope of privileges, so we can get the
authentication token that allows us to access the data that we wanted to read.
Spotify based on scope, will give us access to only the kind of data that we
requested for.
Another thing to remember, is that we are doing authentication via OAuth.
To make the authentication flow work, we have to open URL
http://localhost:8888/ in the browser, so we will be redirected to Spotify to
accept the scope of data that our API wants to read from your Spotify
account. Once you accept it, you are going to be redirected to the defined
redirect URL (Figure 8.2).
In the following example, we are authenticating a user account against the
Spotify system and once the authentication process is correct, we will search
for a music title called drake:
1. import configparser
2. import spotify
3. from fastapi import FastAPI
4. from fastapi.responses import RedirectResponse
5. from typing import Tuple
6.
7. config = configparser.ConfigParser()
8. config.sections()
9. config.read('api_config.ini')
10.
11. SPOTIFY_CLIENT_ID = config.get('spotify', 'client_id')
12. SPOTIFY_CLIENT_SECRET = config.get('spotify', 'client_secret')
13. REDIRECT_URI: str = 'http://localhost:8888/spotify/callback'
14. SPOTIFY_CLIENT = spotify.Client(SPOTIFY_CLIENT_ID, SPOTIF
Y_CLIENT_SECRET)
15. OAUTH2_SCOPES: Tuple[str] = ('user-modify-playback-state', 'user-
read-currently-playing', 'user-read-playback-state', 'app-remote-control')
16. OAUTH2: spotify.OAuth2 = spotify.OAuth2(SPOTIFY_CLIENT.id, R
EDIRECT_URI, scopes=OAUTH2_SCOPES)
17. AUTH_TOKEN = None
18.
19. app = FastAPI()
20.
21.
22. @app.get("/")
23. async def main():
24. url = None
25. if not AUTH_TOKEN:
26. url = OAUTH2.url
27. return RedirectResponse(url, status_code=302)
28. return {"url": url}
29.
30.
31. @app.get('/spotify/callback')
32. async def spotify_callback(code: str):
33. return_url = None
34. try:
35. AUTH_TOKEN = code
36. except KeyError:
37. return {"ready": False}
38. else:
39. print(f"Authentiicaton token: {AUTH_TOKEN}")
40. async with spotify.Client(SPOTIFY_CLIENT_ID, SPOTIFY_CLI
ENT_SECRET) as client:
41. try:
42. response = await spotify.User.from_code(client, code, redirect
_uri=REDIRECT_URI)
43. user = await response
44. results = await client.search('drake')
45. print(results.tracks)
46. if results.tracks and len(results.tracks) > 0:
47. return_url = results.tracks[0].url
48. except spotify.errors.HTTPException as e:
49. print('Token expired?')
50. if 'expired' in str(e).lower() or 'invalid' in str(e).lower():
51. print('redirect-'*5)
52. return RedirectResponse('/', status_code=302)
53.
54. return {"url": return_url}
Code 8.33
To start the example from Code 8.33 we must use Gunicorm WSGI server
like in the proceeding example:
1. $ gunicorn -k uvicorn.workers.UvicornWorker api_step3:app --reload -
b localhost:8888
Code 8.34
Once you open our API main URL in the browser you should see response
like in the following example:
1. {"url":"https://open.spotify.com/track/7aRCf5cLOFN1U7kvtChY1G"}
Code 8.35
It works This is good news. We have successfully managed to authenticate
our application and use Spotify system to find music records for us.
Let us take a closer look at Code 8.33. You can notice that method
spotify_callback gets an argument code. In FastAPI framework this kind of
function definition means that argument is going to be read from query
parameters. Let’s check how can we use this code in HTTP syntax.
1. http://localhost:8888/spotify/callback?code=<..authentication code..>
Code 8.36
1. In example Code 8.36, we shown URL of a callback that Spotify system
is going to redirect user to after successful authentication. It is easy to
notice that the URL has a query parameter code. This is the same
argument as the function spotify_callback as already mentioned.
2. Next step is to create a new client instance (line 40) that we are going to
use in line 42 where we try to get the user ID by using OAuth code, that
we have as a result of callback from Spotify.
3. In the next phase, we are calling Spotify API trying to find a specific
music track. In line 44, we will be looking for the word drake and
through the returned results we try to get URL of the very first record
returned by Spotify.
So far, we did everything that was described in a single flow. This is not real
API logic. What we mean by is, we want to authenticate API against Spotify
only when our app starts. Once our API is authenticated, we should be able
to search via our API any kind of music record in the Spotify library. We
should be able to also ask via our API to play the requested music track.
For the mentioned requirements let us look at the following example to see
how we can modify our Code 8.29 to serve our needs:
1. import configparser
2. import os
3. import spotify
4. from fastapi import FastAPI
5. from fastapi.responses import RedirectResponse
6. from pydantic import BaseModel
7. from typing import Tuple
8.
9. config = configparser.ConfigParser()
10. config.sections()
11. config.read("api_config.ini")
12.
13. SPOTIFY_CLIENT_ID = config.get("spotify", "client_id")
14. SPOTIFY_CLIENT_SECRET = config.get("spotify", "client_secret")
15. REDIRECT_URI: str = "http://localhost:8888/spotify/callback"
16. SPOTIFY_CLIENT = spotify.Client(SPOTIFY_CLIENT_ID, SPOTIF
Y_CLIENT_SECRET)
17. OAUTH2_SCOPES: Tuple[str] = (
18. "user-modify-playback-state",
19. "user-read-currently-playing",
20. "user-read-playback-state",
21. "app-remote-control",
22. )
23. OAUTH2: spotify.OAuth2 = spotify.OAuth2(SPOTIFY_CLIENT.id, R
EDIRECT_URI, scopes=OAUTH2_SCOPES)
24. TOKEN_FILE = '/tmp/token.dat'
25.
26. class Item(BaseModel):
27. phrase: str
28.
29. app = FastAPI()
30.
31. async def token_set(auth_code: str):
32. with open(TOKEN_FILE, 'w') as f:
33. f.write(auth_code)
34.
35. async def token():
36. if os.path.exists(TOKEN_FILE):
37. with open(TOKEN_FILE, 'r') as f:
38. return f.read().strip()
39.
40. @app.get("/")
41. async def main():
42. url = None
43. if not await token():
44. url = OAUTH2.url
45. return RedirectResponse(url, status_code=302)
46. return {"url": url}
47.
48.
49. @app.post("/search/")
50. async def spotify_search(item: Item):
51. async with spotify.Client(SPOTIFY_CLIENT_ID, SPOTIFY_CLIEN
T_SECRET) as client:
52. results = await client.search(item.phrase)
53. if results.tracks and len(results.tracks) > 0:
54. track_url = results.tracks[0].url
55. return {"track_url": track_url}
56.
57.
58. @app.get("/spotify/callback")
59. async def spotify_callback(code: str):
60. success = False
61. try:
62. await token_set(code)
63. except KeyError:
64. return {"ready": False}
65. else:
66. print(f"Authentiicaton token: {code}")
67. async with spotify.Client(SPOTIFY_CLIENT_ID, SPOTIFY_CLI
ENT_SECRET) as client:
68. try:
69. response = await spotify.User.from_code(client, code, redirect
_uri=REDIRECT_URI)
70. user = await response
71. print(f"Managed to collect user data: {user}")
72. return RedirectResponse("/", status_code=302)
73. except spotify.errors.HTTPException as e:
74. print("Token expired?")
75. if "expired" in str(e).lower() or "invalid" in str(e).lower():
76. print("redirect-" * 5)
77. return RedirectResponse("/", status_code=302)
Code 8.37
Firstly, we modified the way we store the authentication code as received
from callback URL (line 33-38). Since Gunicorn is a threaded WSGI server,
we cannot store authentication codes in global single variables like we did in
the previous example (Code 8.29, line 35). In this case, we will use a simple
yet powerful solution. We will store the authentication code in a flat file (line
62). When we want to use the authentication code, we can just read it from
the file (line 43).
We also have simple mechanisms, to check if the returned authentication
code expired or is not valid (lines 73-77). In that case, we return the user (in
the browser) to OAuth login screen to refresh the authentication code. You
might notice that we have added a new method, search (lines 50-55). That
method is a POST method (line 49). It will use only single argument for the
call (line 50) – argument is called item. The argument is expected to be in a
deserialized JSON28 structure. For serialization, we will use data
serialization framework Pydantic29. To visualize how to call search method
please check the following example:
1. $ curl -v -X POST http://localhost:8888/search/ -H 'content-
type: application/json' -d '{"phrase" : "linking park"}'
Code 8.38
As you can see, parameter d (data) in curl command, has specified JSON
payload that we send to our API and as a response we are getting something
like in the proceeding example.
1. {"track_url":"https://open.spotify.com/track/60a0Rd6pjrkxjPbaKzXjfq"
}
Code 8.39
We have the basic API functionality working and an authentication against
Spotify to search for a given phrase against music records. What we want to
achieve next, is to provide functionality where we can send a query to our
API to find interesting music and upon finding results, we will ask our API
to play music. To do that properly, it is a good idea to return the track ID in
our API response with full track URL. We will modify spotify_search
method to look like in the following example:
1. @app.post("/search/")
2. async def spotify_search(item: Item):
3. async with spotify.Client(SPOTIFY_CLIENT_ID, SPOTIFY_CLIEN
T_SECRET) as client:
4. results = await client.search(item.phrase)
5. if results.tracks and len(results.tracks) > 0:
6. track_url = results.tracks[0].url
7. track_id = track_url.split('/')[-1]
8. return {
9. "track_url": track_url,
10. "ID": track_id
11. }
Code 8.40
To return to track id we used a simple trick (method split in string). With
that, we extracted the last part from the full URL. We will also keep in the
returned payload of the full URL we are going to use it later. In the
following example, we are adding the play method to our API. With this
method and given track id we can play the requested music record:
1. @app.get("/play/{track_id}")
2. async def spotify_playback(track_id: str):
3. code = await token()
4. async with spotify.Client(SPOTIFY_CLIENT_ID, SPOTIFY_CLIE
NT_SECRET) as client:
5. response = await spotify.User.from_code(client, code, redirect_uri
=REDIRECT_URI)
6. user = await response
7. devices = await user.get_devices()
8. device_id = devices[0].id
9. p = await user.get_player()'
10. play_url = f"https://open.spotify.com/track/{track_id}"
11. await p.play(play_url, device_id)
Code 8.41
To call our API so that it starts playing music records, we need to call API as
it is shown in the following example:
1. curl -v -X POST http://localhost:8888/play/60a0Rd6pjrkxjPbaKzXjfq -
H 'content-type: application/json'
Code 8.42
If you do not have a premium account (pay subscription) you will get an
error message displaying Forbidden (status code: 403): Player command
failed: Premium required. This means that in a free account we cannot
play music. The other limitation is the fact, that we are getting the device id
(example 8.37, line 8) by assuming that the device element zero in the
returned list is the player in the browser. To make it work, you need to open
Spotify player in your default web browser30.
The third limitation of this approach, kind of limitation of our approach, is
that Python Spotify library assumes that the token (example 8.37, line 5) is
the one coming from OAuth callback URL. That is correct, but Spotify only
allows us to use that code once. Hence, spotify_playback function cannot
use this technique. Let us try to use the official Spotify API approach with
the support of HTTPX library that we installed before (Code 8.21).
1. import httpx
2.
3. @app.post("/play/{track_id}")
4. async def spotify_play(track_id: str):
5. r = httpx.post('https://httpbin.org/post', data={'key': 'value'})
6. async with httpx.AsyncClient() as client:
7. headers = {"Content-Type": "application/x-www-form-
urlencoded"}
8. data = f"grant_type=client_credentials&client_id=
{SPOTIFY_CLIENT_ID}&client_secret=
{SPOTIFY_CLIENT_SECRET}"
9. response = await client.post("https://accounts.spotify.com/api/token
", headers=headers, data=data)
10. access_token = response.json()['access_token']
11.
12. headers = {"Content-
Type": "application/json", "Authorization": f"Bearer {access_token}"}
13. data = {'context_uri': f"spotify:track:{track_id}"}
14. response = await client.put("https://api.spotify.com/v1/me/player/pl
ay", headers=headers, data=data)
15. print(response.content)
Code 8.43
You can see that the kind of approach shown in example Code 8.39, is a very
low level way of communicating with Spotify. We do not use a framework
for this, instead we will use direct access via HTTP and API calls to Spotify.
In line 7-10 we will get the access token by using client id and client secret.
Once we have the token, we can start making API calls (lines 12-15).
Of course, with this way of requesting a playback to Spotify we will not
avoid a case where we do need to have a Spotify subscription plan. The
simplest refactoring, we can do here, is to open the music record URL in the
browser from our API, like in the proceeding example:
1. import webbrowser
2.
3. @app.post("/play/{track_id}")
4. async def spotify_play(track_id: str):
5. play_url = f"https://open.spotify.com/track/{track_id}"
6. webbrowser.open(play_url)
Code 8.44

Building physical devices


So far, we have discussed about using computers as a source of voice
recognition and integration. This is definitely big and bulky albeit there is
another way. We could use a microcomputer, Raspberry Pi31. It is a super
small, light yet very powerful device that can be operated by power
banks(battery).
1. To make our software work, we need to first install an operating
system on it.
2. We are going to use official Raspberry Pi OS32. Once we have it
installed, we must install Python and all dependencies mentioned in
this chapter.
3. Next thing we will need is a USB microphone. Another equipment we
need is going to be a USB speaker, since this tiny computer does not
have one.
To play music with our API, on a microcomputer closed in a small box with
no monitor, we can use a browser trick. Run Chrome browser in headless
mode. This means, the browser starts as normal, but does not open any
windows, in pure command line. In the following example code, how we can
achieve this is shown:
1. from selenium import webdriver
2. from selenium.webdriver.chrome.options import Options
3.
4. options = Options()
5. options.headless = True
6.
7. driver = webdriver.Chrome(options=options)
8. driver.get('https://www.wikipedia.org')
9.
10. # close once finished
11. driver.close()
Code 8.45
We have only scratched the surface of this topic, since this chapter is only
meant to show readers how to extend our smart speaker idea. There are still
some challenges and places to improve our general playback approach, for
instance support more integrations, provide configuration via website – lots
of options to address.

Conclusion
In this chapter, we learned how to use Python for voice recognition. Next,
we have covered how to analyze what is being said by converting speech to
raw text thus, we now know how to integrate such a powerful technique with
the third party software. With Python, you can use unlimited integrations of
voice recognition through smart home solutions like smart lamps, controlling
the water in gardens, cameras and many more.
In the next chapter, we are going to learn how can we use Python to build
music and video downloading software.

1. https://cloud.google.com/ai-
platform/training/docs/algorithms/xgboost
2. https://cmusphinx.github.io/wiki/tutorialam/
3. https://github.com/spatialaudio/python-sounddevice
4. https://scipy.org
5. https://www.python.org/downloads/
6. https://en.wikipedia.org/wiki/44,100_Hz
7. https://github.com/openai/whisper
8. https://aws.amazon.com/polly/
9. https://cloud.google.com/text-to-speech/
10. https://docs.python.org/3/library/tempfile.html
11. https://ss64.com/osx/say.html
12. https://manpages.ubuntu.com/manpages/trusty/man1/say.1.html
13. https://openai.com
14. https://cloud.google.com/dialogflow/es/docs/basics
15. https://pypi.org/project/neuralintents/
16. https://www.tensorflow.org
17. https://www.spotify.com
18. https://developer.spotify.com
19. https://developer.spotify.com/dashboard
20. https://en.wikipedia.org/wiki/OAuth
21. https://docs.python.org/3/library/asyncio.html
22. https://www.python-httpx.org
23. https://docs.python-requests.org/en/latest/index.html
24. https://fastapi.tiangolo.com
25. https://gunicorn.org
26. https://developer.spotify.com/documentation/web-
api/concepts/authorization
27. https://developer.spotify.com/documentation/web-
api/concepts/scopes
28. https://www.w3schools.com/whatis/whatis_json.asp
29. https://pypi.org/project/pydantic/
30. https://open.spotify.com/
31. https://www.raspberrypi.com
32. https://www.raspberrypi.com/documentation/computers/os.html

Join our book’s Discord space


Join the book's Discord Workspace for Latest updates, Offers, Tech
happenings around the world, New Release and Sessions with the Authors:
https://discord.bpbonline.com

OceanofPDF.com
CHAPTER 9
Make a Music and Video
Downloader

Introduction
On the internet, when we try to download some files, we might face some
technical challenges. We know that connection between our computer and
server can be dropped and expect that connection will be automatically re-
established.
Another challenge is that some servers can throttle connection, and we can
only download certain assets with highly limited speed (Fair Use Policy1).
Speaking of limiting connection speed, there can be a case where we want to
download a lot of files, and we need to apply fair usage policy on our end.
There can also be a need to start downloading files from a list and limit
download speed at the same time to not disturb our internet connection and
daily work.
In this chapter, we are going to learn how to download web resources with
Python. We were not only going to learn how to download any resource; we
are about to build a YouTube videos downloading tool.

Structure
In this chapter, we will discuss the following topics:
Understanding API concept
Building YouTube API client
Organizing downloaded data
Support for different formats and resolutions
Building batch data downloader

Objectives
By the end of this chapter, you will learn how to build a download manager
that will help us to download video files from the popular video hosting
platform. We will learn how external API system works and how to use its
advantage to download assets from it. We will also learn how to fetch binary
data from webserver. All these mentioned skills we will be implementing by
using Python language.

Download manager
As it has been mentioned in the chapter objectives, we will build some tools
that will allow us to download some assets, like, images from a given source.
To make it efficient, we will use async programming technique. This is
going to help us avoid locking pieces of code when accessing blocking
content like internet assets. Before we can dive into first example, we shall
install the required libraries:
1. $ pip install asyncio==3.4.3
2. $ pip install httpx==0.24.1
3. $ pip install click==8.1.3
Code 9.1
Once we have installed the required packages, we will start writing a simple
script that will allow us to fetch assets from a given command-line
argument. Let us check the following example to understand how to achieve
this:
1. import click
2. import asyncio
3. import httpx
4. import os
5. from urllib.parse import urlparse
6.
7.
8. async def main(url):
9. async with httpx.AsyncClient() as client:
10. response = await client.get(url, follow_redirects=True)
11.
12. if response.status_code == 200:
13. u = urlparse(url)
14. file_name = os.path.basename(u.path)
15. with open(f'/tmp/{file_name}', 'wb') as f:
16. f.write(response.content)
17.
18.
19. @click.command()
20. @click.option("--
url", help="File URL path to download ", required=True)
21. def run(url):
22. asyncio.run(main(url))
23.
24. if __name__ == '__main__':
25. run()
Code 9.2
You can see that in the example Code 9.1, we used some already known
libraries from previous chapters like, click on httpx or asyncio. In this case,
we have built a simple script for command line support. We can specify the
URL with the resource that we want to download (line 20).
When a resource is downloaded, we strip out the URL path (line 13-14) and
save the resource in tmp folder under its original name.
This simple script will be a starting point to build a more advanced
download manager.
In the next step, we will add an option to download files from the given list
instead of a single URL provided. Thus said, in the following code we can
see how can we achieve downloading files from provided URL.
1. import click
2. import asyncio
3. import httpx
4. import os
5. from urllib.parse import urlparse
6.
7.
8. async def download(url):
9. async with httpx.AsyncClient() as client:
10. print(f"Fetching: {url}")
11. response = await client.get(url, follow_redirects=True)
12. if response.status_code == 200:
13. u = urlparse(url)
14. file_name = os.path.basename(u.path)
15. with open(f"/tmp/{file_name}", "wb") as f:
16. f.write(response.content)
17.
18.
19. async def download_list(urls_file):
20. with open(urls_file, "r") as f:
21. for item in f:
22. await download(item.strip())
23.
24.
25. async def main(url=None, url_list=None):
26. if url:
27. return await download(url)
28. if url_list:
29. print("Running downloader for given list of URLs")
30. return await download_list(url_list)
31.
32.
33. @click.command()
34. @click.option("--url", help="File URL path to download")
35. @click.option("--url-list", help="File with URLs to download")
36. def run(url, url_list):
37. asyncio.run(main(url, url_list))
38.
39.
40. if __name__ == "__main__":
41. run()
Code 9.3
As you can see, we have slightly modified the starting function (line 33-36),
so we can accept additional parameter as a file path (URL list) that will be
used as a list of URLs that we can fetch.
To demonstrate how to use this new parameter, let us check the following
example by creating a sample file called example_files_list.txt that will
have a list of URLs to fetch.
1. https://www.wikipedia.org/portal/wikipedia.org/assets/img/sprite-
8bb90067.svg
2. https://www.wikipedia.org/portal/wikipedia.org/assets/js/index-
86c7e2579d.js
3. https://www.wikipedia.org/portal/wikipedia.org/assets/img/Wikinews-
logo_sister.png
4. https://www.wikipedia.org/portal/wikipedia.org/assets/js/gt-ie9-
ce3fe8e88d.js
5. https://www.wikipedia.org/portal/wikipedia.org/assets/img/Wikipedia-
logo-v2.png
Code 9.4
Now, as shown in Code 9.3 we can use specified file that contains list of
URLs to fetch. We will use file from Code 9.4 and get the content of the
listed URLs. Let’s check following code how to specify file containing a list
of URLs to fetch.
1. python download_manager2.py --url-list example_files_list.txt
2.
3. Running downloader for given list of URLs
4. Fetching: https://www.wikipedia.org/portal/wikipedia.org/assets/img/spr
ite-8bb90067.svg
5. Fetching: https://www.wikipedia.org/portal/wikipedia.org/assets/js/i
ndex-86c7e2579d.js
6. Fetching: https://www.wikipedia.org/portal/wikipedia.org/assets/img/Wi
kinews-logo_sister.png
7. (...)
8. httpcore.ConnectTimeout
9.
10. The above exception was the direct cause of the following exception:
11. (...)
12. File "/Users/darkman66/.virtualenvs/fun2/lib/python3.11/site-
packages/httpx/_transports/default.py", line 77, in map_httpcore_except
ions
13. raise mapped_exc(message) from exc
14. httpx.ConnectTimeout
Code 9.5
We were running our example code until it crashed (Code 9.5, line 9-13).
The reason being why it has happened is clearly stated in line 13. The root
cause of why we faced exception shown in line 14 is that we have stumbled
upon issue with the timeout when trying to access website were tried to
download its content. If we check the way our script works (Code 9.3, line
21-22) this kind of exception is undesired. If it happens, then our loop (Code
9.3, line 21) will stop, and we will finish processing the URL links from the
file. However, this is not correct because the correct approach is a case
where we get all the files from a given list, and even if there is a crash, we
should be able to recover and continue with the download.
Before implementing a fixed code, we need to install new library tenacity
which will help us to build retries techniques for async functions.
1. $ pip install tenacity
Code 9.6
Let us consider the following example Code 9.7 on how to fix a case when
there is a crash or networking issue.
We can introduce retry pattern when fetching external content while facing
mentioned timeouts issues when downloading content of a specified web
resource. Once the library is installed, we can apply retry mechanism. Let us
check Code 9.7 to understand it better.
1. import click
2. import asyncio
3. import httpx
4. import os
5. from urllib.parse import urlparse
6. from tenacity import AsyncRetrying, RetryError, stop_after_attempt
7.
8. RETRIES = 3
9.
10.
11. async def download(url):
12. """Fetch URL resource with retry"""
13. try:
14. async for attempt in AsyncRetrying(stop=stop_after_attempt(RET
RIES)):
15. with attempt:
16. click.echo(f"Fetching: {url}")
17. async with httpx.AsyncClient() as client:
18. response = await client.get(url, follow_redirects=True)
19. if response.status_code == 200:
20. u = urlparse(url)
21. file_name = os.path.basename(u.path)
22. with open(f"/tmp/{file_name}", "wb") as f:
23. f.write(response.content)
24. except RetryError:
25. click.echo(f"Failed to fetch {url} after {RETRIES} tries")
26.
27.
28.
29. async def download_list(urls_file):
30. with open(urls_file, "r") as f:
31. for item in f:
32. if item and 'http' in item:
33. await download(item.strip())
34.
35.
36. async def main(url=None, url_list=None):
37. if url:
38. return await download(url)
39. if url_list:
40. click.echo("Running downloader for given list of URLs")
41. return await download_list(url_list)
42.
43.
44. @click.command()
45. @click.option("--url", help="File URL path to download")
46. @click.option("--url-list", help="File with URLs to download")
47. def run(url, url_list):
48. asyncio.run(main(url, url_list))
49.
50.
51. if __name__ == "__main__":
52. run()j
Code 9.7
As you might have noticed, to solve the mentioned problems regarding
assets timeouts and temporary not able to access web resources, we first
added the try/except block (line 13-25) to catch exceptions when there is an
issue with fetching network resource.
Next, we added in line 14, a piece of code that will help us retry when there
is an exception. The number of retries that we will perform is defined as
static variable (line 8).
The rest of the code (lines 16-23) is the same as before (Code 9.3, lines 9-
16) except that we have wrapped it all in the retry context (Code 9.7, line
15). This is a common technique to fix an issue with failing codes and do the
micro restarts.
Let us understand how to address some bottle neck in our code where
performance of fetching resources is limited.
So far, what we have learned will help us download network resources from
the given list, albeit, it has a common issue which is at the same time a very
strong disadvantage for our solution. We fetch those assets literary one by
one, if one resource takes longer, the next one will be waiting to be fetched
until the blocking one. For example, a slow one or it may be a problematic
resource where we had to retry multiple times.
To address this bottle neck of our code, we will follow these steps:
We must start fetching network resources in parallel. Let us look at the
following code to understand how we can modify the existing Code 9.7 to
work in the parallel downloads approach.
We have modified our already known method download list from the
example Code 9.7 in such a way that in Code 9.8, where we have a loop
(lines 5-6), we will create a list of all the download calls to be made
(coroutines).
1. async def download_list(urls_file):
2. calls_to_make = []
3. with open(urls_file, "r") as f:
4. for item in f:
5. if item and "http" in item:
6. calls_to_make.append(download(item.strip()))
7. click.echo(f"Number of URLs to fetch {len(calls_to_make)}")
8. await asyncio.gather(*calls_to_make)
Code 9.8
Next, asyncio engine makes all these async calls and wait for them to be
finished (line 8).
This way of solving multiprocessing is going to work for us very efficiently,
albeit, we need to be aware of one issue that when we feed our script with
many URLs to parse and fetch, we will be speeding up the whole processing
because of running all the download actions in concurrent courtliness but
there is a problem with processing many hundreds of URLs. Our whole
application is running in a single core of CPU, so if we put a lot of
concurrent coroutines we will be exposing our script to an issue where OS
and CPU cannot process all the requested URLs in real-time. This is related
to the limitations of system kernel and several simultaneous sockets that OS
can process.
The other so called bottle neck that we will face is limitations of websites in
a number of parallel requests that webserver can accept coming from the
same source IP address. In other words, current modern websites apply
protection against DDOS attacks2 which means that they do not tolerate
aggressive content fetching which our script can use. Let us learn how to
address this problem by using the following steps:
In the following example, let us try to address the main concern, that is,
limiting the number of concurrent coroutines that our application is running.
However, before that, we will fix the code from example Code 9.8. To do so,
we will install asyncio pooling3.
1.$ pip install asyncio_pool
Code 9.9
Once we have installed needed pooling library, we can proceed to Code 9.8
for improvements as mentioned in the following example:
1. import click
2. import asyncio
3. import httpx
4. import os
5. from asyncio_pool import AioPool
6. from urllib.parse import urlparse
7. from tenacity import AsyncRetrying, RetryError, stop_after_attempt
8.
9. RETRIES = 3
10. CONCURRENCY_SIZE=2
11.
12.
13. async def download(url):
14. """Fetch URL resource with retry"""
15. try:
16. async for attempt in AsyncRetrying(stop=stop_after_attempt(RET
RIES)):
17. with attempt:
18. click.echo(f"Fetching: {url}")
19. async with httpx.AsyncClient() as client:
20. response = await client.get(url, follow_redirects=True)
21. if response.status_code == 200:
22. u = urlparse(url)
23. file_name = os.path.basename(u.path)
24. with open(f"/tmp/{file_name}", "wb") as f:
25. f.write(response.content)
26. except RetryError:
27. click.echo(f"Failed to fetch {url} after {RETRIES} tries")
28.
29.
30. async def download_list(urls_file):
31. calls = []
32. with open(urls_file, "r") as f:
33. async with AioPool(size=CONCURRENCY_SIZE) as pool:
34. for item in f:
35. if item and "http" in item:
36. result = await pool.spawn(download(item.strip()))
37. calls.append(result)
38. click.echo(f"Commited {len(calls)} URLs to call")
39.
40. for call_item in calls:
41. call_item.result()
42.
43.
44. async def main(url=None, url_list=None):
45. if url:
46. return await download(url)
47. if url_list:
48. click.echo("Running downloader for given list of URLs")
49. return await download_list(url_list)
50.
51.
52. @click.command()
53. @click.option("--url", help="File URL path to download")
54. @click.option("--url-list", help="File with URLs to download")
55. def run(url, url_list):
56. asyncio.run(main(url, url_list))
57. g
58.
59. if __name__ == "__main__":
60. run()
Code 9.10
We can see that the modified Code 9.10 is mainly like Code 9.9 except the
main essence of the chance. We updated the lines 33-37 where the loop (line
34) that was calling download function linearly is now using module for
connections pool (line 36).
The way it works is that pooling system controls how many coroutines main
asyncio reactor can process simultaneously. To control the number of
parallel coroutine download calls we can evaluate at the same time, we have
declared it in line 10.
Additionally, in lines 40-41, we are collecting results from calling all the
download coroutines. In our case download function does not return any
result albeit, collecting result technique is like safety stop in our code for a
case where we must wait until all the routines are done and finished with
processing.
The next step is to address the problem that we have highlighted before, that
is, the case where destination website is using protection against DDOS.
In the following examples, we will build web proxy service that is going to
help us with addressing DDOS issue. Let us analyze the following figure to
see how we will build our proxy network:
Figure 9.1: Concept of proxy service
As shown in Figure 9.1, we will create a small service (proxy server) that
will be installed on at least two different machines. This will be an
advantage as every request we send to the website is going to be seen as
coming from two different IP addresses.
As you can probably already imagine that as many IP addresses (servers to
use) you can have as better it gets. Chances of being detected by destination
website will get lower if we can have a big pool of IP addresses to use.
In the following example, we will check how to build simple proxy service
by using standard Python modules45:
1. import socketserver
2. from urllib.request import urlopen
3. from http.server import SimpleHTTPRequestHandler
4.
5. PORT = 9097
6. HOST = 'localhost'
7.
8. class MyProxy(SimpleHTTPRequestHandler):
9. def do_GET(self):
10. url = self.path
11. print(f"Opening URL: {url}")
12. self.send_response(200)
13. self.end_headers()
14. self.copyfile(urlopen(url), self.wfile)
15.
16.
17. with socketserver.TCPServer((HOST, PORT), MyProxy) as server:
18. print(f"Now serving at {PORT}")
19. server.serve_forever()
Code 9.11
1. In the above example Code 9.11, we are using simple http request
handler which allows us to catch GET call coming from a request and
make a new call to destination server (lines 12-14).
2. We use copyfile (line 14) function to internally copy destination server
response to response object that we will send to our original script. In
order to serve it as proxy server, we must start http service (lines 17-19)
and listen on port and address defined in lines 5-6.
3. Before we can test if our newly built proxy service is working or not, we
must save example Code 9.11 as proxy_service.py and start it.
4. Now, let us try to test in the following example our proxy service by
using simple CLI tool curl6.
1. $ python proxy_service.py
2.
3. Now serving at 9097
Code 9.12
In the following example Code 9.13, we are using curl command to fetch
HTML content from python.org. This request is going through GET method7
to fetch results:
1. $ curl -x http://localhost:9097 http://python.org
Code 9.13
Unfortunately, there is a limitation to our proxy solution. You might have
noticed that we have sent requests as GET because we have inherited it from
SimpleHTTPRequestHandler8 and only overwrite GET functionality
(Code 9.11, lines 9-14).
Thus, it gives us HTTP proxy service that only works with GET methods. In
our case that is enough since our service (Code 9.11) will allow us to
download web resources that are accessible via GET requests.
Another thing that is worth highlighting is the fact that our simple proxy
solution only supports HTTP protocol not HTTPS. The reason being simple,
that is, to build proper proxy with SSL support, we would have to explore
deeper into SSL certificates. We could also use already available projects to
support this need910, although, that topic is beyond this chapter.
Let us try to see Code 9.14 to understand how to modify the example Code
9.10 to use our small HTTP proxy service with it:
1. import click
2. import asyncio
3. import httpx
4. import os
5. import random
6. from asyncio_pool import AioPool
7. from hashlib import sha256
8. from urllib.parse import urlparse
9. from tenacity import AsyncRetrying, RetryError, stop_after_attempt
10.
11. RETRIES = 3
12. CONCURRENCY_SIZE = 2
13.
14. class Downloader:
15.
16. def __init__(self, proxies=None):
17. self.proxies = proxies
18.
19. async def download(self, url):
20. """Fetch URL resource with retry"""
21. try:
22. async for attempt in AsyncRetrying(stop=stop_after_attempt(RE
TRIES)):
23. with attempt:
24. proxy_server = None
25. if self.proxies:
26. proxy_server = {
27. "all://": random.choice(self.proxies),
28. }
29. click.echo(f"Fetching: {url}, proxy: {proxy_server}")
30. async with httpx.AsyncClient(proxies=proxy_server) as cli
ent:
31. response = await client.get(url, follow_redirects=True)
32. if response.status_code == 200:
33. u = urlparse(url)
34. file_hash = sha256(url.encode('utf8')).hexdigest()
35. file_name = f"
{os.path.basename(u.path)}_{file_hash}"
36. with open(f"/tmp/{file_name}", "wb") as f:
37. f.write(response.content)
38. except RetryError:
39. click.echo(f"Failed to fetch {url} after {RETRIES} tries")
40.
41. async def download_list(self, urls_file):
42. calls = []
43. with open(urls_file, "r") as f:
44. async with AioPool(size=CONCURRENCY_SIZE) as pool:
45. for item in f:
46. if item and "http" in item:
47. result = await pool.spawn(self.download(item.strip()))
48. calls.append(result)
49.
50.
51. @click.command()
52. @click.option("--url", help="File URL path to download")
53. @click.option("--url-list", help="File with URLs to download")
54. @click.option("--proxy", help="List of proxy servers", multiple=True)
55. def run(url, url_list, proxy):
56. d = Downloader(proxy)
57. if url:
58. run_app = d.download(url)
59. elif url_list:
60. run_app = d.download_list(url_list)
61. if run_app:
62. asyncio.run(run_app)
63. else:
64. click.echo("No option selected")
65.
66.
67. if __name__ == "__main__":
68. run()
Code 9.14
The essential change in Code 9.14 as compared to Code 9.10 is in the line
29, where we have initialized http client context and we will pass argument
of proxy servers list. Please notice that this list is initialized in class
constructor (lines 15-16). Another important thing to notice that we changed
is the body of the whole download functionality. We converted the function
driven approach (Code 9.10) to class object-oriented programming. By this
move, we managed to simplify initial main entry point (lines 53-62) and
encapsulate calls in individual methods.
Let us check how to use our new script from Code 9.14 that we have saved
as download_with_proxy.py to fetch single URL.
Our program will download specified resource (Code 9.15) under parameter
URL and save it in /tmp folder.
We extracted the last part from the URL which is the resource name (Code
9.14, line 33) and calculated SHA25611 (Code 9.15, line 34) on the top of the
full resource URL.
These two parameters are joined together as a file name (line 35) that we
will use for saving fetched resource (line 35-37) and it will guarantee us of
file name uniqueness.
1. $ python download_with_proxy.py --url https://www.wikipedia.org
2.
3. Fetching: https://www.wikipedia.org, proxy: None
Code 9.15
Now, let us see how to run the same script with the list of files (URL
resources Code 9.4) to fetch and combine with our proxy service.
We need to start proxy service like in Code 9.12. Now in the following
example we are using it.
1. $ python download_with_proxy.py --url-list example_files_list.txt --
proxy=http://localhost:9097
2.
3. Fetching: https://www.wikipedia.org/portal/wikipedia.org/assets/img/spr
ite-8bb90067.svg, proxy: {'all://': 'http://localhost:9097'}
Code 9.16
In line 3, we can see an example of the output of running script. We can see
that the URL we try to reach is served via SSL (HTTPS) protocol.
Now, let us check what do we have in the output of our proxy service (Code
9.12).
1. 127.0.0.1 - - [20/Oct/2023 22:17:21] code 501, message Unsupported m
ethod ('CONNECT')
2. 127.0.0.1 - - [20/Oct/2023 22:17:21] "CONNECT www.wikipedia.org:4
43 HTTP/1.1" 501 -
Code 9.17
We can see that our proxy service is throwing errors (Code 501, line 1)
which means that there is an issue with our HTTP proxy service. If you
check line 2, you will see the details that on initializing SSL connection, we
have CONNECT method missing.
We will not concentrate on building full proxy service that is properly
working with HTTPS connections. Instead, we can use existing Python
library that offers proxy support called proxy12.
1. $ pip install pproxy
Code 9.18
After installing the package, we can start proxy service. The great feature of
this package is that it allows us to start service immediately with zero
configuration. Let us check in the following example:
1. $ pproxy
2.
3. Serving on :8080 by http,socks4,socks5
Code 9.19
In our example Code 9.16, we need to specify new proxy service address
like in the following example:
1. $ python download_with_proxy.py --url-list example_files_list.txt --
proxy=http://localhost:8080
2.
3. Fetching: https://www.wikipedia.org/portal/wikipedia.org/assets/img/spr
ite-8bb90067.svg, proxy: {'all://': 'http://localhost:8080'}
Code 9.20
This time when we have used coherence proxy service that supports SSL
connections with no issues, we can notice that fetching resource does not
show any fatal exceptions. We can say that now we can support proxy
connection for those services that will start detecting our requests as
potential DDOS attacks.

Organizing downloaded data


Downloading resources in parallel is a desired way of processing multiple
resources all at once. Unfortunately, running concurrent download tasks
simultaneously does not guarantee that fetching resources will be faster.
Some servers have got download limit per stream that means, if you try to
download multiple files from the same server, it can limit transfer speed per
single connection. Let us check how to change the download technique using
different algorithm for big files.
In Figure 9.2, we proposed for big files to split download process into
smaller chunks which means that file, for instance of 10MB will be divided
into 4 simultaneous downloads of 2.5 MB each. Every part of this download
process will use 100% download speed assigned per connection.
Figure 9.2: Faster download for large file
Let us check the following example on how to achieve this by downloading
Python installer where we will apply the described process. Let us take a
closer look on the next example that we have saved as
download_in_chunks.py:
1. import asyncio
2. import httpx
3. import os
4. from asyncio_pool import AioPool
5. from urllib.parse import urlparse
6.
7. URL = "https://www.python.org/ftp/python/3.12.0/Python-3.12.0.tgz"
8.
9.
10. async def get_size(url):
11. async with httpx.AsyncClient() as client:
12. response = await client.head(url)
13. size = int(response.headers["Content-Length"])
14. return size
15.
16.
17. async def download_range(url, start, end, output):
18. headers = {"Range": f"bytes={start}-{end}"}
19. async with httpx.AsyncClient() as client:
20. response = await client.get(url, follow_redirects=True, headers=he
aders)
21.
22. with open(output, "wb") as f:
23. for part in response.iter_bytes():
24. f.write(part)
25. print(f"Finished: {output}")
26.
27.
28. async def download(url, output, chunk_size=1000000):
29. file_size = await get_size(url)
30. chunks = range(0, file_size, chunk_size)
31. print(f"Planned number of chunks: {len(chunks)}")
32. calls = []
33. async with AioPool(size=int(len(chunks) / 3)) as pool:
34. for i, start in enumerate(chunks):
35. partial_output = f"{output}.part{i}"
36. result = await pool.spawn(download_range(url, start, start + chu
nk_size - 1, partial_output))
37. calls.append(result)
38.
39. with open(output, "wb") as o:
40. for i in range(len(chunks)):
41. chunk_path = f"{output}.part{i}"
42.
43. with open(chunk_path, "rb") as s:
44. o.write(s.read())
45. os.remove(chunk_path)
46.
47.
48. if __name__ == "__main__":
49. _url = urlparse(URL)
50. file_name = _url.path.split("/")[-1].strip()
51. output_path = f"/tmp/{file_name}"
52. asyncio.run(download(URL, output_path))
Code 9.21
To demonstrate how to use flow, as shown in Figure 9.2, we will follow
these steps:
We will use Python 3.12 gzipped13 source file to download. In the lines 48-
52, we will use the same entry point as always.
If we start script like in the following example asyncio core, we will call
main async function (line 52) and proceed with all the algorithms we coded
further out.
Before continuing with fetching results, we extract filename from the URL
(Code 9.21, line 49-50). Let us try to see how to run the example Code 9.21
and what is the output of running it.
We can see in Code 9.22 that we are making a call to the python.org website
to check first by using HTTP HEAD14 call that what is the file size that we
are about to download (Code 9.21, line 13).
By having a response from the server, we can identify how many chunks we
want to split download (Code 9.2, line 28). The value of the maximum
chunk size to download that we have identified in bytes is approximately
1MB per chunk. This is the reason why we can split output. (Code 9.22) in
so many parts to download.
To verify if our program is downloading partial files and then merging them
into one properly, let us check the python.org website15 to validate what
shall be the proper file size.
1. $ python download_in_chunks.py
2.
3. Planned number of chunks: 28
4. Finished: /tmp/Python-3.12.0.tgz.part8
5. Finished: /tmp/Python-3.12.0.tgz.part2
6. Finished: /tmp/Python-3.12.0.tgz.part6
7. Finished: /tmp/Python-3.12.0.tgz.part1
8. Finished: /tmp/Python-3.12.0.tgz.part0
9. Finished: /tmp/Python-3.12.0.tgz.part5
10. Finished: /tmp/Python-3.12.0.tgz.part4
11. Finished: /tmp/Python-3.12.0.tgz.part7
12. Finished: /tmp/Python-3.12.0.tgz.part3
13. Finished: /tmp/Python-3.12.0.tgz.part10
14. Finished: /tmp/Python-3.12.0.tgz.part9
15. Finished: /tmp/Python-3.12.0.tgz.part12
16. Finished: /tmp/Python-3.12.0.tgz.part11
17. Finished: /tmp/Python-3.12.0.tgz.part17
18. Finished: /tmp/Python-3.12.0.tgz.part16
19. Finished: /tmp/Python-3.12.0.tgz.part15
20. Finished: /tmp/Python-3.12.0.tgz.part13
21. Finished: /tmp/Python-3.12.0.tgz.part14
22. Finished: /tmp/Python-3.12.0.tgz.part18
23. Finished: /tmp/Python-3.12.0.tgz.part19
24. Finished: /tmp/Python-3.12.0.tgz.part21
25. Finished: /tmp/Python-3.12.0.tgz.part20
26. Finished: /tmp/Python-3.12.0.tgz.part22
27. Finished: /tmp/Python-3.12.0.tgz.part24
28. Finished: /tmp/Python-3.12.0.tgz.part23
29. Finished: /tmp/Python-3.12.0.tgz.part25
30. Finished: /tmp/Python-3.12.0.tgz.part27
31. Finished: /tmp/Python-3.12.0.tgz.part26
Code 9.22
According to the website16 (Gzipped source tarball), it should be 27195214
bytes. With the following example, let us verify what we have downloaded
and does it match expectations by checking Code 9.23
1. $ ls -l /tmp/Python-3.12.0.tgz
2.
3. -rw-r--r-
- 1 hubertpiotrowski wheel 27195214 Oct 22 15:21 /tmp/Python-
3.12.0.tgz
Code 9.23
It shows that the file size matches properly with the expected number of
bytes specified at python.org website. That seems to be proving the point
that our algorithm works with no problem, and the result of fetching external
content is as expected, albeit does it contain proper data that has been
fetched. Even if we try to uncompress the downloaded file like in the
following example, the question remains - is the output file corrupted or not?
1. $ tar zxvf /tmp/Python-3.12.0.tgz
2.
3. x Python-3.12.0/
4. x Python-3.12.0/Grammar/
5. x Python-3.12.0/Grammar/python.gram
6. x Python-3.12.0/Grammar/Tokens
7. x Python-3.12.0/Parser/
8. x Python-3.12.0/Parser/tokenizer.h
9. x Python-3.12.0/Parser/pegen.c
10. x Python-3.12.0/Parser/string_parser.h
11. (...)
Code 9.24
As stated on the python.org website, every file which we can fetch from
there has listed md5 checksum. This is a very useful and important
information. Every person who tries to fetch a file from python.org website
can validate that the file that was downloaded matches with the expected
verification checksum.
Let us compare the same thing by running the following code:
1. $ md5sum /tmp/Python-3.12.0.tgz
2.
3. d6eda3e1399cef5dfde7c4f319b0596c /tmp/Python-3.12.0.tgz
Code 9.25
Once we know the overall concept of how to approach partial downloads, let
us try to refactor our Code 9.14 and introduce parallel partial downloads in
it. By checking the following code, let us rewrite that code but before we
analyze our prime code, let us create URLs example file and call it
example_big_files_list.txt and with the following body:
1. https://www.python.org/ftp/python/3.12.0/Python-3.12.0.tgz
2. https://www.php.net/distributions/php-8.1.24.tar.gz
Code 9.26
Let us check how the modified example Code 9.14 works with big files
download when we want to split download into chunks. The following
example is saved as download_pool_and_chunks.py and it is going to
provide better support for fetching large files:
1. import click
2. import asyncio
3. import httpx
4. import os
5. import random
6. from asyncio_pool import AioPool
7. from hashlib import sha256
8. from urllib.parse import urlparse
9. from tenacity import AsyncRetrying, RetryError, stop_after_attempt
10.
11. RETRIES = 3
12. CONCURRENCY_SIZE = 2
13.
14.
15. class Downloader:
16. def __init__(self, proxies=None, max_chunk_size=1000000):
17. self.proxies = proxies
18. # size in bytes, if any
19. self._max_chunk_size = int(max_chunk_size * 1024 * 1024) if ma
x_chunk_size else None
20.
21. async def get_size(self, url):
22. async with httpx.AsyncClient() as client:
23. response = await client.head(url)
24. size = int(response.headers["Content-Length"])
25. return size
26.
27. async def download_range(self, url, start, end, output):
28. headers = {"Range": f"bytes={start}-{end}"}
29. async with httpx.AsyncClient() as client:
30. response = await client.get(url, follow_redirects=True, headers=
headers)
31.
32. with open(output, "wb") as f:
33. for part in response.iter_bytes():
34. f.write(part)
35. click.echo(f"Finished: {output}")
36.
37. async def split_download(self, url, file_size, output):
38. chunks = range(0, file_size, self._max_chunk_size)
39. click.echo(f"Planned number of chunks: {len(chunks)}")
40. calls = []
41. async with AioPool(size=int(len(chunks) / 3)) as pool:
42. for i, start in enumerate(chunks):
43. partial_output = f"{output}.part{i}"
44. result = await pool.spawn(self.download_range(url, start, start
+ self._max_chunk_size - 1, partial_output))
45. calls.append(result)
46.
47. with open(output, "wb") as o:
48. for i in range(len(chunks)):
49. chunk_path = f"{output}.part{i}"
50.
51. with open(chunk_path, "rb") as s:
52. o.write(s.read())
53. os.remove(chunk_path)
54.
55. async def download(self, url):
56. """Fetch URL resource with retry"""
57. try:
58. async for attempt in AsyncRetrying(stop=stop_after_attempt(RE
TRIES)):
59. with attempt:
60. proxy_server = None
61. if self.proxies:
62. proxy_server = {
63. "all://": random.choice(self.proxies),
64. }
65. click.echo(f"Fetching: {url}, proxy: {proxy_server}")
66. if self._max_chunk_size:
67. file_size = await self.get_size(url)
68. click.echo(f"File size: {file_size}")
69. if file_size >= self._max_chunk_size:
70. _url = urlparse(url)
71. output_file_name = _url.path.split
("/")[-1].strip()
72. return await self.split_download
(url, file_size, output_file_name)
73. async with httpx.AsyncClient(proxies=proxy_server) as cli
ent:
74. response = await client.get(url, follow_redirects=True)
75. if response.status_code == 200:
76. u = urlparse(url)
77. file_hash = sha256(url.encode("utf8"))
.hexdigest()
78. file_name = f"{os.path.basename
(u.path)}_{file_hash}"
79. with open(f"/tmp/{file_name}", "wb") as f:
80. f.write(response.content)
81. except RetryError:
82. click.echo(f"Failed to fetch {url} after
{RETRIES} tries")
83.
84. async def download_list(self, urls_file):
85. calls = []
86. with open(urls_file, "r") as f:
87. async with AioPool(size=CONCURRENCY_SIZE) as pool:
88. for item in f:
89. if item and "http" in item:
90. result = await pool.spawn
(self.download(item.strip()))
91. calls.append(result)
92.
93.
94. @click.command()
95. @click.option("--url", help="File URL path to download")
96. @click.option("--url-list", help="File with URLs to download")
97. @click.option("--proxy", help="List of proxy servers", multiple=True)
98. @click.option("--
size", default=None, help="Size limit (MB) for partial downloads",
type=float)
99. def run(url, url_list, proxy, size):
100. d = Downloader(proxy, size)
101. run_app = None
102. if url:
103. run_app = d.download(url)
104. elif url_list:
105. run_app = d.download_list(url_list)
106. if run_app:
107. asyncio.run(run_app)
108. else:
109. click.echo("No option selected")
110.
111.
112. if __name__ == "__main__":
113. run()
Code 9.27
We have modified the main entry point for our primary function (Code 9.27,
line 99-109), where we can specify in current use case not only that we
would like to download single resource (line 95) or a list or arguments as
URLs-list (line 96). We can define proxy as well (line 97) Additionally,
when we specify URL-list we can limit the download size per file.
Let us check how to download two big files (Code 9.26) with the following
example:
1. $ python download_pool_and_chunks.py --url-
list example_big_files_list.txt --size 1
Code 9.28
The output of running Code 9.28 is shown in the following example. Both
the files that we are downloading are being divided into chunks and run in a
pool of connections:
1. Fetching: https://www.python.org/ftp/python/3.12.0/Python-
3.12.0.tgz, proxy: None
2. 1048576
3. Fetching: https://www.php.net/distributions/php-
8.1.24.tar.gz, proxy: None
4. 1048576
5. File size: 27195214
6. Planned number of chunks: 26
7. File size: 18692939
8. Planned number of chunks: 18
9. Finished: Python-3.12.0.tgz.part7
10. Finished: Python-3.12.0.tgz.part6
11. Finished: Python-3.12.0.tgz.part5
12. Finished: Python-3.12.0.tgz.part1
13. Finished: Python-3.12.0.tgz.part2
14. Finished: Python-3.12.0.tgz.part4
15. Finished: Python-3.12.0.tgz.part0
16. Finished: php-8.1.24.tar.gz.part4
17. Finished: php-8.1.24.tar.gz.part2
18. Finished: php-8.1.24.tar.gz.part3
19. Finished: Python-3.12.0.tgz.part18 (…)
Code 9.29
We can notice that we print out file sizes (lines 5-7). Next, we decided how
many parallel channels we wanted to use per download of same single file
(lines 6 and 8). We need to remember the specified chunk size limit (Code
9.28) which is 1 MB.
When we are investigating Code 9.27 (lines 57-82), we can see that we keep
the pattern of retry for a case when downloading is facing any kind of
network issue. We added a new functionality (lines 66-67) where we check if
the size limit is specified, then we make a call to server to retrieve resource
file size that we will download later (lines 69-72).
In the rest of the code, we are fetching the file in chunks, (lines 27-45) like
we have performed in Code 9.21. As you might have noticed, we
accidentally made a mistake in Code 9.27; the mistake is that we saved
results of the downloaded resources when splitting function is not in use into
a different folder than those downloaded when splitting takes a place (lines
43-53). Let us try to fix this error with the updated following code:
1. async def download(self, url):
2. """Fetch URL resource with retry"""
3. try:
4. async for attempt in AsyncRetrying(stop=stop_after_attempt(RET
RIES)):
5. with attempt:
6. proxy_server = None
7. if self.proxies:
8. proxy_server = {
9. "all://": random.choice(self.proxies),
10. }
11. click.echo(f"Fetching: {url}, proxy: {proxy_server}")
12. if self._max_chunk_size:
13. file_size = await self.get_size(url)
14. click.echo(f"File size: {file_size}")
15. if file_size >= self._max_chunk_size:
16. _url = urlparse(url)
17. output_file_name = _url.path.split("/")[-1].strip()
18. output_file_name = f"/tmp/{output_file_name}"
19. return await self.split_download(url, file_size, output_fil
e_name)
20. async with httpx.AsyncClient(proxies=proxy_server) as client
:
21. response = await client.get(url, follow_redirects=True)
22. if response.status_code == 200:
23. u = urlparse(url)
24. file_hash = sha256(url.encode("utf8")).hexdigest()
25. file_name = f"{os.path.basename(u.path)}_{file_hash}"
26. with open(f"/tmp/{file_name}", "wb") as f:
27. f.write(response.content)
Code 9.30
We did not change much except line 18, which is the actual fix. You can
notice that by adding path as a prefix for the output filename, we will save
the output into the same location as all the other files that are smaller than
the file limit (Code 9.27, line 98).

Building YouTube API client


We have learned how to work with the network resources, download small
and big files at the same time (parallel) and limit download throughout.
Now, we will learn how to use our script to download video and audio
content from YouTube.
There are multiple Python packages1718 that we could use to download
YouTube content from the web albeit, we cannot directly connect them to
our async system. In this case, we will build our own library instead.
Before we can start interacting with the YouTube resources, we need to
create YouTube API key19. Let us create test application and API key in
Google developer dashboard. Once that API access is sorted, we will check
if the following code is working:
1. curl https://www.googleapis.com/youtube/v3/search\?
part\=snippet\&type\=channel\&fields\=items%2Fsnippet%2FchannelId
\&q\=CafeDeAnatolia\&key\=<your API key>
Code 9.31
We are using YouTube API v320 in Code 9.31, where we are trying to get the
internal channel ID mapped from a given channel name (CafeDeAnatolia).
Let us check the following example to see what a typical output is for
running such an API query:
1. {
2. "items": [
3. {
4. "snippet": {
5. "channelId": "UC1Tr6S-XLBk1NzNX1jErWMg"
6. }
7. }
8. ]
9. }
Code 9.32
We can see that Google returns JSON response where we have a channel ID
returned, and this is the internal ID that we will use to build the next part of
the functionality.
Let us build a proof of concept code which will allow us to fetch a list of all
the available videos for a given channel by following these steps:
Let us start with creating a simple class that unifies google API calls into one
primary code (called youtube_api.py). Look at example Code 9.33 to
understand it better:
1. import asyncio
2. import click
3. import httpx
4. import urllib
5.
6.
7. class YouTube:
8.
9. base_url = "https://www.googleapis.com/youtube/v3/"
10.
11. def __init__(self, channel):
12. self.channel_name = channel
13.
14. @property
15. def api_key(self):
16. if not getattr(self, '_api_key', None):
17. raise Exception("Please specify API
key before making calls")
18. return self._api_key
19.
20. @api_key.setter
21. def api_key(self, value):
22. if not value:
23. raise Exception("Please specify valid API key")
24. self._api_key = value
25.
26. def _encode_query(self, query):
27. return urllib.parse.urlencode(query)
28.
29. async def get_channel_id(self):
30. query = {
31. "part": "snippet",
32. "type": "channel",
33. "fields": "items/snippet/channelId",
34. "q": self.channel_name,
35. "key": self.api_key
36. }
37. url = f"{self.base_url}search?{self._encode_query(query)}"
38. click.echo(f"Calling {url}")
39. async with httpx.AsyncClient() as client:
40. response = await client.get(url, follow_redirects=True)
41. if response.status_code == 200:
42. response_data = response.json()
43. channel_id = response_data['items'][0]['snippet']['channelId']
44. click.echo(f"got channel ID: {channel_id}")
45. return channel_id
46.
47.
48. def get_videos_list(self):
49. pass
50.
51.
52. @click.command()
53. @click.option("--channel", help="Channel name to scan")
54. @click.option("--api-key", default=None, help="YouTube API key")
55. def main(api_key, channel):
56. yt =YouTube(channel)
57. yt.api_key = api_key
58. run_app = yt.get_channel_id()
59. asyncio.run(run_app)
60.
61. if __name__ == '__main__':
62. main()
Code 9.33
We created a class YouTube that takes channel name as a primary argument
in class constructor. Once the channel name is given (lines 11-12), we must
pass YouTube API key in attribute setter (lines 20-24).
We created protection for accessing the API key class property when it is
not initiated at all or is empty (lines 14-18).
Let us check the following code snippet (running it in Python shell) to
understand what our protection does when we try to use an empty API key:
1. >>> import asyncio
2. >>> from youtube_api import YouTube
3. >>> yt = YouTube('some channel name')
4. >>> asyncio.run(yt.get_channel_id())
5.
6. File ~/work/fun-with-python/chapter_9/youtube_api.py:17,
in YouTube.api_key(self)
7. 14 @property
8. 15 def api_key(self):
9. 16 if not getattr(self, '_api_key', None):
10. ---> 17 raise Exception("Please specify API
key before making calls")
11. 18 return self._api_key
12.
13. Exception: Please specify API key before making calls
Code 9.34
It is worth noticing that when we try to make YouTube API call (Code 9.34,
line 4), our class property reader (Code 9.33, lines 14-18) checks if internal
API key attribute is already set properly and if not then we face fatal
exception being raised in Code 9.34, line 4.
Let us check the following code to know what will happen if we try to set
non-valid API key, for instance it is empty in following case:
1. >>> from youtube_api import YouTube
2. >>> yt = YouTube('some channel name')
3. >>> yt.api_key = ""
4. \-----------------------------------------------------------
----------------
5. Exception Traceback
(most recent call last)
6. Cell In[10], line 1
7. ----> 1 yt.api_key = None
8.
9. File ~/work/fun-with-python/chapter_9/youtube_api.py:23,
in YouTube.api_key(self, value)
10. 20 @api_key.setter
11. 21 def api_key(self, value):
12. 22 if not value:
13. ---> 23 raise Exception("Please specify valid API key")
14. 24 self._api_key = value
15.
16. Exception: Please specify valid API key
Code 9.35
We tried to set class instance attribute (Code 9.35, line 3) with an empty
value. Our previously implemented solution (Code 9.33, line 22) checks if a
given value is not empty or none. If this happens, we raise an exception
(Code 9.33, line 23).
By knowing all of this, we can run a valid code and get internal channel ID
by following the example Code 9.36:
1. $ python youtube_api.py --channel CafeDeAnatolia
--api-key <your api key>
2.
3. Calling https://www.googleapis.com/youtube/v3/search?part=
snippet&type=channel&fields=items%2Fsnippet%2FchannelId&q=
CafeDeAnatolia&key=<your api key>
4. got channel ID: UC1Tr6S-XLBk1NzNX1jErWMg
Code 9.36
We call our script with two main required attributes, that is, the channel
name that we try to scan and the API key which is necessary to make any
kind of calls to YouTube API.
Once we have a channel ID, we can start scanning it.
Let us check the following code to understand how to achieve this goal:
1. async def get_videos_list(self):
2. query = {
3. "key": self.api_key,
4. "channelId": await self.channel_id,
5. "part": "snippet,id",
6. "order": "date",
7. "maxResults": 20
8. }
9. videos = []
10. url = f"{self.base_url}search?{self._encode_query(query)}"
11.
12. async with httpx.AsyncClient() as client:
13. response = await client.get(url, follow_redirects=True)
14. if response.status_code == 200:
15. response_data = response.json()
16. videos = [{v['id']['videoId']:v['snippet']
['title']} for v in response_data['items']]
17. return videos
Code 9.37
We must add this Code 9.37 to our primary YouTube class (Code 9.33). The
new part (Code 9.37) is a method that already has an identified internal
channel ID (by calling method get channel ID), we can start scanning that
channel and fetch the list of videos for future downloads.
We also require to add (line 4) the use of instance property which is shared
across YouTube instance where we keep the internal channel ID.
In the following code, you can see how to use this property instance in terms
of property method:
1. @property
2. async def channel_id(self):
3. if not getattr(self, '_channel_id', None):
4. self._channel_id = await self.get_channel_id()
5. return self._channel_id
Code 9.38
We use the same dynamic instance property that we have used in Code 9.33.
By checking if the internal channel ID is empty or does not exist (Code 9.38,
line 3), we call the internal method as get channel ID (Code 9.38, line 4).
Once we have saved the result to the mentioned internal protected property
(line 4), we will return its value (line 5). With this simple trick, each time we
use the property self-channel ID, we call YouTube API to fetch the internal
channel ID only once. It not only saves unnecessary calls to third party API
but makes our code more efficient.
Let us check the following code to understand how to use Code 9.37 to work
with our Downloader class (Code 9.27). We need to add the following
method to our primary YouTube class:
1. async def download_videos(self, videos_list: List[str]):
2. # to make it compatible with Downloader
we have to send list to file
3. with tempfile.TemporaryFile() as f:
4. f.write('\n'.join(videos_list))
5. f.close()
6. click.echo(f"Start downloading {len(videos_list)} elements")
7. self._downloader.download_list(f.name)
8.
9. async def download_channel_videos(self):
10. videos = await self.get_videos_list()
11. videos = [list(v.keys()).pop() for v in videos]
12. await self.download_videos(videos)
Code 9.39
This method will not download all the videos found on a given channel. The
reason being why it is not going to happen is quite simple. Please check the
example output of running method get videos list shown in the following
example:
1. [{'z5vJllYJEpc': 'Cafe De Anatolia ETHNO WORLD -
Mauritania (Best of Organica '
2. 'DJ MIX 2023)'},
3. {'mJSfvBnl98o': 'Cafe De Anatolia ETHNO WORLD -
Boho (Best of Organica Dj Mix '
4. '2023)'}]
Code 9.40
As we can notice, the returned result of running the mentioned method as
shown in Code 9.40 is a list of dictionaries, where key for each dictionary is
the video ID and not a downloadable URL. Let us try to check the following
code to see how we can retrieve such a URL.
1. curl -X POST -H "Content-Type: application/json" -d ‹{"context":
{"client": {"clientName": "WEB", "clientVersion":
"2.20231026.03.01"}}, "videoId": "zSd9kCvYcOg"}' \
2. https://www.youtube.com/youtubei/v1/player\?key\
=<your api key>\&prettyPrint\=false | jq
Code 9.41
We will use a completely different API version,21 since version 3 does not
allow us to retrieve video details for a given video ID.
Let us check the following example where we can see the sample of
response payload when running Code 9.41:
1. {
2. ...
3. "streamingData": {
4. "expiresInSeconds": "21540",
5. "formats": [
6. {
7. "itag": 18,
8. "url": "https://rr2---sn-5hneknes.googlevideo.com/(...)",
9. "mimeType": "video/mp4; codecs=\"avc1.42001E, mp4a.40.2\"",
10. "bitrate": 656697,
11. "width": 640,
12. "height": 360,
13. "lastModified": "1698589609182367",
14. "contentLength": "137118494",
15. "quality": "medium",
16. "fps": 25,
17. "qualityLabel": "360p",
18. "projectionType": "RECTANGULAR",
19. "averageBitrate": 656671,
20. "audioQuality": "AUDIO_QUALITY_LOW",
21. "approxDurationMs": "1670466",
22. "audioSampleRate": "44100",
23. "audioChannels": 2
24. },
25. {
26. "itag": 22,
27. "url": "https://rr2---sn-5hneknes.googlevideo.com/(...)",
28. "mimeType": "video/mp4; codecs=\"avc1.64001F, mp4a.40.2\"",
29. "bitrate": 1104970,
30. "width": 1280,
31. "height": 720,
32. "lastModified": "1698592104844652",
33. "quality": "hd720",
34. "fps": 25,
35. "qualityLabel": "720p",
36. "projectionType": "RECTANGULAR",
37. "audioQuality": "AUDIO_QUALITY_MEDIUM",
38. "approxDurationMs": "1670466",
39. "audioSampleRate": "44100",
40. "audioChannels": 2
41. }
42. ],
43. }
44. }
Code 9.42
As stated in the following example, we can see that the keys quality and
qualityLabel are explicitly describing what kind of quality of video material
we can get when fetching content given in the key URL.

Support for different formats and resolutions


Let us check the following example to understand how to convert Code 9.41,
so it can be used in Code 9.33. Let us check the following method:
1. async def get_video_download_link(self, video_id):
2. query = {"key": self.api_key, "prettyPrint": False}
3. url = "https://www.youtube.com/youtubei/v1/player"
4. url = f"{url}?{self._encode_query(query)}"
5. data = {
6. "context": {
7. "client": {
8. "clientName": "WEB",
9. "clientVersion": "2.20231026.03.01",
10. "clientScreen": "WATCH",
11. "mainAppWebInfo": {"graftUrl": f"/watch?v={video_id}"}
12. },
13. "user": {"lockedSafetyMode": False},
14. "request": {
15. "useSsl": True,
16. "internalExperimentFlags": [],
17. "consistencyTokenJars": []
18. }
19. },
20. "videoId": video_id,
21. "racyCheckOk": False,
22. "contentCheckOk": False
23.
24. }
25. click.echo(f"Fetchnig video details: {video_id}")
26. headers = {'user-agent': 'Mozilla/5.0 (platform;
rv:geckoversion) Gecko/geckotrail Firefox/firefoxversion'}
27. from pprint import pprint
28. async with httpx.AsyncClient() as client:
29. result = await client.post(url, json=data)
30. for item in result.json().get('streamingData',
{}).get('formats', []):
31. print(item.get('qualityLabel'))
32. if item.get('qualityLabel') == self.video_quality:
33. return item.get('url')
34.
35. async def download_videos(self, videos_list: List[str]):
36. # to make it compatible with Downloader we have to
send list to file
37. with tempfile.NamedTemporaryFile() as f:
38. f.write(b"\n".join(videos_list))
39. click.echo(f"Start downloading {len(videos_list)} elements")
40. await self._downloader.download_list(f.name)
41.
42. async def download_channel_videos(self):
43. videos = await self.get_videos_list()
44. links_to_download = []
45. for item in videos:
46. video_id, title = list(item.items())[0]
47. video_link = await self.get_video_download_link(video_id)
48. click.echo(f"Got downloadable link {title}
({video_id}): {video_link}")
49. if video_link:
50. links_to_download.append(video_link.encode())
51. await self.download_videos(links_to_download)
Code 9.43
1. The first method in lines 1-34 is the core functionality, that is, fetching
from YouTube API v1 download link requested video format (lines 31-
33). We define a few payload attributes that are essential for this API to
work (lines 5-24).
2. In the request that we send, we also inform YouTube that the request is
coming from the browser (line 26). Altogether we send request as POST
method (line 29) with JSON data.
3. Once we have the response, we will check if we can find the requested
video format in the response and then return the found URL (line 32-
33).
4. Once we have the URL to download the requested video, we append it
to the videos URLs that we will try to download later (line 46-50).
5. When the full set of URLs to download is ready, we call this method as
download videos (line 51). This method will convert the list of URLs to
a file and then send file path to the download functionality that we have
built previously in Code 9.27.
It is worth noticing that the file we use to store list of URLs to download in
line 37 (Code 9.43) is stored by using temporary file technique22. As long as
we use that file pointer in the context block (lines 37-40), it is accessible.
Once we exit from the context file, it gets automatically deleted without
leaving any leftovers.

Conclusion
In this chapter, we learned how to build a very efficient file downloading
program. This program not only allows us to download a single web
resource but also helps download many files from a given list.
We have also learned how to make downloads more efficient which means
faster and by using multiple channels for download. Next thing we have
learned is about of how to work with streaming services like YouTube and
how to download video resources. It is worth noticing that Google keeps
changing YouTube public API and the techniques for fetching YouTube
videos can change with time, but the essential download algorithm that we
have learned in this chapter will help you to modify the code and make it
work with the latest YouTube API. Good luck and keep coding!
In the next chapter, we will learn some more low lever networking with
Python.

1. https://www.airtel.in/blog/broadband/fup-internet-plan-significance/
2. https://www.cloudflare.com/en-gb/learning/ddos/what-is-a-ddos-
attack/
3. https://pypi.org/project/asyncio-pool/
4. https://docs.python.org/3/library/urllib.request.html
5. https://docs.python.org/3/library/http.server.html
6. https://curl.se/
7. https://www.w3schools.com/tags/ref_httpmethods.asp
8.
https://docs.python.org/3/library/http.server.html#http.server.SimpleHT
TPRequestHandler
9. https://mitmproxy.org
10. https://pypi.org/project/pproxy/
11. https://en.wikipedia.org/wiki/SHA-2
12. https://pypi.org/project/pproxy/
13. https://www.gnu.org/software/gzip/
14. https://developer.mozilla.org/en-US/docs/web/http/methods/head
15. https://www.python.org/downloads/release/python-3120/
16. https://www.python.org/downloads/release/python-3120/
17. https://pytube.io/en/latest/index.html
18. https://pypi.org/project/youtube_dl/
19. https://developers.google.com/youtube/registering_an_application
20. https://developers.google.com/youtube/v3/docs/search/list
21. https://developers.google.com/youtube/v1_deprecation_notice
22. https://docs.python.org/3/library/tempfile.html

Join our book’s Discord space


Join the book's Discord Workspace for Latest updates, Offers, Tech
happenings around the world, New Release and Sessions with the Authors:
https://discord.bpbonline.com

OceanofPDF.com
CHAPTER 10
Make A Program to Safeguard
Websites

Introduction
The modern internet is not only a source of unlimited knowledge sharing and
access to many free online encyclopedias. Internet is as well a place where
we all can find latest news, join streaming services where watching a movie
is available with just one click.
With great power comes great responsibility. Sometimes unwanted Internet
content should be blocked from access. Many corporate policies in various
companies force their employees to install firewall software to filter
unwanted content. In this chapter, we will learn how to build a filtering
content software as a centralized solution. We will look closely at how a
local computer opens web content and how to filter those unwanted websites
and avoid accessing them.

Structure
This chapter will cover the following topics:
Understanding package routing policies
Write your own DNS server
Build DHCP service
Package inspection software
Filtering web content
Challenges with encrypted websites

Objectives
In the following chapter, we will address these highlighted points, using
Python modules. Some of them are being built from scratch by us.
Additionally, we will try to shed light on the implementation of these
functionalities where we can control all the content filtering with
configuration files.

Understanding package routing policies


We are going to learn about TCP/IP1 packages routing. In the following
concepts, we will simulate package structure in these layers.
Client
Router
Server
In our case, the server will be a simulator of the destination resource that
clients are trying to get access to (outside Internet resource). After reading
technical specifications2 you will get a basic idea of what the transportation
layer looks like and what our layer 4 packet should look like.
Let us try to check the following figure to see how our simulator will work.
Figure 10.1: Example package routing application
Let us check the following example of how to build a simulator for the client
that will simulate requests being sent to the server via the router.
1. import click
2. import random
3. import socket
4. import time
5. from utils import get_client_mac_address, get_client_ip
6.
7. ROUTER_HOST = "localhost"
8. ROUTER_PORT = 9340
9.
10.
11. @click.command()
12. @click.option("--client-
id", type=int, help="client ID (number)", required=True)
13. def main(client_id):
14. mac_addr = get_client_mac_address(client_id)
15. ip_addr = get_client_ip(client_id)
16.
17. click.echo(f"Using MAC address: {mac_addr}")
18. click.echo(f"Using IP address: {ip_addr}")
19.
20. router = (ROUTER_HOST, ROUTER_PORT)
21. client = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
22. # time.sleep(random.randint(1, 3))
23. client.connect(router)
24.
25. click.echo('Receive data packet')
26.
27. while True and client:
28. received_message = client.recv(1024)
29. received_message = received_message.decode("utf-8")
30. source_mac = received_message[0:17]
31. destination_mac = received_message[17:34]
32. source_ip = received_message[34:45]
33. destination_ip = received_message[45:56]
34. message = received_message[56:]
35. print("\nMessage: " + message)
36. time.sleep(1)
37.
38. if __name__ == "__main__":
39. main()
Code 10.1
We created a script that can simulate the client device (network card sending
packages) with a specified client number so we will use a predefined MAC
address3 (line 14) as well as the IP address (line 15). Then we try to connect
to the router (lines 20-23). We can also use a connection triggered after some
random number of seconds of waiting (line 22) if needed. Let us analyze the
following example of how to run 3 clients.
1. $ python client.py --client-id 1
2. $ python client.py --client-id 2
3. $ python client.py --client-id 3
Code 10.2
We can start those clients but first, we need to start our server emulator
presented in the following code, then we start the router emulator and in the
end clients. Let us check how server emulators work in the proceeding
example.
1. import click
2. import random
3. import socket
4. import time
5. from utils import random_mac_address, random_client_ip, get_client_ip
, get_client_mac_address
6.
7.
8. def main():
9. server = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
10. server.bind(("localhost", 8080))
11. server.listen(2)
12. server_ip = random_client_ip()
13. server_mac_addr = random_mac_address()
14.
15. click.echo("waiting for router connection...")
16. while True:
17. routerConnection, address = server.accept()
18. if (routerConnection != None):
19. click.echo('Router Connected Succesfully....!')
20. break
21. click.echo("Start sending messages...")
22. while True:
23. for item in range(1, 4):
24. message = f"message {item} - pkt {random.randint(0, 254)}"
25. destination_ip = get_client_ip(item)
26. source_ip = server_ip
27. ip_addr_header = source_ip + destination_ip
28. source_mac = server_mac_addr
29. destination_mac = get_client_mac_address(0)
30. eth_header = source_mac + destination_mac
31. packet = eth_header + ip_addr_header + message + "\n"
32. click.echo(f"send message {item}, packet {packet}")
33. routerConnection.send(packet.encode())
34. time.sleep(1)
35.
36. if __name__ == "__main__":
37. main()
Code 10.3
In the server Code 10.3 we see that there is an infinity loop (lines 22-34) that
keeps generating new semi-random response messages (line 24) to the
connected router. Once the message is generated, we get the source IP
address (which is the server IP address, line 26). Then we collect the client
IP address (line 25) which is semi-mocked since we do not truly collect the
IP address of a remote connection. The reason is that the client's router and
server run on the same machine so all those components would see each
other connected as IP address 127.0.0.1 Thus we simulate fetching the client
IP address by getting it from a defined list (line 25).
Once we’ve got basic information about the client, message, and destination
details we build a simulated TCP/IP packet message (lines 27-31) that we
send back to the router. This is the simulation of sending network packets
that normally are handled by the client-server in the TCP/IP communication
channel.
The router example code is shown in the following code. Let us look at the
routing algorithm. This is a very simplified simulation that only
demonstrates how the router addresses incoming and outgoing packets.
1. import click
2. import random
3. import socket
4. import time
5. from utils import random_mac_address, get_client_mac_address, get_cl
ient_ip
6.
7. arp_table = {}
8.
9.
10. def start_routing(in_port, out_port):
11. router_mac_addr = random_mac_address().encode()
12. router_in = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
13. router_in.bind(("localhost", in_port))
14. router_in.listen(3)
15.
16. router_out = socket.socket(socket.AF_INET, socket.SOCK_STREA
M)
17. router_out.connect(("localhost", out_port))
18.
19. client_id = 1
20. while len(arp_table.keys()) < 3:
21. client, address = router_in.accept()
22. client_ip = get_client_ip(client_id)
23. click.echo(f"Client connected: {client_ip}")
24. arp_table[client_ip.encode()] = {"client": client, "mac_addr": get_c
lient_mac_address(client_id).encode()}
25. client_id += 1
26. click.echo(f"ARP table: {arp_table}")
27. while True:
28. pkt = b""
29. while True:
30. single_byte = router_out.recv(1)
31. if single_byte == b"\n":
32. break
33. pkt += single_byte
34.
35. src_mac = pkt[0:17]
36. dst_mac = pkt[17:34]
37. src_ip = pkt[34 : 34 + 12 + 3] # example 180.101.231.245
38. dst_ip = pkt[34 + len(src_ip) : 59]
39. message = pkt[59:]
40. click.echo(f"Source IP: {src_ip}, destination: {dst_ip}")
41. click.echo(f"Message: {message}")
42.
43. ethernet_header = router_mac_addr + arp_table[dst_ip]
["mac_addr"]
44. ip_header = src_ip + dst_ip
45. out_pkt = ethernet_header + ip_header + message
46.
47. dst_socket = arp_table[dst_ip]["client"]
48. dst_socket.send(out_pkt)
49.
50.
51. @click.command()
52. @click.option("--router-
port", default=9340, type=int, help="Incoming router port")
53. @click.option("--dst-
port", default=8080, type=int, help="Destination server port")
54. def main(dst_port, router_port):
55. start_routing(router_port, dst_port)
56.
57.
58. if __name__ == "__main__":
59. main()
Code 10.3
We can see that the router is accepting incoming connections from 3 client
devices (lines 20-25). Once the connection is established, we try to find the
correct route for incoming and outgoing packets (lines 27-48). As you
probably already noticed every message coming back from the server has
got an escape charter at the end of the TCP/IP frame – it’s \n. That is only
being used here to help the simulation be a bit cleaner and easy to follow
since we use this charter as a stop sign. Once that stop character is reached in
the packet message we may assume that we managed to get the whole packet
from server, and we can start forwarding message back corresponding client
(lines 47-48).
This simulator we built, helps to demonstrate how we can build an internet
gateway in Python. By having such a gateway, we can start doing deep
packet inspection and analyze if the incoming client request to the router is
allowed to reach its destination, i.e. If the user is trying to open a website
that is blocked for him, we could check this in router internal list of all
blacklisted websites and do-not-forward packet outside and block a user
from accessing desired content.

Figure 10.2: Deep packet inspection compared to a traditional model


All this simulation was to highlight how this can be built even though in the
modern routers world there are a lot of highly complex challenges that are
not easy to address with Python. Some DPI techniques require interactions
on a very low kernel (OS) level where Python does not have the proper API
to be able to address these things. This can be especially challenging when
the router must perform DPI on the TLS level where packets are encrypted.
In this chapter, we are not going to concentrate on building sophisticated
routers in Python which of course could address the main point of this
chapter, which is building programs to safeguard websites. We are going to
use a different strategy. We will inspect websites that users would like to
visit by checking their names in service called DNS resolver. Let us start
with the basics, assigning an IP address to the client machine.

Write DHCP server


To be able to start building an internet gateway service where we are going
to control which hosts have access to the internet and what kind of resources,
they can get access to we need to start by building a DHCP service4.
Python has plenty of DHCP services that we can find in the vast resources of
GitHub5 or in PyPi6 unfortunately, none of them fits our needs. We need a
service that can assign IP addresses to client device requests that can be
saved and organized in the database.
First, we need to install a database, in our examples, we are going to use
MariaDB7. Once we have it installed let’s create a database like in the
following example.
1. $ mysql -h localhost -uroot
2.
3. mysql> CREATE DATABASE IF NOT EXISTS dhcp_service;
4. Query OK, 1 row affected, 1 warning (0.01 sec)
5.
6. mysql>
Code 10.4

Once we have created the database, we are going to use SQLAlchemy8


system to map database tables to Python models9. Let us start with creating a
model that will be responsible for storing all leases that we will give to
clients.
1. from datetime import datetime
2. from sqlalchemy.orm import DeclarativeBase
3. from sqlalchemy import Column, Integer, String, DateTime
4.
5.
6. class BaseTable(DeclarativeBase):
7. created_at = Column(DateTime, default=datetime.now)
8. updated_at = Column(DateTime, default=datetime.now, onupdate=da
tetime.now)
9.
10.
11. class UserLease(BaseTable):
12. __tablename__ = "user_lease"
13. id = Column(Integer, primary_key=True)
14. ip_addr = Column(String(50), nullable=False)
15. mac_address = Column(String(50))
Code 10.5
We use base table model (line 6-8) where we define 2 inheritable fields
created_at and updated_at which we want to use broader in later stage as
indicators when DB record of the table that inherits from BaseTable model
is being created or updated. Such a record will get current timestamp for
both cases.
To be able to start using SQLAlchemy we need to initialize in the main
DHCP service file, the main object of the connection engine. Let us check
the following example which we have named server.py.
1. from sqlalchemy import create_engine
2.
3. engine = create_engine("mysql+pymysql://root@localhost/dhcp_service
")
Code 10.6
We use the MySQL connection driver that helps to establish the
SQLAlchemy engine. To fully explore the potential of our next examples we
must install some essential Python libraries
1. SQLAlchemy==2.0.23
2. asyncio==3.4.3
3. sqlalchemy[asyncio]
4. pre-commit==3.3.1
5. pyMySQL==1.1.0
6. coloredlogs==15.0.1
Code 10.7
When packages are installed, we would have to apply the system of database
migrations10 called Alembic11. First, we need to install it.
1. $ pip install alembic
Code 10.8
After installing we need to go towards the configuration of this migration
package. Let us check the following example of how to do it.
1. $ alembic init --template generic ./scripts
Code 10.9
When Alembic is installed, it creates a configuration file alembic.ini as a
part of its installation script, where with the default parameters for the
configuration file we can keep them as they are. The only change we are
going to make is the database configuration. We need to find the following
statement and fix it accordingly.
1. sqlalchemy.url = mysql+pymysql://root@localhost/dhcp_service
Code 10.10
Now, it’s time to put it all together and create the very first migration where
we will tell Alembic how to create a table for our DHCP model (code 10,
lines 11-15). To do so we are going to execute the following statement.
1. $ alembic revision -m "create user leaese table"
Code 10.11
This statement is going to create a migration file with some random name,
for instance, let us assume it got called:
1. alembic/versions/4944b164709d_create_user_lease_table.py
Code 10.12
We are about to fill its content with following code. That, as we already said,
is going to create UserLease table.
1. """
2. create user leae table
3.
4. Revision ID: 4944b164709d
5. Revises:
6. Create Date: 2023-11-19 23:04:53.414219
7.
8. """
9. from typing import Sequence, Union
10.
11. from alembic import op
12. import sqlalchemy as sa
13.
14.
15. # revision identifiers, used by Alembic.
16. revision: str = "4944b164709d"
17. down_revision: Union[str, None] = None
18. branch_labels: Union[str, Sequence[str], None] = None
19. depends_on: Union[str, Sequence[str], None] = None
20.
21.
22. def upgrade() -> None:
23. op.create_table(
24. "user_lease",
25. sa.Column("id", sa.Integer, primary_key=True),
26. sa.Column("ip_addr", sa.String(50), nullable=False),
27. sa.Column("mac_address", sa.String(50)),
28. )
29.
30.
31. def downgrade() -> None:
32. op.drop_table("user_lease")
Code 10.13
We can notice a few obvious things in the migration file. One of them is an
auto-generated comment statement in the beginning of the file (lines 1-8).
This mostly contains useful information for development, like when the
migration file was generated (line 6) and, the human-readable title of the
migration (line 2) that we directly used when generating the migration file
(Code 10.11).
The rest of the migration file has actions that Alembic, should perform going
forward (upgrade, lines 22-28) and when the developer decides to roll back
applied migration (downgrade, lines 31-32).
What is important is the fact that migrations help us keep a timeline for
database changes. It means that when we apply migrations (Code 10.13) it
will have the previous migration that it will reference (line 17) and the
current migration ID (line 16). These things are needed for Alembic to
properly perform migrations forward and back, thus it is a pointer in time for
Alembic.
Let us see the following example of how to apply such a migration on the
top of the database.
1. alembic upgrade head
2. INFO [alembic.runtime.migration] Context impl MySQLImpl.
3. INFO [alembic.runtime.migration] Will assume non-
transactional DDL.
4. (..)
Code 10.14
We can see that when applying migration, we instruct Alembic that we want
to merge the database state to the migration head which means the latest
migration status.
We are now ready to create a table even though it looks like we forgot a little
detail, our table definition (Code 10.5) is inherited from the base table (line
11). We must add this migration that alters the table that we currently create
by adding two missing columns (created_at and updated_at). Let us create
a new migration by executing the below command.
$ alembic revision -m "add datetime columns"
Code 10.15
Now, we need to fill the body of the newly created migration like in the
following example.
1. """add datetime columns
2.
3. Revision ID: b0b4ac080f74
4. Revises: 4944b164709d
5. Create Date: 2023-11-19 23:55:45.972206
6.
7. """
8. from datetime import datetime
9. from typing import Sequence, Union
10.
11. from alembic import op
12. import sqlalchemy as sa
13.
14.
15. # revision identifiers, used by Alembic.
16. revision: str = "b0b4ac080f74"
17. down_revision: Union[str, None] = "4944b164709d"
18. branch_labels: Union[str, Sequence[str], None] = None
19. depends_on: Union[str, Sequence[str], None] = None
20.
21.
22. def upgrade() -> None:
23. op.add_column("user_lease", sa.Column("created_at", sa.DateTime,
default=datetime.now))
24. op.add_column("user_lease", sa.Column("updated_at", sa.DateTime,
default=datetime.now))
25.
26.
27. def downgrade() -> None:
28. pass
Code 10.16
As mentioned we created migration and filled its body with the content
shown in in example 10.16 albeit the content we gave is only in sections for
upgrade (lines 22-25) and downgrade (lines 27-28). We decided to create 2
missing columns when we upgraded the DB structure where we do nothing
in the case of downgrade. In this demonstration example, it is fine to leave it
like this although it is for sure recommended to always take care properly of
valid database state for each upgrade and downgrade.
We have added those missing fields thus now it is a time to start building the
actual service that is going to support DHCP connections. Let us start with a
simple example where we create a function that will find the first available
IP address that has not already been taken. The following example is going
to be saved under packet.py file.
All the following examples can be found in the GitHub address given in this
book. https://github.com/darkman66/dhcp-server
1. class Ip:
2. # Network byte order
3. BYTE_ORDER = "big"
4.
5. @staticmethod
6. def str_to_byte(ip: str):
7. ip_data = ip.split(".")
8. b = int.to_bytes(int(ip_data[0]), 1, Ip.BYTE_ORDER)
9. b += int.to_bytes(int(ip_data[1]), 1, Ip.BYTE_ORDER)
10. b += int.to_bytes(int(ip_data[2]), 1, Ip.BYTE_ORDER)
11. b += int.to_bytes(int(ip_data[3]), 1, Ip.BYTE_ORDER)
12. return b
13.
14. @staticmethod
15. def str_to_int(ip: str):
16. b = Ip.str_to_byte(ip)
17. return int.from_bytes(b, Ip.BYTE_ORDER)
18.
19. @staticmethod
20. def int_to_str(i):
21. if i == 0:
22. return "0.0.0.0"
23. b = int.to_bytes(i, 4, Ip.BYTE_ORDER)
24.
25. ip = str(int.from_bytes(b[:1], Ip.BYTE_ORDER)) + "."
26. ip += str(int.from_bytes(b[1:2], Ip.BYTE_ORDER)) + "."
27. ip += str(int.from_bytes(b[2:3], Ip.BYTE_ORDER)) + "."
28. ip += str(int.from_bytes(b[3:4], Ip.BYTE_ORDER))
29. return ip
30.
31. @staticmethod
32. def next_ip(ip: str):
33. int_ip = Ip.str_to_int(ip)
34. next_ip = Ip.int_to_str(int_ip + 1)
35. return next_ip
Code 10.17
We are using in this example a technique called the static method which is
very similar to the class method. However, in this case, the static method is
not bound to a class and its object. In this case, it does not initialize or
change the class state since it is not connected to it. The reason we decided
to use this approach is because later in our code we can use syntax like the
following.
1. Ip.next_ip(server_ip)
Code 10.18
We use classes with static methods like Python modules, making it very easy
and convenient to apply that code.
Let us build and add a method to our main server.py file that will find and
add available IP addresses.
1. from models import UserLease
2.
3. async def get_or_create_lease(self, server_ip, mac_address):
4. with Session(engine) as session:
5. result = session.query(UserLease).filter(mac_address == mac_addr
ess)
6. if result.count() > 0:
7. return result.first().ip_addr
8. ip_addr = await self.get_free_ip(server_ip, mac_address)
9. user_lease = UserLease(ip_addr=ip_addr, mac_address=mac_addre
ss)
10. session.add(user_lease)
11. session.commit()
Code 10.19
12
We have an SQLAlchemy session in context statement (lines 4-11) where
we make a query to the database to check if there is any IP address already
assigned to the requesting MAC address (line 5). If there is no record found,
we generate a new first available IP address (line 8). Once the new IP
address is generated (lines 9-11) we save it to the database so it can be
reused for the same MAC address request later.
We will now check how to find a new available IP address (line 8) with the
proceeding code.
1. from sqlalchemy import func, desc
2.
3. def get_free_ip(self, server_ip: str, mac: str) -> str:
4. logging.info(f"Server IP: {server_ip}")
5. with Session(engine) as session:
6. result = session.query(UserLease).order_by(desc(func.INET_ATO
N(UserLease.ip_addr))).limit(1).first()
7. if result:
8. return Ip.next_ip(result.ip_addr)
9. return Ip.next_ip(server_ip)
Code 10.20
We created a method where it was said to try and find the first available IP
address that we can use for lease. In line 6 we query the database for all the
UserLease records sorted by IP address in descending order. Lastly, we fetch
only a single record, the first from the top. That gives up the latest record
from the database and the corresponding IP address (lines 7-8). Next, we
calculate the next value for the IP address based on the returned DB record
(line 8). In case, we can’t fetch any corresponding record from the database,
we calculate the next IP address to be one number higher than our DHCP
server IP address (line 9).
Let us see through the following example, how to start our DHCP server. We
need to notice one important thing before we continue, DHCP service is a
socket-driven server that listens and operates via UDP13 packages.
1. class CaptiveDhcpServer:
2.
3. async def run(self, server_ip: str, netmask: str):
4. await self.get_leases()
5. udps = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
6. udps.setblocking(False)
7. udps.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR,
1)
8.
9. bound = False
10. while not bound:
11. try:
12. addr = socket.getaddrinfo("0.0.0.0", 67, socket.AF_INET, soc
ket.SOCK_DGRAM)[0][-1]
13. udps.bind(addr)
14. logging.info("Starting server on port 67")
15. bound = True
16. except Exception as e:
17. logging.error(f"Failed to bind to port {e}")
18. time.sleep(0.5)
19.
20.
21. if __name__ == "__main__":
22. coloredlogs.install(level=logging.DEBUG)
23. run_app = CaptiveDhcpServer()
24. asyncio.run(run_app.run("10.65.4.1", "255.255.0.0"))
Code 10.21
As we can see, we run our server as in all previous examples where we’ve
been using the asyncio library (lines 23-24). We added a very useful
feature14 in our code, where we installed a plugin with colors for logging
wherever we use log messages(line 22).
Once we start the service by calling the run method (line 24) we can see that
we initialize the socket connection (line 5). Then we set the socket
connection as non-blocking (line 6) so we can accept more incoming
connections and inform the low-level OS socket layer that the socket
connection will be reusable. Next, we apply a simple algorithm (lines 10-18)
to bind the socket to all network interfaces in the system.
When it fails (lines 16) with the binding socket in the system, for instance,
the socket is taken or the system kernel buffered it after the last attempt of
reusing it we will catch the exception (lines 16-18) and wait half a second
before the next attempt to bind it again. When the binding socket is finished
successfully, we set the bound variable to True which stops the loop (line
15).
In the above code, we have a method (line 4) to call all leases count that we
already store in the database. Let us investigate with the proceeding
example, to see what such a code looks like.
1. async def get_leases(self):
2. with Session(engine) as session:
3. lease_count = session.query(UserLease).count()
4. logging.info(f"Lease count {lease_count}")
Code 10.22
In this example, we use query where we query for the total count of all the
leases we have in the database.
We have the most important pieces connected. Let us see the following
example of how to extend the main method from Code 10.21 so we can start
offering leases for every DHCP client request. SOP. So far, we only bind
sockets for incoming requests (Code 10.21, line 13), now let’s check the
proceeding example to see how we can use it.
1. while True:
2. try:
3. data, addr = udps.recvfrom(2048)
4. logging.info("Incoming data...")
5. logging.debug(data)
6.
7. request = Header.parse(data)
8. logging.debug(request)
9.
10. if isinstance(request, DhcpDiscover):
11. logging.info("Creating Offer for Discover")
12. response = DhcpOffer()
13. client_ip = await self.get_or_create_lease(server_ip, request.hea
der.chaddr)
14. logging.info(f"Found new ip: {client_ip}")
15. reply = response.answer(request, client_ip, server_ip, netmask)
16. logging.debug(response)
17.
18. self.send_broadcast_reply(reply)
19.
20. elif isinstance(request, DhcpRequest):
21. logging.info("Creating Ack for Request")
22. response = DhcpAck()
23. reply = response.answer(request, server_ip, netmask)
24. logging.info(response)
25.
26. self.send_broadcast_reply(reply)
27. await asyncio.sleep(0.1)
28. except OSError:
29. await asyncio.sleep(0.5)
30.
31. except Exception as e:
32. logging.error(f"Exception {e}")
33. await asyncio.sleep(0.5)
Code 10.23
It is easily noticed that we have created another infinity loop where we keep
waiting for incoming packages (lines 1-3). The moment we receive
incoming data (line 7), we parse it to check what kind of request the server
received, it can be either DHCP discover15 (line 10) or DHCP request16 (line
20). In the case of receiving a discovery packet, we initialize the DHCP
object (line 12) that contains the response packet to which we inject the
offered available IP address (line 15). Next, we broadcast the response to the
client where the offered IP address can be assigned to the client device.
In case, a client device demands a response from the DHCP server, we have
to initialize the response packet (lines 22-23) and reply with it. What you
should also notice is that we need to put a sleep statement (line 27) to avoid
saturation of CPU resource by the Python process, in a case where there is a
package being received but does not meet acceptance criteria (lines 10 and
20) and it shall be ignored.
Let us check the following example of how send_broadcast_reply is going
to look like.
1. def send_broadcast_reply(self, reply):
2. udpb = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
3. try:
4. udpb.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR
, 1)
5. udpb.setsockopt(socket.SOL_SOCKET, 0x20, 1)
6. udpb.setblocking(False)
7. broadcast_addr = socket.getaddrinfo("255.255.255.255", 68, socket
.AF_INET, socket.SOCK_DGRAM)[0][4]
8. logging.info(f"Broadcasting Response: {reply}")
9. udpb.sendto(reply, broadcast_addr)
10. except Exception as e:
11. logging.error(f"Failed to broadcast reply {e}")
12. finally:
13. udpb.close()
Code 10.24
We can see a broadcasting response method on the newly open broadcast
socket (lines 2-6). Once we open the socket, we inform the operating system
that we will broadcast the packet to all devices on port 68 (line 7). Once all
is set, we try to send the packet. In case it fails (lines 9-10), we will log the
type of exception that just occupied and close the socket (line 13). We
introduced a special use case of a try-except block called finally (line 12). It
is a guarantee that the socket will be closed with or without any exceptions.
After building a properly functioning DHCP service we shall concentrate on
how to use it for content filtering. It cannot be used for analyzing web
content, but it can be used to block devices from getting the IP address.
Those devices that are blacklisted will not get the IP address being assigned,
which means they will not be able to route packages via router to and from
the internet. Let us check how we can improve example 10.20 with the
proceeding code. First, we need to introduce a new table called
WhiteListLease in the models.py table.
1. class WhiteListLease(BaseTable):
2. __tablename__ = "white_list_lease"
3. id = Column(Integer, primary_key=True)
4. mac_address = Column(String(50), unique=True)
Code 10.25
When the table definition is ready, we shall create a proper migration like in
example 10.15. After this, we need to fill it with content like in example
10.16. We will update column names coordinately to fix the model presented
in Code 10.25.
We have decided to use a presented whitelist instead of a blacklist, what is
the difference? White is a list where we list all the MAC addresses that
should get a legitimate assigned IP address from our DHCP server. Blacklist
stores those MAC addresses that should be ignored when the DHCP service
is getting requests for IP assignments.
So why do we not want to put those MAC addresses that we want to ignore
or filter from accessing our lease service? The reason is simple, modern
operating systems17 can use network devices in such a way that for each new
DHCP request, when the device does not have an IP address assigned yet, it
will generate a random18 MAC address and send a DHCP request on behalf
of that newly generated MAC address. That technique of masking hardware
MAC address is being used to hide the real identity of client’s device. Thus
said – it’s mostly used due to privacy concerns.
By knowing this we introduced flipped logic, and we will only assign IP
addresses to those devices that we know of, that are saved in the whitelist
table. You may ask how this will work if the client has MAC randomization
enabled by default. For instance, in iOS, our system user will have to disable
it to be able to use our DHCP service.
After running the migration (Code 10.14) we must insert the first MAC
address to our whitelist. Let us check the example below of how to do it.
1. from sqlalchemy.orm import Session
2. from models import WhiteListLease
3. from sqlalchemy import create_engine
4.
5. engine = create_engine("mysql+pymysql://root@localhost/dhcp_service
")
6. with Session(engine) as session:
7. obj = WhiteListLease(mac_address="123456789abc")
8. session.add(obj)
9. session.commit()
Code 10.26
It is noticeable that we follow the same convention used in previous code
examples, we use context for the session (Code 10.26, line 6). We add the
newly created object to the session and then commit it to the database (lines
8-9). This approach is safe for all the cases where we want auto-rollback
data when there is an error. See the following example when we try to
execute Code 10.26 twice. Because we defined the MAC address field to be
unique (Code 10.25, line 4) it will lead to an error.
1. (...)
2. --> 143 raise errorclass(errno, errval)
3.
4. IntegrityError: (pymysql.err.IntegrityError) (1062, "Duplicate entry '123
456789abc'
for key 'white_list_lease.mac_address'")
5. [SQL: INSERT INTO white_list_lease (mac_address, created_at,
updated_at) VALUES (%(mac_address)s, %(created_at)s, %
(updated_at)s)]
6. [parameters: {'mac_address': '123456789abc', 'created_at': datetime.date
time(2023, 12, 13, 8, 59, 46, 511293), 'updated_at': datetime.datetime(2
023, 12, 13, 8, 59, 46, 511301)}]
7. (Background on this error at: https://sqlalche.me/e/20/gkpj)
8.
9. In [10]: with Session(engine) as session:
10. obj = WhiteListLease(mac_address="123456789abc")
11. session.add(obj)
12. session.commit()
Code 10.27
MySQL raised an integrity error (lines 4-7), the result of an attempt to save
another DB record for mac_address that already exists in the database. That
is what we wanted, we implicitly defined uniqueness constraint in the
mac_address field definition (Code 10.25, line 4).
By having these bases, we can go to the next part, how to use a whitelist
table when assigning IP addresses to client devices. As we said earlier, we
intend to use a whitelist table to store only those MAC addresses allowed to
use the DHCP service. Let us extend this idea with a different approach. We
will store those addresses allowed to use internet with no limitation, which
we will introduce in the next subchapter.
When the device is already configured with the proper IP address and can
use our routing in a local network there is a mechanism of reaching internet
resources that we use to control access to restricted content. Every request
when you try to open any website will work as described in the following
flowchart:

Figure 10.3: Diagram of DNS service with client queries.


As shown in Figure 10.3, when a user tries to open a website and its related
resources like images, CSS, or JavaScript files, the OS sends a request to the
DNS server to verify which IP address corresponds to the request's domain
name. After getting a response, it fetches the requested content directly from
the server by using the IP address instead of the domain name. This
mechanism is called DNS19 resolver20.
We are going to start building such a service that is ready to work as a DNS
resolver. Whenever client looks for the address of the destination server, we
will respond with the following IP address. Let us check one of the
following examples of how to build a simple DNS service with Python, but
before we do that, we need to install some Python modules essentially
required to make this work.
1. $ pip install dnslib netifaces
Code 10.28
Once these modules are installed, we can continue with the following
example, where we build a basic DNS server that will work in proxy mode.
1. import coloredlogs
2. import logging
3. import time
4. from dnslib import QTYPE
5. from dnslib.proxy import ProxyResolver as BaseProxyResolver
6. from dnslib.server import DNSServer
7.
8.
9. class ProxyResolver(BaseProxyResolver):
10. def __init__(self, upstream: str):
11. super().__init__(address=upstream, port=53, timeout=10)
12.
13. def resolve(self, request, handler):
14. type_name = QTYPE[request.q.qtype]
15. logging.info(f"Query type: {type_name}")
16. return super().resolve(request, handler)
17.
18.
19. class DNSService:
20. def __init__(self, port: int, upstream: str):
21. self.port: int = DEFAULT_PORT if port is None else int(port)
22. self.upstream: str | None = upstream
23. self.udp_server: DNSServer | None = None
24. self.tcp_server: DNSServer | None = None
25.
26. def start(self):
27. logging.info(f"Listen on port {self.port}, upstream DNS server {sel
f.upstream}")
28. resolver = ProxyResolver(self.upstream)
29.
30. self.udp_server = DNSServer(resolver, port=self.port)
31. self.tcp_server = DNSServer(resolver, port=self.port, tcp=True)
32. self.tcp_server.start_thread()
33. self.udp_server.start_thread()
34.
35. def stop(self):
36. self.stop_udp()
37.
38. def stop_udp(self):
39. self.udp_server.stop()
40. self.udp_server.server.server_close()
41.
42. def stop_tcp(self):
43. self.tcp_server.stop()
44. self.tcp_server.server.server_close()
45.
46. @property
47. def is_running(self):
48. if self.udp_server and self.tcp_server:
49. return self.udp_server.isAlive() and self.tcp_server.isAlive()
50. return False
51.
52.
53. if __name__ == "__main__":
54. coloredlogs.install(level=logging.DEBUG)
55. s = DNSService(8953, '1.1.1.1')
56. s.start()
57. while s.is_running:
58. time.sleep(0.1)
Code 10.28
The main thing to notice is the fact that we are starting our service to listen
on port 8953 (line 55) instead of standard port 5321. This is only for the next
example Where we will test our DNS server. All the following examples will
have this line (55) changed to the following.
1. DNSService(53, '1.1.1.1')
Code 10.29
We can check and validate if our DNS service is working properly. Thus by
executing the proceeding statement (line 1) we can check if out resolver is
actually resolving name for wikipedia.org. As we said in Chapter 1 - Python
101, we are assuming that our code is working in a Linux environment, so
we are going to use dig22 command to query our DNS service.
1. $ dig @localhost -p 8953 wikipedia.org
2.
3. ; <<>> DiG 9.10.6 <<>> @localhost -p 8953 wikipedia.org
4. ; (2 servers found)
5. ;; global options: +cmd
6. ;; Got answer:
7. ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 23026
8. ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIO
NAL: 1
9.
10. ;; OPT PSEUDOSECTION:
11. ; EDNS: version: 0, flags:; udp: 512
12. ;; QUESTION SECTION:
13. ;wikipedia.org. IN A
14.
15. ;; ANSWER SECTION:
16. wikipedia.org. 471 IN A 185.15.59.224
Code 10.30
We can see that base querying for a server IP address based on domain
works fine. It may occur to you that we could try to combine the already
implemented idea of blocking specific MAC addresses from accessing
requested content. Unfortunately, when we have a DNS server -layer of the
network query does not have the MAC address of the client performing the
query for DNS service. In this case, we can build a simple list of blocked
content23.
Let us use the following example which is digesting one of the publicly
maintained databases of blacklisted servers24. First, we have to install some
modules as in the following code.
1. $ pip install requests
Code 10.31
Once we have installed the request module, we can start using the following
code that will help us check if the requested DNS query is on the blacklist.
1. import os
2. import requests
3.
4.
5. def fetch_dns_blacklist():
6. url ="https://cdn.jsdelivr.net/gh/hagezi/dns-
blocklists@latest/domains/light.txt"
7. response = requests.get(url)
8. return response.text.split('\n')
9.
10. def load_local():
11. local_file = 'blacklist.txt'
12. if os.path.exists(local_file):
13. with open(local_file, 'r') as f:
14. return f.read().split('\n')
15. return []
16.
17. GLOBAL_BLACKLIST = fetch_dns_blacklist()
18. LOCAL_BLACKLIST = load_local()
19.
20. def is_name_valid(site_name):
21. return (
22. site_name not in GLOBAL_BLACKLIST and
23. site_name not in LOCAL_BLACKLIST
24. )
Code 10.32
As we can see we are loading 2 types of blacklist hosts, official support by
the community (lines 5-8) and a custom one that we can create and start
putting blocked hosts line by line in a flat text file (lines 10-14). Next, we
have a function (lines 20-24) that we will use in the proceeding example.
1. def resolve(self, request, handler):
2. type_name = QTYPE[request.q.qtype]
3. q_name = str(request.q.get_qname()).strip('.')
4. logging.info(f"Query type: {type_name}, name: {q_name}")
5. if is_name_valid(q_name):
6. logging.info("Query not on the blacklist")
7. return super().resolve(request, handler)
Code 10.33
As is easy to notice we are calling function is_name_valid to check if the
requested domain name is not in the blacklist, if so, we return no response as
DNS service. As result user will not be able to open the website that he
requested since FQDN can’t be translated to an IP address and OS is not
going to be able to proceed.

Package inspection software


This is one of the most efficient ways of controlling what kind of web
content a user has access to. In previous subchapters we concentrated on a
pretty high level of abstraction, we have been controlling DNS queries, and
assigning IP addresses for client devices. When we say analyzing packages,
we mean, we take every single TCP/IP packet and check what the user is
trying to send out. In this case, we talk about the data processing that will
inspect data being sent over a computer network, in detail. In other words, it
means that if a user tries to fetch any data from the internet we will know
exactly what it is. There are some limitations that we can get into later in this
subchapter.
We will have to install a few packages before we install the main library. The
following tools are being installed for Ubuntu25 Linux distribution. If you
are using different distribution systems, please check the installation of these
packages accordingly.
1. $ sudo apt-get install build-essential git gettext flex bison libtool \
2. autoconf automake pkg-config libpcap-dev libjson-c-dev libnuma-
dev libpcre2-dev libmaxminddb-dev librrd-dev
Code 10.34
We are going to use deep package inspection software called NFStream26
and it is a C library27 that must be compiled (it will happen as part of
installing the Python module). Python in this case is going to be a
lightweight wrapper for it that helps to translate low-level C API to Python
Let us check the following code on how to install the Python version of it28.
1. $ pip install nfstream
Code 10.35
It is quite a simple process if you have installed all the prerequisite libraries
and there was no problem installing the C extension as part of Code 10.35.
Before we start installing our wrapper we have to get a list of all available
network interfaces in our Linux machine. I shall highlight, something here,
the Linux machine which we are about to use here must be the router. That
means all the inside-outside traffic in our local network must be
masqueraded (old name used in ip tables). By having that said let’s check
which network interface is out main gateway to outside world. To be able to
validate this we can fun following command.
1. $ ip a
2.
3. 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state
UNKNOWN group default qlen 1000
4. link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
5. inet 127.0.0.1/8 scope host lo
6. valid_lft forever preferred_lft forever
7. inet6 ::1/128 scope host
8. valid_lft forever preferred_lft forever
9. 2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdi
sc fq_codel state UP group default qlen 1000
10. link/ether 34:17:eb:a4:6c:3a brd ff:ff:ff:ff:ff:ff
11. inet 192.168.0.104/24 brd 192.168.0.255 scope global dynamic eno1
12. valid_lft 6453sec preferred_lft 6453sec
13. inet6 fe80::3617:ebff:fea4:6c3a/64 scope link
14. valid_lft forever preferred_lft forever
15. (...)
Code 10.36
We can see that interface eno1 in the example Code 10.36 is the one we will
use for package inspection. Let us proceed with the following example
where you can see how we can execute live (streamed) package inspection
via the mentioned interface.
1. from nfstream import NFStreamer
2.
3. def main():
4. my_streamer = NFStreamer(source="eno1",
5. statistical_analysis=False,
6. idle_timeout=1
7. )
8. print('start printing')
9. for flow in my_streamer:
10. print(flow)
11. print('.')
12.
13. main()
Code 10.37
This code snipped is quite simple to follow, we are initializing the package
analyzer (line 4) for interface eno1 in streaming mode. It means that every
single package going through this interface will be parsed and packed into
nfstream structure like in the following example.
1. NFlow(id=30,
2. expiration_id=0,
3. src_ip=192.168.0.104,
4. src_mac=34:17:eb:a4:6c:3a,
5. src_oui=34:17:eb,
6. src_port=46181,
7. dst_ip=1.0.0.1,
8. dst_mac=78:8c:b5:ae:07:c2,
9. dst_oui=78:8c:b5,
10. dst_port=53,
11. protocol=17,
12. ip_version=4,
13. vlan_id=0,
14. (...)
15. application_name=DNS.Google,
16. application_category_name=Network,
17. requested_server_name=google.com,
18. )
Code 10.38
We can see that the structure contains a lot of important information that we
can use for tracking unwanted user activities. We can see that we have the
destination port, application name, and requested server name. This is
enough to check what kind of remote resource wants to access.
Unfortunately, this kind of processing interface data is very inefficient, it
requires every packet to be checked and converted into a Python structure. If
we are planning to build a parser for web content, we can apply a filter for
incoming streams and filter out unwanted data. Let’s look at the following
example to learn how we do it.
1. ff = "tcp port 80"
2. my_streamer = NFStreamer(source="eno1",
3. statistical_analysis=False,
4. idle_timeout=1,
5. bpf_filter=ff,
6. )
Code 10.39
We can see that we are filtering data packages out to get only those, that are
going out/in on the HTTP port (line 5). As a result of introducing such a
filter, we can see our code snippet printing only HTTP traffic-related
structures like in the following example.
1. NFlow(
2. src_ip=192.168.0.104,
3. src_mac=34:17:eb:a4:6c:3a,
4. src_oui=34:17:eb,
5. src_port=51684,
6. dst_ip=80.72.192.41,
7. dst_mac=78:8c:b5:ae:07:c2,
8. dst_oui=78:8c:b5,
9. dst_port=80,
10. protocol=6,
11. ip_version=4,
12. vlan_id=0,
13. tunnel_id=0,
14. (..)
15. application_name=HTTP,
16. application_category_name=Web,
17. requested_server_name=inet.pl,
18. user_agent=curl/7.68.0,
19. content_type=text/html
20. )
Code 10.40
We can see that the current structure has more details about what protocol
has been reached (line 15) and its port (line 9). Besides, the nfstream will
try to extract the user-agent application (line 18), as long as possible, that the
user has used for making the query.
After understanding how we can analyze with Python what is going on with
our network interface and know what kind of traffic is going through our
main router interface, we can start putting all unwanted traffic on the
blacklist. Let’s check the proceeding code on how to achieve this.
1. import os
2. from nfstream import NFStreamer
3.
4.
5. def load_blacklist_ips():
6. if os.path.exists('blacklist_ips.txt'):
7. with open('blacklist_ips.txt', 'r') as f:
8. return f.read().split('\n')
9. return []
10.
11. def load_blacklist_domains():
12. if os.path.exists('blacklist.txt'):
13. with open('blacklist.txt', 'r') as f:
14. return f.read().split('\n')
15. return []
16.
17. def dump_ips():
18. with open('blacklist_ips.txt', 'w') as f:
19. return f.write('\n'.join(BLACKLIST_IPS))
20.
21. BLACKLIST = load_blacklist_domains()
22. BLACKLIST_IPS = load_blacklist_ips()
23.
24. def filter_blacklist(frame):
25. if frame.requested_server_name.strip() in BLACKLIST and frame.dst
_ip not in BLACKLIST_IPS:
26. BLACKLIST_IPS.append(frame.dst_ip)
27. dump_ips()
28.
29. def main():
30. ff = "tcp port 80"
31. my_streamer = NFStreamer(source="eno1", statistical_analysis=Fals
e, idle_timeout=1, bpf_filter=ff,)
32. print("start printing")
33. for flow in my_streamer:
34. filter_blacklist(flow)
35.
36. main()
Code 10.41

Challenges with encrypted websites


We added two main functions – loading list of blocked domains (lines 11-15)
and loading IP addresses detected as blocked (lines 5-9). Next, we parse the
stream for packages as in previous examples but this time we call the
function filter_blacklist which will check if the destination domain is on the
blacklist of domains (line 25). If it is, we will check if the destination IP
address has not been detected already (line 25). When we detect a malicious
domain whose IP address has not been detected before we will update the list
of blacklisted IPs (line 26) and dump it into the file (line 27).
You may think, why check the IP address if it already exists on the blacklist
for the same domain? The reason is that very often web server needs load
balancers, which means behind one domain there can be many servers with
different IP addresses thus we have to be sure if the domain should be
blacklisted, hence filtering all its IP addresses.
In our previous examples, we have been using HTTP traffic on port 80. This
is quite easy to analyze and check since this protocol does not involve any
encryption. We can even analyze what type of content is transferred over this
protocol. Things get complicated when the user tries to open an SSL-driven
website (default port 443).
We need to modify our example in such a way that when previously applied,
the filter will also be applicable for HTTPS data. Let’s check the following
example.
1. def main():
2. ff = "tcp port 80 or 443"
3. my_streamer = NFStreamer(source="eno1", statistical_analysis=Fals
e, idle_timeout=1, bpf_filter=ff,)
4. print("start printing")
5. for flow in my_streamer:
6. filter_blacklist(flow)
Code 10.42
We updated our example to be able to inspect content on port 80 and 443
(line 2). In this case, we can check what content the user wants to open. By
what content, we mean the destination server and URL the user is trying to
reach. We will not be able to check what content the user is fetching from an
external resource if this data goes via HTTPS since it is strongly encrypted.
Understanding how this encryption can be broken and decrypted is for sure
beyond our book so we will only simplify the entire process by checking
what the server user wants to achieve and if it is on the blacklist we will stop
them.
To be able to block specific content and not just access to the requested
server we can check in the package (Code 10.40, line 19), what kind of
content the user is trying to access, in this case, it is a page HTML file. By
knowing what mime type29 we want to filter from specific websites we can
use the following code.
1. BLOCKED_MIME_TYPES = ['video/x-msvideo', 'video/mp4']
2.
3. def filter_blacklist(frame):
4. if frame.requested_server_name.strip() in BLACKLIST and frame.dst
_ip not in BLACKLIST_IPS:
5. if frame.content_type in BLOCKED_MIME_TYPES:
6. print(f'blacklisted {frame.requested_server_name}')
7. BLACKLIST_IPS.append(frame.dst_ip)
8. dump_ips()
Code 10.43

Filtering Web Content


So far, we have seen examples, of what kind of content should be blocked.
As a result, we have a list of IP addresses that we should block. Now, it is
time to start effectively preventing those IP addresses from being accessed.
To build a firewall filter we will use a Linux kernel module called
Netfilter30. We will install a Python package pyroute231 like in the following
example.
1. $ pip install pyroute2
Code 10.43
When we have it installed, we are going to use its functionality called
nftables32 to actively block specific servers from being accessed. Let’s see a
simple example of how to add a specific IP address to the blocked list.
1. from pyroute2.netlink.nfnetlink.nftsocket import NFPROTO_IPV4
2. from pyroute2.nftables.main import NFTables
3.
4. def main():
5. with NFTables(nfgen_family=NFPROTO_IPV4) as nft:
6. nft.table("add", name="filter_1")
7. my_set = nft.sets(
8. "add", table="filter", name="test", key_type="ipv4_addr",
9. comment="my test fw filter", timeout=0)
10.
11. nft.set_elems(
12. "add",
13. table="filter",
14. set="test_filter",
15. elements={"10.65.0.4", "10.65.0.2"},
16. )
17. main()
Code 10.44
Creating a table of filter elements (line 6) is quite simple and for this
example, we did not use any timeout33. Next, when we create a fitter table,
we add filter rules (lines 11-16). When we want to use timeout, after which
the rule will expire, we can check the following code.
1. from pyroute2.nftables.main import NFTSetElem
2.
3. nft.set_elems(
4. "add",
5. set=my_set,
6. elements={NFTSetElem(value="10.65.0.9", timeout=12000)},
7. )
Code 10.45
In this example, we add an element for filtering (IP address) in line 6 with a
timeout, where we specify that after 12 seconds it will vanish, and we can no
longer filter that IP address.
In the following example, we will see how we can use our blacklist filter and
how to add filtering element from Code 10.45.
1. from pyroute2.netlink.nfnetlink.nftsocket import NFPROTO_IPV4
2. from pyroute2.nftables.main import NFTables
3. from pyroute2.nftables.main import NFTSetElem
4. from threading import Thread
5.
6.
7. class IPUdpater(Thread):
8. def __init__(self, nft, filter_set):
9. super().__init__()
10. self, nft = nft
11. self.filter_set = filter_set
12. self.running = True
13. self.filter_elements = []
14.
15. def load_blacklist_ips():
16. if os.path.exists("blacklist_ips.txt"):
17. with open("blacklist_ips.txt", "r") as f:
18. return f.read().split("\n")
19. return []
20.
21. def run(self):
22. while self.running:
23. for ip in self.load_blacklist_ips():
24. if ip not in self.filter_elements:
25. print(f"Addding ip {ip} to filtered list {self.filter_set}")
26. self.add_element(ip)
27. time.sleep(10)
28. self.filter_elements = []
29.
30. def add_element(self, ip_address):
31. nft.set_elems(
32. "add",
33. set=self.filter_set,
34. elements={NFTSetElem(value=ip_address, timeout=10000)},
35. )
36.
37. def stop(self):
38. self.running = False
39.
40.
41. def main():
42. with NFTables(nfgen_family=NFPROTO_IPV4) as nft:
43. nft.table("add", name="filter_1")
44. my_set = nft.sets(
45. "add", table="filter", name="test", key_type="ipv4_addr", comm
ent="my test fw filter", timeout=0
46. )
47. print("Starting IPs analyzer")
48. ip_udpater = IPUdpater(nft, my_set)
49. ip_udpater.start()
50.
51. main()
Code 10.46
We used an already known threading technique (class IPUdpater) where we
keep reading a blacklist of IP addresses that we keep updating with Code
10.41, Additionally, we added a checker (line 24) that helps to prevent from
reading the same IP address to the firewall filter once IP already exists there.
Additionally, we add an IP filter only for 10 seconds. Then code execution is
getting paused for 10 seconds (line 27), then clean the list (line 11). This
technique helps us to purge the filter list and reload it with a fresh list of
blocked IP addresses when the code in lines 23-26 runs again. This simple
trick will help you to make sure whenever the blacklist of IP address (file
blacklist_ips.txt) gets updated, it is reflected in the firewall filter.

Conclusion
In this chapter, we learned how to design simple yet powerful routers using
Linux and neftilter34 modules. Next, we analyzed how websites are reached
by client machines when using DHCP address assignment and DNS service.
When we understood that part where we went deeper into networking where
we learned how to analyze network packages and implement very light yet
very efficient firewall filter rules. We also learned what kind of challenges
we may face with encrypted webtraffic like HTTPS.
In the next chapter, we will learn how to use Python to manage calendars
and how to combine a few calendars into one. This will teach us to build a
very efficient calendar tool that can be used with your favorite calendar
application.

1. https://ieeexplore.ieee.org/document/9822234
2. https://docs.oracle.com/cd/E19455-01/806-0916/6ja85398m/index.html
3. https://standards.ieee.org/faqs/regauth/
4. https://www.techtarget.com/searchnetworking/definition/DHCP
5. https://github.com/search?
q=python%20dhcp%20server&type=repositories
6. https://pypi.org/search/?q=dhcp+server&o=
7. https://mariadb.org/download/?t=mariadb&p=mariadb&r=11.2.2
8. https://www.sqlalchemy.org
9. https://docs.sqlalchemy.org/en/20/orm/quickstart.html
10. https://www.aviransplace.com/post/safe-database-migration-pattern-
without-downtime-1
11. https://alembic.sqlalchemy.org/en/latest/
12. https://docs.sqlalchemy.org/en/20/orm/session_basics.html
13. https://datatracker.ietf.org/doc/html/rfc768
14. https://pypi.org/project/coloredlogs/
15. https://learn.microsoft.com/en-us/windows-
server/troubleshoot/dynamic-host-configuration-protocol-
basics#dhcpdiscover
16. https://learn.microsoft.com/en-us/windows-
server/troubleshoot/dynamic-host-configuration-protocol-
basics#dhcprequest
17. https://support.apple.com/guide/security/wi-fi-privacy-
secb9cb3140c/web
18. https://blogs.cisco.com/networking/randomized-and-changing-mac-
rcm
19. https://datatracker.ietf.org/doc/html/rfc1034
20. https://datatracker.ietf.org/doc/html/rfc1034#autoid-27
21. https://datatracker.ietf.org/doc/html/rfc1035#section-4.2
22. https://linuxize.com/post/how-to-use-dig-command-to-query-dns-in-
linux/
23. https://datatracker.ietf.org/doc/html/rfc5782
24. https://github.com/hagezi/dns-blocklists
25. https://ubuntu.com
26. https://www.nfstream.org
27. https://github.com/ntop/nDPI
28. https://github.com/nfstream/nfstream
29. https://developer.mozilla.org/en-
US/docs/Web/HTTP/Basics_of_HTTP/MIME_types/Common_types
30. https://www.netfilter.org/projects/nftables/index.html
31. https://github.com/svinota/pyroute2
32. https://www.netfilter.org/projects/nftables/manpage.html
33. https://wiki.nftables.org/wiki-nftables/index.php/Element_timeouts
34. https://www.netfilter.org/
OceanofPDF.com
CHAPTER 11
Centralizing All Calendars

Introduction
Being digital nomad nowadays can be lots of fun and bring lots of benefits,
for sure. There is one thing that every person working in multiple projects
where there are many agendas and calendars to cover can be very
challenging. You may think that having desktop application for calendar can
solve the problem by configuring multiple calendars in it. That is partially
true. Issue becomes when we try to switch between devices or just share
with others when you are available for meeting between all other meetings
across all calendars.

Structure
This chapter will cover following topics:
Building subscriber tool for web calendars
Google
Office 365
iCal
Calendar parser
Subscribe locally
Synchronize with external calendar
Objectives
In this chapter, we will build together a tool that can help us to address that
problem of multiple calendars, we will be able to synchronize our busy day
across many calendars by using Python for it. We will learn how to use this
with two main popular platforms, Google1 and Office3652 yet we will also
see how to work with offline calendar files.

Building subscriber tool for web calendars


Local OS calendar application can subscribe to remote calendars (provider
as web resource) or to local calendar provided as flat file (iCal). In the
following subchapters we’re going to learn how can we intergrade with two
most popular web calendars and local file as well.

Google
Before we will start with subscribing calendar, we need to configure API and
credentials for Google services. First and foremost, we need to go subscribe
to Google for developer’s platform3. When application is ready, we shall
create very first project and mark it as integral, this will give us guarantee
that we are the only authorized user to use this application for the time
being.
Next step is to create OAuth credentials by following Google guide4. Next
important step is to add Google calendar API application to our new project.
Figure 11.1: Enabling Google calendar access
When all is setup is done it is time to start testing it. To be able to use
Google calendar we have to install following Python modules.
1. $ pip install gcsa beautiful-date
Code 11.1
When modules are installed by assuming we have some future events
already exist in our Google calendar we can list them down by executing
following code.
1. from gcsa.google_calendar import GoogleCalendar
2. gc = GoogleCalendar(credentials_path='/var/tmp/credentials.json')
3. for event in gc.get_events():
print(event)
Code 11.2
Please notice that in line 4 when we initialize calendar client, we assume that
credentials JSON file that we have from Google console is saved under
/var/tmp/credentials.json.
After running the code, we should get output like in following example.
Since we are not putting any limit in query it will print out all the future
events.
1. 2023-03-20 18:00:00+02:00 - Abc Meeting
2023-04-19 17:30:00+02:00 - 123 Meeting!
2. 2023-07-31 17:30:00+01:00 - wow Meeting
Code 11.3
What we should also notice is the fact that after running script default
browser is going to be opened and you will be asked to give permissions to
your user data in Google space as shown in the following figure.
Figure 11.2: Enable personal calendar access by OAuth authenticated
This step is necessary to give access to out calendar and retrieve API token.
What is quite important to notice is the fact that this step is for fetching
token that will expire at some point. If it does, please allow access to
calendar again. In normal use Google calendar simple API5 module has built
in functionality to refresh access token. We will use this in the next part of
following chapters.

Office 365
First, we shall make sure we have an account at Microsoft Office 3656 or
Hotmail7 service. Next step is to install Microsoft exchange RESTful8 API
module9.
1. $ pip install O365
Next step is to follow process10 of configuring application in Azure11
ecosystem. Once we have created application and configured it by
following mentioned GitHub guide, we need to make sure that we
have added scope parameters like shown in the following figure:

Figure 11.3: Example of properly configured scoped access parameters


Now, when all the things are configured, and we have valid scopes for
reading calendar events let us check how to authenticate and get list of
events with following script.
1. import click
from O٣٦٥ import Account, MSGraphProtocol
2.
3. def get_venets(client_id, secret_id):
credentials = (client_id, secret_id)
4. protocol = MSGraphProtocol()
scopes = ['https://graph.microsoft.com/.default']
5. account = Account(credentials, protocol=protocol)
if not account.is_authenticated:
6. if account.authenticate(scopes=scopes):
print("Authenticated")
7. schedule١ = account.schedule(resource = <some
user>@hotmail.com')
calendar١= schedule١.get_default_calendar()
8. events = calendar١.get_events(include_recurring=False)
print('events:')
9. for ev in events:
print(ev)
10. @click.command()
@click.option("--client-id", type=str, help="client ID", required=True)
11. @click.option("--secret", type=str, help="client secret", required=True)
def main(client_id, secret):
12. get_venets(client_id, secret)
13. if __name__ == '__main__':
14.
15. main()
Code 11.4
We used click module as usual to be able to read client ID and client secret
12

that we generated when we were configuring new application in Azure


system. We can also see that we use default scope (line 8) that is translated
to scopes that we configured in Figure 11.3.
We also added mechanism to verify and refresh token if needed (line 10). If
so, we authenticate (lines 11-12) and fetch new token with which we call
Microsoft API to get given user (line 13) calendar events. Let us check
following code how it is going to work.
1. python office_tester.py --client-id <retried client id> --
secret <client secret>
2. Visit the following url to give consent:
https://login.microsoftonline.com/common/oauth2/v2.0/authorize?
response_type=code...
3. Paste the authenticated url here:https:
//login.microsoftonline.com/common/oauth2/nativeclient?code=...
4. Authentication Flow Completed. Oauth Access Token Stored.
You can now use the API.
Authenticated
5. events:
6. Subject: <some event 1> (on: 2023-07-30 from: 16:15:00 to: 16:55:00)
7. Subject: <some event 2> (on: 2023-11-11 from: 18:00:00 to: 18:30:00)
Code 11.5
After calling the code (line 1) we can notice that library is asking us to open
URL (line 4) in the browser. After opening it, since it is OAuth13 flow you
will be asked to authenticate and accept access privileges for our test
application like shown in the following figure.

Figure 11.4: Example of accepting authentication and access privileges for test application
Only for the very first time you will see accept permissions screen, after
which you will be redirected to URL that you have to fully copy and paste to
our screen and hit enter to continue. Now we will be able to fetch token
download all the event from your calendar (lines 11-13). When the script is
executed again it will not ask to re-login since the token is still valid so lines
3-8 will not happen unless token expires, and authentication is required
again. This approach is very similar to the one described in subchapter about
Google calendar.

iCal
Another use case of calendar may be a need to import calendar from 3rd party
software. Very popular standard is iCalendar14. To simulate iCalendar file, we
can use any modern calendar tool out there; for instance system calendar
application. To be able to import and process such a file we are going to
install Python module15 like in following example.
1. $ pip install icalendar coleredlogs pytz
Code 11.6
Example code of how to load and parse iCal file is shown in the following
example. To be able to load and parse iCal file; we have to create few
example entries in your desktop calendar application and export calendar as
myevents.ics file.
1. import icalendar
2. ics_file = "myevents.ics"
3. with open(ics_file, 'rb') as f:
calendar = icalendar.Calendar.from_ical(f.read())
4. for event in calendar.walk('VEVENT'):
print('-'*10)
5. print(event.get("name"))
print(event.get("SUMMARY"))
Code 11.7
We can quickly notice that iCalendar file is quite the same to be parsed as
popular Google and Office 365 web calendars with just difference that it’s a
flat file, so we do not have to use complex web authentication flow.

Calendar parser
We managed to learn in previous subchapters how to load and parse data
from three types of external calendars. Now, we are going to use our
knowledge to collect data from external calendars and merge them together
in one single calendar file (ics). Let us check following example how to
achieve this.
1. import coloredlogs
import logging
2. import os
import pytz
3. from datetime import datetime
from icalendar import Calendar, Event
4. from gcsa.google_calendar import GoogleCalendar
5. CALENDAR_FILE = "/var/tmp/calendar.ics"
6. class MyCalendar:
7. cal = None
8. def __init__(self):
self.read()
9. if not self.cal:
self.cal = Calendar()
10. self.cal.add("prodid", "-//My calendar product//mxm.dk//")
self.cal.add("version", "2.0")
11. def sync_with_google(self):
12. pass
13. def sync_with_office365(self):
pass
14. def sync_with_file(self, file_path):
15. pass
16. def create_event(self, event_dict):
event = Event()
17. for k, v in event_dict.items():
event.add(k, v)
18. return event
19. def find_event(self, event_name, event_start):
for component in self.cal.walk():
20. if component.name.upper() == "VEVENT" and component.get('
name') == event_name and component.decoded("dtstart") == event_star
t:
return component
21. def read(self):
22. if os.path.exists(CALENDAR_FILE):
with open(CALENDAR_FILE, 'rb') as f:
23. self.cal = Calendar.from_ical(f.read())
24. def save(self):
with open(CALENDAR_FILE, "wb") as f:
25. f.write(self.cal.to_ical())
26. if __name__ == '__main__':
27. coloredlogs.install(level=logging.DEBUG)
c = MyCalendar()
28. c.sync_with_google()
c.sync_with_office٣٦٥()
c.sync_with_file('some-file/path/calendar.ics')
29. c.save()
Code 11.8
We are trying to load existing calendar file (destination calendar) in
constructor of our calendar syncing class MyCalendar (line 16). When this
try fails, we assume that we need to create new calendar instance and give it
some backwards compatibility attributes (lines 17-20) so it can be properly
processed by calendar applications.
We also added method for creating event object (lines 31-34). To simplify its
flow, we assume that method attribute is a dictionary, and we simply add
dictionary attributes to event object as event attributes (line 34). To visualize
this, let us check the following example dictionary and the way we call
create_event method.
1. record = {
"summary": event.summary,
2. "dtstart": event.start,
"dtend": event.end,
3. "dtstamp": event.created,
"uid": event.event_id
4. }
record = self.create_event(record)
Code 11.9
We can see that dictionary is getting as keys attributes of event method. In
this case it is quite easy to build such a dictionary in a way that we can easily
sync all the arguments from external calendar.
The other important method that we are going to use is the one which is
going to help us find out if event that we are trying to add or remove from
our local calendar already exists (Code 11.8, lines 37-40). Unfortunately,
(performance wise) this method is using walk method so traversing over
very massive and busy calendar cannot be so much time effective as we
would like.
We also managed to introduce (Code 11.8, lines 55-57) how we are going to
synchronize events from external dictionaries.
Let us check following example to see how we are going to take care of
events coming from Google service.
1. def __init__(self):
self.gc = GoogleCalendar(credentials_path="google_credentials.json"
)
2. def sync_with_google(self):
3. for event in self.gc.get_events():
component = self.find_event(event.summary, event.start)
4. if not component:
record = {
5. "summary": event.summary,
"dtstart": event.start,
6. "dtend": event.end,
"dtstamp": event.created,
7. "uid": event.event_id
}
8. record = self.create_event(record)
logging.info(f"Adding calendar record to database")
9. self.cal.add_component(record)
Code 11.10
We added in the class constructor instance to google calendar service (line
2). This is going to trigger Google authentication through the browser – like
we learned with Code 11.2. Next part is the actual call to fetch calendar data
(line 5) and going through each event and trying to find it in our local copy
(line 6).
When we do not find any instance of event in the local calendar, we add
such an entry (lines 7-17). Then as it is shown in example 11.8 (line 58) we
save newly updated calendar to local file copy.
In the next example, we will see how we can achieve the same thing with
method for supporting Office365 connection.
1. def sync_with_office365(self):
credentials = (client_id, secret_id)
2. protocol = MSGraphProtocol()
scopes = ["https://graph.microsoft.com/.default"]
3. account = Account(credentials, protocol=protocol)
if not account.is_authenticated:
4. if account.authenticate(scopes=scopes):
print("Authenticated")
5. schedule١ = account.schedule(resource=f"{user}@hotmail.com")
calendar١ = schedule١.get_default_calendar()
6. for event in calendar١.get_events(include_recurring=False):
component = self.find_event(event.subject, event.start)
7. if not component:
record = {
8. "summary": event.subject,
"dtstart": event.start,
9. "dtend": event.end,
"dtstamp": event.created,
10. "uid": event.object_id,
}
11. record = self.create_event(record)
logging.info(f"Adding calendar record to database")
12. self.cal.add_component(record)
Code 11.11
We reused Code 11.4 to be able to synchronize with Office 365 service. The
main improvement that we introduce here is lines 14-20 where we build
dictionary that we use same way as we did with Google Code 11.10.
We can notice here that we do not have anywhere in our synchronize class
(Code 11.8) instance of credentials that we have to use for authentication
(line 5). We are going to modify our code in such a way that we will move
authentication part to more reusable class property like in the following
code.
1. @property
def settings(self):
2. if not hasattr(self, '_config'):
self._config = configparser.ConfigParser()
3. self._config.read('sync.ini')
4. return self._config
Code 11.12
We created property that is checking if internal variable sorting settings
already exists (line 3) f so we return its value (line 6). If such a variable does
not exist, we initialize it and store configure parser and then return its
content (lines 3-5). For reading authentication credentials we are going to
use Python ini16 configuration file (sync.ini) which is shown below.
1. [google]
credentials_path = google_credentials.json
2. [office365]
3. user = some.user
client_id = 28xxx-yyy-zzz
4. secret_id = some-secre-tkey-got-from-azure
Code 11.13
We added two main sections in configuration ini file – to be used by Office
365 connection (lines 3-7) and as well for Google authentication (lines 1-2).
To be able to use those credentials in our main class file we are going to
refactor main class in following way.
1. class MyCalendar:
2. def __init__(self):
self.gc = GoogleCalendar(credentials_path=self.settings['google']
['credentials_path'])
3. @property
4. def account(self):
credentials = (
5. self.settings['office365']['client_id'],
self.settings['office365']['secret_id']
6. )
protocol = MSGraphProtocol()
7. scopes = ["https://graph.microsoft.com/.default"]
account = Account(credentials, protocol=protocol)
8. if not account.is_authenticated:
if account.authenticate(scopes=scopes):
9. logging.info("Office 365 Authenticated")
return account
Code 11.14
In the class constructor we left intact that rest part of the code that we
already introduced (Code 11.8) albeit we updated the part in the object
constructor that creates Google calendar instance (line 4). We use
configuration instance in this case instead of hardcoded path. At any point of
time, we can update location of the credentials file without changing the
code.
Another section that we added in configuration file is the office365 part
which we managed to refactor (Code 11.14, lines 6-18). We used same
technique of dynamic property like we did in Code 11.12. In this case, we
initialize and authenticate Office 365 client.
In the proceeding example, we will see how we can refactor out office 365
synchronization method to be able to use highlighted new approach.
1. def sync_with_office365(self):
user = self.settings["office365"]["user"]
2. schedule١ = self.account.schedule(resource=f"{user}@hotmail.com")
calendar١ = schedule١.get_default_calendar()
3. for event in calendar١.get_events(include_recurring=False):
component = self.find_event(event.subject, event.start)
4. if not component:
record = {
5. "summary": event.subject,
"dtstart": event.start,
6. "dtend": event.end,
"dtstamp": event.created,
7. "uid": event.object_id,
}
8. record = self.create_event(record)
logging.info(f"Adding Office365 calendar record to database")
9. self.cal.add_component(record)
Code 11.15

We now have a local copy of all the events merged from Google and Office
365 calendars in one place. Next, move it to add parsing static ics file. In
those case we have to add to ini configuration file below lines.
1. [ics]
path = myevents.ics
Code 11.16
Next step is to update method sync_with_file that is going to use that
configuration and parse myevents.ics file.
1. def sync_with_file(self):
with open(self.settings["ics"]["path"], "rb") as f:
2. calendar = Calendar.from_ical(f.read())
for event in calendar.walk("VEVENT"):
3. component = self.find_event(event["SUMMARY"], event["DTS
TART"])
if not component:
4. record = {
"summary": event["SUMMARY"],
5. "dtstart": event["DTSTART"],
"dtend": event["DTEND"],
6. "dtstamp": event["DTSTAMP"],
"uid": event["UID"],
7. }
record = self.create_event(record)
8. logging.info(f"Adding ics calendar record to database")
self.cal.add_component(record)
Code 11.17
We use already known for us mechanics that we used before in other
methods – which is finding if the event already exists in local calendar – if
not then create dictionary item with event elements and add it to local
calendar.
There is also one more thing to improve, method that is travestating over
local calendar and trying to find already existing event. Since we work with
three different types of calendars and there can be some discrepancies in
dates standards, we shall update our find_event method like in the following
example.
1. def find_event(self, event_name, event_start):
for component in self.cal.walk():
2. if (
component.name.upper() == "VEVENT"
3. and component.get("summary") == event_name
and (component.get('dtstart') == event_start
4. or component.decoded("dtstart") == event_start)
):
5. logging.debug("Found item")
return component
Code 11.18
As it was said we updated this method with a fix of how we compare
datetimes when trying to find existing event (lines 6-7). This way we are a
bit more flexible with comparing datetime for event so will not add event
that already exists, but time compare failed.

Subscribe locally
So far, we have managed to build a tool that allows us to subscribe to
external calendars and sync their events with local database. Now it will be
great if we can use this local calendar with calendar application.
There is one option, we could import1718 calendar file that our tool generates.
Unfortunately, importing calendar from a file has one big disadvantage, that
is, we import file and update local calendar application with events that it
brings. When we run resynchronization script, we update ics file once again
and we do not see those changes in calendar application.
To address this issue, we are going to build simple yet powerful subscription
driven service that our system calendar application may use for
synchronization. In this case, any change made in our local ics file will be
almost immediately reflected in system calendar.
Figure 11.5: Adding remote calendar URL to local calendar application.
To be able to build such a service as you can see in Figure 11.5, we must
write web service that is going to expose dynamic ics standard driven file.
For this we need to install light web framework called Flask19.
1. $ pip install flask==3.0.0
Code 11.19
Let us create the following file and name it as ics_service.py. This file is
going to be our calendar service that we will improve is the next part of this
subchapter.
1. from flask import Flask
2. app = Flask(__name__)
3. @app.route("/")
def hello_world():
4. return "<p>Calendar service</p>"
Code 11.20
Now, to be able to start the service we’re going to use Flask built-in HTTP
service so to start it we will do it as in the following example.
1. $ flask --app ics_service run
* Serving Flask app 'ics_service'
2. * Debug mode: off
WARNING: This is a development server. Do not use it in a
production deployment. Use a production WSGI server instead.
3. * Running on http://127.0.0.1:5000
Press CTRL+C to quit
Code 11.21
We can see that service is starting on localhost with listening on port 5000. If
we try to open this address http://127.0.0.1:5000 we will see HTML body
defined in Code 11.20 lines 5-7.
The next part of building service is to add simple ics file analyzer that is
exposing this file content to system calendar application. Let us check
proceeding code how to approach that.
1. from flask import Flask
from manage_full import CALENDAR_FILE
2. from flask import make_response
3. app = Flask(__name__)
4. @app.route("/")
def hello_world():
5. return "<p>Calendar service</p>"
6. @app.route("/calendar")
def calendar():
7. with open(CALENDAR_FILE, 'rb') as f:
resp = make_response(f.read(), 200)
8. resp.headers['content-type'] = 'text/calendar'
return resp
Code 11.22
We can see clearly that we have added endpoint /calendar that is responsible
for returning calendar events to system calendar application. Other thing
worth of noticing is the part (lines 11-12) of method that is exposing
calendar – it is GET method. This is quite important to notice since this
method will only allow us to read events from our busy calendar. Let us add
to our local calendar endpoint that we just created to out calendar
application. For this excursive we are going to use Thunderbird20 application.
Once application is installed, we need to click in menu → New → Calendar
→ On the network and fill the form like shown in following example.

Figure 11.6: Creating new calendar


After this step, we are going to see all the calendar events that we
synchronized so far with external services. Next part is adding new method
PUT to our calendar web service that will allow Thunderbird calendar to
update, create or remove event. Let us check following example how we can
approach this.
1. import os
from flask import Flask, make_response, request
2. from icalendar import Calendar
from manage_full import CALENDAR_FILE
3. app = Flask(__name__)
4.
5. @app.route("/")
def hello_world():
6. return "<p>Calendar service</p>"
7. def read_calendar():
if not os.path.exists(CALENDAR_FILE):
8. cal = Calendar()
cal.add("prodid", "-//My calendar product//mxm.dk//")
9. cal.add("version", "2.0")
resp = make_response(cal.to_ical(), 200)
10. resp.headers["content-type"] = "text/calendar"
return resp
11. with open(CALENDAR_FILE, "rb") as f:
12. resp = make_response(f.read(), 200)
resp.headers["content-type"] = "text/calendar"
13. return resp
14. @app.route("/calendar", methods=['GET'])
def calendar():
15. return read_calendar()
16. @app.route("/calendar", methods=['PUT'])
17. def calendar_put():
with open(CALENDAR_FILE, "wb") as f:
18. f.write(request.data)
return read_calendar()
Code 11.23
First thing we can notice is the fact that we unified method for reading
calendar file and move it to one place (lines 13-26). You can notice that
GET stays mostly intact instance besides that we use mentioned unified
method.
We added PUT method (line 33) that will help us to allow calendar
application to manipulate events. You can notice that the calendar app is
sending the entire calendar data (lines 34-35) with all the events in calendar
and not the only one that we just changed in the application. Thus, we
replace body of our local ics file with incoming in request content (lines 34-
35). Next, we read again content of the updated file (line 36) and return to
calendar application.
Synchronize with external calendar
So far, we managed to synchronize events data from local copy to desktop
application and back. What we still have missing is optional functionality,
which may be very useful - push newly updated events back to original
calendar.
Let us check following code how to keep track events that were added by
local application calendar.
1. from icalendar import Calendar
2. from manage_full import CALENDAR_FILE, MyCalendar
3. @app.route("/calendar", methods=["PUT"])
4. def calendar_put():
events_to_push = []
5. cal = MyCalendar()
new_calendar = Calendar.from_ical(request.data)
6. for component in new_calendar.walk():
if component.name.upper() == "VEVENT" and not cal.find_event_
by_id(component.get("uid")):
7. events_to_push.append(component)
8. with open(CALENDAR_FILE, "wb") as f:
f.write(request.data)
9. # now we have to push to original calendar what we have added
10. if events_to_push:
cal = MyCalendar()
11. cal.push_events(events_to_push)
12. return read_calendar()
Code 11.24
We modified PUT method to be able to get incoming calendar data like we
did before (Code 11.23) but this time we initialize Calendar object (line 9) to
be able to load PUT data into it (line 10). Next, we go event by event (line
10) and we try to find does every element coming from desktop calendar
already exists in our local copy (line 11). As you can notice we are using
new method for that called find_event_by_id. Let us check proceeding
example how to achieve such an approach to find element by event id.
1. def find_event_by_id(self, event_id):
for component in self.cal.walk():
2. if component.name.upper() == "VEVENT" and component.get("ui
d") == event_id:
logging.debug("Found item")
3. return component
Code 11.25
Whenever we discover that we can detect that new event was being sent by
calendar application we want to make sure we will synchronize it with
external Google calendar – we simplified use case of this process for the
sake of our demo – we are going to use bidirectional synchronization only
for main Google calendar service.
Let us check following example how we are going to push calendar data to
Google service (Code 11.24, lines 19-20).
1. from gcsa.event import Event as google_event
2. def push_events(self, new_events):
logging.info(f"Number of events to push: {len(new_events)}")
3. for event in new_events:
ev = google_event(
4. event["SUMMARY"],
start=event["DTSTART"],
5. end=event["DTEND"],
)
6. self.gc.add_event(ev)
Code 11.26
This is quite clean approach. When we pass a list of all events that we want
to synchronize with Google – we create event object (line 4) with necessary
arguments and then push it to Google (line 9). Let us try to create event in
local desktop calendar application and see if this approach will work.
1. 127.0.0.1 - - [02/Jan/2024 14:35:06] "GET /calendar HTTP/1.1" 200 -
ERROR:ics_service:Exception on /calendar [PUT]
2. Traceback (most recent call last):
(...)
3. File "/Users/hubertpiotrowski/work/fun-with-
python/chapter_11/manage_full.py", line 129, in push_events
event["SUMMARY"],
4. File "/Users/hubertpiotrowski/.virtualenvs/fun2/lib/python3.10/site-
packages/icalendar/caselessdict.py", line 40, in __getitem__
return super().__getitem__(key.upper())
5. KeyError: 'SUMMARY'
Code 11.27
We can see that when we created new event in thunderbird calendar
application, it managed to lead to crash of our webservice. As it is easy to
spot (line 9) the reason being why we crashed it is because calendar
application when event is created it send event with no name (summary
key).
To address this issue and fix it we must modify push_events to be able to
remember what events we are just creating and only push them to sync when
events are ready to be save to external service.
1. import pickle
2. SYNC_FILE = “/tmp/events_to_sync.bin”
3. def push_events(self, new_events):
logging.info(f"Number of events to push: {len(new_events)}")
4. events_ids = [ev['UID'] for ev in new_events]
with open('/tmp/events_to_sync.bin', 'wb') as f:
5. f.write(pickle.dumps(events_ids))
Code 11.28
We do not synchronize event immediately to Google calendar but instead we
dump all events unique IDs (UIDs) of those events that we must
synchronize to file (lines 6-7) and then we can synchronize them later.
This approach has got advantage over synchronizing in real time - we can
delay sync action in time which will help us to make sure that all the
necessary changes are already finished in desktop application.
Let us check following example how to build synchronization script that
checks when is the right time to synchronize newly created events:
1. import time
import coloredlogs
2. import logging
from os import path
3. from manage_full import SYNC_FILE
from manage_full import MyCalendar
4. MAX_FILE_TTL = 3
5. def check_file_ttl():
6. if path.exists(SYNC_FILE):
file_ttl = (time.time() - path.getmtime(SYNC_FILE))/60
# in minutes
7. logging.debug(file_ttl)
if file_ttl >= MAX_FILE_TTL:
8. logging.info("Cleaning up and syncing calendar
events to main calendar service...")
cal = MyCalendar()
9. cal.push_events_to_google()
else:
10. logging.warning(f"Sync evenets file {SYNC_FILE} does not exist
s yet, taking nap...")
logging.warning(f"Not much to do at the moment, taking nap...")
11. time.sleep(10)
12. if __name__ == "__main__":
coloredlogs.install(level=logging.DEBUG)
13. while True:
check_file_ttl()
Code 11.29
We created simple script that runs in infinity loop (line 25-26) and checks if
SYNC_FILE is older than defined TTL (line 8). When we discover that it is
that means all the updates coming from desktop application are ready to be
synced and we can start pushing those events to Google service. (lines 12-
17). Let us check following example how we can achieve processing events
ids to be synced (line 17).
1. def push_events_to_google(self):
if os.path.exists(SYNC_FILE):
2. with open(SYNC_FILE, 'rb') as f:
for item in pickle.loads(f.read()):
3. event = self.find_event_by_id(item)
4. if event:
ev = google_event(
5. event.get("SUMMARY", "untitled event"),
start=event["DTSTART"],
6. end=event["DTEND"]
)
7. self.gc.add_event(ev)
else:
8. logging.warning(f"Seems like event ID {item} is missing in
calendar")
os.unlink(SYNC_FILE)
Code 11.30
We are loading all the events eds from previously saved file with IDs (line 4)
and next we are pushing those events that we find in local calendar file (line
5).

Conclusion
In this chapter, we learned how to use Python for synchronizing calendar
events from external services to local copy and exposing that single
managed calendar to desktop application. Next, we managed to understand
how we can push newly created or changed events from desktop application
to our local copy and next push back to external service. This chapter for
sure helped us to learn how to manage our busy day with Python.
In the next chapter, we are going to learn how to use Python to build
sophisticated monitoring tools that we can use to check availability of
external services.

1. https://developers.google.com/calendar/api/quickstart/python
2. https://learn.microsoft.com/en-us/previous-versions/office/office-365-
api/api/version-2.0/calendar-rest-operations
3. https://console.cloud.google.com
4.
https://developers.google.com/calendar/api/quickstart/python#authorize
_credentials_for_a_desktop_application
5. https://google-calendar-simple-
api.readthedocs.io/en/latest/getting_started.html
6. https://www.office.com
7. https://outlook.live.com
8.
https://en.wikipedia.org/wiki/Overview_of_RESTful_API_Description_
Languages
9. https://github.com/O365/python-o365
10. https://github.com/O365/python-o365#oauth-authentication
11.
https://entra.microsoft.com/#view/Microsoft_AAD_RegisteredApps/App
licationMenuBlade/~/Authentication/
12. https://click.palletsprojects.com/en/
13. https://learn.microsoft.com/en-us/graph/auth-v2-user?
context=graph%2Fapi%2F1.0&view=graph-rest-1.0&tabs=http
14. https://en.wikipedia.org/wiki/ICalendar
15. https://github.com/collective/icalendar
16. https://docs.python.org/3/library/configparser.html
17. https://support.apple.com/en-gb/guide/calendar/icl1023/mac
18. https://support.microsoft.com/en-us/office/import-or-subscribe-to-a-
calendar-in-outlook-com-or-outlook-on-the-web-cff1429c-5af6-41ec-
a5b4-74f2c278e98c?ui=en-us&rs=en-gb&ad=gb
19. https://flask.palletsprojects.com/en/3.0.x/quickstart/#a-minimal-
application
20. https://www.thunderbird.net/en-US/

Join our book’s Discord space


Join the book's Discord Workspace for Latest updates, Offers, Tech
happenings around the world, New Release and Sessions with the Authors:
https://discord.bpbonline.com

OceanofPDF.com
CHAPTER 12
Developing a Method for
Monitoring Websites

Introduction
The constant challenge for system administrators is keeping all online and
network assets consistently available. The crucial part of their everyday
work routing is to have access to great monitoring tools. Every single
occurrence where instance of service is faulty or not accessible it should be
informed to system administrator.

Structure
In this chapter, will be covering topics:
Brief introduction to TCP/UDP packets
Understanding how monitoring works
Concept of monitoring probes
Building reporting central
Design alarm system

Objectives
This chapter will show us how to build a simple yet efficient tool for
monitoring any kind of websites. We will learn such things like reporting
availability of defined websites, reporting uptime and those most crucial
times when our important service is not accessible anymore or access time is
slow.

TCP/UDP
We have had some brief introduction of how TCP/IP packets work and how
they can be simulated with Python. We didn’t talk much about difference of
TCP and UDP. Let us check how we can support connection and processing
TCP and UDP packages with Python. Before we will do this, we need to
understand the key difference between them. TCP1 packet. By simplified
picture TCP is a standard of communication in network stack where have
guarantee of delivery, error checked and stable communication.
UDP2 instead is slacker and does not guarantee packet delivery since in this
standard we send network packet without waiting for confirmation that the
destination party has received it.
1. By knowing this let us try to simulate simple Python implementation of
TCP client and server in the following code.
1. import socket
2.
3. HOST = "localhost"
4. PORT = 62222
5.
6. with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
7. s.bind((HOST, PORT))
8. s.listen()
9. conn, addr = s.accept()
10. with conn:
11. print(f"Connected by client: {addr}")
12. while True:
13. data = conn.recv(1024)
14. if not data:
15. break
16. conn.sendall(data)
Code 12.1
We can see that we used the same technique that learned before in Chapter
10 - Make A Program to Safeguard Websites to be able to build TCP socket
server. We basically created simple echo service where we reply with
message that was sent to server (line 16). Let’s check proceeding example to
see how client side looks like.
1. import socket
2.
3. HOST = "localhost"
4. PORT = 62222
5.
6. with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
7. s.connect((HOST, PORT))
8. s.sendall(b"Hello, world")
9. data = s.recv(1024)
10.
11. print(f"Received {data!r}")
Code 12.2
We can see that we are connecting to server to the same port as server is
listening at (line 6). Once we are connected (line 7) we send message
(line 8) and get the server response.
1. Let us check following example how to build similar example for UDP
service.
1. import socketserver
2.
3. HOST = "localhost"
4. PORT = 62222
5.
6. class MyUDPHandler(socketserver.BaseRequestHandler):
7.
8. def handle(self):
9. data = self.request[0].strip()
10. socket = self.request[1]
11. print(f"Received: {data}")
12. socket.sendto(data.upper(), self.client_address)
13.
14. if __name__ == "__main__":
15. with socketserver.UDPServer((HOST, PORT), MyUDPHandler) a
s server:
16. server.serve_forever()
Code 12.3
3
We can see that we used socketserver package to simplify UDP server. We
use the same approach as we did in example 12.1 which means we respond
with the same message that we received from client. Let us look at the client
side in the proceeding example.
1. import socket
2.
3. HOST = "localhost"
4. PORT = 62222
5.
6. with socket.socket(socket.AF_INET, socket.SOCK_DGRAM) as s:
7. s.connect((HOST, PORT))
8. s.sendall(b"Hello, world")
9. data = s.recv(1024)
10.
11. print(f"Received {data}")
Code 12.4
We managed to write similar UDP client as we did for TCP albeit this
time we explicitly told Python socket connection that we will be
connecting to UDP service (line 6) – notice socket.SOCK_DGRAM4
which is needed for establishing UDP connection.
3. In these two client examples we use connection that assumes server is
started and listens on desired port. Let us check what is going to happen
if we run client script without server being started.
1. $ python udp_client.py
2. Traceback (most recent call last):
3. File "udp_client.py", line 9, in <module>
4. data = s.recv(1024)
5. ConnectionRefusedError: [Errno 61] Connection refused
Code 12.5
We can see that when we try to connect to port that does not have service
listening on it – it leads to fatal exception (line 5). This is something that we
may use to identify that port that we try to connect to is either closed or
server is not responding properly. Let us check following example how can
we drive connection in more efficient way.
1. import socket
2.
3. HOST = "localhost"
4. PORT = 62222
5.
6. with socket.socket(socket.AF_INET, socket.SOCK_DGRAM) as s:
7. s.settimeout(10)
8. s.connect((HOST, PORT))
9. s.settimeout(None)
10. s.sendall(b"Hello, world")
11. data = s.recv(1024)
12.
13. print(f"Received {data}")
Code 12.6
We added timeout (line 7) and then reset it (line 9). This approach will
help us when we try to connect to service that is listening on requested
port but somehow it is faulty – that is why we reset timeout (line 9).
4. When running above code it will still lead to fatal crash since we
connect to closed port. Let us try to modify that example so we can
catch connection issue properly.
1. import socket
2.
3. HOST = "localhost"
4. PORT = 62222
5.
6. with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
7. s.settimeout(10)
8. status = s.connect_ex((HOST, PORT))
9. if status == 0:
10. s.settimeout(None)
11. s.sendall(b"ping")
12. data = s.recv(1024)
13. print(data)
14. else:
15. print(f"Connection error code {status}")
Code 12.7
In the proceeding example we run Code 12.7 thus we can see how
properly we catch system exception when new connection is being
established to a closed port.
1. $ python tcp_client_1.py
2.
3. Connection error code 61
Code 12.8
We can see in the output of running our code that this time there is no
exception since we check during connection what is the connection
status code (line 8). Besides you can notice that we replace method
connect (Code 12.6, line 8) with more sophisticated approach – we use
method connect_ex (Code 12.7, line 8). This time we can get
connection status code instead of exception being raised.
5. We use this status code (line 9) so after knowing its value is 0 which
means connection was successful, we try to send message to opened
socket. I other case we print error code (lines 14-15).
All this is going to correct for TCP connection as you probably already
noticed. For UDP nature we need to tweak our approach to be able to
properly support connection issue when port is closed. Let us check
following example how to do it.
1. import socket
2.
3. HOST = "localhost"
4. PORT = 62222
5.
6. with socket.socket(socket.AF_INET, socket.SOCK_DGRAM) as s:
7. s.settimeout(10)
8. try:
9. status = s.connect_ex((HOST, PORT))
10. if status == 0:
11. s.settimeout(None)
12. s.sendall(b"ping")
13. data = s.recv(1024)
14. print(data)
15. except ConnectionRefusedError:
16. status = -1
17. if status != 0:
18. print(f"Connection error code {status}")
Code 12.9
Why didn’t we rely on the connection status code like in TCP connection
example – it is because of nature of UDP and how Python handles closed
port. It will not return error code (line 9) when port can’t be reached but it
will raise exception that connection is refused (port is closed) when we try to
send any data to closed port (line 12). For this reason, we catch
ConnectionRefusedError exception and set status code to -1 (line 16) so in
the next part of the code we can print error like in TCP example (lines 17-
18).

Port scanner
In previous subchapter we learned the basics about TCP and UDP services.
Now let us try to build some more powerful tool that is going to help us scan
all requested ports to see which of these are opened or closed. First, let us
modify TCP client script that we build before in following way.
1. import socket
2. from pprint import pprint
3.
4. HOST = "wikipedia.org"
5. PORTS = [443, 80, 25]
6.
7. connection_results = {}
8.
9. for port in PORTS:
10. with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
11. s.settimeout(10)
12. status = s.connect_ex((HOST, port))
13. connection_results[port] = True if status == 0 else False
14.
15. pprint(connection_results)
Code 12.10
We defined host that we are planning to scan (line 4) and corresponding
ports (line 5). Next, we are trying to establish connection as we already
performed in examples in previous subchapter (lines 10-12). When
connection is properly established, we mark it in results dictionary (line 13).
In the end we print result, and we should be able to see something like in
below example.
1. $ time python tcp_port_scanner_linear.py
2.
3. {25: False, 80: True, 443: True}
4.
5. python tcp_port_scanner_linear.py 0.04s user 0.02s system 33% cpu 0.
176 total
Code 12.11
We can see that Wikipedia website has opened port 80 and 443 where 25
port (SMTP) is closed.
We learned how to build simple port scanner but there is one challenge with
this scanner that we need to address and improve. As you can notice when
we run python code in example 12.11, we use time5 command in front of
python command. This leads to calculate execeution time of our script (line
5). Conclusion is – running example 12.10 seems to be super quick (0.04s)
but… please notice two main potential traps we may face:
Each port we try to check if it’s being opened – we setup timeout (Code
12.10, line 11)
When we wait those 10 seconds as maximum timeout only for one
single port – execution time for script takes 10s+
When more ports are having issues total execution time will also rise
Even if there are not timeout issues scanning more port with this linear
for loop is very inefficient
By knowing all those potential issues let us try to improve our scanner
example so it can scan ports in parallel.
1. import asyncio
2. from asyncio_pool import AioPool
3. HOST = "wikipedia.org"
4.
5. PORTS = [443, 80, 25]
6. async def tcp_port_check(port) -> bool:
7.
8. try:
9. reader, writer = await asyncio.open_connection(HOST, port)
10. return [port, True]
11. except Exception as e:
12. print(e)
13. return [port, False]
14. async def main():
15.
16. async with AioPool(size=5) as pool:
17. result = await pool.map(tcp_port_check, PORTS)
18. print(dict(result))
19. if __name__ == '__main__':
20.
21.
22. asyncio.run(main())
Code 12.12
We are scanning same ports as in the example 12.10 although in Code 12.12
we used asyncio library to be able to run things in parallel and optimize
performance for port scanning. We had to use again exception approach to
be able to catch a catch when port is closed or having issue responding.
Now, let us check proceeding example to see how efficient our code is.
1. $ time python tcp_port_scanner.py
2.
3. [Errno 61] Connect call failed ('185.15.59.224', 25)
4. {443: True, 80: True, 25: False}
5. python tcp_port_scanner.py 0.06s user 0.03s system 51% cpu 0.180 tot
al
Code 12.13

Advance scanner
So far, we have built scripts for scanning requested ports for single host – in
our case it was Wikipedia. In this following subchapter we will use and
improve already gather knowledge to build something more flexible that is
able to scan many sites at once.
Let us start with building simple service that can configuration file and start
using previously created script to scan multiple sites. We will look at the
proceeding example how can we do this.
1. import asyncio
2. from asyncio_pool import AioPool
3. from pprint import pprint
4.
5. SITES = ["wikipedia.org", "google.com"]
6. PORTS = [443, 80, 25]
7.
8.
9. class MyScanner:
10. def __init__(self):
11. self.scanner_results = {}
12.
13. async def tcp_port_check(self, host, port):
14. try:
15. reader, writer = await asyncio.open_connection(host, port)
16. return [host, port, True]
17. except Exception as e:
18. print(e)
19. return [host, port, False]
20.
21. async def start_scanning(self):
22. calls = []
23. results = {}
24. async with AioPool(size=5) as pool:
25. for site in SITES:
26. for port in PORTS:
27. calls.append(await pool.spawn(self.tcp_port_check(site, por
t)))
28.
29. for r in calls:
30. result = r.result()
31. if result[0] not in self.scanner_results:
32. self.scanner_results[result[0]] = {}
33. self.scanner_results[result[0]][result[1]] = result[2]
34.
35. async def run(self):
36. await self.start_scanning()
37. pprint(self.scanner_results)
38.
39. if __name__ == "__main__":
40. scanner = MyScanner()
41. asyncio.run(scanner.run())
Code 12.14
1. We created simple scanner application that is connection state (lines 14-
19). We do not have similar method like in raw socket module (Code
12.10, line 12) so we had to improvise with try-except block when we
try to establish connection to port.
We used sites and definition of ports (lines 5-6) that we are planning to scan
to verify if ports are opened. We are iterating over those sites and ports (lines
24-27) and keep calling method for checking if connection port can be open
and connection can be established successfully. We have async-pool
module6 is in use to help us with making sure we are not hammering
destination host too much. We should not be too aggressive if we want to
scan ports in parallel from for the same host. We could be accidentally
detected as thread and start getting wrong results – ports might start being
closed for us.
In lines 29-33, we are transforming list of results into something that is
going to look like in following example.
1. $ python scanner.py
2. [Errno 61] Connect call failed ('185.15.59.224', 25)
3. [Errno 61] Connect call failed ('142.250.186.206', 25)
4. {'google.com': {25: False, 80: True, 443: True},
5. 'wikipedia.org': {25: False, 80: True, 443: True}}
Code 12.15
2. So far, we’ve been building service that is helping us to scan ports on
remote server to check if they are opened. You may wonder why do we
need to scan ports to be sure that service is operational? Well, if we
check that for instance port 443 (HTTPS) is open for Wikipedia.org
website that means that the website is able to be opened. This is part of
checking site availability is very crucial for sure, but the other aspect is
to check is if the server itself (without checking opened ports) is
responding – this is called ping7.
Let us install the following package and check how we can build simple
script that is sending ICMP packet (ping) to destination server to verify
that it\s’ alive.
1. $ pip install ping3
Code 12.16
3. Once we have module installed, we can check following example that is
ping given hostname to check how quickly does it respond.
1. import click
2. from ping3 import ping
3.
4. def ping_host(host):
5. result = ping(host)
6. click.echo(f"Response time {result}s")
7.
8.
9. @click.command()
10. @click.option("--host", help="Host to ping")
11. def main(host):
12. ping_host(host)
13.
14. if __name__ == '__main__':
15. main()
Code 12.17
The result of running script is shown below.
1. $ python ping_test.py --host wikipedia.org
2. Response time 0.10447406768798828s
Code 12.18
We can see that ping module is helping us to determine response time
from wikipiedia.org. In the following part of this subchapter, we will
incorporate this simple method to our scanner application. Before we
will do it let us focus on adding another probe to our scanner.
4. So far, we’ve been checking on synthetic level is destination host is
available. This time we need to check not only if port is open, response
time but as well content of the response itself. This way we will be sure
that the service is working properly. Let us check proceeding example to
see how we could validate if response received from wikipiedia.org is
correct.
1. import asyncio
2. import click
3. import httpx
4.
5.
6. async def check_status(url):
7. async with httpx.AsyncClient() as client:
8. response = await client.get(url, follow_redirects=True)
9. status = response.status_code == 200 and len(response.text) >=
50
10. print(f"Site status: {status}")
11.
12.
13. @click.command()
14. @click.option("--url", help="URL to scan", required=True)
15. def main(url):
16. asyncio.run(check_status(url))
17.
18.
19. if __name__ == "__main__":
20. main()
Code 12.19
We’re checking if response status code is 2008 which leads us to conclusion
(line 9) that site is operation since it managed to respond properly.
Additionally, we check if response code is not empty (line 9) and it has at
least 50 characters.
And now execution of the script is shown in proceeding example.
1. $ python check_site_status.py --url https://wikipedia.org
2.
3. Site status: True
Code 12.20
5. After having basic probes being built, we shall modify main scanning
script to be able to support configuration files instead of hard coded
sites list with ports. Let us check example below how to use
configuration file called YAML9. First, we need to install some Python
module.
1. $ pip install PyYAML
Code 12.21
After installing YAML standard module let us check how we are going to
prepare configuration before we will digest it in our refactored code.
1. sites:
2. wikipedia:
3. url: https://wikipedia.org
4. ports:
5. - 443
6. - 80
7. - 465
8. gmail:
9. url: https://gmail.com
10. ports:
11. - 443
12. - 465
13. - 587
14. vimeo:
15. url: https://vimeo.com
16. ports:
17. - 443
18. - 80
Code 12.22
So, we have configuration ready to scan few websites and their public ports.
Now, we have to refactor scanner script to be able to use configuration file
and drive scanning more flexible.
1. import aioping
2. import asyncio
3. import coloredlogs
4. import click
5. import httpx
6. import logging
7. import yaml
8. from asyncio_pool import AioPool
9. from pprint import pformat
10. from urllib.parse import urlparse
11.
12.
13. class MyScanner:
14. def __init__(self, config_fpath):
15. with open(config_fpath, "rb") as f:
16. self._config = yaml.load(f.read(), Loader=yaml.Loader)
17. self.scanner_results = {}
18.
19. def hostname(self, url):
20. parsed_uri = urlparse(url)
21. return parsed_uri.netloc
22.
23. async def tcp_port_check(self, host, port):
24. try:
25. fqdn = self.hostname(host)
26. logging.debug(f"host: {fqdn}, port: {port}")
27. func = asyncio.open_connection(fqdn, port)
28. reader, writer = await asyncio.wait_for(func, timeout=3)
29. return [fqdn, port, True]
30. except Exception as e:
31. logging.error(e)
32. return [fqdn, port, False]
33.
34. async def start_scanning(self):
35. calls = []
36. results = {}
37. async with AioPool(size=5) as pool:
38. for item, items in self._config.get('sites', {}).items():
39.
40. for port in items['ports']:
41. calls.append(await pool.spawn(self.tcp_port_check(items['u
rl'], port)))
42. calls.append(await pool.spawn(self.ping_host(items['url'])))
43. calls.append(await pool.spawn(self.check_status(items['url'])))
44.
45. for r in calls:
46. result = r.result()
47. if result[0] not in self.scanner_results:
48. self.scanner_results[result[0]] = {}
49. self.scanner_results[result[0]][result[1]] = result[2]
50.
51. async def run(self):
52. await self.start_scanning()
53. logging.debug(f"result: {pformat(self.scanner_results)}")
54.
55. async def ping_host(self, host) -> float:
56. fqdn = self.hostname(host)
57. delay = await aioping.ping(fqdn) * 1000
58. logging.debug(f"Response time {delay} ms")
59. return [fqdn, 'ping', delay]
60.
61. async def check_status(self, url) -> bool:
62. fqdn = self.hostname(url)
63. async with httpx.AsyncClient() as client:
64. response = await client.get(url, follow_redirects=True)
65. status = response.status_code == 200 and len(response.text) >=
50
66. logging.debug(f"Site status: {status}")
67. return [fqdn, 'status', status]
68.
69.
70. @click.command()
71. @click.option("--config", help="Config file path", required=True)
72. def main(config):
73. coloredlogs.install(level=logging.DEBUG)
74. scanner = MyScanner(config)
75. asyncio.run(scanner.run())
76.
77.
78. if __name__ == "__main__":
79. main()
Code 12.23
We added in our refactored example support not only for configuration file
(lines 14-16) but we’ve updated reading this configuration when scanning
ports (lines 37-41). We also added refactored method for pinging external
host (lines 55-59). To be able to support we have to installed async ping
module for Python.
1. $ pip install aioping
Code 12.24
When module is ready, we can see we use this with correlation with async
pool (line 42). When result is ready (value is in milliseconds) – we updated
dictionary of results for given FQDN10. The other method that we use in
courtliness pool is check_status (line 43) which as we wrote in one of the
previous examples (Code 12.19) is checking response quality from the
server.
6. Let’s check proceeding example what is the result of running our
scanning application with given configuration YAML (Code 12.22).
1. $ python scanner2.py --config config.yaml
2.
3. result: {'gmail.com': {443: True,
4. 465: False,
5. 587: False,
6. 'ping': 10.923624999122694,
7. 'status': True},
8. 'vimeo.com': {80: True, 443: True, 'ping':
13.128791993949562, 'status': True},
9. 'wikipedia.org': {80: True,
10. 443: True,
11. 465: False,
12. 'ping': 33.38837499904912,
13. 'status': True}}
Code 12.25
We can see that main keys of returned results dictionary are FQDNs of tested
websites where values are test results of scanning ports, site availability and
ping time.
Reporting
So far, we have been building command line tool that can scan remote
website and check their availability. Now, it is time to turn this scanning tool
that is something more visible. Let us start with following example where
we need to install some Python packages to be able to write our small web
application.
1. $ pip install pandas matplotlib flask
Code 12.26
Once we have packages installed, we can rewrite few parts of our scanning
tool Code 12.23 to be able to prepare data for further processing. Let us
check following code how can we update our code.
1. import csv
2. from datetime import datetime
3.
4. class MyScanner:
5.
6. async def start_scanning(self):
7. calls = []
8. results = {}
9. async with AioPool(size=5) as pool:
10. for item, items in self._config.get("sites", {}).items():
11.
12. for port in items["ports"]:
13. calls.append(await pool.spawn(self.tcp_port_check(items["
url"], port)))
14. calls.append(await pool.spawn(self.ping_host(items["url"])))
15. calls.append(await pool.spawn(self.check_status(items["url"])
))
16.
17. for r in calls:
18. result = r.result()
19. if result[0] not in self.scanner_results:
20. self.scanner_results[result[0]] = {}
21. self.scanner_results[result[0]][result[1]] = result[2]
22. for site, results in self.scanner_results.items():
23. self.dump_to_csv(site, results)
24.
25. def dump_to_csv(self, item, data):
26. fname = f"/var/tmp/{item}.csv"
27. headers = sorted([k for k in data.keys()])
28. headers.insert(0, "date")
29. data["date"] = datetime.now().strftime("%Y-%m-
%d %H:%M:%s")
30. if not os.path.exists(fname):
31. with open(fname, "a") as f:
32. csv_out = csv.writer(f)
33. csv_out.writerow(headers)
34. with open(fname, "a") as f:
35. csv_out = csv.writer(f)
36. csv_out.writerow([data.get(k) for k in headers])
Code 12.27
We had to create new method for storing scanning results in CSV file (line
25-36). We need this data to be saved in a file with header (lines 30-33). The
next following CSV lines are being appended to existing output file (lines
34-36). What is worth of noticing is that we had to add date timestamp in the
first column of CSV row.
Next change is the whole scanning method. Once we finish scanning, we are
dumping results for every site that we scanned (lines 22-23).
Next step is to create folder called web with subfolder called templates.
Inside web folder we need to create main application file called
web_server.py with proceeding content.
1. import yaml
2. from flask import Flask
3.
4. app = Flask(__name__)
5.
6. CONFIG_FILE_PATH = "../config.yaml"
7. with open(CONFIG_FILE_PATH, "rb") as f:
8. CONFIG = yaml.load(f.read(), Loader=yaml.Loader)
9.
10. @app.route("/")
11. def list_of_result():
12. return "hello"
Code 12.28
We created basic Flask11 application where to start it we need to run below
command. What is noticeable is the fact that as global variable we read
config.yml file in which store configuration of all the services that we scan.
Let us try to read data from configuration and display what websites we
scan. To be able to do it we have to modify main controller in our web
application like in following code.
1. from flask import render_template
2.
3. @app.route("/")
4. def list_of_result():
5. return render_template("index.html", config=CONFIG, title=”Main
page”)
Code 12.29
We are using template to display main page – let’s build template
index.html under templates directory that we created before. In the
proceeding example, we can see the body of the main template.
1. {% extends 'main.html' %}
2.
3. {% block content %}
4. <div>Monitoring results</div>
5. {% for site in config.sites %}
6. <div><a href="{{ url_for('scanning_results', site=site) }}">{{ site }}
</a></div>
7. {% endfor %}
8.
9. {% endblock %}
Code 12.30
First thing we did (line 1) is import (inheritance) from main template file.
Next, we are looping over list of sites that we are monitoring and next we
create a link to subpage (line 6) which can lead to more details.
Let us check following figure of how main page is going to look like:

Figure 12.1: Basic view of monitored results list


Let us check how main.html template is going to look like in following
example.
1. <!doctype html>
2. <html>
3. <head>
4. <title>{{title}}</title>
5. <meta charset="utf-8">
6. <meta name="description" content={{description}}>
7. </head>
8.
9. <body>
10. {% block content %}{% endblock %}
11. </body>
12. </html>
Code 12.31
12
As it has been said we use inheritance in Jinja2 template (line 10). The rest
is simple HTML. Let us notice one more thing, in example 12.30, line 6 we
generate URL based on controller scanning_results, so let us check
following example to see how this controller is going to work.
1. @app.route("/results/<site>")
2. def scanning_results(site):
3. data = CONFIG["sites"][site]
4. return render_template("site_results.html", data=data, site=site, title=
"List of scanned elements")
Code 12.32
This controller is using template site_results.html where we display list of
monitored ports taken from config yaml file. Let us check how page is going
to look like.

Figure 12.2: List of results of scanned monitored ports


The body of template that is used to display that list is shown in the
following example.
1. {% extends 'main.html' %}
2.
3. {% block content %}
4. <div>scanned ports</div>
5. {% for port in data.ports %}
6. <div><a href="
{{ url_for('port_scanning_results', site=site, port=port) }}">{{ port }}
</a></div>
7. {% endfor %}
8. <div><a href="
{{ url_for('port_scanning_results', site=site, port="status") }}">site stat
us</a></div>
9. <div><a href="
{{ url_for('port_scanning_results', site=site, port="ping") }}">ping</a>
</div>
10. {% endblock %}
Code 12.33
As it is shown in Code 12.33, we are iterating over list of ports that scan
(lines 5-7) and additionally since it is not in the config, we add two links
(lines 8-9) to be able to see ping and status check results.
Let us check below example to see how port_scaning_results controller is
going to look like.
1. import pandas as pd
2.
3. def get_fqdn(site):
4. url = CONFIG['sites'][site]['url']
5. parsed_uri = urlparse(url)
6. return parsed_uri.netloc
7.
8. @app.route("/port_scanning_results/<site>/<port>")
9. def port_scanning_results(site, port):
10. data = CONFIG["sites"][site]
11. fqdn = get_fqdn(site)
12. df = pd.read_csv(f"/var/tmp/{fqdn}.csv", parse_dates=
["date"], index_col="date")
13. history_items = zip((df.index.values), list(df['ping'].array))
14. return render_template("port_scanning_results.html", site=site, port=
port, title="Details", items=history_items)
Code 12.34
We created method get_fqdn that is converting full URL, ie. vimeo.com13
which we used in our config YAML file (lines 10-11). Next, we read CSV
file where we keep all the scanning results for requested site and converting
it to data frame14. Before we render HTML from template (line 14) we
extract requested datatype from data frame and convert it to list of results
(line 13).
Before we will get into details of source template let’s check how the page is
going to look like for history of scanning ping results.
Figure 12.3: Example of ping responses over a time
We managed to display a table of historical results of testing ping for the
site. As you can notice we have timestamp of every time we tested ping.
Additionally, we managed to display graph which shows results as well but
in more visible way than raw table. Let’s check how the template for Figure
12.3 looks like in following example.
1. {% extends 'main.html' %}
2.
3. {% block content %}
4. <div>Results of scanning for port <b>{{ port }}</b></div>
5. <div><a href="{{ url_for('list_of_result') }}">go to Main</a></div>
6. <div><img src="
{{ url_for('history', site=site, scanned_item=port) }}" /></div>
7.
8. <h3>History</h3>
9. <div>
10. <table>
11. <tr>
12. <th>Date</th>
13. <th>Value</th>
14. </tr>
15. {% for item in items %}
16. <tr>
17. <td>{{ item[0] }}</td>
18. <td>{{ item[1] }}</td>
19. </tr>
20. {% endfor %}
21. </table>
22. </div>
23. {% endblock %}
Code 12.35
We are creating return link (line 5) so user can get back to main view. Next,
we insert image (line 6) which we generate in history controller. In the rest
of the template, we create table (lines 10 -21) which will display all the
items checked by the main script.
Let’s check in following example how we generate image (line 6) based on
the results.
1. import io
2. import yaml
3. import pandas as pd
4. from matplotlib.backends.backend_agg import FigureCanvasAgg as Fig
ureCanvas
5. from flask import Flask, make_response, Response
6. from flask import render_template, send_file
7. from urllib.parse import urlparse
8. from matplotlib.figure import Figure
9.
10. app = Flask(__name__)
11.
12.
13. def create_img(x, y, title="", xlabel="Date", ylabel="Value", dpi=100):
14. fig = Figure()
15. axis = fig.add_subplot(1, 1, 1)
16. axis.plot(x, y, color="tab:red")
17. return fig
18.
19.
20. CONFIG_FILE_PATH = "../config.yaml"
21. with open(CONFIG_FILE_PATH, "rb") as f:
22. CONFIG = yaml.load(f.read(), Loader=yaml.Loader)
23.
24.
25. def get_fqdn(site):
26. url = CONFIG["sites"][site]["url"]
27. parsed_uri = urlparse(url)
28. return parsed_uri.netloc
29.
30.
31. @app.route("/")
32. def list_of_result():
33. return render_template("index.html", config=CONFIG, title="Main p
age")
34.
35.
36. @app.route("/results/<site>")
37. def scanning_results(site):
38. data = CONFIG["sites"][site]
39. return render_template("site_results.html", data=data, site=site, title=
"List of scanned elements")
40.
41.
42. @app.route("/port_scanning_results/<site>/<port>")
43. def port_scanning_results(site, port):
44. data = CONFIG["sites"][site]
45. fqdn = get_fqdn(site)
46. df = pd.read_csv(f"/var/tmp/{fqdn}.csv", parse_dates=
["date"], index_col="date")
47. history_items = zip((df.index.values), list(df["ping"].array))
48. return render_template("port_scanning_results.html", site=site, port=
port, title="Details", items=history_items)
49.
50.
51. @app.route("/history/<site>/<scanned_item>")
52. def history(site, scanned_item):
53. fqdn = get_fqdn(site)
54. df = pd.read_csv(f"/var/tmp/{fqdn}.csv", parse_dates=
["date"], index_col="date")
55. fig = create_img(df.index, df[scanned_item], title=f"Results of scanni
ng {scanned_item}")
56. output = io.BytesIO()
57. FigureCanvas(fig).print_png(output)
58. return Response(output.getvalue(), mimetype="image/png")
Code 12.36
We create image with function create_img (line 55) that is being fed with
data frame read by pandas framework (line 54). Once matplotlib is ready
with image we send output to memory buffer (line 56-57) and response back
to browser (line 58).

Conclusion
In this chapter, we learned how to use Python for monitoring external
resources in very efficient and easy yet very powerful way. Using
asynchronous connection with pool of connections is very efficient and
helped us to learn how to not accidentally send too many requests to
monitored website.
In the next chapter, we are going to learn how to use Python to analyze
websites and seek for requested products in the webstores. Once desired item
is available, or price is very attractive, we will learn how to use Python to go
for shopping.

1. https://en.wikipedia.org/wiki/Transmission_Control_Protocol
2. https://en.wikipedia.org/wiki/User_Datagram_Protocol
3. https://docs.python.org/3/library/socketserver.html#module-
socketserver
4. https://docs.python.org/3/library/socket.html#socket.SOCK_DGRAM
5. https://en.wikipedia.org/wiki/Time_%28Unix%29
6. https://pypi.org/project/asyncio-pool/
7. https://en.wikipedia.org/wiki/Ping_(networking_utility)
8. https://developer.mozilla.org/en-
US/docs/Web/HTTP/Status#successful_responses
9. https://en.wikipedia.org/wiki/YAML
10. https://en.wikipedia.org/wiki/Fully_qualified_domain_name
11. https://flask.palletsprojects.com/en/3.0.x/
12. https://jinja.palletsprojects.com/en/2.10.x/templates/#template-
inheritance
13. https://vimeo.com
14. https://pandas.pydata.org/pandas-
docs/stable/reference/api/pandas.DataFrame.html

Join our book’s Discord space


Join the book's Discord Workspace for Latest updates, Offers, Tech
happenings around the world, New Release and Sessions with the Authors:
https://discord.bpbonline.com
OceanofPDF.com
CHAPTER 13
Making a Low-cost, Fully-
automated Shopping App

Introduction
When we try to buy any kind of product in internet what kind be a bit of
challenging is the fact that we would like to get product from welly known
provider and established webstore albeit price can be the key factor here.
Sometimes, there are cases where product is not available at the moment so
checking many websites when hunting for its availability can be really
cumbersome. We have Python; let us use it as our superpower for this job.

Structure
In this chapter, will be covering topics:
Connecting to eBay bidding service, placing a bid a hunting for the best
price
Writing plugins for popular webstores to find and buy best product price
Tracking prices and generating alarm upon best price available

Objectives
Based on knowledge gathered from previous chapters in this chapter you
will learn how to build your personal bot which is going to monitor web
stores for certain products you might be interested in to buy. When a specific
item’s price goes up or down and when it is available for buying. We will
also learn the basics of how to improve this tool so it can help you with auto-
purchase products on its own.

eBay client
This platform is very well known for being on the market for many years
and since then it offers API1 for developers2. Before we can start connecting
to API, we need to have account on this platform. It is not about account that
we normally use for bidding and purchasing goods. We need to register in
developer program3 and wait for accepting our request. Once we have
access. Next step is to register our application – we can use test name or any
name that suits you. Once application name is given system is going to
create for us few access keys as shown in following example.
Figure 13.1: Example sandbox access credentials
1. When we have keys generated, we need to create eBay sandbox
account, just follow links provided on the API keys page. We are about
to use sandbox since there is no risk to accidentally purchase any kind
of unwanted product.
When we are ready with sandbox API credentials and eBay test account, we
are going to create access API as a very last step.

Figure 13.2: Create authentication token.


To do so we need to click on user tokens link (shown in Figure 13.1) and we
will be redirected to page shown in Figure 13.2. What is important we shall
choose option of Auth’m’Auth token. After successfully creating token, we
shall install following Python library that is going to help us working with
online buy-sell platform.
1. $ pip install ebaysdk pyyaml
Code 13.1
2. Next thing, we shall do is to create configuration file ebay.yaml with a
content like in the proceeding example.
1. name: ebay_api_config
2. api.sandbox.ebay.com:
3. compatibility: 719
4. appid: <your app ID>
5. certid: <generated cert ID>
6. devid: <dev ID>
7. token: <generated auth'n'auth token>
Code 13.2
This is a configuration file for the eBay API, which allows you to access
various features and data of the eBay platform. We need to specify the
following parameters:
devid: Your developer ID, which you can obtain from the eBay
Developer Program website as we mentioned before.
token: Your authentication and authorization token, which you can
generate from the eBay Token Tool or the eBay SDK or eBay platform
that we show Figures 13.1 and 13.2.
appid: Your application ID - identifies your application to the eBay
API.
certid: Your certificate ID - sign requests to the eBay API.
3. Once all is set we can send example request to fetch some data from
eBay platform.
1. import yaml
2. from pprint import pprint
3. from ebaysdk.finding import Connection as Finding
4. from ebaysdk.exception import ConnectionError
5.
6. try:
7. with open('ebay.yaml', 'r') as file:
8. data = yaml.safe_load(file)
9.
10. api = Finding(
11. domain="svcs.sandbox.ebay.com", appid=data['api.sandbox.eb
ay.com']['appid'], config_file="ebay.yaml"
12. )
13. response = api.execute("findItemsAdvanced", {"keywords": "watc
h"})
14. pprint(response.dict())
15. except ConnectionError as e:
16. print(e)
17. print(e.response.dict())
Code 13.3
We are using in our example library to abstract eBay API. We can see that
first we are loading our configuration file (lines 7-8) from the yaml file that
we specified before (Code 13.2). Once the file is loaded, we use appid
which we configured in YAML file (Code 13.2). This is needed since we use
API sandbox in that example (Code 13.3) so we have explicitly specify API
endpoint domain (line 11) and appid for it (line 11). Next part is to call API
to be able to find item on eBay platform that matches word watch. Let us
check essential part or API response.
1. {'ack': 'Success',
2. 'itemSearchURL': 'https://shop.sandbox.ebay.com/i.html?
_nkw=watch&_ddo=1&_ipg=100&_pgn=1',
3. 'paginationOutput': {'entriesPerPage': '100', 'pageNumber': '1',
4. 'totalEntries': '148', 'totalPages': '2'},
5. 'searchResult': {
6. '_count': '82',
7. 'item': [{'autoPay': 'false',
8. 'condition': {'conditionDisplayName': 'New',
9. 'conditionId': '1000'},
10. 'country': 'US',
11. 'galleryURL': None,
12. 'globalId': 'EBAY-US',
13. 'isMultiVariationListing': 'false',
14. 'itemId': '110555269430',
15. 'listingInfo': {
16. 'bestOfferEnabled': 'false',
17. 'buyItNowAvailable': 'false',
18. 'endTime': '2024-07-20T06:08:34.000Z',
19. 'gift': 'false',
20. 'listingType': 'FixedPrice',
21. 'startTime': '2024-06-20T06:08:34.000Z'},
22. 'location': 'USA',
23. 'primaryCategory': {
24. 'categoryId': '69527',
25. 'categoryName': 'Other Marine '
26. 'Life '
27. 'Collectibles'},
28. 'returnsAccepted': 'true',
29. 'sellingStatus': {
30. 'convertedCurrentPrice': {'_currencyId': 'USD',
31. 'value': '43.0'},
32. 'currentPrice': {'_currencyId': 'USD',
33. 'value': '43.0'},
34. 'sellingState': 'Active',
35. 'timeLeft': 'P26DT14H45M37S'},
36. 'shippingInfo': {
37. 'expeditedShipping': 'false',
38. 'handlingTime': '5',
39. 'oneDayShippingAvailable': 'false',
40. 'shipToLocations': 'Worldwide',
41. 'shippingServiceCost': {'_currencyId': 'USD',
42. 'value': '10.0'},
43. 'shippingType': 'Flat'},
44. 'title': 'Summit Watch',
45. 'topRatedListing': 'false',
46. 'viewItemURL': 'https://cgi.sandbox.ebay.com/Summit-
Watch-/110555269430'},
47. #<... more result .>
48. ]},
49. 'timestamp': '2024-03-23T15:22:57.941Z',
50. 'version': '1.13.0'}
Code 13.4
We can see that in the API response we have element that has many elements
in it – under key item. This part of response contains all the possible eBay
listing we managed to find that were related to word watch.
4. Once we have that, let us try to check how can we build a code that is
going to help us with fetching details regarding one single specific
listing item. Let us check following code how can we do this by using
itemId that we received form Code 13.4.
1. import click
2. import yaml
3. from pprint import pprint
4. from ebaysdk.trading import Connection as Trading
5.
6. from ebaysdk.exception import ConnectionError
7.
8. @click.command()
9. @click.option("--item", type=str, help="Item id to get
information about", required=True)
10. def main(item):
11. with open('ebay.yaml', 'r') as file:
12. data = yaml.safe_load(file)
13.
14. api = Trading(
15. domain="api.sandbox.ebay.com", #open.api.ebay.com
16. https=True,
17. appid=data['api.sandbox.ebay.com']['appid'],
18. config_file="ebay.yaml",
19. )
20. response = api.execute
('GetItem', {'ItemID': item, 'DetailLevel': 'ReturnAll'})
21. pprint(response.dict())
22.
23. if __name__ == '__main__':
24. main()
Code 13.5
We managed to use well know click library from previous chapters (lines
10-21) to build application that can take item id as its argument. The way we
know which argument to use (item id) we can take if from result of
executing script 13.3. Once we know which eBay item, we want to get
details about we can run script 13.5. Let’s try to check following code how
can we run Code 13.5. Please notice we named Code 13.5 as find_item2.py.
1. $ python find_item2.py --item 110555342333
Code 13.6
Item ID is something that we get from running script 13.3 – any item that is
a result of running searching for requested phrase, in our example it has been
iPhone.
The is other thing worth of noticing, if we want to get results of item details
from another sandbox, in this case it is an item located in UK eBay service
we can specify siteid (line 7) and as an extra argument we can specify API
version (line 8).
1. from ebaysdk.trading import Connection as Trading
2.
3. api = Trading(
4. domain="svcs.sandbox.ebay.com",
5. appid=data['api.sandbox.ebay.com']['appid'],
6. config_file="ebay.yaml",
7. siteid="EBAY-GB",
8. version="1.18.2",
9. )
Code 13.7
We can see clearly that any kind of API wrapper for Trading (line 1, code
13.7), finding item details etc., can take additional arguments (lines 7-8)
where we can specify what sandbox do we want to tackle, in line 7, we
specified UK eBay API. Line 8 we specified what API version do we want to
use. Since eBay API can support many versions4 where from version to
version there can be backwards compatibly being broken it is worth to know
how we can explicitly specify version that we want to use.
5. In the next example, we are going to test how can we make purchase an
item that we are interested about. Before we can make a bid, we shall
check following code where we are getting response from running Code
13.5. We are interested in following chunk of response.
1. 'SellingStatus': {'BidCount': '0',
2. 'BidIncrement': {'_currencyID': 'GBP',
3. 'value': '0.0'},
4. 'ConvertedCurrentPrice': {'_currencyID': 'CAD',
5. 'value': '0.0'},
6. 'CurrentPrice': {'_currencyID': 'GBP',
7. 'value': '8.99'},
8. 'ListingStatus': 'Active',
9. 'MinimumToBid': {'_currencyID': 'GBP',
10. 'value': '8.99'},
11. 'QuantitySold': '0',
12. 'QuantitySoldByPickupInStore': '0',
13. 'ReserveMet': 'true',
14. 'SecondChanceEligible': 'false'
15. },
Code 13.8
We can see that response shown in the Code 13.8 has got few important
information that we are going to use later, under keys mentioned in example
13.8:
MinimumToBid
QuantitySold
ListingStatus

Writing plugins to find and buy best product price


For time being before we are going to start bidding, we need to clean up out
script structure. Let us create modular structure of our code as it is shown in
the following example.
Figure 13.3: Organized sniper example into modular application
This new structure needs some clarifications. In the directory clients we are
going to keep client modules, like in previous example we used eBay, so we
are going to store eBay client logic there. Next, we have configs folder, this
is where we are going to keep all the configuration files needed for APIs or
payments, etc.
In the end we have a folder called payments, we are going to keep logic
there that is going help to support payments steps if we decide to auto-
purchase requested item.
Let us check in following example how main application file is going to use
our newly introduced structure.
1. import clients
2.
3. def main():
4.
5. for client in clients.__all__:
6. c = client()
7. print(c)
8.
9. if __name__ == '__main__':
10. main()
Code 13.9
In the example 13.9, we can see how we are loading sub-modules from
clients folder it is clearly seen that we use technique of exporting specific
(encapsulated) list of module items that we learned in Chapter 1, Python
101. Let us check following example how __init__ file is going to look like
in the clients folder.
1. from .ebay import ClientEbay
2.
3. ;__ = [ClientEbay]
Code 13.10
As it has been mentioned we use technique (lines 3) that allows us to expose
only those parts of module folder that we want. Additionally, we can loop
over exposed modules (line 5-7, code 13.9).
Let us check how in the following example we implemented content of ebay
file that we imported (line 1).
1. import os
2. import yaml
3. from pprint import pprint
4. from ebaysdk.finding import Connection as Finding
5. from ebaysdk.exception import ConnectionError
6. from ebaysdk.shopping import Connection as Shopping
7.
8.
9. class ClientEbay:
10.
11. def __init__(self):
12. dir_path = os.path.dirname(os.path.realpath(__file__))
13. self.__config_file = os.path.join(dir_path, '..', 'configs', 'ebay.yaml')
14.
15. def __load_config(self):
16. with open(self.__config_file, "r") as file:
17. return yaml.safe_load(file)
18.
19. def find_items(self, search_phrase):
20. config = self.__load_config()
21.
22. def get_item_details(self):
23. pass
Code 13.11
We dedicated module encapsulated logic into class called ClientEbay (line
9) where in its constructor define where we keep configuration. Config has
been saved under folder configs please check 13.1. Next item to notice is the
fact that we have two public methods – find_items (line 19) and
get_item_details (line 22). These two we are about to use in other client
modules that we are going to build in the following sections of this chapter.
Let us check as well how do we build structure for payments module in the
following example we shown how we organized __init__ file there.
1. from cc import PaymentCreditCard
2. from ebay import PaymentEbay
3.
4. __all__ = [PaymentCreditCard, PaymentEbay]
Code 13.12
In example 13.12, we used the same __all__ export API as we did in the
example 13.10, so now let’s check how we are filling ebay.py.
1. class PaymentEbay:
2. # we can implement paypayl payment for
the item we want to purchase
3. def pay(self):
4. pass
Code 13.13
As we can see we only highlighted the template of payment class that we
are going to fill later in this chapter.
1. class PaymentCreditCard:
2. # here we can implement integration
with creditcard payment gateways
3. def pay(self):
4. pass
Code 13.14
This is the same approach, we only prepared template for future use.
By having those basics being prepared we can get back to example 13.11
and improve it so we can fill public methods with real feature. Let us check
following example how can we send request to find item.
1. EBAY_API = "svcs.sandbox.ebay.com"
2. SITE_ID = "EBAY-US"
3.
4. class ClientEbay:
5.
6. def find_items(self, search_phrase):
7. api = Finding(
8. domain=EBAY_API,
9. appid=self.config["api.sandbox.ebay.com"]["appid"],
10. config_file=self.__config_file,
11. siteid=SITE_ID,
12. )
13. response = api.execute("findItemsAdvanced", {
14. "keywords": search_phrase,
15. "itemFilter": [
16. {'name': 'Condition', 'value': 'New'},
17. {'name': 'currency', 'value': 'USD'},
18. {'name': 'minPrice', 'value': 300}
19. ],
20. "sortOrder": "BestPrice"
21. })
22. response = response.dict()
23. items = []
24. if int(response['searchResult']['_count']) > 0:
25. for item in response['searchResult']['item']:
26. if item['condition']['conditionDisplayName'] == 'New':
27. url = item['viewItemURL']
28. print(f"Found NEW item URL: {url}")
29. items.append({
30. 'price': item['itemId'],
31. 'currency': item['sellingStatus']['currentPrice']
['_currencyId'],
32. 'price': item['sellingStatus']['currentPrice']['value'],
33. 'img': item['galleryURL'],
34. 'url': item['viewItemURL'],
35. 'title': item['title']
36. })
37. return items
Code 13.15
We updated find_items method with functionality where we look for item
based on given phrase (line 7-12). As additional parameter for searching
we’ve added minimum amount for the item value (line 17-18). This is going
to allow us to make sure that when we send query to eBay (key word, i.e.
iPhone) we’re going to get items in return that the value is at least 300 USD.
This is going is to allow us to avoid any search results that are matching
unwanted criteria, like dock statin, cases, changing cables etc. since their
value is for sur lower than 300 dollars.
Let’s check following example how can we trigger searching.
1. $ python main_app.py --phrase "iphone 10"
Code 13.16
Another thing to notice is that what we are using in our example in the line
16, we are forcing by this parameter that we are going to query for only new
items. In this case we can start compare founded results with the similar
implementation of searching algorithm for amazon webstore.
Now, we want to see how we can apply the same logic of searching for items
based on a phrase and a minimum price to Amazon webstore. We will use
the Amazon website and query for products that match our criteria and
compare the results with the eBay example.
Let us check following example to see how we can make example call to
fetch Amazon results page that will show matching criteria based on our
query.
1. import requests
2. from urllib.parse import urlencode
3.
4.
5. class ClientAmazon:
6.
7. headers = {
8. "User-
Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)",
9. «Accept-Language»: «en-GB,en;q=0.9»,
10. "Accept": "ext/html,application/xhtml+xml,application/xml;q=0.9,i
mage/avif,image/webp,image/apng,*/*;q=0.8,application/signed-",
11. }
12.
13. def get_cookies(self):
14. url = "https://www.amazon.com"
15. r = requests.get(url, headers=self.headers)
16. return r.cookies
17.
18. def find_items(self, search_phrase):
19. url = "https://www.amazon.com/s"
20.
21. params = {
22. "crid": "20LMZ9KZCOSVL",
23. "i": "aps",
24. "ref": "nb_sb_ss_recent_1_0_recent",
25. "url": "search-alias=aps",
26. "k": search_phrase,
27. "sprefix": f"{search_phrase},aps,193",
28. }
29. url += f"?{urlencode(params)}"
30. cookies = self.get_cookies()
31.
32. result = requests.get(url, headers=self.headers, cookies=cookies)
33. print(result)
34. print(result.content)
Code 13.17
We had defined all the shown logic in the file located under path
clients/amazon.py. We created amazon client class with prefix Client – the
same one that we used for eBay client class. Next noticeable thing is headers
attribute of mentioned class (lines 7-11) where we defined not only browser
that we are pretending to be (line 8) but additional headers that are essential
to be send so Amazon website is not going to reject our requests.
In the method find_items, we prepare parameters that we are about to use in
the query call (lines 21-28). When all is set, we make get call (line 32) but
with using mentioned headers and cookie data (line 30). We need to call
amazon.com first to be able to fetch cookie parameters that are very
important to be passed when query for searched element (line 32).
If you check browser and carefully analyze calls and results being received
from and to amazon.com you will notice that our code, just follows what
normal browser does.
By having this sorted, we can install Python module for extracting and
manipulating HTML content. Let us check following example.
1. $ pip install beautifulsoup4
Code 13.18
After installing beautiful soup5 module we have to analyze website that
we’re getting from amazon when fetching search query result (Code 13.17,
line 32). Let us check the following example on how can we extract all the
items with corresponding details URLs that we are getting in searching
results page.
1. from bs4 import BeautifulSoup
2.
3.
4. def parse_results(self, html_data):
5. soup = BeautifulSoup(html_data, 'html.parser')
6. result = soup.find_all('div', {"class": "sg-col-inner"})
7.
8. for item in result:
9. search_result = item.find_all('div', {'class': 's-search-results'})
10. if search_result:
11. break
12.
13. found_items = []
14. for item in search_result.pop().find_all('div', {'class': 's-result-item'}):
15. data = {}
16. a_link = item.find_all('h2', {'class': 'a-size-mini'})
17. # item details
18. if a_link:
19. a_link = a_link.pop()
20. data.update({
21. 'url': 'https://www.amazon.com{}'.format(a_link.select('a')
[0].attrs['href']),
22. 'title': a_link.select('a')[0].text.strip()
23. })
24. # price
25. price_box = item.find_all('div', {'data-cy': 'price-recipe'})
26. if price_box:
27. price = price_box.pop().find_all('span', {'class': 'a-offscreen'})
28. if price:
29. data.update({'price': float(price.pop().text.replace('$',
'').replace(',', '')) })
30. if data:
31. found_items.append(data)
32. return found_items
Code 13.19
In this new method that we added to Amazon class (Code 13.17) we had to
add import of newly installed beautiful soap module (Code 13.19, line 1). In
the new method we extract piece of website that stores query results (line 8-
11). Next, we loop over those results (line 14) and extract data for link
leading to item details (line 21) and item title as well (line 22). Additionally,
in lines 25-29 we extract price details from listing HTML and update the
value into the final structure (line 28-29).
Now, we shall modify our find_items method from Code 13.17 like it is
shown in the following.
1. def find_items(self, search_phrase):
2. url = "https://www.amazon.com/s"
3.
4. params = {
5. "crid": "20LMZ9KZCOSVL",
6. "i": "aps",
7. "ref": "nb_sb_ss_recent_1_0_recent",
8. "url": "search-alias=aps",
9. "k": search_phrase,
10. "sprefix": f"{search_phrase},aps,193",
11. }
12. url += f"?{urlencode(params)}"
13. cookies = self.get_cookies()
14.
15. result = requests.get(url, headers=self.headers, cookies=cookies)
16. return self.parse_results(result.content)
Code 13.20
We can see that in our modified method we call fetching results as it’s been
already demonstrated in Code 13.17 albeit in this case once the result is
obtained, we pass it to new parser method (line 16).
So far, we managed to build two main clients, for Amazon and eBay which
we use to find requested items. Now, we must introduce a main client class
that is going to digest results returned by mentioned class and present them
in such a way that we can use it later for tracking best price and availability
of requested item. Let us check following code to see how to build such a
logic. We’re putting following code under client/main.py.
1. from . import __all__
2.
3. class Main:
4.
5. def collect_results(self, phrase):
6. all_data = []
7. for client in __all__:
8. c = client()
9. result = c.find_items(phrase)
10. if result:
11. all_data.extend(result)
12. return all_data
Code 13.21
We introduced in our main file (Code 13.21) the new way of calling
individual modules (lines 7-9) and collecting (lines 9-11) results. We
refactored Code 3.12 in such a way that now it’s more flexible and universal
way of loading and using created client modules (line 1).
When all is set and ready, we can call our main entry point script like in the
following example.
1. $ python main_app.py --phrase "iphone 14"
Code 13.22
Once the Code 13.22 is executed, we shall be getting many results from both
services that already defined in clients modules (Code 13.21, line 1).
Since the code that we built so far is designed to search for desired product
in few web services and we are getting lots of results, we have to clean up
results and extract from them these items that are sorted by price.
Let us refactor Code 13.21 as in the following example:
1. from . import __all__
2.
3. class Main:
4.
5. def collect_results(self, phrase):
6. all_data = []
7. for client in __all__:
8. c = client()
9. result = c.find_items(phrase)
10. if result:
11. all_data.extend(result)
12. all_data.sort(key=lambda x: float(x.get('price', 0)))
13. return all_data
Code 13.23

We are introducing sorting function sort6 that is going to as well take care or
a case when price in found item is not present (line 12) and set it to default 0.
This way sorting function will threat such an item with the lowest priority.
Once all is sorted, we can modify main function to obey some new
parameters that are going to help us filter out cheapest or the most expensive
items found.
1. import click
2. from pprint import pprint
3. from clients.main import Main
4.
5.
6. @click.command()
7. @click.option("--
order", help="Sorting order", type=click.Choice(['asc', 'desc'], case_sens
itive=False))
8. @click.option("--
limit", type=int, help="Limit number of results", required=False)
9. @click.option("--
phrase", type=str, help="Item name to look for", required=True)
10. def main(order, limit, phrase):
11. m = Main()
12. result = m.collect_results(phrase)
13. if result:
14. if order == 'desc':
15. result.sort(key=lambda x: float(x.get('price', 0)), reverse=True)
16. if limit:
17. result = result[:limit]
18. pprint(result)
19.
20. if __name__ == "__main__":
21. main()
Code 13.24
Refactored code from example 13.24 has new way of presenting results (line
18). Another thing to notice is adding new parameters in main function –
limit and sort (lines 7-8), where sorting is limited by listed options7 that can
be used. Those parameters we use later for sorting (lines 14-15) when order
is requested as descending – we use sort method on the result list (line 15)
and apply soring by price with reverse argument.
In the line 17 we get chunk of results array so in the send we can have
results of fetching data with requested limited number of items.
Let’s check following code how can we use this new approach to get most
expensive item found in the parsing results.
1. $ python main_app.py --phrase "iphone 14" --limit 1 --order desc
Code 13.25
This example code is going to call our main function (Code 13.24) and get
one single result sorted by the biggest price of founded item.

Automated price tracker


So far, we have learned how to build simple yet powerful application that
can help us to find desired product and sort by requested value. Once we can
find what we’re looking for we can build more complicated functionality.
This one will have a new way of checking product price – it is going to run
in the background and keep an eye on desired product. Thus, it can raise
alarm us when price is changed - dropped.
To be able to run cron8 a like solution with Python we have to install new
module pycron9 with the following code:
1. $ pip install python-cron
Code 13.26
Once module is installed, we have to analyze following example to check
how can we run hello world code every 1 minute.
1. import pycron
2. from datetime import datetime
3.
4.
5. @pycron.cron("*/1 * * * *")
6. async def test_call(timestamp: datetime):
7. timestamp = datetime.now()
8. print(f"Executed at {timestamp}")
9.
10.
11. if __name__ == "__main__":
12. pycron.start()
Code 13.27
We can see that we created decorator (line 5) for async10 function. Decorator
is using cron alike syntax11 - this way we can easy and in a very flexible
fashion create many function that can start based on configured schedule.
Another thing that we have to notice is the fact that in line 12 we start async
infinity loop that automatically takes care of executing async tasks. It is easy
and elegant solution.
Once we know how to create cron driven tasks scheduler program we have
to create one that will allow us tracking prices for given products that we’re
interested about. Let us check following code to see how async program can
use our previous program.
1. import click
2. import pycron
3. from datetime import datetime
4. from clients import __all__
5.
6. product_url = None
7. provider_name = None
8. choices = [x.__name__.lower()[6:] for x in __all__]
9.
10.
11. @pycron.cron("*/1 * * * *")
12. async def check_product_availability(timestamp: datetime):
13. print(f'started checking price for: {provider_name}')
14. client = provider_name()
15. client.product_details(product_url)
16.
17.
18. @click.command()
19. @click.option("--
url", help="Product URL to observe", type=str, required=True)
20. @click.option("--
provider", help="Provider", type=click.Choice(choices, case_sensitive=
False), required=True)
21. def main(url, provider):
22. global provider_name, product_url
23. product_url = url
24. provider_name = [x for x in __all__ if f"Client{provider.capitalize()}
" == x.__name__].pop()
25. print(f"Provider class: {provider_name}")
26. pycron.start()
27.
28. if __name__ == '__main__':
29. main()
Code 13.28
We created file which is going to help us to track URL for desired product
(line 19) for provided provider name (line 20). All the possible names of
available providers are extracted from __all__ module variable (line 8).
Once we run our code like in the following example, we should see that the
names of providers do match lower case names of module classes (line 4)
without Class word.
1. python watcher.py --help
2.
3. Usage: watcher.py [OPTIONS]
4.
5. Options:
6. --url TEXT Product URL to observe [required]
7. --provider [ebay|amazon] Provider [required]
8. --help Show this message and exit.
Code 13.29
We can see that when we want to start our watcher file, we specify provider
for instance amazon and then in Code 13.28 (line 24) we try to find
matching provider class. Once the provider class is found we assign it as
global provider (line 24).
Next, we start cron loop processor (line 26) and call check price details
function every minute (line 11). We can see that from processor object once
it is initialized (line 14) we call method product_details (line 15). Since we
do not have this method implemented let’s check how to do it for amazon
module like in the following code:
1. def product_details(self, product_url: str):
2. cookies = self.get_cookies()
3. r = requests.get(product_url, headers=self.headers)
4. soup = BeautifulSoup(r.content, 'html.parser')
5. result = soup.find_all('span', {"class": "aok-offscreen"})
6. price = float(result[0].text.replace('$', '').replace(',', '').strip())
7. return price
Code 13.30
When making call to fetch details URL first we need to get cookies (line 2)
and next we fetch mentioned URL (line 3). We use beautiful soup parser
(line 3-4) and find HTML span element that contains price details. Next, we
have to clean up found price details (line 6) and return the value. Once we
have all this set, we have to build simple price tracker by updating Code
13.28 as in the following example.
1. import os
2. import pickle
3. from hashlib import sha256
4.
5. class Cron:
6. def __init__(self, product_url):
7. self._product_url = product_url
8. self._file_hash = sha256(product_url.encode("utf-8")).hexdigest()
9. self._file_path = f"/tmp/{self._file_hash}"
10.
11. def load_price(self):
12. if os.path.exists(self._file_path):
13. with open(self._file_path, "rb") as f:
14. return pickle.load(f)["price"]
15.
16. def save_price(self, price):
17. with open(self._file_path, "wb") as f:
18. return pickle.dump({"price": price}, f)
19.
20. def is_price_drop(self, current_price: float) -> bool:
21. return self.load_price() and current_price < self.load_price()
Code 13.31
We use technique of calculating hash (line 8) for URL with product item
details that we’re planning to observe. This way we can later use hash and
file path (line 9) to read (unpickle) the value of lastly saved checked item
price (lines 11-14). There is also a block of code (line 16-18) that we have to
call to be able to “remember” (save) currently found price.
We added method as well that check as Boolean value being returned if the
current price has dropped comparing to previously checked price (lines 20-
21).
Once we have that simple class sorted let’s check how can we include it in
our Code 13.28 with the proceeding example.
1. @pycron.cron("*/1 * * * *")
2. async def check_product_availability(timestamp: datetime):
3. print(f"started checking price for: {provider_name}")
4. client = provider_name()
5. new_price = client.product_details(product_url)
6. c = Cron(product_url)
7. if c.is_price_drop(new_price):
8. print(f"Price is dropped! It was {c.load_price()}")
9. c.save_price(new_price)
Code 13.32
We updated method in Code 13.32 (lines 5-8) where we track price change
since last run of the task. When price has dropped, we raise alert (line 8).
Unfortunately, it is not the alert that is eye catching since print function is
going to print alert on the console screen. Let’s check how can we improve
that part and send notifications to OS.
1. $ pip install plyer pyobjus
Code 13.33
Once module is installed, we can apply it in updated Code 13.32 as in
proceeding code.
1. def send_alert(old_price: str, new_price: str):
2. notification.notify(
3. title="Price alert",
4. message=f"Price is dropped! It was {old_price} and it is now {new
_price}",
5. app_icon=None,
6. timeout=10,
7. )
Code 13.34
After creating help function, we can include its use into our cron job that is
checking price changes like in the following example.
1. c = Cron(product_url)
2. if c.is_price_drop(new_price):
3. send_alert(c.load_price(), new_price)
4. c.save_price(new_price)
Code 13.35
We can see that piece of code that was printing out alert message (line 3) is
now calling alert helper function (Code 13.34). After running our watcher
script and price drop, we shall see something like in the following figure.

Figure 13.4: Example alert regarding price drop for following product

Tracking multiple items


By having solution of alerts in place, we can think of rewriting our main
watcher solution. The reason being is the fact that the solution we managed
to build is able to track a single item at once. If we want to be informed
about more products prices drops it will be better to deliver configuration
file that we may read.
Let us check the following code to see how example configuration file may
look like.
1. amazon:
2. - url 1
3. - url 2
4. - url 3
5. ebay:
6. - url 2
7. - url 3
Code 13.36
We used YAML12 standard for URLs configuration file. As it is shown we
are defining providers dictionary key and URLs to observe as a list per
provider. Let’s check following code how can we update Code 13.28 so it
can support a list of obverted ULs instead of a single one.
1. class PriceChecker:
2. def __init__(self):
3. with open("watcher.yaml", "rb") as f:
4. config = yaml.safe_load(f)
5. self._providers = {}
6. for provider in config:
7. provider_cls = [x for x in __all__ if f"Client{provider.capitalize(
)}" == x.__name__].pop()
8. self._providers[provider_cls] = config.get(provider)
9.
10. def check_prices(self):
11. for provider in self._providers:
12. for product_url in self._providers.get(provider):
13. client = provider()
14. print(f'Check product details: {product_url}')
15. new_price = client.product_details(product_url)
16. c = Cron(product_url)
17. if c.is_price_drop(new_price):
18. send_alert(c.load_price(), new_price)
19. c.save_price(new_price)
Code 13.37
We created a new class called PriceChecker that is loading configuration
file (line 3) in its constructor. Configuration file, as we said before, has got
all the names of providers and related URLs to track. Next, we convert
loaded config into dictionary that has same thing as its values, but every
dictionary key is not a name of the provider but its instance (lines 6-8).
Once that is sorted, we have a method check_prices that is going to be used
with following code.
1. @pycron.cron("*/5 * * * *")
2. async def check_product_availability(timestamp: datetime):
3. p = PriceChecker()
4. p.check_prices()
Code 13.38
We updated cron function (line 1) to be executed every 5 minutes so we can
check prices in less aggressive way. Form the Code 13.37 we can see that
checking prices for every provider and URL is working in loops in a loop.
This is very inefficient way of processing any kind of time-consuming data
thus in this case it even makes things worse – blocking resources like
external website calls. To address this concern, we will change Code 13.37
to support asynchronous processing. First, we need to install following
package that we already learned how to use in Chapter 12, Developing A
Method For Monitoring Websites.
1. $ pip install asyncio-pool
Code 13.39
Once it is installed, let us refactor mentioned Code 13.37 as follows.
1. async def get_single_product_details(self, provider, product_url):
2. client = provider()
3. print(f"Check product details: {product_url}")
4. new_price = await client.product_details(product_url)
5. c = Cron(product_url)
6. if c.is_price_drop(new_price):
7. send_alert(c.load_price(), new_price)
8. c.save_price(new_price)
9.
10. async def check_prices(self):
11. calls = []
12. async with AioPool(size=5) as pool:
13. for provider in self._providers:
14. for product_url in self._providers.get(provider):
15. calls.append(await pool.spawn(self.get_single_product_detail
s(provider, product_url)))
16.
17. def start_processing(self):
18. asyncio.run(check_prices())
Code 13.40
We have updated looping code in such a way that we are using async
coroutines pool (line 12) to limit the number of simultaneous calls that we’re
performing. The reason being why we do not want to be too aggressive with
number of requests we perform is the fact that we’re targeting live sites and
they can start blocking out requests.
We are introducing method in lines 17-18 that is starting asyncio reactor
which is going to make sure that all the async coroutines from lines 10-15
are properly executed.
Another thing to notice is the fact that the whole class has been changed in
such a way that it’s working with asynchronous calls. To be able to use their
benefits we have to as well change our cron function. Let’s check proceeding
example how can we achieve this.
1. @pycron.cron("*/5 * * * *")
2. async def check_product_availability(timestamp: datetime):
3. p = PriceChecker()
4. p.start_processing()
Code 13.41
We updated main cron function is such a way that it is going to initialize
PriceChecker class (line 3) and call entry starting function in line 4. Since
our cron processor already works in an async way and has its own coroutines
engine we had to start new coroutine engine (Code 13.14, lines 17-18).
Now, to be able to fully support asynchronous calls we need to update
methods for checking price details from individual client classed. First, let’s
check how are we going to update class for Amazon client (Code 13.17) with
the next example. First, we shall install following module to be able to fulfil
async requirements.
1. $ pip install httpx
Code 13.42
Once module is ready and installed, we ca start refactoring method for
getting prices details like in the following example.
1. async def get_async_cookies(self) -> bool:
2. url = "https://www.amazon.com"
3. async with httpx.AsyncClient() as client:
4. response = await client.get(url, follow_redirects=True, headers=self.
headers)
5. if response.status_code == 200 and len(response.text) >= 50:
6. return response.cookies
7.
8. async def product_details(self, product_url: str):
9. cookies = await self.get_async_cookies()
10. async with httpx.AsyncClient() as client:
11. r = await client.get(product_url, headers=self.headers)
12. soup = BeautifulSoup(r.content, «html.parser»)
13. result = soup.find_all("span", {"class": "aok-offscreen"})
14. if result:
15. price = float(result[0].text.replace("$", "").replace(",", "").strip())
16. return price
Code 13.43
We updated Amazon client in two places – main method for checking item
price details and support method (lines 1-6) that is calling generic amazon
page to extract cookies form it. Later, we use those fetched cookies to be
able to make valid calls. We can notice as well that httpx13 API is similar to
requests library so converting synchronous to async calls is pretty straight
forward process.
Next, we have to refactor code of the eBay client class. To do so we need to
update client code as in the following examples, but first we shall install
required Python module, so we call use HTTP214 protocol with eBay.
1. $ pip install 'httpx[http2]'
Code 13.44
Once the module is installed, let us look at the following example how can
we make async calls with HTTP2 for eBay client.
1. import asyncio
2. import httpx
3. from bs4 import BeautifulSoup
4.
5. class ClientEbay:
6. HEADERS = {
7. "User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
8. "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,
image/avif,image/webp,image/apng,*/*;q=0.8,application/
signed-exchange;v=b3;q=0.7",
9. «Accept-Language»: «en-US,en;q=0.9»,
10. "Accept-Encoding": "gzip, deflate, br",
11. }
12.
13. async def product_details(self, product_url):
14. async with httpx.AsyncClient(headers=self.HEADERS, http2=Tru
e, follow_redirects=True) as client:
15. r = await client.get(product_url)
16. s = BeautifulSoup(r.content, "html.parser")
17. divs = s.find_all('div', {"class": ["x-bin-price", ]})
18. spans = divs[0].find_all("span", {"class": ["ux-textspans",]})
19. price = spans[0].text.replace("$", "").replace(",", "").replace('US'
, '').strip()
20. return price
Code 13.45
We introduced async calls like in the example Code 13.30 (Amazon client).
In this case we use HTTP2 (line 14). The rest of the code is simple
extracting HTML nodes to be able to get the final price of the product. Once
all is ready, we return found value as we did in Code 13.30.

Historical values
When we have all the checkers ready and recording current values for the
tracked products, next step will be to show values for those products so we
can know on demand what are the current values if we willing to buy that
product manually.
Let us check the following code to see how to achieve that:
1. import click
2.
3. class PriceChecker:
4. def show_prices(self):
5. for provider, urls in self._providers.items():
6. print(f"Checking for prices: {provider}")
7. for product_url in urls:
8. c = Cron(product_url)
9. if c.load_price():
10. print(f"Current price: {c.load_price()}")
11. else:
12. print("Price data not found.")
13.
14. @click.command()
15. @click.option("--
price", help="Show prices", required=False, default=False, is_flag=True
)
16. @click.option("--
watch", help="Watch prices", required=False, default=False, is_flag=Tr
ue)
17. def main(price, watch):
18. if watch:
19. pycron.start()
20. elif price:
21. p = PriceChecker()
22. p.show_prices()
23.
24.
25. if __name__ == "__main__":
26. main()
Code 13.46
We introduced new entry point for the main function where we support two
options – watch (line 18-19) parameter which is going to run the code in the
mode where check and analyze cheapest price available. Next option is price
(line 21-23) which helps us to check what is the current price available after
running code with –watch parameter. In the PriceChecker class (lines 3-12)
we introduced code that is loading and print current prices available for the
products that we managed to check.
Let us run following code and see what is going to be example out after
running out code.
1. $ python watcher.py --watch
2.
3. Exception in thread Thread-1:
4. Traceback (most recent call last):
5. (..)
6. asyncio.run(self.check_prices())
7. raise RuntimeError(
8. RuntimeError: asyncio.run() cannot be called
from a running event loop
9. ^C
10. Aborted!
11. sys:1: RuntimeWarning: coroutine 'PriceChecker.check_prices'
was never awaited
12. RuntimeWarning: Enable tracemalloc to get
the object allocation traceback
Code 13.47
We can see that so far, we’ve been building our watcher program around
crobjob concept but when we run it, it crashed with error shown in Code
13.47. You may wonder - what does it mean and why it’s happened. The
reason being is quite trivial. When we run cron jobs we runt them in
coroutines loop, which’s been mentioned before. That has some limitations –
we can’t run another coroutines loop inside already started loop (Code
13.40, lines 17-18). To fix that limitations we need to update our Code 13.40
with following code.
1. from concurrent.futures import ThreadPoolExecutor
2.
3. class PriceChecker:
4. def start_processing(self):
5. try:
6. asyncio.get_running_loop()
7. # Create a separate thread
8. with ThreadPoolExecutor(1) as pool:
9. result = pool.submit(lambda: asyncio.run(self.check_prices()))
.result()
10. except RuntimeError:
11. result = asyncio.run(self.check_prices())
Code 13.48
Basically, method start_compare, which is the main entry point for price
checker, is trying to get current event loop (line 6) and when it does not rise
exception, we know that we already are in the middle of couritne loop. Thus,
we have to start new events loop in newly started thread (lines 8-9). This
solution is going to guarantee that there is no clash of running 2 loops in the
same thread.
What is also worth noticing are lines 10-11 where once there is no loopo
running we can start new one in the current thread. After executing code, we
shall wait unitl all the prices are collected. Next, we can execute following
code and check collected results.
1. $ python watcher_6.py --price
2.
3. Checking for prices: <class 'clients.amazon.ClientAmazon'>
4. Current price: 349.99
5. Current price: 389.99
6. Checking for prices: <class 'clients.ebay.ClientEbay'>
7. Current price: 289.99
Code 13.49
We print out saved values as shown in lines 13.49. It is easy to notice from
Code 13.34 (lines 3-12) that we iterate by reading from config file (Code
13.37, line 3) lines one by one. Thus, for instance if we want to ceheck the
chepest price shown in Code 13.48 (line 7) we can open config file and
check eBay section 1st specified URL. Open it in browser and check details
– what is happening there – why price is the lowest.

Auto purchase product


So far, we have learned how to write application that can monitor and scan
product prices and push alerts whenever price is dropped. Getting alerted
about product price change is very valuable if you can notice that on time.
Further improvement shall be updating our application in such a way that it
can purchase requester monitored item automatically.
Before we are going to update our main code let’s learn how to store credit
card details. Naturally, many webstores platforms can support storing
securely credit cards in their systems. In our following examples we are
going to learn how to store card details in a secure way on our local
computer. First, let’s install proceeding Python packages.
1. $ pip install cryptography
Code 13.50
When module is installed, we can write simple example that is going to
support creatin encryption key and saving encrypted information. First, let us
create JSON file called cc.json with the following content.
1. {
2. "name": "card name",
3. "numbert": "1111111111111111",
4. "month": "12",
5. "year": "2032"
6. }
Code 13.51
This JSON is going to store our credit card details that are going to be use
later for automatically purchasing item that has best price. As we said before
we must encrypt sensitive card information. To do so, first let us check how
can we encrypt data with encryption key. For encryption key we are going to
use symmetric encryption, Fernet15 algorithm. Let us check following
example create_key.py to be able to create encryption key.
1. from cryptography.fernet import Fernet
2.
3. key = Fernet.generate_key()
4. with open('key','wb') as f:
5. f.write(key)
Code 13.52
Once the file is executed it is going to create encryption key called key that
we can use later in the following example to encrypt previously created
JSON file.
1. from cryptography.fernet import Fernet
2.
3. with open('key','rb') as f:
4. key = f.read()
5.
6. with open('cc.json','rb') as f:
7. data = f.read()
8.
9. fernet = Fernet(key)
10. encrypted = fernet.encrypt(data)
11.
12. with open('cc.data','wb') as f:
13. f.write(encrypted)
Code 13.53
In this code we can see that loading credit card from configuration file (lines
6-7) later is being encrypted with mentioned key (lines 9-10) and saved to
output file (lines 12-13). When we try to print content of encrypted file it is
going to look like following.
1. $ cat cc.data
2.
3. gAAAAABnKiEHWk6R49V(...)wx2hxn-0g==%
Code 13.54
Let’s update Code 13.53 with following example so we can read credit card
from command line instead of reding flat file which we have to store locally
and please remember that want to keep CC details safe – encryted.
1. import click
2. import json
3. from cryptography.fernet import Fernet
4.
5. def main():
6. cc_name = click.prompt("Enter a credit card name", type=str)
7. if not cc_name:
8. return
9. cc_nubmer_default = int('1'*16)
10. cc_number = click.prompt("Enter a credit card number", type=click.I
ntRange(cc_nubmer_default), default=cc_nubmer_default)
11. if not cc_number:
12. return
13. cc_exp_month = click.prompt("Enter a credit card expiry month", typ
e=click.IntRange(1, 12), default=1)
14. if not cc_exp_month:
15. return
16. cc_exp_year = click.prompt("Enter a credit card expiry month", type
=click.IntRange(2024, 2050), default=2025)
17. if not cc_exp_month:
18. return
19.
20. data = json.dumps({
21. "name": cc_name,
22. "numbert": cc_number,
23. "month": cc_exp_month,
24. "year": cc_exp_year
25. }).encode("utf-8")
26.
27. with open('key','rb') as f:
28. key = f.read()
29.
30. fernet = Fernet(key)
31. encrypted = fernet.encrypt(data)
32.
33. with open('cc.data','wb') as f:
34. f.write(encrypted)
35.
36. if __name__ == "__main__":
37. main()
Code 13.55
Once we have configuration file that can be created on the fly with running a
script instead of converting unencrypted JSON file to the one that’s
encrypted. Delivering solution this way is more secure and does not leave a
trace of real card details. Let’s create updated payment module (Figure 13.1)
cc.py for our application as n the following example.
1. import json
2. from cryptography.fernet import Fernet
3.
4.
5. class PaymentCreditCard:
6. def __init__(self, enc_key):
7. self.encryption_key = enc_key
8.
9. def load_cc_settings(self):
10. enc = Fernet(self.encryption_key)
11. with open('cc.data', 'rb') as f:
12. encoded_data = f.read()
13. data = json.loads(enc.decrypt(encoded_data))
14. return data
15.
16. def pay(self):
17. # here we can include logic how to perform payment
18. return self.__load_cc_settings()
Code 13.56
To be able to use this payment class we have to update Code 13.37 where we
have method for sending alert to system notification (send_alert). Let us
check following example how can we do this. First, let us add following
code update so we can create and remove lock file.
1. class PriceChecker:
2. def __init__(self):
3. self.__lock_file = '/tmp/price_check.lock'
4.
5. def load_key(self):
6. with open('/tmp/key', 'rb') as f:
7. return f.read()
8.
9. def __create_lock(self):
10. with open(self.__lock_file, 'w') as f:
11. f.write('locked')
12.
13. def remove_lock(self):
14. if os.path.exists(self.__lock_file):
15. os.remove(self.__lock_file)
16.
17. def __is_purchase_is_locked(self):
18. return os.path.exists(self.__lock_file)
Code 13.57
We updated one of our main classes in the constructor (lines 2-3) where we
specify lock file path (line 3). Next method are introducing creating lock file
(lines 9-11), removing lock (lines 13-15) and check if lock exists (files 17-
18). Let’s check following code how can we use the fact if file lock exists.
1. async def get_single_product_details(self, provider, product_url):
2. client = provider()
3. print(f"Check product details: {product_url}")
4. new_price = await client.product_details(product_url)
5. c = Cron(product_url)
6. if c.is_price_drop(new_price):
7. send_alert(c.load_price(), new_price)
8. cc = PaymentCreditCard(self.load_key())
9. client.buy(cc.pay())
10. self.__create_lock()
11. c.save_price(new_price)
Code 13.58
In the presented code, we create lock when the actual item is being
purchased (lines 8-10). This will be a semaphore flag for our following code
that the purchase has already happened and there is no sense to check any
prices further.
1. def start_processing(self):
2. if self.__is_purchase_is_locked():
3. print('We already purchased product,
canceling processing...')
4. return
5. try:
6. asyncio.get_running_loop()
7. # Create a separate thread
8. with ThreadPoolExecutor(1) as pool:
9. result = pool.submit(lambda: asyncio.run(self.check_prices())).re
sult()
10. except RuntimeError:
11. result = asyncio.run(self.check_prices())
Code 13.59
We updated main entry method for processing price checks with very
important check (lines 2-4). When we detect that the file lock already exists,
we cancel any further price checks and print warning message on the screen.
This way we can avoid accidental purchase multiple times of the same
product. Please notice that Code 13.59 is being executed in cron which
means we run it every 5 minutes.

Conclusion
In this chapter, we learned how can we use Python to build quite advance
tool that can help us with hunting for the best price available on multiple e-
commerce platforms. We learned how to do it in modular way, so we can add
support for more websites. At the same time, we dive into the topic of auto
purchase. We analyzed as well how to store sensitive data in a secure way
and how to utilize this data whenever it is needed.
In the following chapter, we are going to learn how to use Python to build
mobile applications.

1. https://en.wikipedia.org/wiki/API
2. https://developer.ebay.com/develop/apis/restful-apis/buy-apis
3. https://developer.ebay.com
4. https://developer.ebay.com/Devzone/XML/docs/ReleaseNotes.html
5. https://beautiful-soup-4.readthedocs.io/en/latest/
6. https://docs.python.org/3/howto/sorting.html
7. https://click.palletsprojects.com/en/8.1.x/options/#choice-options
8. https://opensource.com/article/17/11/how-use-cron-linux
9. https://pypi.org/project/python-cron/
10. https://docs.python.org/3/library/asyncio.html
11. https://crontab.guru
12. https://yaml.org
13. https://www.python-httpx.org
14. https://datatracker.ietf.org/doc/html/rfc9113
15. https://cryptography.io/en/latest/fernet/

Join our book’s Discord space


Join the book's Discord Workspace for Latest updates, Offers, Tech
happenings around the world, New Release and Sessions with the Authors:
https://discord.bpbonline.com

OceanofPDF.com
CHAPTER 14
Python Goes Mobile

Introduction
In this chapter, we will teach how you can use Python in mobile devices
(smartphones) and how to run your own Python programs on those pretty
special platforms. You will learn how to write small and efficient code and
use it on mobile systems. The goal of this chapter is to teach you how to
write mobile applications by using Python.

Structure
Topics to be covered: This chapter will cover following topics.
Brief introduction to mobile applications - their concept and limitations
Overview of Python libraries for mobile devices
Calculator in Python for an iOS and Android

Objectives
In this chapter, we will build simple yet powerful application that is going to
demonstrate how to deploy fully Python driven mobile app. We will learn
how to prepare such an application from concept to the actual running
MVP1. We’re going to dive a bit into topic of mobile operating systems to
learn how to run Python application on the top of it.
Basics
In the mobile world things work differently when we compare how
applications run in system space. Without going into many details, the key
points to highlight are:
Applications are programming language limited – in other words as
developer you mustn’t run your application written in any kind of
language you want – you are bounded by operating system, which
means:
iOS will limit you between objective-c and Swift.
Android this is a realm of Java.
Applications run I sandbox and do not have access to some system
resources.
GUI is driven by OS and writing custom components is quite
challenging.
Writing application in other languages from listed above is possible but
you will have to translate them to native OS language.
Apple’s mobile system called iOS2 since it has been introduced was
modified in many ways although basic core concepts stayed the same. Every
application is being written and compiled by using XCode3 ecosystem which
means the only way to get application running under iOS through Swift4 or
Objective-C5 and by using provide UI toolkit. This makes applications to be
consistent (in their look and feel) but gives us a limitation as well. If we
want to deliver application to iOS with pure Python, we will have to convert
it to a language and libraries that iOS can understand. In the following
example we can see Swift programing example code and we can notice how
much different it is from Python.
1. struct Player {
2. var name: String
3. var highScore: Int = 0
4. var history: [Int] = []
5.
6. init(_ name: String) {
7. self.name = name
8. }
9. }
10.
11. var player = Player("Tomas")
Code 14.1
So, let us check how does it look like in Android world. Here we have very
similar situation albeit even if core system itself is more flexible and wider
open because of use by many mobile devices manufactures it is still in its
essence driven by Java. That means that if we want to deliver mobile
application, we need to build it by using Android Studio and in its essentials
deliver core code in Java6.
We can see that we have a common path in those 2 major mobile operating
systems – we need to deliver out Python application translated to native code
that can be run by mobile operating system. Let us try to see in following
code how example code in Java is much different from Python.
1. public class Main {
2. public static void main(String[] args) {
3. System.out.println(“Hello World”);
4. }
5. }
Code 14.2
In the following subchapers we will learn how we can use pure Python code.

Python GUI
So far, we explored options for mobile OS to understand key points of how
they work and what king of limitations we can expect as developer when
writing Python application for mobile device.
Let us try to learn in this subchapter how we can write Python GUI
applications in the first place. When exploring mainstream libraries for
Python we can high light these:
TK (Tkinter)7: Very basic yet powerful and one of the oldest libraries
for building Python applications with support for user interface.
wxWidgets8: Mature with lots of powerful
https://en.wikipedia.org/wiki/Objective-Cwidgets.
QT9: Commercial yet with free license GUI library, as well as
previously mentioned mature and powerful.
Kivy10: This seems to be youngest player in the business of graphical
interfaces, but it’s got a lot of very great features – where one of them is
portability and this is something that we will chose to demonstrate how
to build mobile application.
Toga11: Quite simple yet powerful library for building graphical
interfaces.

GUI
In this section, we will briefly check how few Python GUI libraries to see
where they shine and how much they are different when we want to build
desktop application.

Toga
As we said this is quite simple yet powerful library that is multiplatform
ready and helps to build quite sophisticated graphical interfaces. Let us start
by installing library itself.
1. $ pip install toga
Code 14.3
Once we have it installed let us build our very first hello world application.
1. import toga
2.
3. class MyApp(toga.App):
4. def startup(self):
5. self.main_window = toga.MainWindow()
6. self.main_window.content = toga.Box(children=
[toga.Label("Hello!")])
7. self.main_window.show()
8. if __name__ == '__main__':
9.
10. app = MyApp("Realistic App", "org.python.code")
11. app.main_loop()
Code 14.4
As we can notice in the example Code 14.2 we create class MyApp (Code
14.2, line 3) which will inherit from toga.App, which is base class for Toga
framework where all the necessary initializations are taking place that will
lead to generate and display application window.

Kivy
Let us create our very first GUI example. This subchapter is going to cover
building desktop application. We will learn how to build hello application
first and then design calculator based on the learned foundations.
First we have to install Python modules that will help us to such an
application.
1. $ pip install Kivy Kivy-examples Kivy-Garden
Code 14.5
Once we have module install, we shall create out first template of the
application. Let’s create simple example with hello world. Let us check
following example.
1. Label:
2. id: entry
3. font_size: 24
4. multiline: False
5. text: "hello world"
6. size: 150, 44
Code 14.6
In UI configuration file (Code 14.2), which we called as kivy_example.kv,
we declared that we are going to use element Label12 which additionally has
declared other parameters like element id (line 2) and the actual text that
we’re about to display.
Now we need to consume this UI elements definition in our application. To
be able to do so we shall import few elements from Kivy library and load
application UI. Let’s try to check following code to see how we can achieve
this.
1. import kivy
2.
3. from kivy.app import App
4. from kivy.lang import Builder
5. kivy.require('1.9.0')
6.
7. from kivy.core.window import Window
8. from kivy.uix.gridlayout import GridLayout
9. from kivy.config import Config
10.
11. Config.set('graphics', 'resizable', 0)
12.
13. class kivExampleApp(App):
14.
15. def build(self):
16. return Builder.load_file("kivy_example.kv")
17.
18. def main():
19. calcApp = kivExampleApp()
20. return calcApp.run()
21.
22. main()
Code 14.7
We have defined configuration for the application (line 11) where we
specified that our application can’t resize – user can’t change size of the
main window. Next, we declared custom class kivExampleApp which is
inheriting from kivy App (line 13). The reason being why we inherit is
because we want to load our user interface (UI) from definition file (Code
14.2).
Method where we load UI definition file is build (Code 14.3, line 15) where
we specify all the elements of the window. That is why we load all the
elements with their corresponding configuration from a file instead putting
them (code driven) in the window – which is not easy to read. As we
mentioned we inject window elements to loading configuration file (line 16).
Next, in the we created main function (lines 18-20) where we initialize our
custom class and with it, we run core of our desktop app. Let us check the
following figure to see how the application is going to look in desktop
environment.

Figure 14.1: Example hello world application driven by Kivy library


After executing our code, we can see application window with our
predefined label element (Code 14.6) centered in the middle of the window
with requested hello world text.

Compiler
That being said, we need to look for options how can we deliver Python
code to mobile operating systems. We could try to explore some options like
Iron Python13 which helps to write Python applications in .NET14
environment and then compile to native code for selected mobile device –
Android or iOS. This technique is a bit complex and breaches far beyond our
interest in this chapter so we will try to find another way.
After researching we can agree that briefcase15 library which is addressing
our need perfectly – we can compile and pack our Python library. First, we
must install library and dependencies like in the following code.
1. pip install briefcase
Code 14.8
Once we have installed main libraries and all the dependencies we can create
our blank hello world application. Let’s check following code how to do
this.
1. $ briefcase new
Code 14.9
After running this command system is going to ask us few questions to be
able to create template hello world application.
1. Formal Name [Hello World]: <enter>
2.
3. App Name [helloworld]: <enter>
4.
5. Bundle Identifier [com.example]: <enter>
6.
7. Project Name [Hello World]: <enter>
8.
9. Description [My first application]: <enter>
10.
11. Author [Jane Developer]: <enter>
12.
13. Author's Email [[email protected]]: <enter>
14.
15. Application URL [https://example.com/helloworld]: <enter>
16.
17. What license do you want to use for this project's code?
18.
19. 1) BSD license
20. 2) MIT license
21. 3) Apache Software License
22. 4) GNU General Public License v2 (GPLv2)
23. 5) GNU General Public License v2 or later (GPLv2+)
24. 6) GNU General Public License v3 (GPLv3)
25. 7) GNU General Public License v3 or later (GPLv3+)
26. 8) Proprietary
27. 9) Other
28.
29. Project License [1]: 1
30.
31. What GUI toolkit do you want to use for this project?
32.
33. 1) Toga
34. 2) PySide6 (does not support iOS/Android deployment)
35. 3) PursuedPyBear (does not support iOS/Android deployment)
36. 4) Pygame (does not support iOS/Android deployment)
37. 5) None
38.
39. GUI Framework [1]: 1 <enter>
Code 14.10
As it’s noticeable we are creating application that is going to use Toga for
user interface – we already did some example UI in Code 14.4.
Once all is set, we can install some components that are essential to build
mobile application – we will focus in this subchapter on building and
compiling application for iOS system.
First, we must install XCode16 from AppStore. Once we install it, we have
open XCode and install iPhone emulator – when this book was written there
was iOS 17.4 available as the latest for the emulation.
When the emulator is ready and installed, we can shut down XCode and run
below code to prepare our hello world example.
1. $ briefcase create iOS
Code 14.11
Once Python environment for our mini example is prepared, we need to
compile it – that means translate Python code that we prepared (Code 14.9)
to iOS binary machine code. To do so we have to run following example.
1. $ briefcase build iOS
Code 14.12
In the following example you can check how valid and with no errors
compiling output should look like.
1. [helloworld] Updating app metadata...
2. Setting main module... done
3.
4. [helloworld] Building Xcode project...
5. Building... done
6.
7. [helloworld] Built build/helloworld/ios/xcode/build/Debug-
iphonesimulator/Hello World.app
Code 14.13
When build is ready, we can finally run in locally in emulator. To be able to
run compiled code in emulator we have execute following command.
1. $ briefcase run iOS
Code 14.14
After running Code 14.14 we can check how the application is going to look
like in emulator. Let us check following screenshot.
Figure 14.2: Example hello world application run in iOS 17.4 emulator

Calculator
After having successful compiled and started application in iOS emulator it
is time to prepare our calculator program. As the main step, we need to start
preparing template of calculator program. For this we are going to use
following code example.
1. $ briefcase new
Code 14.14
It is easy to notice that we follow same syntax as in previous hello world
code albeit in this case when answering questions like in example Code
14.10, we’re going to use new name – calculator. Once name is given to our
new application, please also remember that in this case we also use GUI
library Toga (as shown in example 14.10).
When all is set, we need to update our main application code so it can draw
calculator buttons with UI. To do so we shall modify main application source
file src/calculator/app.py that looks like in the following code.
1. import toga
2. from toga.style import Pack
3. from toga.style.pack import COLUMN, ROW
4.
5.
6. class calculator(toga.App):
7. def startup(self):
8. """Construct and show the Toga application.
9.
10. Usually, you would add your application to a
main content box.
11. We then create a main window
(with a name matching the app), and
12. show the main window.
13. «»»
14. main_box = toga.Box()
15.
16. self.main_window = toga.MainWindow(title=self.formal_name)
17. self.main_window.content = main_box
18. self.main_window.show()
19.
20.
21. def main():
22. return calculator()
Code 14.15
The original Code 14.15 must be modified in such a way that we can
generate calculator buttons. Let’s start with following code where we
generate button on the top of the box inside of another box. Let’s check first
how we are going to generate UI for those requirements.

Figure 14.3: Example calculator application with button and panes


As we see in example figure to draw button in the application, we need to
create a pane and put button inside it. We can of course drive buttons
position inside of pane and size of buttons itself. Let us check following
code how can we achieve this.
1. import toga
2.
3. def build():
4. c_box = toga.Box()
5. box = toga.Box()
6.
7. input_field = toga.TextInput()
8. button = toga.Button("Calculate")
9.
10. c_box.add(button)
11. c_box.add(input_field)
12.
13. box.add(c_box)
14. return box
15.
16. class main_app(toga.App):
17. def startup(self):
18. main_ui= build()
19.
20. self.main_window = toga.MainWindow(title="test application")
21. self.main_window.content = main_ui
22. self.main_window.show()
Code 14.16
We can see in example code that generating UI is happening in separate
function (lines 3-14) and then being assigned to main window content (line
21). We also managed to apply the technique mentioned in Figure 14.2
where we append button (line 8) and input field (line 7) to a box pane (lines
10-11). In the end we append pane c_box into main application box (line
13).
After having all the basic panes and elements prepared, we’re adding them
into the main application window (line 21).
After learning this basic technique let’s check how can we add some colors
to input field. Let’s check following content to validate how this can be
done.
1. from toga.style import Pack
2. from toga.style.pack import LEFT
3.
4. button = toga.Button("Calculate", style=Pack(background_color="#eeee
ff", text_align=LEFT))
Code 14.17
In this example demonstrates how to add background color to button (line 4)
and align text in it to the left (instead of default center). We use Pack
module17 to be able to apply HTML like attribute for drawing UI elements.
Next example we will see how we could attach data action for the button that
is being pressed by the user. Let us check following example how can we
achieve this.
1. import ranodom
2. from toga.style import Pack
3. from toga.style.pack import LEFT
4.
5.
6. def calculate(widget):
7. return random.randint(1, 500)
8.
9. button = toga.Button("Calculate", style=Pack(background_color="#eeee
ff"), on_press=calculate)
Code 14.18
We define the same button albeit it has got additional argument (line 9)
on_press which is indicator for toga engine which function to call once user
processed the button. As it’s noticeable we are calling function that is
randomly generating integer number from 1 to 500 (lines 6-7). This part of
the code can be used not only to trigger some action but actually drive how
the UI looks like – after taking an action like pressing button. We could
assign generated value by function calculate to some element that is part of
user interface, so user is going to see what is the value that was generated.
Please notice that application with graphical user interface (GUI) does not
have console where we could print out any kind of message so user can
follow. Let us check proceeding code how could we take mentioned action.
1. import ranodom
2. from toga.style import Pack
3. from toga.style.pack import LEFT
4.
5.
6. def build():
7. c_box = toga.Box()
8. box = toga.Box()
9.
10. input_field = toga.TextInput()
11. def calculate(widget):
12. input_field.value = random.randint(1, 500)
13.
14. button = toga.Button("Calculate", style=Pack(background_color="#e
eeeff"), on_press=calculate)
15.
16. c_box.add(button)
17. c_box.add(input_field)
18.
19. box.add(c_box)
20. return box
Code 14.19
In the code we are using technique that we had already learned – where
button being pressed action is called and additionally, we added result
assignment to input button. Let’s check following figure to see how the look
and feel of our application is going to change.

Figure 14.4: Example responsive application with result assignment


As is it demonstrated we are presenting randomly generated number upon
button Calculate being pressed. We can notice as well that library
automatically translates integer value generated to string that can be inserted
into input field.
After learning all the necessary basics of how we can create GUI for our iOS
application we can start building calculator layout.
1. def calculator_ui():
2. c_box = toga.Box()
3. row1_box = toga.Box()
4. row2_box = toga.Box()
5. row3_box = toga.Box()
6. row4_box = toga.Box()
7. box = toga.Box()
8.
9. result_input = toga.TextInput(readonly=True, style=Pack(background
_color="#eeeeff", flex=1))
10. f_input = toga.TextInput()
11. result_label = toga.Label("Result", style=Pack(text_align=RIGHT, fle
x=1))
12.
13. button_7 = toga.Button("7", style=Pack(flex=1))
14. button_8 = toga.Button("8", style=Pack(flex=1))
15. button_9 = toga.Button("9", style=Pack(flex=1))
16. button_x = toga.Button("x", style=Pack(flex=1))
17.
18. button_4 = toga.Button("4", style=Pack(flex=1))
19. button_5 = toga.Button("5", style=Pack(flex=1))
20. button_6 = toga.Button("6", style=Pack(flex=1))
21. button__ = toga.Button("-", style=Pack(flex=1))
22.
23. button_3 = toga.Button("3", style=Pack(flex=1))
24. button_2 = toga.Button("2", style=Pack(flex=1))
25. button_1 = toga.Button("1", style=Pack(flex=1))
26. button_plus = toga.Button("+", style=Pack(flex=1))
27.
28. button_0 = toga.Button("0", style=Pack(flex=1))
29. button_div = toga.Button("÷", style=Pack(flex=1))
30. button_equal = toga.Button("=", style=Pack(flex=1))
31.
32. c_box.add(result_label)
33. c_box.add(result_input)
34.
35. row1_box.add(button_7)
36. row1_box.add(button_8)
37. row1_box.add(button_9)
38. row1_box.add(button_x)
39. row1_box.style.update(padding=5)
40.
41. row2_box.add(button_4)
42. row2_box.add(button_5)
43. row2_box.add(button_6)
44. row2_box.add(button__)
45. row2_box.style.update(padding=5)
46.
47. row3_box.add(button_1)
48. row3_box.add(button_2)
49. row3_box.add(button_3)
50. row3_box.add(button_plus)
51. row3_box.style.update(padding=5)
52.
53. row4_box.add(button_0)
54. row4_box.add(button_div)
55. row4_box.add(button_equal)
56. row4_box.style.update(padding=5)
57.
58. box.add(c_box)
59. box.add(row1_box)
60. box.add(row2_box)
61. box.add(row3_box)
62. box.add(row4_box)
63.
64. box.style.update(direction=COLUMN, padding=10)
65. return box
Code 14.20
After execution of our calculator code, we can see what our application is
going to look like in development mode.
Figure 14.5: Example calculator application running in development mode

Calculation logic
We introduced in Code 14.2 new attribute (line 9) that makes input field read
only – it will help us to prevent user from editing input field that should only
be used for presenting calculator results. Another new thing introduced is the
flex attribute that will force graphical element like button to fill all available
space in a row.
In the following code, we will see code snippet how we can add support for
each individual button that user will press.
1. class CalculatorMod:
2. def __init__(self, result_widget):
3. self.storage_1 = []
4. self.storage_2 = []
5. self.operator = None
6. self.result_widget = result_widget
7.
8. def addValue(self, widget):
9. if not self.operator and not self.result_widget.value:
10. self.storage_1.append(int(widget.text))
11. else:
12. self.storage_2.append(int(widget.text))
13.
14. def click_operator(self, widget):
15. if not self.operator:
16. self.operator = widget.text
17.
18. def calculate(self, widget):
19. result = None
20. number_1 = int(''.join([str(x) for x in self.storage_1]))
21. number_2 = int(''.join([str(x) for x in self.storage_2]))
22. if self.operator == "+":
23. result = number_1 + number_2
24. elif self.operator == "-":
25. result = number_1 - number_2
26. elif self.operator == "x":
27. result = number_1 * number_2
28. elif self.operator == "÷":
29. result = number_1 / number_2
30. self.show_result(result)
31.
32. def show_result(self, result):
33. if not result:
34. return
35. self.result_widget.value = result
36. self.storage_1 = [*str(result)]
37. self.storage_2 = []
38. self.operator = None
Code 14.21
We created class that is going to deliver basic support for fundamental
arithmetic operator, which is +. -. ÷ and *. In this case we have method
calculate (lines 18-30) which gets as an attribute button instance that was
pressed in UI (line 18) by the user. Before checking what kind of action user
has performed, we have prepared numbers for the operation that we will
perform (lines 20-21). Let us focus on this matter here - when we start fresh
application, we initialize two array (lines 3-4) and operator helper (line 5).
Additionally, in initializer of the class we are passing instance of widget that
is going to be our container for displaying result - we created such an input
field with read only attribute (Code 14.20, line 9).
When user clicks any button with a number (1-9) we will use callback
method addValue (lines 9-12) which will check if we’re generating number
from the time when fresh application has been started (lines 9-10) and when
there is already number inserted into memory of calculator, or we have any
result of calculation presented in read only input field where we show to the
user results of his operations. Let’s check following code to see how
example button syntax is going to look like.
1. result_input = toga.TextInput(readonly=True, style=Pack(background_c
olor="#333333", flex=1))
2. storage = CalculatorMod(result_input)
3. button_7 = toga.Button("7", style=Pack(flex=1), on_press=storage.addV
alue)
Code 14.22

Callbacks
As it is easy to notice we add callback method (line 3) to example button 7
which is just pointer to a method not the actual method being called – the
call happens happen upon button being pressed. Another thing, since we pass
pointer to method, we do not pass any kind of argument and as we can see in
Code 14.21 (line 8) there is argument being passed to method call – this
happens as part of toga framework. Argument is button instance that user has
pressed.
We as well mentioned that we will pass instance of result_input read only
input field which we do in lines 1-2.
Another important part of buttons will be special use button – operators (ie.
+) and result button (=). Let’s check example code how these button will use
callbacks to process triggered actions coordinately.
1. button_div = toga.Button("÷", style=Pack(flex=1), on_press=storage.cli
ck_operator)
2. button_equal = toga.Button("=", style=Pack(flex=1), on_press=storage.c
alculate)
Code 14.23
We added same method for stretching buttons size (flex=1) and we added
on_press methods accordantly click_operator and calculate. We can see
from example Code 14.21 that method for click operator is checking if we
have already operator remembered in our class instance – if not, we will
remember pressed operator that user pressed (ie. +) and keep collecting
numbers that user pressed until = button is pressed (Code 14.23, line 2).
When that action takes place, we run calculating procedure (Code 14.21,
lines 18-30).
To be able to calculate numbers correctly we applied following logic – we’re
adding digits to array – one by one when they’re being pressed (Code 14.21,
line 10) as long as user does not press operator button, or we do not have
previous result already presented in results input field (line 9). Once user
clicks operator button, we’re doing the same albeit we’re cumulate pressed
digits into second array (Code 14.21, line 12).
At the moment when user press result button (=) we have accumulate array
number 1 containing digits into number (line 20) and as well array 2 (line
21). Next, depending on the operator that we remember that was pressed (ie.
+) we have performed desired arithmetic operation on those 2 numbers (line
23) and redirect calculated result into result input field (lines 32-38). In the
end when result is presented, we must remember freshly calculated result
into array 1 (line 36) so we can use it for the following operations that user is
willing to perform by going further.

Android
So far, we have been dealing with iOS calculator application. We can use
briefcase library18 to build Android app. To start with building our
application we could install Android studio19 and run compilation manually
– which is strongly not recommended. Briefcase framework will do the
heavy lifting for us.
We are about to reuse our calculator application that we have been running
under iOS. To do so we have to run following code in main folder of our
calculator app.
1. $ briefcase package android
Code 14.24
By running this command, we will start installing Java libraries, Android
emulator and compiler. As part of above process, you will be asked to accept
software license. When all is ready, we shall see building output like int the
following example.
1. BUILD SUCCESSFUL in 15s
2. 49 actionable tasks: 49 executed
3. Bundling... done
4.
5. [calculator] Packaged dist/calculator-0.0.1.aab
Code 14.25
Now, we are ready to run our application in emulator. Since running Code
14.24 Toga framework takes care of installing Android emulator the only
thing we shall do here as developer is to run proceeding code.
1. $ briefcase run Android
Code 14.26
When we run mentioned command first thing that we are going to be asked
is to install necessary Java libraries (automatically as mentioned) and how to
run calcualator application as shown in the following code.
1. Select device:
2.
3. 1) @INFO | Storing crashdata in: /tmp/android-darkman66/emu-
crash-34.1.20.db, detection is enabled for process: 24567 (emulator)
4. 2) Create a new Android emulator
5.
6. > 2
Code 14.27
As it is not a surprise, we are choosing emulator as platform that is going to
run our calculator application. After preparing emulator stack, we should see
our application executed in Android emulator as it’s shown in the following
figure.
Figure 14.6: Example of calculator application running under Android Emulator
So far as you could notice we have been building our application with zero
use of heavy and big IDE like Android Studio or XCode. We should apricate
how much work was involved in preparing such a flexible and powerful
framework for building mobile applications with Python.

Alternative UI
We have been building UI for calculator with using recommended Toga
framework albeit we learned in previous subchapters how to install and use
Kivy framework. Let’s use examples that we have been working with in
subchapter Kivy.

Android
To be able to run Code 14.7 in Android we need to install following
buildozer20 and Cython21 libraries so we can compile Python code into Java
VM and pack it into Android app.
1. $ pip install buildozer cython
Code 14.28
After installing library and dependencies we can create.
1. $ buildozer init
Code 14.29
This command is going to create spec file buildozer.spec with all the
necessary settings in it to be able to run Python application in Android
world. We can keep default configuration with default content. Main
important part is to move spec file and our application code example to the
same folder and name application file as main.py – this is required by
buildoizer.
Once all is set, we can start installing lots of dependencies and libraries by
running following code.
1. $ buildozer -v android debug
Code 14.30
We have to be patient since this process is also going to compile a lot or
binary files which takes some time. When all is set, we should see already
know emulator window with our example Code 14.7 running in it.

iOS
With Kivy and iOS we need to install different set of tools to be able to
convert Python to iOS application. Let’s check following code how to install
libraries.
1. $ pip install Cython kivy-ios
Code 14.31
When all is set, we can start compiling all necessary libraries22. Let’s run
following code.
1. $ toolchain build python3 kivy
Code 14.32
This step will take a long, long time, so we need to be patient and wait until
all the binary files are ready. Please do not stop building process because you
may be in a situation where you have to start from beginning. With freshly
build files we have install player module with proceeding code.
1. toolchain pip install plyer
Code 14.33
When all necessary tools are built and read ready, we can start preparing
project for the actual iOS development. The first thing we have to do is to
make sure that our GUI application file is called main.py (same as in
example 14.30) it will be placed in the same folder where we build tools by
running Code 14.33.
Before we can start preparing iOS application, we should install XCode23
Let us run following code to create XCode project.
1. $ toolchain create MyApp .
Code 14.34
This command is going to create for us basic stack for XCode. It’s going to
create new folder called MyApp-ios – inside it which we need to put our
main.py file.
1. $ open MyApp-ios/myapp.xcodeproj
Code 14.35
When project is open, we can see that is going to trigger XCode for opening
and loading our project with its entire stack. When we want to run our
application, we can see that XCode UI has play icon – please click it – it is
going to run our application in iOS emulator.
We can already see that complexity how we pack and build application when
we user Toga vs Kivy is much of a difference. We do not differentiate that
one is better over the other, yet we say that they give much more different
level of control for building application process.
The other thing worth of noticing when it comes to comparing these GUI
frameworks is the fact that the final application size when it gets build to
Android or iOS stack is much different and for you when you check Kivy
manual online it gives some tips how to make final application much lighter
which, for sure, is going to help us deliver more user friendly app to
Appstore.

Conclusion
In this chapter, we had to learn fundamental practices how to build mobile
application and how to design its user interface. Next, we learned how to use
Python for building mobile applications with using different frameworks.
For sure. We managed to notice that they operate differently and have
different set of skill required from developer to be able to use them to build
and pack them as market ready application.
We did not touch a subject how to build and deliver final application to
corresponding Appstore for Android and iOS since these procedures may
change over a time when you read this book and be obsolete.
In the next chapter, we are going to learn how to use Python to read and
generate barcodes and use them to help to organize vCards.

1. https://en.wikipedia.org/wiki/Minimum_viable_product
2. https://www.apple.com/ios/ios-17/
3. https://developer.apple.com/xcode/
4. https://en.wikipedia.org/wiki/Swift_(programming_language)
5. https://en.wikipedia.org/wiki/Objective-C
6. https://en.wikipedia.org/wiki/Java_%28programming_language%29
7. https://docs.python.org/3/library/tkinter.html#module-tkinter
8. https://wxwidgets.org
9. https://doc.qt.io
10. https://kivy.org
11. https://toga.readthedocs.io/en/stable/
12. https://kivy.org/doc/stable/api-kivy.uix.label.html
13. https://ironpython.net
14. https://dotnet.microsoft.com/en-us/
15. https://briefcase.readthedocs.io/
16. https://developer.apple.com/support/xcode/
17. https://toga.readthedocs.io/en/latest/reference/style/pack.html
18. https://briefcase.readthedocs.io/en/stable/how-
to/publishing/android.html#
19. https://developer.android.com/studio
20. https://pypi.org/project/buildozer/
21. https://cython.org
22. https://github.com/kivy/kivy-ios
23. https://developer.apple.com/xcode/
OceanofPDF.com
CHAPTER 15
QR Generator and Reader

Introduction
In the modern world, QR and bar codes manage to become part of our daily
lives. We scan grocery shopping items at self-service cash register, or we
scan advertainments shown as QR codes. We can say those computer-
generated codes are an international standard in our times.

Figure 15.1: Example of QR code –you may try to scan it with your phone

Structure
In this chapter, will be covering topics:
Introduction to barcode and QR codes
Building simple barcode code generator
Building simple QR code generator
Embedding vCard into QR codes
Adding images into QR codes
Uploading and processing QR codes

Objectives
In this chapter, we will explore the use of QR codes, which are two-
dimensional barcodes that can store various kinds of data, such as text,
URLs, phone numbers, or contact information. QR codes are widely used in
various applications, such as product identification, payment systems,
marketing campaigns, or access control. QR codes are so fantastic.
With QR codes we can encode a large amount of information into a small
space, making them easy to scan and read with a smartphone camera or a
dedicated scanner.
QR codes as well can be customized with different shapes, colors, logos, or
images, making them attractive and distinctive for branding purposes.
That being said we can also mention that QR codes can be dynamic and
updateable - meaning that the data stored in the QR code can change over
time without changing the appearance of the code itself.
All this amazing information we are about to learn – we are going to learn
how can we use Python to generate mentioned QR codes and how can we
read them as well. So, let us get started.

Barcode generator
Before we learn how to generate QR codes with Python, let us briefly
explain what barcodes are and how they work. Barcodes are optical labels
that contain information about an object, such as a product, a book, or a
ticket. They consist of patterns of lines, dots, or squares that can be scanned
by a device and decoded into readable data. Barcodes can store various
types of data, such as numbers, text, or URLs.
1. First thing we need to do is to install Python libraries that are going to
help us generating barcodes.
1. $ pip install "python-barcode[images]"
Code 15.1
2. Once we have packets installed, we need to understand one important
thing; with barcodes we have plenty standards1 that barcode readers
can follow in read properly. In the following example, we are going to
generate barcode message in EAN13 standard.
1. import random
2. from barcode import EAN13
3. from barcode.writer import SVGWriter
4.
5. with open("/tmp/somefile.svg", "wb") as f:
6. EAN13(str(random.randint(111122221111, 666677779999)),
7. writer=SVGWriter()).write(f)
Code 15.2
We are importing module random (line 1) to be able to generate random
number that is long enough (12 digits) to fulfil EAN13 standard. We are
opening file hook (line 5) and next we generate EAN13 barcode (line 5-6).
As a result of our code, we shall get SVG file located in /tmp/somefile.svg
that is going to look like in the following example figure.

Figure 15.2: Example barcode number generated by running example 15.2


3. The reason being why we decided to use SVG format is the fact that we
want to generate a graphics file that build by vectors Scalable Vector
Graphics (SVG) so it can scale by any resolution we want, and it will
always look crispy sharp. We can see clearly that EAN13 standard
allowed us to generate barcode SVG file that represents 12 digits. In a
case when we try to generate anything shorter or longer, we are going
to get following error.
1. $ python barcode_example.py
2.
3. Traceback (most recent call last):
4. File "/Users/hubertpiotrowski/work/fun-with-
python/chapter_15/barcode_example.py", line 6, in <module>
5. EAN13(str(random.randint(500000, 999999)), writer=SVGWrite
r()).write(f)
6. File "/Users/hubertpiotrowski/.virtualenvs/fun3/lib/python3.10/site
-packages/barcode/ean.py", line 49, in __init__
7. raise NumberOfDigitsError(
8. barcode.errors.NumberOfDigitsError: EAN must have 12 digits, no
t 6.
Code 15.3
Once we have this sorted let us try to generate a code that can read
formatted barcode SVG file. Before we can continue, we shall install some
more Python packages. Let’s continue with following example.
1. $ pip install opencv-python pyzbar
Code 15.4

Barcode reader
Once we have installed OpenCV2 we can create image reader Python code
that is going to help us decipher barcode SVG file. Another thing that we
are installing is pyzbar3 that is going to help with analyzing image loaded
with OpenCV and converting to Python data. Important thing worth of
mentioning is that we must have zbar library4 preinstalled. We can use
following example of how to install it.
1. # MacOS
2. $ brew install zbar
3. # Linux
4. $ sudo apt-get install libzbar0
Code 15.5
Since we exported barcode image file as SVG we have install following
module to be able to convert it to PNG before we are able to process its
content with pyzbar.
1. $ pip install cairosvg
Code 15.6
1. When Cairo library5 as Python module is installed, we can finally get
the code to work. Let us check following example how to read barcode
that we prepared with Code 15.3.
1. import cv2
2. import numpy as np
3. from cairosvg import svg2png
4. from io import BytesIO
5. from PIL import Image
6. from pyzbar.pyzbar import decode, ZBarSymbol
7.
8. OUTPUT_FILE = "/tmp/cv.png"
9.
10.
11. with open("/tmp/somefile.svg", "r") as f:
12. png = svg2png(file_obj=f)
13.
14. pil_img = Image.open(BytesIO(png)).convert("RGBA")
15. pil_img.save("/tmp/tmp_barcode.png")
16.
17. cv_img = cv2.cvtColor(np.array(pil_img), cv2.COLOR_RGBA2BGR
A)
18. cv2.imwrite(OUTPUT_FILE, cv_img)
19.
20. img = cv2.imread(OUTPUT_FILE)
21. detectedBarcodes = decode(img, symbols=[ZBarSymbol.EAN13])
22. barcode = detectedBarcodes[0]
23. # result
24. print(barcode)
25. print(f"Scanned code: {barcode.data}")
Code 15.7
In our example Code 15.7 we are using few libraries to manipulate
image before we analyze barcode. First in lines 11-12 we load SVG file
we generated in example 15.2. Next, we convert it to PNG format since
OpenCV expects to load PNG image instead of SVG (line 20). Once
we have PNG, we have to make sure that we convert alpha
(transparent) background to white color (line 17) and save it to output
file (line 18).
2. When final file is ready, we read it again (line 20) and decode its
content by using pyzbar decode method (line 21). Since we have single
barcode in the image, we can use first element (line 22) from decoded
barcodes array. IN the end we print decoded object (line 24) and final
value that we wanted to read automatically from barcode (line 25).
So far, we’ve been reading very easy example where barcode is just
few black strips on white background. In the following modifications
we are going to read barcode from photo of a real product containing
barcode.
Before we can continue, we shall install one more addition module that
is going to help us manipulate images by using wrapper for OpenCV
library. Let’s check following code to install imutils6.
1. $ pip install imutils
Code 15.8
3. When module is already is installed let’s create a template for our main
script. First, we are going to load photo with OpenCV and read image
to variable like shown in the following example.
1. import numpy as np
2. import click
3. import imutils
4. import cv2
5.
6.
7. @click.command()
8. @click.option("--image-
file", type=str, help="Full path to image", required=True)
9. def main(image_file):
10. img = cv2.imread(image_file)
11.
12.
13. if __name__ == '__main__':
14. main()
Code 15.8
4. We assume that we have to take a photo of a real product that has a
barcode on it. In our case we took a picture of bottle of water like
shown in the below photo.

Figure 15.3: Example photo of real product with barcode on it


5. Once we have that photo in place you can notice that in my example
product has got a lot of additional text on it where almost in the center
we have barcode. That means it is clear and easy to read by our barcode
reader code. Sometimes there can be a case where product has got a bit
blurry image of barcode, or it is not easy to read since it’s close to other
text on the label etc. To address that concern we are going to apply few
special filters to our loaded photo to be able to extract portion of the
barcode that our program can easy read. Let’s check following
example.
First, after loading image, we are going to convert it to gray scale like in the
following code.
1. gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
2. gradient_X = cv2.Sobel(gray, ddepth=cv2.CV_32F, dx=1, dy=0, ksize=
-1)
3. gradient_Y = cv2.Sobel(gray, ddepth=cv2.CV_32F, dx=0, dy=1, ksize=
-1)
Code 15.9
In line 1 we converted loaded image (Code 15.8, line 10) to greyscale
and next by using Scharr gradient filter7 to detect by using grey scale
horizontal and vertical gradient.
6. Next, by having mentioned gradients we have to use them to run
following code.
1. gradient = cv2.subtract(gradient_X, gradient_Y)
2. gradient = cv2.convertScaleAbs(gradient)
3.
4. blurred = cv2.blur(gradient, (18, 18))
Code 15.10
In the line 1 we subtracted Scharr gradient for Y from X so we can get
as result regions that have in the photo high horizontal gradients with
low vertical gradients. Next, we add a little bit of blurring – since we
are only interested in the white area where barcode is.
7. After running this code our image is going to look like in following
figure.
Figure 15.4: Original photo with applied grey scale and filters
8. Next, we need to manipulate with colors (black and white) to highlight
the area where is the biggest density of white so we can extract the
main field with barcode. Let’s check following code how can we do so.
1. main_area = cv2.getStructuringElement(cv2.MORPH_RECT, (21, 7))
2. new_area = cv2.morphologyEx(thresh, cv2.MORPH_CLOSE, main_ar
ea)
3.
4. new_area = cv2.erode(new_area, None, iterations=6)
5. new_area = cv2.dilate(new_area, None, iterations=5)
Code 15.11
We create (line 1) rectangular area that is wide (21/7) so we can close the
gaps between bars on barcode. Next, we apply (line 2-5) morphological8
processing so we can close more of those stripes in barcode, thus as result
we can get more or less big white box where barcode is. Let’s check
following figure how this is going to look like.

Figure 15.5: Example of the source photo with applied morphological operations
9. Once we have mentioned white box let’s apply following code which is
going to help us crop the image and only left that part of photo that we
want to use to process barcode reading.
1. contours = cv2.findContours(new_area.copy(), cv2.RETR_EXTERNA
L, cv2.CHAIN_APPROX_SIMPLE)
2. contours = imutils.grab_contours(contours)
3.
4. contours_min = sorted(contours, key=cv2.contourArea, reverse=True)
[0]
5. (X, Y, W, H) = cv2.boundingRect(contours_min)
6. rect = cv2.minAreaRect(contours_min)
7. box = cv2.cv.BoxPoints(rect) if imutils.is_cv2() else cv2.boxPoints(rect
)
8. box = np.int0(box)
9.
10. cv2.drawContours(img, [box], -1, (0, 255, 0), 3)
11. cropped_image = img[Y:Y + H, X:X + W].copy()
12. cv2.imshow("final cropped", cropped_image)
13. cv2.waitKey(0)
14.
15. detectedBarcodes = decode(cropped_image, symbols=
[ZBarSymbol.EAN13])
16. barcode = detectedBarcodes[0]
17. # final result what we found
18. print(barcode)
19. print(f"Scanned code: {barcode.data}")
Code 15.12
In the rest of our main function, we are using preparing white area
(from example 15.11) in line 1-2 and fin contours9 of it. Then we get
the smallest contour found and use it as bounding area (line 5).
10. Once we have bounding are we draw green box (to show like in
following figure) what part of barcode we are going to use to read
barcode value. After having green box and bounding area we crop the
main image (line 11) and show the final result (12-13). Finally, in lines
15-19 we read barcode value from cropped image.
Figure 15.6: Cropped image containing barcode to decode

QR code generator
So far, we learned how to read and optimize barcode reader. In this
subchapter we are going to learn how can we build something more
complex Quick Response (QR) codes10.
Let us start with simple example, but before we can do so we have to install
following Python module.
1. $ pip install pyqrcode pypng
Code 15.13
When module is installed, we can create an example code where we
generate QR that contains some example URL that we ask user to open
after scanning.
1. import pyqrcode
2. obj = pyqrcode.create('https://www.python.org')
3. url.png('/tmp/qr.png', scale=6, module_color=[0, 0, 0, 128],
4. background=[0xff, 0xff, 0xcc])
Code 15.14
We can see that we create QR code object that should be pointing to
https://www.python.org website after scanning (line 2). Next part (line 3-
4) we save QR code output to external file – in this case it is PNG file. We
can notice that we are specifying image scale as well as background of
generated QR code PNG file. Once file is ready it is going to look like in
the following figure.

Figure 15.7: Example QR code linking to Python main website


You can open camera in your smartphone and try to scan this QR code – it
should lead you to Python website.
As by default PyQRCode module in Code 15.14, line 2 will use version 1 of
QR code – this will allow us to store lowest number of characters in QR
code – up to 50 characters.
Let’s try to encrypt some binary data in QR code where we won’t encode
URL but some pseudo binary data where we want to share public WiFi
hotspot details so user can join it automatically.
Standard of storing this sort of data is called MeCard11 and it goes like
following syntax.
1. WIFI:S:<SSID>;T:<WEP|WPA|nopass>;P:<PASSWORD>;H:
<true|false|blank>;;
Code 15.15
That format is easily followed by iPhone and Android based smartphones so
by knowing this we can write a simple Python snippet like shown below.
1. from pyqrcode import QRCode
2.
3. data = "WIFI:S:public-wifi-free;T:WPA;P:somepassword123;H:false;;"
4. q = QRCode(data)
5. q.png('/tmp/qr_wifi.png', scale=6)
Code 15.16
We can see that this time we use a bit different method for generating QR
code (line 3) since it allows us to generate code and keep it in memory.
Next, we can save it to disk (line 5) or manipulate it, which we will do in
the following part of this chapter. Let’s check how our QR code is going to
look like comparing to the one from example 15.14.

Figure 15.8: Example of QR code in higher standard able to store more characters.
We can see that our Code 15.16 after running is going to generate a QR
code that has more pixels that shown in Figure 15.7. The reason being is the
fact that we try to store more characters in our QR code which is easily
detected by QRCode class (line 3) and bumped up the version number.
Since we do not specify error tolerance level – Pyton pyqrcode module is
autodetecting lowest tolerance for errors based on information that we want
to encode. That means we can have up to 30% of pixels in the code itself
missing or be blurry, damaged, not readable etc. Let’s try to decrease errors
level to the lowest offered level – that is 7%. Let’s check following example
how to achieve this.
1. from pyqrcode import QRCode
2.
3. data = "WIFI:S:public-wifi-free;T:WPA;P:somepassword123;H:false;;"
4. q = QRCode(data, error='L')
5. q.png('/tmp/qr_wifi.png', scale=6)
Code 15.17
Now, the result of running our code is going to look like show int the
following figure.
Figure 15.9: Example of QR code in lowest errors level
We can see clearly that density of “piexels” in final QR code is much lower
comparing to the result of running Code 15.16. Even if the information
stored is the same we have errors tolerance level up to 7% which basically
means code must be clear and very good quality for scanning.
In the next example we can check how can we achieve a case where we
want to show some logo in the middle of QR code that we generate. This
should not have any side effect on generated QR code, meaning – still is
readable by smartphones albeit will have for sure impact on appearance of
the code itself. Let’s try to put Python image in the middle of the logo that
we generated with Code 15.16. Let us check the following code how can we
achieve this.
1. import pyqrcode
2. from PIL import Image
3.
4. data = "WIFI:S:public-wifi-free;T:WPA;P:somepassword123;H:false;;"
5. url = pyqrcode.QRCode(data, error='H')
6. url.png('test.png',scale=10)
7. im = Image.open('test.png')
8. im = im.convert("RGBA")
9.
10. logo = Image.open('python-logo.png')
11. box = (145, 145, 235, 235)
12. im.crop(box)
13. region = logo
14. region = region.resize((box[2] - box[0], box[3] - box[1]))
15. x = int(im.size[0]/2 ) - int(region.size[0]/2)
16. y = int(im.size[1]/2) - int(region.size[1]/2)
17. im.paste(region, (x, y))
18. im.show()
Code 15.18
In Code 15.18 we are using the same information data that we want to
encode into QR code (lines 1-6). Once PNG file is saved (line 6) we load
that QR code image file (lines 7-8) back to memory. Next, we load logo file
(line 10) and we create a box that is going to be representing maximum size
of logo inside of our QR code.
In line 12 we crop that size of a box out of QR code image so we can have
blank space for pasting our logo. We calculate logo size (lines 13-14) and
resize logo (line 14). The last thing we have to do before pasting logo to QR
code file is to calculate (lines 15-16) where to paste it so it is going to be
right in the center of the final image (line 17). When all is ready, we show
(line 18) our generated QR code.
In the following image we can see what shall be the final result of running
Code 15.18.

Figure 15.10: QR code with image logo inside


We said that we are cropping chunk of original QR code that we’ve
generated (Code 15.8, line 6) so why QR code is still readable and even by
having some image inside. The whole “trick” is about using the fact that we
said before – QR code can have high level of errors correction (Code 15.18,
line 5). We use this fact as pasting logo in the center of the QR code –
algorithm of QR code is going to read that logo as missing “pixels” and
correct the value automatically. As long as logo size is not bigger than 30%
of the whole size of the QR code this is going to work just perfect – of
course by assuming that all the QR code details are not blurry or damaged.
It the next example we are going to modify Code 15.18 in such a way that
we are going to encode vCard 12 into QR code. Example vCard information
that we can encode is shown in the following code.
1. BEGIN:VCARD
2. VERSION:4.0
3. FN:John Smith
4. N:John;Smith;;;ing. jr,M.Sc.
5. BDAY:--0102
6. GENDER:M
7. EMAIL;TYPE=work:[email protected]
8. END:VCARD
Code 15.19
Now, let us modify mentioned code so it can encode vCard details and keep
our logo in the center. Let’s take a look at the following code.
1. import pyqrcode
2. from PIL import Image
3.
4. vcard_data = """BEGIN:VCARD
5. VERSION:4.0
6. FN:John Smith
7. N:John;Smith;;;ing. jr,M.Sc.
8. BDAY:--0102
9. GENDER:M
10. EMAIL;TYPE=work:[email protected]
11. END:VCARD"""
12.
13. def generate_code(data):
14. url = pyqrcode.QRCode(data, error='H')
15. url.png('test.png',scale=10)
16. im = Image.open('test.png')
17. im = im.convert("RGBA")
18.
19. logo = Image.open('python-logo.png')
20. box = (145, 145, 235, 235)
21. im.crop(box)
22. region = logo
23. region = region.resize((box[2] - box[0], box[3] - box[1]))
24. x = int(im.size[0]/2 ) - int(region.size[0]/2)
25. y = int(im.size[1]/2) - int(region.size[1]/2)
26. im.paste(region, (x, y))
27. im.show()
28.
29. if __name__ == '__main__':
30. generate_code(vcard_data)
Code 15.20
We cleaned up the code – it is now wrapped in function (lines 13-27) so as
attribute we can pass vCard data that is going to be encoded into QR code.
That approach is going to work but the vCard data is hardcoded so question
is how can we make sure that we apply any kind of contact data that we
want to be included in QR code? Let’s install module that is going to help
us create configuration file that we are about to use later.
1. $ pip install pyyaml
Code 15.21
Once YAML13 module is installed we can create configuration file that is
going to represent contact data stored in a YAML file - contact.yaml. Let’s
check proceeding example how such a file is going to be organized.
1. contact:
2. - address:
3. home:
4. city: amazing city
5. code: 123456
6. country: best country
7. street: seasame street
8. birthday: 1978-09-15
9. email:
10. home: [email protected]
11. work: [email protected]
12. gender: Male
13. name: John
14. surname: Smith
15. org:
16. role: CEO
17. title: upper main boss
18. name: best company ever
Code 15.22
We prepared YAML file that contains example contact details that we can
embed into QR code. Let us check following code how can we read this
config file (Code 15.22) and create vCard from it.
1. import yaml
2. from datetime import datetime
3.
4. def read_config():
5. with open('contact.yaml', 'r') as file:
6. return yaml.safe_load(file)
7.
8.
9. def create_vcard():
10. data = read_config()['contact'][0]
11. now_with_zulu = datetime.utcnow().isoformat()[:-3]+'Z'
12.
13. vcard_data = f"""BEGIN:VCARD
14. VERSION:4.0
15. FN;CHARSET=UTF-8:{data['name']} {data['surname']}
16. N;CHARSET=UTF-8:{data['surname']};{data['name']};;;
17. «»»
18. if data.get('gender', '').lower() == 'male':
19. vcard_data += "GENDER:M\n\r"
20. if data.get('gender', '').lower() == 'female':
21. vcard_data += "GENDER:F\n\r"
22. birth_date = data['birthday']
23.
24. vcard_data += f"BDAY:{birth_date.strftime('%Y%m%d')}\n\r"
25. home_address = data['address']['home']
26. vcard_data += f"""ADR;CHARSET=UTF-8;TYPE=HOME:;;
{home_address['street']};{home_address['city']};;
{home_address['code']};{home_address['country']}
27. TITLE;CHARSET=UTF-8:{data['org']['title']}
28. ROLE;CHARSET=UTF-8:{data['org']['role']}
29. ORG;CHARSET=UTF-8:{data['org']['name']}
30. REV:{now_with_zulu}
31. END:VCARD»»»
32. return vcard_data
Code 15.23
We can see from Code 15.23 that we have to read YAML config with safe
method (line 6) to be able to read data from YAML file in such a way that
we can drop any risky data - facilities to execute arbitrary Python code that
can be hidden in YAML file.
Once we have YAML data we create vCard string (lines 9-33).
To be able to create vCard data we have to add above code to main section
of Code 15.20 like in proceeding example.
1. f __name__ == "__main__":
2. vcard = create_vcard()
3. print(vcard)
4. generate_code(vcard)
Code 15.24
When the code is executed, we shall receive QR code that is shown in the
following figure.
Figure 15.11: Example of contact data with more details included in QR code
We can see that when we include more contact details into QR code it gets
more complicate and bigger density of pixels.
Since we prepared contact details in such a way that YAML file (Code
15.22) is a list of dictionaries, and we only get one dictionary (first) from it
(Code 15.23, line 10) we can improve Code 15.23 in such a way that we
can generate multiple contact QR codes based on the configuration file that
can contain more contacts in it.
Let us look at the following example to see how we can achieve this.
1. import pyqrcode
2. import yaml
3. from PIL import Image
4. from datetime import datetime
5.
6. def read_config():
7. with open('contact.yaml', 'r') as file:
8. return yaml.safe_load(file)
9.
10. def create_vcard(data):
11. now_with_zulu = datetime.utcnow().isoformat()[:-3]+'Z'
12.
13. vcard_data = f"""BEGIN:VCARD
14. VERSION:4.0
15. FN;CHARSET=UTF-8:{data['name']} {data['surname']}
16. N;CHARSET=UTF-8:{data['surname']};{data['name']};;;
17. «»»
18. if data.get('gender', '').lower() == 'male':
19. vcard_data += "GENDER:M\n\r"
20. if data.get('gender', '').lower() == 'female':
21. vcard_data += "GENDER:F\n\r"
22.
23. birth_date = data['birthday']
24. vcard_data += f"BDAY:{birth_date.strftime('%Y%m%d')}\n\r"
25. home_address = data['address']['home']
26. vcard_data += f"""ADR;CHARSET=UTF-8;TYPE=HOME:;;
{home_address['street']};{home_address['city']};;
{home_address['code']};{home_address['country']}
27. TITLE;CHARSET=UTF-8:{data['org']['title']}
28. ROLE;CHARSET=UTF-8:{data['org']['role']}
29. ORG;CHARSET=UTF-8:{data['org']['name']}
30. REV:{now_with_zulu}
31. END:VCARD»»»
32. return vcard_data
33.
34. def generate_code(data, name, surname):
35. url = pyqrcode.QRCode(data, error="H")
36. url.png("test.png", scale=10)
37. im = Image.open("test.png")
38. im = im.convert("RGBA")
39.
40. logo = Image.open("python-logo.png")
41. box = (145, 145, 235, 235)
42. im.crop(box)
43. region = logo
44. region = region.resize((box[2] - box[0], box[3] - box[1]))
45. x = int(im.size[0] / 2) - int(region.size[0] / 2)
46. y = int(im.size[1] / 2) - int(region.size[1] / 2)
47. im.paste(region, (x, y))
48. final_qr_code = f’/tmp/{name}_{surname}.png’
49. print(f"Saving QR code: {final_qr_code}")
50. im.save(final_qr_code, 'PNG')
51.
52. if __name__ == "__main__":
53. vcards = read_config()['contact']
54. for item in vcards:
55. print(f"Creating vCards for {item['name']} {item['surname']}")
56. vcard = create_vcard(item)
57. generate_code(vcard, item['name'], item['surname'])
Code 15.25
We can see that Code 15.25 is very similar to Code 15.23 but in Code 15.25
we change the way we generate final QR code (line 48-50). Now instead of
showing QR code in image window we save it as PNG file to /tmp
directory (line 50).
This change allows us to read many contacts from configuration file (line
54) and next generate vCards one by one (line 56) and save QR code with it
(line 57).

QR code reader
So far, we managed to learn how to generate QR codes. Now we are going
to see how we can read such a code with using Python. Let’s check
following example to see how we can read mentioned QR code from
example 15.25.
1. from PIL import Image
2. from pyzbar.pyzbar import decode
3.
4. result = decode(Image.open('/tmp/John_Smith.png'))
5. print('decoding result:')
6. print(result[0].data)
Code 15.26
We used two main modules that we’ve been already using before when
we’ve been learning about barcodes and how to process them (lines 1-2).
Once we local QR code image from running example Code 15.25 that is
saved under /tmp/John_Smith.png. Once file is loaded, we decode it (line
4) and print decoding result (line 6) which of course is a vCard data. Let us
check what is going to be result of running such a code.
1. $ python read_qr_code.py
2.
3. decodeing result:
4. b'BEGIN:VCARD\n VERSION:4.0\n FN;CHARSET=UTF-
8:John Smith\n
N;CHARSET=UTF-
8:Smith;John;;;\n GENDER:M\n\rBDAY:19780915\n\r
ADR;CHARSET=UTF-
8;TYPE=HOME:;;seasame street;amazing city;;123456;
best country\n TITLE;CHARSET=UTF-8:upper main boss\n
ROLE;CHARSET=UTF-8:CEO\n ORG;CHARSET=UTF-
8:best company
ever\n REV:2024-07-19T20:43:59.177Z\n END:VCARD'
Code 15.27
Python managed to decipher QR code even that it had some error in it –
remember, we included Python logo in the center of QR code which is
damaging it (below 30% error rate).

Conclusion
In this chapter, we learned how we can analyze barcodes and generate them.
We managed to understand how Python can process images and extract
barcodes from them. Next, we managed to dive deeper into the topic of QR
codes and build beautiful QR codes with logo in them that may be used as
very attractive way of sharing contact data.
In the next chapter, we are going to build an app to keep track of digital
currencies which can be very useful skill especially in times when crypto
currencies are so popular.
1. https://en.wikipedia.org/wiki/Barcode#Barcode_verifier_standards
2. https://pypi.org/project/opencv-python/ and
https://pyimagesearch.com/2014/11/24/detecting-barcodes-images-
python-opencv/
3. https://pypi.org/project/pyzbar/
4. https://github.com/Polyconseil/zbarlight/
5. https://cairographics.org
6. https://github.com/PyImageSearch/imutils
7. https://docs.opencv.org/4.x/d5/d0f/tutorial_py_gradients.html
8.
https://docs.opencv.org/4.x/d9/d61/tutorial_py_morphological_ops.html
9. https://docs.opencv.org/3.4/d4/d73/tutorial_py_contours_begin.html
10. https://en.wikipedia.org/wiki/QR_code
11. https://en.wikipedia.org/wiki/MeCard_(QR_code)
12. https://en.wikipedia.org/wiki/VCard
13. https://yaml.org
OceanofPDF.com
CHAPTER 16
App to Keep Track of Digital
Currencies

Introduction
Before we dive into the technical details of how to create your own crypto
trading platform, let's briefly review what cryptocurrencies are and why they
are so popular among traders and investors. Cryptocurrencies are digital
assets that use cryptography to secure their transactions and control their
creation. Unlike fiat currencies1, which are issued and backed by central
authorities, cryptocurrencies are decentralized and operate on peer-to-peer
networks. This means that no one can manipulate or censor their
transactions, and users have full control over their own funds.
Cryptocurrencies offer several advantages over traditional payment systems,
such as lower fees, faster processing, global accessibility, transparency,
privacy, and security. They also enable new business models and
innovations, such as smart contracts, decentralized applications, and
tokenization.
You will learn how to build a data stream analyzer that collects and
processes real-time data from various sources. You will also learn how to
design and implement a trading engine that executes orders according to
your custom strategies and rules. Finally, you will learn how to develop a
user interface that displays the data and the results of your trading activities
and allows you to adjust your settings and preferences. By the end of this
book, you will have a fully functional crypto trading platform that you can
use for your own purposes or share with others.

Structure
In this chapter, we will discuss the following topics:
Building data stream analyzer
Storage for data results - time driven db
Analyze tool for trends
Learning how to draw graphs with Python
Building alarms logic

Objectives
After reading this chapter, you should know how to build your own crypto
market trading platform client and be able to manage your crypto assets with
Python in use to build simple yet powerful money exchange application.

Data stream
Before we can analyze any kind of data, we have to learn how can fetch data
from external web resource. For following example we’re going to use
crypto.com website that delivers crypto exchange market with live updates.
Let’s check following code to see how we can fetch example data values for
bitcoin.
Before we can make any calls, we have to install following library which is
going to be essential in all our examples.
1. $ pip install requests click
Code 16.1
1. Once Python module is installed, we can wrap up simple example that is
fetching crypto coin currency exchange. Let’s investigate proceeding
example.
1. import requests
2. from pprint import pprint
3.
4. url = f"https://price-api.crypto.com/price/v1/token-price/bitcoin"
5. result = requests.get(url, headers={"User-Agent": "Firefox"})
6. pprint(result.json())
Code 16.2
In the example Code 16.2 we’re calling crypto.com website by introducing
our call ad “firefox” (line 5) browser in request headers. In this case crypto
website will not think that we’re making a call as a command line program
but real browser. Next, we’re print received response (line 6) and since we
know it’s JSON, we call json method on response to convert it to Python
dictionary.
Let’s take a look on the chunk of pretty long example response.
1. {'btc_marketcap': 19735953.0,
2. 'btc_price': 1,
3. 'btc_price_change_24h': 0.0,
4. 'btc_volume_24h': 1870973.8893898805,
5. 'circulating_supply': 19735953.0,
6. 'defi_tradable': True,
7. 'exchange_tradable': True,
8. 'max_supply': 21000000.0,
9. 'price_update_time': 1722868080,
10. 'prices': [68665.03455943712,
11. 68078.98719247522,
12. 66877.86280787496,
13. ...
14. 53335.02102136236],
15. 'rank': 1,
16. 'slug': 'bitcoin',
17. 'token_dominance_rate': None,
18. 'token_id': 1,
19. 'usd_marketcap': 1055828079115.311,
20. 'usd_price': 53497.69930620077,
21. 'usd_price_change_24h': -0.116307,
22. 'usd_price_change_24h_abs': 0.116307,
23. 'usd_volume_24h': 99865107623.72562 }
Code 16.3
We can clearly see that response not only contain current exchange value
(key usd_price) but as well historical data. We’re going to use this fact in
further part this chapter.
2. Let’s refactor example 16.3 so we can support more coin and more
dynamically.
1. import click
2. import requests
3. from pprint import pprin
4.
5. SUPPORTED_COINS = {"eth": "ethereum", "btc": "bitcoin"}
6.
7.
8. def fetch_exchange(coin_str):
9. url = f"https://price-api.crypto.com/price/v1/token-price/
{coin_str}"
10. print(f'Calling {url}')
11. result = requests.get(url, headers={"User-Agent": "Firefox"})
12. pprint(result.json())
13.
14.
15. @click.command()
16. @click.option("--
coin", type=click.Choice(SUPPORTED_COINS.keys())
, help="Coin symbol to fetch details about", required=True)
17. def main(coin):
18. if coin not in SUPPORTED_COINS:
19. raise Exception("Invalid coin")
20. fetch_exchange(SUPPORTED_COINS[coin])
21.
22. if __name__ == "__main__":
23. main()
Code 16.4
3. We updated code to make is cleaner and easier to use. By using click
library (line 16) we managed to limit supported coin to only two so we
can call our script like in the following example.
1. $ python updater.py --coin eth
Code 16.5

Storing stream
To be able to store data that is time driven we can’t simply use SQL
optimized database. For storing stream and dates-based values we need to
use time series database engine2. Of course, there are plenty of choices to
choose from, including PostgreSQL timeseries plugin3. In our case we’re
going to use opensource Influx DB4. On the website you can find full
description how to install influx DB on your local machine. In case of
MacOS it’s just simply as follows.
1. Once service is installed and running, we can access it like regular
website on our local system – it is shown it the following figure.

Figure 16.1: Welcome screen of very first run of influx DB


Once you manage to get access you have to create main account –
please use following parameters – username: admin, password:
password1, organization name: fun, bucket name: coins. This is going to
help us to deliver our following examples in much easier way if we stay
consistent with naming convention. Please, notice that we have to store
API access key that is going to be created by the system once new DB
account is ready – we’re going to use it later with Python.
2. Now, we need to install Python driver for influx DB as follows.
1. $ pip install influxdb python-dotenv
Code 16.6
3. Once module is installed let’s modify our example 16.4 in such a way
that we can store all the results into time series storage that we just
configured. Before we are going to do, we have to create configuration
file .env with content as follows.
1. API_KEY="<you obtained API key>"
2. org=fun
3. bucket=coins
Code 16.7
When configuration file is ready it is time to refactor out example code 16.4
and use newly created configuration .env file.
1. import click
2. import time
3. import requests
4. from influxdb_client import InfluxDBClient, Point, WritePrecision
5. from influxdb_client.client.write_api import SYNCHRONOUS
6. from dotenv import dotenv_values
7.
8. SUPPORTED_COINS = {"eth": "ethereum", "btc": "bitcoin"}
9.
10. class CoinApp:
11.
12. def __init__(self, coin_str):
13. self._config = dotenv_values(".env")
14. self._coin_str = coin_str
15. self.connect()
16.
17. def fetch_exchange(self):
18. url = f"https://price-api.crypto.com/price/v1/token-
price/{self._coin_str}"
19. print(f"Calling {url}")
20. result = requests.get(url, headers={"User-Agent": "Firefox"})
21. data = result.json()
22. return data
23.
24. def seed_data(self):
25. data = self.fetch_exchange()
26. no_of_items = len(data['prices'])
27. for i, value in enumerate(data['prices']):
28. point = (
29. Point("price")
30. .tag("coin", self._coin_str)
31. .field("value", value)
32. )
33. self._write_api.write(bucket=self._config['bucket'], org=self._co
nfig['org'], record=point)
34. print(f"Write item {i+1}/{no_of_items}")
35. time.sleep(1)
36.
37. def connect(self):
38. url = "http://localhost:8086"
39. client = InfluxDBClient(url=url, token=self._config['API_KEY'], o
rg=self._config['org'])
40. self._write_api = client.write_api(write_options=SYNCHRONOU
S)
41.
42. @click.command()
43. @click.option("--
seed", help="Seed example data", required=False, default=False, is_flag
=True)
44. @click.option(
45. "--
coin", type=click.Choice(SUPPORTED_COINS.keys()), help="Coin sy
mbol to fetch details about", required=True
46. )
47. def main(seed, coin):
48. if coin not in SUPPORTED_COINS:
49. raise Exception("Invalid coin")
50. c = CoinApp(SUPPORTED_COINS[coin])
51. if seed:
52. c.seed_data()
53. else:
54. c.update_db()
55.
56.
57. if __name__ == "__main__":
58. main()
Code 16.8
We introduced converting simple synchronization function into class
(line 11). In its constructor we are reading configuration and next we
initialize influx DB API writer as private attribute (line 16, line 42).
Once we have database pointer, we then call crypto coin currency
exchange API (line18-23) where we use received data in our seed
method (line 26).
Because time series database is stored in such a way that each time
when we push data into it will mark pushed value on a time scale
instead of explicitly specifying datetime pointer.
4. By knowing this, since we are receiving more than 100 records (last 24h
currency exchange data) in line 28 we iterate over every single value
and push it to influx DB (lines 29-35). Next, we sleep for a second (line
36) and go to next item.
This way when we want to seed data in our database, we can simulate
that data has been coming from 3rd party system with some tine breaks
and we have historical data stored in our DB.
In the following example we can see how we run Code 16.8 and seed
data into our DB storage.
1. $ python updater_with_storage.py --coin eth --seed
Code 16.9
Since we introduced seed parameter in script 16.8 (line 53-54) we can
skip it then we shall save real data with real timestamp in use.
5. As it is shown in the Code 16.8, line 56 - we have to introduce body of
method update_db. Lets’ check following code how can we approach
this.
1. .def update_db(self):
2. data = self.fetch_exchange()
3. point = (
4. Point("price")
5. .tag("coin", self._coin_str)
6. .field("value", data['usd_price'])
7. )
8. self._write_api.write(bucket=self._config['bucket'], org=self._conf
ig['org'], record=point)
Code 16.10
We can see that method is quite like the one we already introduced in the
Code 16.8 (lines 29-36). In this case we can move it to shared method and
reuse it in both methods, so they are going to look like in the following
example:
1. def seed_data(self):
2. data = self.fetch_exchange()
3. no_of_items = len(data['prices'])
4. for i, value in enumerate(data['prices']):
5. self._save_data(value)
6. print(f"Write item {i+1}/{no_of_items}")
7. time.sleep(1)
8. def update_db(self):
9.
10. data = self.fetch_exchange()
11. value = data['usd_price']
12. self._save_data(value)
13. def _save_data(self, value):
14.
15. point = (
16. Point("price")
17. .tag("coin", self._coin_str)
18. .field("value", value)
19. )
20. self._write_api.write(bucket=self._config['bucket'], org=self._config['
org'], record=point)
Code 16.11
This way code looks much cleaner, and we can deliver code that is not being
repeated but reused which is something that every good developer always
shall remember.

Reading data
Once data is saved, we have to get it from DB storage to Python. To be able
to so it we have create influx DB connection as read pointer. Let’s check
following example how can we accomplish this.
1. from pprint import pprint
2. from influxdb_client import InfluxDBClient
3. from dotenv import dotenv_values
4.
5.
6. url = "http://localhost:8086"
7. config = dotenv_values(".env")
8. client = InfluxDBClient(url=url, token=config['API_KEY'], org=config[
'org'])
9. query_api = client.query_api()
10.
11. query = """from(bucket: "coins")
12. |> range(start: -100m)
13. |> filter(fn: (r) => r._measurement == «price»)»»»
14.
15. result = query_api.query(org=config['org'], query=query)
16.
17. results = []
18. for table in result:
19. for record in table.records:
20. results.append((record.get_field(), record.get_value()))
21.
22. pprint(results)
Code 16.12
We have used the same dotenv module for reading configuration file from
.env file (line 7). Next, we are establishing connection for to influx DB (line
8-9). When allis ready we prepare database query (lines 11-13).
It is easy to notice that influx DB query language is much different than SQL
query language being used in relational database. In our example we make
simple query where we say from bucket coins (line 11) we want to get all
records that were inserted for the last 100 minutes (line 12) and where
measurement name is price (line 13).
The following lines 17-20 we fetch records from database and convert them
to result that is going to look like in the following example.
1. [...
2. ('value', 2416.515533583986),
3. (‹value›, 2492.635832124),
4. (‹value›, 2459.506617256568),
5. (‹value›, 2433.083800659239),
6. (‹value›, 2408.107898192493),
7. (‹value›, 2439.415079893941),
8. (‹value›, 2470.170705476894)]
Code 16.13

Data visualization
Once we have data fetched and saved in local database it would be fantastic
from usability point of view to have option to be able to visualize trends and
rates of currency exchange. To be able to perform this task we are going to
use flask framework as a webservice that we are going to open in the
browser.
1. pip install flask==2.2.3
2. pip install plotly pandas
Code 16.14
1. After installing flask and plotly, which is a framework for drawing
visualizations and charts. First thing we are going to do is to write
simple “hello world” service. In the following example we created file
hello_world.py.
1. from flask import Flask
2.
3. app = Flask(__name__)
4.
5. @app.route("/")
6. def hello():
7. return "Hello World!"
8.
9. if __name__ == "__main__":
10. app.run()
Code 16.15
To start it we need to execute it by running following command.
1. $ python hello_world.py
2.
3. * Serving Flask app 'hello_world'
4. * Debug mode: off
5. WARNING: This is a development server. Do not use it in a
production deployment. Use a production WSGI server instead.
6. * Running on http://localhost:5005
7. Press CTRL+C to quit
Code 16.16
What is worth of noticing is the fact that we specified where server
listens and on which port. Once it’s up we can open our service in any
kind of web browser just by accessing URL http://localhost:5005
2. Working great, right? Before we are going to jump into topic of drawing
any kind of graphs, we need to do some HTML with our hello world
example. We already managed to work with concept of MVC5 in one of
the previous chapters so we have some basic knowledge how web
frameworks use it.
In flask framework it is lighter approach and more low level. Developer
is the person who decides what frameworks to choose for each MVC
component6.
3. Without diving too deep into the topic we must make some assumptions.
For viewing layer, we will use jinja27 templating framework. To be able
to use it our hello world example we have modify out example, so it is
going to look like in the following code.
1. from flask import Flask, render_template
2.
3. app = Flask(__name__)
4.
5. @app.route("/")
6. def hello():
7. return render_template("index.html")
8.
9. if __name__ == "__main__":
10. app.run(host=»localhost», port=5005)
Code 16.17
Restart server and reopen same URL as an example 16.16 and …we have an
error like in following dump from shell.
1. ERROR in app: Exception on / [GET]
2. Traceback (most recent call last):
3.
4. (...)
5.
6. jinja2.exceptions.TemplateNotFound: index.html
7. 127.0.0.1 - - [02/Apr/2023 08:27:03] "GET / HTTP/1.1" 500 -
Code 16.18
4. That error means that we tried to open URL and execute Code 16.18,
lines 6-7 which in line 7 tried to load jinja template which doesn’t exist
yet so as a side effect jinja thrown and error. Let us create missing
template to address this problem. Create directory templates and save
following file index.html in templates folder.
1. <html>
2. <head>
3. <title>crypto analyzer</title>
4. </head>
5. <body>
6. <p>hello</p>
7. </body>
8. </html>
Code 16.19
Since we know how to generate HTML from templates, we can now
make this main template a little nicer and more handsome. For doing
this we will use popular JavaScript library called bootstrap8.
5. Thankfully this library comes as precompiled ready to be distributed
files. We are going to use central distribution system called CDN9. Let
us modify example from Code 16.19 and introduce some bootstrap
sugar there.
1. <html>
2. <head>
3. <title>crypto analyzer</title>
4. <link href="https://cdn.jsdelivr.net/npm/[email protected]/dist/css/
bootstrap.min.css" rel="stylesheet" crossorigin="anonymous">
5. <script src=https://cdn.jsdelivr.net/npm/[email protected]/dist/js/bo
otstrap.bundle.min.js crossorigin="anonymous"></script>
6. </head>
7. <body>
8. <p>Current data</p>
9. </body>
10. </html>
Code 16.20
6. By introducing bootstrap JS to our HTML, we can define completely
new look and feel for our template. As a next step we need to define
new method that is going to select all the current currency rates from
database and display them in main page.
1. from pprint import pprint
2. from influxdb_client import InfluxDBClient
3. from dotenv import dotenv_values
4. from flask import Flask, render_template
5.
6.
7. url = "http://localhost:8086"
8. config = dotenv_values(".env")
9. client = InfluxDBClient(url=url, token=config['API_KEY'], org=con
fig['org'])
10. query_api = client.query_api()
11.
12.
13. app = Flask(__name__)
14. app.config["TEMPLATES_AUTO_RELOAD"] = True
15.
16. def get_data():
17. query = """from(bucket: "coins")
18. |> range(start: -200m)
19. |> filter(fn: (r) => r._measurement == «price»)»»»
20.
21. result = query_api.query(org=config['org'], query=query)
22. results = []
23. for table in result:
24. for record in table.records:
25. results.append((record.get_field(), record.get_value()))
26. return results
Code 16.21
By creating method get_data we implemented same querying technique that
we already learned in Code 16.12. This is going to help us to return a list of
lists of the results from nflux DB. Once we have all the data set prepared it is
time to inject that into our template. Let’s take a look at the index page how
we’re going to achieve this.
1. @app.route("/")
2. def hello():
3. context = {"currencies": get_data()}
4. return render_template("./index.html", **context)
Code 16.22
As it is shown in line 3 we’re fetching results from influx DB by using
method from Code 16.21 and directly assigning result into template context.
Let’s check how we’re going to display data in the template. First, let’s inject
following syntax to Code 16.20, between lines 8 and 9 as follows.
1. {% include "currencies.html" %}
Code 16.23
This syntax will inform templating engine to include another template file
currencies.html and inject it into the place where we used include tag. Let’s
check in the following code how we display influx DB data.
1. <table class="table table-striped table-hover">
2. <thead>
3. <th>ID</th>
4. <th>Code</th>
5. </thead>
6. {%for item in currencies %}
7. <tr>
8. <td>{{ item[0] }}</td>
9. <td>{{ item[1] }}</td>
10. </tr>
11. {% endfor %}
12. </table>
Code 16.24
7. After preparing all the data we should restart our webservice and access
same localhost URL as before but this time we are going to see all the
data like in the following figure.
Figure 16.2: Example table with sample currency exchange data.
This is very basic table which displays raw data. It is quite non-user friendly
way of presenting time driven data. Much better approach is going to be
drawing graph that is going to show easier to follow trends of currency
exchange. Let’s check following example how we can convert receive data
from influx DB into such a form that we can display a graph representing it.
1. import pandas as pd
2. import plotly.graph_objects as go
3. from influxdb_client import InfluxDBClient
4. from dotenv import dotenv_values
5. from dataclasses import make_dataclass
6.
7.
8. url = "http://localhost:8086"
9. config = dotenv_values(".env")
10. client = InfluxDBClient(url=url, token=config['API_KEY'], org=config[
'org'])
11. query_api = client.query_api()
12.
13. query = """from(bucket: "coins")
14. |> range(start: -100m)
15. |> filter(fn: (r) => r._measurement == «price»)»»»
16.
17. result = query_api.query(org=config['org'], query=query)
18.
19. Point = make_dataclass("Point", [("Date", str), ("Value", float)])
20.
21. results = []
22. for table in result:
23. for record in table.records:
24. results.append(Point(record.get_time(), record.get_value()))
25.
26. df = pd.DataFrame(results)
27. fig = go.Figure([go.Scatter(x=df['Date'], y=df['Value'])])
28. fig.show()
Code 16.25
8. We can see that in this example we refactored Code 16.12 in such a way
that instead of printing to database results we are creating data-class
Point object (line 19) which later we fill with data coming from influx
DB (lines 22-24). Once list of points is ready, we’re creating Pandas10
data frame. By having all set is time to create plot figure (line 27) and
fill it with pandas data frame and in the end show it. When all is done
properly, and we have correctly updated influx DB with recent currency
exchange data we shall see figure like shown below.
Figure 16.3: Example crypto currency exchange shown in a human friendly graph
We managed to show all the data that we collected over a time. It is
clear to read what is the up and down trend in crypto currency
exchange. There is one problem with our approach – we do not show
this as part of our web service (Code 16.21). That being said let’s take a
look how can we include learned technique of drawing trends graph as a
part of our simple website.
Before we can introduce serving data graph as part of webservice we
have to install few Python modules as follows.
1. $ pip install kaleido dash
Code 16.27
9. Once packages are installed can introduce new method that is going to
display same image content as in Figure 16.3 but as part of webserver
response. Let’s check following example.
1. from flask import Flask, render_template, Response
2.
3. @app.route("/graph")
4. def graph():
5. results = get_data()
6. df = pd.DataFrame(results)
7. fig = go.Figure([go.Scatter(x=df['Date'], y=df['Value'])])
8. img_bytes = fig.to_image(format="png")
9. return Response(img_bytes, mimetype="image/png")
Code 16.28
We added in flask import Response function that is helping us to return
raw data in response (line 9). In this case we have to exclusively define
what kind of data we’re returning in the response data – in our case we
inform flask framework that the data (line 9) that is going to be returned
is image (PNG).
9. The rest of the body of the method graph is pretty the same as what we
wrote in Code 16.3 ith major detail - line 8. That line instead of showing
image directly is dumping image content into variable which we later
return to browser.
Once image is ready, we can modify body of our template content to be
able to show image as part of our simple website.
1. <body>
2. <p>Current data (USD)</p>
3. <img src="/graph" />
4. </body>
Code 16.29
10. Let us restart our webserver and check how the main website is going to
look like. It should be displaying graph like in the figure as follows.
Figure 16.4: Example webserver displaying currency exchange trends over a time

Data estimate
In the Chapter 4, Developing App to Analyze Financial Expenses, we
learned how to build data estimator. Let us use this knowledge hew to add
tool for analyzing data trends and draw estimates. Let’s check following
example how can we refactor Code 16.28 to be able to add interpolation to
it.
Before we can create estimating tool, we have to install few Python modules
as in following example.
1. $ pip install numpy sklearn
Code 16.30
When all is set, we can write estimating function like in the proceeding code.
1. import numpy as np
2. from sklearn import preprocessing, svm
3. from sklearn.model_selection import train_test_split
4. from sklearn.linear_model import Ridge
5. from datetime import datetime
6.
7. def forecast_data(df):
8. forecast_col = "Value"
9. df.fillna(value=-99999, inplace=True)
10. forecast_size = int(math.ceil(0.03 * len(df)))
11. df['Date'] = df['Date'].apply(lambda x: x.timestamp())
12. df["label"] = df[forecast_col].shift(-forecast_size)
13.
14. x = np.array(df.drop(["label"], axis=1))
15. print(x)
16. x = preprocessing.power_transform(x)
17. x_lately = x[-forecast_size:]
18. x = x[:-forecast_size]
19.
20. df.dropna(inplace=True)
21.
22. y = np.array(df["label"])
23. x_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
24. clf = Ridge(alpha=1.0)
25. clf.fit(x_train, y_train)
26. confidence = clf.score(X_test, y_test)
27.
28. forecast_set = clf.predict(x_lately)
29. df["Forecast"] = np.nan
30. last_date = df.iloc[-1].name
31. last_unix = last_date
32. one_day = 60 # 1 minute in seconds
33. next_unix = last_unix + one_day
34.
35. for i in forecast_set:
36. next_date = datetime.fromtimestamp(next_unix)
37. next_unix += one_day
38. df.loc[next_date] = [np.nan for _ in range(len(df.columns) - 1)] + [i
]
39.
40. @app.route("/graph")
41. def graph():
42. results = get_data()
43. df = pd.DataFrame(results)
44. forecast_data(df)
45. df['Date'] = pd.to_datetime(df['Date'], unit='s')
46. fig = go.Figure([go.Scatter(x=df['Date'], y=df['Value'])])
47. img_bytes = fig.to_image(format="png")
48. return Response(img_bytes, mimetype="image/png")
Code 16.31
We added new essential Python imports (lines 1-5). Next, we introduced a
bit modified interpolation method (lines 7-38) that we already learned in
Chapter 4 - Developing App to Analyze Financial Expenses. What is worth
of noticing is the fact that we are calling it (line 44) by passing data frame
parameter which gets modified directly in mentioned method. Data frame
argument in this use case works like pointer in C language so there is no
need to return and reassign data frame value after method run. Next, we
show graph as we did in previous examples.
Figure 16.5: Compared crypto currency exchange values without and with data interpolation
As it is shown estimating algorithm is pretty smooth and it calculates
incoming trend based on the up and downs in the last few data points that
were read from influx DB.

Alarms
We managed so far to learn how to fetch, store stream data in database and
as well present gathered results in human friendly way. This time we want
see how can be build simple tool that is going to raise alarm when there is
significant change in currency exchange being detected. For instance, we
want to detect when there is a drop or rise of given percentage of value in
received crypto currency exchange data sets. Let’s check following example
how can we modify code that we build in subchapter for data streams.
1. def check_value(self, alarm):
2. query = """
3. from(bucket: «coins»)
4. |> range(start: -1000m)
5. |> filter(fn: (r) => r._measurement == «price»)
6. |> sort(columns: [«_time»], desc: true)
7. |> limit(n:2)
8. """
9. client = InfluxDBClient(url=URL, token=self._config['API_KEY'], o
rg=self._config['org'])
10. query_api = client.query_api()
11. result = query_api.query(org=self._config['org'], query=query)
12. results = []
13. for table in result:
14. for record in table.records:
15. results.append(record.get_value())
16. print(results)
17. ratio = results[1]/results[0]
18. trend_percentage = (ratio*100)-100
19. print(f"Trends change: {trend_percentage}%")
20. if abs(trend_percentage) > alarm:
21. print(f'WARNING: Critical change since
last time we fetched data, change: {trend_percentage}%')
Code 16.32
In this example we call query to database where we ask for all the record
that were saved do the last 1000 minutes and next sort them by insert time
and get only the last 2 values from the set. The reason why we only ask for
the last two elements from database is because we want to check what is the
percentage of change between last currency exchange value that we just
inserted (check following code) and the previous value we saved. In case
when we run our method, we should see result like this.
1. $ python updater_with_storage_alarms.py --coin eth --alarm 5
2.
3. Calling https://price-api.crypto.com/price/v1/token-price/ethereum
4. Trends change: -0.0791962718755741%
Code 16.33
As you can notice we added in the Code 16.32 (lines 20-21) check for the
percentage level of trend change if it’s breaching given threshold
trend_percentage. We calculate absolute value (line 20) since we want to
raise alarm when trend is either growing above alarm level or it is falling
under the alarm line.
In Code 16.33 we are using alarm level as parameter for our script (Code
16.8). To be able to use this parameter we have to modify that code as it is
shown in the following example.
1. @click.command()
2. @click.option("--alarm", help="Alarm level - percentage",
required=False, default=False, type=int)
3. @click.option("--seed", help="Seed example data",
required=False, default=False, is_flag=True)
4. @click.option(
5. "--
coin", type=click.Choice(SUPPORTED_COINS.keys()), help="Coin sy
mbol to fetch details about", required=True
6. )
7. def main(alarm, seed, coin):
8. if coin not in SUPPORTED_COINS:
9. raise Exception("Invalid coin")
10. c = CoinApp(SUPPORTED_COINS[coin])
11. if seed:
12. c.seed_data()
13. else:
14. c.update_db(alarm)
Code 16.34
In this case, adding new alarm parameter is pretty clear and easy to use.
Now, let’s check following example to see how we are using it in method
update_db.
1. def update_db(self, alarm):
2. data = self.fetch_exchange()
3. value = data['usd_price']
4. self._save_data(value)
5. self.check_value(alarm)
Code 16.35
As we said in Code 16.32, we are fetching only two last results inserted into
database because as is it demonstrated in Code 16.35 (lines 4-5) we call
method check_value only after inserting fetch crypto currency exchange
value into influx DB.
Conclusion
In this chapter, we learned how can we use publicly accessible website
which is publishing crypto currency exchange values and statistics. We
managed to understand how we can store time series data in database that is
specially designed to store such a data in a way that is possible to access it
quickly, efficiently and query in very flexible way. We also managed to write
our own yet powerful web application that can consumed mentioned stored
in a such that we can present human friendly crypto currency exchange
trends.

1. https://en.wikipedia.org/wiki/Fiat_money
2. https://en.wikipedia.org/wiki/Time_series_database
3. https://github.com/timescale/timescaledb
4. https://www.influxdata.com
5. https://en.wikipedia.org/wiki/Model–view–controller
6. https://flask-diamond.readthedocs.io/en/latest/model-view-controller/
7. https://palletsprojects.com/p/jinja/
8. https://getbootstrap.com
9. https://www.jsdelivr.com
10. https://pandas.pydata.org

Join our book’s Discord space


Join the book's Discord Workspace for Latest updates, Offers, Tech
happenings around the world, New Release and Sessions with the Authors:
https://discord.bpbonline.com

OceanofPDF.com
Index
A
Alarms 469-471
Auto Purchase 400-405

B
Barcode Generator 430
Barcode Generator, steps 430-432
Barcode Reader 432
Barcode Reader, steps 432-438

C
Calculator 416-420
Calculator, aspects
Android 424-426
Callbacks 423, 424
Calculator, configuring 422, 423
Calculator, framework
Android 426, 427
iOS 427, 428
Calendar Parser 329-333
Calendar Parser, points
External Data, synchronizing 340-344
Subscribe Locally 336-338
Chat 70, 71
Chatbot 67, 68
Chatbot, aspects
Rules-Based Service 67
Self-Learn 67
Chatterbot 68-70
ClientEbay 380
Client-Server 62
Client-Server, applications 71-78
Client-Server, architecture 62-67
Client-Server, ways 64
Compiler 412, 413
Crypto Currencies, optimizing 182, 183
Crypto Currencies, steps 183-188
Crypto Currencies Trend, analyzing 191-197
Crypto Currencies With Wallet, integrating 203-209
Crypto Market 182
Crypto Market, optimizing 211-215
Crypto Market With Client, building 188-191

D
Data Estimate 467, 468
Data Stream 450
Data Stream, steps 450-452
Data Stream, terms
Reading 457, 458
Storing 452-455
Data Visualization 458
Data Visualization, steps 458-466
DHCP Server 294
DHCP Server, architecture 294-307
Download Manager 250-260
Download Manager Data, analyzing 265-275

E
eBay Client 372
eBay Client, keys 378
eBay Client, parameters
appid 373
certid 373
Devid 373
Token 373
eBay Client, steps 372-377
Excel 88
Excel Driver, building 104-108
Excel Expenses, analyzing 92-96
Excel Outcomes, estimating 96-103
Excel, tasks
Export 88-91
Import 91

F
Flake8 44, 45
Format Resolutions, supporting 283-285
Frontend 79
Frontend, arguments 80
Frontend, configuring 80-86

G
git 50
GUI 410
GUI, applications
Kivy 410-412
Toga 410

H
Hashing 146, 147
Hashing, factors 147
Hash Key, calculating 147-149

I
IDE 46, 47
Interaction, scenarios 226-231

P
Package Inspection 312-315
Package Inspection, challenges 316, 317
Package Routing 288
Package Routing, architecture 288-294
Package Routing, layers 288
Parallel Process 126-134
Parallel Process, architecture 175-179
Parallel Process, method 135
Physical Devices, building 247
pip 68
Plugins 378
Plugins, points
Items, tracking 393-396
Price Tracker, automating 388-392
Plugins, values 398-400
Port Scanner 351
Port Scanner, issues 352, 353
Port Scanner, traps 352
Pre-Commit 47-51
Pre-Commit, challenges 50
pycrypto 190
Pylint 45, 46
Python 2
Python, fundamentals
Classes 16-20
Code Style 28-31
Error, handling 25-27
Functions 10-15
Iterators/Generators 8-10
Loops 6-8
Modules/Package 22-24
Python GUI 409
Python GUI, libraries
Kivy 409
QT 409
TK 409
wxWidgets 409
Python Library, building 54-60
Python, points
Editor 3, 4
Hello World 4, 5
Python, tools
Flake8 44, 45
IDE 46, 47
Pre-Commit 47-53
Pylint 45, 46
Python, uses 2, 3
Python, workbench
Clean Code 42, 43
Libraries, controlling 38-41
Linux 34, 35
Project, controlling 36-38
Windows 36

Q
QR Code Generator 438-444
QR Code Reader 447, 448

R
Reporting
361
Reporting
, architecture 361-369
return keyword 9

S
Scanner 353
Scanner, points 354-357
Scikit-Learn 97
sounddevice 218
Speech-To-Text Recognition 218
Speech-To-Text Recognition, components 218
Speech-To-Text Recognition, points
Recording 218-222
Response 223-226
spotify_callback 241
sqlparse 41

T
TCP/UDP 346
TCP/UDP, configuring 346-350
Tempfile 225
Third Party Service, connecting 233-239

V
Viruses 149, 150
Viruses Class, attributes 168
Viruses, issues 153
Viruses Suspicious Files, building 163-166
Viruses, uses 150-163
Virus Scanner, directories 144-146

W
Web Calendars, tool
Google 322-324
iCal 328
Office 365 325-328
Web Content, filtering 317-320
Web Crawler 110-113
Web Crawler, architecture 121-125
Web Crawler, configuring 115-120
Web Crawler, resources
Parallel Process 137, 139
Proxy 139-141
Web Crawler With HTML, analyzing 113-115

Y
YouTube API, building 275-282

OceanofPDF.com

You might also like