Fun With Python - Hubert Piotrowski
Fun With Python - Hubert Piotrowski
com
Fun
with
Python
Developing mobile apps, automating
tasks, and
analyzing cryptocurrency trends
Hubert Piotrowski
www.bpbonline.com
OceanofPDF.com
First Edition 2025
ISBN: 978-93-65893-816
All Rights Reserved. No part of this publication may be reproduced, distributed or transmitted in any
form or by any means or stored in a database or retrieval system, without the prior written permission
of the publisher with the exception to the program listings which may be entered, stored and executed
in a computer system, but they can not be reproduced by the means of publication, photocopy,
recording, or by any electronic and mechanical means.
All trademarks referred to in the book are acknowledged as properties of their respective owners but
BPB Publications cannot guarantee the accuracy of this information.
www.bpbonline.com
OceanofPDF.com
Dedicated to
OceanofPDF.com
About the Author
There are a few people I would like to thank for the continued and ongoing
support they have given me during writing of this book. First and foremost,
I would like to thank my family who has been such a great support and as
well at the same time they have shown so much patience and understanding.
Without their great support finishing this book couldn’t been accomplished.
I am also grateful to the companies, contractors, and colleagues I have
worked with in the past. Their collaboration has enriched my journey and
contributed greatly to my experience.
My greetings and gratitude also go to the team at BPB Publications for
being supportive enough and full of understanding towards me to provide
me with great time to finish this challenging book.
OceanofPDF.com
Preface
https://rebrand.ly/81aee6
The code bundle for the book is also hosted on GitHub at
https://github.com/bpbpublications/Fun-with-Python. In case there’s an
update to the code, it will be updated on the existing GitHub repository.
We have code bundles from our rich catalogue of books and videos
available at https://github.com/bpbpublications. Check them out!
Errata
We take immense pride in our work at BPB Publications and follow best
practices to ensure the accuracy of our content to provide with an indulging
reading experience to our subscribers. Our readers are our mirrors, and we
use their inputs to reflect and improve upon human errors, if any, that may
have occurred during the publishing processes involved. To let us maintain
the quality and help us reach out to any readers who might be having
difficulties due to any unforeseen errors, please write to us at :
[email protected]
Your support, suggestions and feedbacks are highly appreciated by the BPB
Publications’ Family.
Did you know that BPB offers eBook versions of every book published, with PDF and ePub files
available? You can upgrade to the eBook version at www.bpbonline.com and as a print book
customer, you are entitled to a discount on the eBook copy. Get in touch with us at :
[email protected] for more details.
At www.bpbonline.com, you can also read a collection of free technical articles, sign up for a
range of free newsletters, and receive exclusive discounts and offers on BPB books and eBooks.
Piracy
If you come across any illegal copies of our works in any form on the internet, we would be
grateful if you would provide us with the location address or website name. Please contact us at
[email protected] with a link to the material.
Reviews
Please leave a review. Once you have read and used this book, why not leave a review on the site
that you purchased it from? Potential readers can then see and use your unbiased opinion to make
purchase decisions. We at BPB can understand what you think about our products, and our
authors can see your feedback on their book. Thank you!
For more information about BPB, please visit www.bpbonline.com.
OceanofPDF.com
Table of Contents
1. Python 101
Introduction
Structure
Objectives
Installing Python
Using Python
Editor
Hello world
Basics
Loops
Iterators vs generators
Functions
Classes
Modules and packages
Error handling
Code style
Conclusion
Index
OceanofPDF.com
CHAPTER 1
Python 101
Introduction
Many technical universities during early 2000s used software called
MATLAB1 to simulate some use cases for Arithmetic Logic Unit (ALU) in
CPU. Some calculations and simulations during that time with pure
MATLAB were time-consuming and a bit complex to achieve. At that time,
many of us were introduced to Python, which could work with MATLAB
well and replace it in many ways for calculations and data manipulation in
academical requirements.
Many were surprised by the syntax, speed, and ease of use. With a
background primarily in assembler, pascal, and raw C, a new language
called Python with its weird syntax so much different from what was known
during those times.
Wide use of Python came a few years later because of many of us were
“distracted” by PHP during web boom. Next, web world heard about web
frameworks like Django or APIs where Python started to become more
visible and more mature. Backend services started to use twisted framework2
with all these callbacks and async programming for the web was really
something.
In this chapter, we will explain how Python works with simple examples,
albeit getting more complicated, so that the reader can get more prepared for
the following chapters requiring this fundamental knowledge. If you are
already familiar with Python but think you need to refresh your knowledge
or like to learn how to write clean code – please join us in this fun.
Structure
In this chapter, we will discuss the following topics:
Basic syntax of Python code
Understanding of how to build basic structures
Basics of object-oriented programming
How to build packages, modules, and classes
Objectives
After reading this chapter, you should know solid fundamental Python
programming. You will also learn how to write clean code. Let us have some
fun with Python together!
Installing Python
Python is a language with open-source code which means you can go to
https://python.org and download the entire language with all its tools in the
form of compliable code. If you have some experience with C code and
dependent libraries, you can try installing Python from the source. It is a
long process and requires a bit of knowledge, but worth doing. This will
allow you to narrow down the Python stack to your personal needs.
In this book, we will focus on installing Python by using precompiled and
prepared by Python team installers. In this case, you should go to the Python
website and download Python 3.10 3. In the bottom section of the page, you
can find an appropriate installer for your operating system.
In the following chapters, we will use Windows, macOS, and Linux system
Ubuntu to demonstrate the installation process and Python use cases.
Using Python
This book will teach you how to run Python programs, but you must have
basic knowledge of using the Command Line Interface (CLI). If you are
familiar with using CLI in Windows and Unix-based systems (macOS or
Linux), you can skip this paragraph and jump to the next one.
There is a vast history related to CLI and how it was evolving - if you're
interested to know more, you can check out various sources online.
Editor
To make our code look good, as it was already highlighted- we do not use
curly brackets and semicolons to identify logical blocks but indentations
instead - these must be consistent to avoid any potential syntax errors. Single
indentation equals 4 spaces, 2 indentations = 8 spaces end, and so on. There
are other suggestions delivered by the Python community, for instance, how
many lines of spaces must be between function definitions. More
information regarding code formatting can be found on the official Python
website4. In this book, for easier reference, we will be using Visual Studio
Code5 - this IDE is free and open-source and has got lots of great quirks and
features that are very useful for everyday use for every developer.
Hello world
Most universities or programming training courses will tell you that C
language is the father of all programming languages, which is true, no doubt!
For instance, Python itself and its modules are written in C. We are bringing
C here to show you how much easier Python syntax is, compared to C and
other languages.
1. #include <stdio.h>
2. int main() {
3. // printf() displays the string inside quotation
4. printf("Hello, World!");
5. return 0;
6. }
Code 1.1
Now, let us try to see what hello world looks like in Python:
1. print("Hello world")
Code 1.2
The first time you see this "hello world" microprogram, you think to
yourself, where are the curly bracket or semicolons? Well - that is the real
beauty of Python language. As you might’ve already noticed - semicolons
are being completely dropped in Python syntax. What replaces curly
brackets, then? As you probably can tell, it is indentation. We will talk a bit
more about this in Chapter 2, Setup Python Environment, regarding clean
code syntax and tools to help you with this.
The preceding example is going to be again hello world but wrapped up into
function with message as argument.
1. def hello_world(message):
2. print(message)
3.
4. hello_world("Hello world")
Code 1.3
To run this program, open Visual Studio Code (VSC) and create a new file
called hello_world.py, then copy and paste the above example from Figure
1.2 and save it to your home directory. Now, open CLI and go to your home
folder where you saved the file. In the next steps, we will assume that you
have successfully installed Python in your operating system.
1. $ python3 hello_world.py
2. Hello world
Code 1.4
As you can see, the output of that program goes directly to the CLI stream
called stdout. This is how Python programs work - they run in a Python
interpreter and can redirect their output to stdout or log file - we will talk
about this in the upcoming chapters.
The default Python interpreter in CLI can be started by command python
and hitting enter. This is going to start Python shell, and from now on,
everything you type in that console is Python code.
1. Python 3.10.8 (main, Oct 21 2022, 22:22:30)
2. Type "help", "copyright", "credits" or "license" for more information.
3. >>>
Code 1.5
Basics
So far, we have learned how to organize your Python craftsmen's desk. Now,
let’s learn a few basic Python concepts that we will need for the following
chapters of this book. To know more about specific programming concepts,
we strongly suggest spending some time reading official Python docs.6
Loops
When you need to repeat some block of code of the course, you mustn’t
copy and paste the same thing multiple times. There is a way to repeat code
blocks based on condition or until some condition is reached. As an example,
we can use Fibonacci series7.
1. >>> a, b = 0, 1
2. >>> while a < 10:
3. … print(a)
4. … a, b = b, a+b
5. …
6. 0
7. 1
8. 1
9. 2
10. 3
11. 5
12. 8
Code 1.6
Above is an example of a repeating block of code (lines 5-6) until variable
a’s value is lower than 10. If you follow carefully how the value of a is
being increased, you can see a small imperfection with the above code – it
prints out twice value 1 on the screen. Why we say that – let us analyze this
together.
1st iteration → a = 0, b = a + b = 0 + 1 = 1
2nd iteration → a = b = 1, b = a + b = 1 + 1 = 2
3rd iteration → a = b = 2, b = a + b = 1 + 2 = 3 and so on.
Why do we have the wrong output of running the while loop in the output
(example above). You can probably see the pattern – we should print the
values of variables a and b after calculating them. In that case, we will print
the correct values. Please notice one trick in Python – what I’d call a one-
liner → a, b = b, a + b What is happening here? Python has magical syntax.
Instead of typing in two lines something like this:
1. def hello_world(message):
2. print(message)
3.
4. for x in range(10):
5. hello_world(f"repeat {x}")
Code 1.7
To see other examples of the simple loop, we can mix the function hello
world that we learn to write with the actual looping technique. As you can
see, we used the range function as a sequence generator and looping with for
statement. Please notice that each time that loop makes a turn (line no. 5),
we call out the hello_world function with the value saved under variable x.
That looping is the simplest example of what the programming is about -
you are building blocks of logical statements that are executed upon specific
programmed conditions.
In our case, instead of writing something like 10 times printed statement
repeat 0...10, we simplified it to the state of code that gives us the same
result, but is much easier to control. Why? Imagine that you want to print
that line like in the previous example but not 10 times - 1000 times! After
which repetition of copy-paste would you give up?
1. repeat 0
2. repeat 1
3. repeat 2
4. repeat 3
5. repeat 4
6. repeat 5
7. repeat 6
8. repeat 7
9. repeat 8
10. repeat 9
Code 1.8
Let us now improve upon this by using a loop to automate the process. Code
1.9 demonstrates how to achieve the same result as Code 1.8 using a 'for'
loop and conditional statements.
1. def hello_world(message):
2. print(message)
3.
4. for x in range(10):
5. if x % 2 == 0:
6. y=x*2
7. hello_world(f"value {y}")
Code 1.9
Now let us try to modify this simple example, print every 2nd line, use a
new variable with the current iterator value multiplied by 2 and then print its
content. To help us, we can use if statements.
Here, you can check what is going to be the result of running above when we
have conditionals being used when looping.
1. value 0
2. value 4
3. value 8
4. value 12
5. value 16
Code 1.10
Now think for a minute. Do you know what will happen if we modify line 4
to make it look like for x in range(11) – will it change anything in the
output of the running code? The answer is – yes. You will have an additional
line printed with the value of 20.
Why is that? Function range generates inerrable array from 0 to 9 (10
elements as a generator). In Python, for loop will start iteration from element
0, so by having a default iterator over 10 elements (line 4), we will finish
when x=9. When we change and do range(11), we will finish the last loop
when x = 10. In that case, we will enter line 5, and by reaching line 6, we
will do a simple assignment to the variable → y = 10 * 2.
Iterators vs generators
When we talk about loops, we must explain how iterators and generators
work and where they’re useful. When we talk iterators vs generators, it can
be a bit confusing which is what. A simple explanation is that an iterator is a
function and block of code that code consumes what the generator is
producing. In simple words, check the following example:
1. my_numbers = [1, 2, 3, 4, 5]
2. data = iter(my_numbers)
3. print(next(data))
4. print(next(data))
5. print(next(data))
6. print(next(data))
7. print(next(data))
Code 1.11
You can see that in line 1, we created an array with 5 elements in it. In line 2,
we will convert this array to iterator – method iter. From lines 3 – 7, you can
see that we are calling the method next that allows me to fetch the
state/value of the object one at a time. We can say that we created an object
that exposes a mechanism to iterate over its values. The iterator goes in pairs
with generators.
1. 1
2. 2
3. 3
4. 4
5. 5
Code 1.12
The generator is a function that will return the generator type, which
eventually returns the sequence of values instead of a single value. An
example of a simple generator is shown below – please notice that function
does not use the return keyword. Instead, we use yield, which explicitly
informs Python that this line returns the generator value.
1. def my_numbers():
2. for i in range(1, 6):
3. yield i
4.
5. obj = my_numbers()
6. print(next(obj))
7. print(next(obj))
8. print(next(obj))
9. print(next(obj))
10. print(next(obj))
Code 1.13
When you follow carefully from lines 5-10, you will notice similarities with
the previous example, but there is no iter method being called. The reason
being is the fact that our function my_numbers is a generator itself, so that
is why we do not do it.
1. class AsciiIterator:
2.
3. def __iter__(self):
4. self.current_value = 65
5. return self
6.
7. def __next__(self):
8. if self.current_value > 90:
9. raise StopIteration
10. tmp_value = self.current_value
11. self.current_value += 1
12. return chr(tmp_value)
13.
14. obj = AsciiIterator()
15. my_iterator = iter(obj)
16.
17. for letter in my_iterator:
18. print(letter, end = ",")
Code 1.14
In the example from the preceding code, you can see a more precise and
advanced iterator. What is happening here is – We declared class in line 1
with 2 important methods – lines 3 and 7. __iter__ - is the one that is being
called when the iterator is initialized - calling iter function. Once that part is
done every time when in main code – lines 17-18 are called by loop Python
calls __next__. That part of the code keeps increasing the internal value of
current_value and returning it, then line 18 prints it on the screen.
1. A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z,
Code 1.15
Let us try to do something similar with the generator and compare.
Noticeably, to approach the same result – especially the part with a loop –
we built much simpler code using the generator approach (lines 1-3)
compared to the iterator one.
1. def ascii_iterator():
2. for i in range(65, 91):
3. yield chr(i)
4.
5. my_letters = ascii_iterator()
6.
7. for letter in my_letters:
8. print(letter, end=",")
Code 1.16
Functions
So far, we have learned how to write hello world function. Let us try to dive
a little deeper into the function definition and understand how Python deals
with different styles of functions.
1. def foo(arg1, arg2):
2. return args1 + arg2
3.
4. foo(1,2)
Code 1.17
Let us analyze for a minute how a typical Python function is organized. As it
is highlighted in following figure - (item 1) We called our test function foo -
of course, arguments (items A and B) values are flexible, and you can call
your function with any arguments you want, albeit.
Error handling
As a very mature language, Python has an advance system of catching and
raising exceptions. If you do not know what an exception is in programming,
we briefly introduce you to this concept. Let us understand it from the
following example. Notice that it is a simple function that divides a by b,
where b is converted to float (line 2).
1. def foo(a, b):
2. print(a/float(b))
3.
4. print(foo(1, 2))
5. print(foo(1, 0))
Code 1.44
Let us run this program and see what is going to happen. You can see in the
following example that it crashed poorly. This is because in line 5, (Code
1.44) we tried to divide by zero, which as a result, led our code to crash.
1. 0.5
2. None
3.
4. ---
5.
6. ZeroDivisionError Traceback (most recent call last)
7. Cell In [31], line 5
8. 2 print(a/float(b))
9. 4 print(foo(1, 2))
10. ----> 5 print(foo(1, 0))
11.
12. Cell In [31], line 2, in foo(a, b)
13. 1 def foo(a, b):
14. ----> 2 print(a/float(b))
15.
16. ZeroDivisionError: float division by zero
Code 1.45
Our example code crashed with the exception (line 16) ZeroDivisionError.
If we know that our code crashed (lines 10-16) and even Python points us to
where we have an issue, how can you, as a developer protect your code
against such unexpected events? The answer is simple – we can use code to
catch exception blocks. We modified the same code below against fatal
crashes and unexpected input.
1. def foo(a, b):
2. try:
3. return (a/float(b))
4. except ZeroDivisionError:
5. print("We don't know how to devide by zero")
6. except Exception as e:
7. print(f"Something unexpected happened, details: {e}")
8.
9. print(foo(1, 2))
10. print(foo(1, 0))
11. print(foo(1, "lalala"))
Code 1.46
We added returning the proper value of dividing (line 3) two arguments
when everything is correct. Suppose there is a case of dividing by zero, we
can catch this exception and print the proper message (lines 4-5), and in a
case when something unexpected happens (line 11 – where 2nd argument of
the function call is a string instead of number) we catch this corner case and
print proper message.
The preceding code helps you react to unexpected situations, especially lines
6-7. This catches all exceptions that are not about dividing by zero cases.
Still, the question is – is there a better way of testing arguments before
executing them or maybe raising an exception if something unexpected is
going to happen before it does?
Running the preceding code is a bit cleaner. Check lines 2-3 – we
arechecking if the given 2nd argument is greater than zero – to avoid, for
instance, dividing by zero is the case with negatives.
1. def foo(a, b):
2. assert isinstance(a, (int, float)), "Argument 2 must be number"
3. assert isinstance(b, (int, float)), "Argument b must be number"
4.
5. if b < 0:
6. raise Exception("Sorry but 2nd argument
must be greater than zero")
7.
8. return (a/float(b))
Code 1.47
We will run above code print(foo(1, 2)) which gives us the proper and
expected result (line 1). After running with print(foo(1, -1)), we ran to
compare and check (line 2, Code 1.47) block, and based on the improper
value of argument b. We raised an exception (line 3, Code 1.47) which leads
Python to print traceback of such an exception and indicator where exactly it
has been raised (lines 4-17, Code 1.48).
1. In [1]: foo(1,2)
2. Out[1]: 0.5
3.
4. In [2]: foo(1,0)
5. ---------------------------------------------------------------------------
6. ZeroDivisionError
Traceback (most recent call last)
7. <ipython-input-12-157e21edc363> in <module>
8. ----> 1 foo(1,0)
9.
10. <ipython-input-9-a157a0d2288f> in foo(a, b)
11. 6 raise Exception
("Sorry but 2nd argument must be greater than zero")
12. 7
13. ----> 8 return (a/float(b))
14. 9
15.
16. ZeroDivisionError: float division by zero
Code 1.48
Next, let us run a case where argument b is a string, which is not the
expected way of running the function – this case will be checked by the
assert function.
1. In [1]: print(foo(1, "lalala"))
2. ---------------------------------------------------------------------------
3. AssertionError
Traceback (most recent call last)
4. <ipython-input-13-812be92c8833> in <module>
5. ----> 1 print(foo(1, "lalala"))
6.
7. <ipython-input-9-a157a0d2288f> in foo(a, b)
8. 1 def foo(a, b):
9. 2 assert isinstance(a, (int, float)),
"Argument 2 must be number"
10. ----> 3 assert isinstance(b, (int, float)),
"Argument b must be number"
11. 4
12. 5 if b < 0:
13.
14. AssertionError: Argument b must be number
Code 1.49
Pretty cool, right? With assertion, you can check as developer proper
argument types, values or their instance types, and many more possibilities.
If something doesn’t match your expected values, you can throw an assertion
that will stop the code from execution and lead to an exception.
Code style
The previous section, talked about coding standards and where to find them.
Remembering all the standards and how and when to use them in your code
can be pretty challenging, especially if you have to refactor some legacy
code and you’re not sure if some standard in coding used by other developer
was correct or not. But, there is a way in this chaos. Few things to follow
that are worth remembering. Please also check Chapter 2, Setup Python
Environment, where we describe how to organize Python and its tools on
your work machine –some tools mentioned can simplify the whole process
of writing proper and clean syntax.
Variables
Now, we will share a few simple rules for naming your variables properly
for cleaner understanding and correct notation according to pep8.
1. Name your variables with underscores (snake case notation) and as
explicit as it can go. Do not use any:
a. Hungarian notations, example arru8NumberList = [1,2,3]
b. Camel case notation, example CamelCase = 5
Please use snake case - lower case with underscores and standard English
dictionary. For example, my_variable_name = 5
2. Do not use names from the standard Python library as y own variable
name. For instance, sum = 5 ← this will overwrite the default Python
sum function with your variable, and from everywhere in your code
where you try to use sum function, it will fail with a fatal exception. So
be careful with these.
3. Do not make it too long – it is hard to read variable names if they are
longer than the working space in your IDE.
4. If you plan to keep the variable read-only, use a capital letter for a
name, that is, MY_CONTENT_FOR_NAME = “John”.
Functions
As we already learned the fundaments of how construct functions body, we
now will dive a little deeper into details of how to write functions in a clean
manner. First, let us start with an example of a bad code.
1. def FoO(ARgINt8):
2. myTmpV = ARgINt8 * 20
3. return OtherFoO2(myTmpV/2)
4. def OtherFoO2(ARgINt8):
5. return ARgINt8*0.1
6. if __name__ == "__main__":
7. print(FoO())
Code 1.50
The preceding code is bad syntax because it is hard to read. As you can see
in the function definition, we broke the main rule to use snake case names.
Please notice how difficult it is to follow such a code. Another thing is that
there is no space between function bodes and their uses. The author also
didn’t add any comments in function definitions, so whoever uses his code
can have a hard time reading the code. Code also uses Hungarian notation –
for example, in function augments.
Of course, in the above context, we used a trivial example of simple code but
imagine something complex written in such a way – following such a code
can give you lots of hard times as a developer. How to fix this? Let us take a
closer look to clean and properly write the same code.
1. def simple_calculator(in_multiplier):
2. """This is pretty simple function that deliver some math."""
3. my_temp_var = in_multiplier * 20
4. return moving_comma(my_temp_var/2)
5.
6.
7. def moving_comma(data):
8. return data*0.1
9.
10.
11. if __name__ == "__main__":
12. print(simple_calculator())
Code 1.51
You probably already noticed by comparing number numbers that the above
example is a bit longer in that sense – but does it matter? Yes, it does. First
of all, if you add those two extra lines of space after the function definition,
it is correct according to pep8 standards and makes your code cleaner – it
gives more light into so much text. That aspect is super important, especially
if you have to read and analyze lots of code. Trust me on this – such a small
detail can make a big difference.
So why did we say it matters? Oh well, for sure of what we mentioned
above, but when you are adding those spaces it does not impact Python
effectiveness. Remember, it is a dynamically compiled language – so adding
1 or 5 extra lines or spaces for better readiness of your code does not make a
serious difference to Python, but it does to you. Cleaner = easier to follow.
Another thing worthy of mentioning is what we highlighted before, using
variable names in such a way that they are descriptive and are not crazy
acronyms. Trust me on this – reading even your own code is much easier
after a long time if variable function names tell you what they should store
or return.
Classes
Cleanly writing a class is pretty the same rules are applied here as we
already learned in previous subchapters when we talked about variables and
function definition. Now we need to re-apply these rules to the class
definition. Let us take a closer look at an example class definition and its
use.
1. import pickle
2. from collections.abc import Callable
3. from typing import NewType
4.
5. MyObject = NewType('UserId', Callable[[], str])
6.
7.
8. class Serializer(object):
9.
10. def __init__(self, compression=False, compression_level=6, use_zlib:
bool=False,
11. pickle_protocol=pickle.HIGHEST_PROTOCOL):
12. """
13. Initializer, expected arguments:
14. - compression - True, means zip compression is going to be used
15. - compression_level - compresson level
16. - use_zlib - True, means using zlib library
17. «»»
18. self.comp = compression
19. self.comp_level = compression_level
20. self.use_zlib = use_zlib
21. self.pickle_protocol = pickle_protocol
or pickle.HIGHEST_PROTOCOL
22. if self.comp:
23. if self.use_zlib and zlib is None:
24. raise ConfigurationError('use_zlib specified, but zlib module '
25. 'not found.')
26. elif gzip is None:
27. raise ConfigurationError
('gzip module required to enable '
28. 'compression.')
29.
30. def _serialize(self, data: str) -> str:
31. """Serialize given data to pickle reprezentation"""
32. return pickle.dumps(data, self.pickle_protocol)
33.
34. def _deserialize(self, data: str) -> Callable[[], str]:
35. """Deserialize pickled object to its original state"""
36. return pickle.loads(data)
37.
38. def serialize(self, data: Callable[[], str]):
39. data = self._serialize(data)
40. if self.comp:
41. if self.use_zlib:
42. data = zlib.compress(data, self.comp_level)
43. else:
44. data = gzip_compress(data, self.comp_level)
45. return data
46.
47. def deserialize(self, data: MyObject) -> str:
48. if self.comp:
49. if not is_compressed(data):
50. logger.warning('compression enabled but message data does n
ot '
51. 'appear to be compressed.')
52. elif self.use_zlib:
53. data = zlib.decompress(data)
54. else:
55. data = gzip_decompress(data)
56. return self._deserialize(data)
Code 1.52
Let us try to analyze what is happening in the above source code. First thing,
please notice the Class statement – as we already learned in Python, is the
beginning of the class definition statement. Everything that will be defined
in its body must follow some basic rules.
1. data = {"key1": "some value"}
2. s = Serializer()
3. serialized_data = s.serialize(data)
4. s.deserialize(serialized_data)
Code 1.53
Packages
We have already mentioned packages in the previous chapter:
They have their namespace, so you must be aware of how they work
Import functions, variables, or classes having the same names but
coming from different namespaces in such a way as not to overwrite
each other
Now let us focus on how to import components from different packages in
the cleanest way. There are a few simple rules to follow:
Import in alphabetic order
Organize your import in 3 groups
Python system imports
3rd party modules
The imports from your project:
Import only those things that you need to use in your current working
file – do not overcomplicate imports
Never import with a star, like the following example:
1. from typing import Dict, List
2. from some.package import *
Code 1.54
Conclusion
In this chapter, we managed to learn the basics of Python. We also managed
to see how to write clean code in fashion that follows community standard.
We got through some important topics like classes, exception and modules
that we are going to use when we will be programming some mini projects
in next chapters of this book.
In the next chapter, before we get some hands-on code, we will learn how to
properly organize workbench for our Python stack. We will see what kind of
tools are used by professionals and how they can work for us.
1. https://www.mathworks.com/products/matlab.html
2. https://twisted.org/
3. https://www.python.org/downloads/release/python-3107 - release
notes and link to manual
4. https://peps.python.org/pep-0008/ - link to PEP specification about
proper code formatting
5. https://code.visualstudio.com – download Visual Studio Code
6. https://docs.python.org – Python full documentation.
7. https://en.wikipedia.org/wiki/Fibonacci_number
8. https://docs.python.org/3/tutorial/stdlib.html - Python standard
library documentation
9. https://peps.python.org/pep-0008/#function-and-variable-names –
how to write proper and clean functions based on Python official coding
guideline.
10. https://docs.python.org/3/library/typing.html - official Python
support manual for using typing and data types.
11. https://docs.python.org/3/library/index.html - Python standard
modules.
12. https://docs.python.org/3.10/using/cmdline.html#envvar
OceanofPDF.com
CHAPTER 2
Setting up Python Environment 102
Introduction
Python since its beginning went long way. Starting from scripting yet very
powerful language it became a very advance programming ecosystem that
can run almost on any operating system. This power and flexibility brings
some challenges for every developer – how to write clean code that is
structured, follows strong standards, and allows developer to extend
language capabilities by adding libraries and Python extensions.
In this chapter, we will learn some basics regarding Python. We will also go
through the challenges and features of Python and learn how to organize our
development workbench with all necessary tools to start effective coding.
Structure
In this chapter, we will cover the following topics:
Clean and proper Python workbench
Python in Linux
Python in Windows
Controlling projects
Packages – working with external libraries – libraries under control
Organized clean code
Validate code quality – pying and flake8
Working with integrated development environment (IDE)
Auto applying code quality fixes – precommit
Build your own library
Objectives
In this chapter, we will learn how to make our Python code at its best. We
will learn how to install it and use 3rd party libraries in the most organized
way. Next, we will dive into the topic of building our own deployable
libraries for Python.
In the end we will wrap everything we do using automation that is going to
help us simplify process of software development. At the same time, we will
briefly see how to integrate software version control like Git1 with Python
quality automation tools.
Python in Linux
We will learn how to organize the craftsman's desk so that you can manage
multiple projects on the same computer and use different Python versions if
needed. Let us start with the basics; the modern line of Python language is
version 3+, for example, 3.10, which we installed for this chapter.
You can still find some projects that use Python 2.x line, albeit it has been
officially announced to be discontinued3. To be able to control what kind of
Python instances you use for which project, we strongly suggest to start
using pyenv4. As we can read from the project's GitHub page.
"Pyenv lets you easily switch between multiple versions of Python. It is
simple, unobtrusive, and follows the UNIX tradition of single-purpose tools
that do one thing well."
To simplify, we are installing a pyenv - the versioning system that allows us
to install different Python versions under the same roof. Please notice that
the following only applies to Unix-based systems (MacOS and Linux). For
Windows, we will take care of multiple versions of Python in a bit different
way. Jump to the next section if you are using Windows OS.
1. $ git clone https://github.com/pyenv/pyenv.git ~/.pyenv
2. $ cd ~/.pyenv && src/configure && make -C src
Code 2.1
1. $ echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bashrc
2. $ echo 'command -
v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"' >>
~/.bashrc
3. $ echo 'eval "$(pyenv init -)»› >> ~/.bashrc
Code 2.2
With the preceding code, you will clone the pyenv repository and compile it
in your local system. We assume that you use a bash default shell in a Linux
system. In that case, you must tell your shell how to autoload the pyenv
stack.
After installing pyenv, let us try to install Python 3.7 and 3.10. We will have
flexibility by having two different Python versions under the same system.
Once this part is done, you will see three installed Python versions. First on
the top is the one installed widely in your operating system; 2nd and 3rd are
those we just installed with pyenv and ran our hello world program under
each Python version.
1. $ pyenv install 3.10.4
2. $ pyenv install 3.7.4
Code 2.3
1. $ pyenv versions
2. *system(set by /home/darkman66/•pyenv/version)
3. 3.7.4
4. 3.10.4
5. $
Code 2.4
Python in Windows
Installation of Python in the Windows system is not difficult; you just
download the installer from the Python website5 and install it with the
installation wizard. That is one way of installing Python in the Windows
ecosystem, and the other approach is installing different Python versions
using PowerShell6 and pyenv7.
If you do not want to install Bash or PowerShell and want to use native CLI
for Windows, we suggest using cmder8. It is a CLI tool with many great
features, including git support. Once you have installed Python, you can
initialize virtualenv:
1. virtualenv stuff1
2.
3. Using base prefix 'c:\\users\\hub\\appdata\\local\\programs\\python\\pyth
on37'
4. New python executable in C:\Users\hub\Desktop\cmder\stuff1\Scripts\p
ython.exe
5. Installing setuptools, pip, wheel...
6. done.
Code 2.5
Once we have initialized virtualenv (line 1), we can start using it and
installing packages. Refer to the following figure:
1. λ C:\Users\hub\Desktop\cmder\stuff1\Scripts\activate.bat
2.
3. C:\Users\hub\Desktop\cmder
4. (stuff1) λ
Code 2.6
Controlling projects
Python has so many flexible ways of controlling projects and their
dependencies that we could probably write a separate chapter about it. In this
chapter, we will share how to control projects and dependencies.
To make this little mess a bit cleaner and keep all projects and their
dependencies clean, we will use a Python module called
virtualenvwrapper9.
By using this module, you will be able to:
Keep all your project-related dependencies in a single clean place.
Track projects and their libraries.
Easy way to destroy and recreate virtual env.
Experiment with different Python libraries.
Installation of virtualenvwrapper is simple and will be installed as a
system-wide accessible module. That means you will have access to all the
virtualenvwrapper commands on the level of your system.
1. pip install virtualenvwrapper
2. echo 'export VIRTUALENVWRAPPER_PYTHON=/usr/local/bin/pyth
on' >> ~/.bashrc
3. echo 'export WORKON_HOME=$HOME/.virtualenvs' >> ~/.bashrc
4. echo 'export VIRTUALENVWRAPPER_VIRTUALENV=/usr/local/bin
/virtualenv' >> ~/.bashrc
5. source /usr/local/bin/virtualenvwrapper.sh
Code 2.7
You can use the preceding tool whenever you open the bash terminal. For
Windows users to use that tool, you will need to install the bash system
extension; you can visit the Microsoft blog10 , where you will find more
details regarding the installation process:
1. $ mkvirtualenv -p ~/.pyenv/versions/3.7.4/bin/python hello1
2. $ mkvirtualenv -p ~/.pyenv/versions/3.10.4/bin/python hello2
Code 2.8
Let us create two virtual environments as above for our hello world program.
We will do it to demonstrate the feasibility of using two different Python
versions running on the same machine. It is important to remember that each
Python version is compiled on the local machine. So, if you have any issues
installing it with pyenv, ensure all necessary libraries are installed before
continuing pyenv installation.
After successful installation of pyenv and virtualenvwrapper, we can go to
the list of available virtualenvs installed in our system by typing
lsvirutalenv. It will list all available in your system.
1. $ sudo apt install -y wget build-essential libreadline-dev \
2. libncursesw5-dev libssl-dev libsqlite3-dev tk-dev libgdbm-dev \
3. libc6-dev libbz2-dev libffi-dev zlib1g-dev
Code 2.9
To start working in selected virtualenv, type workon hello1, and that is it.
All your Python bin files and libraries will point to the location of the
virtualenv that you selected.
Another alternative approach to using virtualenv is to use it with pyenv
directly. For example, install Python 3.7.4 and initialize virtualenv.
1. $ pyenv install 3.7.4
2. $ pyenv virtualenv 3.7.4 your_venv_name
3. $ pyenv activate your_venv_name
4. $ pyenv version
Code 2.10
Clean code
As explained in Chapter 1, Python 101, Python is a bit unique with its
syntax using indentations instead of clear indications of each block's
beginning and end. By having this said, you must have proper tools to
format and check your code to be always consistent. For instance, you
should not have indentation blocks where one starts with four spaces (Figure
2.1, line 16-18) and the other one goes with three spaces (Figure 2.1, line
20), as shown in the following figure:
Figure 2.1: Example of lines indentation
Notice we used inconsistent indentations and different ways of commenting
function doc strings. This is just an example of the wrong syntax, and there
can be more by going deeper into the code. To make your life easier as a
Python developer and to become a good coder, we should install a few tools
in your local system which will help you with your code by working for you
in the background with auto-formatting wrong code.
There are plenty of these tools but these are my recommendations:
flake817
pylint18
black19
precommit20
Flake8
Install the proceeding tools by using pip and use them to control your code.
We have a project with terrible syntax that looks like the following example,
that we would like to clean. Let us see how you can do it on a few levels to
understand which works better.
1. # -_- coding: utf-8 -_-
2. from pprint import pprint
3. def łąka(p):
4. pprint(p)
5.
6. łąka("some message")
Code 2.20
When you run flake8, you will see immediately what is wrong with your
file. Please notice that you can run the flake8 command for an individual file
or the entire folder containing the entire project. Let us check with following
example how to execute flake8 command.
1. $ flake8 my_app.py
Code 2.21
We guess that it is obvious running the flake8 command only makes sense
on Python files, and other files can either confuse flake8 or give you very
misleading results.
1. my_app.py:7:1: E305 expected 2 blank lines after
class or function definition,found 1
2. my_app.py:8:1: W391 blank line at end of file
Code 2.22
Our tool is very clear in indicating to us where the problems are with our
source file. Once you fix them, you can re-run flake8 until all is clear,
although there are cases where you must use some syntax that should not be
analyzed by flake8; there are a few ways to do it. Let us check following
example.
1. $ flake8 --select E113,W505 my_app.py
Code 2.23
To be able to not overwrite ignored errors like (example Code 2.23) but
extend already predefined ignored errors list we can run flake8 command
like in the following example:
1. $ flake8 --extend-ignore E113,W505 my_app.py
Code 2.24
Including or excluding what we want to validate as a command line
parameter can be a terrible idea, especially if you want to repeat the same
presets for multiple projects and share them with others. That is why flake8
supports saving presets in the config file. Configuration21 can be saved in:
top-level user directory
project directory
With supported formats: setup.cfg, tox.ini22 or flake823. Flake8 supports
reading config from files by using the Python config parser module24.
1. [flake8]
2. ignore = D203
3. exclude =
4. .git,
5. __pycache__
6. max-complexity = 10
7. max-line-length = 120
Code 2.24
Pylint
Pylint is comparable to flake8 and helps to keep coding standards based on
pep825. It has many more cool features, like detecting errors and repetitions
of blocks of code, which can lead to antipatterns. Pylint can also help
refactor code and draw UML diagrams representing your code26. Executing
pylint with example code we are performing in the following code.
1. $ pylint my_app.py
Code 2.25
The output of running pyplint is shown in the following example:
1. ******\******* Module my_app
2. my_app.py:8:0: C0305: Trailing newlines (trailing-newlines)
3. my_app.py:1:0: C0114: Missing module docstring (missing-module-
docstring)
4. my_app.py:4:0: C0116: Missing function or method docstring (missing-
function-docstring)
5. my_app.py:4:11: C0103: Argument name "p" doesn't conform to snake_
case naming style (invalid-name)
6. my_app.py:4:0: C2401: Function name "łąka" contains a non-
ASCII character, consider renaming it. (non-ascii-name)
7.
8. ---
9.
10. Your code has been rated at 0.00/10
Code 2.26
Pylint, when analyzing source files, as you can notice in the above output, it
is diving much deeper into the code. It will not only check syntax
recommended by pep8, but it also does shallow security checks and code
analysis (lines 4-6), and in the end (line 10), it prints out the overall score for
your code quality.
What is very valuable in pylint analysis because this tool will run pieces of
your code and catch all the potential issues. For example, it checks circular
references, unused variables, unimported modules, diving by zero, etc.
IDE
Once we agreed that for the need of this course, we would be using VSC27
for coding and managing your projects, you can also integrate tools we
introduced to you to improve the quality of your code.
Figure 2.3: Visual Studio Code status bar with many useful information for everyday development
After installing flake8, IDE will check your syntax during typing and
highlight all the possible problems as shown in the following figure:
Pre-commit
If you have been working previously with any source control system, you
know Git28. That distributed open-source version control system can support
many plugins, git flow29, and pre and post-commit systems called hooks.
The idea behind the hooks is that you can automatically run shell scripts or
even the entire standalone applications before the actual commit or after.
Figure 2.5: Example of life of code cycle with commits and branching out when pre-commit script is
in use
The preceding figure shows the life cycle of the hooks being executed upon
each commit. As you can see, we can execute the script each time you
commit code to your local branch (which can also be remote). Why did we
mention git hook? Because you can use them to auto-analyze the quality of
your code.
1. ~/work/fun-with-python/ ll .git
2. total 72
3. 18B Nov 19 15:55 COMMIT_EDITMSG
4. 96B Dec 9 23:01 FETCH_HEAD
5. 21B Nov 19 15:55 HEAD
6. 41B Nov 22 17:45 ORIG_HEAD
7. 310B Oct 16 20:46 config
8. 73B Oct 16 20:46 description
9. 372B Oct 29 22:05 fork-settings
10. 480B Oct 16 20:46 hooks
11. 1.9K Nov 22 21:58 index
12. 96B Oct 16 20:46 info
13. 128B Oct 16 20:46 logs
14. 1.6K Dec 9 23:01 objects
15. 112B Oct 16 20:46 packed-refs
16. 160B Oct 16 20:46 refs
Code 2.27
We can see in the code example 2.27 that when we list content of .git
directory in our project we are able to see that there is sub-folder called
hooks. Let’s take a look what is inside by following example. You can
quickly notice that this subfolder has many action files (hooks) that git
version control system is going to call respecteviely depending on the action
that you are performing – for instance file pre-commit is going to be always
called upon every single commit that you perform. That is happening
without any check where action takes a place. What we meant by this is – we
can perform commit action (commit code changes to git repository) from
CLI or GUI client – still git subsystem is going to execute pre-commit
script30.
1. $ ~/work/fun-with-python/ ll .git/hooks
2. total 120
3. 478B Oct 16 20:46 applypatch-msg.sample
4. 896B Oct 16 20:46 commit-msg.sample
5. 4.6K Oct 16 20:46 fsmonitor-watchman.sample
6. 189B Oct 16 20:46 post-update.sample
7. 424B Oct 16 20:46 pre-applypatch.sample
8. 1.6K Oct 16 20:46 pre-commit.sample
9. 416B Oct 16 20:46 pre-merge-commit.sample
10. 1.3K Oct 16 20:46 pre-push.sample
11. 4.8K Oct 16 20:46 pre-rebase.sample
12. 544B Oct 16 20:46 pre-receive.sample
13. 1.5K Oct 16 20:46 prepare-commit-msg.sample
14. 2.7K Oct 16 20:46 push-to-checkout.sample
15. 3.6K Oct 16 20:46 update.sample
Code 2.28
As you can see in the preceding figure, we can create a file pre-commit and
try to make it execute pylint on all the files in the repository. Such a script is
going to look like the following example:
1. #!/bin/sh
2.
3. set -e
4.
5. pylint –rcfile=./config.rc 2>&1
Code 2.29
In line 3, we tell the operating system shell (that is, bash) if, in line 5, pylint
is going to detect any issues, it will immediately stop our pre-commit script,
and as a side effect, it will stop Git from committing changes into the code
repository.
We additionally added an option to pylint with the config file if it is fully
optional, but you can keep it and adjust the config file to your needs. You
can add all those errors or warning messages you want to ignore during
pylint checks.
This solution has a few challenges:
Pylint will analyze all files – time execution of such a script for big
projects can be very inefficient for each commit.
Analyzing all files will also include non-python files.
Not having pylint installed will also crash the commit process, which
brings a challenge when someone wants to edit some not Python file
and must install Python with pylint
Let us try to improve the same script by adding a few additional checks so it
can address the preceding issues and keep the same quality of code checks:
1. #!/bin/sh
2.
3. set -e
4.
5. FILES_CHANGED=$(git diff --name-only --diff-
filter=ACM origin/main | grep "\.py" || true
6.
7. if [ -n "$FILES_CHANGED" ]; then
8. echo $FILES_CHANGED | xargs pylint --rcfile=./config.rc 2>&1
9. fi
Code 2.30
As we can notice in Code 2.30 we have added line 5 (comparing to example
Code 2.29), where we’re checking, by using git command, what are the files
that were changed. By this we mean those files that the current git commit
command added or updated in current working repository. By having a list
of those files, we can run pylint against these, which:
Limits the number of files that pylint must analyze.
Gives better performance.
There is one catch, - please pay attention regarding the grep expression
(Code 2.30, line 5). We try to find all the files that end with .py extension in
any of the currect working folders – for example
some_directory/my_file.py. That is the base concept here, but has some
logical bug. Regexp will get all the files that have .py in their name for
example file1.py and as well fiel.py_some_name.txt. So we can easily see
that extension is nothing to do here. Regular expression is matching string
that has .py in it, which means it will also catch files like .pyc, .pyi etc. We
can see now where the logical bug is. How to fix this then? We will have to
update example Code 2.30 with the following fixes:
1. #!/bin/sh
2.
3. set -e
4.
5. FILES_CHANGED=$(git diff --name-only --diff-
filter=ACM origin/main | grep -E ".py$" || true
6.
7. if [ -n "$FILES_CHANGED" ]; then
8. echo $FILES_CHANGED | xargs pylint --rcfile=./config.rc 2>&1
9. fi
Code 2.31
What would you do if you want to run many things as code validators or
something similar, but you want to logically split these actions across
multiple scripts where not all of them will be the shell scripts? It can be done
by calling subscripts from the main pre-commit script, but it can be a bit
hard to track. Python community is always coming to the rescue with its
massive library of projects and solutions – it is a project called pre-
commit31. It will help organize coed checkers and cleaner.
We install this module by simply running following code.
1. $ pip install -U pre-commit
Code 2.32
Once we have it installed in our Python libraries, we need to install it as git
pre hook. Let us do this by running below command.
1. $ pre-commit install
Code 2.33
The module will inject itself into the standard git pre-commit hook by giving
you more superpowers which you can use by configuring pre-commit
config.
Let us assume we have some project where we installed the pre-commit
module and have the pylint script (as shown in Code 2.31) installed under
config/pylint.sh and the similar one installed under config/flake8.sh is
doing the same thing as pylint but executing flake8 instead. Let us check
how pylint configuration file is going to look like in the following example:
1. default_stages: [commit, push]
2. repos:
3.
4. - repo: https://github.com/pre-commit/pre-commit-hooks
5. rev: v4.3.0
6. hooks:
7. - id: trailing-whitespace
8. - id: end-of-file-fixer
9. - id: check-json
10. - id: check-yaml
11. - id: debug-statements
12. - id: check-merge-conflict
13. - id: detect-private-key
14. - id: end-of-file-fixer
15. - id: pretty-format-json
16. args: [--autofix]
17. - id: no-commit-to-branch
18. args: [--branch, master]
19. - repo: https://github.com/ambv/black
20. rev: 22.10.0
21. hooks:
22. - id: black
23. args: [--line-length=120]
24. - repo: local
25. hooks:
26. - id: pylint
27. name: Pylint
28. stages: [push]
29. description: Run pylint
30. entry: ./config/pylint.sh
31. language: script
32. types: [python]
33. pass_filenames: false
34. - id: flake8
35. name: Check flake8
36. stages: [push]
37. description: RUn flake8
38. entry: ./config/flake8.sh
39. language: script
40. args: [local]
41. types: [python]
42. pass_filenames: false
Code 2.34
This config file .pre-commit.yaml is located in the main root of your project
folder. You know all the configuration parameters and their meaning; you
can check on the pre-commit project website.
We have a hooks section: This is where you define the scripts you want to
run. You can also name these scripts, so when pre-commit scripts start, you
know exactly when each of the scripts gets executed.
In lines 19-23, we intentionally added the use of the module called black32
and how many lines it should use for auto-formatting code. The fact that
black tool is a fantastic code formatter that can automagically fix most of
styling issues that was mentioned in Chapter 1, Python 101.
Additionally, we enabled via config few other very helpful features of pre-
commit. Check lines 6-18, some of these can even help you autoformat
JSON and YAML files which, as most of us know, are so painful to read if
they are not formatted well.
1. trim trailing whitespace..................................Passed
2. fix end of files..........................................Passed
3. check json............................(no files to check)Skipped
4. check yaml............................(no files to check)Skipped
5. debug statements (python).................................Passed
6. check for merge conflicts.................................Passed
7. detect private key........................................Passed
8. fix end of files..........................................Passed
9. pretty format json....................(no files to check)Skipped
10. don't commit to branch....................................Passed
11. black.....................................................Passed
12. Run pylint................................................Passed
13. Check flake8..............................................Passed
14. [bugfix/some-branch-name ee9cdc99] test commit
15. 1 file changed, 1 insertion(+)
Code 2.35
Build your own library
So far in this chapter we have been using libraries that we’ve installed by
using pip tool. In this subchapter we are about to learn how to build our own
library and publish it to pip repository.
Before building out our own library we need to create files structure like in
following figure:
Figure 2.6: Example of directory and file structure for our example library
Main thing to notice it that we created package simple_calculator like we
learned in previous chapter. Our simple calculator module
(my_calculator.py) is going to look like the following:
1. class MyCalculator:
2.
3. def __init__(self, x):
4. self.x = x
5.
6. def add_me(self, y):
7. return self.x + y
8.
9. def substract_me(self, y):
10. return self.x - y
11.
12. def divide_me(self, y):
13. if y != 0:
14. return self.x / float(y)
15. raise Exception('Hey hey, 2nd argument can not be 0')
16.
17. def multiple_me(self, y):
18. return self.x * y
Code 2.36
We managed to create very simple calculator with few helper methods in it.
Class object constructor gets one argument that is going to be used for
calculator operations. Let us update main.py file that is going to
demonstrate how to use add_me method. Let us analyze the following
example:
1. from simple_calculator.my_calculator import MyCalculator
2.
3. m = MyCalculator(5)
4. print("result", m.add_me(5))
Code 2.37
We have the first example to use the newly created calculator module (Code
2.37, line 1). To make this cleaner let us update __init__.py file with
following code:
1. from .my_calculator import MyCalculator
Code 2.38
This kind of import makes coding cleaner and defines publicly accessible
modules from our package in much cleaner and strict way. By this change
we also must update main.py file as in the following example:
1. from simple_calculator import MyCalculator
2.
3. m = MyCalculator(5)
4. print("result", m.add_me(5))
Code 2.39
The next step is to add requirements that are going to be installed as part of
our module. To do so we have to update requirements.txt file as in
following example.
1. pytest~=7.4
Code 2.40
We will install pytest33 module as part of our package. This is not something
very much required by creating modules standards albeit it is highly
recommended practice to include tests in our module so community can
accept it easier when reviewing it.
To organize tests in clean way, we need to keep all of them in one single
folder called tests. Let us create such a folder with files like listed in
following figure:
Conclusion
In this chapter we managed to learn fundamental principles about how to
organize Python environment to make it as efficient as possible. Next, we
went deeper into technicalities about how to run Python project on most
popular operating systems. We also learn how to prepare our work as custom
library which can be shared with other developers and become part of
something big.
In the next chapter, we are going to learn to build our first small yet
powerful Python project. We will also desing and build a real working
example where you may develeop in step by step something that is actually
working and showing you how much fun you can have with Python.
1. https://git-scm.com
2. Virtualenv - https://docs.python.org/3/tutorial/venv.html
3. https://www.python.org/doc/sunset-python-2/
4. Pyenv - https://github.com/pyenv/pyenv
5. https://www.python.org/downloads/windows/
6. https://learn.microsoft.com/en-
us/powershell/scripting/install/installing-powershell-on-windows?
view=powershell-7.3
7. Pyenv for Windows - https://pyenv-win.github.io/pyenv-win/
8. Native CLI for Windows - https://cmder.app
9. Virtualenvwrapper -
https://virtualenvwrapper.readthedocs.io/en/latest/
10. Microsoft blog - https://devblogs.microsoft.com/commandline/bash-
on-ubuntu-on-windows-download-now
11. https://matplotlib.org
12. https://numpy.org
13. https://pandas.pydata.org
14. Python package manager - https://pypi.org/project/pip
15. https://docs.python.org/3/distutils/setupscript.html
16. Install pip tools https://pypi.org/project/pip-tools/
17. Checking code syntax - https://flake8.pycqa.org/en/latest/
18. Checking code syntax - https://pylint.org
19. Auto formater - https://pypi.org/project/black/
20. Amazing git pre commit hooks - https://pre-commit.com
21. https://flake8.pycqa.org/en/3.9.2/user/options.html#options-list
22. We will talk about tox in next pages.
23. Please notice dot as prefix for filename.
24. https://docs.python.org/3/library/configparser.html
25. https://pep8.org
26. https://pylint.pycqa.org/en/latest/pyreverse.html
27. Visual Studio Code - https://code.visualstudio.com
28. https://git-scm.com
29. https://www.atlassian.com/git/tutorials/comparing-
workflows/gitflow-workflow
30. https://git-scm.com/book/en/v2/Customizing-Git-Git-Hooks
31. https://pre-commit.com
32. https://pypi.org/project/black/
33 https://docs.pytest.org/en/7.4.x/
34. https://setuptools.pypa.io/en/latest/
35. https://pythonwheels.com
Join our book’s Discord space
Join the book's Discord Workspace for Latest updates, Offers, Tech
happenings around the world, New Release and Sessions with the Authors:
https://discord.bpbonline.com
OceanofPDF.com
CHAPTER 3
Designing a Conversational
Chatbot
Introduction
In the previous chapters, we have learned many things about Python, its
syntax, development tools, and how to control and deliver the best code with
its finest quality. At the same time, we learned how to integrate Python with
very useful tools that can help us as a developer to deliver code easier and
more efficient. In this chapter, we will start learning our web service project
with some basics and fundaments and then we will be moving towards more
complex topics.
Structure
In this chapter, we will discuss the following topics:
Client-server architecture
Chatbot basics
Training
Chat
Application
Frontend
Objectives
By the end of this chapter, you will learn some fundaments of client-server
application and how to write such an application with using HTTP standards.
When you are finish with this chapter you will know how to build client-
server application, use Python to build above as browser-based HTTP
service, and write asynchronous web service.
Client-server architecture
We will use the most popular architecture in the web world called, client-
server. As shown in the following figure, the client is always sending
requests for resources, assets, pages or asking the server to perform a
specific task and wait for the server to finish. Once the server completes the
processing request, it will respond back to the client with the result or
processing request.
By knowing how headers and path with requested method work2 we can
start building simple web service. A request (question the user asked) will be
send to the server, and the server, based on the request, will respond back.
Let us try to build such an example by using the Twisted3 framework:
1. from twisted.web import server, resource
2. from twisted.internet import reactor
3.
4.
5. class MyServer(resource.Resource):
6. isLeaf = True
7.
8. def render_GET(self, request):
9. uri_path = request.uri.decode('utf-8')
10. print(f"Received request for '{uri_path}'")
11. return f"Hello, world! {uri_path}".encode('utf8')
12.
13. service = server.Site(MyServer())
14. reactor.listenTCP(8083, service)
15. reactor.run()
Code 3.1
By using Twisted framework, we wrote a simple web server that works in
echo mode. That means when we send request to it - it will always respond
with “hello world” message additionally containing full request call path that
we sent in the request (Code 3.2, line 2).
To see how it works in action, check following example’s -notice test_path
part. That is the requested server resource path (line 1) – it is repeated in
response message (line 2)
1. ~/ curl http://localhost:8083/test_path
2. Hello, world! /test_path
Code 3.2
You can see that the simple service can process HTTP GET calls. How about
if we want to send some parameters to it and, based on those, prepare
responses that depend on the given values? Since this is a GET call, we can
process parameters in two ways:
By processing request resource path.
Reading and checking the value of query string.
Let us modify the main class MyServer based on the preceding
requirements so that we can see you how to process incoming requests in
twisted and parse query parameters – let us check following example called
server_2.py:
1. from urllib import parse
2. from twisted.web import server, resource
3. from twisted.internet import reactor
4.
5. """
6. Web server with support for resource path and query string
7. """
8.
9. class MyServer(resource.Resource):
10. isLeaf = True
11.
12. def main_view(self):
13. return "main view"
14.
15. def hello_view(self, **kwargs):
16. if kwargs and kwargs.get('a') and kwargs.get('b'):
17. total = int(kwargs['a']) + int(kwargs['b'])
18. return f"Total sum: {total}"
19. return 'hello to you too'
20.
21. def convert_query_string(self, resource):
22. """Convert query strin to Python dictionary»»»
23. parsed_data = parse.urlparse(resource).query
24. return dict(parse.parse_qsl(parsed_data, keep_blank_values=True))
25.
26. def path_finder(self, request):
27. resource = request.uri.decode('utf-8')
28. query_kwargs = self.convert_query_string(resource)
29. parsed_data = parse.urlparse(resource)
30. resource_path = parsed_data.path
31. result = f"Sorry do not know you {resource}"
32.
33. if resource_path == '/':
34. result = self.main_view()
35. elif resource_path == '/hello':
36. result = self.hello_view(**query_kwargs)
37.
38. return result
39.
40. def render_GET(self, request):
41. output = self.path_finder(request)
42. if output:
43. return output.encode('utf8')
44. return b"Something went wrong"
45.
46. if __name__ == '__main__':
47. service = server.Site(MyServer())
48. reactor.listenTCP(8083, service)
49. reactor.run()
Code 3.3
You can see that we put the decision point of the sub-method based on the
resource path in line 18. Note that we are using the Python urrlib4 module
in the beginning, so do not forget to import:
1. from urllib import parse
Code 3.4
To be able to extract query string5 from full URI and convert it to standard
Python dictionary we created method convert_query_string (code 3.3, lines
21-24). Notice that it will preserve blank values in the query string, that is,
value=. In this case passing value in query string like for instance “?
q1=some-value&q2=“ is going to be converted to dictionary like {“q1”:
“some-value”, “q2”: “”}
This approach will be useful in later stage of our chapter when we will have
to identify some parameter given in the URL query string when there are
empty values.
To demonstrate how we can use our code with different use case please
check the following example:
1. ~/ curl "http://localhost:8083/"
2. main view
3.
4. ~/ curl "http://localhost:8083/hello"
5. hello to you too
6.
7. ~/ curl "http://localhost:8083/hello?a=4&b=3"
8. Total sum: 7
9.
10. ~/ curl "http://localhost:8083/sd"
11. Sorry do not know you /sd
Code 3.5
We managed to cover a few cases with the above examples. The program is
naïve and does not cover corner cases (lines 8-9), so we only check if the
arguments are empty. If they are not numbers, it will lead to a crash. That is
expected, but we can easily fix this by improving those lines like the
following example file python server_2.1.py:
1. from twisted.internet import reactor
2. from twisted.web import server
3.
4. from server_2 import MyServer
5.
6.
7. class MyServer2(MyServer):
8.
9. def hello_view(self, **kwargs):
10. if kwargs and kwargs.get('a') and kwargs.get('b'):
11. try:
12. total = int(kwargs['a']) + int(kwargs['b'])
13. return f"Total sum: {total}"
14. except ValueError:
15. return "One of the arguments is not a number"
16. return 'hello to you too'
17.
18.
19. def start_service():
20. service = server.Site(MyServer2())
21. reactor.listenTCP(8083, service)
22. reactor.run()
23.
24.
25. if __name__ == '__main__':
26. start_service()
Code 3.6
We packed the server startup part (lines 19-22) into individual functions,
which we can inherit later without rewriting the code. Additionally, we also
managed to share the existing code from the previous example (line 7) – we
imported previous example (Code 3.3 being imported in Code 3.6, line 4)
and use the inheritance by covering only the method that we wanted to
change (fix).
This technique allows you to reuse code and is a proper way of writing code
in the object-oriented programming. We need to note that lines 25-26 (Code
3.6) where we check if we run our script directly like script python
server_2.1.py, it will run that part. That help in inheritance (notice example
1) where I also checked if main script was being run if now inheritance will
be possible without accidental string web server.
Chatbot basics
Chatbot is software that interacts with humans in a chat with server-side
software, often based on artificial intelligence (AI) or expertise systems.
Despite the fact what kind of underlying system we will choose, we can
address two main aspects of chatbots:
Rules-based service: It teaches chatbots based on pre-defined rules to
answer the questions in that list.
Self-leaning chatbot: This variant is more flexible because it can learn
independently, albeit technologically more demanding. It requires AI or
machine learning models instead of fixed predefined rules.
Modern chatbots use AI that can utilize natural language processing systems
in real-time. Further, they can analyze language with its variants, flexies,
mistakes, or even dialects. They are powerful tools. For example, try to call
any customer support hotline. In many cases, the first line of direct help will
be the chatbot with a voice recognition system that will guide you smoothly
in analyzing and understanding your reason for calling. As a result of
chatting, if it is possible, the officer on call will not even be needed, or if
eventually there is a need to connect you with a real person, that agent will
get prepared and will be prefilled form all necessary details to help you
quickly and efficiently.
We all know that chatbots can be highly sophisticated AI-driven
applications. However, we will try to concentrate on a simpler yet still
powerful example of a chatbot in Python. This will give us a chance to learn
how chatbots may work and how to dive into AI world.
Training
To build a chatbot, we will use a popular library chatterbot. It has most of
the functionalities we need, such as machine learning training models. As we
can see on its GitHub page,6 “Chatterbot is a machine-learning based
conversational dialog engine built in Python, making generating responses
based on collections of known conversations possible. The language-
independent design of chatterbot allows it to be trained to speak any
language.”
To install it, simply run the following pip command:
1. pip install chatterbot
Code 3.7
The above version has a small bug in our example. It will try to load non-
existing language en (English), so to fix this, we managed to fork the above
project and patch it for our chapter needs. To install it, we just run the
following command; essentially, we need a spacy module, a Python library
for supporting advanced natural language processing.
1. $ pip install spacy==3.4.4 pyyaml==5.4.1
2. $ git clone [email protected]:bpbpublications/Fun-with-Python.git
3. $ cd fun-with-Python/chapter_3/ChatterBot/
4. $ python setup.py install
Code 3.8
After installing the above modules, we must download the Spacy language
pack. Run the following command:
1. python -m spacy download en
Code 3.9
Once it is installed, we will have to build the training tool to help our chatbot
learn basic phrases and sentences that it can expect from the conversation
with a user. To be able to build such a tool, we are going to use the
chatterbot language training corpus7. To use that library most conveniently,
we would build a command line script that can be run at any time with a
need for a server.
First, we must create a folder which is going to contain config files with
basic phrases that we want to teach our chatbot:
1. mkdir -p ~/chatterbot_corpus/data/english
2. vim ~/chatterbot_corpus/data/english/conversations.yml
Code 3.10
Content of conversations file: Let us use something simple as the following.
This is going to be needed for basic chatbot training. At the later stage we
can do some more complex example:
1. categories:
2.
3. - conversations
4. conversations:
5. - - Good morning, how are you?
6. - I am doing well, how about you?
7. - I'm also good.
8. - That's good to hear.
9. - Yes it is.
10. - - Hello
11. - Hi
12. - How are you doing?
13. - I am doing well.
14. - That is good to hear
15. - Yes it is.
16. - Can I help you with anything?
17. - Yes, I have a question.
18. - What is your question?
19. - Could I borrow a cup of sugar?
20. - I'm sorry, but I don't have any.
21. - Thank you anyway
22. - No problem
23. - - How are you doing?
24. - I am doing well, how about you?
25. - I am also good.
26. - That's good.
27. - - Have you heard the news?
28. - What good news?
29. - - What is your favorite book?
30. - I can't read.
31. - So what's your favorite color?
32. - Blue
Code 3.11
We generated the above example conversation file. You can notice that the
yaml file has the first indentation, identifying questions that we may
potentially ask. The second layer of indentation in configuration file (ie lines
30-32) are those responses that chatbot can respond with. Let us see how to
use it. First, let us create a file called chatbot.py:
1. from chatterbot import ChatBot
2.
3.
4. def chatbot():
5. return ChatBot('Trainer')
Code 3.12
The instance of chatbot is wrapped up in separate function that we can call
from any point of our code base and will always initialize same instance of
chatbot. Now, create file called trainer.py.
1. from chatbot import chatbot
2. from chatterbot.trainers import ChatterBotCorpusTrainer
3.
4. _chatbot = chatbot()
5. trainer = ChatterBotCorpusTrainer(_chatbot)
6. trainer.train("chatterbot.corpus.english")
Code 3.13
Execute the above example python ttrainer.py You will notice that in the
same directory with your script there has been a file created db.sqlite3. This
is the sqlite database that contains all verbs, words, phrases, and so on that
our chatbot managed to learn after running training script.
Chat
So far, we analyzed and learned how to use Python modules to train chat
model and how to write interactive scenarios that we are going to use to
interact with user. Now, we will learn how to use these trained materials to
perform simple chat.
1. pip install ipython
Code 3.14
Install ipython, will be useful in testing our model and conversational
scenarios. Once we have it, we will be testing the basic conversation:
1. In [1]: from chatbot import chatbot
2.
3. In [2]: c=chatbot()
4.
5. In [3]: c.get_response('hi')
6. Out[3]: <Statement text:How are you doing?>
7.
8. In [4]: print(c.get_response('hi'))
9. How are you doing?
10.
11. In [5]: print(c.get_response('how are you?'))
12. I am doing well.
Code 3.15
We imported chatbot function from previously created file chatbot.py and
initialized it in line 2. Next you can see in the following lines we are using
method get_response. This is the entrance for chatbot API where we can ask
a question and get the response. You should be able to notice that the
response given is based on the yaml config file that we created before.
Application
In the previous introductions about building a simple web server in client-
server architecture we have learned the basics of webserver and how to
serve resource based on path and query params. We used Twisted framework
and decided to use it for beginning examples (Code 3.1 - 3.6) to show you
more details from HTTP and how you can process and analyze HTTP
requests. For building web application which is our chatbot we are not going
to use twisted since it is too low level in our opinion, especially if we
compare it with other web services-oriented frameworks. The choice made
lets us use Django.8
First let us install it. As usual, to install new package we will use pip:
1. pip install django==3.2.16
Code 3.16
For the time being, when we write this chapter there are two main lines of
long-term support (LTS) versions of Django projection - 3.2 and 4.2.
Version 4.2 is ongoing active development and actively getting new features,
enhancement and bug fixes. For the use of this book, we are going to stay
with recommended stable 3.2 version.
We will start our project that we will call chatter for easier naming
convention. To do so we will use Django commands:
1. django-admin startproject chatter
Code 3.17
Once the project is properly initialized, we should see this file structure like
in proceeding example:
1. .
2. ├── chatter
3. │ ├── __init__.py
4. │ ├── asgi.py
5. │ ├── settings.py
6. │ ├── urls.py
7. │ └── wsgi.py
8. ├── db.sqlite3
9. └── manage.py
Code 3.18
Django admin automatically created main file for managing your project
called manage.py We will use it in many stages, such as running service,
DB migrations, translations and many more that is beyond of the scope of
this mini project. First let us start our web server:
1. $ python manage.py runserver
2.
3. Watching for file changes with StatReloader
4. Performing system checks...
5.
6. System check identified no issues (0 silenced).
7.
8. You have 18 unapplied migration(s). Your project may not work
properly until you apply the migrations for app(s): admin,
auth, contenttypes, sessions.
9. Run 'python manage.py migrate' to apply them.
10. December 30, 2022 - 09:44:56
11. Django version 3.2.16, using settings 'chatter.settings'
12. Starting development server at http://127.0.0.1:8000/
13. Quit the server with CONTROL-C.
Code 3.19
Once the application is running, you should be able to access the app server,
as shown below. Welcome to the Django web service!
Figure 3.3: Main hello world screen accessible once the webserver is started.
We created project in Django called chatter. Now, it is time to create the
actual application. What is the difference between project and the app?
Project is a group of applications you can translate this as it is a web server
that can host many applications. App itself is like an application which does
something particular, for instance our chatbot can be such an app and chatbot
admin dashboard will be another.
So, let us create an app called chat. Before that, we need to apply database
migrations that Django asked us to do. Database migrations9 are like
controlling the history of changes of database tables, triggers, indexes,
columns, data types, etc. In one simple sentence – migrations allow us to
control the history of any changes we may want to apply to the database.
Instead of manually comparing as a developer what kind of database tables
we currently have in the database, what types of columns, data types, and so
on and on, we can use a very simple yet powerful engine for managing and
tracking database changes - migrations.
Migrations are part of Django framework. Any kind of Django model change
can be track by migrations system and it will be always reflected in database
schema. It is always possible to apply migration forward or rollback any
changes that we do not want in database. Concluding, let us apply those
mentioned migrations in following example:
1. $ python manage.py migrate
2.
3. Operations to perform:
4. Apply all migrations: admin, auth, contenttypes, sessions
5. Running migrations:
6. Applying contenttypes.0001_initial... OK
7. Applying auth.0001_initial... OK
8. Applying admin.0001_initial... OK
9. Applying admin.0002_logentry_remove_auto_add... OK
10. Applying admin.0003_logentry_add_action_flag_choices... OK
11. Applying contenttypes.0002_remove_content_type_name... OK
12. Applying auth.0002_alter_permission_name_max_length... OK
13. Applying auth.0003_alter_user_email_max_length... OK
14. Applying auth.0004_alter_user_username_opts... OK
15. Applying auth.0005_alter_user_last_login_null... OK
16. Applying auth.0006_require_contenttypes_0002... OK
17. Applying auth.0007_alter_validators_add_error_messages... OK
18. Applying auth.0008_alter_user_username_max_length... OK
19. Applying auth.0009_alter_user_last_name_max_length... OK
20. Applying auth.0010_alter_group_name_max_length... OK
21. Applying auth.0011_update_proxy_permissions... OK
22. Applying auth.0012_alter_user_first_name_max_length... OK
23. Applying sessions.0001_initial... OK
Code 3.20
Just a small explanation what is happening above. Applying command in
line 1 triggered the migrations. Line 4 is where Django shows you what kind
of applications migrations are being applied. Once we finished with running
migrations, we can finally create our first Django application:
1. $ python manage.py startapp chat
Code 3.21
App created via the above command will also automatically create folders
and file structure by Django. Next, we must create a folder to store templates
for our application. By being in the main folder where manage.py file is
located execute the following command:
1. mkdir -p chat/templates/chat
Code 3.22
In the same directory where we have our manage.py file, to simplify the
description, we will call this location root folder and create a chatbot.py
main file.
1. from chatterbot import ChatBot
2.
3.
4. def chatbot():
5. return ChatBot(
6. 'Trainer',
7. storage_adapter=’chatterbot.storage.SQLStorageAdapter’,
8. database_uri='sqlite:///chatbot.sqlite3'
9. )
Code 3.23
We created the same file as in the training example, but in this case, we
explicitly told Chatbot constructor what kind of data storage we want to use
(SQLite) and where it is located (database_url).
For training, we will use trainer.py in root folder with the following
content:
1. from chatbot import chatbot
2. from chatterbot.trainers import ChatterBotCorpusTrainer
3.
4.
5. _chatbot = chatbot()
6. trainer = ChatterBotCorpusTrainer(_chatbot)
7. trainer.train("chatterbot.corpus.english")
Code 3.24
The remaining config showed in the training section stays the same. Now, it
is time to inform our Django project about our new app. We are going to
update chatter/settings.py in root folder and update the application’s list:
1. INSTALLED_APPS = [
2. 'django.contrib.admin',
3. 'chat',
4. 'django.contrib.auth',
5. 'django.contrib.contenttypes',
6. 'django.contrib.sessions',
7. 'django.contrib.messages',
8. 'django.contrib.staticfiles'
9. ]
Code 3.25
Django should now be able to see our app, albeit before we can start using it,
we have to fix routing. Edit chatter/urls.py in root folder so it looks like the
following:
1. """chatter URL Configuration"""
2. from django.contrib import admin
3. from django.urls import path, include
4. from chat.views import main_view
5.
6. urlpatterns = [
7. path('admin/', admin.site.urls),
8. path('chat/', include('chat.urls')),
9. path('', main_view, name='main_view'),
10. ]
Code 3.26
Line 7 points to admin URLs definition. You can open admin page on
http://localhost:8000/admin/ login and password, that must be created by
executing the following command:
1. $ python manage.py createsuperuser
2.
3. Username (leave blank to use 'foo'): admin
4. Email address: [email protected]
5. Password:
6. Password (again):
7. The password is too similar to the username.
8. This password is too short. It must contain at least 8 characters.
9. This password is too common.
10. Bypass password validation and create user anyway? [y/N]: y
11. Superuser created successfully.
Code 3.27
When an admin account is created, you can access it with the credentials you
used to create the admin account above. You can use it later as this chapter
extension.
Figure 3.4: Admin section available when running example hello world app
The admin is accessible and working. It is time to create main view URL
path defined in line 9 (urls.py). Please notice that the mentioned view (Code
3.28) is imported in line 4, so we create that view in file chat/views.py.
1. from django.http import HttpResponse
2.
3.
4. def main_view(request):
5. return HttpResponse("hello world")
Code 3.28
After accessing main page http://localhost:8000/ you should see the hello
world message that we return in line 5.
Frontend
In the previous sections we concentrated on the server side of our chatbot
application. This section will mainly focus on the frontend side, the part of
the app you access in the web browser.
First let us make sure that our chat app can render HTML properly. We will
update the main template file chat/templates/chat/main.html with this
simple content:
1. <!doctype html>
2. <html lang="en" class="scroll-smooth">
3. <head>
4. <meta charset="utf-8">
5. <meta name="viewport" content="width=device-width, initial-
scale=1">
6. <title>Chatbot demo</title>
7. </head>
8. <body>
9. <p>hello world!</p>
10. </body>
11. </html>
Code 3.34
This is trivial hello world HTML example. To display it we must update the
main controller in the chat application. Edit main view file chat/views.py
and update function as the following example:
1. from django.shortcuts import render
2.
3.
4. def main_view(request):
5. context = {}
6. return render(request, 'chat/main.html', context)
Code 3.35
You can see in line 6 that we return generated HTML from template file.
This simple render function takes three arguments:
One is the request object10,
Second is the path to the template file, and
The last has dictionary with all the data structures for template
Our frontend part will display a small box with all the chat history and
vertical scroll if there are more messages than box can display. On the
bottom of the box, we will add input field where user can write a message
and a button to send it. Overall, it is going to look like the following figure:
Conclusion
In this chapter, we learned how to use simple AI models with Python and
how to train them to be able to build chatbot. We also learned how to build
client-server architecture application that we managed to convert to web
application with a use of interactive JavaScript and HTML.
In the next chapter, we will learn how to use Python for analyzing and
managing our home expenses. After reading this chapter, we will see how
we can use Python to predict future expenses and manage our home budget
based on our income and how much money we spend.
1. https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol
2. https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods
3. https://twisted.org
4. https://docs.python.org/3/library/urllib.html
5. https://en.wikipedia.org/wiki/Query_string
6. https://github.com/gunthercox/ChatterBot
7. https://github.com/gunthercox/chatterbot-corpus
8. https://www.djangoproject.com
9. https://docs.djangoproject.com/en/4.2/topics/migrations/
10. https://docs.djangoproject.com/en/4.1/ref/request-response/
11. https://getbootstrap.com
12. https://docs.djangoproject.com/en/5.1/ref/csrf/
13. https://jquery.com
14. https://api.jquery.com/jQuery.post/
Introduction
It is well-established that a balanced and organized home budget is crucial.
Traditionally, we managed this manually with math. Now, programming
languages provide a more streamlined approach, automating calculations and
offering real-time insights.
Structure
In this chapter, we will discuss the following topics:
Excel
Import
Export
Analyze expenses
Estimate future expenses based on income and outcome happened in the
past
Building behavioral driver estimator
Statistics
Objectives
In this chapter, you will learn how to use Python to manage your finances.
You will see how to import and export data to and from Excel files, which
can help you keep track of your income and expenses. You will also learn
how to collect and organize data in a way that allows you to calculate
estimates and analyze your home budget. By the end of this chapter, you will
have the skills to use Python as a powerful tool for financial planning and
decision making.
Excel
Excel is a popular spreadsheet application that can store, manipulate, and
analyze data in various formats. Python is a powerful programming language
that can perform various tasks with data, such as cleaning, processing, and
visualizing. In this tutorial, we will learn how to import and export data to
and from Excel files using Python and CSV1 format.
Export
Let us check following code to see how we can convert data to CSV format.
1. import csv
2.
3. data = [
4. {"name": "icecream", "amount": 15, "comment": ""},
5. {"name": "water", "amount": 3.2, "comment": "it was hot day"},
6. {"name": "bread", "amount": 1.3, "comment": "my favorite one"},
7. ]
8.
9. output_filename = "output_file.csv"
10. headers = ("name", "amount", "comment")
11.
12. with open(output_filename, 'w') as csv_file:
13. csv_writer = csv.writer(csv_file)
14. for item in data:
15. csv_writer.writerow([item.get(key) for key in headers])
Code 4.1
Our code after running will create output file (line 12) that is going to look
like in the following example.
1. icecream,15,
2. water,3.2,it was hot day
3. bread,1.3,my favorite one
Code 4.2
What we must notice that Code 4.1 is not efficient because it iterates over
the data list and creates a new list for each item by calling the get method on
the item dictionary. This can cause performance issues with lots of items, as
it consumes more memory and time than necessary. The optimized version
of the same thing is going to look like following code.
1. import csv
2.
3. data = [
4. ("icecream", 15, ""),
5. ("water", 3.2, "it was hot day"),
6. ("bread", 1.3, "my favorite one"),
7. ]
8.
9. output_filename = "output_file.csv"
10. headers = [("name", "amount", "comment")]
11.
12. with open(output_filename, "w") as csv_file:
13. csv_writer = csv.writer(csv_file)
14. csv_writer.writerows(headers)
15. csv_writer.writerows(data)
Code 4.3
The Code 4.3 uses the writerows (Code 4.3, line 15) method of the
csv_writer object to write multiple rows at once to a CSV file. The first
argument is a list of headers, which are the column names for the CSV file.
The second argument is a list of data, which are the rows of values for each
column. The Code 4.3 does not loop over the data, but writes it all in one go.
This requires that the data is already in a suitable format for the CSV file,
such as a list of lists or a list of tuples.
Another thing that we must address is the fact that CVS files very offer
require that some types of data is going to be wrapped in quotes. One reason
why quoting data in a CSV file makes sense is that it can prevent the comma
character, which is used as a delimiter, from being interpreted as part of the
data, for example, Smith, John, we need to quote them to avoid confusing
the parser.
Another reason why quoting data in a CSV file makes sense is that it can
preserve the whitespace characters, such as spaces, tabs, or newlines, that are
part of the data.
Let us check following example of how to achieve this.
1. import csv
2.
3. data = [
4. ("icecream", 15, ""),
5. ("water", 3.2, "it was hot day, no"),
6. ("bread", 1.3, "my favorite one"),
7. ]
8.
9. output_filename = "output_file.csv"
10. headers = [("name", "amount", "comment")]
11.
12. with open(output_filename, "w") as csv_file:
13. csv_writer = csv.writer(csv_file, delimiter=',',quotechar='"')
14. csv_writer.writerows(headers)
15. csv_writer.writerows(data)
Code 4.4
Another way to generate a csv file with pandas is to use the DataFrame to
CSV method, which takes a file name or a file object as an argument and
writes the data frame to a .csv file. For example, if we have a data frame
called df, we can write it to a .csv file. Let’s install pandas with the following
code:
1. $ pip install pandas
Code 4.5
This module let us check the following code to see how we can use pandas
to make CSV files.
1. import pandas as pd
2. import numpy as np
3.
4. data = [
5. ("icecream", 15, ""),
6. ("water", 3.2, "it was hot day, no"),
7. ("bread", 1.3, "my favorite one"),
8. ]
9.
10. output_filename = "output_file.csv"
11. headers = ("name", "amount", "comment")
12.
13. df = pd.DataFrame(np.array(data), columns=headers)
14. df.to_csv(output_filename, index=False)
Code 4.6
Once we execute Code 4.5, we are going to get results (line 14) same as in
previous examples (Code 4.3 and Code 4.4).
Import
So far, we have learned how to export exiting data to external format that
can be understood by Excel. To import data back from excel to Python we
are going to use pandas, we need to use the pd.read_excel() function. Let us
check following example how to import data from Excel. First let us create
such a spreadsheet that was generated by importing CSV file that we created
after running Code 4.6. It should look like in the following figure.
Figure 4.1: Example CSV output file imported in Excel
Now, we are going to import the data back to pandas from Excel. To do this,
we need to use the following code:
1. import pandas as pd
2.
3. # specify the path of the Excel file
4. excel_file = "example.xlsx"
5. # read the Excel file into a pandas DataFrame
6. df = pd.read_excel(excel_file)
7. print(df.head())
Code 4.7
The preceding code uses the read_excel function to read the Excel file into a
pandas DataFrame. The function takes the path of the Excel file as an
argument and returns a DF object that contains the data from the
spreadsheet. We can then print rows of the DF using the head method to
check if the data was imported correctly.
Analyze expenses
After reading the Excel file, we can start analyzing the expenses data in the
data frame. We are going to prepare an example report that shows the total
expenses by category with included income. Let us check following figure
how example spreadsheet is going to look like.
We used quandl8 which helps us to fetch financial data about google and
stock market – it is just a simple example so we can learn how such a data
can be in later stage used to predict future of stock exchange. We see that in
line 2 we use quandl API to fetch data about google stock. Once it is fetched,
we have it converted to pandas datafrom. Let us check what can we do with
such a data.
Python scikit-learn9 is a free and open-source machine learning library that
provides a range of supervised and unsupervised learning algorithms, as well
as tools for data preprocessing, model selection, evaluation, and feature
extraction. After installing module with Code 4.14 we can check following
example how to use loaded data (example 4.15) to be able to prepare our
data stack for interpolation.
1. import random
2. from datetime import datetime
3.
4. import quandl, math
5. import numpy as np
6. import pandas as pd
7. import matplotlib.pyplot as plt
8. from sklearn import preprocessing, svm
9. from sklearn.model_selection import train_test_split
10. from sklearn.linear_model import LinearRegression
11. from matplotlib import style
12.
13. style.use("ggplot")
14.
15. df = quandl.get("WIKI/GOOGL")
16. df = df[["Adj. Open", "Adj. High", "Adj. Low",
"Adj. Close", "Adj. Volume"]]
17. df = df[["Adj. Close", "Adj. Volume"]]
18.
19. forecast_col = "Adj. Close"
20. df.fillna(value=-99999, inplace=True)
21. forecast_size = int(math.ceil(0.02 * len(df)))
22. print("Forecast size: {forecast_size}")
23.
24. df["label"] = df[forecast_col].shift(-forecast_size)
25.
26. x = np.array(df.drop(["label"], axis=1))
27. x = preprocessing.scale(x)
28. x_lately = x[-forecast_size:]
29. x = x[:-forecast_size]
30.
31. df.dropna(inplace=True)
32.
33. y = np.array(df["label"])
34. x_train, X_test, y_train, y_test =
train_test_split(x, y, test_size=0.2)
35. clf = LinearRegression(n_jobs=-1)
36. clf.fit(x_train, y_train)
37. confidence = clf.score(X_test, y_test)
38.
39. forecast_set = clf.predict(x_lately)
40. df["Forecast"] = np.nan
41. last_date = df.iloc[-1].name
42. last_unix = last_date.timestamp()
43. one_day = 24 * 60 * 60 # 1 day in seconds
44. next_unix = last_unix + one_day
45.
46. for i in forecast_set:
47. next_date = datetime.fromtimestamp(next_unix)
48. next_unix += one_day
49. df.loc[next_date] = [np.nan for _ in
range(len(df.columns) - 1)] + [i]
50.
51. df["Adj. Close"].plot()
52. df["Forecast"].plot()
53. plt.legend(loc=4)
54. plt.ylabel("Value")
55. plt.xlabel("Date")
56. plt.show()
Code 4.16
There is lots of going on this code but let us try to analyze block step by
step. In Code 4.16, the goal is to create a forecast for the closing price of a
stock exchange based on historical data. The code uses the interpolation
library, which is a tool for time series forecasting that can handle seasonality,
trends, and holidays. In the first part (lines 15-17) we load stock exchange
like in Code 4.15 but in current use case (line 17) we drop those columns
that are not needed in our example – we only keep these that we will use for
estimations. In line 19, we can see that the close column contains the daily
closing prices of the stock, which will be used as the target variable for the
forecast. Next part is to create size of estimation (days for how many we
want to estimate) – line 21, we say we want to add estimation that is 20% of
total number of total number of days in the historical data retrieved from
external API (line 15).
Model creation and fitting will estimate the parameters of the model, such as
the trend, seasonality, and change points (lines 24-29). In later part we train
our model with data that represent the predicted value, the lower bound, and
the upper bound of the forecast, respectively (lines 33-39).
Once we have all the bounds and predictions being prepared, we use plot
method (line 30) to plot the historical and forecasted data on a graph. We can
notice that we had to use loop (lines 46-49) that goes over forecast set (line
46) and keeps using prepared values and inject them into missing dates by
going day by day (line 49).
The code also uses the plot components method (line 51-56) to plot the
components of the forecast, such as the trend, seasonality, and holidays.
These plots can help to understand the behavior and patterns of the time
series data and the forecast.
After analyzing our prophet code, we can check following figure how our
interpolation is going to look like.
Figure 4.5: Example forecast of Google stock market with using Python SkLearn
After warming up we can start changing our code (4.16) in such a way that
we can prepare code for predicting data regarding our expenses. We do not
want to concentrate in this subchapter on preparing data with many real-life
details involved – we want to learn how to interpolate data thus we are going
to create a script that will help us with preparing sample data.
Before we can continue, we have to install Python library as shown in
proceeding code.
1. pip install xlsxwriter
Code 4.17
Once module from Code 4.17 is installed we can build a code that is going to
produce XLS file that contains a simulated expenses grouped by expense
type. Let us check the following code to see how we can achieve this. Let us
create a file like shown below called seed_example_data.py.
1. import pandas as pd
2. import numpy as np
3.
4.
5. EXPENSES = (
6. "Bank Fees",
7. "Clothing",
8. "Consumables",
9. "Entertainment",
10. "Hotels",
11. "Interest Payments",
12. "Meals",
13. "Memberships",
14. "Pension Plan Contributions" "Rent",
15. "Service Fees",
16. "Travel Fares",
17. "Utilities",
18. "Cleaning Supplies",
19. "Communication Charges",
20. "Energy",
21. "Food",
22. "Insurance",
23. "Maintenance",
24. "Medical Costs",
25. "Office Supplies",
26. "Professional Service Fees",
27. "Repair Costs",
28. "Taxes",
29. "Tuition",
30. "Vehicle Lease",
31. )
32.
33.
34. class Seeder:
35. def generate(self):
36. with pd.ExcelWriter("expenses_seed_example.xlsx", engine="xlsx
writer") as writer:
37. for sheet_name in EXPENSES:
38. dates = pd.date_range(start="2020-01-01", end="2021-01-
01")
39. data = {
40. "date": dates,
41. "amount": pd.Series(np.random.choice(np.random.randint(
100, size=150), size=dates.size)),
42. }
43. df = pd.DataFrame(data, columns=["date", "amount"])
44. df.to_excel(writer, sheet_name=sheet_name)
45. if __name__ == "__main__":
46.
47.
48. s = Seeder()
49. s.generate()
Code 4.18
In that code we are creating list of example expanses (line 5-30) that in later
state of it we use to create separate sheets (line 37) in a single output file
(36). Every loop over example sheet name (expense type) we generate dates
range (line 38). Next, we create a series data with random values (line 41)
that are representing our example amount of daily expenses.
Once we have created dataframe which is a combination of dates with data
series (lines 38-43) we save it as excel sheet.
As a result of running Code 4.18 we should get seeded example file called
expenses_seed_example.xlsx. That file is going to be used in the following
part of the chapter.
After seeding expenses file with sample of data it is time to start building a
script that is going to help us analyze our expanses.
First, let us build a class that can load our sample data. Let us check
following example code how can we achieve this.
1. import click
2. import pandas as pd
3.
4.
5. class Interpolation:
6. def __init__(self, source_file):
7. self.df = pd.read_excel(source_file, sheet_name=None, header=No
ne, names=("Type", "Value"))
8.
9.
10. @click.command()
11. @click.option("--
source", type=str, help="Source file to load", required=True)
12. def main(source):
13. i = Interpolation(source)
14.
15.
16. if __name__ == "__main__":
17. main()
Code 4.19
As it is shown in lines 5-7 we load our source file to pandas dataframe. Once
we have it loaded with following code, we can proceed to analyze what we
have loaded.
1. python create_estimates.py --source expenses_seed_example.xlsx
Code 4.20
1. Next step, once data is in the system we are going to aggregate data
weekly in such a way that is presented in the following code.
1. def aggregate(self):
2. for expense in EXPENSES:
3. self.df[expense]["Date"] = pd.to_datetime(self.df[expense]
["Date"][1:]) - pd.to_timedelta(7, unit="d")
4. self.df[expense] = self.df[expense].groupby([pd.Grouper(key="
Date", freq="W")])["Value"].sum()
Code 4.21
Method aggregate (line 3) is extending start dates for time range we created
in seeding script (Code 4.18) for a week before we originally managed to
prepare. Once we have this, we group dates, originally being prepared as
daily expenses, in such a way that as result we have expenses group weekly
(groupby) – amount of these expenses get sum as well - ["Value"].sum()
(line 4).
To be able to use this method, we have to import EXPENSES variable from
Code 4.18 - we called this file seed_example_data.py. Let us check
following code how to achieve this.
1. from seed_example_data import EXPENSES
2.
3. def main(source):
4. i = Interpolation(source)
5. i.aggregate()
Code 4.22
In Code 4.22 we extended main function with explicit calling aggregate
method and import of mentioned EXPENSES variable that we needed.
2. Next step is to add a method that is going to draw what kind of
expenses are burning most of our home budget. Let us check following
code how can we deliver this requirement.
1. import matplotlib.pyplot as plt
2.
3. def plot(self):
4. for expense in EXPENSES:
5. self.df[expense].plot(label=expense, legend=True)
6. plt.legend(loc=4)
7. plt.xlabel("Date")
8. plt.ylabel("Amount")
9. plt.show()
Code 4.23
In this newly introduced method plot we are again looping over expense
types (line 2) and for each dataframe we plot linear value on the graph
presented in figure. In the end of the method, we call show method that is
drawing our data as shown in Figure 4.6.
Figure 4.6: Example of expenses grouped weekly per expense type
Figure 4.6 shows a line chart of the weekly expenses grouped by expense
type. The chart has a title, a legend, and labels for both axes. However, the
chart is hard to read because the lines overlap and cross each other
frequently, making it difficult to compare the trends and values of different
expense types. Moreover, the markers are too small, and the colors are not
very contrastive.
Conclusion
In this chapter, we learned how we can analyze data prepared in Excel. We
managed to understand how Python can be used for reading and writing very
broad data sets. Once having data sets, we learned how to use Python for
manipulating data and drawing pretty complex graphs that represent
analyzed data.
In the next chapter, we are going to learn how can we use Python for
crawling web sites and extracting content out of them. We will also learn
how to make this effective and easy to learn.
1. https://en.wikipedia.org/wiki/Comma-separated_values
2. https://pypi.org/project/click/
3. https://matplotlib.org
4. https://en.wikipedia.org/wiki/Artificial_intelligence
5. https://scipy.org
6. https://scikit-learn.org/stable/index.html
7.
https://docs.scipy.org/doc/scipy/tutorial/interpolate/smoothing_splines.h
tml
8. https://pypi.org/project/Quandl/
9. https://scikit-learn.org/stable/
OceanofPDF.com
CHAPTER 5
Building Non-blocking Web
Crawler
Introduction
Every web service being served by HTTP(S) protocol can be reached on a
very low level. What we mean by this is that by using Python and a few
libraries, we can fetch any website with all its assets and save it locally for
offline use. We do not need any browser to do so, and the whole process can
be fully automated.
As you can imagine, sometimes what is more important in the internet world
is the information we may want to extract from the web, not the beautiful
assets of websites but its data. In its plain state, data can be the most asset,
and sometimes it may be image itself. Whichever part of the information you
want to extract because of its importance for you we are going to learn in
this chapter how to extract those very important assets and fetch them from
remote website resources.
Structure
In this chapter, we will learn how to work with web sites starting with simple
examples where we will learn how to analyze and parse plain text. Next, we
are going to learn how does this work comparing to HTML. We will discuss
the following topics:
Parsing HTML and extracting data
Efficient data scrapping
Using proxy services
Objectives
By the end of this chapter, you will know how to build your efficient web
crawler, what kind of challenges it brings, and how to solve them. Let us
start coding!
Basic example
As a first example of web-crawler we will build crawling project that will
fetch some documents from python.org website. To build such a crawler, we
must design a simple queueing mechanism to store a list of URLs that the
crawler should visit and fetch the content. Once the content is downloaded,
we need a module that will analyze downloaded content and extract the
elements we need for further processing.
Figure 5.2: Concept of fetching web resource with support or URLs queue
Let us start with building a simple queue system that will store First in First
out (FIFO)6. In the proceeding example, (Code 5.15) we created a file
called crawler.py. We can see that for concept of our simple queue we used
Python module queue7. There is an additional attribute in its instance (code
5.15, line 6) that we intentionally skipped. That attribute can be, for
example, queue.Queue(10) where 10 is the size of the queue. We do not
need limit queue size here since we want to parse as many URLs as we can
so narrowing down queue size will be obstacle in this use case.
You can also notice that we do not load queue with any content at this stage.
We only created method _get_element for pulling element out of the queue.
Refer to the following code:
1. import queue
2.
3.
4. class Crawler:
5. def __init__(self):
6. self.urls_queue = queue.Queue()
7.
8. def _get_element(self):
9. return self.urls_queue.get()
10.
11. def process(self):
12. """Main method to start crawler and process"""
13. pass
Code 5.15
Next improvement that we want to do in our code is to add functionality
allowing us to load content, which is a list of URLs to process into our
queuing system. We do not want to hardcode any list of URLs in our code
but something more dynamic. Let us try to use CSV file with 2 columns
which will look this:
Column 1: Stores all URLs that we want to crawl.
Column 2: Number of retries if there is error detected.
How do we load this file to our queue? Again, we could use some sort of
hardcoded file name into our code, but this approach is not recommended
since we want to process any file given; so we should be able to parametrize
our file crawler.py. We will use module click8 , which allows us to build
advance command line tools. In the following example we are adding
command line arguments support. Please remember to import in the top of
Code 5.15 click module:
1. import click
2. import os
3.
4.
5. @click.command()
6. @click.option("--source", help="CSV full file path", required=True)
7. def main(source):
8. """Main entry point for processing URLs and start crawling."""
9. assert os.path.exists(source), f"Given file {source} does not exist"
10. c = Crawler()
11. c.process()
12.
13. if __name__ == '__main__':
14. main()
Code 5.16
We have imported two new modules: os and click. As for click it is imported
for obvious reasons, we wanted to build simple command line tool support
to run our script like in following example Code 5.17. As for os module, we
have imported it for validating, if the given source file path is valid. Refer to
the following code:
1. $ python crawler.py --source=skdjhfdksjh
2. Traceback (most recent call last):
3. File "crawler.py", line 26, in <module>
4. main()
5. File "/Users/hubertpiotrowski/.virtualenvs/fun1/lib/python3.7/site-
packages/click/core.py", line 1130, in __call__
6. return self.main(*args, **kwargs)
7. File "/Users/hubertpiotrowski/.virtualenvs/fun1/lib/python3.7/site-
packages/click/core.py", line 1055, in main
8. rv = self.invoke(ctx)
9. File "/Users/hubertpiotrowski/.virtualenvs/fun1/lib/python3.7/site-
packages/click/core.py", line 1404, in invoke
10. return ctx.invoke(self.callback, **ctx.params)
11. File "/Users/hubertpiotrowski/.virtualenvs/fun1/lib/python3.7/site-
packages/click/core.py", line 760, in invoke
12. return __callback(*args, **kwargs)
13. File "crawler.py", line 21, in main
14. assert os.path.exists(source), f"Given file {source} does not exist"
15. AssertionError: Given file skdjhfdksjh does not exist
Code 5.17
As you can see in Code 5.16, we have added assertion in line 9, if given file
path does not exist. In this case, in Code 5.17 we tested assertion, and as you
can see, it is working as expected. Now let us create valid CSV file which
looks like the following example, 3 URLs to load (1st column) and number
of retries (2nd column):
1. https://www.python.org,1
2. https://www.python.org/community/forums/,2
3. https://www.reddit.com/r/learnpython/,2
4. https://en.wikipedia.org/wiki/Elvis_Presley,3
Code 5.18
In the following code example, we already modified created method main so
it can read given existing CSV file (Let us call it for clear use in this code
example as urls.csv) and load its content to our processing queue:
1. import click
2. import csv
3. import os
4. import queue
5.
6. class Crawler:
7. def __init__(self):
8. self.urls_queue = queue.Queue()
9.
10. def load_content(self, file_path):
11. with open(file_path, 'r') as f:
12. reader = csv.reader(f)
13. for row in reader:
14. self.urls_queue.put(row)
15.
16. click.echo(f"After loaiding CSV content queue size id: {self.urls_q
ueue.qsize()}")
17.
18. def _get_element(self):
19. return self.urls_queue.get()
20.
21. def process(self):
22. """Main method to start crawler and process"""
23. pass
24.
25.
26. @click.command()
27. @click.option("--source", help="CSV full file path", required=True)
28. def main(source):
29. """Main entry point for processing URLs and start crawling."""
30. assert os.path.exists(source), f"Given file {source} does not exist"
31. c = Crawler()
32. c.load_content(source)
33. c.process()
34.
35.
36. if __name__ == '__main__':
37. main()
Code 5.17
We added new method load_content, which mainly loads CSV file and
pushes content to the queue, which we will be processing in the later stage
method called process. For the time being it does not do anything but in
following example, let us try to add some logic there which is going to:
Consume queue in FIFO order
Validate response content
Save content of crawled website to external file
Before we can start building this (code 5.17) part of the code to fetch
website content for us, we must install the requests9 package. It is simple to
use and with very elegant API HTTP library. We could use standard Python
library or some frameworks like in the examples in Chapter 3, Designing a
Conversational Chatbot we have been using Twisted10. In this case we
would like to make things clear and easy to follow with more complex
scenarios that our code is going to cover in the following scenario:
1. pip install requests -U
Once the requests are installed, we must modify our process method to
resemble the example in Code 5.18.
1. import requests
2.
3.
4. def process(self):
5. """Main method to start crawler and process"""
6. while self.urls_queue.qsize() > 0:
7. url, _ = self.urls_queue.get()
8. response = requests.get(url)
9. if response.status_code == 200:
10. f_name = sha256(url.encode('utf-8')).hexdigest()
11. output_file = f"/tmp/{f_name}.html"
12. with open(output_file, "w") as f:
13. f.write(response.text)
14. click.echo(f"URL: {url} [saved] under {output_file}")
Code 5.17
You probably noticed that we updated the process method with a mechanism
that pulls all the elements from the queue until it’s empty (line 6-7). There is
a weird syntax in line 7 as we try to pull an element from the queue that
comes as a list of 2 elements, URL and number of retries to fetch its content.
Since for the time being we ignore number of retries we immediately assign
that value to void variable (underscore symbol). For each element being
pulled, we call external resource (line 7) and check the response code11 (line
9). Only valid responses are being processed in the further code. As we
proceed we calculate SHA256 on the top of resource URL and use it as file
name (line 10) to save downloaded content (lines 11-14). Looks simple and
clean. Now, how to improve our code to download content and retry when
response fails. Let us analyze the following example Code 5.18:
1. import time
2.
3. SLEEP_TIME = 1
4.
5. def process(self):
6. """Main method to start crawler and process"""
7.
8. while self.urls_queue.qsize() > 0:
9. url, number_of_retries = self.urls_queue.get()
10.
11. for try_item in range(int(number_of_retries)):
12. click.echo(f"Number of retries: {try_item+1}/{number_of_retrie
s}")
13. response = requests.get(url)
14. if response.status_code == 200:
15. f_name = sha256(url.encode('utf-8')).hexdigest()
16. output_file = f"/tmp/{f_name}.html"
17. with open(output_file, "w") as f:
18. f.write(response.text)
19. click.echo(f"URL: {url} [saved] under {output_file}")
20. break
21. else:
22. click.echo(f"Fetching resource failed with status code {respon
se.status_code} sleeping {SLEEP_TIME}s before retry")
23. time.sleep(SLEEP_TIME)
Code 5.18
Simple crawler
After few modifications as you can see in Code 5.18, we are using number
of retries (line 11) and when it fails to fetch content (lines 21-23) we use
sleep for waiting until next try. Notice that we used global constant
SLEEP_TIME (code 5.18, line 3). We have used the value in such a way
(Code 5.18, line 3), because it is a global constant that needs to be
capitalized. Refer to Chapter 1, Python 101 and Chapter 2, Setting up
Python Environment, for more details about syntax albeit in this case we
decided to use constant since it’s read only value which we may want to use
all around the code, and it will always indicate the same sleep time for code
retries. Let us test it and add the following line to source CSV file:
1. http://dummy.non-existing.url.com,5
Now, run the code as shown in the following example:
1. Traceback (most recent call last):
2. File "/Users/hp/.virtualenvs/fun1/lib/python3.7/site-
packages/urllib3/connection.py", line 175, in _new_conn
3. (self._dns_host, self.port), self.timeout, **extra_kw
4. File "/Users/hp/.virtualenvs/fun1/lib/python3.7/site-
packages/urllib3/util/connection.py", line 72, in create_connection
5. for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREA
M):
6. File "/Library/Frameworks/Python.
framework/Versions/3.7/lib/python3.7/socket.py", line 752, in getaddrin
fo
7. for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
8. socket.gaierror: [Errno 8] nodename nor servname provided, or not kno
wn
9.
10. During handling of the above exception, another exception occurred:
11.
12. (...)
13.
14. requests.exceptions.ConnectionError: HTTPConnectionPool
(host='dummy.non-existing.url.com', port=80): Max retries exceeded
with url: / (Caused by NewConnectionError('<urllib3.connection.
HTTPConnection object at 0x7fd3705ba2d0>:
Failed to establish a new connection:
[Errno 8] nodename nor servname provided, or not known'))
Code 5.19
What just happened in Code 5.19, our improved code from Code 5.18 should
be reacting on this and retry when the URL cannot be fetched. Please
carefully read Code 5.19, we crashed with exception that explicitly tells us
as developers that requests library has raised exception “name nor service
name not know”. Ignoring the precise error message triggering (code 5.19,
line 14), we have to notice an important thing – in our improved example
(Code 5.18) we check if response code form the server is different than the
status code 200 (Code 5.18, line 14) and then retry in a case when it is
different (for instance 50312 - error). That kind of logic is going to be
reached only in these cases:
URL is valid but we cannot reach it although server is responding but
with error code.
Any kind of response code coming from server that is different than
200, for instance 404, resource not found.
In the case, presented in Code 5.19 we have exception being raised
NewConectionError and checking with “if else statement” that will not
work here. We should pack these kinds of cases into try except block and
properly process these conditions. So, to improve Code 5.18 we must do
something like the following example:
1. def process(self):
2. """Main method to start crawler and process"""
3. url_success = 0
4. url_fails = 0
5. while self.urls_queue.qsize() > 0:
6. url, number_of_retries = self.urls_queue.get()
7.
8. for try_item in range(int(number_of_retries)):
9. click.echo(f"Number of retries: {try_item+1}/{number_of_retrie
s}")
10. is_ok = False
11. try:
12. response = requests.get(url)
13. if response.status_code == 200:
14. f_name = sha256(url.encode('utf-8')).hexdigest()
15. output_file = f"/tmp/{f_name}.html"
16. with open(output_file, "w") as f:
17. f.write(response.text)
18. click.echo(f"URL: {url} [saved] under {output_file}")
19. url_success += 1
20. is_ok = True
21. break
22. except Exception as e:
23. click.echo(f"We failed to fetch URL {url} with exeception: {e
}")
24.
25. click.echo(f"Fetching resource failed with status code {response.
status_code} sleeping {SLEEP_TIME}s before retry")
26. time.sleep(SLEEP_TIME)
27. if not is_ok:
28. url_fails += 1
29. click.echo(f"We fetched {url_success} URLs with {url_fails} fails")
Code 5.20
We managed to slightly change our main process method without doing so
much revolution there so that is possible to catch exception when fetching
the given URL properly. You can probably notice in line 22 that we catch
generic exception without being too specific, so all possible failures trigger
retry. Let us say our code fails because of logical error which has nothing to
do with the fact that we cannot get destination URLs, it will retry until it
fails. At least we do log this, though as developers, we may spot this and fix
it by investigating logs later. This approach allows us to ensure that any fatal
exception does not stop our code from being executed properly.
In line 29 addition logging informs us how many URLs from the given list
in CSV we managed to get properly and how many times we failed. As you
can see, we added in lines 3-4 help variables that are counting these values,
which in the end we use them to print stats.
We now know how to build simple crawler, which is a very good start point
but how about extracting and keep crawling the entire page? For that we
would need to refactor our process method so it can extract necessary links
to subpages and fetch them properly.
1. import re
2. from typing import Optional
3. from urllib.parse import urlparse
4.
5. def fetch_url(self, url: str, number_of_retries: int) -> Optional[str]:
6. click.echo(f"Fetching {url}")
7. for try_item in range(int(number_of_retries)):
8. try:
9. response = requests.get(url)
10. if response.status_code == 200:
11. f_name = sha256(url.encode('utf-8')).hexdigest()
12. output_file = f"/tmp/{f_name}.html"
13. with open(output_file, "w") as f:
14. f.write(response.text)
15. click.echo(f"URL: {url} [saved] under {output_file}")
16. return response.text
17. except Exception as e:
18. click.echo(f"We failed to fetch URL {url} with exeception: {e}")
19.
20. click.echo(f"Fetching resource failed with status code {response.st
atus_code} sleeping {SLEEP_TIME}s before retry")
21. time.sleep(SLEEP_TIME)
Code 5.21
We refactored our code from Code 5.20 to split main method process into
smaller logical blocks - methods where main fetching and saving result of
downloaded URL is now isolated from reading and processing CSV file.
Please notice additional imports being added to the head of the changes.
Now let us looks at the main process method and what is happening with it:
1. def process(self):
2. """Main method to start crawler and process"""
3. url_success = 0
4. url_fails = 0
5. while self.urls_queue.qsize() > 0:
6. url, number_of_retries = self.urls_queue.get()
7. base_url = urlparse(url)
8. content = self.fetch_url(url, number_of_retries)
9. results = LINK.findall(content)
10. click.echo(f"Found {len(results)} links")
11. for parsed_url in results:
12. if parsed_url.startswith('/'):
13. parsed_url = f"{base_url.scheme}://{base_url.netloc}
{parsed_url}"
14. if not parsed_url.startswith('http'):
15. continue
16. content = self.fetch_url(parsed_url, number_of_retries)
Code 5.22
As it has been said, we refactored main method as shown in Code 5.22.
Noticeably, we are processing downloaded page and extracting list of URLs
(lines 9-16, code 5.22) to fetch all found URLs in the source page. We have
added additional checks (lines 12-15) so if parsed URL from the source:
Is does not correct, continue and do no fetch any page.
If it does not start with http/https use main domain and scheme to build
proper URL (lines 12-13).
Regular expression needed and added just after other static definition on the
top of the source file is presented in the following example:
1. LINK = re.compile(r"<a.*?href=[\"'](.*?)[\"']", re.I)
Code 5.23
So far, we have done a web crawler that can process a given CSV file
containing a list of URLs we want to process. It will fetch content of all the
given individual URL and extract all the URLs from it and fetch content of
those. We managed to build simple yet powerful retry mechanism in a case
when given URL resource cannot be fetched or the server keeps rejecting
access to the requested resource. Now, you have probably noticed the bottle
neck of our solution, we can only process URL with the URL, so there are
few issues with it:
Processing time for all the possible URLs is very long.
We are not utilizing resources of our computer and internet connection
properly.
Any failure is slowing down process.
Parallel processing
In the section, simple crawler we have been building simple crawler that
managed to get us into web crawling world, albeit at the same time we faced
some limitations of our solution presented in that chapter, we will refactor
our crawler to work in multiprocessing and parallel paradigm.
Natural choice, sounds like, can be multiprocessing13 or multithreading14 to
be chosen from. We could use these libraries and write some parallel
processing with those. Since Python 3.x is a good implementation of async
library, we will use async instead of multithreading. Reason being is simple,
in Python there is Global Interpreter Lock (GIL15) which in some cases
(especially for web crawling) can slow computation. Asynchronous
programming sound like better choice and more natural so web sockets and
access asynchronous resource which is the entire web.
Let us start refactoring our main module with main function that starts
crawling process. Following example will be using new file called
async_webcrawler.py. Before we can proceed with the following example,
we must install async replacement for requests library called httpx16:
1. pip install httpx -U
Code 5.24
Once we have this module installed it is time to do some refactoring. Before
doing so, we should analyze for a minute what the async world is. We
already briefly answered why using asynchronous is better than synchronous
functions. Since we use web resource which is 100% async and do not have
100% guarantee when and if it will respond, we should use async
programming. The main essence of async programming is, by simplifying
the answer, that Python does not wait for the resource to be finished, which
blocks the other part of the code execution and only continues computing
upon blocking operation. Instead, Python puts information regarding
blocking resource aside (it is called coroutine). Once the resource is reached,
it will inform Python core library about the waiting process to be finished
and then based on coroutine (callbacks) Python will decide what to do next.
Please imagine printer to give you a better picture of the blocking resource.
You can print single character at a time and let us say it is the mechanical
limitation of such a resource. If you have a blocking code, for instance,
before forming the next character to print, your code will let the printer
finish with printing 1st character before it can form next one. The same goes
for other operations, such as updating the printing progress bar in your
operating system, and so on. So not by having async and non-blocking code
you can control printing with coroutines and other operations without
worrying too much about blocking resource.
For the following example we could use external library called Twisted17, as
it has been mentioned few times, as the oldest and most mature framework
that supports event driven and async programming. In this case, to be more
minimalistic and have cleaner code, we are about to use built in Python
functionality called asyncio18.
So, by knowing all of this let us start by refactoring our starting our app
method called main. As you can see in proceeding example Code 5.25
starting async program differs. We must start event loop that will be
controlling mentioned coroutines:
1. import asyncio
2. import httpx
3.
4. async def main(source):
5. """Main entry point for processing URLs and start crawling."""
6. assert os.path.exists(source), f"Given file {source} does not exist"
7. c = Crawler()
8. await c.load_content(source)
9. await c.process()
10.
11.
12. @click.command()
13. @click.option("--source", help="CSV full file path", required=True)
14. def run(source):
15. asyncio.run(main(source))
16.
17.
18. if __name__ == '__main__':
19. run()
Code 5.23
We added two new major imports by removing import for requests module
which is no longer needed for asyncio library. Major change is also how we
had to refactor main function. Right now, it is asynchronous function. Please
notice the async statement in front of the def function tells Python that
function body will be asynchronous and whichever part of the code is called
must wait for its execution in special way (coroutine).
Additionally, from now on, our main function is starting Python asyncio
event loop, which controls coroutines and async assets. The same function
takes click CLI arguments which are being transferred for main function
since that one must be called in async way.
By proceeding example let us check how the entire crawler got refactored
and how to make async requests with httpx library:
1. import asyncio
2. import click
3. import csv
4. import httpx
5. import os
6. import re
7. from hashlib import sha256
8. from typing import Optional
9. from urllib.parse import urlparse
10.
11. SLEEP_TIME = 1
12. LINK = re.compile(r”<a.*?href=[\”’](.*?)[\”’]”, re.I)
Code 5.34
In example Code 5.34 we cleaned up imports to drop all these that are not
necessary anymore. You can notice that we already dropped import for
queue module – why is that? Reason being is because we must use async
queue which is part of the asyncio. In this case we must refactor __init__
method as in following example:
1. class Crawler:
2. def __init__(self):
3. self.urls_queue = asyncio.Queue()
Code 5.35
You can see that initializing queue looks very similar as in previous
examples (Code 5.17) but in this case it is an async queue, so methods
offered by it are different, so we also must refactor getting queue element
method:
1. async def _get_element(self):
2. return await self.urls_queue.get()
Code 5.36
In line 2 we added await in pair with the return statement. That is needed
(await) to inform Python that this part of code will return coroutine and we
do not want to block it.
The following example is showing how we refactored fetch_url method so
it can use httpx module and work in async pattern:
1. async def fetch_url(self, url: str, number_of_retries: int) -
> Optional[str]:
2. click.echo(f"Fetching {url}")
3. for try_item in range(int(number_of_retries)):
4. try:
5. async with httpx.AsyncClient() as client:
6. response = await client.get(url, follow_redirects=True)
7. if response.status_code == 200:
8. f_name = sha256(url.encode('utf-8')).hexdigest()
9. output_file = f"/tmp/{f_name}.html"
10. with open(output_file, "w") as f:
11. f.write(response.text)
12. click.echo(f"URL: {url} [saved] under {output_file}")
13. return response.text
14. except Exception as e:
15. click.echo(f"We failed to fetch URL {url} with exeception: {e}")
16.
17. click.echo(f"Fetching resource failed with status code {response.st
atus_code} sleeping {SLEEP_TIME}s before retry")
18. await asyncio.sleep(SLEEP_TIME)
Code 5.37
In Code 5.37, lines 5-6 we changed requests to httpx lib. Please notice that
the whole block works in context19 and simultaneously async with await
(fetching external) statement. You can also notice what it is very important,
in line 18 when we try to put code to sleep before we retry with fetching
given resource URL we do not use time.sleep. We do not use this sleeping
statement as it would kick in GIL and affect the event loop for async code,
which will put to everything to sleep that should be async and become
synchronous programming. Refer to the following code:
1. async def process(self):
2. """Main method to start crawler and process"""
3. url_success = 0
4. url_fails = 0
5. while True:
6. url, number_of_retries = await self.urls_queue.get()
7. base_url = urlparse(url)
8. content = await self.fetch_url(url, number_of_retries)
9. results = LINK.findall(content)
10. click.echo(f"Found {len(results)} links")
11. for parsed_url in results:
12. if parsed_url.startswith('/'):
13. parsed_url = f"{base_url.scheme}://{base_url.netloc}
{parsed_url}"
14. if not parsed_url.startswith('http'):
15. continue
16. content = await self.fetch_url(parsed_url, number_of_retries)
17. if self.urls_queue.empty():
18. break
19. click.echo("Processing finished, exiting...")
Code 5.38
Another async method process, we had to flip the logic of fetching elements
from the queue, if you compare it with Code 5.22. As you already noticed in
the mentioned example Code 5.22 we have been fetching elements from
queue if it was not empty. In the async use we are doing this differently, and
we are continuing fetching loop until queue is not empty (lines 17-18). We
have changed this because of the nature of async queue and public methods
of it.
The rest of the body of the process function looks like normal. We did not
have to change much except these parts blocking the code, so we converted
them to coroutine.
After refactoring all the previous blocking code and making it async, we
must answer very important question: Did we manage to make processing
faster than synchronous processing. The answer is not that straight and
simple. From one side if we compare apples to apples, yes, it is more
efficient since we do not block processing (lines 5-18) web resources. But it
is not as efficient as it can be. We still do not apply any parallel processing.
How to do it, let us look at the code snippet in the following example:
1. import random
2. import asyncio
3.
4.
5. async def func(func_number: int) -> None:
6. for i in range(1, 6):
7. sleep_time = random.randint(1, 5)
8. print(f"Func {func_number} go {i}/5, taking nap {sleep_time}s")
9. await asyncio.sleep(sleep_time)
10.
11.
12. async def call_tests():
13. await asyncio.gather(func(1), func(2), func(3))
14.
15. asyncio.run(call_tests())
Code 5.39
In Code 5.39 we wrote simple function that takes 1 integer argument, print
status on the screen and then goes to sleep for random number of seconds
(line 9). As you probably noticed so far, in async library we do not send
function to any kind of a thread. Instead, we start it as coroutine function
delegated to background and in line 13 we wait until all started async
functions are finished and can return result (None is this case).
By knowing how parallel processing can be achieved in asynchronous
programming in Python let us try to refactor our code:
1. async def process(self):
2. """Main method to start crawler and process"""
3. url_success = 0
4. url_fails = 0
5. while True:
6. url, number_of_retries = await self.urls_queue.get()
7. base_url = urlparse(url)
8. content = await self.fetch_url(url, number_of_retries)
9. results = LINK.findall(content)
10. click.echo(f"Found {len(results)} links")
11. calls = []
12. for parsed_url in results:
13. if parsed_url.startswith('/'):
14. parsed_url = f"{base_url.scheme}://{base_url.netloc}
{parsed_url}"
15. if not parsed_url.startswith('http'):
16. continue
17. calls.append(self.fetch_url(parsed_url, number_of_retries))
18. await asyncio.gather(*calls)
19. if self.urls_queue.empty():
20. break
21. click.echo("Processing finished, exiting...")
Code 5.40
In Code 5.40 we kept the main essence of our method the same way as it
was except that part that calls to fetch more sub-URLs after extracting them
out from main page (lines 14-17). Finally, we have parallel processing of
fetching and crawling URLs, but there is still thing to improve.
Notice that in line 12 we are still lopping over the list of URLs loaded form
CSV file and in line 18 we wait until downloading content of parsed URLs
addresses (line 18) is finished. This is again not optimum approach – why?
Because we wait until fetching content of long list is finished - crawling a
list from, for instance Wikipedia page, can be massive so downloading
hundreds of pages can take a while until we can continue with next URL
from CSV file.
Let us take a closer look on refactored method process so we can see how it
is going to work in 100% parallel mode:
1. async def process(self):
2. """Main method to start crawler and process"""
3. calls = []
4. while True:
5. url, number_of_retries = await self.urls_queue.get()
6. calls.append(self.process_item(url, number_of_retries))
7. if self.urls_queue.empty():
8. break
9. await asyncio.gather(*calls)
10. click.echo("Processing finished, exiting...")
Code 5.41
In Code 5.41 we have refactored our main process method. As you can see,
we removed the entire block extracting all the found URLs in the page
content and started crawling these URLs. Instead, we call separate async
method (lines 6-8) and gather results (wait for them). That is a way more
efficient. In proceeding example Code 5.42 you can see how we created new
method process_item:
1. async def process_item(self, url: str, number_of_retries: int):
2. base_url = urlparse(url)
3. content = await self.fetch_url(url, number_of_retries)
4. results = LINK.findall(content)
5. click.echo(f"Found {len(results)} links")
6. calls = []
7. extracted_urls = filter(lambda x: x.startswith('/') or x.startswith('http'),
results)
8. for parsed_url in extracted_urls:
9. if parsed_url.startswith('/'):
10. parsed_url = f"{base_url.scheme}://{base_url.netloc}
{parsed_url}"
11. calls.append(self.fetch_url(parsed_url, number_of_retries))
12. return await asyncio.gather(*calls)
Code 5.42
We do not change much in the body of the fetching and parsing block except
some code optimalization. You should notice that we applied filter method
combined with lambda line 7) to help filter out all the non-valid URLs texts
that regexp managed to catch. What is also worth noticing is that filter
function is returning iterable object which, as already said in Chapter 1,
Python 101 and Chapter 2, Setting up Python Environment, can lead to
effectiveness of memory utilization.
So far, we learned how to optimize crawler to fetch as many pages as
possible. We have been building crawler that can fetch HTML, albeit adding
few extra regexp to get images is quite good exercise to perform in next
example since fetching binary files (For example, PNG) is a bit different
than plain text. Refer to the following code:
1. IMAGES = re.compile(r»<img.*?src=[\»’](.*?)[\»’]», re.I)
2.
3. class Crawler:
4. def __init__(self, call_levels: int):
5. self.urls_queue = asyncio.Queue()
6. self.__call_levels = call_levels
7.
8.
9. async def main(source:str, level: int):
10. """Main entry point for processing URLs and start crawling."""
11. assert os.path.exists(source),
f"Given file {source} does not exist"
12. c = Crawler(level)
13. await c.load_content(source)
14. await c.process()
15.
16.
17. @click.command()
18. @click.option("--source", help="CSV full file path", required=True)
19. @click.option("--level", help="Crawling depth level",
type=int, required=False, default=5)
20. def run(source, level):
21. asyncio.run(main(source, level))
Code 5.43
In Code 5.43 we have added new regexp (line 1) for extracting images
URLs. Next, we changed (line 4) the constructor for the Crawler class and
added new parameter call_levels. We will use it in later refactoring when we
use a technique called recurrency and we want to limit how many levels of
recurrency we want to go. You can also see that in lines 17-21 we have
support for this option in main method, so executing script is going to look
like the following example:
1. python async_webcrawler_3.py --source=urls.csv --level=3
Code 5.44
We called this refactored file async_webcrawler_3.py and in Code 5.44 we
have an example how to use it with loading source URLs from CSV file
(urls.csv) and with 3 levels of recurrency. Now let us see how we can do
recurrency in proceeding example:
1. async def process_item(self, url: str, number_of_retries: int, call_level: i
nt=1) -> asyncio.gather:
2. base_url = urlparse(url)
3. content = await self.fetch_url(url, number_of_retries)
4. results = LINK.findall(content)
5. parsed_images = IMAGES.findall(content)
6. click.echo(f"Found {len(results)} links [level: {call_level}]")
7. click.echo(f"Found {len(parsed_images)} images [level: {call_level}]
")
8. calls = []
9.
10. extracted_urls = filter(lambda x: x.startswith('/') or x.startswith('http'),
results)
11. parsed_images = filter(lambda x: x.startswith('/') or x.startswith('http')
, parsed_images)
12.
13. for parsed_url in parsed_images:
14. if parsed_url.startswith('/'):
15. parsed_url = f"{base_url.scheme}://{base_url.netloc}
{parsed_url}"
16. calls.append(self.fetch_url(parsed_url, number_of_retries))
17.
18. for parsed_url in extracted_urls:
19. if parsed_url.startswith('/'):
20. parsed_url = f"{base_url.scheme}://{base_url.netloc}
{parsed_url}"
21. if call_level < self.__call_levels:
22. calls.append(self.process_item(parsed_url, number_of_retries, c
all_level+1))
23. else:
24. calls.append(self.fetch_url(parsed_url, number_of_retries))
25.
26. return await asyncio.gather(*calls)
Code 5.45
We changed the way we run process_item method. We start it as before but
additionally we added new argument call_level. What we do in the body of
this method is:
Extract all found images URLs and fetch them – lines 13-16.
Extract all HTML links and fetch them in lines 21, if recurrency limit
has not been reached yet, we call same method that we are in
process_item, that is technique called recurrency. If recurrency limit is
reached (line 24), call regular fetch_url (with no recurrency in this
context). Please note in line 22 we increased recurrency level when we
call recurrent method to control how deep we went to and stop recurrent
calls if the max limit has been reached.
In Code 5.46, we will see how fetch_url method was refactored. You can
see the main difference, how we get from the httpx response from the
downloaded content's body. Refer to the following code:
1. async def fetch_url(self, url: str, number_of_retries: int) -
> Optional[str]:
2. click.echo(f"Fetching {url}")
3. for try_item in range(int(number_of_retries)):
4. try:
5. async with httpx.AsyncClient() as client:
6. response = await client.get(url, follow_redirects=True)
7. content_type = response.headers.get('Content-Type').split(';')[0]
8. extension = content_type.split('/')[-1].lower()
9. if response.status_code == 200:
10. f_name = sha256(url.encode('utf-8')).hexdigest()
11. output_file = f"/tmp/{f_name}.{extension}"
12. with open(output_file, "wb") as f:
13. data = response.content
14. f.write(data)
15. click.echo(f"URL: {url} [saved] under {output_file}")
16. return data.decode('utf-8')
17. except Exception as e:
18. click.echo(f"We failed to fetch URL {url} with exeception: {e}")
19.
20. click.echo(f"Fetching resource failed with status code {response.st
atus_code} sleeping {SLEEP_TIME}s before retry")
21. if response.status_code != 404:
22. await asyncio.sleep(SLEEP_TIME)
Code 5.47
In Code 5.47 we optimized how we use file extension depending on the
server response metadata (line 8). It is not ideal since some metadata content
types can be application/octet-stream or binary/octet+stream, and this
can lead to weird file extensions that we save in line 13. For the need of this
exercise, we will keep it this way not to make things more complex.
Going further with example 5.47 you can discover that we also refactored
how we get body of the response. Previously (Code 5.21, line 14) we have
been fetching only test so that the technique was fine. Since we want also to
fetch images, we have gotten the content differently (Code 5.47, line 14)
albeit to save it properly, we should open binary stream (line 13).
Improvements
In previous chapters we have been building simple yet still powerful
WebCrawler. In this chapter we will see how to introduce few improvements
to out Proof of Concept (POC). First real live issue that our WebCrawler
may face is the performance of your local machine (computation). In this
case, we will try to introduce a technique that will limit the number of
simultaneous coroutines running simultaneously. As a result of introducing
such a handbrake, we will limit the number of parallel downloads. That is
going to help us limit consumed resources.
Proxy
In some cases when you try to run our crawler you can notice that some
websites will be smart. They shall quickly detect that somebody is crawling
them and that may be something they do not like and will block traffic and
you will start getting lots of errors, for instance 403 response code access
denied.
This is because it is not natural behavior that single IP address, which is how
they see your laptop is sending so many requests simultaneously by asking
so many resources in parallel. There is a way to help us, and we can send all
the traffic to such a website via proxy21 as shown in the following figure:
Figure 5.3: Example of how to use proxy service when sending request to website
To be able to use such a proxy service we have 2 options. First, we can use
existing proxy solution, list of free proxy services is mentioned in linked
page in the Wikipedia article (19). Another option is to start cloud services
independently and install proxy service22.
One way or the other the way we are going to use is presented in following
example Code 5.52:
1. PROXIES = {
2. “http://”: “http://proxy.foo.com:8030”,
3. “https://”: “http://proxy.foo.com:8031”,
4. }
Code 5.52
In Code 5.52 we defined fixed list of proxy servers, for both HTTP and
HTTPS protocols we will use different proxy services23. Refer to the
following code:
1. async def fetch_url(self, url: str, number_of_retries: int) -
> Optional[str]:
2. click.echo(f"Fetching {url}")
3. for try_item in range(int(number_of_retries)):
4. try:
5. async with httpx.AsyncClient(proxies=PROXIES) as client:
6. response = await client.get(url, follow_redirects=True)
7. content_type = response.headers.get('Content-Type').split(';')[0]
8. extension = content_type.split('/')[-1].lower()
9. if response.status_code == 200:
10. f_name = sha256(url.encode('utf-8')).hexdigest()
11. output_file = f"/tmp/{f_name}.{extension}"
12. with open(output_file, "wb") as f:
13. data = response.content
14. f.write(data)
15. click.echo(f"URL: {url} [saved] under {output_file}")
16. return data.decode('utf-8')
17. except Exception as e:
18. click.echo(f"We failed to fetch URL {url} with exeception: {e}")
19.
20. click.echo(f"Fetching resource failed with status code {response.st
atus_code} sleeping {SLEEP_TIME}s before retry")
21. if response.status_code != 404:
22. await asyncio.sleep(SLEEP_TIME)
Code 5.53
Pretty simple, right? But this solution has one limitation, we define single
proxy service per protocol. We want to be able to have random proxies being
used per each external call. To achieve this let us use the following example:
1. import random
2.
3. _PROXIES_HTTP = ("http://proxy.foo.com:8030", "http://proxy2.foo.c
om", "http://proxy3.foo.com")
4. _PROXIES_HTTPS = ("https://http-proxy1.foo.com", "https://http-
proxy2.foo.com","https://http-proxy3.foo.com")
5.
6. async def fetch_url(self, url: str, number_of_retries: int) -
> Optional[str]:
7. random.choice(proxies)
8. my_proxies = {
9. "http://": random.choice(_PROXIES_HTTP),
10. "https://": random.choice(_PROXIES_HTTPS),
11. }
12. click.echo(f"Fetching {url}")
13. for try_item in range(int(number_of_retries)):
14. try:
15. async with httpx.AsyncClient(proxies=my_proxies) as client:
16. ...
Code 5.54
In Code 5.54 we randomly use predefined lists of proxy servers for each call.
In that case we have less probability that the destination server will notice
that all following requests are coming from the same place and block us.
Conclusion
In the world of web crawlers (sometimes called web spiders) what is the
most important in my option is that they are verry agnostic and can crawl
any URL. Based on given parameters, that is, level of recurrency or types of
files to extract they can fetch requested resources without having any
hardcoded logic. They are very agnostic and efficient and parallel processing
must have.
To make our crawler even more efficient we should also implement time-out
support. We do not want to get stuck with request to a resource that cannot
be reached at the moment or is dead-end since we have to crawl with high
efficiency.
Other code path that is worth implementing is the support of HTTP224.
Library that we use in our exercises (httpx) can easily support such a
requests25. As you can see there is still a lot of space for improvements.
In the next chapter, we will learn how to use Python as a tool that will help
us build effective virus scanner. We will also learn how Python can work on
an operating system low level requirements.
1. https://en.wikipedia.org/wiki/Python_(programming_language)
2. https://ipython.org/
3. https://en.wikipedia.org/wiki/Regular_expression
4. https://docs.python.org/3/library/csv.html
5. https://beautiful-soup-4.readthedocs.io/en/latest/
6. https://en.wikipedia.org/wiki/FIFO_(computing_and_electronics)
7. https://docs.python.org/3/library/queue.html
8. https://pypi.org/project/click/
9. https://docs.python-requests.org/en/latest/index.html
10. https://twisted.org
11. https://en.wikipedia.org/wiki/List_of_HTTP_status_codes
12. https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/503
13. https://docs.python.org/3/library/multiprocessing.html
14. https://docs.python.org/3/library/threading.html
15. https://wiki.python.org/moin/GlobalInterpreterLock
16. https://www.python-httpx.org
17. https://twisted.org
18. https://docs.python.org/3/library/asyncio.html
19. https://docs.python.org/3/library/contextlib.html
20. https://pypi.org/project/asyncio-pool/
21. https://en.wikipedia.org/wiki/Proxy_server
22. https://github.com/anapeksha/python-proxy-server
23. https://www.python-httpx.org/advanced/#http-proxying
24. https://en.wikipedia.org/wiki/HTTP/2
25. https://www.python-httpx.org/http2/
OceanofPDF.com
CHAPTER 6
Create Your Own Virus Detection
System
Introduction
Computer viruses can be for sure a very big problem for every computer
owner. They can spread rapidly in a computer system and affect the entire
back-office of enormous organization as same as a single laptop at home.
In this chapter, we will learn how to write simple yet very powerful virus
scanner by using Python. We will go step by step from understanding how
viruses can be detected and excluded from operating system from use – so
called quarantine.
Structure
In this chapter, we will discuss the following topics:
Building files and directories scanner
Calculating hashing keys
Introducing viruses
Use and update viruses DB
Building map of suspicious files
Parallel processing
Objectives
After reading this chapter, you should know how to build your own virus
scanner, how to get your tool working in your local system. We will also
learn how to effectively get latest virus definitions and use them.
Introducing viruses
So far, we’ve learned how to scan local folders and build a list discovered
files. On the top of that we managed to understand how to build a list of
those files and fingerprint them. You are probably wondering where his is
leading us. Before we are going to connect dots, we shall understand some
basics about viruses.
What is a computer virus? According to internet,8 “A computer virus is a
type of computer program that, when executed, replicates itself by
modifying other computer programs and inserting its own code into those
programs. If this replication succeeds, the affected areas are then said to be
"infected" with a computer virus, a metaphor derived from biological
viruses.”
So, we can see virus is a computer program which can be executed on your
personal computer. If executed, it can have hash calculated on top of it. That
means we can easily identify what kind of files on our local filesystem are
suspicious and may be a virus. These are the simplest kind of viruses that
are identified as single files. Sometimes they come in packages but, if we
calculate hashes for them, they can be hunted easily. If we want to know
more complex use cases, we will have to analyze files deeper and compare
chunks of code (fingerprints) with known virus database.
2.
3. <OUTPUT>
4. File: /var/tmp/cbrgbc_1.sqlite, hash:
ebf454d4b0d094cedf591c6dbe370c4796572a67139174da72559156dd2
265ed,
status: virus!
5. File: /var/tmp/cbrgbc_1.sqlite-shm, hash:
61db3163315e6b3b08a156f60812ca5efff009323420aa994d6bdedaf85a
feb0,
status: ok
6. File: /var/tmp/cbrgbc_1.sqlite-wal, hash:
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852
b855,
status: ok
Code 6.11
We have detected by running Code 6.11 that one file looks suspicious (line
4), because we have file hash (line 4) listed in our virus hashes list (Code
6.10, line 1). In this case, we could not only detect that the file is a virus but
also remove it from our computer. In Code 6.12, we modified printing
statement in such a way that we can remove infected file.
1. import os
2.
3. VIRUSES_LIST = []
4.
5. def is_virus(file_path, hash_value):
6. status = True
7. if hash_value in VIRUSES_LIST:
8. status = False
9.
10. if status:
11. print(f"File: {file_path}, hash: {hash_value}, status: [ok]")
12. else:
13. print(f"File: {file_path}, hash: {hash_value}, status: virus! removi
ng...")
14. try:
15. os.remove(file_path)
16. except OSError:
17. print("Seem like detected file can't be remove at
the moment, in use?")
18.
19.
20. @click.command()
21. @click.option("--
virus_def", help="File with virus definition", required=True)
22. @click.option("--fpath", help="Path to start scanning", required=True)
23. def main(fpath, virus_def):
24. with open(virus_def, 'rb') as f:
25. VIRUSES_LIST = f.read().decode('utf-8').replace(' ', '').split('\n')
26.
27. for file_path in scanner(fpath):
28. hash_value = calculate_hash(file_path)
29. is_virus(hash_value, file_path)
Code 6.12
We modified slightly main function (lines 23-29) by loading virus definition
file content into global variable (lines 24-25) and splitting detected virus
and removing infected file into separated function. In this case, it is cleaner
and easier to analyze given parameters: calculated file hash and its location.
Once detected as virus, it will be removed.
Please notice that we use OS exception catching (lines 16-17) because if we
deal with some real virus, there can be a case that it is being executed and
by being up and running (cloning itself, for instance) it will stop us from
removing such a running file. We will deal with such a case in the following
examples.
While reading virus hash list from file, there are few challenges. If we have
multiple files like that, we cannot read all of them at once. Even if we
manage to fix Code 6.12 to be able to read virus hash definitions from many
files, there are still few issues:
Each time we start out script we have to load these files all over again
Keeping result of loading such a file in memory can lead to significant
memory footprint of our code
To solve this, we are going to use local database which stores virus hash
definitions. In Code 6.13, we create separated script that is loading virus
hashes to our DB:
1. import click
2. import os
3. import sqlite3
4.
5. DB_FILENAME = "virus.db"
6.
7.
8. class VirusDB:
9.
10. def __init__(self):
11. self.conn = sqlite3.connect(DB_FILENAME)
12.
13. def _execute(self, sql):
14. print(f"Executing: {sql}")
15. cursor = self.conn.cursor()
16. cursor.execute(sql)
17. return cursor.fetchall()
18.
19. def _commit(self, sql):
20. print(f"Insert/update: {sql}")
21. cursor = self.conn.cursor()
22. cursor.execute(sql)
23. return self.conn.commit()
24.
25. def init_table(self):
26. sql = """CREATE TABLE IF NOT EXISTS virus_db (
27. id INTEGER PRIMARY KEY AUTOINCREMENT,
28. virus_hash TEXT UNIQUE,
29. created_at TIMESTAMP DEFAULT CURRENT_TIMESTAM
P
30. )»»»
31. print(self._execute(sql))
32.
33. def import_data(self, sources):
34. print(f"Importing: {sources}")
35. for source in sources:
36. assert os.path.exists(source), f"File {source} does not exist"
37. with open(source, 'r', encoding='utf-8') as f:
38. for line in f:
39. data = line.strip().strip('\n')
40. sql = f"INSERT OR IGNORE INTO virus_db (virus_hash)
values ('{data}')"
41. self._commit(sql)
42.
43.
44. @click.command()
45. @click.option("--
source", help="File with virus definition", multiple=True, type=str)
46. @click.option("--
operation", help="Operation type", required=True, type=click.Choice(['
init', 'import']))
47. def main(operation, source):
48. v = VirusDB()
49. if operation == 'init':
50. v.init_table()
51. elif operation == 'import':
52. assert source, 'We need source value'
53. v.import_data(source)
54.
55.
56. if __name__ == '__main__':
57. main()
Code 6.13
10
We created script that is using SQLite database engine to store all found
virus hash values being read from give virus hash list files. We run our
script line in Code 6.14. First, we initialize database file.
1. $ python code_6.13.py --operation=init
2.
3. # output
4. Executing: CREATE TABLE IF NOT EXISTS virus_db (
5. id INTEGER PRIMARY KEY AUTOINCREMENT,
6. virus_hash TEXT UNIQUE,
7. created_at TIMESTAMP DEFAULT CURRENT_TIMESTAM
P
8. )
Code 6.14
We used special syntax of SQL create table if not exists which helps avoid
situation where we run init script multiple times. It is easy to understand
that all data will be stored in table virus_db. Database file that we are using
is defined in Code 6.13, line 5. Once we have all DB initialized, we can
start loading data. Let us use the same file like in Code 6.11. Additionally,
we create 2nd file called example_virus_sha256_2.bin and with example
content in Code 6.15:
1. 1bf454d4b0d094cedf591c6dbe370c4796572a67139174da72559156dd
2265e1
2. 11db3163311e6b3b08a156360812ca5efff0093234201a994d6bdedaf85a
feb1
Code 6.15
In the following example, we run our main script (Code 6.13) with 2 source
files and virus definition:
1. $ python code_6.13.py --operation=import --
source=example_virus_sha256.bin --
source=example_virus_sha256_2.bin
2.
3. # output
4. Importing: ('example_virus_sha256.bin', 'example_virus_sha256_2.bin'
)
5. Insert/update: INSERT OR IGNORE INTO virus_db (virus_hash) valu
es
('ebf454d4b0d094cedf591c6dbe370c4796572a67139174da72559156dd
2265ed')
6. Insert/update: INSERT OR IGNORE INTO virus_db (virus_hash) valu
es
('61db3163315e6b3b08a156360812ca5efff0093234201a994d6bdedaf8
5afeb0')
7. Insert/update: INSERT OR IGNORE INTO virus_db (virus_hash) valu
es
('1bf454d4b0d094cedf591c6dbe370c4796572a67139174da72559156d
d2265e1')
8. Insert/update: INSERT OR IGNORE INTO virus_db (virus_hash) valu
es
('11db3163311e6b3b08a156360812ca5efff0093234201a994d6bdedaf8
5afeb1')
Code 6.16
We uploaded all the files content into our database (Code 6.13, line 33-41)
and now we can use these hashes to identify all potential viruses. Please
notice that reading and inserting/updating data from/to database is a little
different. For reading, we use fetchall method (Code 6.13, lines 14-17) on
top of created connection reading cursor. As for inserting/updating we do
something similar but then need to commit SQL statement via connection to
database (Code 6.13, lines 20-23). Most of the Python database drivers
work in this way11.
In Code 6.13, we also use feature of click library that allows us to force user
to choose from limited option for command line parameter (Code 6.13, line
46) like in Code 6.17. We also managed to force user to not skip that
parameter and have it as mandatory (Code 6.13, line 46). At the same time,
we kept optional parameter source (Code 6.13, line 45) to allow user to
specify file’s location with virus definition.
1. python code_6.13.py
2.
3. # OUTPUT
4. Usage: code_6.13.py [OPTIONS]
5. Try 'code_6.13.py --help' for help.
6.
7. Error: Missing option '--operation'. Choose from:
8. init,
9. import
Code 6.17
In Code 6.17 we modify Code 6.13 in such a way that we can use our newly
created database that has virus hashes:
1. import click
2. import os
3. import sqlite3
4. from hashlib import sha256
5.
6. DB_FILENAME = "virus.db"
7.
8.
9. class VirusScanner:
10.
11. def __init__(self):
12. self.conn = sqlite3.connect(DB_FILENAME)
13.
14. def _execute(self, sql):
15. cursor = self.conn.cursor()
16. cursor.execute(sql)
17. return cursor.fetchall()
18.
19. def check_hash(self, has_value) -> bool:
20. sql = f"SELECT * FROM virus_db WHERE virus_hash='{has_va
lue}' LIMIT 1"
21. cursor = self.conn.cursor()
22. cursor.execute(sql)
23. return True if cursor.fetchall() else False
24.
25. def is_virus(self, file_path, hash_value):
26. if not hash_value:
27. return
28. if self.check_hash(hash_value):
29. print(f"File: {file_path}, hash: {hash_value}, status: virus! rem
oving...")
30. try:
31. os.remove(file_path)
32. except OSError:
33. print("Seem like detected file can't be
remove at the moment, in use?")
34. else:
35. print(f"File: {file_path}, hash: {hash_value}, status: [ok]")
36.
37. def scanner(self, file_path: str):
38. for (root, dirs, files) in os.walk(file_path, topdown=True):
39. for f in files:
40. yield os.path.join(root, f)
41.
42. def calculate_hash(self, file_path: str) -> str:
43. if not file_path:
44. return
45. try:
46. with open(file_path, "rb") as f:
47. file_hash = sha256()
48. chunk = f.read(8192)
49. while chunk:
50. file_hash.update(chunk)
51. chunk = f.read(8192)
52.
53. return file_hash.hexdigest()
54. except OSError:
55. print(f'File {file_path} can not be opened at
the moment, skipping')
56.
57. def analyze(self, fpath):
58. for file_path in self.scanner(fpath):
59. hash_value = self.calculate_hash(file_path)
60. self.is_virus(hash_value, file_path)
61.
62.
63. @click.command()
64. @click.option("--fpath", help="Path to start scanning", required=True)
65. def main(fpath):
66. v = VirusScanner()
67. v.analyze(fpath)
68.
69.
70. if __name__ == '__main__':
71. main()
Code 6.18
By Code 6.18, we are scanning for files in the same way as we did in Code
6.13 but with a difference that when we try to compare if file that is being
validated is on the blacklist - we use database for this (Code 6.18, line 19-
23). We also added small improvements – when we check if the file cannot
be read for validation (Code 6.18, line 42-55) we do not crash if the file is
not present anymore or resource is busy and in use by other process (line
32-33).
The whole concept is based on the fact that we have virus hashes list in flat
files. How about if we get such a hash definition list automatically from
some internet source and do not have to worry about manual download?
In Code 6.19, we managed to create a simple snippet that is fetching SHA-
256 hashes and creating flat file from VirusBay12.
1. import requests
2.
3. MAIN_URL = 'https://beta.virusbay.io/sample/data'
4.
5. response = requests.get(MAIN_URL)
6. if response.status_code == 200:
7. data = response.json()
8. with open('virusbay.bin', 'w', encoding='utf8') as f:
9. for item in data['recent']:
10. virus_md5 = item['md5']
11. details_url = f'https://beta.virusbay.io/sample/data/{virus_md5}'
12. details_response = requests.get(details_url)
13. if details_response.status_code == 200:
14. data = details_response.json()
15. if 'sha256' in data:
16. f.write(f"{data['sha256']}\n")
Code 6.19
With this simple script we are fetching latest virus SHA-256 hashes from
VirusBay. Once we have file created virusbay.bin, we can import its
content to out virus database like in Code 6.20:
1. python code_6.13.py --operation=import --source=virusbay.bin
Code 6.20
So far, we managed to build a script that is analyzing given folder and
checking if found file SHA-256 hash matches virus hash list. That approach
may seem to be enough, but in real case scenarios, virus may be smarter
and resist in, for example, ZIP files. With Code 6.23, we modify our base
Code 6.18 in such a way that we can unzip compressed ZIP files and
analyze uncompressed content (files) that may be a virus.
Before we will start modifying script, we need to install magic library13 that
is going to help us analyze file types.
1. pip install python-magic
Code 6.21
You might be wondering why we cannot use file extension like .zip – the
reason is that the file extension can be misleading from the actual file type.
Each file in file system has digital fingerprint called metadata that is
clearly describing operating system. This is something like file header. It
can also be faked and broken by malicious software but in this example, we
will focus on how to read these headers and unzip files. To demonstrate
why using metadata to check file type is a better idea let us check Code
6.22.
Let us use PDF file as an example and rename its file extension to .txt and
run Code 6.22:
1. import magic
2.
3. print(magic.from_file("test.txt"))
4.
5. # OUTPUT
6. 'PDF document, version 1.2'
Code 6.22
As you can see, this is the right way to analyze file types and properly
check what kind of file we are dealing with. In Code 6.23, we will use this
library to check when we are facing ZIP file so we can extract and analyze
its content:
1. import click
2. import os
3. import magic
4. import sqlite3
5. import uuid
6. import zipfile
7. from hashlib import sha256
8.
9. DB_FILENAME = "virus.db"
10.
11.
12. class VirusScanner:
13.
14. def __init__(self):
15. self.conn = sqlite3.connect(DB_FILENAME)
16.
17. def _execute(self, sql):
18. cursor = self.conn.cursor()
19. cursor.execute(sql)
20. return cursor.fetchall()
21.
22. def check_hash(self, has_value) -> bool:
23. sql = f"SELECT * FROM virus_db WHERE virus_hash='
{has_value}' LIMIT 1"
24. cursor = self.conn.cursor()
25. cursor.execute(sql)
26. return True if cursor.fetchall() else False
27.
28. def is_virus(self, file_path, hash_value):
29. if not hash_value:
30. return
31. if self.check_hash(hash_value):
32. print(f"File: {file_path}, hash: {hash_value},
status: virus! removing...")
33. try:
34. os.remove(file_path)
35. except OSError:
36. print("Seem like detected file can't be remove at
the moment, in use?")
37. else:
38. print(f"File: {file_path}, hash: {hash_value},
status: [ok]")
39.
40. def scanner(self, file_path: str):
41. for (root, dirs, files) in os.walk(file_path, topdown=True):
42. for f in files:
43. yield os.path.join(root, f)
44.
45. def calculate_hash(self, file_path: str) -> str:
46. if not file_path:
47. return
48. try:
49. with open(file_path, "rb") as f:
50. file_hash = sha256()
51. chunk = f.read(8192)
52. while chunk:
53. file_hash.update(chunk)
54. chunk = f.read(8192)
55.
56. return file_hash.hexdigest()
57. except OSError:
58. print(f'File {file_path} can not be opened at the moment, skippi
ng')
59.
60. def analyze_zip(self, fpath):
61. extract_dir = "/tmp/{tmp_id}/".format(tmp_id=str(uuid.uuid4()))
62. with zipfile.ZipFile(fpath, 'r') as zip_ref:
63. zip_ref.extractall(extract_dir)
64. self.analyze(extract_dir)
65. os.remove(extract_dir)
66.
67. def analyze(self, fpath):
68. for file_path in self.scanner(fpath):
69. try:
70. hash_value = self.calculate_hash(file_path)
71. self.is_virus(hash_value, file_path)
72. if 'zip' in magic.from_file(file_path).lower():
73. self.analyze_zip(file_path)
74. except OSError:
75. print(f'File {file_path} can not be opened at the moment, skip
ping')
76.
77.
78. @click.command()
79. @click.option("--fpath", help="Path to start scanning", required=True)
80. def main(fpath):
81. v = VirusScanner()
82. v.analyze(fpath)
83.
84.
85. if __name__ == '__main__':
86. main()
Code 6.33
In Code 6.33, we are testing if the file that we found is infected with virus
(lines 71-72) and then if it is a ZIP file (line 73). Once we know it is a ZIP
archive, we unzip it into temporary folder (line 61-64) and then try to scan
and analyze it (line 65). After it is done, we remove temporary folder with
this content.
2.
3. ͍18D%�Q���٬uI��ujz|��R�ۇfX�������a�Zd�
4. DQMz@x���<���!8�-
E�D�R$��7��5Z��<I�blh�.ˉʴJ���f�-
���H� 0��\
2N:�:���H�\}M5.��ҕ�-~bF2��넩>V���*
5. �X�zҐ�]�/
�_k9���'k���:bxDS5ٓ���&m��p�3A�APZ9ćA$�0
}��
6.
�� ��P��w��u͓A�yal�V��i:^�mb
7.
@�rݬhg2_v�yl�����2UO��$'�#�/+>���R�~�c9
��8�N<Gå�uN�e��"
8.
�xR��%I�Q�� MRA���ђӑz���`�\����AGd<
�%���x7�YN�SP�C��#�!
�;mע.�Yx*�K����֡��mXzhc�=��إK:
����!
���Y��Ҥ��z��:ħ{n�u�E��s&��uvg)n:T`�O�2
�H8xcpT
9. W��G?
<�Sq�B 38�a;�_�kS.�M�`bi�s;f��3���}�u
10. ��n»>���`�ߚݰD�|R/^�q{z��^���VW���j9�#%
��l�B�F�o��m�@t�
11.
�+�5���oOe�»1��~v�~cvG��7�It����Ч�C»�
��
12. �[E
b���ʺ[!p��eb<[�#��NJR�[�o|Pbj��-
�� ,G�����[�s�,9.�›e�/�H|
��������]bpD��
13. ��h�s!\�!
ծ��ʊ�%cr�m1�� ��›mĠ�(�K7Q�I�����x%
As you can see, it is useless scrambled data that is not original virus
anymore. In this case, virus cannot open infected file anymore and execute
it by infecting more files. Of course, we have a cure to reverse the process
of encrypting data. Let us check Code 6.47:
1. enc = FilesEncoder()
2. enc.decrypt_file('data.txt.bin')
Code 6.47
Once the code is called, we will restore encrypted file to its original state
under the name data.txt.
Now, let us try to modify base Code 6.33 but instead of removing
suspicious files, we will encrypt them and remove original.
1. import click
2. import os
3. import magic
4. import sqlite3
5. import uuid
6. import zipfile
7. from enrypt_files import FilesEncoder
8. from hashlib import sha256
9.
10. DB_FILENAME = "virus.db"
11.
12.
13. class VirusScanner:
14.
15. def __init__(self):
16. self.conn = sqlite3.connect(DB_FILENAME)
17. self.encryptor = FilesEncoder()
18. self.locks = {}
19. self.files_to_lock = []
20. self.files_to_remove_lock = []
21.
22. def _execute(self, sql):
23. cursor = self.conn.cursor()
24. cursor.execute(sql)
25. return cursor.fetchall()
26.
27. def check_hash(self, has_value) -> bool:
28. sql = f"SELECT * FROM virus_db WHERE virus_hash='
{has_value}' LIMIT 1"
29. cursor = self.conn.cursor()
30. cursor.execute(sql)
31. return True if cursor.fetchall() else False
32.
33. def is_virus(self, file_path, hash_value):
34. if not hash_value:
35. return
36. if self.check_hash(hash_value):
37. print(f"File: {file_path}, hash: {hash_value},
status: virus! removing...")
38. try:
39. self.encryptor.encrypt_file(file_path)
40. os.remove(file_path)
41. except OSError:
42. print("Seem like detected file can't be
remove at the moment, in use?")
43. else:
44. print(f"File: {file_path}, hash:
{hash_value}, status: [ok]")
45.
46. def scanner(self, file_path: str):
47. for (root, dirs, files) in os.walk(file_path, topdown=True):
48. for f in files:
49. yield os.path.join(root, f)
50.
51. def calculate_hash(self, file_path: str) -> str:
52. if not file_path:
53. return
54. try:
55. with open(file_path, "rb") as f:
56. file_hash = sha256()
57. chunk = f.read(8192)
58. while chunk:
59. file_hash.update(chunk)
60. chunk = f.read(8192)
61.
62. return file_hash.hexdigest()
63. except OSError:
64. print(f'File {file_path} can not be opened at the moment, skippi
ng')
65.
66. def analyze_zip(self, fpath):
67. extract_dir = "/tmp/{tmp_id}/".format(tmp_id=str(uuid.uuid4()))
68. with zipfile.ZipFile(fpath, 'r') as zip_ref:
69. zip_ref.extractall(extract_dir)
70. self.analyze(extract_dir)
71. os.remove(extract_dir)
72.
73. def analyze(self, fpath):
74. for file_path in self.scanner(fpath):
75. try:
76. hash_value = self.calculate_hash(file_path)
77. self.is_virus(hash_value, file_path)
78. if 'zip' in magic.from_file(file_path).lower():
79. self.analyze_zip(file_path)
80. except OSError:
81. print(f'File {file_path} can not be opened at the moment, skip
ping')
82.
83.
84. @click.command()
85. @click.option("--fpath", help="Path to start scanning", required=True)
86. def main(fpath):
87. v = VirusScanner()
88. v.analyze(fpath)
89.
90.
91. if __name__ == '__main__':
92. main()
Code 6.48
Parallel processing
So far, we managed to build powerful tools to analyze file system to check
and mark those files that may look suspicious or are corrupted with virus.
When we want to analyze single files directories lineal walking over file
system seems to be effective for single main folder albeit when we want to
analyze the entire file systems where we have hundreds of thousands of
files with thousands of directories this method is going to be very slow.
To address this kind of problem we’re going to update or previous example
for scanning file system where we will introduce more effective way of
processing directories called parallel programming. We could use in this
case threads of processes driven development, although we will use same
approach that we already learned in previous chapters – asynchronous
programming. File system operations is very good example where we can
use this technique. First, we must install few Python modules like in
following example.
1. $ pip install asyncio_pool asyncio aiofiles
Code 6.49
After installing module, we will build simple example demonstrating how
to use parallel scanning for directories. Let’s check following example.
1. import os
2. import asyncio
3. from aiofiles import os as asyncio_os
4.
5.
6. async def async_scan_dir(dir_path):
7.
8. dirs = []
9. dir_list = await asyncio_os.listdir(dir_path)
10. for check_path in dir_list:
11. v_path = os.path.join(dir_path, check_path)
12. is_dir = await asyncio_os.path.isdir(v_path)
13. if is_dir:
14. dirs += await async_scan_dir(v_path)
15. else:
16. dirs.append(v_path)
17.
18.
19. return dirs
20. async def get_result(dir_path="/tmp"):
21. result = await async_scan_dir(dir_path)
22.
23.
24. print(f"result: {result}")
25. asyncio.run(get_result())
Code 6.50
When we run example Code 6.50, we will get as result list of all the files
scanned in /tmp directory. As it is easy to notice we used async file library�
that is helping with parallel scanning, and we reuse function
async_scan_dir (code 6.50, line 6) in recurrent mode to be able to build
final list of picked up files from pointed directory.
Let us try to refactor example Code 6.48 to make it work in async world.
Let’s check proceeding example to see how to approach this.
1. import asyncio
2. import click
3. import os
4. import magic
5. import sqlite3
6. import uuid
7. import zipfile
8. from enrypt_files import FilesEncoder
9. from aiofiles import os as asyncio_os
10. from hashlib import sha256
11.
12. DB_FILENAME = "virus.db"
13.
14.
15. class VirusScanner:
16.
17. def __init__(self):
18. self.conn = sqlite3.connect(DB_FILENAME)
19. self.encryptor = FilesEncoder()
20. self.locks = {}
21. self.files_to_lock = []
22. self.files_to_remove_lock = []
23.
24. def _execute(self, sql):
25. cursor = self.conn.cursor()
26. cursor.execute(sql)
27. return cursor.fetchall()
28.
29. def check_hash(self, has_value) -> bool:
30. sql = f"SELECT * FROM virus_db WHERE virus_hash='{has_va
lue}' LIMIT 1"
31. cursor = self.conn.cursor()
32. cursor.execute(sql)
33. return True if cursor.fetchall() else False
34.
35. def is_virus(self, file_path, hash_value):
36. if not hash_value:
37. return
38. if self.check_hash(hash_value):
39. print(f"File: {file_path}, hash: {hash_value},
status: virus! removing...")
40. try:
41. self.encryptor.encrypt_file(file_path)
42. os.remove(file_path)
43. except OSError:
44. print("Seem like detected file can't be
remove at the moment, in use?")
45. else:
46. print(f"File: {file_path}, hash: {hash_value},
status: [ok]")
47.
48. def calculate_hash(self, file_path: str) -> str:
49. if not file_path:
50. return
51. try:
52. with open(file_path, "rb") as f:
53. file_hash = sha256()
54. chunk = f.read(8192)
55. while chunk:
56. file_hash.update(chunk)
57. chunk = f.read(8192)
58.
59. return file_hash.hexdigest()
60. except OSError:
61. print(f'File {file_path} can not be opened at
the moment, skipping›)
62.
63. async def analyze_zip(self, fpath):
64. extract_dir = "/tmp/{tmp_id}/".format(tmp_id=str(uuid.uuid4()))
65. with zipfile.ZipFile(fpath, 'r') as zip_ref:
66. zip_ref.extractall(extract_dir)
67. await self.analyze(extract_dir)
68. os.remove(extract_dir)
69.
70. async def async_scan_dir(self, dir_path):
71. dirs = []
72. dir_list = await asyncio_os.listdir(dir_path)
73. for check_path in dir_list:
74. v_path = os.path.join(dir_path, check_path)
75. is_dir = await asyncio_os.path.isdir(v_path)
76. if is_dir:
77. dirs += await self.async_scan_dir(v_path)
78. else:
79. dirs.append(v_path)
80. return dirs
81.
82. async def analyze(self, fpath):
83. for file_path in await self.async_scan_dir(fpath):
84. try:
85. hash_value = self.calculate_hash(file_path)
86. self.is_virus(hash_value, file_path)
87. if 'zip' in magic.from_file(file_path).lower():
88. await self.analyze_zip(file_path)
89. except OSError:
90. print(f'File {file_path} can not be opened at the moment, skip
ping›)
91.
92.
93. @click.command()
94. @click.option("--fpath", help="Path to start scanning", required=True)
95. def main(fpath):
96. v = VirusScanner()
97. asyncio.run(v.analyze(fpath))
98.
99.
100. if __name__ == '__main__':
101. main()
Code 6.51
We refactored most parts of the virus scanner where we’ve been dealing
with OS and file system (lines 70-80 and 63-68) to make it work as async
operations. This will help us to make the whole code to be more efficient.
Conclusion
A demonstration of how Python can be used for analyzing file system while
hunting for viruses is shown in this chapter. It is not a difficult technique to
encrypt founded infected files and keep it locally. Next, we can send it to
online services where advanced technology can analyze given files and try
to cure them.
Viruses are obviously complex than flat files and can be hidden inside of
infected files, replicate themselves in memory, lots of sophisticated
techniques are used. Malicious software is always one step ahead of people
writing antivirus software. It is better if you run local antivirus scanner
frequently and be more aware of links you click.
In the next chapter, we will learn how we can use Python with crypto coins.
How can we analyze crypto currency exchange markets and where Python
is going to help us with crypto wallets.
1. https://click.palletsprojects.com/en/8.1.x/
2. https://docs.python.org/3/library/hashlib.html
3. https://datatracker.ietf.org/doc/html/rfc1321
4. https://www.thesslstore.com/blog/difference-sha-1-sha-2-sha-256-
hash-algorithms/
5. https://eprint.iacr.org/2004/199.pdf
6. https://eprint.iacr.org/2011/037.pdf
7. https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#Blocks
8. https://en.wikipedia.org/wiki/Computer_virus
9. https://github.com/Len-Stevens/MD5-Malware-Hashes
10. https://www.sqlite.org/index.html
11. https://docs.python.org/3/library/sqlite3.html
12. https://beta.virusbay.io/sample/browse
13. https://pypi.org/project/python-magic/
14. https://docs.python.org/3/library/fcntl.html
15. http://0pointer.de/blog/projects/locking.html
16. https://www.gnu.org/software/libc/manual/html_node/File-
Locks.html
17.
https://web.archive.org/web/20160306104007/http://research.microsoft.
com/en-us/projects/cryptanalysis/aesbc.pdf
18. https://pypi.org/project/aiofiles/
OceanofPDF.com
CHAPTER 7
Create Your Own Crypto Trading
Platform
Introduction
Crypto currencies have been on the market for quite a while, so it is no
secret that, after getting lots of hype they have become standard payment
platforms in the digital world. Their currency exchange is moving up and
down very rapidly, so working with crypto assets can be fun and a bit of
challenge at the same time. In this chapter, we will learn how to utilize
Python as a tool for crypto currencies.
Structure
In this chapter, we will discuss the following topics:
Brief introduction to crypto market
Building client for crypto market
Trends analyzer
Integrating with crypto wallet
Purchase and sell
Objectives
After reading this chapter, you should know how to build your own crypto
market trading platform client, manage your crypto assets and use Python to
build simple yet powerful money exchange applications.
Currencies
Before we can build any trading platform client, we need to collect all the
crypto currencies that will be analyzed and traced. We have a few options
regarding getting crypto currencies (codes):
For instance, we can add them manually to our application, but this is
going to be a very time-consuming process if we want to follow many
currencies.
Other option, that we will use in the proceeding example is to fetch
currencies codes from the existing crypto trading website.
In the following example, we will use coinmarketcap.com.
1. import json
2. import re
3. import requests
4. from pprint import pprint
5.
6. URL = "https://coinmarketcap.com/all/views/all/"
7. JSON_DATA = re.compile(r'<script\s+id="__NEXT_DATA__"\s+type=
"application/json">(.*?)</script>')
8.
9.
10. def main():
11. raw_data = requests.get(URL).text
12. data = json.loads(JSON_DATA.findall(raw_data).pop())
13. result = {}
14. for item in json.loads(data['props']['initialState'])['cryptocurrency']
['listingLatest']['data']:
15. try:
16. result[item[30]] = item[10]
17. except KeyError:
18. pass
19. return result
20.
21. if __name__ == '__main__':
22. pprint(main())
Code 7.1
In this script, we managed to use a trick where we found that in the body of
website coinmarketcap.com (line 6), they have a JavaScript section that
defines all the supported crypto currencies on the website. Following this
logic, we are extracting the JavaScript part (line 12). Further, we will extract
from the Python dictionary the part that has crypto currencies’ codes along
with their corresponding names. The output of running our script is going to
look like the following code:
1. python get_crypto_codes.py
2.
3. # output
4. {'1INCH': '1inch Network',
5. 'AAVE': 'Aave',
6. 'ACH': 'Alchemy Pay',
7. 'ADA': 'Cardano',
8. 'AGIX': 'SingularityNET',
9. ...
Code 7.2
We now have all the popular crypto currencies that are on the market. We
cannot print them on the screen like in example Code 7.1, however, we need
to store them in a database. In this case, we will use a well-known database,
SQLite2 that we have already been using in previous chapters.
Follow these steps to generate connection to database and initialize its
content:
1. First, let us create a script that will create a database structure. In the
proceeding example, we have also used a tool that we already utilized
many times in previous chapters called Click3.
2. Before we get to the main script, let us create universal class for
managing database and related queries. Follow this example to create a
file db.py.
1. import sqlite3
2.
3. DB_FILENAME = "crypto.db"
4.
5.
6. class DB:
7.
8. def __init__(self):
9. self.conn = sqlite3.connect(DB_FILENAME)
10.
11. def execute(self, sql):
12. print(f"Executing: {sql}")
13. cursor = self.conn.cursor()
14. cursor.execute(sql)
15. return cursor.fetchall()
16.
17. def commit(self, sql):
18. print(f"Insert/update: {sql}")
19. cursor = self.conn.cursor()
20. cursor.execute(sql)
21. return self.conn.commit()
22.
23. def init_table(self, table_name):
24. with open(f"{table_name}.sql") as f:
25. print(self.execute(f.read()))
Code 7.3
3. After creating a database class driver, we can create SQL file
currency.sql that is going to create the table. Here, we will store all the
crypto currencies that we can extract from the mentioned website.
1. CREATE TABLE IF NOT EXISTS currency (
2. id INTEGER PRIMARY KEY AUTOINCREMENT,
3. currency_code TEXT UNIQUE,
4. currency_name TEXT,
5. created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
6. );
Code 7.4
4. By having a database structure, we must create a script that will allow
us to create database structure. For this, we will use the well-known
click library. In the proceeding example, you can see how we will
approach it:
1. import click
2. from db import DB
3.
4. @click.command()
5. @click.option("--
table", help="Table type", required=True, type=click.Choice(['curren
cy']))
6. def main(table):
7. db = DB()
8. db.init_table(table)
9.
10. if __name__ == '__main__':
11. main()
Code 7.5
5. Once we have the table and script to create data storage, we should
modify the code from example 7.1 to store results in our database table.
6. Check the following example Code 7.6 to see how modified version of
Code 7.1 is going to use database storage.
1. import json
2. import re
3. import requests
4. from db import DB
5.
6. URL = "https://coinmarketcap.com/all/views/all/"
7. JSON_DATA = re.compile(r'<script\s+id="__NEXT_DATA__"\s+ty
pe="application/json">(.*?)</script>')
8.
9.
10. def main():
11. db = DB()
12. raw_data = requests.get(URL).text
13. data = json.loads(JSON_DATA.findall(raw_data).pop())
14. result = {}
15. i = 0
16. for item in json.loads(data['props']['initialState'])['cryptocurrency']
['listingLatest']['data']:
17. try:
18. result[item[30]] = item[10]
19. sql = f"""INSERT INTO currency(currency_code, currency_
name) VALUES ('{item[30]}', '{item[10]}');"""
20. db.commit(sql)
21. i += 1
22. except KeyError:
23. pass
24. return i
25.
26. if __name__ == '__main__':
27. no_items = main()
28. print(f"Inserted {no_items} items")
Code 7.6
7. Executing the above example Code 7.6 should give us a result like in
the proceeding example, with about 200 records in DB.
1. $ python code_7.6.py
2.
3. ## result
4.
5. ...
6. Insert/update: INSERT INTO currency(currency_code, currency_na
me) VALUES ('ZEN', 'Horizen');
7. Insert/update: INSERT INTO currency(currency_code, currency_na
me) VALUES ('BTRST', 'Braintrust');
8. Insert/update: INSERT INTO currency(currency_code, currency_na
me) VALUES ('TRAC', 'OriginTrail');
9. Insert/update: INSERT INTO currency(currency_code, currency_na
me) VALUES ('RBN', 'Ribbon Finance');
10. Insert/update: INSERT INTO currency(currency_code, currency_na
me) VALUES ('HFT', 'Hashflow');
11. Insert/update: INSERT INTO currency(currency_code, currency_na
me) VALUES ('METIS', 'MetisDAO');
12. Insert/update: INSERT INTO currency(currency_code, currency_na
me) VALUES ('JOE', 'JOE');
13. Insert/update: INSERT INTO currency(currency_code, currency_na
me) VALUES ('AXL', 'Axelar');
14. Inserted 200 items
Code 7.7
8. We inserted 200 currency codes with corresponding names. What is
going to happen if we run Code 7.6 once again? Let us see in following
example.
1. $ python code_7.6.py
2.
3.
4. Insert/update: INSERT INTO currency(currency_code, currency_na
me) VALUES ('BTC', 'Bitcoin');
5. Traceback (most recent call last):
6. File "code_7.6.py", line 27, in <module>
7. no_items = main()
8. File "code_7.6.py", line 20, in main
9. db.commit(sql)
10. File "/Users/hubertpiotrowski/work/fun-with-
python/chapter_7/db.py", line 20, in commit
11. cursor.execute(sql)
12. sqlite3.IntegrityError: UNIQUE constraint failed: currency.currency
_code
Code 7.8
9. However, we crashed. This is because we created in the table currency a
unique constraint on column currency_code that should prevent us
from inserting multiple times the same code to table (Code 7.4).
Unfortunately, we did not catch an exception in our Code 7.6 (line 20).
Hence, by trying to insert the same code twice, the database driver
raises integrity exception which we should catch and continue (Code
7.8, line 12). Let us modify our example Code 7.6, with proper catching
exceptions.
1. def main():
2. db = DB()
3. raw_data = requests.get(URL).text
4. data = json.loads(JSON_DATA.findall(raw_data).pop())
5. result = {}
6. i = 0
7. for item in json.loads(data['props']['initialState'])['cryptocurrency']
['listingLatest']['data']:
8. try:
9. result[item[30]] = item[10]
10. currency_code = item[30].strip().upper()
11. currency_name = item[10].strip()
12. sql = f"""INSERT INTO currency(currency_code, currency_
name) VALUES ('{currency_code}', '{currency_name}');"""
13. try:
14. db.commit(sql)
15. except sqlite3.IntegrityError:
16. print(f"Currency {currency_code} already exists, skipping
...")
17. except Exception as e:
18. print("Error: ", e)
19. i += 1
20. except KeyError:
21. pass
22. return i
Code 7.9
In example Code 7.9 we modified main method in such a way that now, we
can rerun import script as many times as we have to without having an issue
where we crash either because currency already exists (line 15) or because of
any other reason (line 17).
Trends analyzer
Now, that we have gone through the basics of storing and securing crypto
currencies, it is time to see where and how we can exchange crypto
currencies into standard currency. For instance USD. More than platform for
an exchange, it is important to know how much value our crypto assets have
and when is the right time to exchange them.
Let us start with understanding of where and how to fetch latest currency
exchange. However, before we start we need to update our database
structure. In the proceeding code, we will create a currency exchange table
that stores current currency exchange.
1. CREATE TABLE IF NOT EXISTS currency_exchange (
2. id INTEGER PRIMARY KEY AUTOINCREMENT,
3. currency_code TEXT UNIQUE,
4. last_price FLOAT,
5. created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
6. updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
7. );
Code 7.16
We have a table where we are going to store current fetched currency
exchange values. Now, we need a similar table to keep historical values of
currency exchanges so we can compare them for future use. Mentioned table
is going to be generated like in the following example:
1. CREATE TABLE IF NOT EXISTS currency_exchange_history (
2. id INTEGER PRIMARY KEY AUTOINCREMENT,
3. currency_code TEXT,
4. last_price FLOAT,
5. created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
6. updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
7. CONSTRAINT constraint_name UNIQUE (currency_code, created_a
t)
8. );
Code 7.17
To apply that table definition file from Code 7.16 and Code 7.17, we need to
modify Code 7.5 to make sure that the database tables and content can be
used and processed.
1. import click
2. from db import DB
3.
4. @click.command()
5. @click.option("--
table", help="Table type", required=True, type=click.Choice(['currency',
'currency_exchange', 'currency_exchange_history']))
6. def main(table):
7. db = DB()
8. db.init_table(table)
9.
10. if __name__ == '__main__':
11. main()
Code 7.18
Let us run Code 7.18 to create the required tables like in the following
example. The proceeding Code 7.19 being run will create tables in our main
table database crypto.db.
1. python code_7.17.py --table currency_exchange
2. python code_7.17.py --table currency_exchange_history
Code 7.19
We have a main table with its historical copy. Now it is time to fetch some
data and fill that main table first with real time currency exchange. We have
many options to find our source of crypto market currency exchange
however, instead of building an unreliable screen scraper to fetch data from
popular websites we can try a different approach using API.
Before we dive into an example of using API as source of truth, I want to
highlight the fact that I am not recommending this example API service,
Live Coin Watch10 because it is the best on the market. Please keep in mind
that it is just an example, so if you want to use any other source of data
provider you can choose other API and replace the one shown in the
following examples.
1. Let us start by registering at https://www.livecoinwatch.com.
2. There is a tab called API – please go there and create API key.
3. Next, we have to be aware of some limitations of this service account. It
is free of charge albeit you can only perform limited number of requests
per month.
Let us try to send our first request to API service to check how many credits
we have left like in the proceeding example, Code 7.20. Before moving
ahead, please remember it is necessary to have installed the requests11
package - ike in the previous examples in the subsection with currencies.
1. import os
2. import requests
3. from pprint import pprint
4.
5. API_KEY = os.environ.get('API_KEY')
6. assert API_KEY, "variable API_KEY not specified"
7. URL = "https://api.livecoinwatch.com/credits"
8.
9. response = requests.post(URL, headers={"x-api-
key": API_KEY, "content-type": "application/json"}).json()
10.
11. print("Credit status:")
12. pprint(response)
Code 7.20
In Code 7.20, we used a very useful trick to read API access key. Instead of
hardcoding the access token key into our code, we read it from the
environmental variable (line 5). When the developer executing our code
wants to use their own access token, they need to specify it in runtime
environmental variable like in the following example.
1. API_KEY=111-11111-my-foo-acceess-key python code_7.20.py
Code 7.21
This a very clever way. Besides configurations files, you can in a simple and
clean manner, specify secrets for instance in this case token API access key.
When the developer does not specify any key, we will raise an exception line
6. When everything is correctly specified you should see output like in
example below.
1. Credit status:
2. {'dailyCreditsLimit': 10000, 'dailyCreditsRemaining': 10000}
Code 7.22
So far, we have introduced a concept of how and from where to get crypto
currencies exchange values. Now, in the proceeding example we are going to
use our well known click module, that is going to help us build scripts for
fetching crypto currencies exchange data.
First, we should create a Livecoin crypt client that will fetch data from API
and save results in our newly created DB.
1. import click
2. import os
3. import requests
4. from db import DB
5.
6.
7. class LiveCoinClient:
8.
9. def __init__(self):
10. self.__api_token = os.environ.get('API_KEY')
11. assert self.__api_token, "variable API_KEY not specified"
12. self._db = DB()
13.
14. def __fetch_data(self, url):
15. return requests.post(url, headers={"x-api-
key": self.__api_token, "content-type": "application/json"}).json()
16.
17. def __post_data(self, url):
18. data = {
19. "currency": "USD",
20. "sort": "rank",
21. "order": "ascending",
22. "offset": 0,
23. "limit": 500,
24. "meta": True
25. }
26. return requests.post(url, headers={"x-api-
key": self.__api_token, "content-
type": "application/json"}, json=data).json()
27.
28. def fetch_and_update_coins(self):
29. click.echo("Starting fetching livecoin updates")
30. url = 'https://api.livecoinwatch.com/coins/list'
31. data = self.__post_data(url)
32. data_to_refresh = {item['code']: item['rate'] for item in data}
33. self.refresh(data_to_refresh)
34.
35. def refresh(self, data):
36. for currency_code, currency_value in data.items():
37. click.echo(f"Updating coing: {currency_code}")
38. self.update_currency_exchange(currency_code, currency_value)
39.
40. def update_currency_exchange(self, currency_code, currency_value):
41. sql = f"SELECT * FROM currency_exchange WHERE currency_c
ode='{currency_code}'"
42. result = self._db.execute(sql)
43. if result:
44. result = result.pop()
45. self.save_history(result['currency_code'], result['last_price'])
46. sql = f"UPDATE currency_exchange SET last_price='{currency
_value}', updated_at=now() WHERE currency_code='{currency_code}
'"
47. self._db.commit(sql)
48. else:
49. sql = f"INSERT INTO currency_exchange (last_price, currency_
code) VALUES ('{currency_value}', '{currency_code}')"
50. self._db.commit(sql)
51.
52. def save_history(self, currency_code, currency_value):
53. sql = f"""INSERT INTO currency_exchange_history (currency_co
de, last_price) VALUES ('{currency_code}', '{currency_value}')"""
54. self._db.commit(sql)
Code 7.23
We created in Code 7.23, a generic class, that will fetch crypto currency
exchange from the 3rd party portal. Once the data is fetched, we will save the
current values in the main table currrency_exchange. Data that was
previously saved in that table gets pushed to table
currency_exchange_history where we keep historical currency exchange
values.
Through this mechanism, we can keep the present data as well as past values
which we will use to calculate and predict whether we shall buy or sell our
crypto assets. To use the above functionality, we need a script that will use
our class like in the proceeding example.
1. import click
2. from live_coin_client import LiveCoinClient
3.
4. def main():
5. click.echo("Starting import")
6. l = LiveCoinClient()
7. l.fetch_and_update_coins()
8.
9. if __name__ == '__main__':
10. main()
Code 7.24
To execute the script, we should run the proceeding example, Code 7.25
with API token so we can fetch all currencies.
1. API_KEY=<your api key> python update_currency_exchange.py
Code 7.25
After executing the script, we should have inserted 500 currency records into
main table. To validate it, you can execute the proceeding script:
1. sqlite3 crypto.db
2.
3. sqlite> SELECT count(*) from currency_exchange;
4. 500
Code 7.26
When we execute line 1 we are opening SQLite CLI shell where we can
execute SQL commands directly in our currency.db database. The reason
why we only fetch 500 records is because we have hardcoded page size in
Code 7.23, line 23. If we want to get more different coins values, we need to
modify __post_data method (from Code 7.23, lines 17-26) in such a way
that it can fetch more data if it’s available to fetch.
For instance, as an exercise, we can work with such a concept in the
following example, where we will use a well-known recurrency.
1. def __post_data(self, url, page_limit=500, page_offset=0):
2. click.echo(f"Page limit: {page_limit}, offset: {page_offset}")
3. data = {
4. "currency": "USD",
5. "sort": "rank",
6. "order": "ascending",
7. "offset": page_offset,
8. "limit": page_limit,
9. "meta": True
10. }
11. data = requests.post(url, headers={"x-api-
key": self.__api_token, "content-
type": "application/json"}, json=data).json()
12. if data:
13. more_data = self.__post_data(url, page_limit=page_limit, page_off
set=((page_offset+1)+page_limit))
14. if more_data:
15. data += more_data
16. return data
Code 7.27
This little change will help us get all the possible coins and related currency
exchange from API. Each call of __post_data method we make, we increase
the page number by 1 (line 13) and next, we call the same method
__post_data. We keep increasing page number and continue calling with
recurrency same method __post_data (Code 7.27, line 13) as long as new
data can be still fetched. Once the entire data is received, we keep processing
coin updates as usual.
Let us run the same script once again. This should lead to a case where the
current currency exchange values will be copied to the historical table. The
newly downloaded currency exchange values are going to be saved in the
present data table. Unfortunately, as you can see in the following code
example this is not the case.
1. Executing: SELECT * FROM currency_exchange WHERE currency_c
ode='BTC'
2. Traceback (most recent call last):
3. File "update_currency_exchange.py", line 9, in <module>
4. main()
5. File "update_currency_exchange.py", line 6, in main
6. l.fetch_and_update_coins()
7. File "/Users/hubertpiotrowski/work/fun-with-
python/chapter_7/live_coin_client.py", line 34, in fetch_and_update_coi
ns
8. self.refresh(data_to_refresh)
9. File "/Users/hubertpiotrowski/work/fun-with-
python/chapter_7/live_coin_client.py", line 39, in refresh
10. self.update_currency_exchange(currency_code, currency_value)
11. File "/Users/hubertpiotrowski/work/fun-with-
python/chapter_7/live_coin_client.py", line 47, in update_currency_exc
hange
12. self.save_history(result['currency_code'], result['last_price'])
13. TypeError: tuple indices must be integers or slices, not str
Code 7.28
This error is happening because the native Python SQLite driver returns
tuples for responses. Each DB record is a tuple instead of a dictionary and
this is what we incorrectly assumed in our code. The fix for this problem is
simple as illustrated in following example code.
1. def __init__(self):
2. click.echo(f"Database: {DB_FILENAME}")
3. self.conn = sqlite3.connect(DB_FILENAME)
4. self.conn.row_factory = sqlite3.Row
Code 7.29
We need to apply these changes in the database Code 7.3 in constructor. By
using this approach all cursors in database queries are going to return
dictionary instead of tuples.
Before we dive any deeper into the topic of analyzing fetched data, we can
make an improvement in our updater script (Code 7.23). In the proceeding
example, we will add support to fetch currency exchange data for a given
time range.
1. import click
2. import os
3. import requests
4. from datetime import datetime, timedelta
5. from db import DB
6.
7.
8. class LiveCoinClient:
9.
10. def __init__(self):
11. self.__api_token = os.environ.get('API_KEY')
12. assert self.__api_token, "variable API_KEY not specified"
13. self._db = DB()
14.
15. def __fetch_data(self, url):
16. return requests.post(url, headers={"x-api-
key": self.__api_token, "content-type": "application/json"}).json()
17.
18. def __post_data(self, url, page_limit=500, page_offset=0):
19. click.echo(f"Page limit: {page_limit}, offset: {page_offset}")
20. data = {
21. "currency": "USD",
22. "sort": "rank",
23. "order": "ascending",
24. "offset": page_offset,
25. "limit": page_limit,
26. "meta": True
27. }
28. data = requests.post(url, headers={"x-api-
key": self.__api_token, "content-
type": "application/json"}, json=data).json()
29. if data:
30. more_data = self.__post_data(url, page_limit=page_limit, page_
offset=((page_offset+1)+page_limit))
31. if more_data:
32. data += more_data
33. return data
34.
35. def format_time(self, dt_value):
36. return str(int(dt_value.timestamp())).replace('.', '')[:13].ljust(13, '0')
37.
38. def fetch_crypto(self, currency_code='BTC', days=1):
39. timestamp_now = datetime.now()
40. url = "https://api.livecoinwatch.com/coins/single/history"
41. payload = {
42. "currency": "USD",
43. "code": currency_code,
44. "start": self.format_time(timestamp_now-timedelta(days=days)),
45. "end": self.format_time(timestamp_now),
46. "meta": True
47. }
48. data = requests.post(url, headers={"x-api-
key": self.__api_token, "content-
type": "application/json"}, json=payload).json()
49. for item in data['history']:
50. self.update_currency_exchange(currency_code, item['rate'], item
['date'])
51.
52. def fetch_and_update_coins(self):
53. click.echo("Starting fetching livecoin updates")
54. url = 'https://api.livecoinwatch.com/coins/list'
55. data = self.__post_data(url)
56. data_to_refresh = {item['code']: item['rate'] for item in data}
57. self.refresh(data_to_refresh)
58.
59. def refresh(self, data):
60. for currency_code, currency_value in data.items():
61. click.echo(f"Updating coing: {currency_code}")
62. self.update_currency_exchange(currency_code, currency_value)
63.
64. def update_currency_exchange(self, currency_code, currency_value,
updated_value=None):
65. if not updated_value:
66. updated_value = datetime.now().timestamp()
67. sql = f"SELECT * FROM currency_exchange WHERE currency_c
ode='{currency_code}'"
68. result = self._db.execute(sql)
69. if result:
70. result = result.pop()
71. if float(result['updated_at']) <= float(updated_value):
72. self.save_history(result['currency_code'], result['last_price'], u
pdated_at=result['updated_at'])
73. sql = f"UPDATE currency_exchange SET last_price='{curren
cy_value}', updated_at=
{updated_value} WHERE currency_code='{currency_code}'"
74. self._db.commit(sql)
75. else:
76. self.save_history(result['currency_code'], result['last_price'], u
pdated_value)
77. else:
78. sql = f"INSERT INTO currency_exchange (last_price, currency_
code, updated_at,
created_at) VALUES ('{currency_value}', '{currency_code}',
{updated_value}, {updated_value})"
79. self._db.commit(sql)
80.
81. def save_history(self, currency_code, currency_value, updated_at):
82. sql = f"""INSERT INTO currency_exchange_history (currency_co
de, last_price, updated_at, created_at) VALUES ('{currency_code}', '{c
urrency_value}', '{updated_at}', '{updated_at}')"""
83. self._db.commit(sql)
Code 7.30
We have managed to modify our example Code 7.23 significantly into a
version as shown in Code 7.30. We have added support for obeying currency
exchange updates with an explicitly given date-time (lines 65-66), if
provided. We will use the current timestamp if the argument is not given.
This approach helps us thread the given record (line 64) as a fresh one, that
should be stored as current currency exchange (line 69-71). It needs to be
refreshed and updated in DB or a completely new record because it does not
exist in any database yet (line 77-79).
Another change one must notice is that we have introduced in line 81-83,
how to directly use the given updated_at value while creating historical
records in database.
All these changes can be used with the newly introduced method to fetch
crypto currencies. We can use method from Code 7.30 to fetch any historical
rates of any given crypto currency. As an argument we used the number of
days of how far we want to go (Code 7.30, line 38) with heretical data to
fetch.
We use the mentioned number of days as an input argument (line 38) and
calculate from present date time, the given number of days (line 41-46) of
data to fetch. A fact that needs to be highlighted is that the livecoinwatch12
API expects to receive date time fields as epoch13 timestamp instead of
ISO14 timestamp format.
API timestamp standard as well do not use epoch timestamp value as float
and size is always 13 characters long with filling zeros for empty space. In
that case, we have to create a method format_time that for given datetime
object is creating epoch timestamp with described logic so, that
Livecoinwatch API can understand given timestamps by us.
To consume newly updated codes and introduce the new method,
fetch_crypto we also need to modify our previously generated file from
example Code 7.30 so that it works as illustrated in the proceeding example
code.
1. import click
2. from live_coin_client import LiveCoinClient
3.
4. @click.command()
5. @click.option("--
coin", help="Coin to update", type=str, required=False)
6. @click.option("--
days", help="Number of days to fetch", type=int, required=False)
7. def main(days, coin):
8. click.echo("Starting import")
9. l = LiveCoinClient()
10. if coin:
11. l.fetch_crypto(currency_code=coin, days=(days if days else 10))
12. else:
13. l.fetch_and_update_coins()
14.
15. if __name__ == '__main__':
16. main()
Code 7.31
To use example Code 7.31 in a manner where we want to fetch all the
currency exchange rates, we should run the code like in the following
example:
1. API_KEY=<your api key> python update_currency_exchange.py
Code 7.32
Now, when we want to run a single currency update with given number of
days of history, we need to run the same script with parameters like in the
proceeding example.
1. API_KEY=<your api key> python update_currency_exchange.py --
coin=ETH --days=5
Code 7.33
When we run the same script with the same argument twice or more, you
might face an error of sqlite3 as in the following example:
1. sqlite3.IntegrityError: UNIQUE constraint failed: currency_exchange_h
istory.currency_code, currency_exchange_history.updated_at
Code 7.34
This error is occurring because we decided in example Code 7.17 that we
will only allow saves for historical records with a combination of unique
constraint where the condition is, – updated at + currency code must be
unique constrain. This means that when we run our updating data script with
the same combination of mentioned parameters (updated at + currency
code), it will eventually fetch data that we already have saved in our
database via SQLite driver thus will lead to exception.
To properly support mentioned edge case and deliver valid exception
handling, we must wrap commit DB query (Code 7.30, lines 81 - 83) into
try+except block of code. This fix is being accomplished in the proceeding
example.
1. import sqlite # remember to import this on the top of the script
2.
3. def save_history(self, currency_code, currency_value, updated_at):
4. try:
5. sql = f"""INSERT INTO currency_exchange_history (currency_co
de, last_price, updated_at, created_at) VALUES ('{currency_code}', '{c
urrency_value}', '{updated_at}', '{updated_at}')"""
6. self._db.commit(sql)
7. except sqlite3.IntegrityError:
8. click.echo("This kind of record is aleady saved in DB")
Code 7.35
As you can see, we are doing simple try + catch block here to make sure
that we can save all valid records and skip repetitions.
Conclusion
In this chapter, we learned how we can use Python for analyzing crypto
market and its trends. When it is time to sell and when it is time to buy. We
did not perform the example of pulling our coins out of the trading wallet
since the mechanics is the same as for sending assets from main wallet to
trading one as we already performed, thus it is just a matter of switching
places – source with destination.
We have also managed to practice how to use all the codes that we wrote in
all the subsections That is the whole Python and overall programming
pattern. Gradually, we have been building a prototype to check, analyze and
send crypto assets to trading wallet. We have also managed to learn how to
use local simulations for blockchain which helps prototype dApps. 23 It also
helps you get familiar with crypto world from a developer’s point of view
without using real money where every mistake is not free.
In the next chapter, we are going to learn how to use Python with hardware
that we might want to build. This is going to be learning of how to build
smart speaker that you can interact with.
1. https://www.kaspersky.com/resource-center/definitions/what-is-
cryptocurrency
2. https://www.sqlite.org/index.html
3. https://pypi.org/project/click/
4. https://www.cloudflare.com/en-gb/learning/ssl/how-does-public-key-
encryption-work/
5. https://www.secg.org/sec2-v2.pdf?ref=hackernoon.com
6. https://pypi.org/project/web3/
7. https://github.com/pycrypto/pycrypto
8. https://docs.github.com/en/pull-requests/collaborating-with-pull-
requests/working-with-forks/fork-a-repo
9. https://github.com/darkman66/eth-keyfile
10. https://www.livecoinwatch.com/tools/api
11. https://pypi.org/project/requests/
12. https://livecoinwatch.github.io/lcw-api-docs/#coinssinglehistory
13. https://www.techtarget.com/searchdatacenter/definition/epoch
14. https://www.iso.org/iso-8601-date-and-time-format.html
15. https://web3py.readthedocs.io
16. https://trufflesuite.com/ganache/
17. https://web3py.readthedocs.io
18. https://docs.python.org/3/library/configparser.html
19. https://click.palletsprojects.com/en/8.1.x/
20. https://ethereum.org/en/developers/docs/gas/
21. https://numpy.org
22. https://etherscan.io
23. https://ethereum.org/en/dapps/
OceanofPDF.com
CHAPTER 8
Construct Your Own High-tech
Loudspeaker
Introduction
With galloping technological changes in computers and smart devices we
can say that not only they are bringing newer, better batteries, display, CPU
and other parts of hardware but as well they become more powerful and
efficient. These changes, also have a significant impact upon how we, as
humans interact with computers. Over the years, we have moved from
keyboards, towards touch screens. Now, through super-efficient CPU’s our
focus has shifted towards speech. We can now interact with smart devices by
using our voice to give commands and make simple interactions with them.
This technology is growing rapidly, and, in this chapter, we are going to
explore and learn how to use Python in the fascinating world of smart
speakers.
Structure
In this chapter, we will cover the following topics:
Building a software that can support speech to text
Recording
Response
Building interactions scenarios
Connecting to third party service like music players
Building physical devices
Objectives
After reading this chapter, you should know how to build your own smart
speaker. We will learn how to interact with it by scripting interaction
scenarios. You should also be able to understand how speech to text works,
what kind of challenges it may introduce, and how to beat them as a
developer who knows how to use Python.
Recording
To record audio samples, we will use system libraries and kernel drivers that
allow us access to input audio devices such as microphones via system API
without facing many challenges when accessing such a device on a very low
system level.
We will use a Python library called sounddevice3. This tool allows us to
record sounds from microphone and convert it to Scipy4 data arrays.
First, we need to install requirements to be able to start recording. Before we
install any of above requirements, we must install Python 3.105 which is
required by soundedevice packed.
Once you have installed the required Python version, we can start installing
all the required libraries as shown in the following code.
1. pip install sounddevice==0.4.6
2. pip install scipy==1.10.1
Code 8.1
Once we have installed Python 3.10 and the required libraries, we can try to
record our very first test recording to figure out how microphones are
configured in our system. Let us try the following example that is
demonstrating how to record sounds.
1. import sounddevice as sd
2. from scipy.io.wavfile import write
3.
4. fs = 44100
5. seconds = 3
6.
7. myrecording = sd.rec(int(seconds * fs), samplerate=fs, channels=1)
8. print('Start talking')
9. sd.wait()
10. write('output.wav', fs, myrecording)
Code 8.2
After executing the above code, you should be able to record a sound from
your default system microphone and find it saved under our application
folder in a file called output.wav.
We can see that in line 4 we are defining quality (frequency of sampling6) of
our recording and in line 5 we define how long we want to record that sound.
These two factors, are later being used in line 7 to determine recording
options as well as the number of channels. The reason why we use only
single channel here, is because the most popular microphones are mono,
which means that they deliver sounds in single channel. We do not want to
force recoding on stereo channels in case the microphone does not support it.
If you want, you can experiment with two channels as long as your
microphone supports it.
Next thing to note, is the fact that our framework (sounddevice) by default
generates recordings as numpy data array which means, we cannot send the
output of it directly to the file. This is why we use the function write from
Scipy packet – line 10.
We have the recording in a flat file. We will use this later for sure. For the
time being, we need to understand how to start recording and active
recording itself based on the spoken keyword. The easiest and most efficient
way to achieve is demonstrated in the Code 8.3. Let us try to put our
recording mechanics into a loop and wait for waking word which we will
add later. For time being we will keep listening in the loop.
1. import sounddevice as sd
2. from scipy.io.wavfile import write
3.
4. fs = 44100
5. seconds = 10
6.
7. print('I am listening... press ctrl+c to stop› )
8. while True:
9. myrecording = sd.rec(int(seconds * fs), samplerate=fs, channels=1)
10. sd.wait()
11. print("Finished and again, recordign size", len(myrecording))
Code 8.3
With this simple change (line 8), we can keep registering sounds forever for
ten seconds samples (line 5). By checking the following flowchart, we will
analyze why we need an infinity loop for recording:
Figure 8.1: Recording and catching waken word cycle
In Figure 8.1, we have explained why we need an infinity loop for
recording. That loop, in its basic concept is needed to listen, record, analyze
and wait until the user activates recording block (record spoken word). This
will happen when the user will speak out specific keywords. For this chapter,
let us assume that the triggering word is going to be speaker or hey speaker.
Our example Code 8.3, does not analyze speech to text, so we need to
introduce Python library, that will help us refactor our code and analyze
speech to convert it to text. Before we do some simple exercises, we need to
install Python library for analyzing sounds and be able to convert those to
text. Let’s check following example how to install required module.
1. pip install -U openai-whisper
Code 8.4
Once we have the installer whisper7 we can write a simple example like in
the following code: To make it work, we shall reuse concept from Code 8.2.
1. import whisper
2. import sounddevice as sd
3. from scipy.io.wavfile import write
4.
5. fs = 44100
6. seconds = 5
7.
8. myrecording = sd.rec(int(seconds * fs), samplerate=fs, channels=1)
9. print('Start talking')
10. sd.wait()
11.
12. print('Write output')
13. write('output.wav', fs, myrecording)
14.
15. print('Analyze text')
16.
17. model = whisper.load_model("base")
18. result = model.transcribe("output.wav")
19. analized_text = result["text"]
20.
21. print(f"What you said: {analized_text}")
Code 8.5
In this simple code, we are recording five seconds of audio input.
Once the recording is ready, it is saved to output file (line 13). Next, we will
load the base AI model to whisper (line 17). After the model is properly
loaded and processed, the content of the output file goes to whisper (line 18).
In the end, we have the converted version text form of our audio file.
Now, we need to add support for the wake word. For this, we have to
refactor our previous code and add detection for a key word. In this case, as
we agreed before our magic word is “speaker"” or “hey speaker”. Let’s
check following example how can we react upon mentioned keywords.
1. import logging
2. import whisper
3. import sounddevice as sd
4. from scipy.io.wavfile import write
5.
6. FS = 44100
7. SECONDS = 5
8. RECORDING_FILE = 'output.wav'
9. LOGGING_FORMAT = '%(asctime)s %(message)s'
10.
11. logging.basicConfig(level=logging.INFO, format=LOGGING_FORMA
T)
12.
13.
14. class SmartSpeaker:
15. def __init__(self):
16. self._current_text = None
17. self.model = whisper.load_model("base.en", download_root='.')
18. logging.info("Model loaded")
19.
20. def run(self):
21. if self.record_audio():
22. self.analized_text = self.audio_to_text()
23. logging.info(f"Translated text: {self.analized_text}")
24. if self.is_keyword_in_text:
25. logging.info("Hello, I can't talk yet but I heard you")
26.
27. def record_audio(self) -> bool:
28. try:
29. myrecording = sd.rec(int(SECONDS * FS), samplerate=FS, cha
nnels=1)
30. logging.info('Start talking')
31. sd.wait()
32.
33. logging.info('Write output')
34. write('output.wav', FS, myrecording)
35. except Exception as e:
36. logging.error(f"We crashed: {e}")
37. return False
38. return True
39.
40. def audio_to_text(self) -> str:
41. logging.info('Analyze text')
42. result = self.model.transcribe("output.wav")
43. return result["text"]
44.
45. @property
46. def is_keyword_in_text(self) -> bool:
47. return 'speaker' in self.analized_text.lower() or 'hey speaker' in self.
analized_text.lower()
48.
49. if __name__ == '__main__':
50. smart_speaker = SmartSpeaker()
51. smart_speaker.run()
Code 8.6
We can see that our code got refactored more towards object-oriented
programming. We have managed to delegate logical areas of the code such
as recording speech, (Code 8.4, lines 8-10), converting to text (Code 8.4,
lines 17-19) and picking up key phrases (wake word) into the delegated code
block. Functions converted into methods are in use when we clearly can
detect that recording audio was successful (Code 8.5, line 19) and we have
managed to convert speech to text (Code 8.5, line 20).
Once we detect that there is any of the waken words present in a recorded
spoken text like “speaker” or “hey speaker” (Code 8.5, line 21) then we
respond back to the user. In this case, we only log potential reply messages
(Code 8.5, line 22).
Response
Our project encompasses not just what the user is trying to say but also
focuses on building smart speaker thus we shall not only log what user has
tried to say (Code 8.5, line 22). Instead, we should find a way to speak back
to the user.
To make our application talk we have many options, we could use Amazon
AWS solution8, Google to speeches9 and many more cloud solutions. Let us
try to see how we can interact with Google Cloud service.
1. First, we need to install the Python package, as shown in the following
code. Let’s check how to install required modules.
1. $ pip install gtts playsound
Code 8.7
1. After installing this package, we can write our proof-of-concept code
like in the following code example:
1. from gtts import gTTS
2. import os
3. import playsound
4.
5. def text2speak(text):
6. tts = gTTS(text=text, lang='en')
7. filename = "tmp.mp3"
8. tts.save(filename)
9. playsound.playsound(filename)
10. os.remove(filename)
11.
12. text2speak('Hello there, nice to meet you!')
Code 8.8
In above Code 8.7, we used Google Cloud services that manages all the
conversion from text to speech (line 6). Take note that after fetching
converted text to audio, we must save the result to a file (line 7-8). Once the
file is being saved, we will use play sound module (line 9) to play what has
been fetched from Google service. To make things clean, we are removing
temporary recordings (line 10) before exiting from function.
Unfortunately, this solution has one big issue, the temporary file name is
static. Suppose we call our function text2speak multiple times at the same
time (parallel). Thus we will have a race condition issue. One of the
challenges is a case where the temporary file keeps being overwritten by
multiple instances of the same function. This is going to lead to issues with
playing content of it. To fix this, let us refactor the code like in proceeding
example:
1. import os
2. import playsound
3. from gtts import gTTS
4. from tempfile import mkstemp
5.
6. def text2speak(text):
7. tts = gTTS(text=text, lang='en')
8. filename = mkstemp()[1]
9. tts.save(filename)
10. playsound.playsound(filename)
11. os.remove(filename)
12.
13. text2speak('Hello there, nice to meet you!')
Code 8.9
In this code, we are using a Python library called Tempfile10 which allows
us to create unique temporary files for storing, recording etc. Like in the
previous example 8.7 once the recording is done, we are removing the
temporary file. Hence, make sure that there are no more breadcrumbs left
behind.
Using third party solutions like Google can lead to some noticeable delays
from the moment when we have a text to read until we can hear it. To
demonstrate this let us run the proceeding example:
1. $ time python text_to_speech_google_refatored.py
2.
3. python text_to_speech_google_refatored.py 0.21s user 0.06s system 5
% cpu 4.692 total
Code 8.10
In example Code 8.10, we have used system command time, which measures
execution time of our script. Please note that the last parameter (cpu total)
we have taken almost five seconds from system time to execute our script.
Let us try a similar example where we will use Python and system
synthesis11 12 text to speech functionality:
1. from os import system
2.
3. system('say Hello there, nice to meet you!')
Code 8.11
With an embedded system, say command we have a less natural voice being
used. It sounds more artificial than an actual human voice. This is a
disadvantage for sure, but the advantages of Code 8.10 lies in its simplicity,
no dependency on external commercial services and execution time. You
might question why we have mentioned execution time, as an advantage
against third party commercial service. Now in this following code, let us
check how much time does it take to execute example Code 8.10:
1. $ time python text_to_speech.py
2.
3. python text_to_speech.py 0.23s user 0.06s system 10% cpu 2.597 total
Code 8.12
You can notice that the execution time is almost half, when compared to
example Code 8.9. This is the factor, we will use here as a leverage to
choose a system solution for voice synthesis over third party libraries. Now,
it is time to refactor our Code 8.5, where instead of logging a response
message, we are going to say it to the user. In the following example, we
refactored Code 8.6 with included text to speech:
1. from os import system
2.
3. class SmartSpeaker:
4.
5. def run(self):
6. if self.record_audio():
7. self.analized_text = self.audio_to_text()
8. logging.info(f"Translated text: {self.analized_text}")
9. if self.is_keyword_in_text:
10. reply_txt = "Hello, I can't talk yet but I heard you"
11. system(f'say {reply_txt}')
Code 8.13
We replaced the logging in method run once we detected, that the waken
word phrase is being said by the user with proper text to speech mechanism
in use. As a result, the user will hear the text that we defined in line 11.
Building interaction scenarios
Ideally in the world of smart speakers, we should be able to interact with the
user in such a way that person who talks to smart speaker has a feeling that
he is talking to a person or at least to some form of artificial intelligence. Of
course, for the need of this book we are not going to build another
ChatGPT13 platform for smart interactions with user, albeit we are going to
use something simple, yet powerful for building interaction scenarios.
We will need Python library that allows us to build intents and can be fed by
scenario files that descriptively drive interactions with the user. We could
use, like in the previous subchapter of this chapter a commercial service14
for building smart conversations.
In this case, we will use an open-source library to build conversations. In the
proceeding example, we are going to install Python library to support that:
1. $ git clone [email protected]:bpbpublications/Fun-with-Python.git
2. $ cd fun-with-Python/chapter_8/neuralintents
3. $ python setup.py install
Code 8.14
Once the library is installed, we can go to the next section regarding the code
that will allow us to create interactive scenarios. Before we will do so we
should clarify something. Library neuralintents15 is a POC library that is
based on TenserFlow16. It allows us to work with natural languages and
program interactions with users based on natural human talk and grammar.
We can start with the following example where we define scenarios and
intents:
1. {"intents": [
2. {"tag": "greeting",
3. "patterns": ["Hi", "How are you", "Is anyone there?",
"Hello", "Good day", "Whats up", "Hey", "greetings"],
4. "responses": ["Hello!", "Good to see you again!",
"Hi there, how can I help?"],
5. "context_set": ""
6. },
7. {"tag": "goodbye",
8. "patterns": ["cya", "See you later", "Goodbye",
"I am Leaving", "Have a Good day", "bye", "cao", "see ya"],
9. "responses": ["Sad to see you go :
(", "Talk to you later", "Goodbye!"],
10. "context_set": ""
11. },
12. {"tag": "stocks",
13. "patterns": ["what stocks do I own?", "how are my shares?", "what
companies am I investing in?", "what am I doing in the markets?"],
14. "responses": ["You own the following shares: ABBV, AAPL,
FB, NVDA and an ETF of the S&P 500 Index!"],
15. "context_set": ""
16. }
17. ]}
Code 8.15
We have configured our neuralintents to react upon different waken words
(lines with pattern key, for example in line 3). With this approach, we can
create custom functions that will custom action upon a triggered word. The
great thing about TenserFlow is that we can ignore all the complex linguistic
cases grammar and inflection. Library will try to convert all the cases to its
original forms and find the best matching pattern.
In the proceeding example, we will try to read from command line the given
word or phrase. Our library based on the prepared scenario patterns, will call
custom function as mentioned.
1. import logging
2. from neuralintents import GenericAssistant
3.
4. LOGGING_FORMAT = '%(asctime)s %(message)s'
5. logging.basicConfig(level=logging.INFO, format=LOGGING_FORMA
T)
6.
7. def greetings_callback():
8. logging.info("Your greetings")
9.
10. def stocks_callback():
11. logging.info("Your stocks")
12.
13. mappings = {
14. 'greeting' : greetings_callback,
15. 'stocks' : stocks_callback
16. }
17.
18. assistant = GenericAssistant('intents.json', model_name="test_model", i
ntent_methods=mappings)
19. assistant.train_model()
20. assistant.save_model()
21.
22. while True:
23. message = input("Message: ")
24. if message == "STOP":
25. break
26. else:
27. assistant.request(message)
Code 8.16
In the following example, we will demonstrate how to use our example code
from Code 8.15. As it is easy to notice, the keyword STOP is used to stop
our program. You may also observe, that each time we try to use words from
our scenario patterns (Code 8.15) if it is a valid pattern then our code will
use custom callbacks (lines 7-10 and 13-16). Notice that it has been
mentioned that different forms of words are being handled by the linguistic
library – Code 8.16, line 9 and 14.
1. $ python neutral.py
2.
3.
4. Message: hi
5. 1/1 [==============================] - 0s 42ms/step
6. 2023-05-21 12:26:55,286 Your greetings
7. Message: stock value
8. 1/1 [==============================] - 0s 13ms/step
9. 2023-05-21 12:26:58,984 Your greetings
10. Message: how are my shares?
11. 1/1 [==============================] - 0s 13ms/step
12. 2023-05-21 12:27:11,358 Your stocks
13. Message: shares
14. 1/1 [==============================] - 0s 12ms/step
15. Message: how are my share?
16. 1/1 [==============================] - 0s 12ms/step
17. 2023-05-21 12:27:23,580 Your stocks
18. Message: STOP
19. $
Code 8.17
What is interesting is the fact, that like in line 9 and 10 you can notice that
lib is smart enough to be able to find a proper pattern for singular and plural
form of the same word. Now, it is time to reuse our callback function to be
able to properly support the waken word and start responding to the user by
using some simple response scenarios. Let’s check following example code
how can we achieve that.
1. import logging
2. import random
3. import whisper
4. import sounddevice as sd
5. from datetime import datetime
6. from os import system
7. from scipy.io.wavfile import write
8. from neuralintents import GenericAssistant
9.
10. FS = 44100
11. SECONDS = 5
12. RECORDING_FILE = 'output.wav'
13. LOGGING_FORMAT = '%(asctime)s %(message)s'
14.
15. logging.basicConfig(level=logging.INFO, format=LOGGING_FORMA
T)
16.
17.
18. class SmartSpeaker:
19. def __init__(self):
20. self._current_text = None
21. self.model = whisper.load_model("base.en", download_root='.')
22. logging.info("Model loaded")
23.
24. def audio2text(self):
25. if self.record_audio():
26. self.analized_text = self.audio_to_text()
27. logging.info(f"Translated text: {self.analized_text}")
28. return self.analized_text
29.
30. def run(self, assistant):
31. self.assistant = assistant
32. analyzed_text = self.audio2text()
33. if analyzed_text and self.is_keyword_in_text:
34. self.__say("Yes, how can I help you?")
35. new_analyzed_text = self.audio2text()
36. if new_analyzed_text:
37. self.assistant.request(new_analyzed_text)
38.
39. def record_audio(self) -> bool:
40. try:
41. myrecording = sd.rec(int(SECONDS * FS), samplerate=FS, cha
nnels=1)
42. logging.info('Start talking')
43. sd.wait()
44.
45. logging.info('Write output')
46. write('output.wav', FS, myrecording)
47. except Exception as e:
48. logging.error(f"We crashed: {e}")
49. return False
50. return True
51.
52. def audio_to_text(self) -> str:
53. logging.info('Analyze text')
54. result = self.model.transcribe("output.wav")
55. return result["text"]
56.
57. def get_response(self, tag):
58. list_of_intents = self.assistant.intents["intents"]
59. for i in list_of_intents:
60. if i["tag"] == tag:
61. return random.choice(i["responses"])
62.
63. def __say(self, message):
64. system (f'say {message}')
65.
66. @property
67. def is_keyword_in_text(self) -> bool:
68. return 'speaker' in self.analized_text.lower() or 'hey speaker' in self.
analized_text.lower()
69.
70. def callback_greetings(self):
71. response = self.get_response('greetings')
72. self.__say(response)
73.
74. def callback_time(self):
75. current_time = datetime.now().strftime("%I:%M%p")
76. response = self.get_response('time')
77. response = response.format(time=current_time)
78. self.__say(response)
79.
80.
81. if __name__ == '__main__':
82. smart_speaker = SmartSpeaker()
83. mappings = {
84. 'greetings': smart_speaker.callback_greetings,
85. 'time': smart_speaker.callback_time
86. }
87.
88. assistant = GenericAssistant('intents_speaker.json', model_name="tes
t_model", intent_methods=mappings)
89. assistant.train_model()
90. assistant.save_model()
91. smart_speaker.run(assistant)
Code 8.18
As you can see in example Code 8.17, we have mostly used the code
examples that we learned so far. In a nutshell, they include detecting waken
work and responding to the end user that we have heard the indicator word.
In lines 24-28, we are converting recorded audio to plain text and next in
lines 31-33 we check if the system converted contains waken word, which is,
as we said before, hey speaker. If the valid waken word is detected, we are
going to the next phase (line 34) where we respond back to the user that we
are ready to accept the actual command.
Next, when we accept the actual command, we again try to decipher what
was said by the user and if we can detect a proper command using neural
intents, we will respond with a random response defined in intent JSON file
(line 57-61).
Please check following example to see how our program works in action:
1. (...)
2. Epoch 197/200
3. 3/3 [==============================] - 0s 463us/step - loss: 0.
1315 - accuracy: 1.0000
4. Epoch 198/200
5. 3/3 [==============================] - 0s 481us/step - loss: 0.
0915 - accuracy: 1.0000
6. Epoch 199/200
7. 3/3 [==============================] - 0s 481us/step - loss: 0.
1208 - accuracy: 1.0000
8. Epoch 200/200
9. 3/3 [==============================] - 0s 463us/step - loss: 0.
0790 - accuracy: 1.0000
10.
11. 2023-05-10 17:07:03,653 Start talking
12. 2023-05-08 17:07:08,774 Write output
13. 2023-05-08 17:07:08,777 Analyze text
14. 2023-05-08 17:07:10,145 Translated text: Hey speaker!
15. 2023-05-08 17:07:12,673 Start talking
16. 2023-05-08 17:07:17,794 Write output
17. 2023-05-08 17:07:17,797 Analyze text
18. 2023-05-08 17:07:18,333 Translated text: What time is it?
19. 1/1 [==============================] - 0s 38ms/step
Code 8.19
Another thing we must pay attention to is the fact that to be able to feed and
train neural system we can only do this once when application starts. You
have probably already noticed, that after running our script you have locally
files created like *.pkl and *.h5. These are the linguistic models for English
language. In example Code 8.18, we can see when the application is done
loading the data files, we are interacting with our smart speaker by following
an algorithm that has been described before.
Figure 8.3: Example flow of user interaction with smart speaker and Spotify service
As shown in Figure 8.3, we need to introduce a new application API in our
application schema. This is because Spotify only allows us to integrate with
their service via HTTP(s) protocol and authentication should be done via
OAuth20. We need a web browser for this. To make things cleaner and
introduce a better separation between the part in our application that handles
all the interactions with the user (main app) and the part that is responsible
for playing music, we have a stand-alone API block for this. Before we start
building API service, we must install a package for asynchronous21 HTTP
calls22:
1. $ pip install httpx==0.24.1
Code 8.21
Once we have installed HTTP client library, let us take a quick look at how
this library works in the following example:
1. import httpx
2.
3. url = '‹https://en.wikipedia.org/wiki/Python_(mythology)'
4.
5. response = httpx.get(url)
6.
7. print(response)
8. with open("/tmp/tmp_page.txt", 'wb') as w:
9. w.write(response.content)
Code 8.22
As you may have already noticed, using HTTPX library is very different
from using requests23 package. It is worth mentioning, that in the following
example, we do the same thing albeit we use asynchronous calls with
asyncio:
1. import asyncio
2. import httpx
3.
4. url = 'https://en.wikipedia.org/wiki/Python_(mythology)'
5.
6. async def main():
7. async with httpx.AsyncClient() as client:
8. response = await client.get(url)
9. print(response)
10. with open("/tmp/tmp_page.txt", 'wb') as w:
11. w.write(response.content)
12.
13. asyncio.run(main())
Code 8.23
By running this example, we now know how to use HTTP library to connect
to an external resource and fetch its content in synchronous format. Now, we
need to install another asynchronous library, which is easy to use and very
efficient in building API systems called FastAPI24:
1. $ pip install fastapi==0.96.0
Code 8.24
To understand how FastAPI framework works, let us start with the following
example where we create the first main endpoint for our API service. Let us
create an example script api_step1.py in the following code:
1. from fastapi import FastAPI
2.
3. app = FastAPI()
4.
5.
6. @app.get("/")
7. async def main():
8. url = "http://foo.com/redirect"
9. return {"url": url}
Code 8.25
We have only one single API endpoint (line 6). When we try to execute our
script to see if the API works like in the following example nothing happens:
1. python api_step1.py
Code 8.26
The reason why executing our script from example Code 8.25, is not
effective is because FastAPI is a web API framework, but the transportation
layer has to be delivered by the web server. In our case, we will use WSGI
HTPP server library called Gunicorn25. First, we have to install it like in the
following example:
1. $ pip install gunicorn==20.1.0
Code 8.27
Once we have it installed, we can run our example script from Code 8.25, by
using a proper WSGI server for serving our API. Please check the
proceeding example for a demonstration of the same:
1. $ gunicorn -k uvicorn.workers.UvicornWorker api_step1:app --reload -
b localhost:8888
Code 8.28
This command is going to help us to start our service so we can now test if it
is working by executing the following command:
1. $ curl -v http://localhost:8888/
2.
3. * Trying 127.0.0.1:8888...
4. * Connected to 127.0.0.1 (127.0.0.1) port 8888 (#0)
5. > GET / HTTP/1.1
6. > Host: 127.0.0.1:8888
7. > User-Agent: curl/7.88.1
8. > Accept: */*
9. >
10. < HTTP/1.1 200 OK
11. < date: Sat, 17 Jun 2023 19:24:02 GMT
12. < server: uvicorn
13. < content-length: 33
14. < content-type: application/json
15. <
16. * Connection #0 to host 127.0.0.1 left intact
17.
18. {"url":"http://foo.com/redirect"}
Code 8.29
We can see in Code 8.29, it is finally possible to retrieve an API response. In
our case, we will return a simple JSON response (line 18). As we mentioned
before, we need to build an API that can support OAuth with the Spotify
system and as a result of successful authentication, we can get the access
token. We demonstrated that flow before as basic token authentication (Code
8.20) – it’s been simplified approach and yet not with OAuth. In the
following example, we are using full OAuth flow to get the access token.
Take note that we are using config parser to load Spotify credentials.
In the proceeding example, we are going to use the credentials file
(api_config.ini) this is going to use access keys generated in example 8.20:
1. [spotify]
2. client_id = <your client ID>
3. client_secret = <your client secret>
Code 8.30
Before we proceed with the following examples, we must install Spotify
module that is going to help us properly follow OAuth26:
1. $ pip install git+https://github.com/darkman66/spotify.py.git
Code 8.31
Once we have the configuration file, let us try to utilize it in the following
example where we load credentials and proceed with OAuth flow:
1. import configparser
2. import spotify
3. from fastapi import FastAPI
4. from fastapi.responses import RedirectResponse
5. from typing import Tuple
6.
7. config = configparser.ConfigParser()
8. config.sections()
9. config.read('api_config.ini')
10.
11. SPOTIFY_CLIENT_ID = config.get('spotify', 'client_id')
12. SPOTIFY_CLIENT_SECRET = config.get('spotify', 'client_secret')
13. REDIRECT_URI: str = 'http://localhost:8888/spotify/callback'
14. SPOTIFY_CLIENT = spotify.Client(SPOTIFY_CLIENT_ID, SPOTIF
Y_CLIENT_SECRET)
15. OAUTH2_SCOPES: Tuple[str] = ('user-modify-playback-state', 'user-
read-currently-playing', 'user-read-playback-state')
16. OAUTH2: spotify.OAuth2 = spotify.OAuth2(SPOTIFY_CLIENT.id, R
EDIRECT_URI, scopes=OAUTH2_SCOPES)
17. AUTH_TOKEN = None
18.
19. app = FastAPI()
20.
21.
22. @app.get("/")
23. async def main():
24. url = None
25. if not AUTH_TOKEN:
26. url = OAUTH2.url
27. return RedirectResponse(url, status_code=302)
28. return {"url": url}
Code 8.32
We have imported the config parser (line 1), loaded our configuration file
and used it for Spotify credentials are static values (lines 7-12) taken from
configuration. Once we have the credentials sorted, we are defining callback
URL after a successful authentication from Spotify’s side. Remember, we
have defined that callback URL in our Spotify application configuration
(Figure 8.2).
Next, we need to initialize Spotify client instance (line 14) and use it for
constructor of OAuth2 client (line 16). What you need to notice is line 15.
We define here, what scope of private data we want to get access to27. It is
important to define a proper scope of privileges, so we can get the
authentication token that allows us to access the data that we wanted to read.
Spotify based on scope, will give us access to only the kind of data that we
requested for.
Another thing to remember, is that we are doing authentication via OAuth.
To make the authentication flow work, we have to open URL
http://localhost:8888/ in the browser, so we will be redirected to Spotify to
accept the scope of data that our API wants to read from your Spotify
account. Once you accept it, you are going to be redirected to the defined
redirect URL (Figure 8.2).
In the following example, we are authenticating a user account against the
Spotify system and once the authentication process is correct, we will search
for a music title called drake:
1. import configparser
2. import spotify
3. from fastapi import FastAPI
4. from fastapi.responses import RedirectResponse
5. from typing import Tuple
6.
7. config = configparser.ConfigParser()
8. config.sections()
9. config.read('api_config.ini')
10.
11. SPOTIFY_CLIENT_ID = config.get('spotify', 'client_id')
12. SPOTIFY_CLIENT_SECRET = config.get('spotify', 'client_secret')
13. REDIRECT_URI: str = 'http://localhost:8888/spotify/callback'
14. SPOTIFY_CLIENT = spotify.Client(SPOTIFY_CLIENT_ID, SPOTIF
Y_CLIENT_SECRET)
15. OAUTH2_SCOPES: Tuple[str] = ('user-modify-playback-state', 'user-
read-currently-playing', 'user-read-playback-state', 'app-remote-control')
16. OAUTH2: spotify.OAuth2 = spotify.OAuth2(SPOTIFY_CLIENT.id, R
EDIRECT_URI, scopes=OAUTH2_SCOPES)
17. AUTH_TOKEN = None
18.
19. app = FastAPI()
20.
21.
22. @app.get("/")
23. async def main():
24. url = None
25. if not AUTH_TOKEN:
26. url = OAUTH2.url
27. return RedirectResponse(url, status_code=302)
28. return {"url": url}
29.
30.
31. @app.get('/spotify/callback')
32. async def spotify_callback(code: str):
33. return_url = None
34. try:
35. AUTH_TOKEN = code
36. except KeyError:
37. return {"ready": False}
38. else:
39. print(f"Authentiicaton token: {AUTH_TOKEN}")
40. async with spotify.Client(SPOTIFY_CLIENT_ID, SPOTIFY_CLI
ENT_SECRET) as client:
41. try:
42. response = await spotify.User.from_code(client, code, redirect
_uri=REDIRECT_URI)
43. user = await response
44. results = await client.search('drake')
45. print(results.tracks)
46. if results.tracks and len(results.tracks) > 0:
47. return_url = results.tracks[0].url
48. except spotify.errors.HTTPException as e:
49. print('Token expired?')
50. if 'expired' in str(e).lower() or 'invalid' in str(e).lower():
51. print('redirect-'*5)
52. return RedirectResponse('/', status_code=302)
53.
54. return {"url": return_url}
Code 8.33
To start the example from Code 8.33 we must use Gunicorm WSGI server
like in the proceeding example:
1. $ gunicorn -k uvicorn.workers.UvicornWorker api_step3:app --reload -
b localhost:8888
Code 8.34
Once you open our API main URL in the browser you should see response
like in the following example:
1. {"url":"https://open.spotify.com/track/7aRCf5cLOFN1U7kvtChY1G"}
Code 8.35
It works This is good news. We have successfully managed to authenticate
our application and use Spotify system to find music records for us.
Let us take a closer look at Code 8.33. You can notice that method
spotify_callback gets an argument code. In FastAPI framework this kind of
function definition means that argument is going to be read from query
parameters. Let’s check how can we use this code in HTTP syntax.
1. http://localhost:8888/spotify/callback?code=<..authentication code..>
Code 8.36
1. In example Code 8.36, we shown URL of a callback that Spotify system
is going to redirect user to after successful authentication. It is easy to
notice that the URL has a query parameter code. This is the same
argument as the function spotify_callback as already mentioned.
2. Next step is to create a new client instance (line 40) that we are going to
use in line 42 where we try to get the user ID by using OAuth code, that
we have as a result of callback from Spotify.
3. In the next phase, we are calling Spotify API trying to find a specific
music track. In line 44, we will be looking for the word drake and
through the returned results we try to get URL of the very first record
returned by Spotify.
So far, we did everything that was described in a single flow. This is not real
API logic. What we mean by is, we want to authenticate API against Spotify
only when our app starts. Once our API is authenticated, we should be able
to search via our API any kind of music record in the Spotify library. We
should be able to also ask via our API to play the requested music track.
For the mentioned requirements let us look at the following example to see
how we can modify our Code 8.29 to serve our needs:
1. import configparser
2. import os
3. import spotify
4. from fastapi import FastAPI
5. from fastapi.responses import RedirectResponse
6. from pydantic import BaseModel
7. from typing import Tuple
8.
9. config = configparser.ConfigParser()
10. config.sections()
11. config.read("api_config.ini")
12.
13. SPOTIFY_CLIENT_ID = config.get("spotify", "client_id")
14. SPOTIFY_CLIENT_SECRET = config.get("spotify", "client_secret")
15. REDIRECT_URI: str = "http://localhost:8888/spotify/callback"
16. SPOTIFY_CLIENT = spotify.Client(SPOTIFY_CLIENT_ID, SPOTIF
Y_CLIENT_SECRET)
17. OAUTH2_SCOPES: Tuple[str] = (
18. "user-modify-playback-state",
19. "user-read-currently-playing",
20. "user-read-playback-state",
21. "app-remote-control",
22. )
23. OAUTH2: spotify.OAuth2 = spotify.OAuth2(SPOTIFY_CLIENT.id, R
EDIRECT_URI, scopes=OAUTH2_SCOPES)
24. TOKEN_FILE = '/tmp/token.dat'
25.
26. class Item(BaseModel):
27. phrase: str
28.
29. app = FastAPI()
30.
31. async def token_set(auth_code: str):
32. with open(TOKEN_FILE, 'w') as f:
33. f.write(auth_code)
34.
35. async def token():
36. if os.path.exists(TOKEN_FILE):
37. with open(TOKEN_FILE, 'r') as f:
38. return f.read().strip()
39.
40. @app.get("/")
41. async def main():
42. url = None
43. if not await token():
44. url = OAUTH2.url
45. return RedirectResponse(url, status_code=302)
46. return {"url": url}
47.
48.
49. @app.post("/search/")
50. async def spotify_search(item: Item):
51. async with spotify.Client(SPOTIFY_CLIENT_ID, SPOTIFY_CLIEN
T_SECRET) as client:
52. results = await client.search(item.phrase)
53. if results.tracks and len(results.tracks) > 0:
54. track_url = results.tracks[0].url
55. return {"track_url": track_url}
56.
57.
58. @app.get("/spotify/callback")
59. async def spotify_callback(code: str):
60. success = False
61. try:
62. await token_set(code)
63. except KeyError:
64. return {"ready": False}
65. else:
66. print(f"Authentiicaton token: {code}")
67. async with spotify.Client(SPOTIFY_CLIENT_ID, SPOTIFY_CLI
ENT_SECRET) as client:
68. try:
69. response = await spotify.User.from_code(client, code, redirect
_uri=REDIRECT_URI)
70. user = await response
71. print(f"Managed to collect user data: {user}")
72. return RedirectResponse("/", status_code=302)
73. except spotify.errors.HTTPException as e:
74. print("Token expired?")
75. if "expired" in str(e).lower() or "invalid" in str(e).lower():
76. print("redirect-" * 5)
77. return RedirectResponse("/", status_code=302)
Code 8.37
Firstly, we modified the way we store the authentication code as received
from callback URL (line 33-38). Since Gunicorn is a threaded WSGI server,
we cannot store authentication codes in global single variables like we did in
the previous example (Code 8.29, line 35). In this case, we will use a simple
yet powerful solution. We will store the authentication code in a flat file (line
62). When we want to use the authentication code, we can just read it from
the file (line 43).
We also have simple mechanisms, to check if the returned authentication
code expired or is not valid (lines 73-77). In that case, we return the user (in
the browser) to OAuth login screen to refresh the authentication code. You
might notice that we have added a new method, search (lines 50-55). That
method is a POST method (line 49). It will use only single argument for the
call (line 50) – argument is called item. The argument is expected to be in a
deserialized JSON28 structure. For serialization, we will use data
serialization framework Pydantic29. To visualize how to call search method
please check the following example:
1. $ curl -v -X POST http://localhost:8888/search/ -H 'content-
type: application/json' -d '{"phrase" : "linking park"}'
Code 8.38
As you can see, parameter d (data) in curl command, has specified JSON
payload that we send to our API and as a response we are getting something
like in the proceeding example.
1. {"track_url":"https://open.spotify.com/track/60a0Rd6pjrkxjPbaKzXjfq"
}
Code 8.39
We have the basic API functionality working and an authentication against
Spotify to search for a given phrase against music records. What we want to
achieve next, is to provide functionality where we can send a query to our
API to find interesting music and upon finding results, we will ask our API
to play music. To do that properly, it is a good idea to return the track ID in
our API response with full track URL. We will modify spotify_search
method to look like in the following example:
1. @app.post("/search/")
2. async def spotify_search(item: Item):
3. async with spotify.Client(SPOTIFY_CLIENT_ID, SPOTIFY_CLIEN
T_SECRET) as client:
4. results = await client.search(item.phrase)
5. if results.tracks and len(results.tracks) > 0:
6. track_url = results.tracks[0].url
7. track_id = track_url.split('/')[-1]
8. return {
9. "track_url": track_url,
10. "ID": track_id
11. }
Code 8.40
To return to track id we used a simple trick (method split in string). With
that, we extracted the last part from the full URL. We will also keep in the
returned payload of the full URL we are going to use it later. In the
following example, we are adding the play method to our API. With this
method and given track id we can play the requested music record:
1. @app.get("/play/{track_id}")
2. async def spotify_playback(track_id: str):
3. code = await token()
4. async with spotify.Client(SPOTIFY_CLIENT_ID, SPOTIFY_CLIE
NT_SECRET) as client:
5. response = await spotify.User.from_code(client, code, redirect_uri
=REDIRECT_URI)
6. user = await response
7. devices = await user.get_devices()
8. device_id = devices[0].id
9. p = await user.get_player()'
10. play_url = f"https://open.spotify.com/track/{track_id}"
11. await p.play(play_url, device_id)
Code 8.41
To call our API so that it starts playing music records, we need to call API as
it is shown in the following example:
1. curl -v -X POST http://localhost:8888/play/60a0Rd6pjrkxjPbaKzXjfq -
H 'content-type: application/json'
Code 8.42
If you do not have a premium account (pay subscription) you will get an
error message displaying Forbidden (status code: 403): Player command
failed: Premium required. This means that in a free account we cannot
play music. The other limitation is the fact, that we are getting the device id
(example 8.37, line 8) by assuming that the device element zero in the
returned list is the player in the browser. To make it work, you need to open
Spotify player in your default web browser30.
The third limitation of this approach, kind of limitation of our approach, is
that Python Spotify library assumes that the token (example 8.37, line 5) is
the one coming from OAuth callback URL. That is correct, but Spotify only
allows us to use that code once. Hence, spotify_playback function cannot
use this technique. Let us try to use the official Spotify API approach with
the support of HTTPX library that we installed before (Code 8.21).
1. import httpx
2.
3. @app.post("/play/{track_id}")
4. async def spotify_play(track_id: str):
5. r = httpx.post('https://httpbin.org/post', data={'key': 'value'})
6. async with httpx.AsyncClient() as client:
7. headers = {"Content-Type": "application/x-www-form-
urlencoded"}
8. data = f"grant_type=client_credentials&client_id=
{SPOTIFY_CLIENT_ID}&client_secret=
{SPOTIFY_CLIENT_SECRET}"
9. response = await client.post("https://accounts.spotify.com/api/token
", headers=headers, data=data)
10. access_token = response.json()['access_token']
11.
12. headers = {"Content-
Type": "application/json", "Authorization": f"Bearer {access_token}"}
13. data = {'context_uri': f"spotify:track:{track_id}"}
14. response = await client.put("https://api.spotify.com/v1/me/player/pl
ay", headers=headers, data=data)
15. print(response.content)
Code 8.43
You can see that the kind of approach shown in example Code 8.39, is a very
low level way of communicating with Spotify. We do not use a framework
for this, instead we will use direct access via HTTP and API calls to Spotify.
In line 7-10 we will get the access token by using client id and client secret.
Once we have the token, we can start making API calls (lines 12-15).
Of course, with this way of requesting a playback to Spotify we will not
avoid a case where we do need to have a Spotify subscription plan. The
simplest refactoring, we can do here, is to open the music record URL in the
browser from our API, like in the proceeding example:
1. import webbrowser
2.
3. @app.post("/play/{track_id}")
4. async def spotify_play(track_id: str):
5. play_url = f"https://open.spotify.com/track/{track_id}"
6. webbrowser.open(play_url)
Code 8.44
Conclusion
In this chapter, we learned how to use Python for voice recognition. Next,
we have covered how to analyze what is being said by converting speech to
raw text thus, we now know how to integrate such a powerful technique with
the third party software. With Python, you can use unlimited integrations of
voice recognition through smart home solutions like smart lamps, controlling
the water in gardens, cameras and many more.
In the next chapter, we are going to learn how can we use Python to build
music and video downloading software.
1. https://cloud.google.com/ai-
platform/training/docs/algorithms/xgboost
2. https://cmusphinx.github.io/wiki/tutorialam/
3. https://github.com/spatialaudio/python-sounddevice
4. https://scipy.org
5. https://www.python.org/downloads/
6. https://en.wikipedia.org/wiki/44,100_Hz
7. https://github.com/openai/whisper
8. https://aws.amazon.com/polly/
9. https://cloud.google.com/text-to-speech/
10. https://docs.python.org/3/library/tempfile.html
11. https://ss64.com/osx/say.html
12. https://manpages.ubuntu.com/manpages/trusty/man1/say.1.html
13. https://openai.com
14. https://cloud.google.com/dialogflow/es/docs/basics
15. https://pypi.org/project/neuralintents/
16. https://www.tensorflow.org
17. https://www.spotify.com
18. https://developer.spotify.com
19. https://developer.spotify.com/dashboard
20. https://en.wikipedia.org/wiki/OAuth
21. https://docs.python.org/3/library/asyncio.html
22. https://www.python-httpx.org
23. https://docs.python-requests.org/en/latest/index.html
24. https://fastapi.tiangolo.com
25. https://gunicorn.org
26. https://developer.spotify.com/documentation/web-
api/concepts/authorization
27. https://developer.spotify.com/documentation/web-
api/concepts/scopes
28. https://www.w3schools.com/whatis/whatis_json.asp
29. https://pypi.org/project/pydantic/
30. https://open.spotify.com/
31. https://www.raspberrypi.com
32. https://www.raspberrypi.com/documentation/computers/os.html
OceanofPDF.com
CHAPTER 9
Make a Music and Video
Downloader
Introduction
On the internet, when we try to download some files, we might face some
technical challenges. We know that connection between our computer and
server can be dropped and expect that connection will be automatically re-
established.
Another challenge is that some servers can throttle connection, and we can
only download certain assets with highly limited speed (Fair Use Policy1).
Speaking of limiting connection speed, there can be a case where we want to
download a lot of files, and we need to apply fair usage policy on our end.
There can also be a need to start downloading files from a list and limit
download speed at the same time to not disturb our internet connection and
daily work.
In this chapter, we are going to learn how to download web resources with
Python. We were not only going to learn how to download any resource; we
are about to build a YouTube videos downloading tool.
Structure
In this chapter, we will discuss the following topics:
Understanding API concept
Building YouTube API client
Organizing downloaded data
Support for different formats and resolutions
Building batch data downloader
Objectives
By the end of this chapter, you will learn how to build a download manager
that will help us to download video files from the popular video hosting
platform. We will learn how external API system works and how to use its
advantage to download assets from it. We will also learn how to fetch binary
data from webserver. All these mentioned skills we will be implementing by
using Python language.
Download manager
As it has been mentioned in the chapter objectives, we will build some tools
that will allow us to download some assets, like, images from a given source.
To make it efficient, we will use async programming technique. This is
going to help us avoid locking pieces of code when accessing blocking
content like internet assets. Before we can dive into first example, we shall
install the required libraries:
1. $ pip install asyncio==3.4.3
2. $ pip install httpx==0.24.1
3. $ pip install click==8.1.3
Code 9.1
Once we have installed the required packages, we will start writing a simple
script that will allow us to fetch assets from a given command-line
argument. Let us check the following example to understand how to achieve
this:
1. import click
2. import asyncio
3. import httpx
4. import os
5. from urllib.parse import urlparse
6.
7.
8. async def main(url):
9. async with httpx.AsyncClient() as client:
10. response = await client.get(url, follow_redirects=True)
11.
12. if response.status_code == 200:
13. u = urlparse(url)
14. file_name = os.path.basename(u.path)
15. with open(f'/tmp/{file_name}', 'wb') as f:
16. f.write(response.content)
17.
18.
19. @click.command()
20. @click.option("--
url", help="File URL path to download ", required=True)
21. def run(url):
22. asyncio.run(main(url))
23.
24. if __name__ == '__main__':
25. run()
Code 9.2
You can see that in the example Code 9.1, we used some already known
libraries from previous chapters like, click on httpx or asyncio. In this case,
we have built a simple script for command line support. We can specify the
URL with the resource that we want to download (line 20).
When a resource is downloaded, we strip out the URL path (line 13-14) and
save the resource in tmp folder under its original name.
This simple script will be a starting point to build a more advanced
download manager.
In the next step, we will add an option to download files from the given list
instead of a single URL provided. Thus said, in the following code we can
see how can we achieve downloading files from provided URL.
1. import click
2. import asyncio
3. import httpx
4. import os
5. from urllib.parse import urlparse
6.
7.
8. async def download(url):
9. async with httpx.AsyncClient() as client:
10. print(f"Fetching: {url}")
11. response = await client.get(url, follow_redirects=True)
12. if response.status_code == 200:
13. u = urlparse(url)
14. file_name = os.path.basename(u.path)
15. with open(f"/tmp/{file_name}", "wb") as f:
16. f.write(response.content)
17.
18.
19. async def download_list(urls_file):
20. with open(urls_file, "r") as f:
21. for item in f:
22. await download(item.strip())
23.
24.
25. async def main(url=None, url_list=None):
26. if url:
27. return await download(url)
28. if url_list:
29. print("Running downloader for given list of URLs")
30. return await download_list(url_list)
31.
32.
33. @click.command()
34. @click.option("--url", help="File URL path to download")
35. @click.option("--url-list", help="File with URLs to download")
36. def run(url, url_list):
37. asyncio.run(main(url, url_list))
38.
39.
40. if __name__ == "__main__":
41. run()
Code 9.3
As you can see, we have slightly modified the starting function (line 33-36),
so we can accept additional parameter as a file path (URL list) that will be
used as a list of URLs that we can fetch.
To demonstrate how to use this new parameter, let us check the following
example by creating a sample file called example_files_list.txt that will
have a list of URLs to fetch.
1. https://www.wikipedia.org/portal/wikipedia.org/assets/img/sprite-
8bb90067.svg
2. https://www.wikipedia.org/portal/wikipedia.org/assets/js/index-
86c7e2579d.js
3. https://www.wikipedia.org/portal/wikipedia.org/assets/img/Wikinews-
logo_sister.png
4. https://www.wikipedia.org/portal/wikipedia.org/assets/js/gt-ie9-
ce3fe8e88d.js
5. https://www.wikipedia.org/portal/wikipedia.org/assets/img/Wikipedia-
logo-v2.png
Code 9.4
Now, as shown in Code 9.3 we can use specified file that contains list of
URLs to fetch. We will use file from Code 9.4 and get the content of the
listed URLs. Let’s check following code how to specify file containing a list
of URLs to fetch.
1. python download_manager2.py --url-list example_files_list.txt
2.
3. Running downloader for given list of URLs
4. Fetching: https://www.wikipedia.org/portal/wikipedia.org/assets/img/spr
ite-8bb90067.svg
5. Fetching: https://www.wikipedia.org/portal/wikipedia.org/assets/js/i
ndex-86c7e2579d.js
6. Fetching: https://www.wikipedia.org/portal/wikipedia.org/assets/img/Wi
kinews-logo_sister.png
7. (...)
8. httpcore.ConnectTimeout
9.
10. The above exception was the direct cause of the following exception:
11. (...)
12. File "/Users/darkman66/.virtualenvs/fun2/lib/python3.11/site-
packages/httpx/_transports/default.py", line 77, in map_httpcore_except
ions
13. raise mapped_exc(message) from exc
14. httpx.ConnectTimeout
Code 9.5
We were running our example code until it crashed (Code 9.5, line 9-13).
The reason being why it has happened is clearly stated in line 13. The root
cause of why we faced exception shown in line 14 is that we have stumbled
upon issue with the timeout when trying to access website were tried to
download its content. If we check the way our script works (Code 9.3, line
21-22) this kind of exception is undesired. If it happens, then our loop (Code
9.3, line 21) will stop, and we will finish processing the URL links from the
file. However, this is not correct because the correct approach is a case
where we get all the files from a given list, and even if there is a crash, we
should be able to recover and continue with the download.
Before implementing a fixed code, we need to install new library tenacity
which will help us to build retries techniques for async functions.
1. $ pip install tenacity
Code 9.6
Let us consider the following example Code 9.7 on how to fix a case when
there is a crash or networking issue.
We can introduce retry pattern when fetching external content while facing
mentioned timeouts issues when downloading content of a specified web
resource. Once the library is installed, we can apply retry mechanism. Let us
check Code 9.7 to understand it better.
1. import click
2. import asyncio
3. import httpx
4. import os
5. from urllib.parse import urlparse
6. from tenacity import AsyncRetrying, RetryError, stop_after_attempt
7.
8. RETRIES = 3
9.
10.
11. async def download(url):
12. """Fetch URL resource with retry"""
13. try:
14. async for attempt in AsyncRetrying(stop=stop_after_attempt(RET
RIES)):
15. with attempt:
16. click.echo(f"Fetching: {url}")
17. async with httpx.AsyncClient() as client:
18. response = await client.get(url, follow_redirects=True)
19. if response.status_code == 200:
20. u = urlparse(url)
21. file_name = os.path.basename(u.path)
22. with open(f"/tmp/{file_name}", "wb") as f:
23. f.write(response.content)
24. except RetryError:
25. click.echo(f"Failed to fetch {url} after {RETRIES} tries")
26.
27.
28.
29. async def download_list(urls_file):
30. with open(urls_file, "r") as f:
31. for item in f:
32. if item and 'http' in item:
33. await download(item.strip())
34.
35.
36. async def main(url=None, url_list=None):
37. if url:
38. return await download(url)
39. if url_list:
40. click.echo("Running downloader for given list of URLs")
41. return await download_list(url_list)
42.
43.
44. @click.command()
45. @click.option("--url", help="File URL path to download")
46. @click.option("--url-list", help="File with URLs to download")
47. def run(url, url_list):
48. asyncio.run(main(url, url_list))
49.
50.
51. if __name__ == "__main__":
52. run()j
Code 9.7
As you might have noticed, to solve the mentioned problems regarding
assets timeouts and temporary not able to access web resources, we first
added the try/except block (line 13-25) to catch exceptions when there is an
issue with fetching network resource.
Next, we added in line 14, a piece of code that will help us retry when there
is an exception. The number of retries that we will perform is defined as
static variable (line 8).
The rest of the code (lines 16-23) is the same as before (Code 9.3, lines 9-
16) except that we have wrapped it all in the retry context (Code 9.7, line
15). This is a common technique to fix an issue with failing codes and do the
micro restarts.
Let us understand how to address some bottle neck in our code where
performance of fetching resources is limited.
So far, what we have learned will help us download network resources from
the given list, albeit, it has a common issue which is at the same time a very
strong disadvantage for our solution. We fetch those assets literary one by
one, if one resource takes longer, the next one will be waiting to be fetched
until the blocking one. For example, a slow one or it may be a problematic
resource where we had to retry multiple times.
To address this bottle neck of our code, we will follow these steps:
We must start fetching network resources in parallel. Let us look at the
following code to understand how we can modify the existing Code 9.7 to
work in the parallel downloads approach.
We have modified our already known method download list from the
example Code 9.7 in such a way that in Code 9.8, where we have a loop
(lines 5-6), we will create a list of all the download calls to be made
(coroutines).
1. async def download_list(urls_file):
2. calls_to_make = []
3. with open(urls_file, "r") as f:
4. for item in f:
5. if item and "http" in item:
6. calls_to_make.append(download(item.strip()))
7. click.echo(f"Number of URLs to fetch {len(calls_to_make)}")
8. await asyncio.gather(*calls_to_make)
Code 9.8
Next, asyncio engine makes all these async calls and wait for them to be
finished (line 8).
This way of solving multiprocessing is going to work for us very efficiently,
albeit, we need to be aware of one issue that when we feed our script with
many URLs to parse and fetch, we will be speeding up the whole processing
because of running all the download actions in concurrent courtliness but
there is a problem with processing many hundreds of URLs. Our whole
application is running in a single core of CPU, so if we put a lot of
concurrent coroutines we will be exposing our script to an issue where OS
and CPU cannot process all the requested URLs in real-time. This is related
to the limitations of system kernel and several simultaneous sockets that OS
can process.
The other so called bottle neck that we will face is limitations of websites in
a number of parallel requests that webserver can accept coming from the
same source IP address. In other words, current modern websites apply
protection against DDOS attacks2 which means that they do not tolerate
aggressive content fetching which our script can use. Let us learn how to
address this problem by using the following steps:
In the following example, let us try to address the main concern, that is,
limiting the number of concurrent coroutines that our application is running.
However, before that, we will fix the code from example Code 9.8. To do so,
we will install asyncio pooling3.
1.$ pip install asyncio_pool
Code 9.9
Once we have installed needed pooling library, we can proceed to Code 9.8
for improvements as mentioned in the following example:
1. import click
2. import asyncio
3. import httpx
4. import os
5. from asyncio_pool import AioPool
6. from urllib.parse import urlparse
7. from tenacity import AsyncRetrying, RetryError, stop_after_attempt
8.
9. RETRIES = 3
10. CONCURRENCY_SIZE=2
11.
12.
13. async def download(url):
14. """Fetch URL resource with retry"""
15. try:
16. async for attempt in AsyncRetrying(stop=stop_after_attempt(RET
RIES)):
17. with attempt:
18. click.echo(f"Fetching: {url}")
19. async with httpx.AsyncClient() as client:
20. response = await client.get(url, follow_redirects=True)
21. if response.status_code == 200:
22. u = urlparse(url)
23. file_name = os.path.basename(u.path)
24. with open(f"/tmp/{file_name}", "wb") as f:
25. f.write(response.content)
26. except RetryError:
27. click.echo(f"Failed to fetch {url} after {RETRIES} tries")
28.
29.
30. async def download_list(urls_file):
31. calls = []
32. with open(urls_file, "r") as f:
33. async with AioPool(size=CONCURRENCY_SIZE) as pool:
34. for item in f:
35. if item and "http" in item:
36. result = await pool.spawn(download(item.strip()))
37. calls.append(result)
38. click.echo(f"Commited {len(calls)} URLs to call")
39.
40. for call_item in calls:
41. call_item.result()
42.
43.
44. async def main(url=None, url_list=None):
45. if url:
46. return await download(url)
47. if url_list:
48. click.echo("Running downloader for given list of URLs")
49. return await download_list(url_list)
50.
51.
52. @click.command()
53. @click.option("--url", help="File URL path to download")
54. @click.option("--url-list", help="File with URLs to download")
55. def run(url, url_list):
56. asyncio.run(main(url, url_list))
57. g
58.
59. if __name__ == "__main__":
60. run()
Code 9.10
We can see that the modified Code 9.10 is mainly like Code 9.9 except the
main essence of the chance. We updated the lines 33-37 where the loop (line
34) that was calling download function linearly is now using module for
connections pool (line 36).
The way it works is that pooling system controls how many coroutines main
asyncio reactor can process simultaneously. To control the number of
parallel coroutine download calls we can evaluate at the same time, we have
declared it in line 10.
Additionally, in lines 40-41, we are collecting results from calling all the
download coroutines. In our case download function does not return any
result albeit, collecting result technique is like safety stop in our code for a
case where we must wait until all the routines are done and finished with
processing.
The next step is to address the problem that we have highlighted before, that
is, the case where destination website is using protection against DDOS.
In the following examples, we will build web proxy service that is going to
help us with addressing DDOS issue. Let us analyze the following figure to
see how we will build our proxy network:
Figure 9.1: Concept of proxy service
As shown in Figure 9.1, we will create a small service (proxy server) that
will be installed on at least two different machines. This will be an
advantage as every request we send to the website is going to be seen as
coming from two different IP addresses.
As you can probably already imagine that as many IP addresses (servers to
use) you can have as better it gets. Chances of being detected by destination
website will get lower if we can have a big pool of IP addresses to use.
In the following example, we will check how to build simple proxy service
by using standard Python modules45:
1. import socketserver
2. from urllib.request import urlopen
3. from http.server import SimpleHTTPRequestHandler
4.
5. PORT = 9097
6. HOST = 'localhost'
7.
8. class MyProxy(SimpleHTTPRequestHandler):
9. def do_GET(self):
10. url = self.path
11. print(f"Opening URL: {url}")
12. self.send_response(200)
13. self.end_headers()
14. self.copyfile(urlopen(url), self.wfile)
15.
16.
17. with socketserver.TCPServer((HOST, PORT), MyProxy) as server:
18. print(f"Now serving at {PORT}")
19. server.serve_forever()
Code 9.11
1. In the above example Code 9.11, we are using simple http request
handler which allows us to catch GET call coming from a request and
make a new call to destination server (lines 12-14).
2. We use copyfile (line 14) function to internally copy destination server
response to response object that we will send to our original script. In
order to serve it as proxy server, we must start http service (lines 17-19)
and listen on port and address defined in lines 5-6.
3. Before we can test if our newly built proxy service is working or not, we
must save example Code 9.11 as proxy_service.py and start it.
4. Now, let us try to test in the following example our proxy service by
using simple CLI tool curl6.
1. $ python proxy_service.py
2.
3. Now serving at 9097
Code 9.12
In the following example Code 9.13, we are using curl command to fetch
HTML content from python.org. This request is going through GET method7
to fetch results:
1. $ curl -x http://localhost:9097 http://python.org
Code 9.13
Unfortunately, there is a limitation to our proxy solution. You might have
noticed that we have sent requests as GET because we have inherited it from
SimpleHTTPRequestHandler8 and only overwrite GET functionality
(Code 9.11, lines 9-14).
Thus, it gives us HTTP proxy service that only works with GET methods. In
our case that is enough since our service (Code 9.11) will allow us to
download web resources that are accessible via GET requests.
Another thing that is worth highlighting is the fact that our simple proxy
solution only supports HTTP protocol not HTTPS. The reason being simple,
that is, to build proper proxy with SSL support, we would have to explore
deeper into SSL certificates. We could also use already available projects to
support this need910, although, that topic is beyond this chapter.
Let us try to see Code 9.14 to understand how to modify the example Code
9.10 to use our small HTTP proxy service with it:
1. import click
2. import asyncio
3. import httpx
4. import os
5. import random
6. from asyncio_pool import AioPool
7. from hashlib import sha256
8. from urllib.parse import urlparse
9. from tenacity import AsyncRetrying, RetryError, stop_after_attempt
10.
11. RETRIES = 3
12. CONCURRENCY_SIZE = 2
13.
14. class Downloader:
15.
16. def __init__(self, proxies=None):
17. self.proxies = proxies
18.
19. async def download(self, url):
20. """Fetch URL resource with retry"""
21. try:
22. async for attempt in AsyncRetrying(stop=stop_after_attempt(RE
TRIES)):
23. with attempt:
24. proxy_server = None
25. if self.proxies:
26. proxy_server = {
27. "all://": random.choice(self.proxies),
28. }
29. click.echo(f"Fetching: {url}, proxy: {proxy_server}")
30. async with httpx.AsyncClient(proxies=proxy_server) as cli
ent:
31. response = await client.get(url, follow_redirects=True)
32. if response.status_code == 200:
33. u = urlparse(url)
34. file_hash = sha256(url.encode('utf8')).hexdigest()
35. file_name = f"
{os.path.basename(u.path)}_{file_hash}"
36. with open(f"/tmp/{file_name}", "wb") as f:
37. f.write(response.content)
38. except RetryError:
39. click.echo(f"Failed to fetch {url} after {RETRIES} tries")
40.
41. async def download_list(self, urls_file):
42. calls = []
43. with open(urls_file, "r") as f:
44. async with AioPool(size=CONCURRENCY_SIZE) as pool:
45. for item in f:
46. if item and "http" in item:
47. result = await pool.spawn(self.download(item.strip()))
48. calls.append(result)
49.
50.
51. @click.command()
52. @click.option("--url", help="File URL path to download")
53. @click.option("--url-list", help="File with URLs to download")
54. @click.option("--proxy", help="List of proxy servers", multiple=True)
55. def run(url, url_list, proxy):
56. d = Downloader(proxy)
57. if url:
58. run_app = d.download(url)
59. elif url_list:
60. run_app = d.download_list(url_list)
61. if run_app:
62. asyncio.run(run_app)
63. else:
64. click.echo("No option selected")
65.
66.
67. if __name__ == "__main__":
68. run()
Code 9.14
The essential change in Code 9.14 as compared to Code 9.10 is in the line
29, where we have initialized http client context and we will pass argument
of proxy servers list. Please notice that this list is initialized in class
constructor (lines 15-16). Another important thing to notice that we changed
is the body of the whole download functionality. We converted the function
driven approach (Code 9.10) to class object-oriented programming. By this
move, we managed to simplify initial main entry point (lines 53-62) and
encapsulate calls in individual methods.
Let us check how to use our new script from Code 9.14 that we have saved
as download_with_proxy.py to fetch single URL.
Our program will download specified resource (Code 9.15) under parameter
URL and save it in /tmp folder.
We extracted the last part from the URL which is the resource name (Code
9.14, line 33) and calculated SHA25611 (Code 9.15, line 34) on the top of the
full resource URL.
These two parameters are joined together as a file name (line 35) that we
will use for saving fetched resource (line 35-37) and it will guarantee us of
file name uniqueness.
1. $ python download_with_proxy.py --url https://www.wikipedia.org
2.
3. Fetching: https://www.wikipedia.org, proxy: None
Code 9.15
Now, let us see how to run the same script with the list of files (URL
resources Code 9.4) to fetch and combine with our proxy service.
We need to start proxy service like in Code 9.12. Now in the following
example we are using it.
1. $ python download_with_proxy.py --url-list example_files_list.txt --
proxy=http://localhost:9097
2.
3. Fetching: https://www.wikipedia.org/portal/wikipedia.org/assets/img/spr
ite-8bb90067.svg, proxy: {'all://': 'http://localhost:9097'}
Code 9.16
In line 3, we can see an example of the output of running script. We can see
that the URL we try to reach is served via SSL (HTTPS) protocol.
Now, let us check what do we have in the output of our proxy service (Code
9.12).
1. 127.0.0.1 - - [20/Oct/2023 22:17:21] code 501, message Unsupported m
ethod ('CONNECT')
2. 127.0.0.1 - - [20/Oct/2023 22:17:21] "CONNECT www.wikipedia.org:4
43 HTTP/1.1" 501 -
Code 9.17
We can see that our proxy service is throwing errors (Code 501, line 1)
which means that there is an issue with our HTTP proxy service. If you
check line 2, you will see the details that on initializing SSL connection, we
have CONNECT method missing.
We will not concentrate on building full proxy service that is properly
working with HTTPS connections. Instead, we can use existing Python
library that offers proxy support called proxy12.
1. $ pip install pproxy
Code 9.18
After installing the package, we can start proxy service. The great feature of
this package is that it allows us to start service immediately with zero
configuration. Let us check in the following example:
1. $ pproxy
2.
3. Serving on :8080 by http,socks4,socks5
Code 9.19
In our example Code 9.16, we need to specify new proxy service address
like in the following example:
1. $ python download_with_proxy.py --url-list example_files_list.txt --
proxy=http://localhost:8080
2.
3. Fetching: https://www.wikipedia.org/portal/wikipedia.org/assets/img/spr
ite-8bb90067.svg, proxy: {'all://': 'http://localhost:8080'}
Code 9.20
This time when we have used coherence proxy service that supports SSL
connections with no issues, we can notice that fetching resource does not
show any fatal exceptions. We can say that now we can support proxy
connection for those services that will start detecting our requests as
potential DDOS attacks.
Conclusion
In this chapter, we learned how to build a very efficient file downloading
program. This program not only allows us to download a single web
resource but also helps download many files from a given list.
We have also learned how to make downloads more efficient which means
faster and by using multiple channels for download. Next thing we have
learned is about of how to work with streaming services like YouTube and
how to download video resources. It is worth noticing that Google keeps
changing YouTube public API and the techniques for fetching YouTube
videos can change with time, but the essential download algorithm that we
have learned in this chapter will help you to modify the code and make it
work with the latest YouTube API. Good luck and keep coding!
In the next chapter, we will learn some more low lever networking with
Python.
1. https://www.airtel.in/blog/broadband/fup-internet-plan-significance/
2. https://www.cloudflare.com/en-gb/learning/ddos/what-is-a-ddos-
attack/
3. https://pypi.org/project/asyncio-pool/
4. https://docs.python.org/3/library/urllib.request.html
5. https://docs.python.org/3/library/http.server.html
6. https://curl.se/
7. https://www.w3schools.com/tags/ref_httpmethods.asp
8.
https://docs.python.org/3/library/http.server.html#http.server.SimpleHT
TPRequestHandler
9. https://mitmproxy.org
10. https://pypi.org/project/pproxy/
11. https://en.wikipedia.org/wiki/SHA-2
12. https://pypi.org/project/pproxy/
13. https://www.gnu.org/software/gzip/
14. https://developer.mozilla.org/en-US/docs/web/http/methods/head
15. https://www.python.org/downloads/release/python-3120/
16. https://www.python.org/downloads/release/python-3120/
17. https://pytube.io/en/latest/index.html
18. https://pypi.org/project/youtube_dl/
19. https://developers.google.com/youtube/registering_an_application
20. https://developers.google.com/youtube/v3/docs/search/list
21. https://developers.google.com/youtube/v1_deprecation_notice
22. https://docs.python.org/3/library/tempfile.html
OceanofPDF.com
CHAPTER 10
Make A Program to Safeguard
Websites
Introduction
The modern internet is not only a source of unlimited knowledge sharing and
access to many free online encyclopedias. Internet is as well a place where
we all can find latest news, join streaming services where watching a movie
is available with just one click.
With great power comes great responsibility. Sometimes unwanted Internet
content should be blocked from access. Many corporate policies in various
companies force their employees to install firewall software to filter
unwanted content. In this chapter, we will learn how to build a filtering
content software as a centralized solution. We will look closely at how a
local computer opens web content and how to filter those unwanted websites
and avoid accessing them.
Structure
This chapter will cover the following topics:
Understanding package routing policies
Write your own DNS server
Build DHCP service
Package inspection software
Filtering web content
Challenges with encrypted websites
Objectives
In the following chapter, we will address these highlighted points, using
Python modules. Some of them are being built from scratch by us.
Additionally, we will try to shed light on the implementation of these
functionalities where we can control all the content filtering with
configuration files.
Conclusion
In this chapter, we learned how to design simple yet powerful routers using
Linux and neftilter34 modules. Next, we analyzed how websites are reached
by client machines when using DHCP address assignment and DNS service.
When we understood that part where we went deeper into networking where
we learned how to analyze network packages and implement very light yet
very efficient firewall filter rules. We also learned what kind of challenges
we may face with encrypted webtraffic like HTTPS.
In the next chapter, we will learn how to use Python to manage calendars
and how to combine a few calendars into one. This will teach us to build a
very efficient calendar tool that can be used with your favorite calendar
application.
1. https://ieeexplore.ieee.org/document/9822234
2. https://docs.oracle.com/cd/E19455-01/806-0916/6ja85398m/index.html
3. https://standards.ieee.org/faqs/regauth/
4. https://www.techtarget.com/searchnetworking/definition/DHCP
5. https://github.com/search?
q=python%20dhcp%20server&type=repositories
6. https://pypi.org/search/?q=dhcp+server&o=
7. https://mariadb.org/download/?t=mariadb&p=mariadb&r=11.2.2
8. https://www.sqlalchemy.org
9. https://docs.sqlalchemy.org/en/20/orm/quickstart.html
10. https://www.aviransplace.com/post/safe-database-migration-pattern-
without-downtime-1
11. https://alembic.sqlalchemy.org/en/latest/
12. https://docs.sqlalchemy.org/en/20/orm/session_basics.html
13. https://datatracker.ietf.org/doc/html/rfc768
14. https://pypi.org/project/coloredlogs/
15. https://learn.microsoft.com/en-us/windows-
server/troubleshoot/dynamic-host-configuration-protocol-
basics#dhcpdiscover
16. https://learn.microsoft.com/en-us/windows-
server/troubleshoot/dynamic-host-configuration-protocol-
basics#dhcprequest
17. https://support.apple.com/guide/security/wi-fi-privacy-
secb9cb3140c/web
18. https://blogs.cisco.com/networking/randomized-and-changing-mac-
rcm
19. https://datatracker.ietf.org/doc/html/rfc1034
20. https://datatracker.ietf.org/doc/html/rfc1034#autoid-27
21. https://datatracker.ietf.org/doc/html/rfc1035#section-4.2
22. https://linuxize.com/post/how-to-use-dig-command-to-query-dns-in-
linux/
23. https://datatracker.ietf.org/doc/html/rfc5782
24. https://github.com/hagezi/dns-blocklists
25. https://ubuntu.com
26. https://www.nfstream.org
27. https://github.com/ntop/nDPI
28. https://github.com/nfstream/nfstream
29. https://developer.mozilla.org/en-
US/docs/Web/HTTP/Basics_of_HTTP/MIME_types/Common_types
30. https://www.netfilter.org/projects/nftables/index.html
31. https://github.com/svinota/pyroute2
32. https://www.netfilter.org/projects/nftables/manpage.html
33. https://wiki.nftables.org/wiki-nftables/index.php/Element_timeouts
34. https://www.netfilter.org/
OceanofPDF.com
CHAPTER 11
Centralizing All Calendars
Introduction
Being digital nomad nowadays can be lots of fun and bring lots of benefits,
for sure. There is one thing that every person working in multiple projects
where there are many agendas and calendars to cover can be very
challenging. You may think that having desktop application for calendar can
solve the problem by configuring multiple calendars in it. That is partially
true. Issue becomes when we try to switch between devices or just share
with others when you are available for meeting between all other meetings
across all calendars.
Structure
This chapter will cover following topics:
Building subscriber tool for web calendars
Google
Office 365
iCal
Calendar parser
Subscribe locally
Synchronize with external calendar
Objectives
In this chapter, we will build together a tool that can help us to address that
problem of multiple calendars, we will be able to synchronize our busy day
across many calendars by using Python for it. We will learn how to use this
with two main popular platforms, Google1 and Office3652 yet we will also
see how to work with offline calendar files.
Google
Before we will start with subscribing calendar, we need to configure API and
credentials for Google services. First and foremost, we need to go subscribe
to Google for developer’s platform3. When application is ready, we shall
create very first project and mark it as integral, this will give us guarantee
that we are the only authorized user to use this application for the time
being.
Next step is to create OAuth credentials by following Google guide4. Next
important step is to add Google calendar API application to our new project.
Figure 11.1: Enabling Google calendar access
When all is setup is done it is time to start testing it. To be able to use
Google calendar we have to install following Python modules.
1. $ pip install gcsa beautiful-date
Code 11.1
When modules are installed by assuming we have some future events
already exist in our Google calendar we can list them down by executing
following code.
1. from gcsa.google_calendar import GoogleCalendar
2. gc = GoogleCalendar(credentials_path='/var/tmp/credentials.json')
3. for event in gc.get_events():
print(event)
Code 11.2
Please notice that in line 4 when we initialize calendar client, we assume that
credentials JSON file that we have from Google console is saved under
/var/tmp/credentials.json.
After running the code, we should get output like in following example.
Since we are not putting any limit in query it will print out all the future
events.
1. 2023-03-20 18:00:00+02:00 - Abc Meeting
2023-04-19 17:30:00+02:00 - 123 Meeting!
2. 2023-07-31 17:30:00+01:00 - wow Meeting
Code 11.3
What we should also notice is the fact that after running script default
browser is going to be opened and you will be asked to give permissions to
your user data in Google space as shown in the following figure.
Figure 11.2: Enable personal calendar access by OAuth authenticated
This step is necessary to give access to out calendar and retrieve API token.
What is quite important to notice is the fact that this step is for fetching
token that will expire at some point. If it does, please allow access to
calendar again. In normal use Google calendar simple API5 module has built
in functionality to refresh access token. We will use this in the next part of
following chapters.
Office 365
First, we shall make sure we have an account at Microsoft Office 3656 or
Hotmail7 service. Next step is to install Microsoft exchange RESTful8 API
module9.
1. $ pip install O365
Next step is to follow process10 of configuring application in Azure11
ecosystem. Once we have created application and configured it by
following mentioned GitHub guide, we need to make sure that we
have added scope parameters like shown in the following figure:
Figure 11.4: Example of accepting authentication and access privileges for test application
Only for the very first time you will see accept permissions screen, after
which you will be redirected to URL that you have to fully copy and paste to
our screen and hit enter to continue. Now we will be able to fetch token
download all the event from your calendar (lines 11-13). When the script is
executed again it will not ask to re-login since the token is still valid so lines
3-8 will not happen unless token expires, and authentication is required
again. This approach is very similar to the one described in subchapter about
Google calendar.
iCal
Another use case of calendar may be a need to import calendar from 3rd party
software. Very popular standard is iCalendar14. To simulate iCalendar file, we
can use any modern calendar tool out there; for instance system calendar
application. To be able to import and process such a file we are going to
install Python module15 like in following example.
1. $ pip install icalendar coleredlogs pytz
Code 11.6
Example code of how to load and parse iCal file is shown in the following
example. To be able to load and parse iCal file; we have to create few
example entries in your desktop calendar application and export calendar as
myevents.ics file.
1. import icalendar
2. ics_file = "myevents.ics"
3. with open(ics_file, 'rb') as f:
calendar = icalendar.Calendar.from_ical(f.read())
4. for event in calendar.walk('VEVENT'):
print('-'*10)
5. print(event.get("name"))
print(event.get("SUMMARY"))
Code 11.7
We can quickly notice that iCalendar file is quite the same to be parsed as
popular Google and Office 365 web calendars with just difference that it’s a
flat file, so we do not have to use complex web authentication flow.
Calendar parser
We managed to learn in previous subchapters how to load and parse data
from three types of external calendars. Now, we are going to use our
knowledge to collect data from external calendars and merge them together
in one single calendar file (ics). Let us check following example how to
achieve this.
1. import coloredlogs
import logging
2. import os
import pytz
3. from datetime import datetime
from icalendar import Calendar, Event
4. from gcsa.google_calendar import GoogleCalendar
5. CALENDAR_FILE = "/var/tmp/calendar.ics"
6. class MyCalendar:
7. cal = None
8. def __init__(self):
self.read()
9. if not self.cal:
self.cal = Calendar()
10. self.cal.add("prodid", "-//My calendar product//mxm.dk//")
self.cal.add("version", "2.0")
11. def sync_with_google(self):
12. pass
13. def sync_with_office365(self):
pass
14. def sync_with_file(self, file_path):
15. pass
16. def create_event(self, event_dict):
event = Event()
17. for k, v in event_dict.items():
event.add(k, v)
18. return event
19. def find_event(self, event_name, event_start):
for component in self.cal.walk():
20. if component.name.upper() == "VEVENT" and component.get('
name') == event_name and component.decoded("dtstart") == event_star
t:
return component
21. def read(self):
22. if os.path.exists(CALENDAR_FILE):
with open(CALENDAR_FILE, 'rb') as f:
23. self.cal = Calendar.from_ical(f.read())
24. def save(self):
with open(CALENDAR_FILE, "wb") as f:
25. f.write(self.cal.to_ical())
26. if __name__ == '__main__':
27. coloredlogs.install(level=logging.DEBUG)
c = MyCalendar()
28. c.sync_with_google()
c.sync_with_office٣٦٥()
c.sync_with_file('some-file/path/calendar.ics')
29. c.save()
Code 11.8
We are trying to load existing calendar file (destination calendar) in
constructor of our calendar syncing class MyCalendar (line 16). When this
try fails, we assume that we need to create new calendar instance and give it
some backwards compatibility attributes (lines 17-20) so it can be properly
processed by calendar applications.
We also added method for creating event object (lines 31-34). To simplify its
flow, we assume that method attribute is a dictionary, and we simply add
dictionary attributes to event object as event attributes (line 34). To visualize
this, let us check the following example dictionary and the way we call
create_event method.
1. record = {
"summary": event.summary,
2. "dtstart": event.start,
"dtend": event.end,
3. "dtstamp": event.created,
"uid": event.event_id
4. }
record = self.create_event(record)
Code 11.9
We can see that dictionary is getting as keys attributes of event method. In
this case it is quite easy to build such a dictionary in a way that we can easily
sync all the arguments from external calendar.
The other important method that we are going to use is the one which is
going to help us find out if event that we are trying to add or remove from
our local calendar already exists (Code 11.8, lines 37-40). Unfortunately,
(performance wise) this method is using walk method so traversing over
very massive and busy calendar cannot be so much time effective as we
would like.
We also managed to introduce (Code 11.8, lines 55-57) how we are going to
synchronize events from external dictionaries.
Let us check following example to see how we are going to take care of
events coming from Google service.
1. def __init__(self):
self.gc = GoogleCalendar(credentials_path="google_credentials.json"
)
2. def sync_with_google(self):
3. for event in self.gc.get_events():
component = self.find_event(event.summary, event.start)
4. if not component:
record = {
5. "summary": event.summary,
"dtstart": event.start,
6. "dtend": event.end,
"dtstamp": event.created,
7. "uid": event.event_id
}
8. record = self.create_event(record)
logging.info(f"Adding calendar record to database")
9. self.cal.add_component(record)
Code 11.10
We added in the class constructor instance to google calendar service (line
2). This is going to trigger Google authentication through the browser – like
we learned with Code 11.2. Next part is the actual call to fetch calendar data
(line 5) and going through each event and trying to find it in our local copy
(line 6).
When we do not find any instance of event in the local calendar, we add
such an entry (lines 7-17). Then as it is shown in example 11.8 (line 58) we
save newly updated calendar to local file copy.
In the next example, we will see how we can achieve the same thing with
method for supporting Office365 connection.
1. def sync_with_office365(self):
credentials = (client_id, secret_id)
2. protocol = MSGraphProtocol()
scopes = ["https://graph.microsoft.com/.default"]
3. account = Account(credentials, protocol=protocol)
if not account.is_authenticated:
4. if account.authenticate(scopes=scopes):
print("Authenticated")
5. schedule١ = account.schedule(resource=f"{user}@hotmail.com")
calendar١ = schedule١.get_default_calendar()
6. for event in calendar١.get_events(include_recurring=False):
component = self.find_event(event.subject, event.start)
7. if not component:
record = {
8. "summary": event.subject,
"dtstart": event.start,
9. "dtend": event.end,
"dtstamp": event.created,
10. "uid": event.object_id,
}
11. record = self.create_event(record)
logging.info(f"Adding calendar record to database")
12. self.cal.add_component(record)
Code 11.11
We reused Code 11.4 to be able to synchronize with Office 365 service. The
main improvement that we introduce here is lines 14-20 where we build
dictionary that we use same way as we did with Google Code 11.10.
We can notice here that we do not have anywhere in our synchronize class
(Code 11.8) instance of credentials that we have to use for authentication
(line 5). We are going to modify our code in such a way that we will move
authentication part to more reusable class property like in the following
code.
1. @property
def settings(self):
2. if not hasattr(self, '_config'):
self._config = configparser.ConfigParser()
3. self._config.read('sync.ini')
4. return self._config
Code 11.12
We created property that is checking if internal variable sorting settings
already exists (line 3) f so we return its value (line 6). If such a variable does
not exist, we initialize it and store configure parser and then return its
content (lines 3-5). For reading authentication credentials we are going to
use Python ini16 configuration file (sync.ini) which is shown below.
1. [google]
credentials_path = google_credentials.json
2. [office365]
3. user = some.user
client_id = 28xxx-yyy-zzz
4. secret_id = some-secre-tkey-got-from-azure
Code 11.13
We added two main sections in configuration ini file – to be used by Office
365 connection (lines 3-7) and as well for Google authentication (lines 1-2).
To be able to use those credentials in our main class file we are going to
refactor main class in following way.
1. class MyCalendar:
2. def __init__(self):
self.gc = GoogleCalendar(credentials_path=self.settings['google']
['credentials_path'])
3. @property
4. def account(self):
credentials = (
5. self.settings['office365']['client_id'],
self.settings['office365']['secret_id']
6. )
protocol = MSGraphProtocol()
7. scopes = ["https://graph.microsoft.com/.default"]
account = Account(credentials, protocol=protocol)
8. if not account.is_authenticated:
if account.authenticate(scopes=scopes):
9. logging.info("Office 365 Authenticated")
return account
Code 11.14
In the class constructor we left intact that rest part of the code that we
already introduced (Code 11.8) albeit we updated the part in the object
constructor that creates Google calendar instance (line 4). We use
configuration instance in this case instead of hardcoded path. At any point of
time, we can update location of the credentials file without changing the
code.
Another section that we added in configuration file is the office365 part
which we managed to refactor (Code 11.14, lines 6-18). We used same
technique of dynamic property like we did in Code 11.12. In this case, we
initialize and authenticate Office 365 client.
In the proceeding example, we will see how we can refactor out office 365
synchronization method to be able to use highlighted new approach.
1. def sync_with_office365(self):
user = self.settings["office365"]["user"]
2. schedule١ = self.account.schedule(resource=f"{user}@hotmail.com")
calendar١ = schedule١.get_default_calendar()
3. for event in calendar١.get_events(include_recurring=False):
component = self.find_event(event.subject, event.start)
4. if not component:
record = {
5. "summary": event.subject,
"dtstart": event.start,
6. "dtend": event.end,
"dtstamp": event.created,
7. "uid": event.object_id,
}
8. record = self.create_event(record)
logging.info(f"Adding Office365 calendar record to database")
9. self.cal.add_component(record)
Code 11.15
We now have a local copy of all the events merged from Google and Office
365 calendars in one place. Next, move it to add parsing static ics file. In
those case we have to add to ini configuration file below lines.
1. [ics]
path = myevents.ics
Code 11.16
Next step is to update method sync_with_file that is going to use that
configuration and parse myevents.ics file.
1. def sync_with_file(self):
with open(self.settings["ics"]["path"], "rb") as f:
2. calendar = Calendar.from_ical(f.read())
for event in calendar.walk("VEVENT"):
3. component = self.find_event(event["SUMMARY"], event["DTS
TART"])
if not component:
4. record = {
"summary": event["SUMMARY"],
5. "dtstart": event["DTSTART"],
"dtend": event["DTEND"],
6. "dtstamp": event["DTSTAMP"],
"uid": event["UID"],
7. }
record = self.create_event(record)
8. logging.info(f"Adding ics calendar record to database")
self.cal.add_component(record)
Code 11.17
We use already known for us mechanics that we used before in other
methods – which is finding if the event already exists in local calendar – if
not then create dictionary item with event elements and add it to local
calendar.
There is also one more thing to improve, method that is travestating over
local calendar and trying to find already existing event. Since we work with
three different types of calendars and there can be some discrepancies in
dates standards, we shall update our find_event method like in the following
example.
1. def find_event(self, event_name, event_start):
for component in self.cal.walk():
2. if (
component.name.upper() == "VEVENT"
3. and component.get("summary") == event_name
and (component.get('dtstart') == event_start
4. or component.decoded("dtstart") == event_start)
):
5. logging.debug("Found item")
return component
Code 11.18
As it was said we updated this method with a fix of how we compare
datetimes when trying to find existing event (lines 6-7). This way we are a
bit more flexible with comparing datetime for event so will not add event
that already exists, but time compare failed.
Subscribe locally
So far, we have managed to build a tool that allows us to subscribe to
external calendars and sync their events with local database. Now it will be
great if we can use this local calendar with calendar application.
There is one option, we could import1718 calendar file that our tool generates.
Unfortunately, importing calendar from a file has one big disadvantage, that
is, we import file and update local calendar application with events that it
brings. When we run resynchronization script, we update ics file once again
and we do not see those changes in calendar application.
To address this issue, we are going to build simple yet powerful subscription
driven service that our system calendar application may use for
synchronization. In this case, any change made in our local ics file will be
almost immediately reflected in system calendar.
Figure 11.5: Adding remote calendar URL to local calendar application.
To be able to build such a service as you can see in Figure 11.5, we must
write web service that is going to expose dynamic ics standard driven file.
For this we need to install light web framework called Flask19.
1. $ pip install flask==3.0.0
Code 11.19
Let us create the following file and name it as ics_service.py. This file is
going to be our calendar service that we will improve is the next part of this
subchapter.
1. from flask import Flask
2. app = Flask(__name__)
3. @app.route("/")
def hello_world():
4. return "<p>Calendar service</p>"
Code 11.20
Now, to be able to start the service we’re going to use Flask built-in HTTP
service so to start it we will do it as in the following example.
1. $ flask --app ics_service run
* Serving Flask app 'ics_service'
2. * Debug mode: off
WARNING: This is a development server. Do not use it in a
production deployment. Use a production WSGI server instead.
3. * Running on http://127.0.0.1:5000
Press CTRL+C to quit
Code 11.21
We can see that service is starting on localhost with listening on port 5000. If
we try to open this address http://127.0.0.1:5000 we will see HTML body
defined in Code 11.20 lines 5-7.
The next part of building service is to add simple ics file analyzer that is
exposing this file content to system calendar application. Let us check
proceeding code how to approach that.
1. from flask import Flask
from manage_full import CALENDAR_FILE
2. from flask import make_response
3. app = Flask(__name__)
4. @app.route("/")
def hello_world():
5. return "<p>Calendar service</p>"
6. @app.route("/calendar")
def calendar():
7. with open(CALENDAR_FILE, 'rb') as f:
resp = make_response(f.read(), 200)
8. resp.headers['content-type'] = 'text/calendar'
return resp
Code 11.22
We can see clearly that we have added endpoint /calendar that is responsible
for returning calendar events to system calendar application. Other thing
worth of noticing is the part (lines 11-12) of method that is exposing
calendar – it is GET method. This is quite important to notice since this
method will only allow us to read events from our busy calendar. Let us add
to our local calendar endpoint that we just created to out calendar
application. For this excursive we are going to use Thunderbird20 application.
Once application is installed, we need to click in menu → New → Calendar
→ On the network and fill the form like shown in following example.
Conclusion
In this chapter, we learned how to use Python for synchronizing calendar
events from external services to local copy and exposing that single
managed calendar to desktop application. Next, we managed to understand
how we can push newly created or changed events from desktop application
to our local copy and next push back to external service. This chapter for
sure helped us to learn how to manage our busy day with Python.
In the next chapter, we are going to learn how to use Python to build
sophisticated monitoring tools that we can use to check availability of
external services.
1. https://developers.google.com/calendar/api/quickstart/python
2. https://learn.microsoft.com/en-us/previous-versions/office/office-365-
api/api/version-2.0/calendar-rest-operations
3. https://console.cloud.google.com
4.
https://developers.google.com/calendar/api/quickstart/python#authorize
_credentials_for_a_desktop_application
5. https://google-calendar-simple-
api.readthedocs.io/en/latest/getting_started.html
6. https://www.office.com
7. https://outlook.live.com
8.
https://en.wikipedia.org/wiki/Overview_of_RESTful_API_Description_
Languages
9. https://github.com/O365/python-o365
10. https://github.com/O365/python-o365#oauth-authentication
11.
https://entra.microsoft.com/#view/Microsoft_AAD_RegisteredApps/App
licationMenuBlade/~/Authentication/
12. https://click.palletsprojects.com/en/
13. https://learn.microsoft.com/en-us/graph/auth-v2-user?
context=graph%2Fapi%2F1.0&view=graph-rest-1.0&tabs=http
14. https://en.wikipedia.org/wiki/ICalendar
15. https://github.com/collective/icalendar
16. https://docs.python.org/3/library/configparser.html
17. https://support.apple.com/en-gb/guide/calendar/icl1023/mac
18. https://support.microsoft.com/en-us/office/import-or-subscribe-to-a-
calendar-in-outlook-com-or-outlook-on-the-web-cff1429c-5af6-41ec-
a5b4-74f2c278e98c?ui=en-us&rs=en-gb&ad=gb
19. https://flask.palletsprojects.com/en/3.0.x/quickstart/#a-minimal-
application
20. https://www.thunderbird.net/en-US/
OceanofPDF.com
CHAPTER 12
Developing a Method for
Monitoring Websites
Introduction
The constant challenge for system administrators is keeping all online and
network assets consistently available. The crucial part of their everyday
work routing is to have access to great monitoring tools. Every single
occurrence where instance of service is faulty or not accessible it should be
informed to system administrator.
Structure
In this chapter, will be covering topics:
Brief introduction to TCP/UDP packets
Understanding how monitoring works
Concept of monitoring probes
Building reporting central
Design alarm system
Objectives
This chapter will show us how to build a simple yet efficient tool for
monitoring any kind of websites. We will learn such things like reporting
availability of defined websites, reporting uptime and those most crucial
times when our important service is not accessible anymore or access time is
slow.
TCP/UDP
We have had some brief introduction of how TCP/IP packets work and how
they can be simulated with Python. We didn’t talk much about difference of
TCP and UDP. Let us check how we can support connection and processing
TCP and UDP packages with Python. Before we will do this, we need to
understand the key difference between them. TCP1 packet. By simplified
picture TCP is a standard of communication in network stack where have
guarantee of delivery, error checked and stable communication.
UDP2 instead is slacker and does not guarantee packet delivery since in this
standard we send network packet without waiting for confirmation that the
destination party has received it.
1. By knowing this let us try to simulate simple Python implementation of
TCP client and server in the following code.
1. import socket
2.
3. HOST = "localhost"
4. PORT = 62222
5.
6. with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
7. s.bind((HOST, PORT))
8. s.listen()
9. conn, addr = s.accept()
10. with conn:
11. print(f"Connected by client: {addr}")
12. while True:
13. data = conn.recv(1024)
14. if not data:
15. break
16. conn.sendall(data)
Code 12.1
We can see that we used the same technique that learned before in Chapter
10 - Make A Program to Safeguard Websites to be able to build TCP socket
server. We basically created simple echo service where we reply with
message that was sent to server (line 16). Let’s check proceeding example to
see how client side looks like.
1. import socket
2.
3. HOST = "localhost"
4. PORT = 62222
5.
6. with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
7. s.connect((HOST, PORT))
8. s.sendall(b"Hello, world")
9. data = s.recv(1024)
10.
11. print(f"Received {data!r}")
Code 12.2
We can see that we are connecting to server to the same port as server is
listening at (line 6). Once we are connected (line 7) we send message
(line 8) and get the server response.
1. Let us check following example how to build similar example for UDP
service.
1. import socketserver
2.
3. HOST = "localhost"
4. PORT = 62222
5.
6. class MyUDPHandler(socketserver.BaseRequestHandler):
7.
8. def handle(self):
9. data = self.request[0].strip()
10. socket = self.request[1]
11. print(f"Received: {data}")
12. socket.sendto(data.upper(), self.client_address)
13.
14. if __name__ == "__main__":
15. with socketserver.UDPServer((HOST, PORT), MyUDPHandler) a
s server:
16. server.serve_forever()
Code 12.3
3
We can see that we used socketserver package to simplify UDP server. We
use the same approach as we did in example 12.1 which means we respond
with the same message that we received from client. Let us look at the client
side in the proceeding example.
1. import socket
2.
3. HOST = "localhost"
4. PORT = 62222
5.
6. with socket.socket(socket.AF_INET, socket.SOCK_DGRAM) as s:
7. s.connect((HOST, PORT))
8. s.sendall(b"Hello, world")
9. data = s.recv(1024)
10.
11. print(f"Received {data}")
Code 12.4
We managed to write similar UDP client as we did for TCP albeit this
time we explicitly told Python socket connection that we will be
connecting to UDP service (line 6) – notice socket.SOCK_DGRAM4
which is needed for establishing UDP connection.
3. In these two client examples we use connection that assumes server is
started and listens on desired port. Let us check what is going to happen
if we run client script without server being started.
1. $ python udp_client.py
2. Traceback (most recent call last):
3. File "udp_client.py", line 9, in <module>
4. data = s.recv(1024)
5. ConnectionRefusedError: [Errno 61] Connection refused
Code 12.5
We can see that when we try to connect to port that does not have service
listening on it – it leads to fatal exception (line 5). This is something that we
may use to identify that port that we try to connect to is either closed or
server is not responding properly. Let us check following example how can
we drive connection in more efficient way.
1. import socket
2.
3. HOST = "localhost"
4. PORT = 62222
5.
6. with socket.socket(socket.AF_INET, socket.SOCK_DGRAM) as s:
7. s.settimeout(10)
8. s.connect((HOST, PORT))
9. s.settimeout(None)
10. s.sendall(b"Hello, world")
11. data = s.recv(1024)
12.
13. print(f"Received {data}")
Code 12.6
We added timeout (line 7) and then reset it (line 9). This approach will
help us when we try to connect to service that is listening on requested
port but somehow it is faulty – that is why we reset timeout (line 9).
4. When running above code it will still lead to fatal crash since we
connect to closed port. Let us try to modify that example so we can
catch connection issue properly.
1. import socket
2.
3. HOST = "localhost"
4. PORT = 62222
5.
6. with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
7. s.settimeout(10)
8. status = s.connect_ex((HOST, PORT))
9. if status == 0:
10. s.settimeout(None)
11. s.sendall(b"ping")
12. data = s.recv(1024)
13. print(data)
14. else:
15. print(f"Connection error code {status}")
Code 12.7
In the proceeding example we run Code 12.7 thus we can see how
properly we catch system exception when new connection is being
established to a closed port.
1. $ python tcp_client_1.py
2.
3. Connection error code 61
Code 12.8
We can see in the output of running our code that this time there is no
exception since we check during connection what is the connection
status code (line 8). Besides you can notice that we replace method
connect (Code 12.6, line 8) with more sophisticated approach – we use
method connect_ex (Code 12.7, line 8). This time we can get
connection status code instead of exception being raised.
5. We use this status code (line 9) so after knowing its value is 0 which
means connection was successful, we try to send message to opened
socket. I other case we print error code (lines 14-15).
All this is going to correct for TCP connection as you probably already
noticed. For UDP nature we need to tweak our approach to be able to
properly support connection issue when port is closed. Let us check
following example how to do it.
1. import socket
2.
3. HOST = "localhost"
4. PORT = 62222
5.
6. with socket.socket(socket.AF_INET, socket.SOCK_DGRAM) as s:
7. s.settimeout(10)
8. try:
9. status = s.connect_ex((HOST, PORT))
10. if status == 0:
11. s.settimeout(None)
12. s.sendall(b"ping")
13. data = s.recv(1024)
14. print(data)
15. except ConnectionRefusedError:
16. status = -1
17. if status != 0:
18. print(f"Connection error code {status}")
Code 12.9
Why didn’t we rely on the connection status code like in TCP connection
example – it is because of nature of UDP and how Python handles closed
port. It will not return error code (line 9) when port can’t be reached but it
will raise exception that connection is refused (port is closed) when we try to
send any data to closed port (line 12). For this reason, we catch
ConnectionRefusedError exception and set status code to -1 (line 16) so in
the next part of the code we can print error like in TCP example (lines 17-
18).
Port scanner
In previous subchapter we learned the basics about TCP and UDP services.
Now let us try to build some more powerful tool that is going to help us scan
all requested ports to see which of these are opened or closed. First, let us
modify TCP client script that we build before in following way.
1. import socket
2. from pprint import pprint
3.
4. HOST = "wikipedia.org"
5. PORTS = [443, 80, 25]
6.
7. connection_results = {}
8.
9. for port in PORTS:
10. with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
11. s.settimeout(10)
12. status = s.connect_ex((HOST, port))
13. connection_results[port] = True if status == 0 else False
14.
15. pprint(connection_results)
Code 12.10
We defined host that we are planning to scan (line 4) and corresponding
ports (line 5). Next, we are trying to establish connection as we already
performed in examples in previous subchapter (lines 10-12). When
connection is properly established, we mark it in results dictionary (line 13).
In the end we print result, and we should be able to see something like in
below example.
1. $ time python tcp_port_scanner_linear.py
2.
3. {25: False, 80: True, 443: True}
4.
5. python tcp_port_scanner_linear.py 0.04s user 0.02s system 33% cpu 0.
176 total
Code 12.11
We can see that Wikipedia website has opened port 80 and 443 where 25
port (SMTP) is closed.
We learned how to build simple port scanner but there is one challenge with
this scanner that we need to address and improve. As you can notice when
we run python code in example 12.11, we use time5 command in front of
python command. This leads to calculate execeution time of our script (line
5). Conclusion is – running example 12.10 seems to be super quick (0.04s)
but… please notice two main potential traps we may face:
Each port we try to check if it’s being opened – we setup timeout (Code
12.10, line 11)
When we wait those 10 seconds as maximum timeout only for one
single port – execution time for script takes 10s+
When more ports are having issues total execution time will also rise
Even if there are not timeout issues scanning more port with this linear
for loop is very inefficient
By knowing all those potential issues let us try to improve our scanner
example so it can scan ports in parallel.
1. import asyncio
2. from asyncio_pool import AioPool
3. HOST = "wikipedia.org"
4.
5. PORTS = [443, 80, 25]
6. async def tcp_port_check(port) -> bool:
7.
8. try:
9. reader, writer = await asyncio.open_connection(HOST, port)
10. return [port, True]
11. except Exception as e:
12. print(e)
13. return [port, False]
14. async def main():
15.
16. async with AioPool(size=5) as pool:
17. result = await pool.map(tcp_port_check, PORTS)
18. print(dict(result))
19. if __name__ == '__main__':
20.
21.
22. asyncio.run(main())
Code 12.12
We are scanning same ports as in the example 12.10 although in Code 12.12
we used asyncio library to be able to run things in parallel and optimize
performance for port scanning. We had to use again exception approach to
be able to catch a catch when port is closed or having issue responding.
Now, let us check proceeding example to see how efficient our code is.
1. $ time python tcp_port_scanner.py
2.
3. [Errno 61] Connect call failed ('185.15.59.224', 25)
4. {443: True, 80: True, 25: False}
5. python tcp_port_scanner.py 0.06s user 0.03s system 51% cpu 0.180 tot
al
Code 12.13
Advance scanner
So far, we have built scripts for scanning requested ports for single host – in
our case it was Wikipedia. In this following subchapter we will use and
improve already gather knowledge to build something more flexible that is
able to scan many sites at once.
Let us start with building simple service that can configuration file and start
using previously created script to scan multiple sites. We will look at the
proceeding example how can we do this.
1. import asyncio
2. from asyncio_pool import AioPool
3. from pprint import pprint
4.
5. SITES = ["wikipedia.org", "google.com"]
6. PORTS = [443, 80, 25]
7.
8.
9. class MyScanner:
10. def __init__(self):
11. self.scanner_results = {}
12.
13. async def tcp_port_check(self, host, port):
14. try:
15. reader, writer = await asyncio.open_connection(host, port)
16. return [host, port, True]
17. except Exception as e:
18. print(e)
19. return [host, port, False]
20.
21. async def start_scanning(self):
22. calls = []
23. results = {}
24. async with AioPool(size=5) as pool:
25. for site in SITES:
26. for port in PORTS:
27. calls.append(await pool.spawn(self.tcp_port_check(site, por
t)))
28.
29. for r in calls:
30. result = r.result()
31. if result[0] not in self.scanner_results:
32. self.scanner_results[result[0]] = {}
33. self.scanner_results[result[0]][result[1]] = result[2]
34.
35. async def run(self):
36. await self.start_scanning()
37. pprint(self.scanner_results)
38.
39. if __name__ == "__main__":
40. scanner = MyScanner()
41. asyncio.run(scanner.run())
Code 12.14
1. We created simple scanner application that is connection state (lines 14-
19). We do not have similar method like in raw socket module (Code
12.10, line 12) so we had to improvise with try-except block when we
try to establish connection to port.
We used sites and definition of ports (lines 5-6) that we are planning to scan
to verify if ports are opened. We are iterating over those sites and ports (lines
24-27) and keep calling method for checking if connection port can be open
and connection can be established successfully. We have async-pool
module6 is in use to help us with making sure we are not hammering
destination host too much. We should not be too aggressive if we want to
scan ports in parallel from for the same host. We could be accidentally
detected as thread and start getting wrong results – ports might start being
closed for us.
In lines 29-33, we are transforming list of results into something that is
going to look like in following example.
1. $ python scanner.py
2. [Errno 61] Connect call failed ('185.15.59.224', 25)
3. [Errno 61] Connect call failed ('142.250.186.206', 25)
4. {'google.com': {25: False, 80: True, 443: True},
5. 'wikipedia.org': {25: False, 80: True, 443: True}}
Code 12.15
2. So far, we’ve been building service that is helping us to scan ports on
remote server to check if they are opened. You may wonder why do we
need to scan ports to be sure that service is operational? Well, if we
check that for instance port 443 (HTTPS) is open for Wikipedia.org
website that means that the website is able to be opened. This is part of
checking site availability is very crucial for sure, but the other aspect is
to check is if the server itself (without checking opened ports) is
responding – this is called ping7.
Let us install the following package and check how we can build simple
script that is sending ICMP packet (ping) to destination server to verify
that it\s’ alive.
1. $ pip install ping3
Code 12.16
3. Once we have module installed, we can check following example that is
ping given hostname to check how quickly does it respond.
1. import click
2. from ping3 import ping
3.
4. def ping_host(host):
5. result = ping(host)
6. click.echo(f"Response time {result}s")
7.
8.
9. @click.command()
10. @click.option("--host", help="Host to ping")
11. def main(host):
12. ping_host(host)
13.
14. if __name__ == '__main__':
15. main()
Code 12.17
The result of running script is shown below.
1. $ python ping_test.py --host wikipedia.org
2. Response time 0.10447406768798828s
Code 12.18
We can see that ping module is helping us to determine response time
from wikipiedia.org. In the following part of this subchapter, we will
incorporate this simple method to our scanner application. Before we
will do it let us focus on adding another probe to our scanner.
4. So far, we’ve been checking on synthetic level is destination host is
available. This time we need to check not only if port is open, response
time but as well content of the response itself. This way we will be sure
that the service is working properly. Let us check proceeding example to
see how we could validate if response received from wikipiedia.org is
correct.
1. import asyncio
2. import click
3. import httpx
4.
5.
6. async def check_status(url):
7. async with httpx.AsyncClient() as client:
8. response = await client.get(url, follow_redirects=True)
9. status = response.status_code == 200 and len(response.text) >=
50
10. print(f"Site status: {status}")
11.
12.
13. @click.command()
14. @click.option("--url", help="URL to scan", required=True)
15. def main(url):
16. asyncio.run(check_status(url))
17.
18.
19. if __name__ == "__main__":
20. main()
Code 12.19
We’re checking if response status code is 2008 which leads us to conclusion
(line 9) that site is operation since it managed to respond properly.
Additionally, we check if response code is not empty (line 9) and it has at
least 50 characters.
And now execution of the script is shown in proceeding example.
1. $ python check_site_status.py --url https://wikipedia.org
2.
3. Site status: True
Code 12.20
5. After having basic probes being built, we shall modify main scanning
script to be able to support configuration files instead of hard coded
sites list with ports. Let us check example below how to use
configuration file called YAML9. First, we need to install some Python
module.
1. $ pip install PyYAML
Code 12.21
After installing YAML standard module let us check how we are going to
prepare configuration before we will digest it in our refactored code.
1. sites:
2. wikipedia:
3. url: https://wikipedia.org
4. ports:
5. - 443
6. - 80
7. - 465
8. gmail:
9. url: https://gmail.com
10. ports:
11. - 443
12. - 465
13. - 587
14. vimeo:
15. url: https://vimeo.com
16. ports:
17. - 443
18. - 80
Code 12.22
So, we have configuration ready to scan few websites and their public ports.
Now, we have to refactor scanner script to be able to use configuration file
and drive scanning more flexible.
1. import aioping
2. import asyncio
3. import coloredlogs
4. import click
5. import httpx
6. import logging
7. import yaml
8. from asyncio_pool import AioPool
9. from pprint import pformat
10. from urllib.parse import urlparse
11.
12.
13. class MyScanner:
14. def __init__(self, config_fpath):
15. with open(config_fpath, "rb") as f:
16. self._config = yaml.load(f.read(), Loader=yaml.Loader)
17. self.scanner_results = {}
18.
19. def hostname(self, url):
20. parsed_uri = urlparse(url)
21. return parsed_uri.netloc
22.
23. async def tcp_port_check(self, host, port):
24. try:
25. fqdn = self.hostname(host)
26. logging.debug(f"host: {fqdn}, port: {port}")
27. func = asyncio.open_connection(fqdn, port)
28. reader, writer = await asyncio.wait_for(func, timeout=3)
29. return [fqdn, port, True]
30. except Exception as e:
31. logging.error(e)
32. return [fqdn, port, False]
33.
34. async def start_scanning(self):
35. calls = []
36. results = {}
37. async with AioPool(size=5) as pool:
38. for item, items in self._config.get('sites', {}).items():
39.
40. for port in items['ports']:
41. calls.append(await pool.spawn(self.tcp_port_check(items['u
rl'], port)))
42. calls.append(await pool.spawn(self.ping_host(items['url'])))
43. calls.append(await pool.spawn(self.check_status(items['url'])))
44.
45. for r in calls:
46. result = r.result()
47. if result[0] not in self.scanner_results:
48. self.scanner_results[result[0]] = {}
49. self.scanner_results[result[0]][result[1]] = result[2]
50.
51. async def run(self):
52. await self.start_scanning()
53. logging.debug(f"result: {pformat(self.scanner_results)}")
54.
55. async def ping_host(self, host) -> float:
56. fqdn = self.hostname(host)
57. delay = await aioping.ping(fqdn) * 1000
58. logging.debug(f"Response time {delay} ms")
59. return [fqdn, 'ping', delay]
60.
61. async def check_status(self, url) -> bool:
62. fqdn = self.hostname(url)
63. async with httpx.AsyncClient() as client:
64. response = await client.get(url, follow_redirects=True)
65. status = response.status_code == 200 and len(response.text) >=
50
66. logging.debug(f"Site status: {status}")
67. return [fqdn, 'status', status]
68.
69.
70. @click.command()
71. @click.option("--config", help="Config file path", required=True)
72. def main(config):
73. coloredlogs.install(level=logging.DEBUG)
74. scanner = MyScanner(config)
75. asyncio.run(scanner.run())
76.
77.
78. if __name__ == "__main__":
79. main()
Code 12.23
We added in our refactored example support not only for configuration file
(lines 14-16) but we’ve updated reading this configuration when scanning
ports (lines 37-41). We also added refactored method for pinging external
host (lines 55-59). To be able to support we have to installed async ping
module for Python.
1. $ pip install aioping
Code 12.24
When module is ready, we can see we use this with correlation with async
pool (line 42). When result is ready (value is in milliseconds) – we updated
dictionary of results for given FQDN10. The other method that we use in
courtliness pool is check_status (line 43) which as we wrote in one of the
previous examples (Code 12.19) is checking response quality from the
server.
6. Let’s check proceeding example what is the result of running our
scanning application with given configuration YAML (Code 12.22).
1. $ python scanner2.py --config config.yaml
2.
3. result: {'gmail.com': {443: True,
4. 465: False,
5. 587: False,
6. 'ping': 10.923624999122694,
7. 'status': True},
8. 'vimeo.com': {80: True, 443: True, 'ping':
13.128791993949562, 'status': True},
9. 'wikipedia.org': {80: True,
10. 443: True,
11. 465: False,
12. 'ping': 33.38837499904912,
13. 'status': True}}
Code 12.25
We can see that main keys of returned results dictionary are FQDNs of tested
websites where values are test results of scanning ports, site availability and
ping time.
Reporting
So far, we have been building command line tool that can scan remote
website and check their availability. Now, it is time to turn this scanning tool
that is something more visible. Let us start with following example where
we need to install some Python packages to be able to write our small web
application.
1. $ pip install pandas matplotlib flask
Code 12.26
Once we have packages installed, we can rewrite few parts of our scanning
tool Code 12.23 to be able to prepare data for further processing. Let us
check following code how can we update our code.
1. import csv
2. from datetime import datetime
3.
4. class MyScanner:
5.
6. async def start_scanning(self):
7. calls = []
8. results = {}
9. async with AioPool(size=5) as pool:
10. for item, items in self._config.get("sites", {}).items():
11.
12. for port in items["ports"]:
13. calls.append(await pool.spawn(self.tcp_port_check(items["
url"], port)))
14. calls.append(await pool.spawn(self.ping_host(items["url"])))
15. calls.append(await pool.spawn(self.check_status(items["url"])
))
16.
17. for r in calls:
18. result = r.result()
19. if result[0] not in self.scanner_results:
20. self.scanner_results[result[0]] = {}
21. self.scanner_results[result[0]][result[1]] = result[2]
22. for site, results in self.scanner_results.items():
23. self.dump_to_csv(site, results)
24.
25. def dump_to_csv(self, item, data):
26. fname = f"/var/tmp/{item}.csv"
27. headers = sorted([k for k in data.keys()])
28. headers.insert(0, "date")
29. data["date"] = datetime.now().strftime("%Y-%m-
%d %H:%M:%s")
30. if not os.path.exists(fname):
31. with open(fname, "a") as f:
32. csv_out = csv.writer(f)
33. csv_out.writerow(headers)
34. with open(fname, "a") as f:
35. csv_out = csv.writer(f)
36. csv_out.writerow([data.get(k) for k in headers])
Code 12.27
We had to create new method for storing scanning results in CSV file (line
25-36). We need this data to be saved in a file with header (lines 30-33). The
next following CSV lines are being appended to existing output file (lines
34-36). What is worth of noticing is that we had to add date timestamp in the
first column of CSV row.
Next change is the whole scanning method. Once we finish scanning, we are
dumping results for every site that we scanned (lines 22-23).
Next step is to create folder called web with subfolder called templates.
Inside web folder we need to create main application file called
web_server.py with proceeding content.
1. import yaml
2. from flask import Flask
3.
4. app = Flask(__name__)
5.
6. CONFIG_FILE_PATH = "../config.yaml"
7. with open(CONFIG_FILE_PATH, "rb") as f:
8. CONFIG = yaml.load(f.read(), Loader=yaml.Loader)
9.
10. @app.route("/")
11. def list_of_result():
12. return "hello"
Code 12.28
We created basic Flask11 application where to start it we need to run below
command. What is noticeable is the fact that as global variable we read
config.yml file in which store configuration of all the services that we scan.
Let us try to read data from configuration and display what websites we
scan. To be able to do it we have to modify main controller in our web
application like in following code.
1. from flask import render_template
2.
3. @app.route("/")
4. def list_of_result():
5. return render_template("index.html", config=CONFIG, title=”Main
page”)
Code 12.29
We are using template to display main page – let’s build template
index.html under templates directory that we created before. In the
proceeding example, we can see the body of the main template.
1. {% extends 'main.html' %}
2.
3. {% block content %}
4. <div>Monitoring results</div>
5. {% for site in config.sites %}
6. <div><a href="{{ url_for('scanning_results', site=site) }}">{{ site }}
</a></div>
7. {% endfor %}
8.
9. {% endblock %}
Code 12.30
First thing we did (line 1) is import (inheritance) from main template file.
Next, we are looping over list of sites that we are monitoring and next we
create a link to subpage (line 6) which can lead to more details.
Let us check following figure of how main page is going to look like:
Conclusion
In this chapter, we learned how to use Python for monitoring external
resources in very efficient and easy yet very powerful way. Using
asynchronous connection with pool of connections is very efficient and
helped us to learn how to not accidentally send too many requests to
monitored website.
In the next chapter, we are going to learn how to use Python to analyze
websites and seek for requested products in the webstores. Once desired item
is available, or price is very attractive, we will learn how to use Python to go
for shopping.
1. https://en.wikipedia.org/wiki/Transmission_Control_Protocol
2. https://en.wikipedia.org/wiki/User_Datagram_Protocol
3. https://docs.python.org/3/library/socketserver.html#module-
socketserver
4. https://docs.python.org/3/library/socket.html#socket.SOCK_DGRAM
5. https://en.wikipedia.org/wiki/Time_%28Unix%29
6. https://pypi.org/project/asyncio-pool/
7. https://en.wikipedia.org/wiki/Ping_(networking_utility)
8. https://developer.mozilla.org/en-
US/docs/Web/HTTP/Status#successful_responses
9. https://en.wikipedia.org/wiki/YAML
10. https://en.wikipedia.org/wiki/Fully_qualified_domain_name
11. https://flask.palletsprojects.com/en/3.0.x/
12. https://jinja.palletsprojects.com/en/2.10.x/templates/#template-
inheritance
13. https://vimeo.com
14. https://pandas.pydata.org/pandas-
docs/stable/reference/api/pandas.DataFrame.html
Introduction
When we try to buy any kind of product in internet what kind be a bit of
challenging is the fact that we would like to get product from welly known
provider and established webstore albeit price can be the key factor here.
Sometimes, there are cases where product is not available at the moment so
checking many websites when hunting for its availability can be really
cumbersome. We have Python; let us use it as our superpower for this job.
Structure
In this chapter, will be covering topics:
Connecting to eBay bidding service, placing a bid a hunting for the best
price
Writing plugins for popular webstores to find and buy best product price
Tracking prices and generating alarm upon best price available
Objectives
Based on knowledge gathered from previous chapters in this chapter you
will learn how to build your personal bot which is going to monitor web
stores for certain products you might be interested in to buy. When a specific
item’s price goes up or down and when it is available for buying. We will
also learn the basics of how to improve this tool so it can help you with auto-
purchase products on its own.
eBay client
This platform is very well known for being on the market for many years
and since then it offers API1 for developers2. Before we can start connecting
to API, we need to have account on this platform. It is not about account that
we normally use for bidding and purchasing goods. We need to register in
developer program3 and wait for accepting our request. Once we have
access. Next step is to register our application – we can use test name or any
name that suits you. Once application name is given system is going to
create for us few access keys as shown in following example.
Figure 13.1: Example sandbox access credentials
1. When we have keys generated, we need to create eBay sandbox
account, just follow links provided on the API keys page. We are about
to use sandbox since there is no risk to accidentally purchase any kind
of unwanted product.
When we are ready with sandbox API credentials and eBay test account, we
are going to create access API as a very last step.
We are introducing sorting function sort6 that is going to as well take care or
a case when price in found item is not present (line 12) and set it to default 0.
This way sorting function will threat such an item with the lowest priority.
Once all is sorted, we can modify main function to obey some new
parameters that are going to help us filter out cheapest or the most expensive
items found.
1. import click
2. from pprint import pprint
3. from clients.main import Main
4.
5.
6. @click.command()
7. @click.option("--
order", help="Sorting order", type=click.Choice(['asc', 'desc'], case_sens
itive=False))
8. @click.option("--
limit", type=int, help="Limit number of results", required=False)
9. @click.option("--
phrase", type=str, help="Item name to look for", required=True)
10. def main(order, limit, phrase):
11. m = Main()
12. result = m.collect_results(phrase)
13. if result:
14. if order == 'desc':
15. result.sort(key=lambda x: float(x.get('price', 0)), reverse=True)
16. if limit:
17. result = result[:limit]
18. pprint(result)
19.
20. if __name__ == "__main__":
21. main()
Code 13.24
Refactored code from example 13.24 has new way of presenting results (line
18). Another thing to notice is adding new parameters in main function –
limit and sort (lines 7-8), where sorting is limited by listed options7 that can
be used. Those parameters we use later for sorting (lines 14-15) when order
is requested as descending – we use sort method on the result list (line 15)
and apply soring by price with reverse argument.
In the line 17 we get chunk of results array so in the send we can have
results of fetching data with requested limited number of items.
Let’s check following code how can we use this new approach to get most
expensive item found in the parsing results.
1. $ python main_app.py --phrase "iphone 14" --limit 1 --order desc
Code 13.25
This example code is going to call our main function (Code 13.24) and get
one single result sorted by the biggest price of founded item.
Figure 13.4: Example alert regarding price drop for following product
Historical values
When we have all the checkers ready and recording current values for the
tracked products, next step will be to show values for those products so we
can know on demand what are the current values if we willing to buy that
product manually.
Let us check the following code to see how to achieve that:
1. import click
2.
3. class PriceChecker:
4. def show_prices(self):
5. for provider, urls in self._providers.items():
6. print(f"Checking for prices: {provider}")
7. for product_url in urls:
8. c = Cron(product_url)
9. if c.load_price():
10. print(f"Current price: {c.load_price()}")
11. else:
12. print("Price data not found.")
13.
14. @click.command()
15. @click.option("--
price", help="Show prices", required=False, default=False, is_flag=True
)
16. @click.option("--
watch", help="Watch prices", required=False, default=False, is_flag=Tr
ue)
17. def main(price, watch):
18. if watch:
19. pycron.start()
20. elif price:
21. p = PriceChecker()
22. p.show_prices()
23.
24.
25. if __name__ == "__main__":
26. main()
Code 13.46
We introduced new entry point for the main function where we support two
options – watch (line 18-19) parameter which is going to run the code in the
mode where check and analyze cheapest price available. Next option is price
(line 21-23) which helps us to check what is the current price available after
running code with –watch parameter. In the PriceChecker class (lines 3-12)
we introduced code that is loading and print current prices available for the
products that we managed to check.
Let us run following code and see what is going to be example out after
running out code.
1. $ python watcher.py --watch
2.
3. Exception in thread Thread-1:
4. Traceback (most recent call last):
5. (..)
6. asyncio.run(self.check_prices())
7. raise RuntimeError(
8. RuntimeError: asyncio.run() cannot be called
from a running event loop
9. ^C
10. Aborted!
11. sys:1: RuntimeWarning: coroutine 'PriceChecker.check_prices'
was never awaited
12. RuntimeWarning: Enable tracemalloc to get
the object allocation traceback
Code 13.47
We can see that so far, we’ve been building our watcher program around
crobjob concept but when we run it, it crashed with error shown in Code
13.47. You may wonder - what does it mean and why it’s happened. The
reason being is quite trivial. When we run cron jobs we runt them in
coroutines loop, which’s been mentioned before. That has some limitations –
we can’t run another coroutines loop inside already started loop (Code
13.40, lines 17-18). To fix that limitations we need to update our Code 13.40
with following code.
1. from concurrent.futures import ThreadPoolExecutor
2.
3. class PriceChecker:
4. def start_processing(self):
5. try:
6. asyncio.get_running_loop()
7. # Create a separate thread
8. with ThreadPoolExecutor(1) as pool:
9. result = pool.submit(lambda: asyncio.run(self.check_prices()))
.result()
10. except RuntimeError:
11. result = asyncio.run(self.check_prices())
Code 13.48
Basically, method start_compare, which is the main entry point for price
checker, is trying to get current event loop (line 6) and when it does not rise
exception, we know that we already are in the middle of couritne loop. Thus,
we have to start new events loop in newly started thread (lines 8-9). This
solution is going to guarantee that there is no clash of running 2 loops in the
same thread.
What is also worth noticing are lines 10-11 where once there is no loopo
running we can start new one in the current thread. After executing code, we
shall wait unitl all the prices are collected. Next, we can execute following
code and check collected results.
1. $ python watcher_6.py --price
2.
3. Checking for prices: <class 'clients.amazon.ClientAmazon'>
4. Current price: 349.99
5. Current price: 389.99
6. Checking for prices: <class 'clients.ebay.ClientEbay'>
7. Current price: 289.99
Code 13.49
We print out saved values as shown in lines 13.49. It is easy to notice from
Code 13.34 (lines 3-12) that we iterate by reading from config file (Code
13.37, line 3) lines one by one. Thus, for instance if we want to ceheck the
chepest price shown in Code 13.48 (line 7) we can open config file and
check eBay section 1st specified URL. Open it in browser and check details
– what is happening there – why price is the lowest.
Conclusion
In this chapter, we learned how can we use Python to build quite advance
tool that can help us with hunting for the best price available on multiple e-
commerce platforms. We learned how to do it in modular way, so we can add
support for more websites. At the same time, we dive into the topic of auto
purchase. We analyzed as well how to store sensitive data in a secure way
and how to utilize this data whenever it is needed.
In the following chapter, we are going to learn how to use Python to build
mobile applications.
1. https://en.wikipedia.org/wiki/API
2. https://developer.ebay.com/develop/apis/restful-apis/buy-apis
3. https://developer.ebay.com
4. https://developer.ebay.com/Devzone/XML/docs/ReleaseNotes.html
5. https://beautiful-soup-4.readthedocs.io/en/latest/
6. https://docs.python.org/3/howto/sorting.html
7. https://click.palletsprojects.com/en/8.1.x/options/#choice-options
8. https://opensource.com/article/17/11/how-use-cron-linux
9. https://pypi.org/project/python-cron/
10. https://docs.python.org/3/library/asyncio.html
11. https://crontab.guru
12. https://yaml.org
13. https://www.python-httpx.org
14. https://datatracker.ietf.org/doc/html/rfc9113
15. https://cryptography.io/en/latest/fernet/
OceanofPDF.com
CHAPTER 14
Python Goes Mobile
Introduction
In this chapter, we will teach how you can use Python in mobile devices
(smartphones) and how to run your own Python programs on those pretty
special platforms. You will learn how to write small and efficient code and
use it on mobile systems. The goal of this chapter is to teach you how to
write mobile applications by using Python.
Structure
Topics to be covered: This chapter will cover following topics.
Brief introduction to mobile applications - their concept and limitations
Overview of Python libraries for mobile devices
Calculator in Python for an iOS and Android
Objectives
In this chapter, we will build simple yet powerful application that is going to
demonstrate how to deploy fully Python driven mobile app. We will learn
how to prepare such an application from concept to the actual running
MVP1. We’re going to dive a bit into topic of mobile operating systems to
learn how to run Python application on the top of it.
Basics
In the mobile world things work differently when we compare how
applications run in system space. Without going into many details, the key
points to highlight are:
Applications are programming language limited – in other words as
developer you mustn’t run your application written in any kind of
language you want – you are bounded by operating system, which
means:
iOS will limit you between objective-c and Swift.
Android this is a realm of Java.
Applications run I sandbox and do not have access to some system
resources.
GUI is driven by OS and writing custom components is quite
challenging.
Writing application in other languages from listed above is possible but
you will have to translate them to native OS language.
Apple’s mobile system called iOS2 since it has been introduced was
modified in many ways although basic core concepts stayed the same. Every
application is being written and compiled by using XCode3 ecosystem which
means the only way to get application running under iOS through Swift4 or
Objective-C5 and by using provide UI toolkit. This makes applications to be
consistent (in their look and feel) but gives us a limitation as well. If we
want to deliver application to iOS with pure Python, we will have to convert
it to a language and libraries that iOS can understand. In the following
example we can see Swift programing example code and we can notice how
much different it is from Python.
1. struct Player {
2. var name: String
3. var highScore: Int = 0
4. var history: [Int] = []
5.
6. init(_ name: String) {
7. self.name = name
8. }
9. }
10.
11. var player = Player("Tomas")
Code 14.1
So, let us check how does it look like in Android world. Here we have very
similar situation albeit even if core system itself is more flexible and wider
open because of use by many mobile devices manufactures it is still in its
essence driven by Java. That means that if we want to deliver mobile
application, we need to build it by using Android Studio and in its essentials
deliver core code in Java6.
We can see that we have a common path in those 2 major mobile operating
systems – we need to deliver out Python application translated to native code
that can be run by mobile operating system. Let us try to see in following
code how example code in Java is much different from Python.
1. public class Main {
2. public static void main(String[] args) {
3. System.out.println(“Hello World”);
4. }
5. }
Code 14.2
In the following subchapers we will learn how we can use pure Python code.
Python GUI
So far, we explored options for mobile OS to understand key points of how
they work and what king of limitations we can expect as developer when
writing Python application for mobile device.
Let us try to learn in this subchapter how we can write Python GUI
applications in the first place. When exploring mainstream libraries for
Python we can high light these:
TK (Tkinter)7: Very basic yet powerful and one of the oldest libraries
for building Python applications with support for user interface.
wxWidgets8: Mature with lots of powerful
https://en.wikipedia.org/wiki/Objective-Cwidgets.
QT9: Commercial yet with free license GUI library, as well as
previously mentioned mature and powerful.
Kivy10: This seems to be youngest player in the business of graphical
interfaces, but it’s got a lot of very great features – where one of them is
portability and this is something that we will chose to demonstrate how
to build mobile application.
Toga11: Quite simple yet powerful library for building graphical
interfaces.
GUI
In this section, we will briefly check how few Python GUI libraries to see
where they shine and how much they are different when we want to build
desktop application.
Toga
As we said this is quite simple yet powerful library that is multiplatform
ready and helps to build quite sophisticated graphical interfaces. Let us start
by installing library itself.
1. $ pip install toga
Code 14.3
Once we have it installed let us build our very first hello world application.
1. import toga
2.
3. class MyApp(toga.App):
4. def startup(self):
5. self.main_window = toga.MainWindow()
6. self.main_window.content = toga.Box(children=
[toga.Label("Hello!")])
7. self.main_window.show()
8. if __name__ == '__main__':
9.
10. app = MyApp("Realistic App", "org.python.code")
11. app.main_loop()
Code 14.4
As we can notice in the example Code 14.2 we create class MyApp (Code
14.2, line 3) which will inherit from toga.App, which is base class for Toga
framework where all the necessary initializations are taking place that will
lead to generate and display application window.
Kivy
Let us create our very first GUI example. This subchapter is going to cover
building desktop application. We will learn how to build hello application
first and then design calculator based on the learned foundations.
First we have to install Python modules that will help us to such an
application.
1. $ pip install Kivy Kivy-examples Kivy-Garden
Code 14.5
Once we have module install, we shall create out first template of the
application. Let’s create simple example with hello world. Let us check
following example.
1. Label:
2. id: entry
3. font_size: 24
4. multiline: False
5. text: "hello world"
6. size: 150, 44
Code 14.6
In UI configuration file (Code 14.2), which we called as kivy_example.kv,
we declared that we are going to use element Label12 which additionally has
declared other parameters like element id (line 2) and the actual text that
we’re about to display.
Now we need to consume this UI elements definition in our application. To
be able to do so we shall import few elements from Kivy library and load
application UI. Let’s try to check following code to see how we can achieve
this.
1. import kivy
2.
3. from kivy.app import App
4. from kivy.lang import Builder
5. kivy.require('1.9.0')
6.
7. from kivy.core.window import Window
8. from kivy.uix.gridlayout import GridLayout
9. from kivy.config import Config
10.
11. Config.set('graphics', 'resizable', 0)
12.
13. class kivExampleApp(App):
14.
15. def build(self):
16. return Builder.load_file("kivy_example.kv")
17.
18. def main():
19. calcApp = kivExampleApp()
20. return calcApp.run()
21.
22. main()
Code 14.7
We have defined configuration for the application (line 11) where we
specified that our application can’t resize – user can’t change size of the
main window. Next, we declared custom class kivExampleApp which is
inheriting from kivy App (line 13). The reason being why we inherit is
because we want to load our user interface (UI) from definition file (Code
14.2).
Method where we load UI definition file is build (Code 14.3, line 15) where
we specify all the elements of the window. That is why we load all the
elements with their corresponding configuration from a file instead putting
them (code driven) in the window – which is not easy to read. As we
mentioned we inject window elements to loading configuration file (line 16).
Next, in the we created main function (lines 18-20) where we initialize our
custom class and with it, we run core of our desktop app. Let us check the
following figure to see how the application is going to look in desktop
environment.
Compiler
That being said, we need to look for options how can we deliver Python
code to mobile operating systems. We could try to explore some options like
Iron Python13 which helps to write Python applications in .NET14
environment and then compile to native code for selected mobile device –
Android or iOS. This technique is a bit complex and breaches far beyond our
interest in this chapter so we will try to find another way.
After researching we can agree that briefcase15 library which is addressing
our need perfectly – we can compile and pack our Python library. First, we
must install library and dependencies like in the following code.
1. pip install briefcase
Code 14.8
Once we have installed main libraries and all the dependencies we can create
our blank hello world application. Let’s check following code how to do
this.
1. $ briefcase new
Code 14.9
After running this command system is going to ask us few questions to be
able to create template hello world application.
1. Formal Name [Hello World]: <enter>
2.
3. App Name [helloworld]: <enter>
4.
5. Bundle Identifier [com.example]: <enter>
6.
7. Project Name [Hello World]: <enter>
8.
9. Description [My first application]: <enter>
10.
11. Author [Jane Developer]: <enter>
12.
13. Author's Email [[email protected]]: <enter>
14.
15. Application URL [https://example.com/helloworld]: <enter>
16.
17. What license do you want to use for this project's code?
18.
19. 1) BSD license
20. 2) MIT license
21. 3) Apache Software License
22. 4) GNU General Public License v2 (GPLv2)
23. 5) GNU General Public License v2 or later (GPLv2+)
24. 6) GNU General Public License v3 (GPLv3)
25. 7) GNU General Public License v3 or later (GPLv3+)
26. 8) Proprietary
27. 9) Other
28.
29. Project License [1]: 1
30.
31. What GUI toolkit do you want to use for this project?
32.
33. 1) Toga
34. 2) PySide6 (does not support iOS/Android deployment)
35. 3) PursuedPyBear (does not support iOS/Android deployment)
36. 4) Pygame (does not support iOS/Android deployment)
37. 5) None
38.
39. GUI Framework [1]: 1 <enter>
Code 14.10
As it’s noticeable we are creating application that is going to use Toga for
user interface – we already did some example UI in Code 14.4.
Once all is set, we can install some components that are essential to build
mobile application – we will focus in this subchapter on building and
compiling application for iOS system.
First, we must install XCode16 from AppStore. Once we install it, we have
open XCode and install iPhone emulator – when this book was written there
was iOS 17.4 available as the latest for the emulation.
When the emulator is ready and installed, we can shut down XCode and run
below code to prepare our hello world example.
1. $ briefcase create iOS
Code 14.11
Once Python environment for our mini example is prepared, we need to
compile it – that means translate Python code that we prepared (Code 14.9)
to iOS binary machine code. To do so we have to run following example.
1. $ briefcase build iOS
Code 14.12
In the following example you can check how valid and with no errors
compiling output should look like.
1. [helloworld] Updating app metadata...
2. Setting main module... done
3.
4. [helloworld] Building Xcode project...
5. Building... done
6.
7. [helloworld] Built build/helloworld/ios/xcode/build/Debug-
iphonesimulator/Hello World.app
Code 14.13
When build is ready, we can finally run in locally in emulator. To be able to
run compiled code in emulator we have execute following command.
1. $ briefcase run iOS
Code 14.14
After running Code 14.14 we can check how the application is going to look
like in emulator. Let us check following screenshot.
Figure 14.2: Example hello world application run in iOS 17.4 emulator
Calculator
After having successful compiled and started application in iOS emulator it
is time to prepare our calculator program. As the main step, we need to start
preparing template of calculator program. For this we are going to use
following code example.
1. $ briefcase new
Code 14.14
It is easy to notice that we follow same syntax as in previous hello world
code albeit in this case when answering questions like in example Code
14.10, we’re going to use new name – calculator. Once name is given to our
new application, please also remember that in this case we also use GUI
library Toga (as shown in example 14.10).
When all is set, we need to update our main application code so it can draw
calculator buttons with UI. To do so we shall modify main application source
file src/calculator/app.py that looks like in the following code.
1. import toga
2. from toga.style import Pack
3. from toga.style.pack import COLUMN, ROW
4.
5.
6. class calculator(toga.App):
7. def startup(self):
8. """Construct and show the Toga application.
9.
10. Usually, you would add your application to a
main content box.
11. We then create a main window
(with a name matching the app), and
12. show the main window.
13. «»»
14. main_box = toga.Box()
15.
16. self.main_window = toga.MainWindow(title=self.formal_name)
17. self.main_window.content = main_box
18. self.main_window.show()
19.
20.
21. def main():
22. return calculator()
Code 14.15
The original Code 14.15 must be modified in such a way that we can
generate calculator buttons. Let’s start with following code where we
generate button on the top of the box inside of another box. Let’s check first
how we are going to generate UI for those requirements.
Calculation logic
We introduced in Code 14.2 new attribute (line 9) that makes input field read
only – it will help us to prevent user from editing input field that should only
be used for presenting calculator results. Another new thing introduced is the
flex attribute that will force graphical element like button to fill all available
space in a row.
In the following code, we will see code snippet how we can add support for
each individual button that user will press.
1. class CalculatorMod:
2. def __init__(self, result_widget):
3. self.storage_1 = []
4. self.storage_2 = []
5. self.operator = None
6. self.result_widget = result_widget
7.
8. def addValue(self, widget):
9. if not self.operator and not self.result_widget.value:
10. self.storage_1.append(int(widget.text))
11. else:
12. self.storage_2.append(int(widget.text))
13.
14. def click_operator(self, widget):
15. if not self.operator:
16. self.operator = widget.text
17.
18. def calculate(self, widget):
19. result = None
20. number_1 = int(''.join([str(x) for x in self.storage_1]))
21. number_2 = int(''.join([str(x) for x in self.storage_2]))
22. if self.operator == "+":
23. result = number_1 + number_2
24. elif self.operator == "-":
25. result = number_1 - number_2
26. elif self.operator == "x":
27. result = number_1 * number_2
28. elif self.operator == "÷":
29. result = number_1 / number_2
30. self.show_result(result)
31.
32. def show_result(self, result):
33. if not result:
34. return
35. self.result_widget.value = result
36. self.storage_1 = [*str(result)]
37. self.storage_2 = []
38. self.operator = None
Code 14.21
We created class that is going to deliver basic support for fundamental
arithmetic operator, which is +. -. ÷ and *. In this case we have method
calculate (lines 18-30) which gets as an attribute button instance that was
pressed in UI (line 18) by the user. Before checking what kind of action user
has performed, we have prepared numbers for the operation that we will
perform (lines 20-21). Let us focus on this matter here - when we start fresh
application, we initialize two array (lines 3-4) and operator helper (line 5).
Additionally, in initializer of the class we are passing instance of widget that
is going to be our container for displaying result - we created such an input
field with read only attribute (Code 14.20, line 9).
When user clicks any button with a number (1-9) we will use callback
method addValue (lines 9-12) which will check if we’re generating number
from the time when fresh application has been started (lines 9-10) and when
there is already number inserted into memory of calculator, or we have any
result of calculation presented in read only input field where we show to the
user results of his operations. Let’s check following code to see how
example button syntax is going to look like.
1. result_input = toga.TextInput(readonly=True, style=Pack(background_c
olor="#333333", flex=1))
2. storage = CalculatorMod(result_input)
3. button_7 = toga.Button("7", style=Pack(flex=1), on_press=storage.addV
alue)
Code 14.22
Callbacks
As it is easy to notice we add callback method (line 3) to example button 7
which is just pointer to a method not the actual method being called – the
call happens happen upon button being pressed. Another thing, since we pass
pointer to method, we do not pass any kind of argument and as we can see in
Code 14.21 (line 8) there is argument being passed to method call – this
happens as part of toga framework. Argument is button instance that user has
pressed.
We as well mentioned that we will pass instance of result_input read only
input field which we do in lines 1-2.
Another important part of buttons will be special use button – operators (ie.
+) and result button (=). Let’s check example code how these button will use
callbacks to process triggered actions coordinately.
1. button_div = toga.Button("÷", style=Pack(flex=1), on_press=storage.cli
ck_operator)
2. button_equal = toga.Button("=", style=Pack(flex=1), on_press=storage.c
alculate)
Code 14.23
We added same method for stretching buttons size (flex=1) and we added
on_press methods accordantly click_operator and calculate. We can see
from example Code 14.21 that method for click operator is checking if we
have already operator remembered in our class instance – if not, we will
remember pressed operator that user pressed (ie. +) and keep collecting
numbers that user pressed until = button is pressed (Code 14.23, line 2).
When that action takes place, we run calculating procedure (Code 14.21,
lines 18-30).
To be able to calculate numbers correctly we applied following logic – we’re
adding digits to array – one by one when they’re being pressed (Code 14.21,
line 10) as long as user does not press operator button, or we do not have
previous result already presented in results input field (line 9). Once user
clicks operator button, we’re doing the same albeit we’re cumulate pressed
digits into second array (Code 14.21, line 12).
At the moment when user press result button (=) we have accumulate array
number 1 containing digits into number (line 20) and as well array 2 (line
21). Next, depending on the operator that we remember that was pressed (ie.
+) we have performed desired arithmetic operation on those 2 numbers (line
23) and redirect calculated result into result input field (lines 32-38). In the
end when result is presented, we must remember freshly calculated result
into array 1 (line 36) so we can use it for the following operations that user is
willing to perform by going further.
Android
So far, we have been dealing with iOS calculator application. We can use
briefcase library18 to build Android app. To start with building our
application we could install Android studio19 and run compilation manually
– which is strongly not recommended. Briefcase framework will do the
heavy lifting for us.
We are about to reuse our calculator application that we have been running
under iOS. To do so we have to run following code in main folder of our
calculator app.
1. $ briefcase package android
Code 14.24
By running this command, we will start installing Java libraries, Android
emulator and compiler. As part of above process, you will be asked to accept
software license. When all is ready, we shall see building output like int the
following example.
1. BUILD SUCCESSFUL in 15s
2. 49 actionable tasks: 49 executed
3. Bundling... done
4.
5. [calculator] Packaged dist/calculator-0.0.1.aab
Code 14.25
Now, we are ready to run our application in emulator. Since running Code
14.24 Toga framework takes care of installing Android emulator the only
thing we shall do here as developer is to run proceeding code.
1. $ briefcase run Android
Code 14.26
When we run mentioned command first thing that we are going to be asked
is to install necessary Java libraries (automatically as mentioned) and how to
run calcualator application as shown in the following code.
1. Select device:
2.
3. 1) @INFO | Storing crashdata in: /tmp/android-darkman66/emu-
crash-34.1.20.db, detection is enabled for process: 24567 (emulator)
4. 2) Create a new Android emulator
5.
6. > 2
Code 14.27
As it is not a surprise, we are choosing emulator as platform that is going to
run our calculator application. After preparing emulator stack, we should see
our application executed in Android emulator as it’s shown in the following
figure.
Figure 14.6: Example of calculator application running under Android Emulator
So far as you could notice we have been building our application with zero
use of heavy and big IDE like Android Studio or XCode. We should apricate
how much work was involved in preparing such a flexible and powerful
framework for building mobile applications with Python.
Alternative UI
We have been building UI for calculator with using recommended Toga
framework albeit we learned in previous subchapters how to install and use
Kivy framework. Let’s use examples that we have been working with in
subchapter Kivy.
Android
To be able to run Code 14.7 in Android we need to install following
buildozer20 and Cython21 libraries so we can compile Python code into Java
VM and pack it into Android app.
1. $ pip install buildozer cython
Code 14.28
After installing library and dependencies we can create.
1. $ buildozer init
Code 14.29
This command is going to create spec file buildozer.spec with all the
necessary settings in it to be able to run Python application in Android
world. We can keep default configuration with default content. Main
important part is to move spec file and our application code example to the
same folder and name application file as main.py – this is required by
buildoizer.
Once all is set, we can start installing lots of dependencies and libraries by
running following code.
1. $ buildozer -v android debug
Code 14.30
We have to be patient since this process is also going to compile a lot or
binary files which takes some time. When all is set, we should see already
know emulator window with our example Code 14.7 running in it.
iOS
With Kivy and iOS we need to install different set of tools to be able to
convert Python to iOS application. Let’s check following code how to install
libraries.
1. $ pip install Cython kivy-ios
Code 14.31
When all is set, we can start compiling all necessary libraries22. Let’s run
following code.
1. $ toolchain build python3 kivy
Code 14.32
This step will take a long, long time, so we need to be patient and wait until
all the binary files are ready. Please do not stop building process because you
may be in a situation where you have to start from beginning. With freshly
build files we have install player module with proceeding code.
1. toolchain pip install plyer
Code 14.33
When all necessary tools are built and read ready, we can start preparing
project for the actual iOS development. The first thing we have to do is to
make sure that our GUI application file is called main.py (same as in
example 14.30) it will be placed in the same folder where we build tools by
running Code 14.33.
Before we can start preparing iOS application, we should install XCode23
Let us run following code to create XCode project.
1. $ toolchain create MyApp .
Code 14.34
This command is going to create for us basic stack for XCode. It’s going to
create new folder called MyApp-ios – inside it which we need to put our
main.py file.
1. $ open MyApp-ios/myapp.xcodeproj
Code 14.35
When project is open, we can see that is going to trigger XCode for opening
and loading our project with its entire stack. When we want to run our
application, we can see that XCode UI has play icon – please click it – it is
going to run our application in iOS emulator.
We can already see that complexity how we pack and build application when
we user Toga vs Kivy is much of a difference. We do not differentiate that
one is better over the other, yet we say that they give much more different
level of control for building application process.
The other thing worth of noticing when it comes to comparing these GUI
frameworks is the fact that the final application size when it gets build to
Android or iOS stack is much different and for you when you check Kivy
manual online it gives some tips how to make final application much lighter
which, for sure, is going to help us deliver more user friendly app to
Appstore.
Conclusion
In this chapter, we had to learn fundamental practices how to build mobile
application and how to design its user interface. Next, we learned how to use
Python for building mobile applications with using different frameworks.
For sure. We managed to notice that they operate differently and have
different set of skill required from developer to be able to use them to build
and pack them as market ready application.
We did not touch a subject how to build and deliver final application to
corresponding Appstore for Android and iOS since these procedures may
change over a time when you read this book and be obsolete.
In the next chapter, we are going to learn how to use Python to read and
generate barcodes and use them to help to organize vCards.
1. https://en.wikipedia.org/wiki/Minimum_viable_product
2. https://www.apple.com/ios/ios-17/
3. https://developer.apple.com/xcode/
4. https://en.wikipedia.org/wiki/Swift_(programming_language)
5. https://en.wikipedia.org/wiki/Objective-C
6. https://en.wikipedia.org/wiki/Java_%28programming_language%29
7. https://docs.python.org/3/library/tkinter.html#module-tkinter
8. https://wxwidgets.org
9. https://doc.qt.io
10. https://kivy.org
11. https://toga.readthedocs.io/en/stable/
12. https://kivy.org/doc/stable/api-kivy.uix.label.html
13. https://ironpython.net
14. https://dotnet.microsoft.com/en-us/
15. https://briefcase.readthedocs.io/
16. https://developer.apple.com/support/xcode/
17. https://toga.readthedocs.io/en/latest/reference/style/pack.html
18. https://briefcase.readthedocs.io/en/stable/how-
to/publishing/android.html#
19. https://developer.android.com/studio
20. https://pypi.org/project/buildozer/
21. https://cython.org
22. https://github.com/kivy/kivy-ios
23. https://developer.apple.com/xcode/
OceanofPDF.com
CHAPTER 15
QR Generator and Reader
Introduction
In the modern world, QR and bar codes manage to become part of our daily
lives. We scan grocery shopping items at self-service cash register, or we
scan advertainments shown as QR codes. We can say those computer-
generated codes are an international standard in our times.
Figure 15.1: Example of QR code –you may try to scan it with your phone
Structure
In this chapter, will be covering topics:
Introduction to barcode and QR codes
Building simple barcode code generator
Building simple QR code generator
Embedding vCard into QR codes
Adding images into QR codes
Uploading and processing QR codes
Objectives
In this chapter, we will explore the use of QR codes, which are two-
dimensional barcodes that can store various kinds of data, such as text,
URLs, phone numbers, or contact information. QR codes are widely used in
various applications, such as product identification, payment systems,
marketing campaigns, or access control. QR codes are so fantastic.
With QR codes we can encode a large amount of information into a small
space, making them easy to scan and read with a smartphone camera or a
dedicated scanner.
QR codes as well can be customized with different shapes, colors, logos, or
images, making them attractive and distinctive for branding purposes.
That being said we can also mention that QR codes can be dynamic and
updateable - meaning that the data stored in the QR code can change over
time without changing the appearance of the code itself.
All this amazing information we are about to learn – we are going to learn
how can we use Python to generate mentioned QR codes and how can we
read them as well. So, let us get started.
Barcode generator
Before we learn how to generate QR codes with Python, let us briefly
explain what barcodes are and how they work. Barcodes are optical labels
that contain information about an object, such as a product, a book, or a
ticket. They consist of patterns of lines, dots, or squares that can be scanned
by a device and decoded into readable data. Barcodes can store various
types of data, such as numbers, text, or URLs.
1. First thing we need to do is to install Python libraries that are going to
help us generating barcodes.
1. $ pip install "python-barcode[images]"
Code 15.1
2. Once we have packets installed, we need to understand one important
thing; with barcodes we have plenty standards1 that barcode readers
can follow in read properly. In the following example, we are going to
generate barcode message in EAN13 standard.
1. import random
2. from barcode import EAN13
3. from barcode.writer import SVGWriter
4.
5. with open("/tmp/somefile.svg", "wb") as f:
6. EAN13(str(random.randint(111122221111, 666677779999)),
7. writer=SVGWriter()).write(f)
Code 15.2
We are importing module random (line 1) to be able to generate random
number that is long enough (12 digits) to fulfil EAN13 standard. We are
opening file hook (line 5) and next we generate EAN13 barcode (line 5-6).
As a result of our code, we shall get SVG file located in /tmp/somefile.svg
that is going to look like in the following example figure.
Barcode reader
Once we have installed OpenCV2 we can create image reader Python code
that is going to help us decipher barcode SVG file. Another thing that we
are installing is pyzbar3 that is going to help with analyzing image loaded
with OpenCV and converting to Python data. Important thing worth of
mentioning is that we must have zbar library4 preinstalled. We can use
following example of how to install it.
1. # MacOS
2. $ brew install zbar
3. # Linux
4. $ sudo apt-get install libzbar0
Code 15.5
Since we exported barcode image file as SVG we have install following
module to be able to convert it to PNG before we are able to process its
content with pyzbar.
1. $ pip install cairosvg
Code 15.6
1. When Cairo library5 as Python module is installed, we can finally get
the code to work. Let us check following example how to read barcode
that we prepared with Code 15.3.
1. import cv2
2. import numpy as np
3. from cairosvg import svg2png
4. from io import BytesIO
5. from PIL import Image
6. from pyzbar.pyzbar import decode, ZBarSymbol
7.
8. OUTPUT_FILE = "/tmp/cv.png"
9.
10.
11. with open("/tmp/somefile.svg", "r") as f:
12. png = svg2png(file_obj=f)
13.
14. pil_img = Image.open(BytesIO(png)).convert("RGBA")
15. pil_img.save("/tmp/tmp_barcode.png")
16.
17. cv_img = cv2.cvtColor(np.array(pil_img), cv2.COLOR_RGBA2BGR
A)
18. cv2.imwrite(OUTPUT_FILE, cv_img)
19.
20. img = cv2.imread(OUTPUT_FILE)
21. detectedBarcodes = decode(img, symbols=[ZBarSymbol.EAN13])
22. barcode = detectedBarcodes[0]
23. # result
24. print(barcode)
25. print(f"Scanned code: {barcode.data}")
Code 15.7
In our example Code 15.7 we are using few libraries to manipulate
image before we analyze barcode. First in lines 11-12 we load SVG file
we generated in example 15.2. Next, we convert it to PNG format since
OpenCV expects to load PNG image instead of SVG (line 20). Once
we have PNG, we have to make sure that we convert alpha
(transparent) background to white color (line 17) and save it to output
file (line 18).
2. When final file is ready, we read it again (line 20) and decode its
content by using pyzbar decode method (line 21). Since we have single
barcode in the image, we can use first element (line 22) from decoded
barcodes array. IN the end we print decoded object (line 24) and final
value that we wanted to read automatically from barcode (line 25).
So far, we’ve been reading very easy example where barcode is just
few black strips on white background. In the following modifications
we are going to read barcode from photo of a real product containing
barcode.
Before we can continue, we shall install one more addition module that
is going to help us manipulate images by using wrapper for OpenCV
library. Let’s check following code to install imutils6.
1. $ pip install imutils
Code 15.8
3. When module is already is installed let’s create a template for our main
script. First, we are going to load photo with OpenCV and read image
to variable like shown in the following example.
1. import numpy as np
2. import click
3. import imutils
4. import cv2
5.
6.
7. @click.command()
8. @click.option("--image-
file", type=str, help="Full path to image", required=True)
9. def main(image_file):
10. img = cv2.imread(image_file)
11.
12.
13. if __name__ == '__main__':
14. main()
Code 15.8
4. We assume that we have to take a photo of a real product that has a
barcode on it. In our case we took a picture of bottle of water like
shown in the below photo.
Figure 15.5: Example of the source photo with applied morphological operations
9. Once we have mentioned white box let’s apply following code which is
going to help us crop the image and only left that part of photo that we
want to use to process barcode reading.
1. contours = cv2.findContours(new_area.copy(), cv2.RETR_EXTERNA
L, cv2.CHAIN_APPROX_SIMPLE)
2. contours = imutils.grab_contours(contours)
3.
4. contours_min = sorted(contours, key=cv2.contourArea, reverse=True)
[0]
5. (X, Y, W, H) = cv2.boundingRect(contours_min)
6. rect = cv2.minAreaRect(contours_min)
7. box = cv2.cv.BoxPoints(rect) if imutils.is_cv2() else cv2.boxPoints(rect
)
8. box = np.int0(box)
9.
10. cv2.drawContours(img, [box], -1, (0, 255, 0), 3)
11. cropped_image = img[Y:Y + H, X:X + W].copy()
12. cv2.imshow("final cropped", cropped_image)
13. cv2.waitKey(0)
14.
15. detectedBarcodes = decode(cropped_image, symbols=
[ZBarSymbol.EAN13])
16. barcode = detectedBarcodes[0]
17. # final result what we found
18. print(barcode)
19. print(f"Scanned code: {barcode.data}")
Code 15.12
In the rest of our main function, we are using preparing white area
(from example 15.11) in line 1-2 and fin contours9 of it. Then we get
the smallest contour found and use it as bounding area (line 5).
10. Once we have bounding are we draw green box (to show like in
following figure) what part of barcode we are going to use to read
barcode value. After having green box and bounding area we crop the
main image (line 11) and show the final result (12-13). Finally, in lines
15-19 we read barcode value from cropped image.
Figure 15.6: Cropped image containing barcode to decode
QR code generator
So far, we learned how to read and optimize barcode reader. In this
subchapter we are going to learn how can we build something more
complex Quick Response (QR) codes10.
Let us start with simple example, but before we can do so we have to install
following Python module.
1. $ pip install pyqrcode pypng
Code 15.13
When module is installed, we can create an example code where we
generate QR that contains some example URL that we ask user to open
after scanning.
1. import pyqrcode
2. obj = pyqrcode.create('https://www.python.org')
3. url.png('/tmp/qr.png', scale=6, module_color=[0, 0, 0, 128],
4. background=[0xff, 0xff, 0xcc])
Code 15.14
We can see that we create QR code object that should be pointing to
https://www.python.org website after scanning (line 2). Next part (line 3-
4) we save QR code output to external file – in this case it is PNG file. We
can notice that we are specifying image scale as well as background of
generated QR code PNG file. Once file is ready it is going to look like in
the following figure.
Figure 15.8: Example of QR code in higher standard able to store more characters.
We can see that our Code 15.16 after running is going to generate a QR
code that has more pixels that shown in Figure 15.7. The reason being is the
fact that we try to store more characters in our QR code which is easily
detected by QRCode class (line 3) and bumped up the version number.
Since we do not specify error tolerance level – Pyton pyqrcode module is
autodetecting lowest tolerance for errors based on information that we want
to encode. That means we can have up to 30% of pixels in the code itself
missing or be blurry, damaged, not readable etc. Let’s try to decrease errors
level to the lowest offered level – that is 7%. Let’s check following example
how to achieve this.
1. from pyqrcode import QRCode
2.
3. data = "WIFI:S:public-wifi-free;T:WPA;P:somepassword123;H:false;;"
4. q = QRCode(data, error='L')
5. q.png('/tmp/qr_wifi.png', scale=6)
Code 15.17
Now, the result of running our code is going to look like show int the
following figure.
Figure 15.9: Example of QR code in lowest errors level
We can see clearly that density of “piexels” in final QR code is much lower
comparing to the result of running Code 15.16. Even if the information
stored is the same we have errors tolerance level up to 7% which basically
means code must be clear and very good quality for scanning.
In the next example we can check how can we achieve a case where we
want to show some logo in the middle of QR code that we generate. This
should not have any side effect on generated QR code, meaning – still is
readable by smartphones albeit will have for sure impact on appearance of
the code itself. Let’s try to put Python image in the middle of the logo that
we generated with Code 15.16. Let us check the following code how can we
achieve this.
1. import pyqrcode
2. from PIL import Image
3.
4. data = "WIFI:S:public-wifi-free;T:WPA;P:somepassword123;H:false;;"
5. url = pyqrcode.QRCode(data, error='H')
6. url.png('test.png',scale=10)
7. im = Image.open('test.png')
8. im = im.convert("RGBA")
9.
10. logo = Image.open('python-logo.png')
11. box = (145, 145, 235, 235)
12. im.crop(box)
13. region = logo
14. region = region.resize((box[2] - box[0], box[3] - box[1]))
15. x = int(im.size[0]/2 ) - int(region.size[0]/2)
16. y = int(im.size[1]/2) - int(region.size[1]/2)
17. im.paste(region, (x, y))
18. im.show()
Code 15.18
In Code 15.18 we are using the same information data that we want to
encode into QR code (lines 1-6). Once PNG file is saved (line 6) we load
that QR code image file (lines 7-8) back to memory. Next, we load logo file
(line 10) and we create a box that is going to be representing maximum size
of logo inside of our QR code.
In line 12 we crop that size of a box out of QR code image so we can have
blank space for pasting our logo. We calculate logo size (lines 13-14) and
resize logo (line 14). The last thing we have to do before pasting logo to QR
code file is to calculate (lines 15-16) where to paste it so it is going to be
right in the center of the final image (line 17). When all is ready, we show
(line 18) our generated QR code.
In the following image we can see what shall be the final result of running
Code 15.18.
QR code reader
So far, we managed to learn how to generate QR codes. Now we are going
to see how we can read such a code with using Python. Let’s check
following example to see how we can read mentioned QR code from
example 15.25.
1. from PIL import Image
2. from pyzbar.pyzbar import decode
3.
4. result = decode(Image.open('/tmp/John_Smith.png'))
5. print('decoding result:')
6. print(result[0].data)
Code 15.26
We used two main modules that we’ve been already using before when
we’ve been learning about barcodes and how to process them (lines 1-2).
Once we local QR code image from running example Code 15.25 that is
saved under /tmp/John_Smith.png. Once file is loaded, we decode it (line
4) and print decoding result (line 6) which of course is a vCard data. Let us
check what is going to be result of running such a code.
1. $ python read_qr_code.py
2.
3. decodeing result:
4. b'BEGIN:VCARD\n VERSION:4.0\n FN;CHARSET=UTF-
8:John Smith\n
N;CHARSET=UTF-
8:Smith;John;;;\n GENDER:M\n\rBDAY:19780915\n\r
ADR;CHARSET=UTF-
8;TYPE=HOME:;;seasame street;amazing city;;123456;
best country\n TITLE;CHARSET=UTF-8:upper main boss\n
ROLE;CHARSET=UTF-8:CEO\n ORG;CHARSET=UTF-
8:best company
ever\n REV:2024-07-19T20:43:59.177Z\n END:VCARD'
Code 15.27
Python managed to decipher QR code even that it had some error in it –
remember, we included Python logo in the center of QR code which is
damaging it (below 30% error rate).
Conclusion
In this chapter, we learned how we can analyze barcodes and generate them.
We managed to understand how Python can process images and extract
barcodes from them. Next, we managed to dive deeper into the topic of QR
codes and build beautiful QR codes with logo in them that may be used as
very attractive way of sharing contact data.
In the next chapter, we are going to build an app to keep track of digital
currencies which can be very useful skill especially in times when crypto
currencies are so popular.
1. https://en.wikipedia.org/wiki/Barcode#Barcode_verifier_standards
2. https://pypi.org/project/opencv-python/ and
https://pyimagesearch.com/2014/11/24/detecting-barcodes-images-
python-opencv/
3. https://pypi.org/project/pyzbar/
4. https://github.com/Polyconseil/zbarlight/
5. https://cairographics.org
6. https://github.com/PyImageSearch/imutils
7. https://docs.opencv.org/4.x/d5/d0f/tutorial_py_gradients.html
8.
https://docs.opencv.org/4.x/d9/d61/tutorial_py_morphological_ops.html
9. https://docs.opencv.org/3.4/d4/d73/tutorial_py_contours_begin.html
10. https://en.wikipedia.org/wiki/QR_code
11. https://en.wikipedia.org/wiki/MeCard_(QR_code)
12. https://en.wikipedia.org/wiki/VCard
13. https://yaml.org
OceanofPDF.com
CHAPTER 16
App to Keep Track of Digital
Currencies
Introduction
Before we dive into the technical details of how to create your own crypto
trading platform, let's briefly review what cryptocurrencies are and why they
are so popular among traders and investors. Cryptocurrencies are digital
assets that use cryptography to secure their transactions and control their
creation. Unlike fiat currencies1, which are issued and backed by central
authorities, cryptocurrencies are decentralized and operate on peer-to-peer
networks. This means that no one can manipulate or censor their
transactions, and users have full control over their own funds.
Cryptocurrencies offer several advantages over traditional payment systems,
such as lower fees, faster processing, global accessibility, transparency,
privacy, and security. They also enable new business models and
innovations, such as smart contracts, decentralized applications, and
tokenization.
You will learn how to build a data stream analyzer that collects and
processes real-time data from various sources. You will also learn how to
design and implement a trading engine that executes orders according to
your custom strategies and rules. Finally, you will learn how to develop a
user interface that displays the data and the results of your trading activities
and allows you to adjust your settings and preferences. By the end of this
book, you will have a fully functional crypto trading platform that you can
use for your own purposes or share with others.
Structure
In this chapter, we will discuss the following topics:
Building data stream analyzer
Storage for data results - time driven db
Analyze tool for trends
Learning how to draw graphs with Python
Building alarms logic
Objectives
After reading this chapter, you should know how to build your own crypto
market trading platform client and be able to manage your crypto assets with
Python in use to build simple yet powerful money exchange application.
Data stream
Before we can analyze any kind of data, we have to learn how can fetch data
from external web resource. For following example we’re going to use
crypto.com website that delivers crypto exchange market with live updates.
Let’s check following code to see how we can fetch example data values for
bitcoin.
Before we can make any calls, we have to install following library which is
going to be essential in all our examples.
1. $ pip install requests click
Code 16.1
1. Once Python module is installed, we can wrap up simple example that is
fetching crypto coin currency exchange. Let’s investigate proceeding
example.
1. import requests
2. from pprint import pprint
3.
4. url = f"https://price-api.crypto.com/price/v1/token-price/bitcoin"
5. result = requests.get(url, headers={"User-Agent": "Firefox"})
6. pprint(result.json())
Code 16.2
In the example Code 16.2 we’re calling crypto.com website by introducing
our call ad “firefox” (line 5) browser in request headers. In this case crypto
website will not think that we’re making a call as a command line program
but real browser. Next, we’re print received response (line 6) and since we
know it’s JSON, we call json method on response to convert it to Python
dictionary.
Let’s take a look on the chunk of pretty long example response.
1. {'btc_marketcap': 19735953.0,
2. 'btc_price': 1,
3. 'btc_price_change_24h': 0.0,
4. 'btc_volume_24h': 1870973.8893898805,
5. 'circulating_supply': 19735953.0,
6. 'defi_tradable': True,
7. 'exchange_tradable': True,
8. 'max_supply': 21000000.0,
9. 'price_update_time': 1722868080,
10. 'prices': [68665.03455943712,
11. 68078.98719247522,
12. 66877.86280787496,
13. ...
14. 53335.02102136236],
15. 'rank': 1,
16. 'slug': 'bitcoin',
17. 'token_dominance_rate': None,
18. 'token_id': 1,
19. 'usd_marketcap': 1055828079115.311,
20. 'usd_price': 53497.69930620077,
21. 'usd_price_change_24h': -0.116307,
22. 'usd_price_change_24h_abs': 0.116307,
23. 'usd_volume_24h': 99865107623.72562 }
Code 16.3
We can clearly see that response not only contain current exchange value
(key usd_price) but as well historical data. We’re going to use this fact in
further part this chapter.
2. Let’s refactor example 16.3 so we can support more coin and more
dynamically.
1. import click
2. import requests
3. from pprint import pprin
4.
5. SUPPORTED_COINS = {"eth": "ethereum", "btc": "bitcoin"}
6.
7.
8. def fetch_exchange(coin_str):
9. url = f"https://price-api.crypto.com/price/v1/token-price/
{coin_str}"
10. print(f'Calling {url}')
11. result = requests.get(url, headers={"User-Agent": "Firefox"})
12. pprint(result.json())
13.
14.
15. @click.command()
16. @click.option("--
coin", type=click.Choice(SUPPORTED_COINS.keys())
, help="Coin symbol to fetch details about", required=True)
17. def main(coin):
18. if coin not in SUPPORTED_COINS:
19. raise Exception("Invalid coin")
20. fetch_exchange(SUPPORTED_COINS[coin])
21.
22. if __name__ == "__main__":
23. main()
Code 16.4
3. We updated code to make is cleaner and easier to use. By using click
library (line 16) we managed to limit supported coin to only two so we
can call our script like in the following example.
1. $ python updater.py --coin eth
Code 16.5
Storing stream
To be able to store data that is time driven we can’t simply use SQL
optimized database. For storing stream and dates-based values we need to
use time series database engine2. Of course, there are plenty of choices to
choose from, including PostgreSQL timeseries plugin3. In our case we’re
going to use opensource Influx DB4. On the website you can find full
description how to install influx DB on your local machine. In case of
MacOS it’s just simply as follows.
1. Once service is installed and running, we can access it like regular
website on our local system – it is shown it the following figure.
Reading data
Once data is saved, we have to get it from DB storage to Python. To be able
to so it we have create influx DB connection as read pointer. Let’s check
following example how can we accomplish this.
1. from pprint import pprint
2. from influxdb_client import InfluxDBClient
3. from dotenv import dotenv_values
4.
5.
6. url = "http://localhost:8086"
7. config = dotenv_values(".env")
8. client = InfluxDBClient(url=url, token=config['API_KEY'], org=config[
'org'])
9. query_api = client.query_api()
10.
11. query = """from(bucket: "coins")
12. |> range(start: -100m)
13. |> filter(fn: (r) => r._measurement == «price»)»»»
14.
15. result = query_api.query(org=config['org'], query=query)
16.
17. results = []
18. for table in result:
19. for record in table.records:
20. results.append((record.get_field(), record.get_value()))
21.
22. pprint(results)
Code 16.12
We have used the same dotenv module for reading configuration file from
.env file (line 7). Next, we are establishing connection for to influx DB (line
8-9). When allis ready we prepare database query (lines 11-13).
It is easy to notice that influx DB query language is much different than SQL
query language being used in relational database. In our example we make
simple query where we say from bucket coins (line 11) we want to get all
records that were inserted for the last 100 minutes (line 12) and where
measurement name is price (line 13).
The following lines 17-20 we fetch records from database and convert them
to result that is going to look like in the following example.
1. [...
2. ('value', 2416.515533583986),
3. (‹value›, 2492.635832124),
4. (‹value›, 2459.506617256568),
5. (‹value›, 2433.083800659239),
6. (‹value›, 2408.107898192493),
7. (‹value›, 2439.415079893941),
8. (‹value›, 2470.170705476894)]
Code 16.13
Data visualization
Once we have data fetched and saved in local database it would be fantastic
from usability point of view to have option to be able to visualize trends and
rates of currency exchange. To be able to perform this task we are going to
use flask framework as a webservice that we are going to open in the
browser.
1. pip install flask==2.2.3
2. pip install plotly pandas
Code 16.14
1. After installing flask and plotly, which is a framework for drawing
visualizations and charts. First thing we are going to do is to write
simple “hello world” service. In the following example we created file
hello_world.py.
1. from flask import Flask
2.
3. app = Flask(__name__)
4.
5. @app.route("/")
6. def hello():
7. return "Hello World!"
8.
9. if __name__ == "__main__":
10. app.run()
Code 16.15
To start it we need to execute it by running following command.
1. $ python hello_world.py
2.
3. * Serving Flask app 'hello_world'
4. * Debug mode: off
5. WARNING: This is a development server. Do not use it in a
production deployment. Use a production WSGI server instead.
6. * Running on http://localhost:5005
7. Press CTRL+C to quit
Code 16.16
What is worth of noticing is the fact that we specified where server
listens and on which port. Once it’s up we can open our service in any
kind of web browser just by accessing URL http://localhost:5005
2. Working great, right? Before we are going to jump into topic of drawing
any kind of graphs, we need to do some HTML with our hello world
example. We already managed to work with concept of MVC5 in one of
the previous chapters so we have some basic knowledge how web
frameworks use it.
In flask framework it is lighter approach and more low level. Developer
is the person who decides what frameworks to choose for each MVC
component6.
3. Without diving too deep into the topic we must make some assumptions.
For viewing layer, we will use jinja27 templating framework. To be able
to use it our hello world example we have modify out example, so it is
going to look like in the following code.
1. from flask import Flask, render_template
2.
3. app = Flask(__name__)
4.
5. @app.route("/")
6. def hello():
7. return render_template("index.html")
8.
9. if __name__ == "__main__":
10. app.run(host=»localhost», port=5005)
Code 16.17
Restart server and reopen same URL as an example 16.16 and …we have an
error like in following dump from shell.
1. ERROR in app: Exception on / [GET]
2. Traceback (most recent call last):
3.
4. (...)
5.
6. jinja2.exceptions.TemplateNotFound: index.html
7. 127.0.0.1 - - [02/Apr/2023 08:27:03] "GET / HTTP/1.1" 500 -
Code 16.18
4. That error means that we tried to open URL and execute Code 16.18,
lines 6-7 which in line 7 tried to load jinja template which doesn’t exist
yet so as a side effect jinja thrown and error. Let us create missing
template to address this problem. Create directory templates and save
following file index.html in templates folder.
1. <html>
2. <head>
3. <title>crypto analyzer</title>
4. </head>
5. <body>
6. <p>hello</p>
7. </body>
8. </html>
Code 16.19
Since we know how to generate HTML from templates, we can now
make this main template a little nicer and more handsome. For doing
this we will use popular JavaScript library called bootstrap8.
5. Thankfully this library comes as precompiled ready to be distributed
files. We are going to use central distribution system called CDN9. Let
us modify example from Code 16.19 and introduce some bootstrap
sugar there.
1. <html>
2. <head>
3. <title>crypto analyzer</title>
4. <link href="https://cdn.jsdelivr.net/npm/[email protected]/dist/css/
bootstrap.min.css" rel="stylesheet" crossorigin="anonymous">
5. <script src=https://cdn.jsdelivr.net/npm/[email protected]/dist/js/bo
otstrap.bundle.min.js crossorigin="anonymous"></script>
6. </head>
7. <body>
8. <p>Current data</p>
9. </body>
10. </html>
Code 16.20
6. By introducing bootstrap JS to our HTML, we can define completely
new look and feel for our template. As a next step we need to define
new method that is going to select all the current currency rates from
database and display them in main page.
1. from pprint import pprint
2. from influxdb_client import InfluxDBClient
3. from dotenv import dotenv_values
4. from flask import Flask, render_template
5.
6.
7. url = "http://localhost:8086"
8. config = dotenv_values(".env")
9. client = InfluxDBClient(url=url, token=config['API_KEY'], org=con
fig['org'])
10. query_api = client.query_api()
11.
12.
13. app = Flask(__name__)
14. app.config["TEMPLATES_AUTO_RELOAD"] = True
15.
16. def get_data():
17. query = """from(bucket: "coins")
18. |> range(start: -200m)
19. |> filter(fn: (r) => r._measurement == «price»)»»»
20.
21. result = query_api.query(org=config['org'], query=query)
22. results = []
23. for table in result:
24. for record in table.records:
25. results.append((record.get_field(), record.get_value()))
26. return results
Code 16.21
By creating method get_data we implemented same querying technique that
we already learned in Code 16.12. This is going to help us to return a list of
lists of the results from nflux DB. Once we have all the data set prepared it is
time to inject that into our template. Let’s take a look at the index page how
we’re going to achieve this.
1. @app.route("/")
2. def hello():
3. context = {"currencies": get_data()}
4. return render_template("./index.html", **context)
Code 16.22
As it is shown in line 3 we’re fetching results from influx DB by using
method from Code 16.21 and directly assigning result into template context.
Let’s check how we’re going to display data in the template. First, let’s inject
following syntax to Code 16.20, between lines 8 and 9 as follows.
1. {% include "currencies.html" %}
Code 16.23
This syntax will inform templating engine to include another template file
currencies.html and inject it into the place where we used include tag. Let’s
check in the following code how we display influx DB data.
1. <table class="table table-striped table-hover">
2. <thead>
3. <th>ID</th>
4. <th>Code</th>
5. </thead>
6. {%for item in currencies %}
7. <tr>
8. <td>{{ item[0] }}</td>
9. <td>{{ item[1] }}</td>
10. </tr>
11. {% endfor %}
12. </table>
Code 16.24
7. After preparing all the data we should restart our webservice and access
same localhost URL as before but this time we are going to see all the
data like in the following figure.
Figure 16.2: Example table with sample currency exchange data.
This is very basic table which displays raw data. It is quite non-user friendly
way of presenting time driven data. Much better approach is going to be
drawing graph that is going to show easier to follow trends of currency
exchange. Let’s check following example how we can convert receive data
from influx DB into such a form that we can display a graph representing it.
1. import pandas as pd
2. import plotly.graph_objects as go
3. from influxdb_client import InfluxDBClient
4. from dotenv import dotenv_values
5. from dataclasses import make_dataclass
6.
7.
8. url = "http://localhost:8086"
9. config = dotenv_values(".env")
10. client = InfluxDBClient(url=url, token=config['API_KEY'], org=config[
'org'])
11. query_api = client.query_api()
12.
13. query = """from(bucket: "coins")
14. |> range(start: -100m)
15. |> filter(fn: (r) => r._measurement == «price»)»»»
16.
17. result = query_api.query(org=config['org'], query=query)
18.
19. Point = make_dataclass("Point", [("Date", str), ("Value", float)])
20.
21. results = []
22. for table in result:
23. for record in table.records:
24. results.append(Point(record.get_time(), record.get_value()))
25.
26. df = pd.DataFrame(results)
27. fig = go.Figure([go.Scatter(x=df['Date'], y=df['Value'])])
28. fig.show()
Code 16.25
8. We can see that in this example we refactored Code 16.12 in such a way
that instead of printing to database results we are creating data-class
Point object (line 19) which later we fill with data coming from influx
DB (lines 22-24). Once list of points is ready, we’re creating Pandas10
data frame. By having all set is time to create plot figure (line 27) and
fill it with pandas data frame and in the end show it. When all is done
properly, and we have correctly updated influx DB with recent currency
exchange data we shall see figure like shown below.
Figure 16.3: Example crypto currency exchange shown in a human friendly graph
We managed to show all the data that we collected over a time. It is
clear to read what is the up and down trend in crypto currency
exchange. There is one problem with our approach – we do not show
this as part of our web service (Code 16.21). That being said let’s take a
look how can we include learned technique of drawing trends graph as a
part of our simple website.
Before we can introduce serving data graph as part of webservice we
have to install few Python modules as follows.
1. $ pip install kaleido dash
Code 16.27
9. Once packages are installed can introduce new method that is going to
display same image content as in Figure 16.3 but as part of webserver
response. Let’s check following example.
1. from flask import Flask, render_template, Response
2.
3. @app.route("/graph")
4. def graph():
5. results = get_data()
6. df = pd.DataFrame(results)
7. fig = go.Figure([go.Scatter(x=df['Date'], y=df['Value'])])
8. img_bytes = fig.to_image(format="png")
9. return Response(img_bytes, mimetype="image/png")
Code 16.28
We added in flask import Response function that is helping us to return
raw data in response (line 9). In this case we have to exclusively define
what kind of data we’re returning in the response data – in our case we
inform flask framework that the data (line 9) that is going to be returned
is image (PNG).
9. The rest of the body of the method graph is pretty the same as what we
wrote in Code 16.3 ith major detail - line 8. That line instead of showing
image directly is dumping image content into variable which we later
return to browser.
Once image is ready, we can modify body of our template content to be
able to show image as part of our simple website.
1. <body>
2. <p>Current data (USD)</p>
3. <img src="/graph" />
4. </body>
Code 16.29
10. Let us restart our webserver and check how the main website is going to
look like. It should be displaying graph like in the figure as follows.
Figure 16.4: Example webserver displaying currency exchange trends over a time
Data estimate
In the Chapter 4, Developing App to Analyze Financial Expenses, we
learned how to build data estimator. Let us use this knowledge hew to add
tool for analyzing data trends and draw estimates. Let’s check following
example how can we refactor Code 16.28 to be able to add interpolation to
it.
Before we can create estimating tool, we have to install few Python modules
as in following example.
1. $ pip install numpy sklearn
Code 16.30
When all is set, we can write estimating function like in the proceeding code.
1. import numpy as np
2. from sklearn import preprocessing, svm
3. from sklearn.model_selection import train_test_split
4. from sklearn.linear_model import Ridge
5. from datetime import datetime
6.
7. def forecast_data(df):
8. forecast_col = "Value"
9. df.fillna(value=-99999, inplace=True)
10. forecast_size = int(math.ceil(0.03 * len(df)))
11. df['Date'] = df['Date'].apply(lambda x: x.timestamp())
12. df["label"] = df[forecast_col].shift(-forecast_size)
13.
14. x = np.array(df.drop(["label"], axis=1))
15. print(x)
16. x = preprocessing.power_transform(x)
17. x_lately = x[-forecast_size:]
18. x = x[:-forecast_size]
19.
20. df.dropna(inplace=True)
21.
22. y = np.array(df["label"])
23. x_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
24. clf = Ridge(alpha=1.0)
25. clf.fit(x_train, y_train)
26. confidence = clf.score(X_test, y_test)
27.
28. forecast_set = clf.predict(x_lately)
29. df["Forecast"] = np.nan
30. last_date = df.iloc[-1].name
31. last_unix = last_date
32. one_day = 60 # 1 minute in seconds
33. next_unix = last_unix + one_day
34.
35. for i in forecast_set:
36. next_date = datetime.fromtimestamp(next_unix)
37. next_unix += one_day
38. df.loc[next_date] = [np.nan for _ in range(len(df.columns) - 1)] + [i
]
39.
40. @app.route("/graph")
41. def graph():
42. results = get_data()
43. df = pd.DataFrame(results)
44. forecast_data(df)
45. df['Date'] = pd.to_datetime(df['Date'], unit='s')
46. fig = go.Figure([go.Scatter(x=df['Date'], y=df['Value'])])
47. img_bytes = fig.to_image(format="png")
48. return Response(img_bytes, mimetype="image/png")
Code 16.31
We added new essential Python imports (lines 1-5). Next, we introduced a
bit modified interpolation method (lines 7-38) that we already learned in
Chapter 4 - Developing App to Analyze Financial Expenses. What is worth
of noticing is the fact that we are calling it (line 44) by passing data frame
parameter which gets modified directly in mentioned method. Data frame
argument in this use case works like pointer in C language so there is no
need to return and reassign data frame value after method run. Next, we
show graph as we did in previous examples.
Figure 16.5: Compared crypto currency exchange values without and with data interpolation
As it is shown estimating algorithm is pretty smooth and it calculates
incoming trend based on the up and downs in the last few data points that
were read from influx DB.
Alarms
We managed so far to learn how to fetch, store stream data in database and
as well present gathered results in human friendly way. This time we want
see how can be build simple tool that is going to raise alarm when there is
significant change in currency exchange being detected. For instance, we
want to detect when there is a drop or rise of given percentage of value in
received crypto currency exchange data sets. Let’s check following example
how can we modify code that we build in subchapter for data streams.
1. def check_value(self, alarm):
2. query = """
3. from(bucket: «coins»)
4. |> range(start: -1000m)
5. |> filter(fn: (r) => r._measurement == «price»)
6. |> sort(columns: [«_time»], desc: true)
7. |> limit(n:2)
8. """
9. client = InfluxDBClient(url=URL, token=self._config['API_KEY'], o
rg=self._config['org'])
10. query_api = client.query_api()
11. result = query_api.query(org=self._config['org'], query=query)
12. results = []
13. for table in result:
14. for record in table.records:
15. results.append(record.get_value())
16. print(results)
17. ratio = results[1]/results[0]
18. trend_percentage = (ratio*100)-100
19. print(f"Trends change: {trend_percentage}%")
20. if abs(trend_percentage) > alarm:
21. print(f'WARNING: Critical change since
last time we fetched data, change: {trend_percentage}%')
Code 16.32
In this example we call query to database where we ask for all the record
that were saved do the last 1000 minutes and next sort them by insert time
and get only the last 2 values from the set. The reason why we only ask for
the last two elements from database is because we want to check what is the
percentage of change between last currency exchange value that we just
inserted (check following code) and the previous value we saved. In case
when we run our method, we should see result like this.
1. $ python updater_with_storage_alarms.py --coin eth --alarm 5
2.
3. Calling https://price-api.crypto.com/price/v1/token-price/ethereum
4. Trends change: -0.0791962718755741%
Code 16.33
As you can notice we added in the Code 16.32 (lines 20-21) check for the
percentage level of trend change if it’s breaching given threshold
trend_percentage. We calculate absolute value (line 20) since we want to
raise alarm when trend is either growing above alarm level or it is falling
under the alarm line.
In Code 16.33 we are using alarm level as parameter for our script (Code
16.8). To be able to use this parameter we have to modify that code as it is
shown in the following example.
1. @click.command()
2. @click.option("--alarm", help="Alarm level - percentage",
required=False, default=False, type=int)
3. @click.option("--seed", help="Seed example data",
required=False, default=False, is_flag=True)
4. @click.option(
5. "--
coin", type=click.Choice(SUPPORTED_COINS.keys()), help="Coin sy
mbol to fetch details about", required=True
6. )
7. def main(alarm, seed, coin):
8. if coin not in SUPPORTED_COINS:
9. raise Exception("Invalid coin")
10. c = CoinApp(SUPPORTED_COINS[coin])
11. if seed:
12. c.seed_data()
13. else:
14. c.update_db(alarm)
Code 16.34
In this case, adding new alarm parameter is pretty clear and easy to use.
Now, let’s check following example to see how we are using it in method
update_db.
1. def update_db(self, alarm):
2. data = self.fetch_exchange()
3. value = data['usd_price']
4. self._save_data(value)
5. self.check_value(alarm)
Code 16.35
As we said in Code 16.32, we are fetching only two last results inserted into
database because as is it demonstrated in Code 16.35 (lines 4-5) we call
method check_value only after inserting fetch crypto currency exchange
value into influx DB.
Conclusion
In this chapter, we learned how can we use publicly accessible website
which is publishing crypto currency exchange values and statistics. We
managed to understand how we can store time series data in database that is
specially designed to store such a data in a way that is possible to access it
quickly, efficiently and query in very flexible way. We also managed to write
our own yet powerful web application that can consumed mentioned stored
in a such that we can present human friendly crypto currency exchange
trends.
1. https://en.wikipedia.org/wiki/Fiat_money
2. https://en.wikipedia.org/wiki/Time_series_database
3. https://github.com/timescale/timescaledb
4. https://www.influxdata.com
5. https://en.wikipedia.org/wiki/Model–view–controller
6. https://flask-diamond.readthedocs.io/en/latest/model-view-controller/
7. https://palletsprojects.com/p/jinja/
8. https://getbootstrap.com
9. https://www.jsdelivr.com
10. https://pandas.pydata.org
OceanofPDF.com
Index
A
Alarms 469-471
Auto Purchase 400-405
B
Barcode Generator 430
Barcode Generator, steps 430-432
Barcode Reader 432
Barcode Reader, steps 432-438
C
Calculator 416-420
Calculator, aspects
Android 424-426
Callbacks 423, 424
Calculator, configuring 422, 423
Calculator, framework
Android 426, 427
iOS 427, 428
Calendar Parser 329-333
Calendar Parser, points
External Data, synchronizing 340-344
Subscribe Locally 336-338
Chat 70, 71
Chatbot 67, 68
Chatbot, aspects
Rules-Based Service 67
Self-Learn 67
Chatterbot 68-70
ClientEbay 380
Client-Server 62
Client-Server, applications 71-78
Client-Server, architecture 62-67
Client-Server, ways 64
Compiler 412, 413
Crypto Currencies, optimizing 182, 183
Crypto Currencies, steps 183-188
Crypto Currencies Trend, analyzing 191-197
Crypto Currencies With Wallet, integrating 203-209
Crypto Market 182
Crypto Market, optimizing 211-215
Crypto Market With Client, building 188-191
D
Data Estimate 467, 468
Data Stream 450
Data Stream, steps 450-452
Data Stream, terms
Reading 457, 458
Storing 452-455
Data Visualization 458
Data Visualization, steps 458-466
DHCP Server 294
DHCP Server, architecture 294-307
Download Manager 250-260
Download Manager Data, analyzing 265-275
E
eBay Client 372
eBay Client, keys 378
eBay Client, parameters
appid 373
certid 373
Devid 373
Token 373
eBay Client, steps 372-377
Excel 88
Excel Driver, building 104-108
Excel Expenses, analyzing 92-96
Excel Outcomes, estimating 96-103
Excel, tasks
Export 88-91
Import 91
F
Flake8 44, 45
Format Resolutions, supporting 283-285
Frontend 79
Frontend, arguments 80
Frontend, configuring 80-86
G
git 50
GUI 410
GUI, applications
Kivy 410-412
Toga 410
H
Hashing 146, 147
Hashing, factors 147
Hash Key, calculating 147-149
I
IDE 46, 47
Interaction, scenarios 226-231
P
Package Inspection 312-315
Package Inspection, challenges 316, 317
Package Routing 288
Package Routing, architecture 288-294
Package Routing, layers 288
Parallel Process 126-134
Parallel Process, architecture 175-179
Parallel Process, method 135
Physical Devices, building 247
pip 68
Plugins 378
Plugins, points
Items, tracking 393-396
Price Tracker, automating 388-392
Plugins, values 398-400
Port Scanner 351
Port Scanner, issues 352, 353
Port Scanner, traps 352
Pre-Commit 47-51
Pre-Commit, challenges 50
pycrypto 190
Pylint 45, 46
Python 2
Python, fundamentals
Classes 16-20
Code Style 28-31
Error, handling 25-27
Functions 10-15
Iterators/Generators 8-10
Loops 6-8
Modules/Package 22-24
Python GUI 409
Python GUI, libraries
Kivy 409
QT 409
TK 409
wxWidgets 409
Python Library, building 54-60
Python, points
Editor 3, 4
Hello World 4, 5
Python, tools
Flake8 44, 45
IDE 46, 47
Pre-Commit 47-53
Pylint 45, 46
Python, uses 2, 3
Python, workbench
Clean Code 42, 43
Libraries, controlling 38-41
Linux 34, 35
Project, controlling 36-38
Windows 36
Q
QR Code Generator 438-444
QR Code Reader 447, 448
R
Reporting
361
Reporting
, architecture 361-369
return keyword 9
S
Scanner 353
Scanner, points 354-357
Scikit-Learn 97
sounddevice 218
Speech-To-Text Recognition 218
Speech-To-Text Recognition, components 218
Speech-To-Text Recognition, points
Recording 218-222
Response 223-226
spotify_callback 241
sqlparse 41
T
TCP/UDP 346
TCP/UDP, configuring 346-350
Tempfile 225
Third Party Service, connecting 233-239
V
Viruses 149, 150
Viruses Class, attributes 168
Viruses, issues 153
Viruses Suspicious Files, building 163-166
Viruses, uses 150-163
Virus Scanner, directories 144-146
W
Web Calendars, tool
Google 322-324
iCal 328
Office 365 325-328
Web Content, filtering 317-320
Web Crawler 110-113
Web Crawler, architecture 121-125
Web Crawler, configuring 115-120
Web Crawler, resources
Parallel Process 137, 139
Proxy 139-141
Web Crawler With HTML, analyzing 113-115
Y
YouTube API, building 275-282
OceanofPDF.com