Troubleshooting and Debugging Techniques
Troubleshooting and Debugging Techniques
Introduction to Debugging
Troubleshooting means Identifying, analysing and solving any kind of problem may it be in
a real world or in IT. There could be problems caused by hardware, the operating system, or
applications running on the computer. They could also be caused by the environment and
configuration of the software. The services the application is interacting with, or a wide
range of other possible IT causes.
Whereas, debugging means Identifying, analysing and removing bugs from the code of an
application.
There are lots of tools that we can use to get more information about the system and what
the programs in our system are doing. Tools like tcpdump and Wireshark can show us
ongoing network connections and help us analyse the traffic going over our cables.
Tools like ps, top, or free can show us the number and types of resources used in the
system. We can use a tool like strace to look at the system calls made by a program, or
ltrace to look at the library calls made by the software.
Debuggers let us follow the code line by line, inspect changes in variable assignments,
interrupt the program when a specific condition is met, and more.
1. To get information: Gather more and more information about the problem, don’t
worry about the solution at first try to find the possible causes and how critical the
problem is.
2. Find the root cause: After you know the possible causes, further investigation would
lead you to the root cause of the problem, from where you can proceed the
troubleshooting.
3. Take a remedial action: After finding the root cause you have to take the action to
solve the problem for either a short period or a long-term remedy. You must know
how big and crucial the problem is and how it can damage or alter the working of
system.
Also, when we are executing above steps you should make sure that whatever you do must
be Documented appropriately, hence you can have a track record of how and what you are
doing.
So, when trying to create a reproduction case, we want to find the actions that reproduce
the issue, and we want these to be as simple as possible. The smaller the change in the
environment and the shorter the list of steps to follow, the better.
Bugs that come and go are hard to reproduce, and are extremely annoying to debug.
If you can't modify the code of the program to get more information, check if there's a
logging configuration that you can change. Many applications and services already include a
debugging mode that generates a lot more output then the default mode.
For bugs that occur at random times, we need to repair our system to give us as much
information as possible when the bug happens.
There is an annoying type of intermittent issue, nicknamed Heisenbug, in honour of Werner
Heisenberg. He's the scientist that first described the observer effect, where just observing
a phenomenon alters the phenomenon.
Heisenbugs are extra hard to understand, because when we meddle with them, the bug
goes away. These bugs usually point to bad resource management. Maybe the memory was
wrongly allocated, the network connections weren't correctly initialized, or the open files
weren't properly handled.
LAB 1
Debugging Python Scripts _ Qwiklabs.html
Slowness of Code
A problem that we have to deal with a lot when working in IT, is things being slow. This
could be our computer, our scripts, or even complex systems. Slow is a relative term.
Modern computers are much faster and can do many more things than computers a couple
of decades ago. Still, we always want them to be faster and to do more in less time.
The general strategy for addressing slowness is to identify the port for addressing slowness
in our device, our script, or our system to run slowly. The bottleneck could be the CPU time
as we just mentioned. But it could also be time spent reading data from disk waiting for data
transmitted over the network, moving data from disk to RAM, or some other resource that's
limiting the overall performance.
We need to monitor the usage of our resources to know which of them as being exhausted.
This means that it's being used completely and programs are getting blocked by not having
access to more of it.
Top on Linux systems, This tool lets us see which currently running processes are using the
most CPU time. If we start by memory, which ones are using the most memory. It also
shows a bunch of other load information related to the current state of the computer, like
how many processes are running and how the CPU time or memory is being used.
If you run enough programs at the same time, you'll fill it up and run out of space. What
happens when you run out of RAM? At first, the OS will just remove from RAM anything
that's cached, but not strictly necessary. If there's still not enough RAM after that, the
operating system will put the parts of the memory that aren't currently in use onto the hard
drive in a space called swap. Reading and writing from disk is much slower than reading and
writing from RAM.
When trying to figure out what's making a computer slow, the first step is to look into when
the computer is slow. If it's slow when starting up, it's probably a sign that there are too
many applications configured to start on boot.
In this case, fixing the problem is just a question of going through the list of programs that
start automatically and disabling any that aren't really needed. If instead the computer
becomes sluggish after days of running just fine, and the problem goes away with a reboot,
it means that there's a program that's keeping some state while running that's causing the
computer to slow down.
If your hard drive has errors, the computer might still be able to apply error correction to
get the data that it needs, but it will affect the overall performance. And once a hard drive
starts having errors, it's only a matter of time until they're bad enough that data starts
getting lost, so it's worth keeping an eye out for them. To do this, we can use some of the
OS utilities that diagnose problems on hard drives or on RAM, and check if there's anything
that could be causing problems.
Monitoring Tools:
https://docs.microsoft.com/en-us/sysinternals/downloads/procmon
http://www.brendangregg.com/linuxperf.html
http://brendangregg.com/usemethod.html
https://www.digitalcitizen.life/how-use-resource-monitor-windows-7
https://docs.microsoft.com/en-us/sysinternals/downloads/process-explorer
The first step is to keep in mind that we can't really make our computer go faster. If we want
our code to finish faster, we need to make our computer do less work, and to do this, we'll
have to avoid doing work that isn't really needed. The most common ones include storing
data that was already calculated to avoid calculating it again using the right data
structures for the problem and reorganizing the code so that the computer can stay busy
while waiting for information from slow sources like disk or over the network.
There's a bunch of tools that can help us with that called profilers. A profiler is a tool that
measures the resources that our code is using, giving us a better understanding of what's
going on.
Expensive Loops
Loops are what make our computers do things repeatedly. They are an extremely useful tool
and let us avoid repetitive work, but we need to use them with caution. If you do an
expensive operation inside a loop, you multiply the time it takes to do the expensive
operation by the amount of times you repeat the loop.
Instead of making one network call for each element, make one call before the loop. Instead
of reading from disk for each element, read the whole thing before the loop. Even if the
operations done inside the loop aren't especially expensive, if we're going through a list of a
thousand elements and we only need five out of them, we're wasting time on elements we
don't need. Make sure that the list of elements that you're iterating through is only as long
as you really need it to be.
Remember that to make our scripts get to their goal faster, we need to avoid having our
computer do unnecessary work. If the script gets executed fairly regularly, it's common to
create a local cache.
Creating caches can be super useful to save us time and make our programs faster. But
they're sometimes tricky to get right. We need to think about how often we're going to
update the cache and what happens if the data in the cache is out of date. If we're looking
for some long-term stats, we can generate the cache once per day, and it won't be a
problem.
When we call time it runs the command that we pass to it and prints how long it took to
execute it. There are three different values. Real, user, and sys. Real is the amount of
actual time that it took to execute the command. This value is sometimes called wall-clock
time because it's how much time a clock hanging on the wall would measure no matter
what the computer's doing. User is the time spent doing operations in the user space. Sys
is the time spent doing system level operations. The values of user and sys won't
necessarily add up to the value of real because the computer might be busy with other
processes.
Parallelizing Operations
In typical scripts while this operation is going on, nothing else happens. The script is blocked,
waiting for input or output while the CPU sits idle. One way we can make this better is to do
operations in parallel. That way, while the computer is waiting for the slow IO, other work
can take place.
The OS will decide what fraction of CPU time each process gets and switch between them as
needed. So, a very easy way to run operations in parallel is just to split them across different
processes, calling your script many times each with a different input set, and just let the
operating system handle the concurrency.
Another easy thing to do, is to have a good balance of different workloads that you run on a
computer. If you have a process that's using a lot of CPU while a different process is using a
lot of network IO and another process is using a lot of disk IO, these can all run in parallel
without interfering with each other. When using the OS to split the work and the
processes, these processes don't share any memory, and sometimes we might need to have
some shared data. In that case, we'd use threads. Threads let us run parallel tasks inside a
process.
A script is CPU bound if you're running operations in parallel using all available CPU time.
In large complex systems, we have lots of different computers involved. Each one doing a
part of the work and interacting with the others through the network.
To be able to run things in parallel, we'll need to create an executor. This is the process
that's in charge of distributing the work among the different workers. The futures module
provides a couple of different executors, one for using threads and another for using
processes. For example, ThreadPoolExecutor.
The futures module makes it possible to run operations in parallel using different
executors.
To know more about the complex slow systems, follow the link below:
https://realpython.com/python-concurrency/
https://hackernoon.com/threaded-asynchronous-magic-and-how-to-wield-it-bba9ed602c32
LAB 2
Fix a Slow System with Python
Crashing Programs
Systems that Crash
When we come across a program that terminates unexpectedly, we go through our usual
cycle of gathering information about the crash, digging in until we find the root cause, and
then applying the right fix.
When an application crashes and we don't know why we'll want to look for logs that might
relate to the failure. To look at logs on Linux will open the system log files and VAR log or
the user log files and dot accession errors file. On Mac OS we generally use the console app
to look at logs and the event Viewer on Windows. So what kind of data should you look for
in these logs most logs have a date and time for each line locked knowing when the
application crashed you can look for a log line around that time.
Ever we have an error message no matter how weird it seems we can search for it online to
try to figure out its meaning. If we're lucky, we might find the official documentation of
what that error means and what we can do about it. If there are no errors or the errors
aren't useful, we can try to find out more info by enabling sling debug logging. Many
applications generate a lot more output when debugging logging is enabled.
We could find that the problem is caused by a resource not being present that the program
expects to be present.
If the problem is caused by an external service that the application uses and that's no longer
compatible, we could write a service to act as a proxy and make sure that both sides see the
requests and responses they expect. This type of compatibility layer is called a Wrapper. A
Wrapper is a function or program that provides a compatibility layer between two
functions or programs so they can work well together. Using Wrappers is a pretty common
technique when the expected output and input formats don't match.
Another possibility you might need to look at is if the overall system environment is it
working well with the application. In this case, you might want to check what environment
the applications developers recommend and then modify your systems to match that. This
could be running the same version of the operating system using the same version of the
dynamic libraries or interacting with the same back end services.
Sometimes we can't find a way to stop an application from crashing but we can make sure
that if it crashes it starts back again. To do this, we can deploy a watchdog. This is a process
that checks whether a program is running and when it's not starts the program again. To
implement this, we need to write a script that stays running in the background and
periodically checks if the other program is running.
This works well for services where availability matters more than running continuously and
no matter how you work around the issue, remember to always report the bug to the
application developers.
https://docs.microsoft.com/en-us/sysinternals/downloads/procmon
Code that Crashes
Each process running on our computer asks the operating system for a chunk of memory.
This is the memory used to store values and do operations on them during the program's
execution. The OS keeps a mapping table of which process is assigned which portion of the
memory. Processes aren't allowed to read or write outside of the portions of memory they
were assigned.
During normal working conditions, applications will request a portion of the memory and
then use the space at the OS assigned to them. But programming errors might lead to a
process trying to read or write to a memory address outside of the valid range. When this
happens, the OS will raise an error like segmentation fault or general protection fault.
On top of the information that the computer uses to execute the program, the executable
binary needs to include extra information needed for debugging, like the names of the
variables and functions being used. These symbols are usually stripped away from the
binaries that we run to make them smaller. So we'll need to either recompile the binary to
include the symbols, or download the debugging symbols from the provider of the software
if they're available.
Dr. Memory can assist in finding out if invalid operations are occurring in a program
running on Windows or Linux.
The logging module sets debug messages to show up when the code fails.
Writing good comments is one of those good habits that pays off when trying to understand
code written by others and also your past self. Unfortunately, a lot of code doesn't include
enough comments, leaving us to try to understand it without enough context. If that's the
case, you can improve things by adding comments as you read the code and figure out what
it's doing.
If you've come across an error and debug the issue well enough to understand what's going
on, you might be able to fix the problem even if you've never seen that language before.
This is one of those skills that gets better with practice. So it might make sense to you to
start practicing before you need to fix a problem in the code. Take a program that you both
use and have access to its code and figure out how it does a specific action.
Core files store all the information related to the crash so that we or someone else can
debug what's going on. It's like taking a snapshot of the crash when it happens to analyze
it later. We need to tell the OS that we want to generate those core files. We do that by
running the ulimit command, then using the -c flat for core files, and then saying unlimited
to state that we want core files of any size.
When dealing with complex systems having good logs is essential to understanding what's
going on. On top of that, you'll want to have good monitoring of what the service is doing
and use version control for all changes so that you can quickly check what's changed and roll
back when needed. It's also important that you can very quickly deploy new machines when
necessary. This could be achieved by either keeping standby servers, in case you need to use
them, or by having a tested pipeline that allows you to deploy new servers on demand.
A lot of companies today have automated processes for deploying services to virtual
machines running in the cloud. This can take a bit of time to set up, but once you've done
that you can very easily increase or reduce the number of servers you're using.
If you don't write down what you've tried or how you fix the problem, you risk for getting
some important details and wasting a lot of valuable time when you need to revisit an issue.
When working on a problem, it's always a good idea to document what you're doing in a
bug or ticket.
Documenting what you do, lets you keep track of what you've tried and what the results
were. This might seem unnecessary. But after a whole day of troubleshooting a problem, it's
pretty common for us to forget what we've tried or what was the outcome of a specific
action.
On top of people looking for the root cause and a solution, you want to have a person in
charge of communicating with the people affected. This lets the team avoid forgetting to
update the tracking issue or even worse providing contradictory information. This
communications lead needs to know what's going on and provide timely updates on the
current state and how long until the problem's resolved. They can act as a shield for
questions from users letting the rest of the team focus on the actual problem.
Effective Post-mortems
Post-mortems are documents that describe details of incidence to help us learn from our
mistakes. When writing a post-mortem, the goal isn't to blame whoever caused the
incident, but to learn from what happened to prevent the same issue from happening again.
To do this, we usually document what happened, why it happened, how it was diagnosed,
how it was fixed, and finally figure out what we can do to avoid the same event happening
in the future. Remember the main goal is to learn from our mistakes. Writing a post-mortem
isn't about getting someone fired but about making sure that next time we do better.
A memory leak, happens when a chunk of memory that's no longer needed is not
released. If the memory leak is small, we might not even notice it, and it probably won't
cause any problems. But, when the memory that's leaked becomes larger and larger over
time, it can cause the whole system to start misbehaving.
The languages like Python, JAVA and go requests the necessary memory when we create
variables, and then they run a tool called Garbage collector, that's in charge of freeing the
memory that's no longer in use. To detect when that's the case, the garbage collector looks
at the variables in use and the memory assigned to them and then checks if there any
portions of the memory that aren't being referenced by any variables.
The OS will normally release any memory assigned to a process once the process finishes.
So, memory leaks are less of an issue for programs that are short lived, but can become
especially problematic for processes that keep running in the background. Even worse than
these, are memory leaks caused by a device driver, or the OS itself.
We can use a memory profiler to figure out how the memory is being used. As what
debuggers will have to use the right profiler for the language of the application. For profiling
C and C++ programs, we'll use Valgrind which we mentioned in an earlier video. For profiling
a Python, there are bunch of different tools that are disposal, depending on what exactly we
want to profile.
Another resource that might need our attention is the disk usage of our computer.
Programs may need disk space for lots of different reasons. Installed binaries and libraries,
data stored by the applications, cached information, logs, temporary files or even backups.
It's common for the overall performance of the system to decrease as the available disk
space gets smaller. Data starts getting fragmented across the disk, and operations become
slower. When a hard drive is full, programs may suddenly crash, while trying to write
something into disk and finding out that they can't. A full hard drive might even lead to data
loss, as some programs might truncate a file before writing an updated version of it, and
then fail to write the new content, losing all the data that was stored in it before.
If it's a user machine, it might be easily fixed by uninstalling applications that aren't used, or
cleaning up old data that isn't needed anymore. But if it's a server, you might need to look
more closely at what's going on. To figure this out, you want to look at how the space is
being used and what directories are taking up the most space, then drill down until you find
out whether large chunks of space are taken by valid information or by files that should be
perched.
In other cases, the disk might get full due to a program generating large temporary files, and
then failing to clean those up.
Large temporary files may remain if an application crashes because it’s not cleaned up
automatically.
Network Saturation
When you work in IT, you interact with services all over the Internet. At one moment, you
might connect to a service running on your local network and the next use another service
running in a data centre located on a different continent. If your network connection is
good, you might not be able to tell the difference where the website you're browsing is
hosted. But if you're dealing with a network service that isn't exactly up to speed, you might
need to get more details about the connection you're using.
The two most important factors that determine the time it takes to get the data over the
network are the latency and the bandwidth of the connection.
The latency is the delay between sending a byte of data from one point and receiving it on
the other. This value is directly affected by the physical distance between the two points
and how many intermediate devices there are between them. The bandwidth is how much
data can be sent or received in a second. This is effectively the data capacity of the
connection. Internet connections are usually sold by the amount of bandwidth the customer
will see.
Computers can transmit data to and from many different points of the Internet at the same
time, but all those separate connections share the same bandwidth. Each connection will
get a portion of the bandwidth, but the split isn't necessarily even. If one connection is
transmitting a lot of data, there may be no bandwidth left for the other connections. When
these traffic jams happen, the latency can increase a lot because packets might get held
back until there's enough bandwidth to send them.
There are limits to how many connections a single server can have, which will prevent
new connections.
To know more about managing resources follow the links given below:
https://realpython.com/python-concurrency/
https://hackernoon.com/threaded-asynchronous-magic-and-how-to-wield-it-bba9ed602c32
https://www.pluralsight.com/blog/tutorials/how-to-profile-memory-usage-in-python
https://www.linuxjournal.com/content/troubleshooting-network-problems
As humans, we want to make sure that we spend our time doing meaningful activities, like
work that we enjoy, and earning the satisfaction of a job well done. When working, we need
to optimize the time we spend to bring the most value to the company.
The Eisenhower Decision Matrix, when using this method, we split tasks into two different
categories: urgent and important.
Some tasks are important, but not urgent, so they need to get done at some point even if it
takes a while to complete them. Some tasks might seem urgent, but aren't really important.
A lot of the interruptions that we need to deal with are in this category. Answering email,
phone calls, texts, or instant messages feel like something that we need to do right away.
But most of the time are not really the best use of our time.
Spending time on long-term tasks might not bear fruit right away, but it can be critical when
dealing with a large incident.
Technical debt is the pending work that accumulates when we choose a quick-and-easy
solution instead of applying a sustainable long-term one.
If you work independently, you can try to establish a set of ours when users can expect to
reach you for a normal request, and the rest of the time only be available for emergencies.
The key here is to have a window of time reserved when you're not going to be interrupted.
That's the time when you can get the most important tasks done when you can fully
concentrate on dealing with complex issues and finding solutions for tricky problems.
The point is to have all the tasks listed in one place to avoid depending on your not always
perfect memory later. Once you have the list, you can check the real urgency of the tasks.
Ask yourself, if any items don't get done today will something bad happen? If yes, then
those should be worked on first. Once you're done with the most critically urgent tasks, you
can look at the rest of the list and assess the importance of each issue.
If possible, try to start with the larger, most important tasks to get those out of the way first.
But as we called out, when our work involves IT support, we know that we'll have to deal
with interruptions. And working on complex tasks while getting interrupted can be very
frustrating.
Taking breaks is important because it allows our creative minds to stay fresh, and working
on a fun side project can help us research emerging technologies and come up with new
ideas.
You have to roughly assess the amount of time you have to give to certain tasks that you
want to accomplish which are both urgent and important but no matter how detailed we
are, the final estimation won't ever exactly match the time it takes, but it will give us a
rough idea of whether we can complete the task in a few hours, days, weeks, or months.
Spare drives are a practical shortcut that can quickly replace hard drives in a pre-fail state.
https://blog.rescuetime.com/how-to-prioritize/
Brian Kernighan, one of the first contributors to the Unix operating system and co-author of
the famous C programming language book, once said, “everyone knows that debugging is
twice as hard as writing a program in the first place. So, if you're as clever as you can be
when you write it, how will you ever debug it?”
This is a warning against writing complicated programs. If the code is clear and simple, it will
be much easier to debug than if it's clever but obscure.
Network Attached Storage (NAS) products from vendors like NetApp can provide additional
shelves to add more storage as the website’s content, and users’ data increases in size.
It's important to focus on building systems and applications that are simple and easy to
understand. So that when something goes wrong, we can figure out how to fix them quickly.
If you're writing code, try writing the tests for the program before the actual code to help
you keep focus on your goal. If you're building a system or deploying an application, having
documentation that states what the end goal should be, and the steps you took to get there
can be really helpful.
We called out at the beginning of this that solving technical problems is a bit of an art, and
that it can be fun when things finally click together. You might still find yourself facing an
issue that you have no idea what to do about, and that's okay. If you're in a sticky situation,
the main thing to do is to remain calm. We need our creative skills to solve problems, and
the worst enemy of creativity is anxiety. So, if you feel that you're out of ideas, it's better to
take your mind off the problem for a while.
Sometimes a change of scenery is all we need for a new idea to pop up and help us figure
out what we're missing, true in coding and in life.
If the problem you're trying to solve is complex and affects a lot of people, it can get really
stressful to try to fully debug it with everyone waiting on you. That's why it's better to focus
first on the short-term solution, and then look for the long-term remediation once those
affected are able to get back to work. And don't be afraid to ask for help. Sometimes just
the act of explaining the problem to someone else can help us realize what we're missing.
Even if you're not in charge of the development of the software, you can still run automatic
tests whenever there's a new version, just to check if it still works as expected. So make sure
you perform these tests whenever a new version of the application comes around. Finally,
regardless of whether the bug came from software that you wrote or someone else wrote,
make sure that you document the key pieces of what you did, how you diagnosed the issue,
and how you squashed it. That way, if the issue happens again, you or whoever else needs
to deal with it will be able to quickly apply the solution, instead of spending valuable time
investigating.
https://simpleprogrammer.com/understanding-the-problem-domain-is-the-hardest-part-of-
programming/
https://blog.turbonomic.com/blog/on-technology/thinking-like-an-architect-understanding-
failure-domains
https://landing.google.com/sre/sre-book/chapters/effective-troubleshooting/