If you are a student completing this project as part of a class at Allegheny
College, you can check the schedule on the course web site for the due date or
ask the course instructor for more information about the due date. You can also
find the deadline for the project, as reported by GitHub Classroom, by clicking
the grey box at the top of this file. Please furthermore note that the content
provided in the README.md file for this GitHub repository is an overview of
the project and thus may not include the details for every step needed to
successfully complete every project deliverable. This means that you may need to
schedule a meeting during the course instructor's office hours to discuss
aspects of this project.
Even though the course instructor will have covered all of the concepts central to this project before you start to work on it, please note that not every detail needed to successfully complete the assignment will have been covered during prior classroom sessions. This is by design as an important skill that you must practice as security engineer is to search for and then understand and ultimately apply the technical content found in additional resources.
This project invites you to implement and use a program called programtracer
that can produce a detailed trace of a program's execution in the service of
automated malware analysis. As explained by CrowdStrike in the article entitled
10 Malware Detection
Techniques,
there a several different, yet often complementary, techniques for detecting
malware of a system. This project invites you to explore more about dynamic
malware analysis where you will write a programtracer that will observe the
execution of a Python program and then record a program trace that a malware
analyst could study so as to better understand its behavior and to perhaps
extract a behavior signature that could be used to detect it in the future. The
programtracer should be able to perform a rudimentary analysis of a save trace
and compare two different traces that it previously produced.
After cloning this repository to your computer, please take the following steps to get started on the project:
- Make sure that you are using a recent version of Python 3.12 to complete this
assignment by typing
python --versionin your terminal; if you are not using a recent version of Python please upgrade before proceeding. - Make sure that you are using a recent version of Poetry 1.8 to complete this
assignment by typing
poetry --versionin your terminal; if you are not using a recent version of Poetry please upgrade before proceeding. - Before moving to the next step, you may need to again type
poetry installone or more times in order to avoid the appearance of warnings when you next run theprogramtracerprogram.
Please note that you are invited to complete all of the background research,
implementation, and experimentation needed to implement and use the
programtracer, as outlined further in the following subsections.
Your programtracer will take as input a Python program and/or a Python
program's test suite, and then produce a detailed trace of the program's
behavior. The trace should record all of the details about the specific
instructions that were run during the execution of the Python program. To learn
more about program tracing, please consult the following references organized
into the following technical categories:
- Concepts: Introduction to the technical concepts of program analysis and dynamic malware analysis.
- Packages: Built-in packages for program tracing in Python
- Tools: Tools for program tracing in Python
After reading all of the background research and exploring the references that the prior section provides, you should pick a small Python program and Pytest test suite (perhaps even one that you wrote yourself) and attempt run it and then produce a trace of each line of source code that the test suite ran in the Python program. The trace that your tool produces should include all of the executed instructions at the level of the abstract syntax tree (AST), the Python source code, and/or the native code produced by the Python interpreter. Whenever possible, the trace should also include the values of variables that were accessed by each of the detected instructions. Finally, the trace should be stored in a file in either a plaintext, comma-separated value (CSV), or JavaScript object notation (JSON) format.
Once you have an implementation that is working for a small Python project, you
should create a complete implementation of the programtracer project, using
the main.py file to implement the command-line interface (CLI) for the
program. As you add features to your tool you should confirm that it works for
progressively larger Python programs and test suites. You should then implement
the following features into your programtracer tool:
- Command-Line Interface: The
programtracershould have a command-line that accepts the name of a Python program and/or a Python program's test suite and then performs the program tracing when the tests run on the program. - Program Tracing: The
programtracershould trace the execution of a Python program and save the trace in a suitable format in a specified directory and file. - Variable Tracking: The
programtracershould track the values of variables as they are referenced by the specific instructions in the program's source code. - Trace Analysis: The
programtracershould be able to analyze the trace by reporting information about, for instance, the number of instructions in the trace, the number of times each instruction was executed, the number of times a variable is accessed by instructions, and the number of unique values stored in the variables accessed by the instructions. - Trace Comparison: The
programtracershould be able to compare two traces and surface the similarities and differences between the them. This feature would be useful in the context of malware analysis to compare the behavior of a new program to a well-known malware program. - Efficiency Analysis: The
programtracershould offer at least two efficiency analysis features that involve measuring the performance of tasks such as creating the trace, saving the trace, analyzing one or more traces, or the size of the traces when either stored in memory or on disk.
You should aim to fully implement all of these features as a part of your
programtracer tool. If you are not able to implement a specific feature, then
you must both document the steps that you took and explain why it was not
possible to fully implement a featured in writing/reflection.md file.
To evaluate the programtracer tool, you should conduct an experiment that
(loosely) follows the following steps:
-
Select a Python Program and Test Suite: Choose at least five small- to medium-sized Python program and their corresponding Pytest test suites. Make sure that these are all programs that you did not implement yourself. Aim to strike a balance between programs that are realistic and programs that are small enough that you can feasibly analyze and understand their traces.
-
Run the
programtracerTool: Execute theprogramtracertool on each of the selected Python programs and its test suite. Ensure that the tool generates a trace file in the specified format (i.e., plaintext, CSV, or JSON). -
Verify the Trace Output: For each selected program and its test suite, manually inspect the majority of the trace file to verify that it accurately records the program's execution. Check that the trace includes details such as executed instructions, variable values, and any other relevant runtime information.
-
Analyze the Trace: For each selected program and its test suite, use the
programtracertool's analysis features to gather information about the trace. This includes:- The number of instructions in the trace.
- The number of times each instruction was executed.
- The number of times variables were accessed by instructions.
- The number of unique values stored in the variables accessed by the instructions.
-
Compare Traces: After making a change to the source code of each Python program, run the
programtracertool on it. Manually compare the traces that arise from this modified program and the original to identify the similarities and differences in their execution behavior. You could imagine that this is the step that a malware analyst would take to (a) compare the behavior of a new program to a well-known malware program or (b) compare the behavior of a program before it was infected with malware to after it was infected. -
Efficiency Analysis: For each selected program and its test suite, time the execution of the
programtracertool when it is completing tasks such as creating the trace, saving the trace, and analyzing the trace. Record the size of the trace files when stored in memory and on disk. -
Collect Data: Collect all relevant data from the analysis and efficiency measurements. Ensure that the data is well-organized and clearly labeled and add it to the
writing/reflection.mdfile. -
Report Results: Summarize the findings from the experiment in a report. The report in the
writing/reflection.mdfile should include:- An overview of the selected Python program and test suite.
- A description of the trace output and its verification.
- Results from the trace analysis, including any notable patterns or insights.
- A comparison of different traces, highlighting key differences.
- Efficiency analysis results, including performance metrics and trace file sizes.
- Any challenges encountered during the experiment and how they were addressed.
- If you have already installed the
GatorGrade program that runs the
automated grading checks provided by
GatorGrader you can, from the
repository's base directory, run the automated grading checks by typing
gatorgrade --config config/gatorgrade.yml. - You may also review the output from running GatorGrader in GitHub Actions.
- Don't forget to provide all of the required responses to the technical writing
prompts in the
writing/reflection.mdfile. - Please make sure that you completely delete the
TODOmarkers and their labels from all of the provided source code. This means that instead of only deleting theTODOmarker from the code you should delete theTODOmarker and the entire prompt and then add your own comments to demonstrate that you understand all of the source code in this project. - Please make sure that you also completely delete the
TODOmarkers and their labels from every line of thewriting/reflection.mdfile. This means that you should not simply delete theTODOmarker but instead delete the entire prompt so that your reflection is a document that contains polished technical writing that is suitable for publication on your professional web site.