Ba Luethi
Ba Luethi
Burkhard Stiller
                                                               HyperDtct: Hypervisor-Based
                                                                 Ransomware Detection
                                                           Supervisor: Jan von der Assen, Chao Feng, Dr. Alberto Huertas
                                                                                      Celdran
                                                                         Date of Submission: July 1st, 2024
                                                                                                                     ifi
                                                          University of Zurich
                                                          Department of Informatics (IFI)
                                                          Binzmühlestrasse 14, CH-8050 Zürich, Switzerland
Bachelor Thesis
Communication Systems Group (CSG)
Department of Informatics (IFI)
University of Zurich
Binzmühlestrasse 14, CH-8050 Zürich, Switzerland
URL: http://www.csg.uzh.ch/
Kurzfassung
Ransomware stellt eine wachsende Bedrohung für Institutionen und kritische Infrastruktur
dar. Das Aufkommen von Ransomware-as-a-Service ermöglicht es auch Benutzern ohne
tiefergreifende technische Kenntnisse, hochwertige Ziele anzugreifen, was Ransomware zu
einem lukrativen kriminellen Geschäftsmodell macht. Um die Ausführung von Ransomwa-
re zu verhindern, werden neue Erkennungs- und Klassifizierungssysteme benötigt. Zwar
existieren zahlreiche Lösungen zur Erkennung von Ransomware, doch viele sind potenzi-
ell unsicher, da sie sich auf demselben Betriebssystem wie die Malware befinden. Einige
Lösungen nutzen einen Hypervisor, um ihr Erkennungssystem von der Ransomware zu
isolieren und zu verbergen. Um Einsicht in die virtuelle Maschine zu erhalten und die
zur Erkennung von Malware erforderlichen Daten zu sammeln, greifen viele auf dieselbe
Library zurück.
Diese Arbeit stellt HyperDtct vor, einen neuartigen Ansatz zum Sammeln von System-
aufrufen von der Hypervisor-Ebene. HyperDtct nutzt die gesammelten Systemaufrufe zur
Erkennung von Ransomware. Das vorgeschlagene System ist eine Sandbox. Verschiedene
Algorithmen zur Klassifizierung und Erkennung von Anomalien sowie Techniken zur Aus-
wahl von Features werden anhand von dreizehn gutartigen und elf Ransomware Beispie-
len evaluiert. Darunter befinden sich verbreitete und schädliche Beispiele wie Babuk und
LockBit Dark. Die Ergebnisse dieser Evaluierungen zeigen, dass HyperDtct die betrach-
teten Beispiele mit einem hohen F1 -Score von 0.97 in gutartig und bösartig klassifizieren
kann. Zusätzlich haben die Experimente gezeigt, dass HyperDtct die für das System bis-
her unbekannte Ransomware LockBit und Babuk in weniger als zehn Sekunden erkennen
kann.
                                             i
ii
Abstract
This work proposes HyperDtct, a novel way to collect system calls at the hypervisor level
and detect ransomware based on these collected logs. The proposed system functions as
a sandbox. The experiments conducted throughout this thesis assess various classifier
and anomaly detection algorithms and feature selection techniques using thirteen benign
samples and eleven ransomware samples. This includes prolific and harmful samples such
as Babuk and LockBit Dark. These evaluations indicate that HyperDtct can classify the
considered samples with a high F1 score of 0.97 into benign and malicious. Experiments
have also shown that HyperDtct can detect the previously unseen samples LockBit and
Babuk within less than ten seconds.
                                           iii
iv
Acknowledgments
First and foremost, I would like to thank my supervisor, Jan von der Assen, for his
support and guidance, the exciting discussions, and his tremendously helpful suggestions
throughout this thesis. His unwavering enthusiasm and knowledge of the topic greatly
contributed to the success of this work and made the entire process much more engaging.
Furthermore, I would like to extend my gratitude to Prof. Dr. Burkhard Stiller and the
Communication Systems Group for providing me with the opportunity to explore such a
fascinating topic.
                                           v
vi
Contents
Abstract i
Acknowledgments v
1 Introduction                                                                                 1
   1.1   Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    1
   1.2   Description of Work     . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   2
   1.3   Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    3
2 Background                                                                                   5
   2.1   Hypervisor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    5
         2.1.1   Xen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     7
         2.1.2   Kernel-Based Virtual Machine . . . . . . . . . . . . . . . . . . . . .        7
         2.1.3   ESXi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    8
         2.1.4   Hyper-V . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     8
   2.2   Virtual Machine Introspection . . . . . . . . . . . . . . . . . . . . . . . . .       9
   2.3   Malware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
         2.3.1   Ransomware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
         2.3.2   Evasive Malware . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
   2.4   Malware Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
         2.4.1   Malware analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
         2.4.2   Detection Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 13
   2.5   Leveraging Machine Learning For Malware Detection . . . . . . . . . . . . 14
         2.5.1   Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 16
                                              vii
viii                                                                                CONTENTS
3 Related Work 19
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.1 BitVisor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.2 Xen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.4 ESXi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.5 Custom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4 Research Opportunities 25
4.1.1 BitVisor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1.3 Xen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5 Architecture 31
5.1 Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
             5.4.1   Bag-of-Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
CONTENTS                                                                                    ix
6 Implementation 35
6.1 Client . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.2 Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.3 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.4.2 Ransomware-PoC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.5 Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
7 Evaluation 47
7.3.1 Babuk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
7.3.2 LockBit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
7.3.5 V3 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Bibliography                                                                               65
x                                                                              CONTENTS
Abbreviations 75
List of Figures 76
List of Tables 77
A Installation Guidelines 81
A.1 Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
A.2 Client . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
A.3 Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
B Contents of the CD                                                                      85
Chapter 1
Introduction
Ransomware is malware that encrypts files on a device, making them useless to the victim.
It demands a ransom to decrypt and restore access to the files. The earliest appearance of
ransomware can be traced back to 1989 when Joseph Popp created the AIDS ransomware
program. This program was spread as a Trojan on a floppy disk, which was state-of-the-
art at the time. After the malware successfully encrypted the files on the main system
drive, it demanded a ransom payment to be sent to a post office box [1].
Since then, ransomware has become an ever-growing global threat, targeting high-profile
institutions and critical infrastructure, ranging from hospitals and schools to government
agencies. Criminals now have various payment techniques to extract the ransom anony-
mously: various prepaid services and digital currencies, such as Bitcoin, offer anonymity
and are not tied to traceable bank accounts [1].
1.1 Motivation
Due to the ongoing growth of the threat ransomware poses, there is a need for detection
systems to detect ransomware. These detection systems leverage two methods to gather
                                             1
2                                                       CHAPTER 1. INTRODUCTION
data for detecting ransomware [7]: static methods try to extract features from the exe-
cutable files without executing them. While this method is secure since the file does not
need to be executed to classify it as benign or malware, it is also easily circumvented by
malware. Dynamic methods, on the other hand, execute the file and collect features from
its runtime behavior. This is harder to evade for malware, as even malware that tries
to obfuscate its behavior must execute some fundamental behavior to reach its malicious
purpose. To leverage the collected data, a machine learning (ML) model can be utilized
to detect malware. ML has been widely deployed in malware detection because it outper-
forms heuristic detection approaches, particularly for unknown and complex malware [8].
While many existing ransomware detection systems run in the same operating system as
the ransomware, this approach can be bypassed if ransomware elevates its privileges to the
kernel level. Indeed, privilege escalation is attempted and partially achieved by several
ransomware families and samples [9], [10]. A popular approach to keep the detection
system isolated from the operating system is Virtual Machine Introspection (VMI). VMI
enables the collection of behavioral data from outside a virtual machine at the hypervisor
level, which has the advantage of having higher privileges than the kernel mode of the
guest operating systems (OS). Even if a rootkit gains access to the kernel of the guest-level
operating system, the detection tool remains secure.
Various studies use VMI to extract information from virtual machines while keeping their
detectors safe in another VM or machine (cf., Section 3.1). These previous works then
use the collected features to detect whether malware is running in the guest VM using
detection methods ranging from ML to heuristic detection. Various related works rely on
system calls as features [10]–[13], showing how efficient these are as features for detecting
malware. However, studies detecting malware solely based on system calls rely on the
LibVMI [14] library to collect their behavioral data. Anti-forensic malware could find
and exploit a way to bypass detection in these environments, making research on system-
call-based detection systems using LibVMI ineffective. Therefore, there is a need to find
new ways to collect system calls of malware samples at the hypervisor level and evaluate
whether this data can be used for detection systems. HyperDbg [15] is a hypervisor-based
debugger that can be used for high-performance and stealthy debugging of user and kernel
applications. The debugger allows for the automation of debugging and monitoring flows
and could, therefore, be used to automatically collect data from malicious samples at
the hypervisor level. The developers have already shown that their debugger is stealthy
enough to monitor a large proportion of evasive malware. However, it remains yet to be
seen whether the extracted data can be used to detect ransomware.
This work proposes HyperDtct, a prototypical detection system based on HyperDbg. Hy-
perDtct is designed and implemented as a proof-of-concept system that demonstrates how
system calls can be collected using HyperDbg to detect and classify benign and malicious
software. The system is implemented as a sandbox designed to execute potentially mali-
cious executables in a safe environment isolated from vulnerable systems.
1.3. THESIS OUTLINE                                                                     3
To verify that the data collected with the proposed system can be used for a detection
system, several ML and anomaly detection (AD) algorithms were implemented and fitted
to the collected data. The effectiveness and performance of these models were evaluated
on eleven ransomware samples, ranging from ransomware built for academic purposes to
recent samples of Babuk and LockBit Dark ransomware. Additionally, thirteen benign
samples were used for comparison. The results achieved on all samples indicate that
HyperDtct can successfully be used as a detection system, with the best model achieving
an F1 score of 0.97 when trained and evaluated on all samples.
After this chapter introduces the reader to this work, Chapter 2 provides background on
how hypervisors work, the process of VMI, relevant malware and how it can be detected,
and finally, how ML can be leveraged for this detection. This knowledge assists the
reader in comprehending related work described in Chapter 3 before the different resulting
research opportunities are explored in Chapter 4, elaborating on the VMI capabilities of
different Hypervisors. Chapter 5 introduces the reader to the scenario, gives a system
overview, and elaborates on how data is extracted and preprocessed for the models before
Chapter 6 shows how HyperDtct is initially implemented. Then, this implementation is
evaluated using various ransomware samples and extended in Chapter 7, and finally, the
last chapter summarizes the findings and future work.
4   CHAPTER 1. INTRODUCTION
Chapter 2
Background
The purpose of this chapter is to provide the information needed to understand this
thesis and its related work. First, different hypervisors, their architectures, and basic
functionality are elaborated. Subsequently, the discussion shifts to the concept of virtual
machine introspection before malware and malware detection are explored. The chapter
culminates with an introduction to machine learning in the context of computer security,
thus providing a comprehensive overview of the background knowledge required for this
thesis.
2.1 Hypervisor
Virtualization allows splitting up and sharing the resources of a single physical machine
into several virtual machines (VMs). This is enabled by a hypervisor or virtual machine
monitor (VMM). Hypervisors are responsible for presenting the guest OS with a virtual-
ized version of the hardware and scheduling its executions on the physical hardware.
The Intel x86 architecture is built based on four protection rings, where ring 0 (kernel
mode) is the most privileged and ring 3 (user mode) is the least privileged [16]. Hypervisors
operate with more privileges than kernel mode. Thus, they operate at ring -1, although
this is not an actual protection ring [7].
Two main types of hypervisors are distinguished: Type I and Type II [17]: Type II or
hosted hypervisors run on a host OS and use various functions of the underlying OS,
such as its file system, to run their VMs. Examples of this type of hypervisor include
VMware Workstation or Oracle Virtual Box. Type I, or native/bare-metal hypervisors,
run directly on the hardware and thus do not have the layer of a host OS underneath
them. Examples include Xen Project and VMware ESXi. In the subsequent chapters, the
term ”hypervisor” is used synonymously with a Type I hypervisor, as this thesis focuses
on Type I hypervisors. The difference between the two types of hypervisors is further
illustrated in Figure 2.1.
                                             5
6                                                                      CHAPTER 2. BACKGROUND
Guest OS Guest OS
Hypervisor Host OS
Hardware Hardware
Type I Type II
    1. The VMM provides an environment for programs that is essentially identical to the
       original machine
When an OS runs a process, it runs it in user mode, while the OS itself runs in kernel
mode. To satisfy requirement number 3, a guest OS must run in user mode, but to adhere
to requirement number 1, the guest OS should not be aware of this. [18] define two
sets of instructions: Sensitive instructions are those that can only be executed in kernel
mode, while privileged instructions trigger a trap if executed in user mode. A trap is a
synchronous exception intentionally triggered to invoke the OS [19]. When the guest OS
now executes a sensitive instruction, which is only allowed in kernel mode, the hypervisor
thus must trap the resulting system call and execute it as if the guest OS was running
directly on the hardware while keeping the VM isolated from other VMs possibly running
concurrently on the same hardware [19].
There are various approaches to do this [19]. Trap-and-emulate emulates all kernel-
mode code at the hypervisor level. Before Intel and AMD introduced virtualization in
their CPUs in 2005, this approach was only possible if the sensitive instructions were
a subset of the privileged instructions, as defined by [18]. Because of this, there was
a need for other approaches to virtualization. Binary translation allowed VMs before
hardware-assisted virtualization: The hypervisor replaces sensitive instructions in code
executed on a guest VM at runtime with calls to the hypervisor to handle them. With
various optimizations, this method can operate at nearly full speed. Paravirtualization,
on the other hand, modifies the source code of the guest VMs: Instead of executing
sensitive instructions, the guest OS invokes hypercalls, similar to processes issuing system
calls to the OS. The hypervisor is responsible for providing an application programming
interface (API) to handle said hypercalls. Hardware-assisted virtualization: Intel and
AMD extended most of their CPUs to support hardware-assisted virtualization. This is
done with additional instructions, specifically for virtualization, as well as adding hardware
2.1. HYPERVISOR                                                                            7
support for additional page tables required by VMs. With these extensions, the trap-and-
emulate approach becomes possible. Today, most hypervisors discussed in this thesis rely
on hardware support whenever possible.
2.1.1 Xen
The Xen hypervisor refers to its guest VMs as ”domains.” It supports two kinds of domains:
Control Domain (Dom0) and unprivileged domains (DomU). Each Xen hypervisor has
exactly one Dom0, which is the first VM it starts and contains all the physical drivers
required for the hardware the system is running on. This VM can also be used to monitor
and manage the other unprivileged VMs [20]. Figure 2.2 illustrates a simplified Xen
architecture and its interaction with the hardware.
Dom0 DomU
                           System services
                                                        Application          Application
Dom0 Kernel
Xen
Hardware
Using Kernel-based Virtual Machine (KVM), we can turn a Linux Kernel into a hypervi-
sor. VMs running on that hypervisor appear as running processes on the host. Contrary
to Xen, the Control Domain (or Dom0) is not on the same level as the guest VMs: The
OS whose kernel was turned into a hypervisor is now operative below the guest VM, as
portrayed in figure 2.3 [21]. Because of this, the KVM environment can be interpreted as
a Type 2 hypervisor (e.g., as [19] did). However, this thesis interprets the entire KVM
environment as a Type 1 hypervisor because the kernel, which is turned into a hypervisor,
operates directly on the hardware and has no OS underneath it.
8                                                                       CHAPTER 2. BACKGROUND
KVM guest
Application Application
Guest OS
                                                                                   KVM
                                      Linux Kernel with KVM
                                          Native Driver
Hardware
2.1.3 ESXi
Similarly to KVM, VMware ESXi does not run a privileged VM but instead implements
a microkernel at the hypervisor level. This means that while it has the same structure as
KVM, contrary to KVM, the footprint of an ESXi hypervisor is smaller, as it implements
only the minimal mechanisms of an OS [22]. Also, contrary to the hypervisors mentioned
above, ESXi is proprietary software.
2.1.4 Hyper-V
Like ESXi, Microsoft Hyper-V is proprietary software distributed for free with Windows
and Windows Server. VMs running on a Hyper-V hypervisor are named ”Partition,” and
similarly to Xen, there is a privileged Root Partition and possibly multiple unprivileged
Child Partitions. The Child Partitions are managed and created from the Root Partition
and can be further distinguished into enlightened and unenlightened: Enlightened Parti-
tions are aware that they run in a virtualized environment. Thus, they accelerate their
I/O operations by accessing hardware directly over VMBus and bypassing the slower em-
ulated hardware. On the other hand, unenlightened Partitions access resources over the
hypervisor [23]. The difference between enlightened and unenlightened Child Partitions
is visualized in Figure 2.4.
2.2. VIRTUAL MACHINE INTROSPECTION                                                                               9
                                                                                     Unenlightened Child
                         Root Partition     Enlightened Child Partition
                                                                                          Partition
VMBus VMBus
Hyper-V
Hardware
VMI is a method that allows obtaining internal state information from a guest VM from
outside of said VM: The security monitor is located at the hypervisor level and inspects
a guest VM. This approach has two benefits [24]. First, the monitor is isolated from
the guest VM, making tampering with the detection system more difficult for malicious
software. Second, because the hypervisor has access to the hardware state of the VM,
the monitor can obtain information about the VM’s activities. The concept was first
introduced by [25]: The authors created Livewire, which detects intrusions in a modified
version of VMware Workstation.
Because the required data is collected below the OS at the hypervisor level, thus being
below the high-level abstraction provided by the OS (e.g., Processes, files, etc.), there is a
semantic gap between the low-level data collected and the desired higher-level view of the
OS. This semantic gap is one of the biggest challenges of VMI [26]. ”Bridging the semantic
gap” refers to deriving the higher-level view of the introspected OS from low-level data
collected at the hypervisor level.
There are various approaches to bridge the semantic gap, which [27] classified into four
categories, depending on how the semantic information is retrieved:
   • In-VM: Approaches that fall into this category do not bridge the semantic gap.
     Instead, they avoid it by deploying an agent inside the guest VM, which exposes the
     guest OS’ behavior to the hypervisor.
Various research studies were conducted on each category; however, the available technol-
ogy depends mainly on the hypervisor. For a concise overview of what tool is available
on which hypervisor, see Section 4.1.
VMI enables the investigation of a guest VM without interrupting it. Because of its
independence from the monitored OS, VMI assists in malware collection, malware analysis,
intrusion detection, intrusion prevention, stealthy debugging, cloud security, and mobile
security [28].
2.3 Malware
Malware is an acronym for malicious software, while benign software refers to harmless
software. Malware is code running on a computerized system whose presence or behavior
the system administrators are unaware of; if they were aware of the code and its behavior,
they would not permit it to run. Malware compromises the confidentiality, integrity, or
availability of the system by exploiting existing vulnerabilities in a system or by creating
new ones [7].
Zero-day vulnerabilities (or zero-days) are those vulnerabilities for which no patch or
fix has been publicly released and where the vendor might not even be aware of the
vulnerability yet. Several parties are interested in information about these vulnerabilities,
such as software vendors and cybercriminals, and finding such information can fetch a
great deal of money [29].
Various types of malware exist; however, this thesis focuses on the type of ransomware,
which is explained in its own subsection. Ransomware employs multiple counter-measures
against detection and recovery systems, such as Rootkits, which elevate the malware’s
privileges. A hypervisor-based analysis environment is safe from this, as the hypervisor
operates below the kernel of the guest VM; however, certain rootkits are posing a threat to
this environment: The firmware operating a device, such as a USB device, can be updated
with malicious firmware exposing the hypervisor and enabling malicious operations on
it [30]. After elaborating on ransomware, further evasion techniques employed by malware
are explored.
2.3.1 Ransomware
Ransomware can be categorized into four main forms [31]. Scareware informs the user
that they have been infected with malware. This information is usually delivered by a fake
antivirus interface, which includes offers to purchase software to remove the malware. The
scareware is removed as soon as the user purchases this software and runs it. Contrary
to other forms of ransomware, most scareware does no damage to the system and instead
2.3. MALWARE                                                                               11
hopes to extract money from users by simply scaring them [7]. Locker ransomware locks
the user out of the computer and tries to block access to the device or an application.
This form of ransomware is generally easy to overcome by rebooting in safe mode or
running an on-demand virus scanner [32]. Leakware or Doxware threatens to publish
users’ data unless a ransom is paid [31]. Crypto ransomware is the most common type of
ransomware and the one focused on in this thesis. It aims to encrypt data important to
victims but does not interfere with essential computer functions. The data is encrypted
using cryptography algorithms such as AES and RSA [32]. As these encryption methods
are currently almost impossible to decrypt if implemented correctly, this fosters the need
to detect and stop the ransomware before it can encrypt too many files. Ransomware
can, in most cases, not be distinctly put into one of those categories, as many families
of ransomware employ multiple aspects of the forms above. For example, WannaCry
encrypts data and threatens victims with releasing their data publicly unless the ransom
is paid [31]. Many ransomware families also pressure victims by accusing them of a crime
or setting a deadline by which the victim must pay ransom to recover their data [1].
Ransomware attacks typically follow a sequence, which can be split up into six primary
stages [33]:
  1. Distribution: The malware is spread to the targeted device in this phase. Several
     attack vectors exist, such as phishing emails or malicious websites.
  2. Infection: The executable malicious code is downloaded, and the ransomware installs
     itself on the targeted device.
  4. Scanning: The ransomware scans the targeted device and its networks for files to
     encrypt.
  5. Encryption: This is the stage where the encryption of the files begins. The malware
     encrypts all previously identified files while the user is unaware of what’s happening.
6. Payday: The victim is informed about what happened, and a ransom is demanded.
Malware authors actively want to prevent the detection of their software and use vari-
ous approaches to achieve this. [34] have developed a compiler that automatically exports
critical system calls to separate processes, thus splitting up suspicious system call patterns
across multiple processes. Their results showed that malware compiled with their custom
compiler could evade detection from real-world behavioral detection tools. With more re-
search leveraging learning models to detect malware, evading the detection of such models
has also gotten into the focus of malware authors. Depending on the features used by the
model to detect malware, evading it becomes easy: [35] have shown that performing a few
12                                                       CHAPTER 2. BACKGROUND
Anti-forensic methods are a subcategory of evasive malware equipped with methods and
tools to obfuscate investigations. They might include avoiding detection or detecting the
presence of a forensic tool to alter the behavior accordingly [37]. This kind of evasive
malware tries to identify whether they are executed in a forensic environment, such as
under a debugger, in a VM, or inside a sandbox. If, for example, a VM is detected, the
malware might alter its behavior by not executing or trying to infect the host. There are
many ways to identify whether the current OS is running as a VM, such as identifying
special instructions, timing measurements, and OS markers [38].
Fighting against these evasion techniques, malware detection systems employ various
methods to detect malware nonetheless. Detecting malware can be separated into three
stages [8]: Malware analysis, feature extraction, and classification.
Malware analysis is the process of extracting data from a malicious sample. Malware can
be analyzed using dynamic or static analysis: While static analysis uses the extraction of
information without executing the software, dynamic analysis extracts information from
the running software [39].
The static analysis approach relies on attributes found in a sample, which can be extracted
without executing the sample. Examples of such attributes include dynamic-link library
(DLL) imports, hex dump, and assembly code, like [40] used to train and evaluate their
detection model. The advantage of this approach is that it is swift. However, detectors
relying only on static analysis can be easily evaded: Various code obfuscation techniques
bypass models trained on static attributes. General code obfuscation techniques aim to
confuse the understanding of how a program functions. These can range from simple
layout transformations to complicated changes in control and data flow [41]. The most
common code obfuscation technique is packing [42]: A packer is a program that transforms
an executable binary file into another configuration using compression and/or encryption
to protect/hide the executable’s original content. While packing can be used with non-
malicious intentions, it is commonly used by malware authors to buy time until detection
or avoid detection altogether.
2.4. MALWARE DETECTION                                                                    13
Dynamic analysis requires several properties to hold to allow the safe collection of relevant
data [7]:
• Data gathered by the analysis framework must not be compromised by the malware
   • The framework needs to meet the requirements of the malware to execute its mali-
     cious behavior
Many researchers have proposed hybrid approaches, using static and dynamic features to
leverage the advantages of both techniques when analyzing malware. [43] extracted 261
combined features from one dynamic and static analysis dataset. To test the detection
model that was built, they took 311 application samples consisting of 165 benign apps
from the Play Store and 146 malicious apps from VirusShare. The test results showed
that their hybrid analysis model could increase detection by about 5%.
After the malware has been analyzed and relevant features extracted, different techniques
exist to detect it. Each of these techniques comes with its advantages and disadvantages,
and there is no single solution that fits all scenarios:
This approach creates signatures of malware executables and stores them in the detection
system’s database. To classify an unknown file using traditional signature-based tech-
niques, its signature is created and compared with the signature of known malware in the
database. If a match is found, the file is classified as malicious [39]. Various approaches
exist to extract a signature. A signature can be a static feature of the executable file
or also a behavioral signature. String scanning, for example, extracts byte sequences of
malware executables. This technique is extensively used by antivirus scanners such as
ClamAV [8]. [44], on the other hand, leveraged behavior sequences to create behavioral
signatures for polymorphic malware detection. In general, signature-based techniques
have the advantage that they are faster than behavior-based detection. However, these
detection models are also very prone to obfuscation methods and cannot detect zero-day
attacks [8].
14                                                          CHAPTER 2. BACKGROUND
These techniques observe the behavior performed by malware during execution to detect
it. This approach is not as easy to evade by obfuscation as the signature-based detection
approach is: Although the program code might change, the program’s behavior will still
be similar [8]. Thus, the majority of new malware can be detected with this method
[45]. Various procedures can be leveraged to detect malware based on behavior: Moni-
toring system calls, file changes, processes, and network, to name a few mentioned in [8].
Disadvantages of the behavior-based technique include the high computational overhead
and the fact that the malware is already executing when it can be detected, potentially
already doing damage [39].
This thesis leverages ML to use the data extracted with VMI. In this context, ML refers
to algorithms and processes that can generalize past data to predict future outcomes.
Supervised ML methods use probabilities of previously observed events to infer the prob-
abilities of new events, while unsupervised methods draw abstractions from unlabeled
datasets and apply these to new data. On the other hand, AD is applied where a sig-
nificant class imbalance is present, with most samples being ”normal” while only a tiny
percentage are outliers [46]. In the context of this work, the terms anomaly and outlier
are used as synonyms.
Related work uses various supervised ML algorithms to classify malware. A broad overview
of algorithms used in related work or this thesis is given here, based on [46]:
     • Naive Bayes (NB): The NB classifier is one of the oldest statistical classifiers and is
       called ”naive” because it makes the strong statistical assumption that features are
       chosen independently from some unknown distribution. Even though this assump-
       tion often does not hold, it is quite effective for problems such as spam classification.
     • Support Vector Machine (SVM): An SVM is a linear classifier that tries to separate
       two classes in the dataset by producing a hyperplane in the vector space. It uses a
       hinge loss function, which penalizes the points on the wrong side of the hyperplane
       or very near it on the correct side.
     • K-Nearest Neighbors (KNN): KNN is a lazy learning algorithm that puts off most
       computations to prediction time instead of training time. Instead of generalizing
       training data, KNN stores them in the model and predicts a result based on the
       most common label out of the test samples k nearest neighbors.
Unsupervised learning algorithms, on the other hand, offer the promise to skip the labori-
ous step of feature engineering: Instead of carefully selecting features to feed the learning
algorithm as with supervised learning, unsupervised learning algorithms extract the fea-
tures themselves [46]. Related work used the following unsupervised ML algorithms [47]:
   • Deep Neural Network (DNN): DNNs are feed-forward networks, which means data
     flows through the graph composed of nodes and edges without forming a cycle. A
     DNN consists of an input layer, n hidden layers, and an output layer. Figure 2.5
     provides an overview of the architecture.
   • Convolutional Neural Network (CNN): CNNs are primarily used in image process-
     ing. CNNs are structured similarly to DNNs; however, at least one of the hidden
     layers performs convolution. To use CNNs in Malware analysis, collected data must
     be encoded to images [48].
X1 h1 O1
X2 h2 O2
...
Xp hn Oc
The possibility of avoiding feature selection is very tempting; however, [49] found that RF
accuracy outperforms the accuracy of DNN models in malware classification using various
feature sets.
problem. AD makes use of both supervised and unsupervised ML, as well as other tech-
niques to detect outliers. AD can further be separated into novelty detection, where the
training data is unpolluted by anomalies, and outlier detection, where it is assumed that
the data is contaminated by some anomalies [46]. This work considers the following al-
gorithms for AD, based on [46]. Although they are based on supervised ML algorithms,
the modifications to these algorithms in the context of AD give them characteristics of
unsupervised learning:
     • Isolation Forest (IForest): The algorithm of IForest iterates through data points
       in the training set, randomly selects a feature, and then randomly selects a split
       value in the range between the maximum and minimum of the selected feature in
       the dataset. Then, the number of such splits to isolate a single sample is considered,
       with the intuition being that anomalous samples require fewer splits than inliers to
       be isolated.
     • Local Outlier Factor (LOF): LOF classifies anomalies using the local density around
       a sample. This means it measures the concentration of other points in the immediate
       surrounding region, where the size of this region can be defined as either a fixed
       distance or the closest n neighbors. Samples with a significantly lower local density
       than their neighbors are considered anomalies.
But how can the performance of different algorithms be compared? All following defini-
tions are taken from [46]. A straightforward approach to comparing two models is the
cost function, computing the cost of the two models on the same dataset and choosing
the lower-cost one. Assuming a binary (positive, negative) classification scenario, the
predictions of a model can fall into four possible pairs:
To now use these metrics for comparing binary classifier models, plot the receiver op-
erating characteristic (ROC) curve. This curve plots the false positive rate (FPR) (cf.,
Equation 2.1) on the x-axis and the true positive rate (TPR) (cf., Equation 2.2, also
known as recall) on the y-axis. A truly random classifier would achieve an ROC curve of
the form y = x, while a perfect classifier’s ROC curve would enclose an Area of 1.0. The
area under the curve (AUC) can be calculated to utilize the ROC curve for performance
comparison. The AUC can be interpreted as the probability that the classifier correctly
classifies a randomly chosen truly positive example and a randomly chosen truly nega-
tive example into the correct categories. This is best illustrated with a random classifier,
2.5. LEVERAGING MACHINE LEARNING FOR MALWARE DETECTION                                 17
where the ROC curve would be y = x. It follows that the AUC is 0.5, which is equivalent
to a random ordering of samples.
                                           FP
                                   FPR =                                             (2.1)
                                         FP + TN
                                           TP
                                   TPR =                                             (2.2)
                                         TP + FN
The F-score (cf., Equation 2.4) combines precision (cf., Equation 2.3) and recall (cf.,
Equation 2.2) into one metric and harshly penalizes extremes. α is the relative weighting
of precision and recall, where recall is considered α times as important as precision. F1
score means that recall and precision have the same weight.
                                                  TP
                                 Precision =                                         (2.3)
                                               TP + FP
                                             1+α
                                  Fα =     1         α                               (2.4)
                                       precision
                                                 + recall
Another metric often mentioned by related work is classification accuracy, which is the
total proportion of correct labels. For a binary classification model, Equation 2.5 is the
formal definition of the metric.
                                            TP + TN
                          Accuracy =                                                 (2.5)
                                       TP + TN + FN + FP
18   CHAPTER 2. BACKGROUND
Chapter 3
Related Work
This chapter gives an overview of the work related to this thesis. Table 3.1 first provides
an overview of the related work before elaborating on malware detection using VMI. The
section about VMI is organized according to the hypervisor used by the study due to its
impact on the selection of available introspection tools.
3.1 Overview
Because of the mentioned impact on the selection of available introspection tools, Table
3.1 organizes related research by the hypervisor used, mirroring the structure of this
chapter. Grouping by hypervisor, we see five groups of papers: The first is built upon the
work of [50] and uses BitVisor for malware detection. The groups of Xen and KVM share
similarities in their choice of tools and represent the bulk of the research. Meanwhile,
proprietary hypervisors like ESXi and Hyper-V attract little to no attention from the
research community. Lastly, some researchers implement a custom hypervisor to conduct
their research, similar to [50], allowing for highly tailored research.
Ever since the paper [58] introduced the idea of VMI, various research has been conducted
on malware detection using VMI. Because of their different architecture and licensing,
some hypervisors are better suited as an intrusion detection system than others, which is
noticeable in the amount of research per hypervisor:
                                            19
20                                                    CHAPTER 3. RELATED WORK
3.2.1 BitVisor
Building upon the work of BitVisor, [59] propose WaybackVisor: An extension of BitVisor,
which automatically transfers all I/O logs of SATA drives to a Hadoop cluster. Wayback-
Visor intercepts all write operations by extending the ATA driver at the hypervisor level
to send all intercepted information over the network to a Hadoop cluster. Leveraging
WaybackVisor, [51] intercepted all write operations from WannaCry, TeslaCrypt and the
benign Zip-software. As features, the authors extracted the entropy of sectors, the total
amount of read sectors, the total amount of written sectors, the variance of Logical Block
Address (LBA) in read requests, and the variance of LBA in write requests from the logs
created by WaybackVisor and generate and evaluate ML models by using RF, SVM and
KNN algorithms. Their models achieved an F1 score of 0.98, with the KNN algorithm
performing best.
To enhance WaybackVisor, [52] also add a logging functionality for low-level memory
access patterns. They then collected memory access data for ransomware (WannaCry,
REvil, Darkside), wiper malware (CaddyWiper), and benign software (AESCrypt, Zip,
Excel, PowerPoint, Firefox). When creating and evaluating their ML models (RF, SVM,
KNN), they discovered that they were able to classify running software with a F1 score of
0.93 and detect malware with a F1 score of 0.95 without needing to bridge the semantic
gap.
To enhance research into the storage access patterns of ransomware and its use in ran-
somware detection, [60] published RanSap, an open dataset consisting of seven samples
of ransomware, as well as five popular benign software samples. Using said dataset, [53]
build Foreseer, a DNN that models the entire feature set using Long Short-Term Memory
3.2. MALWARE DETECTION USING VIRTUAL MACHINE INTROSPECTION                                 21
aided by attention mechanisms. Foreseer can project future events to cut the time to
predict malware presence by 40%. The combined work of [53], [60] shows how publishing
datasets can lead to more focused research.
The need for standard ransomware datasets is also outlined by [61]: Their survey found
that most researchers download ransomware samples from public websites and build their
required datasets themselves. Thus, the authors proposed the implementation of standard
ransomware datasets, such as RanSap, as a potential research direction.
3.2.2 Xen
Contrary to BitVisor, Xen hypervisor can host multiple VMs. Various tools used for
malware analysis build upon the Xen hypervisor, such as Drakvuf [62] and LibVMI.
Because of the availability of these tools, Xen offers itself as a viable platform for malware
detection research:
Using similar tools and methods, [12] extract logs of VM memory from the VMM. These
memory logs are then parsed to system call logs using Drakvuf. From the system call
logs, features are extracted as a Bag of n-grams. The researchers then utilize an RF
model to detect malware, using the University of New Mexico (UNM) and BareCloud
datasets to validate their model. While the UNM dataset is based on Linux system calls,
the BareCloud dataset consists of Windows binaries. VmShield achieved an accuracy of
81.23% to 99.52% on UNM’s Dataset and 97.66% to 99.91% accuracy on the BareCloud
dataset. Although multiple studies use the UNM dataset to evaluate and train their
corresponding models, the dataset can not be found anymore under the link cited by
these studies, as it appears to have been taken down.
With the same set of tools, [54] extract multiple features from VM memory, the hypervisor
layer, and the hardware layer (using perf as an additional tool). In total, they extract
235 events. To allow a classifier model to use these many features, they use ensemble
learning combined with a Voting method. The model was evaluated using malicious
samples obtained from VirusShare [64], which include various malware types, such as
trojans, worms, and ransomware. Using these samples, the authors reach a F1 score of
0.94.
22                                                    CHAPTER 3. RELATED WORK
In contrast to previous research using the Xen hypervisor, [55] detect malware without
such a deep view inside the guest System. To achieve this, they extracted performance
counters maintained at the hypervisor level, performance at the hardware level, and traces
collected at the hypervisor. They collect 329 indicators using the following tools: perf,
xenperf, and xentrace. These indicators are separated into their originating VMs, and
the program behaviors of the VMs are inferred. Thus, the authors bridge the semantic
gap without requiring detailed insight from tools like Drakvuf. The authors use over 2000
benign and malicious software executables from various sources to train and evaluate their
model and achieve a detection accuracy of 0.875 on the trained RF classifiers.
Like Xen, KVM also supports LibVMI, thus allowing a similar extraction method: [13]
propose KVMInspector, which uses both LibVMI for virtual machine introspection and
strace, to get system call logs from inside the guest VM. The features extracted at the
guest OS and hypervisor levels are then used to detect various malware, including rootkits.
The proposed approach uses an ensemble heterogeneous classifier where the outputs of
different models are aggregated using a VotingClassifier. The model reaches an accuracy
of up to 99.92% when evaluated on the UNM dataset.
Without needing to collect any information at the guest OS level, [48] utilize Intel Pro-
cessor Trace (IPT) to collect information about the executions of a process. To start
recording IPT packets, VMI is used to extract the identification information of the target
process and then send it to the hypervisor for IPT configuration. After this, the con-
trol flow information is converted to color images, and a CNN detects malware with an
accuracy of 95% when evaluated using malicious executables as well as benign programs.
Similar to [13], [10] monitor system calls in a KVM environment. However, the proposed
application, called RansomSpector, does not need a component inside the guest VM and
instead traps all system calls to the hypervisor, using LibVMI to resolve addresses, where
file and network system calls are sent to the detector. This detector matches the ob-
served system call patterns with known ransomware patterns. To evaluate their tool,
the authors use a dataset of various types of ransomware obtained from VirusShare and
VirusTotal [65]. The authors conclude that their tool can effectively detect ransomware
attacks with a small performance overhead.
3.2.4 ESXi
While the hypervisors mentioned above are open-source, ESXi is a proprietary hypervi-
sor developed by VMware. Although VMware is a significant player in the virtualization
technologies market and even a market leader in the hyper-converged infrastructure mar-
ket [66], very little public research on malware detection was conducted using its ESXi
hypervisor. On VMware’s security blog, [56] published an article outlining how memory
forensics can be employed on an ESXi VM: When creating a snapshot of a VM on ESXi,
it is possible to include the complete state of the guest VM, including the contents of its
3.2. MALWARE DETECTION USING VIRTUAL MACHINE INTROSPECTION                                 23
memory, which are saved into a file with the extension ”.vmem.” This memory snapshot
was then analyzed using Volatility. To then detect malware presence in a snapshot, the
authors suggest several approaches:
  2. Using memory analysis by directly accessing the OS objects and identifying “un-
     accounted” components or suspicious modifications to the function addresses of a
     process
The authors have developed a plugin for Volatility to detect API hooks of malware trying
to hide its presence. Using this plugin, they execute and analyze the evasive Thanos
ransomware [68], which tries to hide its presence by letting other processes execute it
using API Hooks.
3.2.5 Custom
Using Intel VT-x, it is possible to write a custom hypervisor. Various examples exist
of custom hypervisors [69], [70]. The advantage of implementing a custom hypervisor
is that intercepting instructions and system calls can be done from within the custom
hypervisor. The authors of [57] use this to trap instructions and hook system calls to
expose virtual environment-detecting behavior applied by evasive malware to change their
behavior in VMs. Instructions are trapped as explained in section 2.1, while system
calls are hooked by replacing the physical memory page of the system call issued by
the guest in the Intel Extended Page Table with a fake page table, adding a logging
functionality before returning to the original function. An advantage of the proposed
approach is that, contrary to previous research, it is independent of different virtualization
environments; however, to implement the proposed approach, for example, with Hyper-V,
an in-depth understanding of the architecture of the hypervisor is necessary. To evaluate
their approach, the authors used 23 samples of evasive malware and compared them to
other detection methods, such as Drakvuf [62] and VMShield [12]. Most other methods
failed to identify any of the evasive samples, with Drakvuf performing the best by detecting
56.52% of all samples. However, their model successfully identified all samples, achieving
a 100% detection rate. Additionally, it detected threats more than six times faster than
the other methods.
Similarly to [57], HyperDbg [15] uses Intel VT-x and Extended Page Table to leverage
the powers of a hypervisor for debugging purposes. The implemented debugger can be
used for high-performance and stealthy debugging of user and Kernel applications. It
operates on ring -1 by virtualizing an already running system. To verify the stealthiness
of the debugger, the authors tested the debugger with 13 packers and protectors, none of
which detected the debugger. HyperDbg is open-source and features a VMX-compatible
script engine and extensive documentation [71], making it a useful tool for various tasks.
Using their debugger, the authors analyzed over 10’000 samples collected from a malware
24                                                    CHAPTER 3. RELATED WORK
database [72] and found that their debugger enabled debugging of 22% more samples
than WinDbg. Additionally, the debugger is 2018x faster than WinDbg for system call
recording.
Much related work has been conducted on the open-source hypervisors Xen and KVM,
leaving little room for a new research gap. A possible issue in this part of the related
work is that all related work using only system calls as features relies on the library
LibVMI. Therefore, novel ways to extract system calls could be explored. In contrast to
open-source hypervisors, proprietary hypervisors like ESXi and Hyper-V are rarely used
in current research, thus offering a possible research direction. Additionally, a custom
hypervisor could be leveraged as a ransomware detection system, as only [57] did this to
catch virtual environment-detecting malware. While implementing a custom hypervisor
would require a deep understanding of the Processor and the corresponding virtualization
technology, such as Intel VT-x or AMD-V, plenty of custom hypervisors that could be
used exist [15], [69], [70]. Related work also uses various detection methods, from simple
policy-based detection to complex neural networks. However, none have used AD models
to detect ransomware. Therefore, the following research directions have been identified
and are explored in the next chapter.
Research Opportunities
This chapter examines the theoretical and practical feasibility of the research directions
identified in the related work. First, the VMI tools and opportunities are listed per
hypervisor in Table 4.1. This section should give the reader an overview of what is
possible on each hypervisor. After this, the research directions outlined in Section 3.3
are combined to research opportunities with the capabilities identified in Section 4.1.
These research opportunities are then explored and tested for feasibility. The research
opportunity discussed in this thesis is selected to conclude this chapter.
4.1.1 BitVisor
This thin hypervisor is open-source and was developed as a research project [50]. It
features a para-passthrough architecture designed to shrink hypervisor code size by let-
ting most guest OS I/O operations bypass the hypervisor, only mediating the minimal
access needed for security. To allow direct access from the guest device drivers, the para-
passthrough architecture limits the number of VMs to one and can thus be seen as a
thin layer between OS and Hardware. Because only one VM is possible, traditional VMI
approaches are impossible with this hypervisor. Instead, virtual machine introspection is
achieved by modifying the hypervisor, as [51], [52], [60] did. Therefore, extracting system
                                            25
26                                      CHAPTER 4. RESEARCH OPPORTUNITIES
calls should be possible with an approach similar to trap and emulate; however, it requires
a deep understanding of the hypervisor and the guest OS and is therefore not considered
a research opportunity.
4.1.3 Xen
Linux KVM allows turning a Linux Kernel into a Type 1 hypervisor [21]. With some
additional configurations, KVM offers support for LibVMI [14], thus allowing researchers
to monitor the low-level details of a VM easily. To bridge the semantic gap, tools similar
to those available for Xen are accessible, except for Drakvuf. As this thesis is written,
Drakvuf is uniquely available for the Xen hypervisor. Linux KVM also allows the collection
of IPT from a guest VM as [48] demonstrated.
Hyper-V does not support any form of VMI. However, it can be debugged using LiveKd [77],
which allows the extraction of a memory dump. Using LiveKd, extracting hypercalls and
4.2. EXPLORING RESEARCH OPPORTUNITIES                                                   27
other hypervisor-level information should also be possible; however, this requires a more
intricate setup [78]. Enabling IPT monitoring should be possible [79]. However, no official
library for simplifying the collection of those traces on Windows was found in this work.
Several opportunities can be explored with the capabilities found in Section 4.1 and the
possible research directions outlined in Section 3.3. These opportunities are summarized
in Table 4.2, displaying the hypervisor, the introspection method and tools used, and
the research directions the opportunity covers. Because performance logs were always
combined with other features, and IPT requires a deep understanding of the Intel archi-
tecture, these two extraction methods were not considered in the research opportunities.
For convenience, the research directions outlined in Section 3.3 are as follows:
Opportunity Nr. 1
   • Windows Server 2022 Standard Hyper-V partition (L1) with 4056 MB of RAM and
     nested virtualization enabled.
After installing LiveKd and Volatility3, a memory snapshot of L2 was taken on L1 using
the command shown in Listing 4.1.
                         Listing 4.1: LiveKd Memory Snapshot
  livekd64.exe -hv Win10L2
              -o C:\Users\Administrator\Desktop\test.dmp
28                                         CHAPTER 4. RESEARCH OPPORTUNITIES
To test the usability of this memory dump, the processes running in L2 were extracted in
Listing 4.2.
This approach produced an accurate list of processes running in L2, thus showing the
usability of this dump file for VMI. However, extracting a memory dump took a long time
(over an hour for 1024 MB), making this approach unpractical.
Opportunity Nr. 2
Because ESXi is proprietary software, an evaluation license was acquired to explore the
second opportunity. This work tried to install ESXi on the test desktop (Intel i7-3770),
but after the installation failed multiple times with the error: ”No network adapters were
detected,” this opportunity was not further explored.
Opportunity Nr. 3
The third opportunity to develop a custom setup to analyze system calls was explored
using the following environment.
      • Guest: VMware Player VM, with nested virtualization enabled and added virtual
        serial port (Windows 10).
The connection from host to guest was established according to HyperDbg’s documen-
tation [80], where the guest was the debuggee and the host the debugger. To test the
capabilities of HyperDbg, system calls were extracted. Listing 4.3 shows how HyperDbg
can be used to extract system calls from a VM, while Listing 4.4 displays the exemplary
output of this command:
4.3. DISCUSSION                                                                         29
To resolve the system call number to the corresponding system call, we need context
information about the operating system running on the client, such as the OS and version.
Given this information, system call tables collected by the community like [81] can be
leveraged to resolve the corresponding system call.
4.3 Discussion
Although multiple opportunities would be feasible, the decision was made to prioritize the
speed of the extraction process. Consequently, the first two opportunities are impractical
due to their longer extraction times. Using HyperDbg, the detection system gains a
deep insight into the virtualized system while collecting information protected by the
hypervisor with little intrusion on the inspected client. The opportunity combines three
research directions, allowing the exploration of a novel way to extract system calls at the
hypervisor level, using a custom hypervisor and AD models to detect anomalies based on
the extracted features.
30   CHAPTER 4. RESEARCH OPPORTUNITIES
Chapter 5
Architecture
The purpose of this chapter is to give the reader an overview of the proposed system, as
well as the scenario in which this proposed system must operate, before elaborating on
the concrete implementation of the system in the subsequent chapter.
5.1 Scenario
This work uses HyperDbg as part of a sandbox to monitor a running executable. The
goal of the sandbox is to determine whether a running executable is ransomware while
avoiding harm to other systems or the network. To assess the maliciousness of software,
this work uses HyperDbg to analyze the system calls issued by the executables. For
this purpose, HyperDbg monitors and logs the entirety of the system calls issued by the
system in a timespan while the executable is running. ML and AD models then use
these system calls to determine whether ransomware was running on the system. To train
and evaluate these models, this work considers various ransomware and benign samples.
Benign samples include various office apps and procedures such as zipping and unzipping
files.
This work considered various setups to extract system calls at the hypervisor level using
HyperDbg.
  1. Debugging a physical machine over a serial port: This requires two Windows ma-
     chines. If these two machines are physical, at least one must have a serial port to
     be debuggable.
  2. Debugging a VM running on VMware Workstation: This setup would allow for only
     a single physical machine, but our test environment would operate on a Type 2
     hypervisor using nested virtualization.
                                           31
32                                                                 CHAPTER 5. ARCHITECTURE
     3. Debug the local machine and send data over a TCP socket: HyperDbg allows local
        machine monitoring. To extract the collected information, HyperDbg can create
        different output sources, including writing to files, named pipes, and TCP sockets.
        This work utilizes this capability to send the collected logs over a TCP socket to
        another machine in the network.
These setups have advantages and disadvantages: Setups one and two allow for more
control over the debugged system because HyperDbg can operate in debugger mode. De-
bugger mode allows HyperDbg to break to the debugger and step instructions in kernel
mode. While setup three is less intrusive, it operates in VMI Mode, which means that
features such as breaking to the debugger, step instructions in kernel mode, and starting
processes from HyperDbg are not possible [82]. This thesis proposes to use setup num-
ber three because HyperDbg’s main purpose is to monitor. Debugging features are not
required, so the advantage of being less intrusive is greater. An overview of the setup is
provided in Figure 5.1:
Client
Virtualized OS
                                                                                 Controller
                                              Network Connection
                 Executable /
                 Ransomware
                                       Settings, Instructions, Executables   Monitoring Module
HyperDbg
     • Client: This is the machine where HyperDbg runs and malware is executed. Because
       this machine is running malware, it should be as isolated as possible from other
       machines.
     • Controller: This machine is connected to the client over the network. It is respon-
       sible for collecting the logs from the client and detecting whether ransomware is
       running on the client.
5.3. LOGGING SYSTEM CALLS                                                                33
With HyperDbg, system calls can be logged comfortably: With the !Syscall command,
HyperDbg can register an event, which triggers when Windows tries to execute a system
call. This command unsets the Syscall Enable bit in the Extended Feature Enable Regis-
ter, which lets system calls result in an undefined opcode exception. This exception can
then be intercepted at the hypervisor level, where the system call is emulated and can be
intercepted by the debugger [83]. At this point, context information of the system call can
be extracted. To extract context information from the system call, [10] described which
registers contain this information on the x64 platform: To get the number of the system
call, the RAX register can be accessed. When a system call is invoked, the first four
parameters are put into RCX, RDX, R8, and R9 registers, and the remaining parameters
are pushed on the stack. Using HyperDbg, it is possible to access system state informa-
tion easily: the variables $pname, $pid, $tid contain the values of the process name, the
process id, and the thread id, respectively. These variables make it easy to deduct who
issued a system call. Resolving the system call number to the corresponding system call
will not be necessary for our detection system but is possible using system call tables
collected by the community like [81]. The system call parameters could then be resolved
to obtain more information about the purpose of the system call, for example, by using
the types and structures defined in the Winternl.h header file. Resolving all system calls
with their parameters requires more computation and memory to be accessed on the client
machine, which this work tries to avoid. Because related work, except for [10], largely
avoided computing the context information for system calls [12], [13], [84], this work tries
to avoid it as well.
To obtain services from the OS, a user program must make a system call, which invokes
the operating system through a trap. The trap instruction switches from user to kernel
mode and starts the operating system [19]. Because of this, various related work use
system calls to detect malware using multiple approaches to process the raw system call
logs to features. An overview of these approaches can be found in Table 5.1
5.4.1 Bag-of-Words
This approach is inspired and commonly used by natural language processing. It converts
a text document or a sentence into a vector of counts, where the vector contains an entry
34                                                       CHAPTER 5. ARCHITECTURE
for every possible word in the vocabulary. If a word, such as ”Syscall,” appears three times
in a document, the corresponding vector position for that word will have the value three
[85]. In the context of system calls, this approach would count the occurrences of each
unique system call in a trace. Because the Bag-of-Words loses the original sequence and
structure of the analyzed text, the works that are listed next to this approach in Table
5.1 use the extension of this approach, which is called Bag-of-n-Grams: Instead of only
counting the occurrences per word, the occurrences of a continuous sequence of n items
is counted by a sliding window, which shifts by one to the right for each sequence. For
example, the sentence ”System Calls are great” generates the 3-grams (three words per
sequence): ”System Calls are”, ”Calls are great”. In related work, Bag-of-n-Grams was
used in two ways (as depicted in Table 5.1):
1) Frequency: With this approach, a vector is created, where each n-gram is annotated
with the number of occurrences, the same as in the Bag-of-Words approach. Thus, the
vector is in the form of Vf requency =< c1, c2, c3, ..., cZ >, where Z is the number of unique
n-grams and c is the number of occurrences of that particular n-gram in the trace.
Choosing the optimal value for the n parameter in the Bag-of-n-Gram approach can
be difficult. [86] provided a theoretical and experimental investigation, which sequence
length is optimal for intrusion detection, and found that 6-gram and 7-gram have the
best performance. [12], [13], [84] all use 6-gram to detect malware. To find the best
configuration, this work will explore the performance of the detection models with various
combinations of the approaches mentioned above and values for n. The results of this
exploration will be documented in Chapter 6.
Chapter 6
Implementation
As mentioned, this chapter aims to give the reader a complete overview of the concrete
implementation of the architecture proposed in Chapter 5. HyperDtct includes two phys-
ical machines: A client and a controller. The client is the machine on which HyperDbg
monitors the execution of a sample, while the controller collects the logs generated by
HyperDbg and instructs the client on how and what logs should be connected. This
implementation makes use of the following platforms:
   • Intel i7-6700k PC with 16 Gigabytes (GB) RAM and 120 GB SSD running Windows
     10 version 22H2 as the client machine.
                             Client
                                                                                                                                File/Controll server
                                                                                                                                     Port 9090
                                 Launches after timeout
                                                              Cat. 6a twisted pair cable         Firewall
   Launches w.                                                                              Deny all, except                                Launches
   Configuration               Executable /
                                                                                           Port 9090 and 8989
                               Ransomware
                                                                                                                                      Log server
                                                                                                                                      Port 8989
HyperDbg
These platforms are interconnected as visualized in Figure 6.1. The client can be moni-
tored from any machine, not just a Raspberry Pi. However, a Pi was used in this prototype
to demonstrate that even a resource-constrained device can function as the controller.
                                                                            35
36                                                    CHAPTER 6. IMPLEMENTATION
The implementation for HyperDtct is available under [87]. The following sections describe
these platforms and their corresponding components in depth. First, the client and how
system call logs are collected using HyperDbg are elaborated. Then, the collection of
logs is described from the controller’s perspective before the assumptions underlying the
described setup are stated, and the samples to train the models are elaborated. The last
part of this chapter focuses on detecting malicious samples and evaluating these initial
detection models.
6.1 Client
Setting up the client machine requires modification of the bare Windows setup: First, to
allow ransomware to have some data to encrypt, the documents folder is populated with
documents in various formats, such as PDF, HTML, and PNG, collected from [88]. Us-
ing client/Utils/download_govdocs.py, five archives were downloaded from [88], which
equals roughly three GBs of files to encrypt. Additionally, the zip folder containing these
documents is also placed in the documents folder to allow simulating unzip operations and
have a larger target on the system for ransomware to encrypt. This implementation used
version 0.8.2 of HyperDbg to monitor the client. At the time of writing, HyperDbg is still
under active development and requires some configurations before it can be run to analyze
the local system. This implementation followed this guide to set up the environment for
HyperDbg [89].
In addition to the modifications suggested in the guide, Windows Defender was disabled
because it was found that Windows’ native antivirus software interfered with HyperDbg,
as it led to a system crash each time it scanned the system concurrently with HyperDbg
running. Therefore, this work disabled all components related to Windows Defender. As
the client is part of a sandbox isolated from the rest of the network and HyperDtct requires
malware to run uninterrupted to collect data for training, this modification is necessary
in any case. The following steps were followed to disable Windows Defender.
Disabling Windows Defender reduced the occurrences of a system crash when logging
system calls. However, it did not completely avoid them. To launch HyperDbg, Driver
Signature Enforcement (DSE) must be disabled on Windows. While various approaches
exist to achieve this, this work used the following approach:
     1. Press and hold the shift key while clicking the restart button.
     2. This opens the recovery settings. Select Troubleshoot > Advanced Options >
        Startup Settings and click Restart
6.1. CLIENT                                                                               37
  3. The system reboots. Upon reboot, several options are presented. Press the key F7
     to disable DSE.
This approach temporarily disables DSE on the system and must be executed again after
each reboot. HyperDbg must be launched using the highest privileges and thus requires
client/start_logging.py to be run as Administrator. To allow this while at the same
time executing the potentially malicious samples with an unprivileged user, the client
machine is set up with two user accounts:
• Client: This is the unprivileged user in whose session all samples are executed.
HyperDtct provides a wrapper for HyperDbg to collect system call logs: To start monitor-
ing system calls on the client, run the script client/start_logging.py in the unprivileged
user’s session, entering the password of the administrative user. The script aims to pro-
vide an entry point to start logging system calls. It first checks whether HyperDbg is
runnable by executing HyperDbg and trying to catch a single system call. Upon success,
the script tests the connection to the controller machine. If these two prerequisites are
successful, it starts the logging procedure. To log the system calls for a single executable,
the following steps are executed:
  1. Request the next file from the controller. The file is expected to be a zip file
     containing a batch script named execute.bat.
  3. Request the next log socket from the controller. The answer to this request contains
     information such as how long the log duration is expected to be and whether the
     file is expected to be malicious.
The process is further outlined in Figure 6.2 as a sequence diagram, focusing on the
sequence from the client’s perspective.
38                                                                                 CHAPTER 6. IMPLEMENTATION
                                          Check if HyperDbg
                                             is runnable
Runnable
ACK
  # Configure symbols, load them once, and then use only the local path
  .sympath SRV*c:\Symbols
  .sym reload
6.2 Controller
The controller is a machine connected to the client over the network. Because the client
is a sandbox, this work tries to isolate it as much as possible from the controller. There-
fore, using Ubuntu’s uncomplicated firewall (UFW), all connections to the controller are
denied except connections on the configured controller socket (Port 9090) and log socket
(Port 8989). A separate log socket is created because HyperDbg closes the socket con-
nection after successfully logging all system calls to it. To start the monitoring server on
the controller, the script controller/start_server.py can be run. The script accepts
two arguments, –input and –log_dir. Input is either a zip file with the name format
[malicious|benign]_[file_name]_[duration]min.zip or a directory, which contains a
settings.json file, outlining in what order and with which settings the files in the direc-
tory are sent. The log directory is the directory where the collected logs will be output.
For each file to be sent, the following settings can be adjusted in the file settings.json:
   • file: The name of the zip file to be sent. This file is expected to be in the same
     directory as settings.json.
   • malicious: Whether the executable in the zip directory is expected to be malicious.
   • minutes: For how many minutes the file should be monitored.
40                                                      CHAPTER 6. IMPLEMENTATION
After the script is started, the controller socket accepts the following requests:
     • next file: Send the next file of the defined files. If all files were sent, the controller
       sends a special command, signaling to the client that all files were sent.
     • next log: Creates a new log socket for HyperDbg to connect to and sends the settings
       related to the expected executed file.
     • test connection: Sends a simple ACK, confirming the connection is working.
Because HyperDbg is not entirely stable and can crash the client unexpectedly, the con-
troller must gracefully handle this sudden loss of connection. Therefore, to avoid the
scenario where the controller waits endlessly on the client’s input, a timeout of 60 seconds
is set before the controller closes the connection and waits for new connections.
6.3 Assumptions
This work makes several assumptions for collecting system call logs using HyperDbg: First,
it is assumed that ransomware cannot reboot or crash the system, as this would lead to
HyperDbg not running anymore because DSE would be enabled again. Second, HyperDtct
assumes that ransomware cannot infect the controller over the configured controller socket
or log socket. In Chapter 7, these assumptions are evaluated and tested for different
ransomware.
System call logs for several scenarios and executables were collected using the explained
setup.
To collect data and evaluate the system, this work used benign applications from [91],
which distributes free applications for Windows, specifically packaged for portability. The
platform is fully open-source, free, and maintained by over a hundred developers, trans-
lators, application packagers, designers, and release testers. The advantage of portable
applications in the context of this work is that these executables can be sent to the client
the same way ransomware would be, without requiring an installation process. For the
process of initial data collection, the following samples were considered with the corre-
sponding monitoring duration:
6.5. DETECTION                                                                            41
• copying all PDFs from the documents directory to another directory: 3 minutes
These samples were packaged into a directory containing an execute.bat script, defining
how the executable should be run, and compressed to a zip file. Each such script is
responsible for ensuring that the software only runs during the defined monitoring period.
To monitor the executable, it is executed on the client using the mentioned execute.bat
script and run for the defined time. This process happens automatically, and no input
from a user is required.
6.4.2 Ransomware-PoC
Ransomware-PoC was run twice for 10 minutes and encrypted files in the documents
folder, populated with files collected from [88].
6.5 Detection
The detection module leverages the Python packages scikit-learn [93], pandas [94], and
pyOD [95]. After preprocessing the collected logs using pandas and scikit-learn, the
extracted features are used to train and evaluate the following algorithms:
      • IForest: Building on the success of the RF algorithm, the IForest is selected as the
        AD alternative to the classifier RF.
Thus, the prototypical implementation of HyperDtct considers two anomaly detectors and
two classifiers to detect malicious samples. The performance of each model is evaluated
at each stage of development, and the models, as well as their corresponding hyperparam-
eters, are subject to change, should their performance disappoint during evaluation.
The collected log files are read into a pandas DataFrame. Based on the file’s name and
the issuing process’s name, all system calls are labeled as benign or malicious, based on
the issuing process name. Only the system call number is considered, while the attributes
of the system call are ignored. The system call numbers are grouped by timestamp and
PID. HyperDtct only considers one client, thus logs are not collected concurrently from
multiple machines. Because of this, grouping by timestamp and PID creates a unique
index. Additionally, the timestamp is rounded to a certain time interval, such as five
seconds, which results in a list of system calls issued in five seconds by a process. This
process is outlined in Listing 6.2. During the implementation phase, this work explored
several time intervals; however, no difference was found in model performance. To satisfy
both the need for sufficiently long timestamps that allow normal software to issue enough
system calls within this timeframe and the need for enough samples to train and evaluate
the models, a timeframe of five seconds was chosen.
This list of system calls issued by a process in five seconds is then converted to a space-
separated string of system call numbers, such that it can be used with the string-based
bag-of-n-grams approach. In total, over 103’000 system calls were collected across the
mentioned samples, over 27’000 of which stemmed from malicious samples. By grouping
by PID and five seconds, 5230 benign and 241 malicious dataset entries are available for
training and evaluation of the detection models.
The preprocessed data is used to train all models, allocating 80% for training and re-
serving 20% for evaluation. Using this split, only 56 malicious samples are available in
the evaluation dataset, which must be considered during model and vectorizer evaluation.
6.5. DETECTION                                                                                                                                 43
As outlined in Section 7.1, this work uses the F1 score to compare models. The models
were initially fitted and evaluated without adjusting Hyperparameters, except setting
contamination to 0.044 in the AD models. This led to the F1 scores depicted in Figure
6.3. While the classifiers achieved high F1 scores of over 95 % across all vectorizers and
n-grams, the AD models did not achieve high scores. The notable exception was the
LOF model, which showed considerably better performance when trained on frequency-
based n-grams. Specifically, the LOF model trained on a frequency-based 3-gram dataset
achieved the highest F1 score of 76 %.
                                                                      Model F1 Score by Vectorizer and N-Gram Range
                                                                 Model: RF                                                 Model: NB
                                 1.0
                                 0.8
                                 0.6
                      F1 Score
     Classification
                                 0.4
                                           Vectorizer
                                 0.2       CountVectorizer
                                           TfidfVectorizer
                                 0.0
                                       5                     4       3              2               1       5         4         3          2   1
                                                                 Model: LOF                                               Model: IForest
                                 1.0
                                 0.8
                                 0.6
                      F1 Score
Anomaly Detection
                                 0.4
                                 0.2
                                 0.0
                                       5                     4        3             2               1       5         4         3          2   1
                                                                   N-Gram                                                    N-Gram
According to [96], n estimators sets the number of base estimators in the ensemble (with
the default being 100), max samples defines the number of samples while max features
defines the number of features to draw from X to train each base estimator. With those
hyperparameters considered and the combination of n and vectorizers, a total of 270
combinations of parameters are tested.
To improve the LOF algorithm models, the parameters outlined in Listing 6.4 were con-
sidered.
n neighbors is the number of neighbors to use by default for ‘kneighbors‘ queries, algorithm
defines the algorithm used to compute the nearest neighbor, and metric defines how the
distance to the neighbors is measured [97]. Evaluating this parameter grid results in a
total of 270 combinations to be tested.
To perform the tuning process and identify the optimal parameter combinations, the script
controller/detection/gridSearch.py was executed with the corresponding model passed
as an argument. The best F1 scores from these runs and their respective configurations
are presented in Table 6.1.
0.6
           0.5
F1 Score
0.4
0.3
0.2
0.1
           0.0
                 5             4        3         2                  1       5                  4         3           2       1
                                     N-Gram                                                            N-Gram
Compared to the previous results in Figure 6.3, Figure 6.4 depicts the F1 score across dif-
ferent n-grams using the AD models with the aforementioned, updated hyperparameters.
While the results for the LOF models have not improved remarkably, tuning the hyper-
parameters of the IForest algorithm improved both results of the frequency-based and
TF-IDF-based 2-gram data and led to a model with an F1 score of over 80 %. Despite
these improvements, the classifier models still perform better than the AD models. Table
6.2 gives an overview of the best results achieved per algorithm and its evaluation metrics:
Evaluation
This chapter evaluates the sandbox prototype. Therefore, each section describes addi-
tional samples collected and potential modifications made to the system. Each section
builds on the previous sections and considers the changes made in earlier evaluations.
Because the data collected in this work is highly unbalanced, with only a minority of entries
being malicious, choosing the right performance metric to evaluate models is important.
Several performance metrics have been described in Section 2.5.1, making up the metrics
this work can select from. Classification accuracy is a misleading metric in this work, as
a model detecting not a single malicious entry would still achieve a high score because
of the data’s inherent high percentage of benign entries. A better metric would be the
F1 score, a harmonic mean of precision and recall. This metric provides a more accurate
assessment of a model’s performance in detecting malicious entries, highlighting models
that not only detect a high proportion of malicious entries but also do so with a high
degree of accuracy. Therefore, this work relies mainly on the F1 score to compare models.
Only two samples of the same ransomware were considered for training the models during
implementation. Thus, more malicious samples are needed. However, collecting data
from malicious samples using the existing monitoring system is inefficient, as the client
requires a manual reset after each sample. This section elaborates on how this work
efficiently collected more malicious samples before elaborating on what additional samples
were collected and how the model performance was affected.
                                             47
48                                                          CHAPTER 7. EVALUATION
To improve data collection efficiency, the system needs to be updated in two ways: to
enable automatic client recovery after a malware infection and to handle potential client
crashes. These updates will involve making changes to both the client and controller
machines.
Because the considered prototype does not use an established VMM but a custom hy-
pervisor, snapshots are not supported, and another approach must be found to reset
the system. Windows provides functionalities to create backups for files and entire vol-
umes, such as the command-line tool wbadmin, but not all functionalities of such tools
are available for every version of Windows. On the client machine running Windows
10 version 22H2, creating a backup using wbadmin was supported, but restoring it was
not. Another approach would be implementing a tool to automatically use the control
panel GUI to restore the system, similar to how [15] restored their systems during evalu-
ation. Because the malware samples considered in this version do not attempt to escalate
privileges and only encrypt the user’s folder without affecting the state of the operating
system beyond a reboot, a more straightforward restore functionality was implemented.
client/System/recovery.py recovers the encrypted files using compressed backup files
stored on a drive only accessible with administrative privileges. Archive file formats con-
sidered include .tar.gz, .zip and .7z, among which the archive created using 7-Zip [98] was
found to be the fastest, restoring a directory containing 63 GB of files in less than thirteen
minutes. The command shown in Listing 7.1 was used to create a backup archive of a
folder.
           Listing 7.1: Create a File Backup of the Archive Directory using 7-Zip
     7z.exe a -mx1 D:\FileBackup\Documents.7z .\Archive\
To restore the compressed backups, the algorithm in Listing 7.2 restores all archives
contained in FILE BACKUP PATH to their original location, specified by the name of
the archive. The algorithm assumes that all encrypted data can be found within one
directory RECOVERY PARENT DIR, such as the user’s home directory. Because this
recovery method was enough to restore the system to automate data collection for the
samples considered during this version, a recovery method to restore the operating system
is deferred to later versions or for future work.
7.2. V2: RANSOMWARE COLLECTION                                                         49
Two obstacles must be addressed to continue logging after a potential crash of the client
machine automatically: The script client/start_logging.py must be launched auto-
matically with administrative privileges on client startup, and DSE must be disabled
without requiring manual intervention. The problem of automatically launching the
script client/start_logging.py with administrative privileges can be solved using the
Taskschedule functionality in combination with the Autologon functionality provided with
Windows. To automatically login to the privileged user, the following values were modified
in the Windows Registry Editor in the key HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft
\Windows NT\CurrentVersion\Winlogon:
• AutoAdminLogon: 1
• DefaultUserName: ClientAdmin
The necessary configuration to automatically run the monitoring script with adminis-
trative privileges is documented in the script client/System/setup_logging_task.ps1,
which can be run to apply these configurations. After these two configurations, the system
automatically continues monitoring after a restart or potential crash.
To disable DSE after a potential crash, the method to disable DSE was updated: Instead
of manually disabling DSE temporarily using the startup menu, the client machine boots
on a thumb drive set up with EfiGuard [99]. EfiGuard is a portable x64 UEFI bootkit
that patches the Windows boot manager, boot loader, and kernel at boot time to disable
PatchGuard and DSE. Using EfiGuard, DSE can be disabled from the command line using
EfiDSEFix.exe and thus can be disabled without an additional restart after a potential
crash. To avoid malware modifying or encrypting the contents of EfiGuard, the thumb
drive’s access rights are set so that no user can access it.
These modifications update the assumptions of the prototype outlined in Section 6.3.
The first assumption that ransomware cannot reboot or crash the system is relaxed, as
in version V2, a reboot or crash can be handled as long as the monitoring task starts up
50                                                            CHAPTER 7. EVALUATION
Unlike the version described in Chapter 6, this version of the controller must function
without any user intervention and, therefore, must detect a potential crash of the client
machine and be able to handle the client’s requests again after it has recovered. To
enable configuration of the behavior of the controller depending on the sample causing
the crash, the option on crash was added to the file settings with the following possible
configurations:
     • retry (default): If a crash occurred while monitoring this file, try to monitor it again.
       This option is best used for samples where a crash is unexpected.
     • skip: Skip monitoring this file again in case of a crash. Using this option may result
       in an incomplete log.
     • retry append : To avoid having to monitor the entirety of the configured time again
       while still having logs of the configured duration, this option starts appending to the
       existing log and, after the client recovered, monitors the sample for the remaining
       time.
The data for V2 was collected in two iterations, with each sample being run twice. In
the first iteration (V2), the samples were run on 20 GB of files to encrypt, while in the
second iteration (V2-1), 60 GB of files were available. V2 was run for a shorter time and
already featured the client file-recovery implementation. Because several samples crashed
the client during the first iteration, the system was modified to handle crashes as described
above. However, no crashes were observed during V2-1. Table 7.1 gives an overview of the
considered samples, their settings, and which samples crashed the client in the first run.
The collected ransomware samples are briefly introduced in the following subsections.
7.2. V2: RANSOMWARE COLLECTION                                                           51
1. Cry
RAASNet
To avoid the required login and update some deprecated imports of RAASNet, this work
forked the repository from [101], a fork of the original, banned repository. The repository
with the mentioned changes used in this work can be found under [105]. RAASNet
demonstrates how RaaS allows non-technical users to create custom ransomware payloads,
customizing the command and control server settings, the encryption method, and the
displayed messages. All this can be done from a graphical user interface without requiring
the attacker to write or edit a single line of code. It also allows the attacker to launch a
command and control server to collect the private encryption keys. The payload can be
created in the menu depicted in Figure 7.1.
While several settings can be customized, this work edited the following settings to gen-
erate the payloads:
   • Set Target Dirs: This setting was modified to encrypt only the user’s documents
     directory.
Using PyInstaller [106], RAASNet compiles executables from the generated payload, which
can be easily distributed and executed.
RanSim
JavaRansomware
Roar
With the data collected in V2 and V2-1, this work has amassed over 68’000 entries of five-
second system call traces issued by several processes. In this subsection, the data collected
in V2 and V2-1 is referred to as data collected in V2, as no distinction is made on the
detection level. An overview of the data and how it is split into training and evaluation
sets is provided by Figure 7.2. To estimate how well the different models perform when
faced with samples not seen before, the models trained on the data collected in V1 were
used to predict the newly collected data. For each algorithm considered in V1 (NB, RF,
IForest, and LOF), the model yielding the best result, outlined in Table 6.2, was used to
predict the maliciousness of samples collected in V2. Their performance was evaluated
on the evaluation dataset of V2, such that it can be compared to models trained on data
collected in V2. The results of these predictions are outlined in Table 7.3.
54                                                          CHAPTER 7. EVALUATION
                                           1.9%
 72.5%                                          7.5%
                                                                  Categories
                                           18.1%         Malicious Training (5109 entries)
                                                         Malicious Evaluation (1264 entries)
                                                         Benign Training (49446 entries)
                                                         Benign Evaluation (12375 entries)
 Table 7.3: F1 Scores of Best V1   Models per   Algorithm   Across V2 Evaluation Dataset
                     Algorithm     Scaler        N-Gram      F1 Score
                     NB            Frequency     5           0.15
                     RF            TF-IDF        5           0.0
                     IForest       TF-IDF        2           0.08
                     LOF           Frequency     3           0.61
LOF reached a high F1 score, considering the small number of dataset entries it was trained
on. The other models, however, did not reach an F1 score exceeding 0.15. To train models
with the new training dataset, HyperDtct was modified to handle larger datasets. In V1,
the vectorized datasets were handled as dense matrices. This worked fine, considering
the entire dataset only consisted of about 5’500 entries. However, handling more entries
becomes memory inefficient, as most values in the matrix are zero. Therefore, the features
are converted to a scikit-learn-compatible sparse matrix format when they are formatted
to vectors and, different from V1, all algorithms are now provided by scikit-learn, as the
considered pyOD algorithms did not support sparse matrices. This change was facilitated
in the models in the directory controller/detection/models/V2, allowing reproduction
of the V1 models stored in the directory controller/detection/models/V1 as they were
documented in Chapter 6. Additionally, the contamination parameter of the AD models
was updated to represent the contamination factor of the newly collected data, which is
9.3 %. After these changes were implemented and the models trained on all collected
data, the F1 scores outlined in Figure 7.3 were observed across vectorizers and n-grams.
Similar to the previous version, the performance of the classifier models remains better
than that of the AD models. Overall, performance scores have decreased. The best
7.2. V2: RANSOMWARE COLLECTION                                                                                                                      55
                                                                 Model: NB                                                  Model: RF
                                 1.0
                                 0.8
                      F1 Score
                                 0.6
     Classification
                                 0.4       Vectorizer
                                 0.2       CountVectorizer
                                           TfidfVectorizer
                                       5                     4       3              2               1       5         4         3          2        1
                                                                 Model: LOF                                               Model: IForest
                                 1.0
                                 0.8
                      F1 Score
                                 0.6
Anomaly Detection
                                 0.4
                                 0.2
                                       5                     4        3             2               1       5         4         3          2        1
                                                                   N-Gram                                                    N-Gram
models, outlined in Table 7.4, do not reach F1 scores as high as the ones in V1, summarized
in Table 6.2.
The threshold parameter in the VT selection method was set to 0.0001. Because the
considered vectorized data is sparse, this threshold leads to a hefty reduction of features,
reducing the number of features considered across the selected feature spaces from a range
of 50’000 - 99’000 features to a range of 400-5’500 features. The selected value worked
well; however, no further experiments with different values were conducted.
For the RFE method, the number of features to select was set to half of the complete
feature space. Logistic regression was used as the estimator, and features were removed
56                                                         CHAPTER 7. EVALUATION
in step sizes of 1000 features. The feature selection methods did not heavily influence
the F1 score across both selection methods, except for the IForest models. Feature spaces
vectorized using a frequency-based approach had a low deviation from the F1 score of
models trained and evaluated on the whole feature space for both the VT and RFE
feature selection methods. On the other hand, feature spaces vectorized using TF-IDF
experienced a higher fluctuation, with a decreased F1 score of up to 0,04 lower. Feature
selection methods applied on datasets for the IForest algorithm led to high instability in
F1 performance, sometimes increasing performance up to 0.08 and other times decreasing
up to 0.13. These feature selection techniques can thus be applied to some, but not all,
algorithms and n-gram ranges to reduce model scoring time, should these models be used
in a real-time detection scenario. Because neither higher n-grams nor models with reduced
feature spaces outperform the models in Table 7.4, applying feature selection techniques
is left to future work, and neither higher n-grams nor feature selection is implemented in
HyperDtct. The experiments and their corresponding results regarding feature selection
and higher n-grams can be found in controller/detection/fs_experiments.ipynb.
To improve the classifier models, it was attempted to balance the training dataset by
oversampling malicious vectors. To achieve this, two oversampling methods were con-
sidered [108]: Random Oversampling (ROS) and Synthetic Minority Oversampling Tech-
nique (SMOTE). While ROS naively duplicates and adds more samples of the underrep-
resented class, SMOTE creates synthetic minority class examples. These oversampling
techniques were applied to the V2 training dataset and evaluated using the RF and NB
classifier algorithms. With both oversampling techniques, the contamination factor of the
dataset was increased to 0.5. After fitting the classifiers on the oversampled datasets, it
was observed that the F1 score improved for models trained on 4- and 5-gram datasets,
while those trained on 1- and 2-gram datasets experienced a decrease, regardless of the
vectorizer or oversampling technique used. The observed difference in F1 score ranged
from +0.018 to -0.070. Because no model was found to beat the previous best F1 score
outlined in Table 7.4, it was decided not to implement oversampling into HyperDtct’s
training process. The experiments and results regarding oversampling can be found in
controller/detection/os_experiments.ipynb.
Similar to V1, hyperparameter tuning was conducted in an attempt to improve the per-
formance of the models. Because the classifier models did not reach a perfect score in
this version, their corresponding hyperparameters were also tuned. For the AD models,
the considered parameter grid remained the same as outlined in Listings 6.3 and 6.4,
the exception being the removal of the algorithm parameter from LOF, as it overrides
this setting to ’brute’ when fitted to sparse data [109]. For the NB algorithm, only the
Hyperparameter alpha was considered, which is an additive smoothing parameter. Ex-
periments considered the values 0.2, 1.0 (default) and 2.0 [110]. For the RF algorithm,
more hyperparameters were considered, as outlined in Listing 7.3.
n estimators denotes the number of trees in the forest, criterion is the function to measure
the quality of a split and thus determines how a tree grows, and max features determines
how the number of features is calculated when looking for the best split [111].
7.3. V3: REAL-WORLD RANSOMWARE EVALUATION                                                  57
The best configurations and their corresponding F1 score achieved in these experiments
are outlined in Table 7.5. For all algorithms, except for the NB, better scores were
achieved than without hyperparameter configuration (cf., Table 7.4). Therefore, the im-
proved hyperparameters were configured in the algorithms contained in the directory
controller/detection/models/V2.
Until this point, the evaluated ransomware consists only of samples created for academic
purposes. To assess whether HyperDtct is also able to handle samples from the real world,
the ransomware Babuk, responsible for the attack on the D.C. Metropolitan Police De-
partment, and Lockbit black, a modern descendent of the ransomware responsible for the
attack on Colonial Pipeline Co., are considered in this version of the work. Additionally,
to verify that the models have not been overfitted to classify all system-call-intensive tasks
as malicious, the file restoration process implemented in V2 (i.e., a benign workload) is
monitored as a sample. After a short introduction to the ransomware, the steps to monitor
the samples are described, and the performance of the models is evaluated.
58                                                          CHAPTER 7. EVALUATION
7.3.1 Babuk
A sample of the ransomware was fetched from [112]. When Babuk is executed, it first
terminates a hard-coded list of processes, including various backup solutions. Then, it at-
tempts to delete Windows Shadow Copies before encrypting files. Babuk spawns multiple
threads and potentially does not encrypt all files to speed up this process. For large files,
it avoids encrypting the entire file and instead only encrypts parts, rendering it unusable
[113].
7.3.2 LockBit
With this sample [114], a very recent ransomware sample of LockBit 3.0, also called
LockBit Black, is considered. It is called LockBit Black because it is a successor of
BlackMatter, which came from the Darkside ransomware family [115] responsible for the
Colonial Pipeline Co. attack. LockBit is widely recognized as the world’s most prolific
and harmful ransomware, responsible for causing billions of euros in damage. LockBit is
offered as a RaaS solution, and version 3.0 encrypts a victim’s files and threatens to publish
them. More recent samples remain available despite a series of arrests and infrastructure
takeovers disrupting the group’s operations in early 2024 [116]. To avoid detection and
analysis, LockBit makes use of code obfuscation and anti-debugging methods [115] and
attempts to escalate its privileges [114].
When attempting to execute Babuk, it was found that it did not encrypt files without
administrative privileges. Therefore, Babuk was executed as the administrative user.
Additionally, it was found that the client machine crashed when more than two drives
were attached while Babuk was running. Therefore, the drive containing the file backups
was removed. After these changes, Babuk ran successfully. The LockBit sample on the
other hand did not require to be run as administrator, however, the monitoring process
had to be restarted several times, as a memory corruption crashed the client, before
the sample managed to run successfully. Although HyperDtct can successfully run and
monitor the prolific samples Babuk and LockBit, the assumptions made in V2 did not hold.
The monitoring process requires manual intervention because the system recovery method
implemented in V2 is insufficient to restore the system. During V3, the system recovery
was conducted using Clonezilla [117], and the samples were run manually without relying
on automatic restoration. For completion, /controller/input/V3/settings.json was
added, documenting the settings with which the samples were run.
To evaluate whether models fitted on data collected in V2 are able to detect LockBit and
Babuk, the best-performing V2 models are used to predict the newly collected samples.
7.3. V3: REAL-WORLD RANSOMWARE EVALUATION                                                59
The results of the models evaluated on the Babuk dataset are outlined in Table 7.6,
while the performance on LockBit is displayed in Table 7.7. Both tables display F1 score,
precision, and recall, as well as detection time, corresponding to the first five-second
timeframe of the ransomware, which has been flagged as malicious.
Although the models have been fitted on neither Babuk nor LockBit, they achieve high
performance in detecting these samples. Notably, the RF algorithm achieves high results
in both instances. All models detect these particular ransomware samples within five to
ten seconds and achieve a high recall metric. However, when evaluated on the benign file
restoration sample, all models falsely flagged most timeframes, in which 7-zip issued sys-
tem calls, as malicious. These results are outlined in Table 7.8, which uses accuracy as the
performance metric, as the dataset contains no malicious entries. This indicates that the
models considered at this stage are overfitted, flagging all system-call-intensive processes
as malicious. When evaluating these three datasets combined, the models outlined in
Table 7.9 achieve the best performance per algorithm. Notably, all these models achieve a
high recall metric while suffering performance loss in precision by flagging benign samples
as malicious.
7.3.5 V3 Models
Because of the low precision achieved by the V2 models in Table 7.9, newer models
should consider more system-call-intensive benign behavior during training. Therefore,
60                                                              CHAPTER 7. EVALUATION
       Table 7.9: Best Results of V2 Models per     Algorithm   on the Combined Dataset
             Algorithm Vectorizer N-Gram             F1 Score    Precision Recall
             NB           Frequency 2                0.72        0.57      0.99
             RF           Frequency 2                0.76        0.63      0.97
             IForest      Frequency 1                0.40        0.25      0.99
             LOF          TF-IDF      3              0.30        0.19      0.84
While these samples are collected as logs to the directory controller/logs/V3, the logs
of Babuk, LockBit, and file recovery are moved to controller/logs/V3-Eval, which is
excluded from training. Training the models on the additionally collected system-call-
intensive benign samples and re-evaluating the performance on the combination of the
Babuk, Lockbit, and file recovery datasets, as in Table 7.9, an increase in performance
can be observed in Table 7.10. With no loss in recall, the precision was increased for most
models, indicating that the additional training data led to the models flagging less benign
timeframes as malicious. The best-performing vectorizer and n-gram remained largely
the same as in Table 7.9, except for the RF model, where TF-IDF was found to perform
better and LOF, with a decrease in n-gram from three to two.
     Table 7.10: Best Results of V3    Models per   Algorithm   on the Combined Dataset
            Algorithm Vectorizer        N-Gram      F1 Score    Precision Recall
            NB          Frequency       2           0.74        0.59      0.99
            RF          TF-IDF          2           0.80        0.67      0.97
            IForest     Frequency       1           0.42        0.27      0.99
            LOF         TF-IDF          2           0.31        0.19      0.99
V3-1
Because collecting and fitting more system-call-intensive benign behavior has led to in-
creased performance, in iteration V3-1, oversampling the three collected samples has been
considered. Therefore, the directory controller/logs/V3 was duplicated to controller/
logs/V3-1, naively oversampling the system-call-intensive benign behavior. Of course,
the duplicated timestamp leads to timeframes of V3 containing twice as many system
calls instead of achieving the goal of oversampling the three collected samples. Al-
though a mistake was made, models trained on these logs featured increased preci-
sion, as outlined in Table 7.11. However, no improvement was noticed when imple-
menting this naive oversampling method correctly by modifying the timestamps to be
7.3. V3: REAL-WORLD RANSOMWARE EVALUATION                                                61
in the future; therefore, the oversampled logs were removed. To repeat the exper-
iment, the code used for correctly copying and modifying the logs can be found in
controller/detection/v3_model_performance.ipynb. Still, the improved results out-
lined in Table 7.11 achieved by mistake indicate that more benign system-call intensive
samples would benefit the models, and further experiments are conducted.
   Table 7.11: Best Results of V3-1   Models per Algorithm on the Combined Dataset
           Algorithm Vectorizer       N-Gram F1 Score Precision Recall
           NB          Frequency      2         0.75      0.60       0.99
           RF          TF-IDF         2         0.82      0.71       0.97
           IForest     Frequency      1         0.46      0.30       0.99
           LOF         TF-IDF         2         0.32      0.19       0.99
V3-2
In another attempt to improve the models with oversampling, this version attempts to
enhance by training them on selectively oversampled data. Processes that issue more than
40 system calls in most timeframes in the additional samples collected in V3 are over-
sampled. This approach is further described by Listing 7.4. After training the models on
the oversampled logs, the models’ performance did not increase more than one percentage
point, achieving similar results as the failed oversampling approach of V3-1. Therefore,
the data created in this version is also not considered for training in future versions. To
allow reproduction of this experiment, the full version of the algorithm outlined in Listing
7.4 can be found in controller/detection/v3_model_performance.ipynb.
               Listing 7.4: Oversampling System-Call-Intensive Processes
  # Load V3 samples, group by PID and timestamp
  log_dir = os.path.join(os.getcwd(), ’../logs/V3’)
  df = read_logs_from_dir(log_dir)
  prep_df = Preprocessor.get(version=2).group_by_pid_and_timestamp(df)
  # Get the five PIDs issuing the most timeframes with more than 20 system calls
  filtered_df = prep_df[prep_df[’syscall’].apply(lambda x: len(x) > 40)]
  syscall_intense_pids = list(filtered_df.value_counts(’pid’).head(5).index)
  # For each time the df is stacked, add another day to the timestamp to avoid wrong
      grouping
  stacked_dfs = []
  for i in range(1, 4):
      new_df = df.copy()
      new_date = datetime.now().date() + timedelta(days=i)
      new_df[’timestamp’] = new_df[’timestamp’].apply(
      lambda x: x.replace(year=new_date.year, month=new_date.month, day=new_date.day))
      stacked_dfs.append(new_df)
V3-3
Finally, after the previous partly successful improvements in the classification of un-
seen data, this unseen data is included in the training set. Therefore, the models are
also trained on Babuk, LockBit, and file restoration samples by copying the logs in
controller/logs/V3-Eval to controller/logs/V3-3, which is included in the training.
This leads to the results in Table 7.12, depicting the performance of the best models per
algorithm when trained and evaluated on all samples collected in this work.
In summary, this thesis explored three research directions, identified based on related
work. First, it proposed a novel way to extract system calls for a ransomware detection
system. Secondly, it investigated the implementation or modification of a custom hyper-
visor for a ransomware detection system. Thirdly, it explored those aspects in the context
of AD models to detect ransomware.
The results for these algorithms, vectorizers, and n-grams were evaluated across eleven
ransomware and thirteen benign samples in three iterative development cycles, referred
to as V1, V2, and V3. The key findings of these evaluations were the following. i)
HyperDtct’s best model (RF) achieved an F1 score of 0.97, showing that HyperDbg is
a viable alternative to related work to extract system calls for ransomware detection.
ii) Classifier models outperform the AD models in this work. While the best classifier
model (RF) reached an F1 score of 0.97, the best AD model (IForest) only reached an F1
score of 0.49 in the final evaluation. The RF algorithm achieved the best performance,
similar to related work. iii) The RF model also performed well when used on completely
unseen samples, with an F1 score of 0.80 achieved when evaluated on the unseen behaviors
obtained from Babuk, LockBit, and file restoration. iv) RF detected the unseen Babuk
and LockBit samples in five to ten seconds. Other models were even faster but achieved
a lower precision than RF. v) While related literature found ranges of n-gram from six to
seven most effective (cf., Chapter 5), HyperDtct performed best on ranges from one to four,
with the best model (RF) performing best on a frequency-based one-gram dataset. vi)
Using hyperparameter tuning improved model performances in V1 and V2. The optimal
parameters found for the RF algorithm in V2 are ’n estimators’: 100, ’criterion’: ’log loss’
and ’max features’: ’log2’. vii) Trying larger values for n-grams in combination with
feature elimination did not achieve better results in V2. viii) While oversampling malicious
                                            63
64                                   CHAPTER 8. SUMMARY AND FUTURE WORK
entries in V2 and benign entries in V3 did not improve model performance, accidentally
duplicating the number of system calls issued in benign timeframes improved performance
in V3.
Based on these results, this thesis contributes to the defined research directions. The first
and second directions were addressed by using HyperDbg. The evaluations demonstrated
that HyperDbg is an effective tool for capturing system calls using a custom hypervisor,
as evidenced by the high performance of the RF model using the extracted logs. The
third direction, using AD models to detect ransomware, was explored but found to be
less effective than classifier models. However, this work provides various experiments and
evaluations that can serve as a foundation for future work on improving AD models in
the context of system-call-based ransomware detection using HyperDbg.
                                         65
66                                                                   BIBLIOGRAPHY
AD        Anomaly Detection
API       Application Programming Interface
AUC       Area Under the ROC Curve
CNN       Convolutional Neural Network
DLL       Dynamic Link Library
DNN       Deep Neural Network
Dom0      Control Domain (Xen)
DomU      Unprivileged Domain (Xen)
DSE       Driver Signature Enforcement
FN        False Negative
FP        False Positive
FPR       False Positive Rate
GB        Gigabyte
IForest   Isolation Forest
IPT       Intel Processor Trace
KNN       K-Nearest Neighbors
KVM       Kernel-Based Virtual Machine
LBA       Logical Block Access
LOF       Local Outlier Factor
ML        Machine Learning
NB        Naive Bayes
OS        Operating System
PID       Process Identifier
RaaS      Ransomware-as-a-Service
RF        Random Forest
RFE       Recursive Feature Elimination
RL        Reinforcement Learning
ROAR      Ransomware Optimized with AI for Ressource-constrained devices
ROC       Receiver Operating Characteristic
ROS       Random Oversampling
SMOTE     Synthetic Minority Oversampling Technique
SVM       Support Vector Machine
TF-IDF    Term Frequency-Inverse Document Frequency
TID       Thread ID
TN        True Negative
TP        True Positive
                                       75
76                                    ABBREVIATONS
                                           77
78   LIST OF FIGURES
List of Tables
7.11 Best Results of V3-1 Models per Algorithm on the Combined Dataset . . . 61
                                          79
80   LIST OF TABLES
Appendix A
Installation Guidelines
A.1 Controller
The installation guidelines for the controller machine are written for a Raspberry Pi 3
Model B+ with Ubuntu 22.04.4 LTS installed. Nevertheless, these guidelines should work
for an arbitrary Ubuntu system. It is assumed that a user with administrative privileges
is already set up, Git [118], and Python [119] installed, and the system is connected to
the internet.
An unprivileged user named ”logger” is created, and the repository is cloned. This process
is outlined in Listing A.1.
               Listing A.1: Create Logger-User and Download Repository
  # Create the logger user and clone the repository
  sudo -s
  adduser logger
  cd /home/logger/
  git clone https://github.com/Cyber-Tracer/HyperDtct.git
To configure the controller’s network and firewall, execute the commands outlined in A.2.
                      Listing A.2: Setup the Controller’s Network
  # Configure the network of the controller
  cd /home/logger/HyperDtct/controller/setup
  chmod u+x ./setup_network.sh
  chmod u+x ./setup_ufw.sh
  ./setup_network.sh
  ./setup_ufw.sh
                                           81
82                                         APPENDIX A. INSTALLATION GUIDELINES
A.2 Client
The client machine must be connected to the internet and adhere to the CPUs supported
by HyperDbg. This work used an Intel I7-6700k CPU. Install Git [118], Python [119] and
7-zip [98] (skip 7-zip installation if V1 data is collected, as there it will be installed as part
of the monitoring process) for Windows and download HyperDtct according to Listing
A.4.
After setting up Windows with an administrative user, HyperDbg must be installed, ac-
cording to the guide [89]. Move the latest release’s directory to C:\HyperDtct\HyperDbg.
Then, download the latest release of EfiGuard from [99], and follow the instructions
provided by [99] to set up a bootable loader thumb drive. Move EfiDSEFix.exe to
C:\HyperDtct\client\System\EfiDSEFix.exe. Configure the BIOS to boot on the thumb
drive first.
To add the client user, execute the commands outlined in Listing A.6. PASSWORD is
the password of the newly created client user. To allow executables to run with standard
privileges, enter and store the client user’s credentials.
                           Listing A.6: Store Client Credetials
  net user Client PASSWORD /add
  py C:\HyperDtct\client\System\runas.py store_creds
To allow monitoring samples over an extended period, set the screen saver and standby
timeout to never by running client/System/setup/setup_power_settings.ps1. To al-
low malicious samples to run uninterrupted for monitoring purposes, disable Microsoft De-
fender by running client/System/setup/disable_ms_defender.ps1 in an elevated shell.
The client is then populated with documents, fetched from [88], and a backup of these
files is stored on the second drive. Additionally, the backup drive and the directory of
HyperDtct are made inaccessible for the client user. These steps are outlined in Listing
A.7.
      Listing A.7: Populate Client with Documents and Configure Access Rights
  rem Populate Client user with documents
  py C:\HyperDtct\client\System\setup\download_govdocs.py 0 100 C:\Users\Client\Documents
A.3 Detection
• –log dir : Directory where the logs for the different versions are stored
Contents of the CD
4. HyperDtct’s source code, including serialized models considered during the thesis.
85