Solid State Technology
Volume: 63 Issue: 6
Publication Year: 2020
Effectiveness of the Static Analysis using PE files for
Malware Detection
Amit Parmar1, Dr. Keyur N Brahmbhatt2
1
Ph.D. Research Scholar, Gujarat Technological University, Ahmedabad
2
Associate Professor, Information Technology Department, Birla Vishvakarma Mahavidyalaya, Vallabh
Vidyanagar
1
Email:
[email protected],
[email protected]Abstract— Computer malware is very widespread problem. It is one of the most serious threats on the
computer system. Effectiveness of the static analysis for malware detection and its significance has been
portrayed throughout in this paper. The presented study has developed an effective analysis on the malware
detection using static analysis, dynamic analysis and hybrid analysis. Moreover, a survey table has been
provided in the study where the identified malware attacks and the benign as well as their accuracy rates
have been mentioned. It includes study of various techniques like N-gram analysis, opcode analysis, Byte
sequencing and PE analysis. The survey table has quickly attracted the eye of researcher within the subject
of the malware and benign identification. The study has developed a comparison between the techniques
aligned with tables and structures for detecting the malware attacks and best method has been suggested.
Keywords- Malware detection, Static analysis, Dynamic analysis, Machine learning, PE files
I. INTRODUCTION
It is evident that the effective identification of the unknown malware attacks can be determined as one of
the major challenges in the cyber security system now-a-days. It is powerful enough to harm a computer-
based system and reveal the data or confidential information to the unauthorized accesses or hackers who can
misuse the data. The opcode sequence has constructed a matrix using different high-level architectures for
identifying and minimizing the known malware attacks [1]. However, high dimensional malware attacks can
be found with the help of the effective PE files. It is evident that the static characteristics of the malware
attacks can be detected through the help of 32 bit as well as 64-bit Portable Executable Windows files aligned
with the opcode sequence analysis. However, when malicious software is being analysed without being
executed, the process is known as static analysis. This particular process uses some detection patterns
including byte-sequence, control flow graph, opcode frequency distribution, n-grams and string signature. On
the other hand, dynamic analysis is considered to be malware obfuscation technique that helps in enabling the
accurate observation of an effective program’s running state in a controlled as well as safe environment. The
program’s behavioural characteristics can be reflected through the immersive involvement of the dynamic
analysis for detecting the malware attacks.
However, it has been identified that the dynamic analysis is not only effective for debugging a running
program but also helps in tracking as well as recording the running activities of the program. The previous
studies have evaluated the factor that the dynamic analysis is more effective and immersive than the static
analysis when it comes to the identification of the malware attacks.
22665
Solid State Technology
Volume: 63 Issue: 6
Publication Year: 2020
The following survey paper would shed light on the techniques that are useful for detecting the malware
attacks using different analytical tools. In this study, we would discuss some specific factors that are
responsible for identifying the unknown malware attacks and minimize the negative impacts of the attacks.
Static analysis, static techniques and Byte sequencing would be broadly discussed for defining the processes
through what the malware anomaly can be detected. Effective diagrams and charts would also be provided
that would support the factors.
The previous studies have proven the factor that n-gram analysis and PE files are effective for identifying
the unknown malware attacks based on the frequency score. Hence, it has been identified that the dynamic
analysis is more effective than the static analysis. This study would include a comparison between the
different methods and an incremental malware detection system would be developed. It is evident that the
malware detection process can be divided into two effective parts such as static analysis and dynamic
analysis. We have decided some contributions including (1) static analysis and its effectiveness for malware
detection. (2) Dynamic analysis would be applied for identifying the unknown malware attack detection. (3)
N-gram analysis would be done for evaluating the effectiveness of the analysis technique. (4) Byte
sequencing and opcode analysis would be also included in the study. (5) There would be a comparison
between the techniques aligned with tables and structures for detecting the malware attacks and the best
method would be suggested.
II. RELATED WORK
A. Static analysis
When a malware attack can be detected without being executed, the process is known as static analysis. An
effective detection pattern for identifying the malware attacks is followed by the static analysis which
includes byte sequence in n-gram analysis, control flow graph, opcode frequency, string signature and many
more. Static analysis or the malware analysis helps to classify the identified benign without going through the
execution phase [1]. Although it has been identified from the previous researches that dynamic malware
detection process is way more effective than the static analysis and it is undecidable, it is considered to be an
effective and significant protective layer in the security genre of malware detection process as once the files
are detected, they can be mitigated before execution.
Fig. 1. Traditional vs. advanced malwares
B. Static Technique
In this technique, it is important for the executable file for being unpacked as well as decrypted before the
completion of the effective static analysis. The engaged debugger as well as memory dumper techniques are
usually used for reversing the windows compiled and executable files. For example, as shown in Fig. 1 IDA
pro as well as OllyDbg can be analyzed as two major Debugger tools that help in displaying the effective
22666
Solid State Technology
Volume: 63 Issue: 6
Publication Year: 2020
code of the malware detection as per the instructions of Intel × 86 assemblies. It helps to identify the inner
activities of the malware attacks and inner patterns. Along with that, LordPE as well as OllyDump are
considered to be the memory dumper tools that usually help in obtaining the identified protected codes that
have been located in the memory of the system and dump it to an existing file. This particular technique helps
in analysing the previously packed executable that has been difficult to dissemble earlier.
C. Dynamic analysis
Dynamic analysis helps to identify as well as analyse the internal behaviour of a harmful malicious code
through the effective interaction with the affected system. This analysis works when the execution of the
malicious code in an environment that is controlled such as simulator, sandbox, virtual machine as well as
emulator. As shown in Table I, it is evident that some monitoring tools are installed as well as activated for
some certain purposes.
TABLE I. Monitoring Tools And Its Purposes [4]
Tools Purposes
Process monitor File system
Capture BAT Registry monitoring
Process explorer Process monitoring
Process Hacker Replace Process monitoring
Wireshark Network monitoring
Regshot System change
detection
Dynamic analysis can be performed using different effective techniques such as “function call monitoring,
instruction traces, function parameter analysis, autostart extensibility points, information flow tracking and
others”. The previous studies have evaluated the factor that the dynamic analysis is way more effective than
the static analysis for identifying the malware attacks as this technique does not need the existing executable
files to be disassembled[3]. The natural behaviour of the malware is disclosed in this technique which is
effectively resilient to the existing static analysis. However, the major disadvantage of using the dynamic
analysis for detecting the malware attacks is that the following technique consists of time as well as resources.
Apart from that, the executions of the malwares happen in the virtual environment which is quite different
from the real world and as a result, the malwares can behave differently when they are placed in the real
situation. On the other hand, sometimes the malwares can be detected only under some certain and specific
conditions and in the virtual environment, they cannot be detected. Norman Sandbox, Anubis, ThreatExpert
are some effective online tools that help to detect malware analysis using dynamic analysis.
III. MACHINE LEARNING CLASSIFICATION FOR MALWARES
A. Byte sequencing
The previous studies have evaluated the factor that there are a significant number of machine learning
approaches available that help in classifying the malware attacks. As shown in fig. 2, Association rules,
22667
Solid State Technology
Volume: 63 Issue: 6
Publication Year: 2020
decision tree, Naive Bayes, SVM (Support Vector Machine) are some machine learning approaches. Portable
executables, byte sequences as well as strings have been the first static features that have helped the malware
attacks to be detected. DLL functions are used in the PE files while executable files extract the strings and
encode them in the program files. An effective sequence of the n-bytes is extracted from a previous
executable file using the byte sequence approach. A significant number of surveys have been done by many
researchers and they have been successful to detect the malware attacks using various techniques including
byte sequence, decision tree analysis, n-gram analysis and others. Table IV would mention the various
techniques, detected malware and benign along with their accuracy rates.
B. N-gram analysis in Byte sequencing
The previous studies have evaluated the factor that data mining approaches have been helpful for
identifying the computer malwares. In this technique, the malware as well as benign programs are referred to
as an effective training set that has been extracted from the for building a classification model for detecting
the malware and benign programs. N-grams of the byte sequences have been considered to be the existing
basic features of the malware detection that can help the classifiers to identify the attacks. The sub-strings are
overlapped by the n-grams in malware detection [5]. The length of n is responsible for maintaining the
sequence and it helps in capturing as well as presenting the longer substrings frequencies. Executable sections
of the malware as well as benign (PE files) can be extracted using n-gram. For example, as shown in Table II,
the pattern C3908D7426005589 can be shown using n-gram analysis [5].
N-grams for different values of n= 1,2,3,4….8.
TABLE II. N-GRAMS FOR DIFFERENT VALUES(1 TO 8)[5]
1gram Text2gram Text3gram Text4gram
C3 C390 C3908D C3908D74
90 908D 908D 908D7426
8D 8D74 8D74 8D742600
When n has different values (n=1, 2, 3, 4), the n-gram analysis shown in Table III.
TABLE III. N-GRAMS FOR DIFFERENT VALUES(1 TO 4) [5]
1gram Text2gram Text3gram Text4gram
C9 555B 85C074 85C08945
15 58EC 85C075 85C07410
DB 45EC 8B4508 83F8FF74
85 8D45 85C00F 83F8FF89
22668
Solid State Technology
Volume: 63 Issue: 6
Publication Year: 2020
Fig. 2. Taxonomy for malware detection [3]
22669
Solid State Technology
Volume: 63 Issue: 6
Publication Year: 2020
TABLE IV. The Survey Of Related Work
Paper Technique Classification Malware Dataset Size of data Accurac
methods type y
Minimal Minimal Contrast Exploit- Malware: VX Malware:
[1] contrast Frequent based Heavens, Open 1,083
frequent Subgraph Miner worm, malware, Malware Benign: 1,000
pattern mining (MCFS) Mass- Domain List Total: 2,083 92%
1
mailing Benign: Windows
worm, IRC samples
worm,
Trojan
Decision tree,k- SDBot, Malware: Malware:
[2] Opcode NN,Bayesiannet Alamar, Heavens 1,000 94.83
Sequence 2 work,random Bropia, Benign: Own Benign: 1,000
forest and SVM Snowdoor, machine Total: 2,000
JSGen,
Kelvir,
Skydance,
Caznova,
Sonic,
Redhack
and Theef
Backdoor, Malware: VX Malware:10,52
[3] Mining J48, Adaboost, Construc- Heaven 1
Format Bagging, tor, Virtool, Benign: Windows Benign: 91.1%
Information 3 Random Forest DoS, Nuker, folder and Program 8,592
Flooder, Files folder and Total: 19,113
Exploit, other legitimate
Hacktool, software
Worm,
Trojan,
Virus
Malware: Malware:
[9] Behavior of Elastic-Net , L1 Unspecified VirusShare 2,722 98.4%
malware in n- regularized,Logis Benign:Cygwin,wi Benign: 2,488
gram 9 tic Regression ndows xp,7 and 8 Total: 5,210
and the novel
Multi-Byte
Malware: Malware:
[4] PE header Create his own Unspecified Unspecified 5,598 99.5%
Information 4 algorithm Benign: Collected Benign: 1,237
downloads.com Total: 6,875
and Softpedia
Decision Tree,
PE header Random Forest, Virus,Worm Malware : Open Malware:
[7] Information kNN, Logistic ,Trojan, malware 2,722 98.4%
(Raw and Regression,Linea Bot,Spywar repository,virussha Benign: 2,488
derived r Discriminant e,Download re Total: 5,210
values) 7 Analysis and er and Benign : Windows
Naive Bayes Backdoor XP and 7 files
22670
Solid State Technology
Volume: 63 Issue: 6
Publication Year: 2020
Malware: Malware: 1444
[4] Variable Decision tree and Worm unspecified Benign: 1330 90%
length random forest Benign: Total: 2774
instruction unspecified
sequence 4
PE Malware: VX Malware: 3265
[4] approach,strin Naïve Bayes Unspecified Heavens Benign: 1001 97.11%
gs and Byte Benign: Total: 4266
sequence 4 unspecified
Malware: 1018
[5] n-gram model Naïve Bayes, Viruses, Malware: VX Benign: 1120 99%
in byte Instance based worms, Heavens Total: 2138
sequence 5 learner, J48 and Trojan,
AdaBoost1 backdoors Benign: Windows
Executables
C. Opcode analysis
Opcode refers to the machine language instructions that help in identifying the operation that is required to
be executed. An instruction is referred to an effective pair of a composed operational code as well as an
operand. For example, on the instruction set of X86, opcodes are described as an effective and particular
series of assembly instructions including PUSH, ADD, MOV, POP and others. The previous research has
stated that opcode analysis is significant for identifying the differences between the malicious software as
well as benign software. However, it can be said that the opcode distribution is efficient for predicting the
executable files. As shown in Table V, First the executable files are required to be disassembled for being
represented and thereafter, an opcode profile is built for containing an effective substrings list that are aligned
with the computed term frequency (TF) of every opcode based [4]. A graph or matrix is also used to consider
the binary combination of the opcode sequences that are related to the consecutive opcodes. There are more
than 1000 varieties of opcodes so that it is difficult to identify the most efficient and effective opcode
collection. However, as per the research, TF-IDF method is an effective method that can help in identifying
the malware detection.
TF-IDF is an acronym for term frequency inverse document frequency and the TF-IDF weight is a weight
often used in information retrieval and text mining [6]. The intuition behind it is that if a word occurs
multiple times in a document, hence, we compute the frequency of occurrence of each opcode sequence
within the file by using TF. From (1) it follows:
(1)
Where, is the number of times the opcode appears in program file
= total number of terms in the whole program file
Then we use the IDF to measure how much information each opcode provides. As shown in (2):
(2)
Where, |D|= total number of files
= number of documents where opcode appears
Thereafter, the calculation of the TF – IDF is as follows,
TF – IDFi,j= TFi,j * IDFi (3)
Summing TF-IDF of each opcode,
(4)
22671
Solid State Technology
Volume: 63 Issue: 6
Publication Year: 2020
TABLE V. Extracted opcode and operands from dissembles PE files [6]
Line Opcode Operand
1 MOV ESI, ECX
2 CALL 0x4012c0
3 TEST BYTE [ESP+0x8], 0x1
4 JZ 0x401168
5 PUSH ESI
6 CALL 0x6a16c8
7 ADD ESP, 0x4
8 MOV EAX, ESI
9 POP ESI
10 RET 0x4
11 INT 3
12 INT 3
13 PUSH -0x1
14 MOV EAX, [FS:0x0]
15 PUSH DWORD 0x6d4bf7
16 PUSH EAX
17 MOV [FS:0x0], ESP
18 PUSH EBX
19 PUSH EBP
20 PUSH ESI
D. PE header
An effective collection of metadata is represented through the PE Header that is broadly aligned with the
PE (Portable Executable) files. The previous research has ensured the factor that PE32 header has some basic
features that help in extracting “size of header, size of stack reserve as well as size of uninitialized data” [6].
For example, decision tree analysis has been conducted by the researchers for analysing the structural
information of the PE Header which can help in detecting the malware attacks and benign software. Along
with 125 header characteristics, 29 section characteristics as well as 31 section characteristics have been used.
E. Static PE analysis
It has been stated by the researchers that static analysis of PE32 has a significant number of challenges. It
is evident that 64-bit platforms can be affected by the 32-bit malware and the operating system of the
Windows supports the portable executable (PE) files [8]. Hence, it is not surprising that the malware attacks
are identified in the Windows OS. it has been identified that the utilization of the machine learning process
for malware detection has speeded up the intelligence and has reduced the human interaction which has
reduced the possibility of human error as well. However, an effective and scientific taxonomy is required as
well for the static malware analysis.
22672
Solid State Technology
Volume: 63 Issue: 6
Publication Year: 2020
F. Static PE features
It is evident that static PE features help to represent imported functions, byte sequences as well as strings.
It provides the facility of identifying the PE32 malware attacks along with the benign and defends them with
99% accuracy. Decision tree classifier also helps the PE files to be detected when it is being used in the
Windows OS [4].
IV. CONCLUSION
It is evident that static analysis, dynamic analysis and hybrid analysis are the major techniques of detecting
the malware attacks and benign. Static analysis is helpful for identifying the malware files before executing
and reflects the malware codes so that the malware activities can be identified easily. On the other hand,
dynamic analysis helps the PE files to be executed in a controlled environment and the process is way more
resilient to the traditional static analysis. However, it is evident that the virtual environment is different than
the real world and as a result, malwares can perform in another way when they are in the real world. In that
situation, dynamic analysis fails to detect the malware. Along with that, it is an expensive process and time as
well as resources consuming process. Hybrid technique is an effective combination of the static analysis and
dynamic analysis which is both time as well as cost consuming and cannot be successful in the real world in
most cases.
After analyzing all the malware detection techniques, it can be said that the static analysis is the best and
effective technique that can successfully identify the malware and benign attacks and sustain the system.
REFERENCES
[1] A. Hellal, L.B. Romdhane, “Minimal Contrast Frequent Pattern Mining for Malware Detection”,
Computers & security, Vol. 62, pp. 19-32, 2016.
[2] Ucci, L. Aniello and R. Baldoni, "Survey of machine learning techniques for malware analysis",
Computers & Security, vol. 81, pp. 123-147, 2019.
[3] Shabtai, R. Moskovitch, Y. Elovici and C. Glezer, "Detection of malicious code by applying
machine learning classifiers on static features: A state-of-the-art survey", Information Security
Technical Report, vol. 14, no. 1, pp. 16-29, 2009.
[4] Gandotra, D. Bansal and S. Sofat, "Malware Analysis and Classification: A Survey", 2020.
[5] S. Jain and Y. Meena, "Byte Level n–Gram Analysis for Malware Detection", Communications in
Computer and Information Science, pp. 51-59, 2011.
[6] Z. Sun et al., "An Opcode Sequences Analysis Method For Unknown Malware Detection",
Proceedings of the 2019 2nd International Conference on Geoinformatics and Data Analysis, 2019.
[7] Hyrum S. Anderson and Phil Roth, “EMBER:An Open Dataset for Training Static PE Malware
Machine Learning Models”, Arxiv.org, 2020.Available: https://arxiv.org/pdf/1804.04637.pdf.
[8] Shalaginov, S. Banin, A. Dehghantanha and K. Franke, "Machine Learning Aided Static Malware
Analysis: A Survey and Tutorial", Advances in Information Security, pp. 7-45, 2018.
[9] Raff, E., Zak, R., Cox, R. et al,“An investigation of byte n-gram features for malware classification”.
J ComputVirol Hack Tech 14, 1–20 (2018).
Dr. Keyur N. Brahmbhatt Dr. Dinesh B. Vaghela Dr. U. K. Jaliya
Supervisor DPC Member 1 DPC Member 2
22673