0% found this document useful (0 votes)
58 views5 pages

APT Malware Classification Using Ada-LightGBM

Talks about AdaBoost

Uploaded by

littletrout8803
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views5 pages

APT Malware Classification Using Ada-LightGBM

Talks about AdaBoost

Uploaded by

littletrout8803
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

2021 IEEE Sixth International Conference on Data Science in Cyberspace (DSC)

An APT Malware Classification Method Based on


Adaboost Feature Selection and LightGBM

1st NaXu 2nd Shudong Li * 3rd Xiaobo Wu 4th Weihong Han


Cyberspace Institute ofAdvanced Cyberspace Institute ofAdvanced School ofComputer Science and Cyberspace Institute ofAdvanced
Technology Technology Cyber Engineering Technology
Guangzhou University Guangzhou University Guangzhou University Guangzhou University
Guangzhou, China Guangzhou, China Guangzhou, China Guangzhou, China
2ll2006240@[Link] lishudong@[Link] happywxb@[Link] hanweihong@[Link]

5th Xiaojing Luo


Cyberspace Institute ofAdvanced
Technology
ao Guangzhou University
Guangzhou, China
u.J
u.J
sharnenall@[Link]
u.J
Abstract-Advanced Persistent Threat (APT) attack activities OceanLotus and Bitter. The healthcare industry has become the
with the theme of COVID-19 and vaccine are also growing primary target of APT attacks worldwide, surpassing the fields
rapidly. The target of APT attack has gradually expanded from of government, finance, national defense, energy and
government agencies to vaccine manufacturers, medical industry telecommunications for the first time. At the end of 2020, 360
and so on. What's more, APT groups have a strict organizational Secure Brain Monitor captured an APT attack event, it mainly
structure and professional division of labor and malware targets at Chinese social scientists studying South Asian
delivered by the same APT groups are similar. Oassifying relations [3][4]. The attacker delivered the program that
malware samples into known APT groups in time can minimize disguised as the invitation letter document to induce users to
losses as soon as possible and keep relevant industries vigilant. click the self-decompression program. It released and opened
In our paper, we proposed a multi-classification method of the bait file and subsequent malicious software samples
APT malware based on Adaboost and LightGBM. We collect real through the self-decompression program. The ultimate goal of
APT malware samples that have been delivered by 12 known this APT attack event is to return the collected information and
APT groups. The API call sequence of each APT malware is intelligence on the target computer to the APT server.
obtained through the sandbox. For the relationship between
IT adjacent APIs, we use TF-IDF algorithm combined with bi-grallL In normal conditions, APT groups have a strict
Vl
o Then, Adaboost algorithm is used to select out the important API organizational structure and professional division of labor.
features, which form the target feature subset. Finally, we use the Malware samples delivered by the same APT groups are
'uco" above subset combined with LightGBM ensemble algorithm to similar. Classifying malware samples into our known APT
~ train multiple classifiers, named Ada-LightGBM. The
0; groups in time can minimize losses as soon as possible and
..c
>-
experimental results show that our method is superior to the keep relevant industries vigilant. At present, there are many
U
c
single Adaboost and LightGBM method. The classifier has good detection methods based on traffic information and log
recognition performance for the test samples. information, while APT-related malware classification is
C'u" relatively rare. In view of this research perspective, our paper
·w'" Keywords-Advanced Persistent Threat Attack, Malware,
Vl proposes an APT Malware Classification Method based on
Adaboost, LightGBM
~
co Adaboost feature selection and LightGBM. Its main
o contributions are as follows:
c 1. INTRODUCTION
o
'c"
u APT [1] attack has always been one of the hot topics in the • The proposed method IS a multi-classification
1: field of cyberspace security. Compared with the traditional experiment on real APT malware datasets, which
~
c cyberspace threat, APT attack is a long-planned malicious spy provides a feasible method for detecting APT attacks in
o malware research.
u threat, which will carry out long-term and continuous precision
co
c attack on specific targets, and it has the characteristics of
o
.;::; strong purpose, complex form, strong concealment and long- • We use Adaboost algorithm to calculate the importance
co
c term continuity. With the global outbreak of COVID-19, APT of each feature. Our experimental results on real dataset
~c attack activities are also growing rapidly. According to the show that the proposed method has good accuracy.
.c 2020 annual report of QI-ANXIN Global Advanced Persistent • The Ada-LightTGBM model constructed by us is
"\(
Vi
Threat [2], it has included 642 public reports, involving 151 superior to single DT, KNN, Adaboost, LightTGBM
u.J
u.J
attack groups. Among them, the five APT groups with the and XGBoost.
u.J
highest rate of mention are Lazarus, Darkhotel, Kimsuky,
.-<
N
o
N

978-1-6654-1815-7/21/$31.00 ©202l IEEE 635


DOl 10.1109IDSC53577.2021.00101
Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 05,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.
II. RELATED WORK and extracts API call information. Since a very small number
Because of the complexity and concealment of APT attacks, of samples have anti-sandbox technology, which makes it
a variety of detection methods are needed. It is difficult to impossible to obtain behavior reports, we choose to remove
detect APT attacks by a single method. According to different these samples directly. The data information we extracted from
data sources, researchers divide into three detection methods as the sample execution report is shown in Fig. 2.
a whole: traffic detection technology, log detection technology
and APT malware detection technology. Traffic detection Data preprocessing
technology can isolate malicious traffic by filtering a large
number of normal traffic, so as to carry out targeted defense.
Shang L, et al. [5] proposed to use LSTM in deep learning to
extract time features among data streams, and use CNN to
extract features within data streams. Then they isolated
malicious traffic through GBDT to carry out targeted
traceability tracking. Log detection technology can analyze log
information to find out attack evidence, and achieve the
tracking of attack. Liu F, et al. [6] found the trace of APT
attack by making rules and constructing heterographs from the
initial information of user logs, and the used clustering
algorithm to isolate malicious logs. However, due to the
security and privacy information involved in user traffic and
logs, it is not easy for ordinary researchers to obtain them. At
present, there are many detection methods based on traffic
information and log information, while APT-related malware
organization classification method is relatively rare. Liras, et
al. [7] collected static analysis fields, dynamic analysis fields
and network traffic related fields of APT malware, and used Figure 1: Overall design framework diagram
LR, KNN, Random Forest and SVM algorithms for multiple
classification, which provided a foundation for the study of !lapin: "LdrGetProcedureAddress ll ,
lIcategory": 11 system",
APT detection from the perspective of malicious software.
"return_value'l: Q,

In recent years, malware analysis methods are roughly "stacktrace": [


divided into two categories: static analysis and dynamic ],
analysis. Static analysis [8-12] focuses on the presentation of "flags" :
the malware in the form of text. Because of the same malware
family will lead to the similarity of malware author or team ],
"arguments": {
code. Based on this, the static features extracted from the
Hfunction address": "Ox76746ba9 11 ,
binary file, executable file or disassembly file of the malware "ordinal"~ 0,
are analyzed. Dynamic analysis [13-19] focuses on the "module_address": "Ox76710000",
behavior characteristics of malware By executing malware lI rnodule": "kerne132",
samples in a virtual controlled environment, the behavior "function_name": "GetCornputerNameA"
],
information of malware, such as behavior logs, context "tid H : 2744,
parameters, API call sequences, is recorded. By running "status": :,
malware, you can find out which servers are connected by APT "time": ~E230802B5.~SE337
attack, which parameters are modified, and which device ],

inputs/outputs are executed. Dynamic analysis bypasses the Figure 2: Part ofthe sample executable report information
limitations of static analysis.
A. Data Preprocessing
III. PROPOSED METHoD We use TF-IDF algorithm combined with Bi-gram to
This paper proposes an APT malware organization convert API sequences into digital vector. Bi-gram algorithm is
classification method based on Adaboost feature extraction and a collection of two consecutive apis and extracted from the
LightGBM. Firstly, behavior information of APT malware is continuous api sequence. It can well obtain the front and rear
obtained through the sandbox and the initial features are execution information of APT malware. For an example,
extracted and quantified according to the TF-IDF algorithm {"GetSystemTimeAsFileTime" ,"NtAllocateVirtualMemory" ,"
combined with bi-gram. Then, Adaboost is used to select out NtFreeVirtualMemory", "LdrGetDllHandle", } gets
the important features, which forms the target feature subset. {"GetSystemTimeAsFileTime NtAllocateVirtualMemory
Finally, we use the above subset combined with LightGBM ","NtAllocateVirtualMemory NtFreeVirtualMemory
ensemble algorithm to train multiple classifiers, named Ada- ","NtFreeVirtuaIMemory LdrGetDllHandle", } after bi-
LightGBM. The overall design framework is shown in Fig.!. gram treatment. TF-IDF is composed of TF and IDF to
evaluate the importance of api words for a malware or all
This paper builds an automated analysis as Cuckoo
malware. In this paper, the vector X with fixed length is
sandbox platform to record behavior information of malware

636

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 05,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.
obtained after TF-IDF and bi-gram processing. Due to the large Through experimental comparison, we find that the best
dimension of the transformed feature vector, it is necessary to feature subset is formed by selecting the top 400 features in S,
filter and reduce the dimension. We divide the original data set as shown in Fig. 3.
into training set and test set, where the ratio of training set to
Adaboost Feature Selection
test set is 7 : 3. The partition is based on the label and assigned
to the training set and the test set according to the proportions 1.0

in the original data, so that the proportions in the training set ~


and the test set are the same as those in the original data set. ~ 0.8
1:
o

r
0.
H Adaboost Feature Selection E
6
Adaboost algorithm [20] is one of the boosting ensemble
learning algorithms. The classification model established based if
ti 0.4
on Adaboost feature selection examines the performance of o
o
each feature in an iteration, and evaluates the importance of the .0

:.¥'" 0.2
all features. Specifically, the steps of feature selection based on
Adaboost algorithm are as follows.
0,0
Input: 50 100 150 200
top 400
250 300 350 400

X = {(xL xi, , x~), (x~, x~, ... , x~), ... , (x)", x;", ... ,x;;')}
Figure 3: Adaboost feature selection
Labels = {Y1' Yz, , Ym}' where m is the number of samples
C. LightGBM Classifier
in the training set ,and the training set used in this paper has a
total of 2436 sample data. n is the feature dimension and In 2017, the Microsoft team proposed a new efficient
decision tree is selected as the base classifier. gradient lifting algorithm based on decision tree, named
LightGBM [21]. The main computing cost of traditional
Output: GBDT algorithm is the learning process of each decision tree
model and the main cost of the learning process is the search of
Strong classifier H(K;) and final feature importance score S.
the optimal partition point. The most popular method is the
The detailed calculation steps are as follows: pre-sorted method, where you sort all the features in advance
Stepl: We initialize all the weights of the original sample and then take all the data and go through all the possible
sorting points. However, the disadvantage is that the
to D1 = 2. , and set the number of iterations to T. Where
m computational overhead and space are very large. The other
Db 1 ::::; k ::::; T represents the weight distribution of the sample algorithm is based on histogram. This algorithm discretizes the
in the ith iteration. Finally, combined with time and continuous eigenvalues into buckets, and then constructs the
LightGBM accuracy, when the number of decision trees is set histogram with the number of features in the bucket. Finally,
to 50, better results can be obtained. according to the discrete value of the histogram, it traverses to
Step2: k ~ 1,2,3, ... , T. find the optimal segmentation point. Histogram-based
algorithms cost less time and space. Based on this, LightGBM
a. Classifier h k is trained under the weight distribution algorithm uses histogram algorithm to reduce the number of
ofD k · classification points. At the same time, LightGBM further
optimizes the histogram [Link] of all, it chooses the
b. Evaluate the error rate Ek of hk as the sum of the
leaf-wise algorithm with depth limit to find the leaf with the
weights of the samples with classification errors,
largest splitting gain from all the current leaves each time,
and calculate all feature importance scores Sk.
instead of the level-wise decision tree growth strategy, because
c. If the error rate Ek is greater than the threshold, we in the case of the same number of splitting, Compared with
consider h k as an effective classifier, then update the level-wise, Leaf-wise can reduce more errors and obtain better
weight of the classifier to = .!:.
z Ek
E
ak
and update (l- k) [Link], Leaf-wise also has its disadvantages. When
the decision tree is deep, the model is prone to over-fitting. So
the weight of all samples to Dk +1 . The role of z is LightGBM limits the maximum depth of the decision tree on
normalization, so that the cumulative sum of sample the Leaf-wise, ensuring high efficiency and preventing
weights after updating is still 1. overfitting.
Dk(X) -ak _
e ,Yprediction - Ylabel In this paper, LightGBM model automatically optimizes the
Zt
four parameters of learning rate, iteration number, leaf node
Dk+1 D (x)
{ _k_ _

Zt
e
ak
,Yprediction '* Ylabel
number and tree model depth by Grid SearchCV. First set the
larger learning rate ~ 0.5, the initial number of iterations ~ 150,
the initial number ofleaf nodes is 100, the objective function is
Step3: After the iteration is completed, the final strong
multi-classification. Other parameters use default values.
classifier H(x) = sign (l:;r=1
ukhk(x)) is obtained, and the Then the GridSearchCV function is used to find the optimal
importance score of all features is ST' parameters. Finally, the optimal learning rate is 0.05, the

637

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 05,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.
number of iterations is 86, the maximum depth of the tree is 6, Negative (TN) represents the number of negative samples
and the number ofleafnodes is 40. correctly identified as negative samples, False Positive (FP)
represents the number of negative samples wrongly identified
IV. EXPERIMENTS as positive samples, and False Negative (FN) represents the
A. APT Malware Dataset number of positive samples wrongly identified as negative
samples.
The dataset used in this paper includes 12 APT
organizations, a total of 3594 malware samples, which are C. Experimental results
initiated by five different countries [22]. All malicious samples This paper first compares the performance of our model
are from open source threat intelligence released by domestic Ada-LightGBM with that of a single model Adaboost,
and foreign security agencies. Each malicious software sample LightGBM and XGBoost in the real data set, and then uses the
is downloaded in batches from VirusTotal website [23] and its classical classification algorithms DT and KNN to prove the
source report links are recorded in the overviwe. csv file [24]. superiority of the constructed model. Comparison of
Distribution ofAPT groups in Table 1. experimental results information in Table 2.
Table 1: Distribution ofAPT groups Table 2: Classification results of each model
Numbers of
APT Group Country Numbers Accuracy Precision Recall F1
successful reports Model
(macro) (macro) (macro) (macro)
APT 1 China 405 400
Adaboost 0.9037 0.8192 0.8001 0.8060
APT 10 China 244 240

APT 19 China 32 24 LightGBM 0.9009 0.8571 0.8438 0.8452

APT 21 China 106 90 XGBoost 0.9152 0.8264 0.8137 0.8159


APT 28 Russia 214 200
DT 0.8664 0.7678 0.7664 0.7635
APT 29 Russia 281 277
KNN 0.8520 0.7544 0.7534 0.7488
APT 30 China 164 155

DarkHotel North Korea 273 257 Ada-LightGBM 0.9353 0.8965 0.8472 0.8662

Energetic Bear Russia 132 132

Winnti China 387 350 From the perspective of accuracy, the accuracy of the
Gorgon Group Pakistan 961 960 constructed Ada-LightGBM model is the highest, reaching
0.9353. From the perspective of precision and recall, the
Equation Group USA 395 395 constructed model is superior than the single model Adaboost
Total 3594 3480 and LightGBM, XGBoost, and the classical classification
models DT and KNN. Overall, the constructed model still
occupies a certain advantage, so our model is effective.
B. Model evaluation index
V. CONCLUSIONS
We choose the commonly used multi-classification
evaluation indicators Accuracy, Precision, Recall and F1 to In this paper, we discuss how to analyze the APT
evaluate the performance of our model in the experiment. At organization from the perspective of malware. We propose a
multi-classification model of APT malware based on Adaboost
the same time, the arithmetic average of each classification
index is calculated to get the macro average, which is used to and LightGBM. Firstly, malicious sample behavior information
measure the overall effect of each algorithm classification. The is obtained through the sandbox, and the behavior information
calculation is as follows: is extracted and quantified according to the TF-IDF algorithm
combined with bi-gram. Then, Adaboost is used to select the
TP+TN important features,which forms the target feature subset.
Accuracy
TP + TN + FP + FN Finally, we use the above subset combined with LightGBM
ensemble algorithm to train multiple classifiers, named Ada-
TP
Precisionmacro = TP + FP [Link] the real data set from 12 APT organizations, we
prove that this model has better accuracy than single algorithm
TP and classical algorithm in the multi-classification task of APT
Recal/macro = TP + FN organizations. Our next step is to use deep learning [25] to
explore the differences between APT malware and non APT
2 x Precisionmacro x Recal/macro malware.
Flmacro = ------.-.--==----....,..,---'=.;.,;;,.
PreClSWnmacro + Recal/macro
ACKNOWLEDGMENT
In the formula, True Positive (TP) represents the number of
This research was funded by Key R&D Program of
positive samples correctly identified as positive samples, True
Guangdong Province (No. 2019BOlO136003), NSFC (No.

638

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 05,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.
62072131, 61972106), Science and Technology Proj ects in learning in cloud computing environment", Simulation Modelling
Guangzhou (No.202102010442), National Key Research and Practice and Theory, Volume 113, 2021, 102391.
Development Program of China (No. 2019QY1406), [13] Pektas, Abdurrahman and T. Acarman , "Classification of malware
families based on runtime behaviors". Journal of Information Security
Guangdong Higher Ed-ucation Innovation Group and Applications [Link].(2017):91-100.
(No.2020KCXTD007), Guangzhou Higher
[14] Hangfeng Yang, Shudong Li,Xiaobo Wu,Hui Lu,Weihong Han, "A
EducationInnovation Group (No.202032854). Novel Solution for Malicious Code Detection and Family Clustering
based on Machine Learning".IEEE Access. VoI.7(1),pp.148853-148860.
REFERENCES
[15] Meihua Fan, Shudong Li, Xiaobo Wu, and Weihong Han, Zhaoquan Gu
[1] Maloney and Sarah. "What is an Advanced Persistent Threat (APT)?". and Zhihong Tian, "A Novel Malware Detection Framework Based on
Available at [Link] Weighted Heterograph". CIAT 2020: 2020 International Conference on
threat-apt, 2018. Cyberspace Innovation of Advanced Technologies Guangzhou China.
[2] QI-ANXIN Threat Intelligence Center: "The 2020 Annual Report of QI- December 4-6, 2020,PP:39-43.
ANXIN Global Advanced Persistent Threat (APT)". Available at [16] S. Yang, S. Li, W. Chen and Y. Liu, "A Real-Time and Adaptive-
[Link] 126, 2021. Learning Malware Detection Method Based on API-Pair Graph". IEEE
[3] 360 Secure Brain Monitor:"Disclosure of Attack Activities of Access, vol. 8, pp. 208120-208135, 2020, doi:
manlinghua Organization (apt-c-08) using Warzone Rat". Available at 10.1109/ACCESS.2020.3038453.
[Link] 2021. [17] Kim H, Kim J, Kim Y , et al. "Improvement of malware detection and
[4] 360 Government&Enterprise Security Group: "Reproduce the attack classification using API call sequence alignment and visualization[J]".
process of manlinghua Organization (apt-c-08)". Available at Cluster Computing, 2017.
[Link] 2021. [18] Arner E A and Zelinka I, "A dynamic Windows malware detection and
[5] Shang L, Guo D, Ji Y , et al. "Discovering unknown advanced persistent prediction method based on contextual understanding of API call
threat using shared features mined by neural networks". Computer sequence[J]". Computers & Security, 2020.
Networks, 2021, 189(2):107937. [19] Shudong Li, Qianqing Zhang, Xiaobo Wu, Weihong Han, Zhihong
[6] Liu, F, et al. "Log2vec: A Heterogeneous Graph Embedding Based Tian, "Attribution Classification Method of APT Malware in loT Using
Approach for Detecting Cyber Threats within Enterprise". the 2019 Machine Learning Techniques", Security and Communication
ACM SIGSAC Conference ACM, 2019. Networks, vol. 2021, Article ID 9396141, 12 pages, 2021.
[7] Liras, Lfm, A. Soto and M. A. Prada, "Feature analysis for data-driven [20] Freund, Yoav and R. E. Schapire. "A desicion-theoretic generalization
APT-related malware discrimination". Computers & Security of on-line learning and an application to boosting". Journal of Computer
104.1(2021):102202. and System Sciences 55(1997):119-139.
[8] Kong, D and G. Yan, "Discriminant malware distance learning on [21] KE G,MENG Q,FINLEY T,et al. "Lightgbm:A highly efficient gradient
structural information for automated malware classification". boosting decision tree[C)". NIPS 2017:2017 Advances in Neural
Proceedings of the 19th ACM SIGKDD international conference on Information Processing Systems. 2017 :3146-3154.
Knowledge discovery and data mining ACM, 2013. [22] Github:"APTMalware Dataset".Available at [Link]
[9] Sun, B, et al. "Malware family classification method based on static research!APTMalware/tree/master/samples, 2019.
feature extraction". IEEE International Conference on Computer & [23] Virustotal:"APT Malware Sample download website".Available at
Communications IEEE, 2017:507-513. [Link]
[10] Chen, L, et al. "An Ensemble Learning Approach to Detect Malwares [24] Github:"[Link]". Available at [Link]
Based on Static Information", 2020. research!APTMalware/blob/master/[Link], 2019.
[11] H. Fereidooni, M. Conti, D. Yao and A. Sperduti, "ANASTASIA: [25] Shudong Li,Laiyuan Jiang, Xiaobo Wu, Weihong Han, Dawei Zhao,
ANdroid mAlware detection using STatic analySIs of Applications". Zhen Wang. A Weighted Network Community Detection Algorithm
2016 8th IFIP International Conference on New Technologies, Mobility Based on Deep Learning. Applied Mathematics and Computation. 401
and Security (NTMS), 2016, pp. 1-5, doi: 10. 1109/NTMS.2016.7792435. (2021):126012.
[12] Shudong Li, Yuan Li, Weihong Han, Xiaojiang Du, Mohsen Guizani,
Zhihong Tian, "Malicious mining code detection based on ensemble

639

Authorized licensed use limited to: Amrita School of Engineering. Downloaded on November 05,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.

Common questions

Powered by AI

Ada-LightGBM combines the feature selection strengths of Adaboost with the efficient classification capacity of LightGBM, resulting in superior performance. The model benefits from Adaboost's ability to weigh and select the most significant features, which enhances classifier performance, and from LightGBM's usage of the histogram-based algorithm for efficient decision tree learning and optimal segmentation point finding. This combination allows Ada-LightGBM to achieve higher accuracy, precision, recall, and F1 scores compared to using Adaboost, LightGBM, or XGBoost alone in classifying APT malware .

The TF-IDF (Term Frequency-Inverse Document Frequency) and bi-gram algorithms are used to process and convert API call sequences from APT malware into a vectorized format that captures both the importance and the sequence of API calls. TF-IDF evaluates the significance of API terms by considering their frequency across malware samples, while bi-grams enhance this by capturing the sequential relationship between adjacent APIs. This preprocessing enables the extraction of richer behavioral patterns, providing a more informative feature set for subsequent classification tasks .

Traditional traffic and log-based APT detection methods face limitations in capturing the covert and complex nature of APT activities, as they primarily rely on identifying anomalous patterns in large datasets, which can lead to false positives. They are also less effective in detailing the behaviors of malware already infiltrating systems. APT malware-focused classification, like using Ada-LightGBM, directly analyzes the malware's behavioral characteristics, such as API call sequences, facilitating more accurate categorization and reducing reliance on network or log anomalies alone, offering a more targeted and precise detection approach .

Challenges in APT malware detection include the complexity and concealment strategies of APT attacks, which make single-method detection insufficient. These attacks often involve sophisticated techniques to evade traditional security measures and present a dynamic threat landscape. Therefore, new classification methods like the Ada-LightGBM feature selection and classification model are developed to address these challenges by enhancing detection capabilities through multi-layered analysis of malware behavior, leveraging advanced algorithms for more precise classification, reducing false positives, and enabling adaptive defenses .

The combination of Adaboost and LightGBM is effective because Adaboost efficiently selects the most impactful features from the dataset by adjusting features' weights based on their predictive importance, while LightGBM swiftly processes these reduced, high-quality feature sets through its optimized, histogram-based tree learning method. Together, they enhance accuracy and reduce overfitting, resulting in better performance in classifying complex and evasive APT malware samples .

LightGBM optimizes decision tree construction by employing histogram-based algorithms, which discretize continuous feature values into buckets, significantly reducing time and space consumption compared to pre-sorted methods. Additionally, it uses a leaf-wise approach with depth limitations, focusing on splitting leaves with the largest potential gain, which enhances accuracy while controlling for overfitting. This allows LightGBM to maintain high efficiency and precision when classifying malware, ensuring robust detection in dynamic threat environments .

Healthcare industries have become prominent targets for APT attacks largely due to the high value and sensitivity of the data they handle, such as patient health records and vaccine development information, especially amid the global COVID-19 pandemic. APT groups have shifted their focus here, as these sectors offer lucrative opportunities for espionage and data theft. Moreover, the critical nature of healthcare operations, the rapid digitalization during the pandemic, and potentially weaker cybersecurity defenses compared to traditional sectors like finance and government make them vulnerable targets .

Grid SearchCV systematically explores a predefined range of hyperparameter settings for the LightGBM model to identify the best combination that maximizes model performance, such as learning rate, iteration numbers, and tree depth. This ensures that the model is finely tuned to achieve optimal classification outcomes, balancing between accuracy, efficiency, and avoiding overfitting, particularly in complex task environments like APT classification .

Static analysis examines malware without executing it, focusing on the analysis of code features, such as binary data or disassembly files, to identify patterns or similarities within malware families. In contrast, dynamic analysis involves executing the malware in a controlled environment to observe its behavior, such as API calls or network interactions. Dynamic analysis can reveal real-time behavior and additional insights into malware actions, but requires more resources and might be circumvented by anti-sandbox technologies used by some malware .

APT attacks are characterized by their long-term and continuous precision targeting, strong concealment, organizational complexity, and professional division of labor. Unlike traditional threats which may be opportunistic in nature, APT attacks are long-planned, with a strong purpose, targeting specific victims, such as government agencies, vaccine manufacturers, and healthcare industries. The malware used by APT groups tends to be similar within the same group, allowing for classification into known groups to mitigate damages promptly .

You might also like