MalPacDetector Base Paper
MalPacDetector Base Paper
This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2025.3580336
Abstract—The Node Package Manager (NPM) registry contains the NPM is the default package manager for the Node.js
millions of JavaScript packages widely shared between worldwide environment. The NPM hosts more than 2 million reusable
developers. However, NPM has also been abused by attack- packages [1]. This popularity has made it a target of attackers
ers to spread malicious packages, highlighting the importance
of detecting malicious NPM packages. Existing malicious NPM for spreading malicious JavaScript packages. [2]–[4]. For
package detectors suffer from, among other things, high false instance, the eslint-scope attacker first compromises the NPM
positives and/or high false negatives. In this paper, we propose a account of the maintainer of the ESLint package and then
novel Malicious NPM Package Detector (MalPacDetector), which publishes malicious versions of the eslint-scope and
leverages Large Language Model (LLM) to automatically and eslint-config-eslint packages, which will download
dynamically generate features (rather than asking experts to
manually define them). To evaluate the effectiveness of Mal- and execute the malicious payload on a victim computer
PacDetector and existing detectors, we construct a new NPM running these malicious packages to steal the sensitive data
package dataset, which overcomes the weaknesses of existing stored in the victim computer’s .npmrc file (typically the
datasets (e.g., a small number of examples and a high repetition victim user’s credentials for publishing packages at NPM)
rate of malicious fragments). The experimental results show that [5]. In another attack, the getcookies package contains a
MalPacDetector outperforms existing detectors by achieving a
false positive rate of 1. 3% and a false negative rate of 7. 5%. backdoor that enables an attacker to execute arbitrary code at
In particular, MalPacDetector detects 39 previously unknown a victim computer [6].
malicious packages, which are confirmed by the NPM security Attacks against NPM, including those mentioned above,
team. have motivated the investigation of malicious NPM package
Index Terms—Malicious package detection, npm, malicious detectors. Existing detectors follow two approaches: program
features, large language model. analysis vs. machine learning. The program analysis approach
detects malicious packages mainly via pattern matching, clone
I. I NTRODUCTION detection, and dynamic analysis [7]–[10]. However, this ap-
proach often uses rules to detect malicious packages and
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Information Forensics and Security. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2025.3580336
dataset, dubbed MalnpmDB. When compared with existing Step 1.1: Identifying malicious code in training NPM
datasets, MalnpmDB has the following advantages: It contains packages. We propose using an LLM to analyze program
3,258 malicious packages, including 7 types of attacks and code in malicious NPM packages (the input), extract malicious
4,051 benign packages; by contrast, existing datasets focus code snippets, and summarize their malicious behaviors. The
on a single type of attack and have a high repetition rate of idea is to guide an LLM to focus on malicious code snippets,
malicious code fragments [7], [14], [15]. analyze malicious behaviors, and summarize features. This
Third, we use MalnpmDB to evaluate the effectiveness of step applies to malicious, but not benign packages, because we
MalPacDetector and the existing detectors that are publicly hope to use LLM to learn the malicious behaviors, and thus
available. Experimental results show MalPacDetector outper- features, of malicious packages that would not be exhibited
forms the existing detectors by achieving a 98.0% precision, by benign packages. This step is conducted iteratively as one
a 1.3% false positive rate, a 7.5% false negative rate, and a needs to query the LLM in question in multiple rounds with
95.2% F1-measure; this represents a 2.9% higher precision increasingly refined (i.e., adaptive) prompts.
and a 3.8% higher F1-measure when compared with the state- As illustrated in Figure 2, this step starts with an initial
of-the-art machine learning-based detector [11]. In particular, prompt to ask an LLM to analyze the program files in a
MalPacDetector detects 39 previously unknown malicious malicious NPM package. By standardizing the output format
packages, which are confirmed by the NPM security team. of the LLM, we can automatically process its response to
To enable others to verify or leverage results, we have extract malicious code and keywords. These, along with their
made the source code of MalPacDetector and the Mal- locations (i.e., the lines of code) and descriptions of the
npmDB dataset publicly available at https://github.com/ malicious behaviors, are then incorporated into a new prompt
MalPacDetector/MalPacDetector-core. to the LLM as comments. The basic idea guiding the design
Ethical consideration. We reported the 39 previously un- of prompts is to instruct the LLM to summarize malicious
known malicious packages to the NPM security team, which behaviors of an input NPM package, so that one can use
then removed these 39 packages from the NPM registry. Com- a keyword extractor (e.g., YAKE! [16]) to characterize the
munication emails with the NPM security team are available malicious behaviors. We repeat this process until the response
for verification if necessary. from the LLM converges and then use the converged response
Paper organization. Section II describes the MalPacDetector. as the malicious behavior.
Section III presents the MalnpmDB dataset. Section IV reports Figure 3 presents a running example. Figure 3(a) shows a
our experiments. Section V reviews the related prior studies. piece of malicious JavaScript code, which calls a function to
Section VI discusses limitations of this study. Section VII retrieve local information, packs the information into a JSON
concludes the paper. file, and sends it to a remote server controlled by the attacker.
Figure 3(b) shows the analysis result by an LLM (i.e., GPT 3.5
II. T HE M AL PAC D ETECTOR [17]), namely the lines of code and their associated malicious
behaviors (i.e., packaging users’ data and sending the package
As highlighted in Figure 1, MalPacDetector has two phases: to a server).
training and detection. In the training phase (Steps 1-3), the Step 1.2: Building a malicious snippet set. This step is to
input is NPM packages and the output is a trained model; in the automatically extract malicious code snippets and malicious
detection phase (Steps 4-5), the input is target NPM packages behaviors from a malicious package, and construct a malicious
and the trained model, and the output is malicious lines of snippet set with data structure (malicious snippet, malicious
code in the target NPM packages. behavior), which is made possible by the way the the prompts
• Step 1: LLM-based feature generation. This step uses are constructed in Step 1.1. Given results in the format shown
an LLM to generate features of NPM packages. in Figure 3(b), by limiting the output of LLM (i.e., highlighting
• Step 2: Feature value extraction for training pack- malicious behavior), one can automatically extract keywords
ages. This step extracts feature values from the training to reflect malicious behaviors exhibited by the example (e.g.,
packages. keywords “data leakage” and “remote code execution” in the
• Step 3: Model training. This step trains a model as a example shown). Figure 3(c) shows how the malicious snippet
malicious package detector. set is built based on the output of the LLM as shown in Figure
• Step 4: Feature value extraction for target packages. 3(b). The ‘code’ column is obtained by the LLM identifying
This step extracts feature values from the target packages. malicious code, and its corresponding ‘behavior’ is obtained
• Step 5: Malicious code detection in target packages. by the extractor YAKE! summarizing the output of the LLM.
This step detects malicious lines of code in target pack- Step 1.3: Summarizing malicious behavior features. This
ages. step is to process the extracted malicious snippet set obtained
These steps are elaborated below. in Step 1.2, by automatically merging similar malicious behav-
iors (e.g., “process execution”, “child process”, and “create
process”) with the LLM and summarizing their behavior
A. Step 1: LLM-based Feature Generation features. Figure 3(d) shows the summarized features for the
Unlike existing malicious NPM package detectors that rely running example. In this example, most of the malicious code
on experts to manually define features, we propose using LLM snippets exhibiting the malicious behavior of “data leakage”
to automatically generate features as follows. contain OS-related APIs, meaning the use of operating systems
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Information Forensics and Security. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2025.3580336
Input Output
Step 2: Feature value
Training NPM package Step 1: LLM-based feature Training set A trained
extraction for training Step 3: Model training
phase dataset generation model
packages
Test set
Figure 1: Overview of the training and detection process of MalPacDetector: the training phase generates a trained model and
the detection phase uses the trained model to detect malicious lines of code in target NPM packages.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Information Forensics and Security. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2025.3580336
1. const packageJSON =
require('./package.json'); 19. var options = {
2. function extractData() { 20. hostname: "dc.glc.st",
3. const trackingData = 21. port: 80,
Code Behavior Feature summary
JSON.stringify({ 22. path: "/",
4. p: package, 23. method: "POST",
5. c: __dirname, 24. headers: { const trackingData = Operating system
data leakage
6. hd: os.homedir(), Line 3-15: These lines might cause JSON.stringify({...}); operation
25. "Content-Type": "application/x-
7. hn: os.hostname(), data leakage, because it convert
www-form- urlencoded", var req=http.request(...);
8. un: os.userInfo().username, some sensitive data into a JSON remote code
26. "Content-Length": ...; HTTP request
9. dns: dns.getServers(), string. execution
postData.length, req.end();
10. lip: JSON.stringify(local_ips), Line 29-36: These lines might cause
27. } Feature
11. pip: public_ip, remote code execution, because it
28. }; set
12. r: packageJSON ? send trackingData as a POST . . .
29. var req = http.request(options, (res)
packageJSON.resolved : request to an external server. This . . .
=> {
undefined, server might be belong to an . . .
30. res.on("data", (d) => {
13. v: packageJSON.version, attacker, and it return some
31. process.stdout.write(d);
14. pjson: packageJSON, malicious code or commands to child_process.spawn(e.trim(),
32. });
15. }); execute. [],{detached: !0,stdio: process
33. }); Process operation
16. var postData = 34. req.on("error", (e) => {}); "ignore",windowsHide: execution
querystring.stringify({ 35. req.write(postData); !0,shell: !0})
17. msg: trackingData, 36. req.end();
18. }); 37. }
(a) Input (b) Identifying malicious (c) Building a malicious (d) Summarizing (e)
code in npm packages snippet set features Output
Figure 3: An example of NPM package and the LLM-generated features, where red boxes highlight the resulting code, behavior,
and feature summary of this package.
of these code elements. During the manual development of collect sensitive information and sends it to an outside server.
our extraction code, for each malicious behavior pattern, we Recall that Step 1 leads to features related to the installation
combine the feature names generated by the LLM with fre- and execution of scripts. When dealing with package.json file,
quency analysis to select the most appropriate code elements preinstall is extracted as a feature value. Figure 4(b) shows
as detection targets. These statistical results provide selection the AST of the target package and how the APIs and strings
scope and priority guidance for our code element selection. that perform malicious behaviors are extracted from the AST.
This data-driven approach more effectively captures the com- Figure 4(c) shows the feature vector that reflects malicious
mon characteristics of malicious behaviors. For example, for behavior of creating a process and accessing a domain.
the malicious behavior of remote code execution in a malicious
snippet set, we identify 312 snippets that contain the eval
function. Consequently, we use eval as the detection targets E. Step 5: Malicious Package Detection
to extract the feature vector of useEval. Similarly, for the This step applies the trained model (obtained in Step 3 of
useNetwork features, we choose detection targets such as http, the training phase) to the feature representation of the target
request, and node-fetch that appear frequently in the malicious NPM packages (obtained in Step 4) to classify whether a target
snippet set. package is malicious or not, and if so, which lines of code are
malicious.
C. Step 3: Model Training Continuing with the preceding example, Figure 4(d)
This is a standard process for training a classifier based on presents the detection result, not only showing that the target
the feature representation of NPM packages, including both package is malicious but also pinpointing the malicious lines
malicious and benign ones, to detect malicious packages. The of code which are highlighted in red and italic.
decision on training which classifier can be made based on
factors such as the size of the dataset (i.e., the number of III. T HE M ALNPM DB DATASET
examples). When the dataset is small, deep learning models
may not be appropriate. Rather than proposing just another dataset for benchmarking
malicious NPM package datasets, we propose the desired
D. Step 4: Target Package Feature Value Extraction characteristics of such datasets to guide us in preparing the
MalnpmDB dataset.
In the detection phase, we first generate the AST of the
target package, then extract feature vectors from the nodes
of the AST based on our feature set, and finally, detect the A. Desired Characteristics
malicious target packages using the trained model. While
outputting the results, we can also obtain the malicious lines We propose the following desired characteristics of bench-
of code for malicious packages. marking datasets for malicious NPM package detectors.
Figure 4(a) shows a target package, which contains two • Size. A dataset should contain as many examples as
files: package.json and index.js. Note that index.js imports possible, including both malicious and benign examples
exec from child process and then executes a shell command to (i.e., NPM packages).
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Information Forensics and Security. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2025.3580336
package.json package.json
1. { Program
1. {
2. ... 2. ...
3. "scripts": { VariableDeclaration 3. "scripts": {
4. "preinstall": "node index.js" 4. "preinstall": "node index.js"
5. } VariableDeclarator 5. }
6. ... 6. ...
7. } 7. }
ObjectPattern
index.js CallExpression index.js
1. const {exec} = require("child_process");
Feature Value 1. const {exec} = require("child_process");
2. exec("a=$(hostname;pwd;whoami;
require 2. exec("a=$(hostname;pwd;whoami;
echo 'vforvenom'; includeInstallScript true
echo 'vforvenom';
curl https://ifconfig.me;) && arguments includeDomain true curl https://ifconfig.me;) &&
echo $a | xxd -p | head | while read
useProcess true echo $a | xxd -p | head | while read
ut;do nslookup
"child_process" ut;do nslookup
$ut.ji5ykbsz4xj4576mrf6n0fptkkqaez. useFileSystem false $ut.ji5ykbsz4xj4576mrf6n0fptkkqaez.
oastify.com;
ExpressionStatement useEval false oastify.com;
done" , (error, data, getter) => {
done" , (error, data, getter) => {
3. if(error){ ... ... 3. if(error){
4. console.log("error",error.message); CallExpresion
4. console.log("error",error.message);
5. return;
5. return;
6. } exec 6. }
7. if(getter){
7. if(getter){
8. console.log(data); arguments
8. console.log(data);
9. return;
9. return;
10. } "a=$(hostname;pwd;..." 10. }
11. console.log(data);
ArrowFunctionExpression 11. console.log(data);
12. });
12. });
(a) Target package (b) AST (c) Feature vector (d) Detection result
Figure 4: An example showing a target package is detected as malicious where malicious code is highlighted in red and italic.
• Ground truth. Each example is associated with its Table I: Sources of malicious NPM packages we collected
ground-truth label, where malicious examples (i.e., pack- Source Knife [15] MalOSS [7] Dockers Enterprises Total
ages) are accompanied by the description of their mali- #packages 2,086 567 1,239 2,897 6,789
cious behaviors, which are not exhibited by the benign
examples. we review the existing datasets for the same purpose [7], [12],
• Duplication. This refers to that a dataset should not [15] and the malicious NPM packages collected by the open-
contain any duplicate examples, or alternatively only source community [21], [22] with respect to the characteristics
contain a small fraction of duplicate examples. This mentioned above to identify their weaknesses. We find that the
is relevant because attackers often publish a malicious existing datasets have the following weaknesses: There are
package with different package names, and thus these many duplicate examples in Knife [15] and a few duplicate
duplicate examples, which are different only in terms of examples in MalOSS [7]; the implication of these issues
package names, would affect the training and evaluation will be seen when we use these datasets to evaluate the
of detectors. Therefore, it is important to eliminate such effectiveness of detectors. Moreover, it is difficult to obtain
duplicate examples when applicable. the complete source code of malicious packages in Snyk [21].
• Representativeness. A dataset ideally should contain all These weaknesses suggest that we should propose a higher-
kinds of malicious behaviors of NPM packages, and reflect quality dataset by collecting more malicious examples with
the distribution of malicious behaviors in the real world source code while eliminating duplicates.
so that the trained model can be well generalized. As an Second, under the guidance of the desired characteristics
approximate measure of this metric, one may consider and the findings about the weaknesses of existing datasets,
using the Vendi Score [19] to measure the diversity of we collect malicious NPM packages by searching projects
malicious code. The Vendi Score measures the diversity on GitHub as well as old versions of Docker and mirrors
of examples in a dataset and is thus considered suitable based on the list of malicious NPM packages we identified.
for our purpose. Moreover, we consider the distribution We also communicate with security field enterprises to obtain
of the 7 common behaviors of NPM packages [20]: malicious packages they encountered in the past. This allows
file system, network, process, operating system, code us to collect 6,789 malicious packages from the various
execution, obfuscation, and install script. sources described in Table I.
• Balance. A dataset should be balanced in terms of the Third, we clean up the malicious examples by remov-
ratio of malicious vs. benign examples and the ratio ing duplicate packages as attackers indeed reuse malicious
between the different kinds of malicious behaviors (i.e., packages under different package names. Two packages are
ideally equal). duplicates of each other as long as they have (i) the same
scripts field in their package.json files and (ii) the same
B. Constructing the MalnpmDB Dataset code files. Note that (i) can be justified by the fact that the
Under the guidance of the desired characteristics mentioned scripts field in the package.json file contains shell commands
above, we construct the MalnpmDB dataset as follows. First, that will be automatically executed but the other fields (e.g.,
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Information Forensics and Security. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2025.3580336
Table II: Datasets comparison test the effect of the same model on different data sets. Too
Metrics Knife [15] MalOSS [7] MalnpmDB many repeated examples will cause the model to overfit.
Size (# of malicious packages) 2,086 567 3,258 Fourth, in terms of representativeness via the Vendi Score,
Duplication (%) 77.3 18.2 0.0
Vendi Score 3.42 2.94 7.61 our dataset exhibits a higher Vendi Score and thus higher
representativeness. In terms of the distribution of malicious
behaviors, we extract feature vectors from the malicious
package name, version, description, and author’s information) examples and then perform dimensionality reduction via the
are irrelevant to code execution; whereas, (ii) can be justified t-SNE algorithm [24] and perform clustering via the k-
by the fact that only code files are relevant to whether a means algorithm [25]. Figures 5(a)-(c) present the scatter
package is malicious or benign, but explanatory files (e.g., plots of feature vectors extracted from the three datasets,
README.md, LICENSE, package-lock.json) are not. respectively. We observe that the distribution of feature vectors
Fourth, we manually check that the remaining packages in Figure 5(c) is more scattered, indicating that the malicious
are indeed malicious. Each malicious package is reviewed by behaviors in MalnpmDB (mal) (i.e., malicious examples in
two researchers to confirm its maliciousness and classify its MalnpmDB) are more diverse and representative. By contrast,
malicious behavior. Only when the labels and the malicious Knife and MalOSS are sparser and their malicious behaviors
behavior classifications given by the two researchers are con- are monotonous: the feature vector distribution of Knife is
sistent will the package be considered malicious; in the case similar to that of MalOSS despite that the former contains
of dispute among the two researchers, the package will be more data points (because the former contains duplicates).
reviewed by a third researcher. This takes four weeks of three Fifth, in terms of balance, MalnpmDB has a 44:56 ratio of
experienced researchers, who also refine the malicious label of malicious:benign packages. For optimal results, it is advisable
an example with a specific type of maliciousness mentioned to ensure a roughly equal number of examples for each class
in Table III (e.g., file system and network). This leads to 3,258 [26].
verified malicious NPM packages.
Fifth, we identify benign packages by treating the most Insight 1: MalnpmDB is a better benchmark dataset than
popular packages as benign because they have likely been Knife and MalOSS for evaluating malicious NPM packages
scrutinized by the community. To balance the malicious and detectors.
benign examples in our dataset and consider the number of
malicious examples after cleaning, we download the top 5,000 IV. E XPERIMENTS AND R ESULTS
popular NPM packages [23]. Since some of these packages A. Experiments
cannot be downloaded successfully, we obtain the 4,051 We run experiments (the 5 steps of MalPacDetector) on a
packages as benign examples. We advocate using balanced computer with an Intel(R) Xeon(R) Gold 6240 CPU running
datasets for two reasons: (i) this aligns with conventional at 2.60GHz with 16GB memory storage. The LLM used in our
dataset construction practices; and (ii) it enables the model to experiment is GPT-3.5. The other tools used in the experiments
learn more comprehensive malicious behavior patterns while include keyword extractor YAKE! and JavaScript compiler
minimizing false negatives. That is, the dataset contains 3,258 babel.
malicious NPM packages and 4,051 benign packages, leading Selection of classifier to instantiate MalPacDetector in our
to a total of 7,309 packages. experiments. As mentioned above, the NPM detector can be
instantiated with many kinds of machine learning models,
and the choice of a specific model depends on factors such
C. Analyzing the MalnpmDB Dataset
as the size of the dataset. Although MalnpmDB has more
Now we show that the MalnpmDB dataset possesses the malicious packages than any existing dataset, the number of
desired characteristics mentioned above. First, in terms of malicious examples is still small. Thus, we choose the Random
size, Table II shows that MalnpmDB contains a large number Forest (RF) model in our experiments because it can avoid
of malicious examples. Note that we focus on the size of overfitting, reduce variance, have a good generalization ability,
malicious examples because they are often more difficult to evaluate the importance of features, and discover the interac-
obtain than benign examples. tions between features, which is beneficial for the analysis
Second, in terms of ground-truth, we manually verify that of our results. When training an RF model, we use a robust
the malicious examples are indeed malicious, while using stratified κ-fold cross-validation to train and test a classifier,
heuristics to identify benign examples—deeming the most by partitioning the dataset into κ-1 subsets for training and
popular packages as benign as they have been scrutinized by using the remaining subset for testing.
the community. Effectiveness metrics. To evaluate the effectiveness of a
Third, in terms of duplication, Table II shows that Mal- detector, we use the widely-used metrics: accuracy, precision,
npmDB has 0% duplication in malicious packages, but the False Negative Rate (FNR), False Positive Rate (FPR), and
existing datasets, Knife [15] and MalOSS [7], have a sig- F1-measure (F1). Accuracy is the overall correctness of a
nificant percentage of duplications in malicious packages. It detector, namely the ratio of correctly predicted packages
is interesting to note that after removing duplicate packages, (including true positives and true negatives) to the total number
Knife and MalOSS have about the same small number of of packages in question. Precision is the ratio of correctly
malicious packages. In the following experiments (RQ2), we predicted positive packages to the detected positive packages.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Information Forensics and Security. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2025.3580336
60 60 60
40 40 40
20 20 20
0 0 0
20 20 20
40 40 40
60 60 60
80 60 40 20 0 20 40 60 80 80 60 40 20 0 20 40 60 80 80 60 40 20 0 20 40 60 80
(a) Knife (b) MalOSS (c) MalnpmDB (mal)
Figure 5: Scatter plots of feature vectors from (a) Knife, (b) MalOSS, (c) MalnpmDB (mal), where colors represent different
types of malicious behaviors, and a straight line links from an example to the center of the cluster to which it belongs.
FNR is the ratio of false negative packages to the positive to perform malicious actions, such as creating, deleting,
packages. FPR is the ratio of incorrectly predicted positive or modifying files and starting external processes. This
packages to negative packages. The overall effectiveness F1- category has one feature.
measure is defined as F1 = 2·Precision·(1-FNR)
Precision+(1-FNR) . • Code execution. Malicious code often leverages dynamic
code execution to perform actions, such as installing
Research Questions (RQs). Our experiments are geared
malware, exfiltrating sensitive data, and evading detec-
towards answering:
tion. These are often conducted by leveraging the eval
• RQ1: How effective is LLM in generating features?
function, leading to one feature.
• RQ2: How effective is MalPacDetector in detecting ma-
• Obfuscation. Malicious code often employs obfuscation
licious NPM packages with known ground truth? techniques to bypass security controls and evade the
• RQ3: How useful is MalPacDetector in detecting mali-
detection of security tools. Common obfuscation methods
cious NPM packages in the real world where the ground include encoding strings into the base64 or byte format,
truth is unknown? using the AES encryption algorithm, and using a com-
pression tool. This category has 7 features.
B. How Effective Is LLM? (RQ1) • Install script. In each NPM package, there is a pack-
age.json file that contains metadata about the package.
Table III summarizes the 22 features extracted from Mal-
This allows package writers to customize their script
npmDB, with the assistance of GPT 3.5 as described above.
commands, including preinstall, install, and postinstall,
We divide these 22 features into the following 7 categories.
which are automatically executed before, during, and after
• File system. Malicious JavaScript code can leverage the package installation, respectively. Attackers can use this
file system to conduct malicious activities, such as (i) mechanism to automatically execute malicious instruc-
stealing, tampering, and deleting sensitive data (e.g., login tions before, during, and after installation. This leads to
credentials) and personal documents and (ii) creating files the feature that considers the JavaScript code files that
or registry entries to persist on a victim’s computer. This are involved in the install command as part of the install
category has 3 features. script.
• Network. Malicious code may exfiltrate sensitive data
Discovery of new malicious behavior feature. Table III
from victim computers, causing network activities that
compares the features we identified by the LLM with the fea-
may be leveraged to detect them. This category has 5
tures summarized in other papers [8], [11], [12], [27]. We can
features.
obtain all the malicious behavior features summarized in other
• Process. Process-related malicious activities include
papers through the LLM. In addition, we also found two new
downloading a malicious program from an attacker and
malicious behavior features. These features are related to a
executing it, stealing process environment variables, and
novel obfuscation method that achieves malicious behavior by
running a backdoor program to receive and execute
inserting base64-obfuscated code into script files. By applying
malicious instructions from the attacker. This category
the feature set with two new features to the entire detection
has 4 features.
tool, we achieve a significant reduction in the false positive
• Operating system. Attackers can obtain information
rate on the MalnpmDB dataset—from 4.6% down to 1.3%.
about the operating system running on a computer, its
hostname, and its domain name. This information may Performance comparison between different LLMs. We test
help an attacker choose a specific attack or exploit. An LLMs, such as GPT-4 and Deepseek-R1, against the malicious
attacker may execute some system calls or commands samples in MalnpmDB. GPT-3.5 detects 8,793 malicious be-
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Information Forensics and Security. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2025.3580336
Table III: Features generated with the assistance of GPT-3.5, from MalnpmDB
Feature category Feature name Feature description Others MalnpmDB
useFileSystem Use file system related libraries or functions in code ✓ ✓
File system useFileSystemInScript Use file system related libraries or functions in the install script ✓ ✓
includeSensitiveFiles Include sensitive files in string ✓ ✓
includeIP Include IP ✓ ✓
includeDomain Include domain name ✓ ✓
Network includeDomainInScript Include domain name in the install script ✓ ✓
useNetwork Use network related libraries or functions in code ✓ ✓
useNetworkInScript Use network related libraries or functions in the install script ✓ ✓
useProcess Use process related libraries or functions in code ✓ ✓
useProcessInScript Use process related libraries or functions in the install script ✓ ✓
Process
useProcessEnv Use process environment related libraries or functions in code ✓ ✓
useProcessEnvInScript Use process environment related libraries in the install script ✓ ✓
Operating system useOperatingSystem Use operating system related libraries or functions in code ✓ ✓
Code execution useEval Call eval function in code ✓ ✓
useBase64Conversion Use base64 conversion in code ✓ ✓
useBase64ConversionScript Use base64 conversion in the install script - ✓
includeBase64String Include base64 string in code ✓ ✓
Obfuscation includeBase64StringInScript Include base64 string in the install script - ✓
includeByteString Include byte string in code ✓ ✓
useBuffer Use Buffer library or functions in code ✓ ✓
useEncrytAndEncode Use encryption and encoding libraries or functions in code ✓ ✓
Install script includeInstallScript Have at least one install script in package.json ✓ ✓
haviors (e.g., Figure 3(b)) but GPT-4 detects 126 more, and Table IV: Effectiveness of MalPacDetector (metrics unit: %)
Deepseek-R1 detects 74 more. However, most of these newly Dataset Accuracy Precision FNR FPR F1
detected malicious behaviors are similar to the known ones Knife [15] 98.8 99.3 2.8 0.1 98.2
Knife* 94.3 84.2 26.4 1.6 78.4
and did not result in the generation of new features. This MalOSS [7] 95.4 88.6 26.9 1.4 80.1
partly owes to the feature threshold mentioned in Algorithm MalnpmDB (our dataset) 95.8 98.0 7.5 1.3 95.2
1, which renders previously unseen malicious behaviors not to
be summarized as features (i.e., limited capabilities in dealing
with new behaviors). Another reason is that the number of ma- when applied to the MalOSS dataset, with a high false negative
licious samples in MalnpmDB is limited, and the improvement rate (26.9%). This can be attributed to the small number
of the performance of LLM is difficult to bring significant of malicious examples. When applied to the Knife dataset,
results. it achieves the smallest FNR (2.8%) and FPR (0.1%). This
can be attributed to the fact that the dataset contains many
Insight 2: LLM can identify malicious behaviors from NPM duplicate examples because after removing the duplicates
packages and summarize them into features. (Knife*), the FNR increases substantially (to 26.4%) and the
C. How effective is MalPacDetector? (RQ2) FNR increases to 1.6%.
We construct a feature extractor by utilizing the 22 features Comparing MalPacDetector with other detectors. Since
summarized in Table III and the malicious code snippets MalnpmDB is better than the other datasets, we use it to
generated by GPT-3.5. This extractor serves as the foundation compare the effectiveness of MalPacDetector with that of the
for our subsequent experimental evaluations. following publicly available state-of-the-art malicious NPM
Experiments of MalPacDetector with different datasets. package detectors: (i) the OSS Detect Backdoor (ODB) [8],
To measure the effectiveness of MalPacDetector in detecting which detects malicious behavior in NPM packages through
malicious NPM packages with known ground truth, we conduct regular expression matching; (ii) MalWuKong [27], which pro-
experiments on 4 datasets: Knife [15] and its variant dubbed poses an innovative approach that integrates source code slic-
Knife* which is obtained by removing the duplicated examples ing, inter-procedural analysis, and cross-file inter-procedural
(i.e., only one copy is kept), MalOSS [7], and our dataset analysis; (iii) Ohm et al.’s detector [11], which is instantiated
MalnpmDB. Since Knife, thus Knife*, and MalOSS only con- as an RF model and detects malicious packages using manu-
tain malicious NPM packages, we use the same 4,051 benign ally defined features; (iv) the A MALFI detector [12], which
packages in our MalnpmDB for their benign examples. We use is instantiated as a Support Vector Machine (SVM) model
a 4-fold cross-validation, meaning κ = 4 in these experiments. and considers the difference when the malicious package is
The time complexity of the experiments is insignificant, e.g., updated. Note that ODB and MalWuKong do not require
the training time of MalPacDetector with respect to Mal- model training, so we directly test their effectiveness on the
npmDB is 1,877 seconds and the average detection time for entire MalnpmDB dataset. Note that we do not consider the
a package is 0.25 seconds (thus, we will omit to mention the DONAPI detector [28] because its source code and dataset are
time complexity in subsequent experiments). not publicly available, and we do not consider the MalOSS
Table IV summarizes the effectiveness of MalPacDetector detector [7] because it uses dynamic features that are specific
when applied to 4 datasets. We observe that it performs poorly to the MalOSS dataset but do not apply to all NPM packages
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Information Forensics and Security. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2025.3580336
Table V: Comparison of detectors via MalnpmDB (columns Table VI: Effectiveness of MalPacDetector with four machine
2-5 metrics unit: %) learning classification models on our MalnpmDB dataset em-
Detector Accuracy Precision FNR FPR F1 P-Value
ploying a 4-fold cross-validation methodology (metrics unit:
ODB [8] 47.2 45.4 9.9 75.5 60.4 0.01 %)
MalWuKong [27] 93.1 95.0 10.5 3.7 92.2 0.43
Ohm et al. [11] 93.5 97.9 12.5 1.3 92.4 0.54 Model Accuracy Precision FNR FPR F1
A MALFI [12] 90.1 92.5 14.9 5.4 88.6 0.05 NB 93.4 97.3 12.5 2.0 92.1
MalPacDetector 95.8 98.0 7.5 1.3 95.2 1.00 MLP 95.7 97.0 6.8 2.4 95.0
RF 95.8 98.0 7.5 1.3 95.2
SVM 95.9 98.1 7.5 1.4 95.2
in the MalnpmDB. In these experiments, we also set κ = 4
1.0
(i.e., 4-fold cross-validation). 0.9 malicious
0.8 benign
Table V summarizes the experimental results. We observe 0.7
Average value
0.6
that Ohm et al.’s detector [11] with RF achieves the best 0.5
0.4
effectiveness, with an F1 of 92.4%. ODB has an extremely 0.3
0.2
high FPR (75.5%) and high FNR 9.9%, likely owing to its use 0.1
0.0
of general rules. Nevertheless, MalPacDetector outperforms l
ript ring ript ain ript ork ript tem ion ffer va ess tem Env ode ript ript ript eIP iles ript ring
llSc St nSc om nSc etw nSc ys ers Bu seE roc ys ess nc nSc nSc nSc lud veF nSc St
the other detectors in all metrics, with a precision of 98.0%, I n staase6t4ringIludemD ainIuseNworkaI ting4SConv use u usePeFileeSProctAndrEsionIsEnvIcessI inecnsitsitemI eByte
e
lud deB 64S in Do
c Net per se6 us us ncry nve oces ePro deS leSy clud
a FPR of 1.3%, and an F1 of 95.2%. This can be attributed inc incluBase clude useuseOseBa E o r s
usese64CuseP u
lu i in
inc useF
u d e i n u a
to that the features generated by LLM capture the distinctions ni cl us eB
between malicious and benign packages. Feature name
To rigorously evaluate the performance differences among Figure 6: Barplot of average feature values, where malicious
the models, we conduct Friedman and Nemenyi statistical tests examples (i.e., packages) and benign ones exhibit a clear
[29]. Since MalnpmDB encompasses all available malicious difference in the install script, meaning that attackers tend to
samples in our collection and no additional malware datasets insert malicious code into the install script.
exist for testing, we enhance the conventional Friedman test
methodology. Our improved approach employs 4-fold cross-
validation to partition the dataset, creating pseudo-repeated ob-
Characterizing the features generated by LLM. As de-
servations that simulate multiple dataset scenarios. We record
scribed in Step 1 of MalPacDetector, the feature genera-
each classifier’s performance metrics during every validation
tion process relies on a combination of domain expertise
fold, using F1-scores as our statistical measure.
and statistical analysis. Moreover, certain features are cus-
The Friedman test yields a significant P-Value of 0.019,
tomized to JavaScript packages. As highlighted in Figure 6,
indicating statistically distinguishable performance among the
the features corresponding to the install script are include-
classifiers. Subsequent Nemenyi post-hoc testing reveals spe-
InstallScript, includeBase64StringInScript, includeDomain-
cific differences between classifiers, as detailed in the “P-
InScript, useNetworkInScript, useBase64ConversionInScript,
Value” column of Table V comparing other detectors against
useProcessEnvInScript, useProcessInScript, and useFileSys-
MalPacDetector. While Ohm et al.’s detector [11] and Mal-
temInScript, which have an average value close to 0 for benign
WuKong demonstrate comparable performance to our ap-
examples and thus show a significant discrepancy between
proach, their overall effectiveness remains inferior. The other
the benign examples and the malicious examples. This can
two tools show substantially larger performance gaps com-
be attributed to the fact that installation commands such
pared to our solution.
as preinstall are often used to prepare the environment for
Insight 3: MalPacDetctor outperforms the state-of-the-art installation (e.g., installing required dependencies, creating
malicious NPM package detectors. specific file structures, or setting environment variables). In
most cases, preinstall requires only common shell commands
Is RF the only option for instantiating MalPacDetector?
and does not require JavaScript code to be executed; even
To answer this question, we use MalnpmDB to train and com-
when code execution is required, access to the libraries that
pare 4 classifiers: Naive Bayes (NB), Multi-Layer Perceptron
are associated with these features is not necessary. Note that
(MLP), RF, and SVM; we consider these models because
commands install and postinstall are similar. The values of
they all offer good interpretability. We perform 4-fold cross-
features in code such as includeDomain, useNetwork, useOp-
validation using different models and hyperparameters, and
eratingSystem, useEval, useProcess, and includeIP show ob-
select the best hyperparameter based on the average F1 score.
vious discrepancies between malicious and benign examples.
Table VI shows the effectiveness of MalPacDetector when
Other features, such as includeBase64String, useFileSystem,
instantiated with the 4 classifiers, respectively. We observe that
useBuffer, and useProcessEnv, have a similar average value for
all instantiations have good precision. Nevertheless, MLP, RF,
benign and malicious examples, which can be attributed to the
and SVM achieve better accuracy and FNR. In particular, RF
fact that all examples often use these basic and common mod-
excels by achieving the lowest FPR.
ules. The preceding discussion underscores the effectiveness
Insight 4: Random forest model is a better option for of the features generated by LLM in discriminating between
instantiating MalPacDetector, but other classifiers are also benign and malicious packages.
effective.
Insight 5: LLM, or more specifically GPT 3.5, can effectively
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Information Forensics and Security. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2025.3580336
generate features to distinguish benign and malicious NPM Table VII summarizes the 39 malicious packages detected
packages. by MalPacDetector. We observe that 22 packages have been
downloaded multiple times, but the others have not been
downloaded because we inform the NPM security team about
D. How Useful Is MalPacDetector? (RQ3)
their maliciousness rapidly. We find that the 39 packages
To evaluate the usefulness of MalPacDetector in practice are mainly about stealing user or host computer information.
where the ground truth is unknown, we apply MalPacDetector We summarize their attack and anti-detection methods in 5
to the newly released packages on the NPM registry over three categories:
weeks. Due to the balanced nature of our training dataset, • Shell scripts in installation. This type of package uses
direct deployment of our detection model in real-world envi- shell scripts to leverage preinstall, install, or postinstall to
ronments may result in an unacceptably high false positive. wage attacks before, during, or after the installation pro-
To address this issue, we adopt the methodology proposed by cess that is described in scripts field of the package.json
A MALFI [12], to implement a daily manual review process file.
for all detected packages. Furthermore, we incorporate these • Shell scripts in code. This type of malicious package
verified packages (including both benign and malicious ones) uses shell scripts in JavaScript code to attack the user’s
into our training dataset, enabling iterative model refinement host.
through continuous retraining. • Pure malicious code. The payload steals users’ infor-
We analyze 269,175 packages using MalPacDetector, which mation (e.g., home path, hostname, username, and DNS
initially identifies 1,604 potentially malicious packages (most servers’ information) and sends it to the remote server. It
of them occur in the first three days). Through manual verifi- does not employ obfuscation or other techniques to evade
cation, we confirm that 39 are indeed malicious packages. We detection.
report them to the NPM security team, which confirmed the • DNS resolution. In this case, the stolen information is
maliciousness of all of them. As a result, these 39 packages sent through the DNS resolution protocol rather than
have been removed from the NPM registry. the common communication protocols such as HTTP or
Table VII: The 39 malicious NPM packages detected by HTTPS. It splits data into many pieces, encodes and
MalPacDetector and their attack methods, where “A1” wraps a piece of data into the start of the domain name,
means Shell in installation, “A2” means Shell in code, “A3” and then adds some random bytes or salt into the piece
means Pure malicious code, “A4” means DNS resolution, to avoid DNS caching.
• Obfuscation. This attack uses obfuscation to lower read-
“A5” means Obfuscation, “✓” means that a package contains
an attack method, and “-” means that a package does not ability, such as splitting keywords into several parts and
contain an attack method. storing them in a “keyword dictionary”. When calling
functions, it uses a dictionary lookup function to search
Package name #Downloads A1 A2 A3 A4 A5
emails-helper-2.0.20 539 - - ✓ ✓ -
for keywords via their index. It also contains much junk
emails-helpers-2.0.20 0 - - ✓ ✓ - code (e.g., adding useless conditional branch statements
hydra-consent-app-express-2.0.0 66 - - ✓ - -
rust-functions-1.0.2 73 - - ✓ - - and wrapping a normal function call statement into mul-
web3-provider-patchers-1.0.2 0 - - - - ✓ tiple function calls). It periodically executes a function
wallet-watch-asset-1.0.2, 21.0.2 213 - - - - ✓
wallet-add-chain-1.0.2 47 - - - - ✓ that interrupts the debugger to prevent security personnel
metronome-ui-1.0.2, 21.0.2 186 - - - - ✓ from dynamically debugging the code.
chain-list-2.0.0 106 - - - - ✓
master-oracle-lib-2.0.0 60 - - - - ✓ Table VII summarizes the attack methods of the 39 ma-
vesper-synth-user-lib-2.0.0 64 - - - - ✓
metronome-synth-user-lib-1.0.2 0 - - - - ✓
licious packages detected by MalPacDetector. We observe
betterbit-frame-pkg-2.0.0 58 - - - - ✓ that the two most common attacks are pure malicious code
dm commons utilities-99.99.0 0 - - ✓ - -
hardhat-gas-report-1.1.18 245 ✓ - - - - and obfuscation, which are used in 20 and 10 packages,
helio tawa-5.0.1 196 - - ✓ - - respectively. The third most frequent attack is shell scripts
@gds-web-ui/core-1.0.0 0 - - ✓ - -
@gds-web-ui/sodalite-0.23.1 0 - - ✓ - - in installation (7 packages).
@grabdefence/trust-feature-1.0.5-rc.1 0 - - - ✓ -
paysecure-9.0.3 0 - - ✓ - - Insight 6: MalPacDetector is useful in detecting malicious
paysecuretest-99.9.9 0 - - ✓ - - NPM packages in the real world.
grablink-web-sdk-1.1.2 0 - - ✓ - -
kwaishop-radar-99.9.1 0 - - - - ✓
vforvenom-1.999.0 0 - ✓ ✓ - -
pb-styles-v1-5.1.1 0 - - ✓ - -
V. R ELATED W ORK
zara-mkt-core-1.0.0, 9.9.1 74 ✓ - - - -
npm-random-gen-1.0.0, 1.0.1 0 ✓ - - - -
We divide related prior studies into three categories:
send-orchestrator-event-lambda-8.8.8 77 - - ✓ - - malicious NPM package detection, malicious NPM package
walletconnect-website-4.4.4 0 - - ✓ - -
subspace-relayer-front-end-4.4.4 0 - - ✓ - -
datasets, and the application of LLM to code security.
pathkit-local-9.9.9 60 ✓ - - - -
producer-journey-1.0.4 378 - - ✓ - -
adidas-data-mesh-9.9.8 760 ✓ - - - -
inteken-app-client-9.9.1 303 ✓ - - - - A. Malicious NPM Package Detectors
surf-sharekit-frontend-9.9.7 206 - - ✓ - -
symbl ai-1.0.0 66 - - ✓ - -
There are two types of approaches to detecting malicious
puppeteer-example-0.1.10 1,187 - - ✓ - - NPM packages: program analysis-based and machine learning-
mfp-food-diary-0.1.2 140 ✓ - - - -
kendo-react-messages-1.999.0, 1.999.1 0 - - ✓ - - based detectors.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Information Forensics and Security. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2025.3580336
Program analysis-based detectors. These detectors use rules malicious packages. However, this dataset has monotonous
to detect malicious code while leveraging other information malicious behaviors and many repetitive segments. MalOSS
(e.g., historical versions, developers, and software compo- [7] utilizes recent malicious packages gathered from the
nents). These detectors often suffer from a high false positive community to create experimental data; however, the limited
rate or false negative rate [8], [30]. Duan et al. [7] leveraged a amount of dataset is insufficient to meet the demands of
combination of static analysis and dynamic analysis, together machine learning. A MALFI [12] obtained some malicious
with package metadata, to build an analysis pipeline and examples from the NPM official and built a dataset, but the
detect malicious packages. However, this detector requires the author only disclosed the version information of the malicious
manual definition and updating of detection rules. Scalco et al. packages. Yu et al. [35] used an LLM to convert other types of
[14], [31] examined the differences between the package and malicious code into JavaScript code. This solution can enrich
its purported source code to detect malware. However, code the types of malicious behaviors, but there is a gap between
difference is a weak indicator of attack; thus, the detector the converted malicious code and the malicious code in the
can miss many attacks. Huang et al. [28] combined code real world. The above three datasets cannot meet the needs of
reconstruction techniques and static analysis, and extracted our model in terms of quantity and quality, so we built our
API call sequences to confirm and detect obfuscated content. own dataset.
However, the effectiveness of their static analysis part still
depends on the selection of features. Detectors such as Mi-
crosoft’s Application Inspector [30] and OSS Detect Backdoor C. Application of LLM to Code Security
[8] provide regular expression-based scanning with a high false LLM has been applied for code detection and test generation
positive rate. purposes [36]–[38]. Closely related to our studies are [39],
Machine learning-based detectors. These detectors are which applies LLM to detect malicious NPM and Python
trained via supervised learning from manually-defined features packages, and [40], which applies LLM to detect small NPM
(e.g., [11], [12], [28]) or unsupervised learning but still from packages. The present study is the first to apply LLM to
manually defined features (e.g., [10], [13], [32]–[34]). Ohm generate features from malicious packages and then use these
et al. [11] analyzed three commonly employed supervised features to train models that are faster and/or more lightweight
machine learning techniques and concluded that pre-selecting than deep learning models, which are also not feasible in our
a number of packages for further manual analysis could help setting because of the limited dataset size.
detectors work. Sejfia and Schafer [12] proposed detectors
consisting of three complementary techniques in the context VI. D ISCUSSIONS
of NPM packages, by leveraging manually-defined features. A. Limitations
By contrast, we use LLM to learn features automatically.
The present study has several limitations. First, the main
Experimental results show that our detector simultaneously
challenge in malicious NPM package detection lies in anti-
achieves a lower false positive rate and a lower false negative
detection technology, especially code obfuscation. While AST
rate. In this study, we propose and investigate how to apply
allows us to handle basic code minification to some degree, it
LLM to define/generate effective features.
is ineffective against advanced obfuscation techniques, such
as complex code compression. In these cases, we cannot
B. Malicious NPM Package Datasets accurately extract original feature vectors but must process
At present, the mainstream way to obtain datasets is mainly such samples uniformly as obfuscated code. Considering how
through communities and academic research. NPM official reg- attackers inject malicious code into normal packages, we can
ularly removes malicious packages, which makes it extremely prevent obfuscation to some extent. Further research needs to
difficult for us to obtain source code datasets. be conducted to deal with code obfuscation attacks.
Second, MalPacDetector uses static detection technology,
Communities. In the public community, only malicious pack- which means that it has limited capability in dealing with
age names and version information can be obtained, and dynamic behaviors. For instance, when dealing with unknown
it is difficult to obtain complete malicious packages with domain names and scripts, manual review is necessary. A
source code. In open-source communities such as GitHub [22], package has multiple malicious features but does not perform
researchers collect lists of malicious packages and classify malicious behavior. This type of package is rare in our dataset,
them. Public detection platforms such as Snyk [21] regularly but it is foreseeable that the model will have a high false
publish information about malicious packages they discovered. positive rate for such packages.
In comparison, we collect a list of malicious packages in the Third, the quality of feature generation heavily depends
past three years and the source code of the packages from on the dataset. When we adapt this approach to Python
some mirror sources and Docker [34]. packages, the results are significantly less impressive than
Academic research. The dataset in academic research is their counterparts with respect to NPM packages. This can be
another important source, but currently, there are fewer works attributed to the limited quantity and low diversity of malicious
on malicious package detection and the dataset used is also samples available in Python.
single. The Backstabber’s Knife Collection [15] is a widely Fourth, the feature generation relies on the performance of
used dataset, and it takes the lead in collecting and analyzing the LLM in question. For new types of malicious packages,
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Information Forensics and Security. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2025.3580336
the LLM may not be able to extract the malicious behavior [8] Microsoft OSS Detect Backdoor System, https://github.com/microsoft/
snippets, leading to ineffective features. This issue needs to be OSSGadget/wiki/OSS-Detect-Backdoor.
[9] D. L. Vu, I. Pashchenko, F. Massacci, H. Plate, and A. Sabetta, “Towards
address by more studies. using source code repositories to identify software supply chain attacks,”
in Proceedings of the 2020 ACM SIGSAC Conference on Computer and
Communications Security, Virtual Event, USA, 2020, pp. 2093–2095.
B. Threats to Validity [10] M. Ohm, A. Sykosch, and M. Meier, “Towards detection of software
In our approach of using LLMs to generate features, we supply chain attacks by forensic artifacts,” in Proceedings of the 15th
International Conference on Availability, Reliability and Security, Virtual
utilize all malicious samples from MalnpmDB. This means the Event, Ireland, 2020, pp. 65:1–65:6.
training and test sets divided from MalnpmDB in subsequent [11] M. Ohm, F. Boes, C. Bungartz, and M. Meier, “On the feasibility of
experiments cannot be guaranteed as completely independent. supervised machine learning for the detection of malicious software
packages,” in Proceedings of the 17th International Conference on
This compromise stems from two practical constraints: (i) A Availability, Reliability and Security, Vienna, Austria, 2022, pp. 127:1–
fixed feature set and feature extractor are essential. Given the 127:10.
limited number of malicious samples and the need to prevent [12] A. Sejfia and M. Schäfer, “Practical automated detection of malicious
npm packages,” in Proceedings of the 44th IEEE/ACM International
overfitting, 4-fold cross-validation proves critical specifically, Conference on Software Engineering (ICSE), Pittsburgh, PA, USA, 2022,
only the training-test set partitioning varies when computing pp. 1681–1692.
the model’s average performance. Constructing separate fea- [13] M. Ohm, L. Kempf, F. Boes, and M. Meier, “Supporting the detection
of software supply chain attacks through unsupervised signature gener-
ture extractors for each fold would compromise experimental ation,” arXiv preprint arXiv:2011.02235, pp. 1–20, 2020.
rigor. (ii) The feature set and extractor are constructed based [14] S. Scalco, R. Paramitha, D. L. Vu, and F. Massacci, “On the feasibility of
on the observed distribution of malicious behaviors. Under detecting injections in malicious npm packages,” in Proceedings of the
17th International Conference on Availability, Reliability and Security,
ideal conditions, random dataset partitioning maintains this Vienna, Austria, 2022, pp. 115:1–115:8.
behavioral distribution, ensuring the test set involved in feature [15] M. Ohm, H. Plate, A. Sykosch, and M. Meier, “Backstabber’s knife
generation does not affect subsequent model training. collection: A review of open source software supply chain attacks,”
in Proceedings of the 17th International Conference on Detection of
To partially demonstrate our tool’s capability in detecting Intrusions and Malware, and Vulnerability Assessment, Lisbon, Portugal,
completely unknown samples, we provide real-world detec- 2020, pp. 23–43.
tion results. Conducting more rigorous quantitative evaluation [16] R. Campos, V. Mangaravite, A. Pasquali, A. Jorge, C. Nunes, and
A. Jatowt, “Yake! keyword extraction from single documents using
remains a direction for our future work. multiple local features,” vol. 509, 2020, pp. 257–289.
[17] GPT-3.5, https://platform.openai.com/docs/models/gpt-3-5.
VII. C ONCLUSION [18] Babel, https://babeljs.io/.
[19] D. Friedman and A. B. Dieng, “The Vendi Score: A diversity evalua-
We have presented MalPacDetector, a malicious NPM pack- tion metric for machine learning,” Transactions on Machine Learning
age detector, and showed it is effective and useful. One key Research, 2023.
[20] P. Ladisa, H. Plate, M. Martinez, and O. Barais, “Sok: Taxonomy of
innovation is to use LLM to guide the generation of features attacks on open-source software supply chains,” in Proceedings of the
to describe the malicious behaviors of NPM packages. We 44th IEEE Symposium on Security and Privacy (S&P) , San Francisco,
have not only constructed a new dataset, MalnpmDB, which CA, USA, 2023, pp. 1509–1526.
[21] Snyk Vulnerability Database of npm, https://security.snyk.io/vuln/npm.
advancaes the state of the art, but also proposed a set of desired [22] GitHub Advisory Database of npm, https://github.com/advisories?
characteristics that should be satisfied by a good benchmark query=type%3Areviewed+ecosystem%3Anpm.
dataset. The limitations of MalPacDetector serve as interesting [23] Libraries.io of npm, https://libraries.io/npm.
[24] L. van der Maaten and G. Hinton, “Visualizing data using t-sne,” Journal
open problems for future research. of Machine Learning Research, vol. 9, pp. 2579–2605, 2008.
[25] A. M. Ikotun, A. E. Ezugwu, L. Abualigah, B. Abuhaija, and J. Hem-
R EFERENCES ing, “K-means clustering algorithms: A comprehensive review, variants
analysis, and advances in the era of big data,” Information Sciences, vol.
[1] N. Zahan, T. Zimmermann, P. Godefroid, B. Murphy, C. S. Maddila, 622, pp. 178–210, 2023.
and L. A. Williams, “What are weak links in the npm supply chain?” [26] R. Mohammed, J. Rawashdeh, and M. Abdullah, “Machine learning with
in Proceedings of the 44th IEEE/ACM International Conference on oversampling and undersampling techniques: Overview study and exper-
Software Engineering: Software Engineering in Practice, Pittsburgh, PA, imental results,” in Proceedings of the 11th International Conference on
USA, 2022, pp. 331–340. Information and Communication Systems (ICICS), 2020, pp. 243–248.
[2] A. Bagmar, J. Wedgwood, D. Levin, and J. Purtilo, “I know what [27] N. Li, S. Wang, M. Feng, K. Wang, M. Wang, and H. Wang, “Mal-
you imported last summer: A study of security threats in thepython wukong: Towards fast, accurate, and multilingual detection of malicious
ecosystem,” arXiv preprint arXiv:2102.06301, 2021. code poisoning in oss supply chains, luxembourg,” in Proceedings of
[3] I. Koishybayev and A. Kapravelos, “Mininode: Reducing the attack the 38th IEEE/ACM International Conference on Automated Software
surface of node.js applications,” in Proceedings of the 23rd International Engineering (ASE), 2023, pp. 1993–2005.
Symposium on Research in Attacks, Intrusions and Defenses (RAID), San [28] C. Huang, N. Wang, Z. Wang, S. Sun, L. Li, J. Chen, Q. Zhao, J. Han,
Sebastian, Spain, 2020, pp. 121–134. Z. Yang, and L. Shi, “DONAPI: malicious NPM packages detector
[4] X. Jiang, L. Meng, S. Li, and D. Wu, “Active poisoning: efficient using behavior sequence knowledge mapping,” in Proceedings of the
backdoor attacks on transfer learning-based brain-computer interfaces,” 33rd USENIX Security Symposium, USENIX Security , Philadelphia,
Science China Information Sciences, vol. 66, no. 8, 2023. PA, USA, 2024.
[5] Eslint-scope, https://github.com/advisories/GHSA-hxxf-q3w9-4xgw. [29] S. Garcia and F. Herrera, “An extension on” statistical comparisons of
[6] M. Zimmermann, C. Staicu, C. Tenny, and M. Pradel, “Small world classifiers over multiple data sets” for all pairwise comparisons.” Journal
with high risks: A study of security threats in the npm ecosystem,” in of machine learning research, vol. 9, no. 12, 2008.
Proceedings of the 28th USENIX Security Symposium, Santa Clara, CA, [30] Microsoft OSS ApplicationInspector, https://github.com/microsoft/
USA, 2019, pp. 995–1010. ApplicationInspector.
[7] R. Duan, O. Alrawi, R. P. Kasturi, R. Elder, B. Saltaformaggio, and [31] D. L. Vu, F. Massacci, I. Pashchenko, H. Plate, and A. Sabetta, “Last-
W. Lee, “Towards measuring supply chain attacks on package managers pymile: Identifying the discrepancy between sources and packages,” in
for interpreted languages,” in Proceedings of the 28th Annual Network Proceedings of the 29th ACM Joint European Software Engineering
and Distributed System Security Symposium (NDSS), Virtual Event. The Conference and Symposium on the Foundations of Software Engineering,
Internet Society, 2021. Athens, Greece, 2021, pp. 780–792.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Information Forensics and Security. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2025.3580336
[32] K. A. Garrett, G. Ferreira, L. Jia, J. Sunshine, and C. Kästner, “Detecting Shouhuai Xu (Senior Member, IEEE) received the PhD degree in computer
suspicious package updates,” in Proceedings of the 41st International science from Fudan University, Shanghai, China, in 2000. He is the Gallogly
Conference on Software Engineering: New Ideas and Emerging Results, Chair Professor in cybersecurity with the Department of Computer Science,
Montreal, QC, Canada, 2019, pp. 13–16. University of Colorado Colorado Springs (UCCS). Prior to joining UCCS,
[33] M. Ohm, L. Kempf, F. Boes, and M. Meier, “Towards detection of he was with the University of Texas at San Antonio. He pioneered the
malicious software packages through code reuse by malevolent actors.” Cybersecurity Dynamics approach as the foundation for the emerging science
Gesellschaft für Informatik, Bonn, 2022. of cybersecurity, with three pillars: first-principle cybersecurity modeling and
[34] D. U. Brand, O. Stussi, and E. Wåreus, “Supply chain attacks in open analysis (the x-axis); cybersecurity data analytics (the y-axis, to which the
source projects,” Master’s thesis, LUND University, 2022. present paper belongs); and cybersecurity metrics (the z-axis). He co-initiated
[35] Z. Yu, M. Wen, X. Guo, and H. Jin, “Maltracker: A fine-grained npm the International Conference on Science of Cyber Security (SciSec) and is
malware tracker copiloted by LLM-enhanced dataset,” in Proceedings of serving as its Steering Committee chair. He is/was an associate editor of
the 33rd ACM SIGSOFT International Symposium on Software Testing IEEE Transactions on Dependable and Secure Computing (IEEE TDSC),
and Analysis, 2024, pp. 1759–1771. IEEE Transactions on Information Forensics and Security (IEEE TIFS), and
[36] W. Tang, M. Tang, M. Ban, Z. Zhao, and M. Feng, “CSGVD: A deep IEEE Transactions on Network Science and Engineering (IEEE TNSE).
learning approach combining sequence and graph embedding for source
code vulnerability detection,” Journal of Systems and Software, vol. 199,
p. 111623, 2023.
[37] K. Zhang, D. Wang, J. Xia, W. Y. Wang, and L. Li, “ALGO: synthesizing
algorithmic programs with generated oracle verifiers,” in Proceedings
of the Annual Conference on Neural Information Processing Systems
(NeurIPS), New Orleans, LA, USA, 2023.
[38] M. Alqarni and A. Azim, “Low level source code vulnerability de-
tection using advanced BERT language model,” in Proceedings of the
35th Canadian Conference on Artificial Intelligence, Toronto, Ontario,
Canada, 2022, pp. 1–11.
[39] J. Zhang, K. Huang, B. Chen, C. Wang, Z. Tian, and X. Peng, “Malicious
package detection in npm and pypi using a single model of malicious Ziteng Xu joined Ant Technology Group Co., Ltd, Hangzhou, China, in
behavior sequence,” arXiv preprint arXiv:2309.02637, 2023. 2017. He is currently a researcher in computer software. His research interests
[40] N. Zahan, P. Burckhardt, M. Lysenko, F. Aboukhadijeh, and L. Williams, include code security and malicious code detection.
“Shifting the lens: Detecting malware in npm ecosystem with large
language models,” arXiv preprint arXiv:2403.12196, 2024.
Zhen Li (Member, IEEE) received the Ph.D. degree from the Huazhong
University of Science and Technology (HUST), Wuhan, China, in 2019. She
is currently an Associate Professor of Cyberspace Security at HUST. Her
research interests mainly include software security and AI security.
Hai Jin (Fellow, IEEE) received the Ph.D. degree in computer engineering
Jixiang Qu received the bachelor’s degree from Huazhong University of from the Huazhong University of Science and Technology in 1994. He was
Science and Technology (HUST), Wuhan, China, in 2018. He is currently with The University of Hong Kong from 1998 to 2000 and a Visiting Scholar
pursuing a master’s degree with the School of Computer Science and Tech- with the University of Southern California from 1999 to 2000. He is currently
nology, HUST, under the supervision of Yunhe Zhang. His research interests the Chair Professor of Computer Science and Engineering at the Huazhong
include malicious code detection. University of Science and Technology (HUST), China. He has coauthored
more than 20 books and published over 900 research articles. His research
interests include computer architecture, parallel and distributed computing,
big data processing, data storage, and system security. He is a Fellow of
IEEE, a Fellow of CCF, and a Life Fellow Member of ACM. In 1996, he
was awarded the German Academic Exchange Service Fellowship to visit the
Deqing Zou received the PhD degree from the Huazhong University of Technical University of Chemnitz, Germany. He received the Excellent Youth
Science and Technology (HUST), Wuhan, China, in 2004. He is currently Award from the National Science Foundation of China in 2001.
a professor of Cyberspace Security at HUST. His main research interests
include system security, trusted computing, virtualization, and cloud security.
He has always served as a reviewer for several prestigious journals, such as
IEEE Transactions on Dependable and Secure Computing, IEEE Transactions
on Computers, IEEE Transactions on Parallel and Distributed Systems, and
IEEE Transactions on Cloud Computing. He is on the editorial boards of four
international journals and has served as PC chair/PC member of more than
40 international conferences.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/