Knighter: Transforming Static Analysis With Llm-Synthesized Checkers
Knighter: Transforming Static Analysis With Llm-Synthesized Checkers
LLM-Synthesized Checkers
Chenyuan Yang Zijie Zhao Zichen Xie
University of Illinois University of Illinois Zhejiang University
Urbana-Champaign Urbana-Champaign China
USA USA [email protected]
[email protected] [email protected]
[email protected] USA
[email protected]
our analysis identifying at least six historical patches ad- we implement a bug triage agent that identifies false alarms,
dressing it, no static analysis tool had been developed to enabling iterative refinement of the checkers.
systematically detect these issues. Even specialized kernel
checkers like Smatch [2] fail to identify these vulnerabili-
3 Design
ties because they lack the domain-specific knowledge that
devm_kzalloc may return NULL upon failure. Terminology. KNighter takes a patch commit as input and
Our approach. KNighter extracts critical insights from the outputs a corresponding CSA checker. Valid checkers cor-
patch: unchecked return values from devm_kzalloc repre- rectly distinguish between buggy and patched code, flagging
sent potential Null-Pointer-Dereference vulnerabilities. The pre-patch code as defective while recognizing post-patch
synthesized checker (written in CSA, Figure 2c) tracks null- code as correct. Plausible checkers1 are valid checkers that
check status across execution paths while correctly handling additionally demonstrate practical utility through low false
pointer aliasing, a sophisticated static analysis capability. positive rates or a manageable number of reports. We provide
This checker discovered 3 new vulnerabilities in the Linux formal definitions of these terms in § 4.
kernel. Figure 2b presents one such vulnerability exhibiting Overview. KNighter leverages agentic workflow to process
the same pattern where a null pointer check is missing for patch commits for static analyzer synthesis, as illustrated in
the pointer returned by the devm_kzalloc call. This bug was Figure 3. It operates in two phases: checker synthesis (§ 3.1)
subsequently fixed and assigned CVE-2024-50103. and checker refinement (§ 3.2). In the checker synthesis
Advantages over direct LLM scanning. Directly using phase, KNighter analyzes the input patch to identify bug
LLMs to scan the Linux kernel would be prohibitively ex- patterns (§ 3.1.1), synthesizes a detection plan (§ 3.1.2), and
pensive, as devm_kzalloc alone appears over 7K times across implements a checker using CSA (§ 3.1.3). If compilation
5.4K files. In contrast, KNighter’s static analyzers primarily errors occur, a syntax-repair agent automatically repairs
consume CPU resources rather than repeated LLM invoca- them based on the error messages. This phase concludes with
tions, making the approach both scalable and cost-effective. the generation of valid checkers (§ 3.1.4). In the subsequent
Moreover, since generating the checkers is mostly a one-time checker refinement phase, these valid checkers are deployed
effort, they can naturally evolve alongside the system. to scan the entire codebase for potential bugs. When bug
Technical challenges and solutions. Creating effective reports are generated, a triage agent evaluates them for false
static analyzers with LLMs presents several challenges. First, positives, and KNighter refines the checker accordingly. If the
writing robust checkers end-to-end is complex. KNighter scan produces a manageable number of reports with a low
addresses this through a multi-stage synthesis pipeline that false positive rate, KNighter presents the plausible checkers
breaks down complex tasks into manageable steps. Second, and their filtered reports as potential bugs for review.
LLM hallucination can produce incorrect analyzers. KNighter
mitigates this by validating synthesized checkers against his-
torical patches, verifying they correctly distinguish between
1We adopt the term “plausible” from program repair [40, 49], where a “plau-
buggy and patched code. Finally, to reduce false positives,
sible” patch passes all test cases and potentially is the correct fix.
4
KNighter: Transforming Static Analysis with LLM-Synthesized Checkers SOSP ’25, October 13–16, 2025, Seoul, Republic of Korea
spi: mchp-pci1xxx: Fix a possible null pointer dereference in Algorithm 1: Synthesize checkers with input patch.
pci1xxx_spi_probe
1 Function GenChecker(patch):
In function pci1xxxx_spi_probe, there is a potential null 2 # Iterative checker generation and evaluation
pointer that may be caused by a failed memory allocation
by the function devm_kzalloc. Hence, a null pointer check 3 for i = 1 to maxIterations do
needs to be added to prevent null pointer dereferencing 4 # Stage 1: Bug Pattern Analysis
later in the code.
To fix this issue, spi_bus->spi_int[iter] should be checked.
5 pattern ← AnalyzePatch(patch)
The memory allocated by devm_kzalloc will be automatically 6 # Stage 2: Detection Plan Synthesis
released, so just directly return -ENOMEM.
7 plan ← SynthesizePlan(patch, pattern)
8 # Stage 3: Analyzer Implementation and Repair
Figure 4. Patch commit message. 9 checker ← Implement(patch, pattern, plan)
10 attempts ← 0
11 while hasCompilationErrors(checker) and attempts
< maxAttempts do
3.1 Checker Synthesis 12 checker ← RepairChecker(checker)
Algorithm 1 presents the multi-stage pipeline of checker syn- 13 attempts ← attempts + 1
thesis. In the first stage, KNighter analyzes the bug pattern 14 if hasCompilationErrors(checker) then
shown in the patch (Line 5). Next, KNighter synthesizes 15 # Skip evaluation if checker still has errors
the plan based on the patch and the identified bug pattern 16 Continue
(Line 7). With the plan in hand, KNighter implements the 17 # Stage 4: Validation
checker using CSA (Line 9). If any compilation issues arise, 18 isValid ← ValidateChecker(checker, patch)
a syntax-repair agent is invoked to debug and repair them 19 if isValid then
(Line 12). The repair process is allowed up to maxAttempts 20 return checker
(default is 5) attempts. If the checker compiles successfully,
KNighter validates it by checking whether it can distinguish 21 return Null
between the buggy and patched code (Line 18). Once the
checker is deemed valid, it is returned for the next phase
(Line 20). Otherwise, the synthesis pipeline continues iter-
ating until reaching maxIterations. If all iterations fail, the comprehensive, but identifying all relevant functions/con-
process returns Null, indicating that a valid checker could ditions poses significant static analysis challenges, hinder-
not be synthesized (Line 21). ing robust implementation by LLMs. Consequently, our ap-
proach favors more targeted bug patterns derived from the
3.1.1 Bug Pattern Analysis. The initial stage involves patch context. These facilitate precise and tractable checker
analyzing patch commits to identify underlying bug patterns. synthesis by the LLMs. For the devm_kzalloc example, focus-
Patch commits typically consist of diff patches and may ing specifically on its return value yields a targeted pattern
include developer comments describing the bug being fixed, that effectively addresses the observed bug class while being
as illustrated in Figure 4. Our goal is to extract patterns that significantly more manageable for the LLM to implement
can be translated into static analysis rules for bug detection. correctly compared to the broader, more complex alternative.
While bug patterns are sometimes explicitly described in
commit messages, they often require deeper analysis of the 3.1.2 Plan Synthesis. Once the bug pattern is identified,
code changes within the patch. KNighter generates a high-level plan for implementing the
We have developed an LLM-based agent specifically de- static analyzer. This plan serves two critical purposes: first, it
signed to perform this pattern analysis, with the prompt provides structured guidance to the LLMs during implemen-
template shown in Figure 5a. In addition to the patch, we ex- tation, preventing confusion and promoting effective execu-
tract the complete function code that was modified from the tion. Second, it facilitates debugging of the entire pipeline
kernel codebase. This additional context is crucial because by making the LLMs’ reasoning process transparent and
the patch diff alone may not capture all relevant buggy pat- traceable. Our ablation study in § 5.4.2 confirms the value of
terns, as some issues depend on the broader context of the this plan synthesis, demonstrating improved performance
code. By providing both the patch and the complete function consistent with findings in other domains [44].
code to LLMs, we enable a more comprehensive understand- For instance, synthesizing a checker for the unchecked
ing of the bug being patched. devm_kzalloc return value pattern (illustrated in Figure 2c)
A single bug pattern identified from a patch can be ex- might generate a plan with key steps such as: (1) Using
pressed with varying scope and complexity. Consider the program state to track memory regions from devm_kzalloc,
Null-Pointer-Dereference involving devm_kzalloc (Figure 2a). (2) monitoring conditional branches (checkBranchCondition)
A broad pattern (e.g., check any potentially null return) is to mark regions as checked if a null check occurs, and (3)
5
SOSP ’25, October 13–16, 2025, Seoul, Republic of Korea Chenyuan Yang, Zijie Zhao, Zichen Xie, Haoyu Li, and Lingming Zhang
int sh_pfc_register_pinctrl(struct sh_pfc *pfc) { Table 1. Distribution of patch commits across 10 bug cate-
struct sh_pfc_pinctrl *pmx; gories and the validity status of their synthesized checkers.
int ret;
pmx = devm_kzalloc(pfc->dev, sizeof(*pmx), GFP_KERNEL); “NPD” denotes “Null-Pointer-Dereference” and “UBI” indi-
if (unlikely(!pmx)) FP by triage agent cates “Use-Before-Initialization”.
return -ENOMEM;
pmx->pfc = pfc; Reported by checker
...
Valid
Figure 7. A report labeled as FP by our triage agent. Bug Type Total Invalid Direct Refined Fail
NPD 6 1 2 2 1
Integer-Overflow 7 3 1 3 0
Our refinement pipeline addresses these challenges me- Out-of-Bound 6 2 4 0 0
thodically. First, to manage report complexity, we distill gen- Buffer-Overflow 5 3 2 0 0
erated bug reports to their essential components—primarily Memory-Leak 5 2 3 0 0
the “relevant lines” highlighted by the static analyzer (e.g., Use-After-Free 7 4 2 1 0
Double-Free 8 1 5 1 1
CSA [12]) and the corresponding trace path—stripping extra-
UBI 5 1 1 3 0
neous context while preserving critical diagnostics. Second, Concurrency 5 2 3 0 0
to navigate the complexity of analysis and modification, we Misuse 7 3 3 1 0
employ specialized LLM-based agents. A triage agent classi-
fies each distilled report, focusing strictly on alignment with Total 61 22 26 11 2
the target bug pattern (rather than general code correctness);
the prompt template is shown in Figure 5c.
If the triage agent identifies a report as a false positive, as These manually collected and labeled commits served as our
exemplified in Figure 7, a dedicated refinement agent takes benchmark dataset for rigorous evaluation.
over. In the case shown, the initial checker (derived from Few-shot examples. We prepared three end-to-end exam-
the patch in Figure 2a) flagged the use of pmx->pfc because ples for in-context learning. These three are patch com-
its logic failed to recognize if (unlikely(!pmx)) as a valid mit 3027e7b15b02 (Null-Pointer-Dereference), 3948abaa4e2b
null check, perhaps confused by the unlikely() macro. The (Use-Before-Initialization), and 4575962aeed6 (Double-Free).
triage agent correctly interprets the check semantically and The design and implementation of the checker for these three
flags the report as FP. The refinement agent then uses this commits required approximately 40 person-hours. This was a
information to adjust the checker’s logic, specifically enhanc- one-time effort, yielding reusable examples. We also explore
ing its ability to handle constructs like unlikely(), thereby the use of real-world, off-the-shelf examples (§ 5.4.2).
preventing this type of false positive in subsequent scans Utility functions. While implementing example checkers,
while ensuring it can still detect the original vulnerability. we identified several common helper operations. We imple-
A refined checker is accepted only if it satisfies two crite- mented 9 such utility functions (e.g., getMemRegionFromExpr)
ria: (1) it no longer generates warnings for the previously to encapsulate low-level Clang Static Analyzer tasks, simpli-
identified false positive cases, and (2) it maintains its validity fying checker development, particularly for LLM synthesis.
by correctly differentiating between the original buggy and These utilities were designed for simplicity and extensibility.
patched code versions. This criterion ensures the semantic Valid checkers. To evaluate checker validity, we verify that
accuracy of the refined checkers. it can both detect the original bug and recognize its fix. We
first identify buggy objects by examining the modified files in
4 Implementation the diff patch. Next, we check out the repository to the buggy
Input commit collection. To collect patch commits for commit (immediately preceding the patch) and scan these
rigorous evaluation, we implemented a systematic classifica- objects to count the number of bug reports (𝑁𝑏𝑢𝑔𝑔𝑦 ). We then
tion and selection process. First, we established 10 distinct scan these objects after applying the patch commit to obtain
bug categories. We then used relevant keywords to identify the number of remaining bug reports (𝑁𝑝𝑎𝑡𝑐ℎ𝑒𝑑 ). A checker is
potentially related commits. A commit was included in our considered valid if 𝑁𝑏𝑢𝑔𝑔𝑦 > 𝑁𝑝𝑎𝑡𝑐ℎ𝑒𝑑 and 𝑁𝑝𝑎𝑡𝑐ℎ𝑒𝑑 < 𝑇𝑣𝑎𝑙𝑖𝑑 ,
dataset only when two authors independently agreed on its where 𝑇𝑣𝑎𝑙𝑖𝑑 is a threshold value (50 by default).
categorization. For each bug type, we initially examined the Plausible checkers. We determine plausible checkers based
first 20 commits that matched our search criteria. We con- on their performance when analyzing the entire Linux kernel.
tinued reviewing commits beyond the initial 20 if we hadn’t Our approach is founded on the principle that high-quality
yet collected 5 qualifying commits for a given category. Our checkers, especially those derived from historical commits,
goal was to gather a minimum of 5 commits per bug type should generate a reasonable number of actionable bug re-
whenever possible. Table 1 presents our categorization of ports. A checker is classified as plausible if it either: (1) pro-
10 bug types and their corresponding patch commit counts. duces fewer reports than a predefined threshold 𝑇𝑝𝑙𝑎𝑢𝑠𝑖𝑏𝑙𝑒
7
SOSP ’25, October 13–16, 2025, Seoul, Republic of Korea Chenyuan Yang, Zijie Zhao, Zichen Xie, Haoyu Li, and Lingming Zhang
(default: 20), or (2) demonstrates an acceptable false positive int ice_set_fc(struct ice_port_info *pi, ...)
rate in sampled warnings. {
struct ice_aqc_get_phy_caps_data *pcaps __free(kfree);
Checker refinement. We evaluate each valid checker by
scanning the entire kernel codebase independently, with if (!pi || !aq_failures)
return -EINVAL; → Path without any assignment to pcaps
execution bounded by either a one-hour time limit or a max- ...
imum of 100 warnings during the refinement process. Note
that these limits are only applied during the checker refine- (a) Bug in Use-Before-Initialization patch.
ment phase; when performing actual bug detection, we run struct x509_certificate *x509_cert_parse(const void *data ..)
the checkers without such constraints. The refinement pro- {
struct x509_certificate *cert __free(x509_free_certificate);
cess begins with LLM-assisted triage of the checker’s output. // Auto-cleanup pointer not initialized to NULL (False Alarm)
Using a consistent random seed, we sample 5 warnings for struct x509_parse_context *ctx __free(kfree) = NULL;
LLM inspection due to cost consideration. A checker qualifies cert = kzalloc(sizeof(struct x509_certificate), GFP_KERNEL);
as plausible if it either generates fewer than 𝑇𝑝𝑙𝑎𝑢𝑠𝑖𝑏𝑙𝑒 = 20 → cert with assignment in every path
if (!cert)
total reports or exhibits at most one false positive in the return ERR_PTR(-ENOMEM);
evaluated sample (labeled by our triage agent). For checkers ...
failing these criteria, we implement an iterative refinement
(b) False positive report for UBI.
protocol targeting the identified false positives, permitting
up to three refinement iterations to improve precision.
Figure 8. Examples of false positives by KNighter.
5 Evaluation
We explore the following research questions for KNighter: Static Analysis Capabilities. We further examine the static
analysis capabilities employed by checkers, including path
RQ-1. Can KNighter generate high-quality checkers? sensitivity, region sensitivity, and advanced state tracking.
RQ-2. Can the checkers generated by KNighter find real- Hardware and software. All our experiments are run on
world kernel bugs? a workstation with 64 cores, 256 GB RAM, and 4 Nvidia
RQ-3. Are the capabilities of KNighter orthogonal to the A6000 GPUs, operating on Ubuntu 20.04.5 LTS. We use O3-
human-written checkers? mini as our default LLM backend. By default, when scanning
RQ-4. Are all the key components in KNighter effective? the entire codebase, we use -j32. We evaluated using Linux
v6.13, and for bug finding, we examined versions from v6.9
Evaluation metrics. We conduct an extensive evaluation to v6.15. The Linux configuration used is allyesconfig.
by using the following metrics:
Checker Validity Rate. A valid checker successfully iden- 5.1 RQ1: Synthesized Checkers
tifies the buggy pattern in the original code and confirms We evaluate KNighter on the 61 commits listed in Table 1
its absence in the patched version. This metric reflects our to show that it can synthesize high-quality checkers across
framework’s and LLMs’ ability to understand patch seman- various bug types, beyond those in our few-shot examples.
tics and synthesize discriminative checkers.
Plausible Checker Rate. This metric measures the number 5.1.1 Checker Synthesis. In total, valid checkers were
of high-quality checkers synthesized, representing those that generated for 39 commits. The 39 synthesized checkers av-
are both valid and exhibit a low false positive rate. erage 125.7 lines of code and exhibit diverse static analysis
Bug Detection. We assess the number of real-world bugs capabilities. Specifically, 37 checkers are path-sensitive, 13
successfully detected by the synthesized checkers. incorporate region sensitivity, and 16 employ advanced state
Resource Efficiency. This metric captures the computational tracking; in contrast, only 2 leverage AST travelers. This
time and monetary costs associated with both checker syn- suggests that KNighter can generate complex analysis logic,
thesis and execution. not just simple pattern matching.
Checker Error Categories. We classify checker failures into Cost. The full synthesis process required 15.9 hours, during
the following categories, ordered by severity: which 8.2 million input tokens were processed and 1.2 million
• Compilation Failures: Checkers that fail during com- output tokens produced, resulting in an approximate cost of
pilation due to syntax or dependency errors. $0.24 per commit using O3-mini. For commits that ultimately
• Runtime Errors: Checkers that compile successfully yielded valid checkers, an average of 2.4 synthesis attempts
but crash during execution (e.g., "The analyzer encoun- was necessary (with a maximum of 8 attempts observed).
tered problems on source files"). Failure analysis. We now break down the failures from
• Semantic Issues: Checkers that cannot distinguish be- two perspectives: the underlying failure root causes and the
tween the buggy and patched code. observed failure symptoms during the synthesis process.
8
KNighter: Transforming Static Analysis with LLM-Synthesized Checkers SOSP ’25, October 13–16, 2025, Seoul, Republic of Korea
(i) Failure root causes. Among the 61 commits processed, Table 2. Newly detected bugs by KNighter.
22 did not result in any valid checker. Our investigation into
these failures indicates that: Total Confirmed Fixed Pending CVE
• 2 (9%) commits failed due to an inaccurate bug pattern, KNighter 92 77 57 15 30
• 7 (32%) failed owing to an inaccurate plan, and
• 13 (59%) were caused by inaccurate implementation.
For the implementation-related failures, a common issue
was that compiler optimizations inlined certain function § 5.4.1). In total, we obtained 90 reports labeled as “bug”.
calls (e.g., strcp and memset), which prevented the checker Upon manual verification, we confirmed 61 true positives.
from properly intercepting these calls. This indicates that the combination of our plausible checkers
Our approach exhibits limitations in handling buffer over- and bug triage agent has a false positive rate of 32.2%.
flow and use-after-free commits. We believe these challenges Our manual analysis of the 29 false positives revealed
stem from two main factors: static analysis inherently strug- three recurring patterns leading to incorrect reports:
gles with precise value determination—especially when es- • Inaccurate bug pattern: In 5 cases, although the checker
tablishing buffer bounds at compile time—and it also faces correctly identified the original bug/patch scenario, the
significant hurdles in analyzing multi-threaded code, for in- inferred bug pattern lacked the necessary precision
stance, when assessing the proper use of locks. for reliable detection across different contexts.
(ii) Failure symptoms. During synthesis (allowing up to • Incorrect pattern matching: For 6 reports, the checker
10 attempts per commit), 273 failed attempts were recorded correctly identified the bug pattern but applied it too
from all 61 commits. The failures can be categorized as: broadly, flagging code segments that did not meet the
• 65 attempts (23.8%) resulted in compilation errors, specific constraints intended by the pattern.
• 1 attempt (0.4%) led to a runtime error, and • Trigger condition mismanagement: The most common
• 207 attempts (75.8%) suffered from semantic issues that issue (18 reports) involved checkers where both the
obstructed proper bug identification. pattern and matching were correct, but the checker
Of the 207 semantic failures, 34 checkers erroneously flagged failed to manage trigger or state conditions properly
both buggy and patched code as potentially problematic, (e.g., failing to recognize a pointer had already been
while the remaining 173 misclassified both versions as bug- validated before use).
free. This outcome underscores the challenge of accurately Case study: A high false positive rate checker. In commit
distinguishing buggy code. 90ca6956d383 (“ice: Fix freeing uninitialized pointers”, see
Interestingly, even the 173 checkers that did not recognize Figure 8a), the issue stems from the pointer pcaps not being
the specific bug in their input patch can still be valuable. initialized to NULL. If an early return or error path occurs
When deployed across the large system, these checkers may before the pointer is allocated, the cleanup routine may inad-
successfully detect bugs with similar patterns in other con- vertently attempt to free an uninitialized (or garbage) pointer.
texts. This apparent paradox likely arises because the failure In such cases, the checker should account for the possibility
to detect the training bug is sometimes due to edge cases of an early exit leaving the pointer unset. In contrast, the bug
or context-specific complexities rather than inherent defi- report in Figure 8b highlights a scenario where, although
ciencies in the checkers’ detection logic. Moreover, these the cert pointer starts uninitialized, it is immediately as-
checkers generally exhibit lower false positive rates com- signed a valid value along every execution path, ensuring it
pared to those that incorrectly flag both buggy and patched is never left in an unassigned state. Thus, despite the initial
code, enhancing their practical utility for bug detection. uninitialized state, the code does not constitute a bug. Our
5.1.2 Checker Refinement. After scanning the entire synthesized checker did not incorporate these nuanced con-
kernel codebase with these 39 valid checkers, 26 of them straints, and the triage agent likewise failed to recognize the
were labeled “plausible” directly. Our refinement pipeline critical differences, ultimately leading to a false positive.
was applied to the remaining 13 valid checkers, successfully
refining 11 of them. In total, 19 refinement steps were com- 5.2 RQ2: Detected Bugs
pleted successfully. This demonstrates the effectiveness of 5.2.1 Overall. To date, static analyzers synthesized by
our refinement pipeline, which successfully refined 84.6% of KNighter have identified 92 new bugs in the Linux kernel.
the valid checkers that were not “plausible” initially. As summarized in Table 2, developer confirmation has been
False positive rate. Of the 37 plausible checkers, 16 did not received for 77 of these bugs. Among those confirmed, 57
report any bugs. For the remaining checkers, we applied our have already been fixed. The remaining 15 bugs are currently
bug triage agent to filter all the reports, focusing only on awaiting developer review (calculated as Total - Confirmed).
those labeled as “bug” since our triage agent demonstrated Notably, 30 of the discovered bugs have been assigned CVE
a low false negative rate in our evaluation (as shown in numbers, showing their practical security impact.
9
SOSP ’25, October 13–16, 2025, Seoul, Republic of Korea Chenyuan Yang, Zijie Zhao, Zichen Xie, Haoyu Li, and Lingming Zhang
10
Bug types. Analysis of checkers from the manually collected
7
4 3 3 3 commits reveals that KNighter can detect a diverse range
1 1
0 cy
of bug types (see Figure 9a). Null-Pointer-Dereference bugs
PD
k
B
AF
r
BI
r
ve
ve
a
us
U
n
Le
U
N
O
tO
fO
is
re
ur
Bu
In
nc
M
this area. In response, we expanded our effort by automati-
Co
50
in the kernel [9]. Additionally, 10 and 7 bugs were identified
40
in the sound and net subsystems, respectively. Notably, 2 bugs
30
were found in the samples directory—an area that provides
20 example usage for kernel developers and where correctness
10 is especially critical.
10 7
3 2
0
1 1 1 Bug lifetime. Figure 9c illustrates the distribution of bug
lifetimes. Notably, the average bug lifetime is 4.3 years, and
s
fs
es
ch
lib
e
ne
er
ud
un
pl
ar
iv
cl
so
m
dr
in
26 bugs had existed for over five years before detection. This
sa
8
CVE-2025-21715. Figure 10a (the input patch to KNighter)
6 shows a fix for a Use-After-Free vulnerability. In this patch,
free_netdev must be invoked only after all the references
4
to its private data, otherwise, it could cause a Use-After-
2
Free issue. Leveraging this patch, the checker generated by
0 KNighter identified a similar bug in dm9000_drv_remove, as
shown in Figure 10b, where dm (the private data of ndev)
(d) Number of bugs detected by each commit. remains in use after ndev is freed, causing a Use-After-Free.
This newly discovered issue was assigned CVE-2025-21715.
Figure 9. Details of new bugs. Subfigures (a), (b), (c), and (d) CVE-2024-50259. Figure 10c shows an input patch fixing a
show breakdowns by type, subsystem, lifetime, and commit. buffer overflow vulnerability. The patch mitigates the risk
by limiting the number of bytes copied via copy_from_user
The checkers for bug detection originate from two sources: to sizeof(mybuf) - 1, thereby preserving space for a trail-
(i) the initial 61 manually collected commits used for eval- ing zero. This trailing zero is essential for subsequent string
uation in § 3.1 across diverse bug types (shown in Table 1), operations, such as sscanf, to function correctly. Taking this
10
KNighter: Transforming Static Analysis with LLM-Synthesized Checkers SOSP ’25, October 13–16, 2025, Seoul, Republic of Korea
static int emac_remove(struct platform_device *pdev) { Table 1. We conducted the comparative analyses by running
... Smatch on the entire codebase to determine if it could detect
mdiobus_unregister(adpt->mii_bus);
- free_netdev(netdev); the bugs found by KNighter.
if (adpt->phy.digital) Smatch reported a total of 1970 errors and 2870 warnings
iounmap(adpt->phy.digital);
iounmap(adpt->phy.base); across the kernel. We manually inspected all the files where
+ free_netdev(netdev); bugs were detected to assess whether any of the true positive
return 0; bugs identified by KNighter were also detected by Smatch.
}
Notably, Smatch failed to detect any of our true positive bugs,
(a) Input Use-After-Free patch. underscoring the unique detection capabilities of KNighter.
static void dm9000_drv_remove(struct platform_device *pdev) { Further analysis of Smatch’s checkers revealed that they
... do not fully leverage the domain-specific knowledge em-
dm9000_release_board(pdev, dm);
free_netdev(ndev); /* free device structure */ bedded in the Linux kernels—a resource that KNighter effec-
if (dm->power_supply) tively extracts from historical patches. For instance, Smatch’s
Use the private data dm after freeing ndev check_deref checker employs static range analysis to iden-
regulator_disable(dm->power_supply);
} tify potential null pointers but lacks domain-specific insights.
It fails to recognize that functions like devm_kzalloc may re-
(b) CVE-2025-21715, found by the checker for the patch above. turn NULL under error conditions that conventional static
int lpfc_debugfs_lockstat_write(struct file *file, ...) { range analysis cannot detect. Consequently, Smatch iden-
char mybuf[64];
int i;
tified only three potential null pointer dereferences, all of
+ size_t bsize; which were confined to unit test files rarely prioritized by
memset(mybuf, 0, sizeof(mybuf)); developers. We conclude that KNighter and Smatch detect
- if (copy_from_user(mybuf, buf, nbytes))
+ bsize = min(nbytes, (sizeof(mybuf) - 1)); different classes of bugs, demonstrating KNighter’s effec-
+ if (copy_from_user(mybuf, buf, bsize)) tiveness in learning domain knowledge from patches and
return -EFAULT;
...
subsequently identifying diverse bugs and vulnerabilities.
}
5.4 RQ4: Effectiveness of Components
(c) Input Buffer-Overflow patch.
5.4.1 Bug Triage Agent. To evaluate our bug triage agent,
static ssize_t nsim_nexthop_bucket_activity_write(...) {
...
from the 39 valid checkers, we sampled up to 5 reports per
memset(mybuf, 0, sizeof(mybuf)); checker to reduce manual inspection efforts, aiming to re-
- if (size > sizeof(buf)) Possible buffer overflow duce manual inspection efforts while maintaining evaluation
+ if (size > sizeof(buf) - 1)
return -EINVAL; coverage. In total, we collected 79 reports from 18 check-
if (copy_from_user(buf, user_buf, size)) ers, while the remaining 21 valid checkers didn’t generate
return -EFAULT;
+ buf[size] = 0;
reports. Our triage agent classified 29 reports as “bug” (posi-
... tive) and 50 as “not-a-bug” (negative). These classifications
} were compared against ground truth labels established by
manual review from two authors. The agent achieved 7 true
(d) CVE-2024-50259, found by the checker for the patch above.
positives (TP), 22 false positives (FP), 50 true negatives (TN),
Figure 10. Example vulnerabilities detected by KNighter. and zero false negatives (FN). The absence of false negatives
is particularly important, as it indicates the agent effectively
prioritized all potentially true bugs in this set, minimizing
patch as input, the checker generated by KNighter identi- the risk of overlooking genuine issues, even though the 22
fied a similar bug in nsim_nexthop_bucket_activity_write, false positives require further filtering.
as shown in Figure 10d. In this case, the omission of append- We also evaluated 5-way self-consistency via majority
ing a trailing zero after copying data from userspace could voting [48]. For each report, we ran our triage agent five
lead to improper string handling and potential overflow is- independent times and labeled the report as a “bug” only
sues. This detected issue was subsequently fixed by adding if the agent made that prediction in at least 𝑡 of the runs.
the trailing zero and was assigned CVE-2024-50259. Compared to our single-sample baseline, which identified 7
TPs and 22 FPs, majority voting did not offer a significant
5.3 RQ3: Orthogonality with Smatch improvement. Using a threshold of 𝑡 = 3 kept the TP count
Since no comparable automated static analyzer generation at 7 but slightly increased FPs to 24. A stricter threshold
approaches exist for this domain, we evaluate KNighter of 𝑡 = 4 also resulted in 7 TPs while reducing FPs to 20.
against expert-written checkers. Our baseline is provided by Ultimately, majority voting only slightly shifted the false
Smatch [2], which is widely used in Linux kernel analysis positive count in this setting. There could be two potential
and supports tailored checks for all bug types considered in reasons. For TPs, the agent likely identifies these bugs with
11
SOSP ’25, October 13–16, 2025, Seoul, Republic of Korea Chenyuan Yang, Zijie Zhao, Zichen Xie, Haoyu Li, and Lingming Zhang
Table 3. Ablation study results. “Default” means addition to our default model, O3-mini, we tested GPT-4o,
KNighter’s standard configuration utilizing multi-stage syn- Gemini-2-flash, and open-source DeepSeek-R1.
thesis, fixed few-shot examples, and the O3-mini model. Al- As detailed in Table 3, O3-mini yielded the most valid
ternative configurations are compared against this baseline. checkers (12). GPT-4o and DeepSeek-R1 performed compa-
rably to each other, generating 11 valid checkers each, only
Errors slightly fewer than O3-mini. This suggests that multiple high-
Variants Valid
capability models, including open-source options, are viable,
Syntax Runtime Semantics
although minor differences in performance exist.
Default 12 28 0 75 In contrast, Gemini-2-flash performed substantially worse,
W/o multi-stage 8 52 3 75 producing valid checkers for only 4 commits. Upon closer
W/ RAG 12 37 4 62 inspection, we found that Gemini-2-flash struggled with
W/ GPT-4o 11 31 0 76 CSA implementation, frequently using non-existent APIs
W/ DeepSeek-R1 11 29 8 66 and generating syntax errors at a much higher rate (130
W/ Gemini-2-flash 4 130 2 44 vs. 28). This highlights a crucial insight: successful checker
synthesis demands more than general coding proficiency;
accurate knowledge or inference of the target framework’s
high confidence, meaning a single run is sufficient to detect specific APIs and conventions (CSA in this case) is essential.
all of them. For FPs, the agent’s predictions are likely less
confident and more inconsistent across different runs.
5.4.2 Ablation Study. To evaluate our design choices, we 6 Limitations and Discussion
created a sample dataset of patch commits for an ablation Limitations. KNighter faces challenges with highly complex
study. We randomly sampled 2 commits from each bug type bug patterns, particularly those involving state-machine rea-
using zero as the random seed. This resulted in a dataset of soning, such as use-after-free, and concurrency issues that
20 commits (2 commits × 10 bug types). Table 3 shows the require analyzing multi-threaded code and locking schemes.
overall results of our ablation study. Synthesizing precise checkers for these issues is difficult for
Checker synthesis. First, we assessed the impact of our today’s LLMs, as it requires a sophisticated understanding
default three-stage synthesis approach (bug pattern analy- of a program’s temporal and interprocedural behavior. To
sis, plan synthesis, checker implementation) compared to address this, a promising future direction is to improve the
directly synthesizing checkers in a single stage (omitting analysis agent to automatically abstract complex bug reports
explicit pattern/plan steps), using identical few-shot exam- into more formal state-machine representations. This would
ples. As shown in Table 3, the multi-stage approach proved enable the synthesis of precise, state-aware checkers that
more effective, yielding valid checkers for 12 commits com- can trace the specific sequences of operations leading to a
pared to only 8 for the single-stage method. Furthermore, the bug. A second, complementary strategy is to focus on se-
single-stage approach resulted in significantly more syntax lectively identifying high-quality, canonical bug fixes. This
errors (52 vs. 28), often leading to checkers that failed to com- involves developing methods to filter out patches that are
pile. This highlights the value of the structured multi-stage overly complex or specific, allowing KNighter to learn from
process for improving both validity and compilability. reusable, idiomatic repair patterns and improve the overall
Second, we explored using Retrieval-Augmented Genera- performance of checker synthesis.
tion (RAG) [16] for selecting few-shot examples dynamically, Generability. KNighter employs a general three-stage work-
comparing it against our default set of fixed examples. It uti- flow: LLM-driven synthesis of checkers from bug-fix patches,
lizes a knowledge base derived from 118 official CSA check- patch-grounded validation, and triage/refinement to reduce
ers [12], embedded using text-embedding-ada-002 [4]. Dur- false positives. While the Linux kernel served as an ideal and
ing synthesis, three relevant examples are retrieved based challenging initial target due to its scale, complexity, and
on semantic similarity. Our results indicate that RAG-based importance, our approach is not fundamentally tied to it. The
example selection achieved comparable effectiveness to our workflow’s applicability extends to any project that meets
fixed examples, also generating valid checkers for 12 com- two criteria: a version history with bug-fix commits for learn-
mits. However, because the official CSA checkers used are ing, and a static analysis framework to serve as a synthesis
substantially longer than our curated fixed examples, the target (e.g., Chromium [1]). Furthermore, the implementa-
RAG approach incurred approximately double the input/out- tion is not limited to C/C++. Although we demonstrated its
put token cost. Due to similar effectiveness, our default fixed use with the Clang Static Analyzer [12], the pipeline can be
few-shot examples offer better cost-efficiency for this task. readily adapted—via a small set of few-shot examples—to
LLM choice. We evaluated the performance of KNighter generate checkers or rules for other ecosystems (e.g., Cod-
across different language models for checker synthesis. In eQL [17] or Semgrep [41]).
12
KNighter: Transforming Static Analysis with LLM-Synthesized Checkers SOSP ’25, October 13–16, 2025, Seoul, Republic of Korea
LLMs to scan large systems like the Linux kernel is often [6] Ahmadi, M., Farkhani, R. M., Williams, R., and Lu, L. Finding bugs
prohibitively expensive and faces scalability challenges. using your own code: detecting functionally-similar yet inconsistent
Our synthesis approach. In contrast, KNighter uses LLMs code. In 30th USENIX security symposium (USENIX Security 21) (2021),
pp. 2025–2040.
to synthesize the entire static analyzer from patches. Un- [7] Ayewah, N., Pugh, W., Hovemeyer, D., Morgenthaler, J. D., and
like LLM-augmented tools, this minimizes reliance on pre- Penix, J. Using static analysis to find bugs. IEEE software 25, 5 (2008),
existing manual analyzer development, enhancing adapt- 22–29.
ability for diverse bug types. Unlike direct LLM analysis, [8] Bai, J.-J., Lawall, J., Chen, Q.-L., and Hu, S.-M. Effective static analysis
KNighter generates efficient, reusable static checkers, avoid- of concurrency Use-After-Free bugs in linux device drivers. In 2019
USENIX Annual Technical Conference (USENIX ATC 19) (Renton, WA,
ing the high costs and scalability issues of scanning massive July 2019), USENIX Association, pp. 255–268.
codebases directly with LLMs. A very recent concurrent [9] Bursey, J., Sani, A. A., and Qian, Z. Syzretrospector: A large-scale
work, MoCQ [27], also explores checker/query synthesis, retrospective study of syzbot, 2024.
but it focuses on general bug patterns and relies on man- [10] Cai, Y., Yao, P., Ye, C., and Zhang, C. Place your locks well: under-
ual examples for validation. KNighter distinguishes itself standing and detecting lock misuse bugs. In 32nd USENIX Security
Symposium (USENIX Security 23) (2023), pp. 3727–3744.
not only by automatically inferring specific, nuanced pat- [11] Chen, W., Zhang, B., Wang, C., Tang, W., and Zhang, C. Seal:
terns from patches but also by incorporating a closed-loop Towards diverse specification inference for linux interfaces from secu-
triage-refinement pipeline to iteratively improve checker rity patches. In Proceedings of the Twentieth European Conference on
precision. This fully automated refinement process makes Computer Systems (2025), pp. 1246–1262.
[12] Clang, and LLVM. Clang Static Analyzer. https://clang-analyzer.
our approach more scalable for detecting complex defects in
llvm.org/.
system software, establishing it as a practical paradigm for [13] Deng, Y., Xia, C. S., Peng, H., Yang, C., and Zhang, L. Large language
applying LLM intelligence to large-scale static analysis. models are zero-shot fuzzers: Fuzzing deep-learning libraries via large
language models. In Proceedings of the 32nd ACM SIGSOFT international
8 Conclusion symposium on software testing and analysis (2023), pp. 423–435.
[14] Du, X., Zheng, G., Wang, K., Feng, J., Deng, W., Liu, M., Chen, B.,
This paper introduces KNighter, a novel approach that trans- Peng, X., Ma, T., and Lou, Y. Vul-rag: Enhancing llm-based vulnerabil-
forms how LLMs can contribute to static analysis for complex ity detection via knowledge-level rag. arXiv preprint arXiv:2406.11147
systems like the Linux kernel. By synthesizing static analyz- (2024).
ers rather than directly analyzing code, KNighter bridges the [15] Engler, D., Chen, D. Y., Hallem, S., Chou, A., and Chelf, B. Bugs as
deviant behavior: A general approach to inferring errors in systems
gap between LLMs’ reasoning capabilities and the practical
code. ACM SIGOPS Operating Systems Review 35, 5 (2001), 57–72.
constraints of analyzing massive systems. KNighter’s practi- [16] Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Wang,
cal impact is shown by the discovery of 92 new, long-latent H., and Wang, H. Retrieval-augmented generation for large language
bugs in the Linux kernel, with 77 confirmed, 57 fixed, and models: A survey. arXiv preprint arXiv:2312.10997 2 (2023).
30 CVE assigned. [17] GitHub. CodeQL. https://codeql.github.com/.
Looking forward, KNighter opens new possibilities for [18] Gong, S., Peng, D., Altinbüken, D., Fonseca, P., and Maniatis, P.
Snowcat: Efficient kernel concurrency testing using a learned coverage
scalable LLM-based static analysis. Future work could extend predictor. In Proceedings of the 29th Symposium on Operating Systems
this approach to other systems beyond the Linux kernel, Principles (2023), pp. 35–51.
incorporate additional learning paradigms, and further refine [19] Gong, S., Wang, R., Altinbüken, D., Fonseca, P., and Maniatis,
checker generation techniques to address more complex bug P. Snowplow: Effective kernel fuzzing with a learned white-box test
patterns. By leveraging LLMs to synthesize tools rather than mutator. In Proceedings of the 30th ACM International Conference
on Architectural Support for Programming Languages and Operating
perform analysis directly, we establish a scalable, reliable, Systems, Volume 2 (2025), pp. 1124–1138.
and traceable paradigm for utilizing AI in critical software [20] Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-
security applications. Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A.,
et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783
Acknowledgments (2024).
[21] He, L., Su, P., Zhang, C., Cai, Y., and Ma, J. One simple api can cause
We are grateful to the anonymous reviewers for their valu- hundreds of bugs an analysis of refcounting bugs in all modern linux
able feedback that helped to improve this paper. This work kernels. In Proceedings of the 29th Symposium on Operating Systems
was partially supported by NSF grant CCF-2131943. Principles (2023), pp. 52–65.
[22] Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen,
References Q., Peng, W., Feng, X., Qin, B., et al. A survey on hallucination in
large language models: Principles, taxonomy, challenges, and open
[1] Chromium. https://www.chromium.org/chromium-projects/. questions. ACM Transactions on Information Systems 43, 2 (2025), 1–55.
[2] Smatch. https://github.com/error27/smatch. [23] Ji, Y., Dai, T., Zhou, Z., Tang, Y., and He, J. Artemis: Toward accurate
[3] Syzkaller. https://github.com/google/syzkaller/. detection of server-side request forgeries through llm-assisted inter-
[4] text-embedding-ada-002. https://openai.com/index/new-and- procedural path-sensitive taint analysis. Proceedings of the ACM on
improved-embedding-model/. Programming Languages 9, OOPSLA1 (2025), 1349–1377.
[5] Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, [24] Jiang, Y., Zhou, Z., Xu, B., Liu, B., Xu, R., and Huang, P. Training
F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. with confidence: Catching silent errors in deep learning training with
Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
14
KNighter: Transforming Static Analysis with LLM-Synthesized Checkers SOSP ’25, October 13–16, 2025, Seoul, Republic of Korea
automated proactive checks. In Proceedings of the 19th USENIX Sym- [43] Sun, C., Sheng, Y., Padon, O., and Barrett, C. Clover: Clo sed-
posium on Operating Systems Design and Implementation (Boston, MA, loop ver ifiable code generation. In International Symposium on AI
USA, July 2025), OSDI ’25, USENIX Association. Verification (2024), Springer, pp. 134–155.
[25] Lattuada, A., Hance, T., Cho, C., Brun, M., Subasinghe, I., Zhou, [44] Sun, S., Liu, Y., Wang, S., Zhu, C., and Iyyer, M. Pearl: Prompting large
Y., Howell, J., Parno, B., and Hawblitzel, C. Verus: Verifying rust language models to plan and execute actions over long documents.
programs using linear ghost types. Proceedings of the ACM on Pro- arXiv preprint arXiv:2305.14564 (2023).
gramming Languages 7, OOPSLA1 (2023), 286–315. [45] Suzuki, K., Ishiguro, K., and Kono, K. Balancing analysis time and
[26] Li, H., Hao, Y., Zhai, Y., and Qian, Z. Enhancing static analysis for bug detection: daily development-friendly bug detection in linux. In
practical bug detection: An llm-integrated approach. Proceedings of 2024 USENIX Annual Technical Conference (USENIX ATC 24) (2024),
the ACM on Programming Languages 8, OOPSLA1 (2024), 474–499. pp. 493–508.
[27] Li, P., Yao, S., Korich, J. S., Luo, C., Yu, J., Cao, Y., and Yang, J. [46] Team, G., Anil, R., Borgeaud, S., Alayrac, J.-B., Yu, J., Soricut, R.,
Automated static vulnerability detection via a holistic neuro-symbolic Schalkwyk, J., Dai, A. M., Hauth, A., Millican, K., et al. Gem-
approach. arXiv preprint arXiv:2504.16057 (2025). ini: a family of highly capable multimodal models. arXiv preprint
[28] Li, T., Bai, J.-J., Han, G.-D., and Hu, S.-M. {LR-Miner}: Static race arXiv:2312.11805 (2023).
detection in {OS} kernels by mining locking rules. In 33rd USENIX [47] Wang, C., Liu, J., Peng, X., Liu, Y., and Lou, Y. Boosting static resource
Security Symposium (USENIX Security 24) (2024), pp. 6149–6166. leak detection via llm-based resource-oriented intention inference,
[29] Li, T., Bai, J.-J., Sui, Y., and Hu, S.-M. Path-sensitive and alias-aware 2024.
typestate analysis for detecting os bugs. In Proceedings of the 27th ACM [48] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowd-
International Conference on Architectural Support for Programming hery, A., and Zhou, D. Self-consistency improves chain of thought
Languages and Operating Systems (New York, NY, USA, 2022), ASPLOS reasoning in language models. arXiv preprint arXiv:2203.11171 (2022).
’22, Association for Computing Machinery, p. 859–872. [49] Xia, C. S., Wei, Y., and Zhang, L. Automated program repair in the
[30] Li, Z., Dutta, S., and Naik, M. Llm-assisted static analysis for detect- era of large pre-trained language models. In 2023 IEEE/ACM 45th
ing security vulnerabilities. arXiv preprint arXiv:2405.17238 (2024). International Conference on Software Engineering (ICSE) (2023), IEEE,
[31] Lin, M., Chen, K., and Xiao, Y. Detecting {API} {Post-Handling} pp. 1482–1494.
bugs using code and description in patches. In 32nd USENIX Security [50] Xia, C. S., and Zhang, L. Automated program repair via conversation:
Symposium (USENIX Security 23) (2023), pp. 3709–3726. Fixing 162 out of 337 bugs for $0.42 each using chatgpt. In Proceedings
[32] Liu, J., Yi, L., Chen, W., Song, C., Qian, Z., and Yi, Q. LinKRID: Vetting of the 33rd ACM SIGSOFT International Symposium on Software Testing
imbalance reference counting in linux kernel with symbolic execution. and Analysis (2024), pp. 819–831.
In 31st USENIX Security Symposium (USENIX Security 22) (Boston, MA, [51] Yan, H., Sui, Y., Chen, S., and Xue, J. Spatio-temporal context reduc-
Aug. 2022), USENIX Association, pp. 125–142. tion: a pointer-analysis-based static approach for detecting use-after-
[33] Lou, C., Jing, Y., and Huang, P. Demystifying and checking silent free vulnerabilities. In Proceedings of the 40th International Conference
semantic violations in large distributed systems. In 16th USENIX on Software Engineering (New York, NY, USA, 2018), ICSE ’18, Associa-
Symposium on Operating Systems Design and Implementation (OSDI 22) tion for Computing Machinery, p. 327–337.
(2022), pp. 91–107. [52] Yang, C., Deng, Y., Lu, R., Yao, J., Liu, J., Jabbarvand, R., and Zhang,
[34] Lou, C., Parikesit, D. S., Huang, Y., Yang, Z., Diwangkara, S., Jing, L. Whitefox: White-box compiler fuzzing empowered by large lan-
Y., Kistijantoro, A. I., Yuan, D., Nath, S., and Huang, P. Deriving guage models. Proceedings of the ACM on Programming Languages 8,
semantic checkers from tests to detect silent failures in production OOPSLA2 (2024), 709–735.
distributed systems. In 19th USENIX Symposium on Operating Systems [53] Yang, C., Deng, Y., Yao, J., Tu, Y., Li, H., and Zhang, L. Fuzzing
Design and Implementation (OSDI 25) (2025), pp. 19–38. automatic differentiation in deep-learning libraries. In 2023 IEEE/ACM
[35] Lu, K., Pakki, A., and Wu, Q. Detecting Missing-Check bugs via 45th International Conference on Software Engineering (ICSE) (2023),
semantic-and Context-Aware criticalness and constraints inferences. IEEE, pp. 1174–1186.
In 28th USENIX Security Symposium (USENIX Security 19) (2019), [54] Yang, C., Li, X., Misu, M. R. H., Yao, J., Cui, W., Gong, Y., Hawblitzel,
pp. 1769–1786. C., Lahiri, S. K., Lorch, J. R., Lu, S., Yang, F., Zhou, Z., and Lu, S.
[36] Lyu, Y., Fang, Y., Zhang, Y., Sun, Q., Ma, S., Bertino, E., Lu, K., and Autoverus: Automated proof generation for rust code. Proceedings of
Li, J. Goshawk: Hunting memory corruptions via structure-aware and the ACM on Programming Languages 9, OOPSLA2 (2025).
object-centric memory operation synopsis. In 2022 IEEE Symposium [55] Yang, C., Zhao, Z., and Zhang, L. Kernelgpt: Enhanced kernel fuzzing
on Security and Privacy (SP) (2022), IEEE, pp. 2096–2113. via large language models. ASPLOS ’25, Association for Computing
[37] Mathai, A., Huang, C., Maniatis, P., Nogikh, A., Ivančić, F., Yang, Machinery, p. 560–573.
J., and Ray, B. Kgym: A platform and dataset to benchmark large [56] Zhai, Y., Hao, Y., Zhang, H., Wang, D., Song, C., Qian, Z., Lesani,
language models on linux kernel crash resolution. Advances in Neural M., Krishnamurthy, S. V., and Yu, P. Ubitect: a precise and scalable
Information Processing Systems 37 (2024), 78053–78078. method to detect use-before-initialization bugs in linux kernel. In
[38] Min, C., Kashyap, S., Lee, B., Song, C., and Kim, T. Cross-checking Proceedings of the 28th ACM Joint Meeting on European Software En-
semantic correctness: The case of finding file system bugs. In Pro- gineering Conference and Symposium on the Foundations of Software
ceedings of the 25th Symposium on Operating Systems Principles (2015), Engineering (2020), pp. 221–232.
pp. 361–377. [57] Zhang, C., Liu, H., Zeng, J., Yang, K., Li, Y., and Li, H. Prompt-
[39] Oracle. Kernel-Fuzzing. https://github.com/oracle/kernel-fuzzing. enhanced software vulnerability detection using chatgpt. In Proceed-
[40] Qi, Z., Long, F., Achour, S., and Rinard, M. An analysis of patch ings of the 2024 IEEE/ACM 46th International Conference on Software
plausibility and correctness for generate-and-validate patch generation Engineering: Companion Proceedings (New York, NY, USA, 2024), ICSE-
systems. In Proceedings of the 2015 international symposium on software Companion ’24, Association for Computing Machinery, p. 276–277.
testing and analysis (2015), pp. 24–36. [58] Zhang, H., Chen, W., Hao, Y., Li, G., Zhai, Y., Zou, X., and Qian,
[41] Semgrep. Semgrep. https://github.com/semgrep/semgrep. Z. Statically discovering high-order taint style vulnerabilities in os
[42] Shestov, A., Levichev, R., Mussabayev, R., Maslov, E., Zadorozhny, kernels. In Proceedings of the 2021 ACM SIGSAC Conference on Com-
P., Cheshkov, A., Mussabayev, R., Toleu, A., Tolegen, G., and puter and Communications Security (New York, NY, USA, 2021), CCS
Krassovitskiy, A. Finetuning large language models for vulnera- ’21, Association for Computing Machinery, p. 811–824.
bility detection. IEEE Access 13 (2025), 38889–38900.
15