ReDef: Do Code Language Models Truly Understand Code Changes for Just-in-Time Software Defect Prediction?

Nam, Doha; Kim, Taehyoun; Ryu, Duksan; Baik, Jongmoon

Computer Science > Software Engineering

arXiv:2509.09192 (cs)

[Submitted on 11 Sep 2025 (v1), last revised 2 Apr 2026 (this version, v2)]

Title:ReDef: Do Code Language Models Truly Understand Code Changes for Just-in-Time Software Defect Prediction?

Authors:Doha Nam, Taehyoun Kim, Duksan Ryu, Jongmoon Baik

View PDF HTML (experimental)

Abstract:Just-in-Time software defect prediction (JIT-SDP) plays a critical role in prioritizing risky code changes during code review and continuous integration. However, existing datasets often suffer from noisy labels and low precision in identifying bug-inducing commits. To address this, we present ReDef (Revert-based Defect dataset), a high-confidence benchmark of function-level modifications curated from 22 large-scale C/C++ projects. Defective cases are anchored by revert commits, while clean cases are validated through post-hoc history checks. Ambiguous instances are conservatively filtered out via a GPT-assisted triage process involving multiple votes and audits. This pipeline yields 3,164 defective and 10,268 clean modifications, offering substantially more reliable labels than prior resources. Beyond dataset construction, we provide a systematic evaluation of how Code Language Models (CLMs)-specifically CodeBERT, CodeT5+, UniXcoder, and Qwen2.5-reason about code modifications. We first investigate which input encodings most effectively expose change information under five different strategies. We then design four counterfactual perturbation strategies (e.g., swapping added/deleted blocks, inverting diff polarity) to serve as diagnostic probes. We posit that if models genuinely capture change semantics, such distortions should lead to a clear decline in predictive performance. Our results show that compact diff-style encodings consistently outperform whole-function formats across all CLMs, supported by rigorous statistical confirmation. However, under counterfactual tests, performance remains effectively stable, revealing that what appears to be robustness in fact reflects a reliance on superficial cues rather than true semantic understanding.

Comments:	Accepted to FSE 2026; An anonymous link containing the dataset, construction scripts, and experimental code is publicly available for reproducibility: this https URL
Subjects:	Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2509.09192 [cs.SE]
	(or arXiv:2509.09192v2 [cs.SE] for this version)
	https://doi.org/10.48550/arXiv.2509.09192

Submission history

From: Doha Nam [view email]
[v1] Thu, 11 Sep 2025 07:07:11 UTC (608 KB)
[v2] Thu, 2 Apr 2026 18:43:42 UTC (689 KB)

Computer Science > Software Engineering

Title:ReDef: Do Code Language Models Truly Understand Code Changes for Just-in-Time Software Defect Prediction?

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Software Engineering

Title:ReDef: Do Code Language Models Truly Understand Code Changes for Just-in-Time Software Defect Prediction?

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators