Abstract
Large Language Models (LLMs) are one of the most promising developments in the field of artificial intelligence, and the software engineering community has readily noticed their potential role in the software development life-cycle. Developers routinely ask LLMs to generate code snippets, increasing productivity but also potentially introducing ownership, privacy, correctness, and security issues. Previous work highlighted how code generated by mainstream commercial LLMs is often not safe, containing vulnerabilities, bugs, and code smells. In this paper, we present a framework that leverages testing and static analysis to assess the quality, and guide the self-improvement, of code generated by general-purpose, open-source LLMs. First, we ask LLMs to generate C code to solve a number of programming tasks. Then we employ ground-truth tests to assess the (in)correctness of the generated code, and a static analysis tool to detect potential safety vulnerabilities. Next, we assess the models ability to evaluate the generated code, by asking them to detect errors and vulnerabilities. Finally, we test the models ability to fix the generated code, providing the reports produced during the static analysis and incorrectness evaluation phases as feedback. Our results show that models often produce incorrect code, and that the generated code can include safety issues. Moreover, they perform very poorly at detecting either issue. On the positive side, we observe a substantial ability to fix flawed code when provided with information about failed tests or potential vulnerabilities, indicating a promising avenue for improving the safety of LLM-based code generation tools.
Similar content being viewed by others
Data availability
We provide https://doi.org/10.6084/m9.figshare.26984716, containing (i) the benchmark suite of tasks (Sect. 2), including both the raw and cleaned source code generated by the models we considered, (ii) the source code for the scripts used in our experimental evaluation (Sect. 6), (iii) the analysis results computed by Infer, covering both the vulnerability analysis and the repair phases. We also provide a README file with the instructions on how to reproduce our experiments for each phase of the pipeline. To reproduce our experimental setting, the user will need to install Infer and get a token for the GROQ API.
References
Zhao S, Jia M, Tuan LA, Wen J. Universal vulnerabilities in large language models: in-context learning backdoor attacks. CoRR 2024 https://doi.org/10.48550/ARXIV.2401.05949, arXiv:abs/2401.05949.
Zhang B, Liang P, Zhou X, Ahmad A, Waseem M. Practices and challenges of using github copilot: An empirical study. In: Chang, S. (ed.) The 35th International Conference on Software Engineering and Knowledge Engineering, SEKE 2023, KSIR Virtual Conference Center, USA, July 1–10, 2023, pp. 124–129. KSI Research Inc., 2023. https://doi.org/10.18293/SEKE2023-077.
Barke S, James MB, Polikarpova N. Grounded copilot: how programmers interact with code-generating models. Proc ACM Program Lang. 2023;7(OOPSLA1):85–111. https://doi.org/10.1145/3586030.
Gartner: gartner hype cycle shows ai practices and platform engineering will reach mainstream adoption in software engineering in two to five years 2024. https://www.gartner.com/en/newsroom/press-releases/2023-11-28-gartner-hype-cycle-shows-ai-practices-and-platform-engineering-will-reach-mainstream-adoption-in-software-engineering-in-two-to-five-years.
Team ML. https://ai.meta.com/blog/meta-llama-3/ 2024. https://ai.meta.com/blog/meta-llama-3/.
Mesnard T, Hardin C, Dadashi R, Bhupatiraju S, Pathak S, Sifre L, Rivière M, Kale MS, Love J, Tafti P, Hussenot L, Chowdhery A, Roberts A, Barua A, Botev A, Castro-Ros A, Slone A, Héliou A, Tacchetti A, Bulanova A, Paterson A, Tsai B, Shahriari B, Lan CL, Choquette-Choo CA, Crepy C, Cer D, Ippolito D, Reid D, Buchatskaya E, Ni E, Noland E, Yan G, Tucker G, Muraru G, Rozhdestvenskiy G, Michalewski H, Tenney I, Grishchenko I, Austin J, Keeling J, Labanowski J, Lespiau J, Stanway J, Brennan J, Chen J, Ferret J, Chiu J, et al. Gemma: Open models based on gemini research and technology. 2024 https://doi.org/10.48550/ARXIV.2403.08295, CoRR arXiv:abs/2403.08295.
Jiang AQ, Sablayrolles A, Roux A, Mensch A, Savary B, Bamford C, Chaplot DS, Las Casas D, Hanna EB, Bressand F, Lengyel G, Bour G, Lample G, Lavaud LR, Saulnier L, Lachaux M, Stock P, Subramanian S, Yang S, Antoniak S, Scao TL, Gervet T, Lavril T, Wang T, Lacroix T, Sayed WE. Mixtral of experts. https://doi.org/10.48550/ARXIV.2401.04088, 2024 CoRR arXiv:abs/2401.04088.
Calcagno C, Distefano D. Infer: An automatic program verifier for memory safety of C programs. In: Bobaru MG, Havelund K, Holzmann GJ, Joshi R (eds.) NASA Formal Methods - Third International Symposium, NFM 2011, Pasadena, CA, USA, April 18–20, 2011. Proceedings. Lecture Notes in Computer Science, vol. 6617, pp. 459–465. Springer, 2011 https://doi.org/10.1007/978-3-642-20398-5_33.
Austin J, Odena A, Nye MI, Bosma M, Michalewski H, Dohan D, Jiang E, Cai CJ, Terry M, Le QV, Sutton C. Program synthesis with large language models. 2021 CoRR arXiv:abs/2108.07732.
Xu R, Cao J, Lu Y, Wen M, Lin H, Han X, He B, Cheung S-C, Sun L. CRUXEval-X: a benchmark for multilingual code reasoning, understanding and execution 2025. https://arxiv.org/abs/2408.13001.
Chen M, Tworek J, Jun H, Yuan Q, Oliveira Pinto HP, Kaplan J, Edwards H, Burda Y, Joseph N, Brockman G, Ray A, Puri R, Krueger G, Petrov M, Khlaaf H, Sastry G, Mishkin P, Chan B, Gray S, Ryder N, Pavlov M, Power A, Kaiser L, Bavarian M, Winter C, Tillet P, Such FP, Cummings D, Plappert M, Chantzis F, Barnes E, Herbert-Voss A, Guss WH, Nichol A, Paino A, Tezak N, Tang J, Babuschkin I, Balaji S, Jain S, Saunders W, Hesse C, Carr AN, Leike J, Achiam J, Misra V, Morikawa E, Radford A, Knight M, Brundage M, Murati M, Mayer K, Welinder P, McGrew B, Amodei D, McCandlish S, Sutskever I, Zaremba W. Evaluating large language models trained on code 2021 arXiv:2107.03374 [cs.LG].
Jiang AQ, Sablayrolles A, Mensch A, Bamford C, Chaplot DS, Las Casas D, Bressand F, Lengyel G, Lample G, Saulnier L, Lavaud LR, Lachaux M, Stock P, Scao TL, Lavril T, Wang T, Lacroix T, Sayed WE. Mistral 7b. 2023 https://doi.org/10.48550/ARXIV.2310.06825, CoRR arXiv:abs/2310.06825.
Schulhoff S, Ilie M, Balepur N, Kahadze K, Liu A, Si C, Li Y, Gupta A, Han H, Schulhoff S, Dulepet PS, Vidyadhara S, Ki D, Agrawal S, Pham C, Kroiz GC, Li F, Tao H, Srivastava A, Costa HD, Gupta S, Rogers ML, Goncearenco I, Sarli G, Galynker I, Peskoff D, Carpuat M, White J, Anadkat S, Hoyle AM, Resnik P. The prompt report: A systematic survey of prompting techniques. https://doi.org/10.48550/ARXIV.2406.06608, CoRR arXiv:abs/2406.06608 (2024).
Wei J, Wang X, Schuurmans D, Bosma M, Ichter B, Xia F, Chi EH, Le QV, Zhou D. Chain-of-thought prompting elicits reasoning in large language models. In: Koyejo S, Mohamed S, Agarwal A, Belgrave D, Cho K, Oh A. (eds.) Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022 2022. http://papers.nips.cc/paper_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html.
Zhou Y, Muresanu AI, Han Z, Paster K, Pitis S, Chan H, Ba J. Large language models are human-level prompt engineers. In: The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1–5, 2023. OpenReview.net, 2023. https://openreview.net/forum?id=92gvk82DE-.
Cousot P, Cousot R. Abstract interpretation: A unified lattice model for static analysis of programs by construction or approximation of fixpoints. In: Graham RM, Harrison MA, Sethi R. (eds.) Conference Record of the Fourth ACM Symposium on Principles of Programming Languages, Los Angeles, California, USA, January 1977, pp. 238–252. ACM, 1977. https://doi.org/10.1145/512950.512973.
Tihanyi N, Bisztray T, Jain R, Ferrag MA, Cordeiro LC, Mavroeidis V. The formai dataset: Generative AI in software security through the lens of formal verification. In: McIntosh S, Choi E, Herbold S. (eds.) Proceedings of the 19th International Conference on Predictive Models and Data Analytics in Software Engineering, PROMISE 2023, San Francisco, CA, USA, 8 December 2023, pp. 33–43. ACM, 2023. https://doi.org/10.1145/3617555.3617874.
Gadelha MYR, Monteiro FR, Morse J, Cordeiro LC, Fischer B, Nicole DA. ESBMC 5.0: an industrial-strength C model checker. In: Huchard M, Kästner C, Fraser G. (eds.) Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, ASE 2018, Montpellier, France, September 3–7, 2018, pp. 888–891. ACM, 2018. https://doi.org/10.1145/3238147.3240481
Tihanyi N, Bisztray T, Ferrag MA, Jain R, Cordeiro LC. How secure is ai-generated code: a large-scale comparison of large language models. Empir Softw Eng. 2025;30(2):1–42.
Pearce H, Ahmad B, Tan B, Dolan-Gavitt B, Karri R. Asleep at the keyboard? assessing the security of github copilot’s code contributions. In: IEEE Symposium on Security and Privacy, S&P 2022, pp. 754–768 2022. IEEE.
Nazzal M, Khalil I, Khreishah A, Phan N. Promsec: Prompt optimization for secure generation of functional source code with large language models (llms). In: Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pp. 2266–2280, 2024.
He J, Vechev M. Large language models for code: Security hardening and adversarial testing. In: Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, pp. 1865–1879, 2023.
Li D, Yan M, Zhang Y, Liu Z, Liu C, Zhang X, Chen T, Lo D. Cosec: On-the-fly security hardening of code llms via supervised co-decoding. In: Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, pp. 1428–1439, 2024.
Chapman PJ, Rubio-González C, Thakur AV. Interleaving static analysis and LLM prompting. In: Proceedings of the 13th ACM SIGPLAN International Workshop on the State Of the Art in Program Analysis, SOAP 2024, pp. 9–17, 2024.
Li Z, Dutta S, Naik M. LLM-assisted static analysis for detecting security vulnerabilities. arXiv preprint, 2024 arXiv:2405.17238.
Ullah S, Han M, Pujar S, Pearce H, Coskun A, Stringhini G. LLMs cannot reliably identify and reason about security vulnerabilities (yet?): A comprehensive evaluation, framework, and benchmarks. In: IEEE Symposium on Security and Privacy, S&P 2024, 2024.
Lu G, Ju X, Chen X, Pei W, Cai Z. Grace: empowering llm-based software vulnerability detection with graph structure and in-context learning. J Syst Softw. 2024;212:112031.
Wen X-C, Gao C, Gao S, Xiao Y, Lyu MR. Scale: Constructing structured natural language comment trees for software vulnerability detection. In: Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, pp. 235–247, 2024.
Charalambous Y, Tihanyi N, Jain R, Sun Y, Ferrag MA, Cordeiro LC. A new era in software security: Towards self-healing software via large language models and formal verification, 2023. https://doi.org/10.48550/ARXIV.2305.14752, CoRR arXiv:abs/2305.14752.
Jin M, Shahriar S, Tufano M, Shi X, Lu S, Sundaresan N, Svyatkovskiy A. Inferfix: End-to-end program repair with llms. In: Chandra S, Blincoe K, Tonella P. (eds.) Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2023, San Francisco, CA, USA, December 3–9, 2023, pp. 1646–1656. ACM, 2023. https://doi.org/10.1145/3611643.3613892.
Janßen C, Richter C, Wehrheim H. Can chatgpt support software verification?, 2023. https://doi.org/10.48550/ARXIV.2311.02433, CoRR arXiv:abs/2311.02433.
Li H, Hao Y, Zhai Y, Qian Z. Assisting static analysis with large language models: A chatgpt experiment. In: Chandra S, Blincoe K, Tonella P. (eds.) Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2023, San Francisco, CA, USA, December 3–9, 2023, pp. 2107–2111. ACM, 2023. https://doi.org/10.1145/3611643.3613078.
Jain N, Han K, Gu A, Li W-D, Yan F, Zhang T, Wang S, Solar-Lezama A, Sen K, Stoica I: Livecodebench: Holistic and contamination free evaluation of large language models for code, 2024. arXiv preprint arXiv:2403.07974.
Olausson TX, Inala JP, Wang C, Gao J, Solar-Lezama A. Is self-repair a silver bullet for code generation? In: The Twelfth International Conference on Learning Representations, 2023.
Islam NT, Khoury J, Seong A, Bou-Harb E, Najafirad P. Enhancing source code security with llms: demystifying the challenges and generating reliable repairs. In: Network and Distributed System Security (NDSS) Symposium 2024, 2024.
OpenAI: GPT-3.5-turbo, 2022. Accessed March 2024 . https://platform.openai.com/docs/models/gpt-3-5-turbo.
Bhatt M, Chennabasappa S, Nikolaidis C. Wan S, Evtimov I, Gabi D, Song D, Ahmad F, Aschermann C, Fontana L, Frolov S, Giri RP, Kapil D, Kozyrakis Y, LeBlanc D, Milazzo J, Straumann A, Synnaeve G, Vontimitta V, Whitman S, Saxe J. Purple llama cyberseceval: A secure coding benchmark for language models. 2023. https://doi.org/10.48550/ARXIV.2312.04724, CoRR arXiv:abs/2312.04724.
He J, Vero M, Krasnopolska G, Vechev MT. Instruction tuning for secure code generation. In: Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21–27, 2024. OpenReview.net, 2024. https://openreview.net/forum?id=MgTzMaYHvG.
Pearce H, Tan B, Ahmad B, Karri R, Dolan-Gavitt B. Examining zero-shot vulnerability repair with large language models. In: IEEE Symposium on Security and Privacy, S&P 2023, pp. 2339–2356, 2023. IEEE.
Plein L, Ouédraogo WC, Klein J, Bissyandé TF. Automatic generation of test cases based on bug reports: a feasibility study with large language models. In: Proceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings, ICSE Companion 2024, Lisbon, Portugal, April 14–20, 2024, pp. 360–361. ACM, 2024. https://doi.org/10.1145/3639478.3643119..
Acknowledgements
This work was supported by Bando di Ateneo 2024 per la Ricerca, funded by University of Parma (FIL_2024_PROGETTI_B_IOTTI - CUP D93C24001250005) and by SERICS (PE00000014 - CUP H73C2200089001) project funded by PNRR NextGeneration EU.
Author information
Authors and Affiliations
Contributions
G.D.: Software, Data Curation, Writing, Visualization, Experiments V.A.: Software, Supervision, Writing, Experiments E.I.: Software, Writing, Visualization, Experiments S.M.: Conceptualization, Supervision, Writing A.C.: Resources, Supervision, Writing E.Z.: Validation, Supervision, Writing All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no Conflict of interest.
Ethical approval
Not applicable.
Consent to participate
Not applicable.
Consent to publish
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
1.1 Prompt experiments
For each phase we report the prompt experiments and the result we obtained. The names of the best prompts, whose results are reported in the paper, are highlighted in italics.
1.2 Code generation experiments
See Fig.14
Vanilla system prompt for generating C code
See table 11
See Fig. 15
Example and Counterexample system prompt for generating C code
See table 12
See Fig. 16
Chain of Thought system prompt for generating C code
see Table 13
See Fig. 17
Combo system prompt for generating C code
See table 14
See Fig. 18
Vanilla + Implicit CoT system prompt for generating C code
See table 15
1.3 Self-evaluation experiments - correctness
See table 16
Vanilla system prompt to perform correctness classification
Preference heatmap for the self-correctness classification, using vanilla prompt. On the rows the evaluated models, on the columns the evaluator models
See table 17
Example and counterexample system prompt to perform correctness classification
Preference heatmap for the self-correctness classification, using example and counterexample prompt. On the rows the evaluated models, on the columns the evaluator models
See table 18
Chain of Thought system prompt to perform correctness classification
Preference heatmap for the self-correctness classification, using Chain of Thought prompt. On the rows the evaluated models, on the columns the evaluator models
Combo system prompt to perform correctness classification
1.4 Self-evaluation experiments - safety
See table 19
See Fig. 26
Vanilla system prompt to perform vulnerability detection
See table 20
See Fig. 27
Example and counterexample system prompt to perform vulnerability detection
See table 21
Chain of Thought system prompt to perform vulnerability detection
Combo system prompt to perform vulnerability detection
1.5 Code repair experiments - correctness
See Fig. 30
Vanilla Prompt
See table 22
See Fig. 31
Chain of Thought Prompt
see Table 23
See Fig. 32
One Assert at the Time Prompt. The system prompt is the vanilla prompt but we provide one failed assertion at the time, iteratively, for a maximum of 6 iterations
1.6 Code repair experiments - safety
See Fig. 33
Vanilla Prompt
See table 29
See Fig. 34
Chain of Thought Prompt
See table 30
See Fig. 35
Instructions Prompt
See table 31
See Fig. 36
Combo Prompt
See table 32
See Fig. 37
No Info Prompt. In the content prompt, in this experiment, we do not provide any information for the kind of vulnerabilities present in the code
See table 33
See Fig. 38
No Line Prompt. In the content prompt, in this experiment, we do not provide the line number where the vulnerability is present
See table 34
See Fig. 39
One Vulnerability at the Time Prompt
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Dolcetti, G., Arceri, V., Iotti, E. et al. Helping LLMs improve code generation using feedback from testing and static analysis. Discov Artif Intell (2026). https://doi.org/10.1007/s44163-026-01009-5
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s44163-026-01009-5



























