Helping LLMs improve code generation using feedback from testing and static analysis

Dolcetti, Greta; Arceri, Vincenzo; Iotti, Eleonora; Maffeis, Sergio; Cortesi, Agostino; Zaffanella, Enea

doi:10.1007/s44163-026-01009-5

Helping LLMs improve code generation using feedback from testing and static analysis

Research
Open access
Published: 05 March 2026

article number , (2026)
Cite this article

You have full access to this open access article

Download PDF

Save article

View saved research

Discover Artificial Intelligence Aims and scope Submit manuscript

Helping LLMs improve code generation using feedback from testing and static analysis

Download PDF

Greta Dolcetti³,
Vincenzo Arceri¹,
Eleonora Iotti¹,
Sergio Maffeis²,
Agostino Cortesi³ &
…
Enea Zaffanella¹

131 Accesses
3 Citations
1 Altmetric
Explore all metrics

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Abstract

Large Language Models (LLMs) are one of the most promising developments in the field of artificial intelligence, and the software engineering community has readily noticed their potential role in the software development life-cycle. Developers routinely ask LLMs to generate code snippets, increasing productivity but also potentially introducing ownership, privacy, correctness, and security issues. Previous work highlighted how code generated by mainstream commercial LLMs is often not safe, containing vulnerabilities, bugs, and code smells. In this paper, we present a framework that leverages testing and static analysis to assess the quality, and guide the self-improvement, of code generated by general-purpose, open-source LLMs. First, we ask LLMs to generate C code to solve a number of programming tasks. Then we employ ground-truth tests to assess the (in)correctness of the generated code, and a static analysis tool to detect potential safety vulnerabilities. Next, we assess the models ability to evaluate the generated code, by asking them to detect errors and vulnerabilities. Finally, we test the models ability to fix the generated code, providing the reports produced during the static analysis and incorrectness evaluation phases as feedback. Our results show that models often produce incorrect code, and that the generated code can include safety issues. Moreover, they perform very poorly at detecting either issue. On the positive side, we observe a substantial ability to fix flawed code when provided with information about failed tests or potential vulnerabilities, indicating a promising avenue for improving the safety of LLM-based code generation tools.

The Hidden Risks of LLM-Generated Web Application Code: A Security-Centric Evaluation of Code Generation Capabilities in Large Language Models

Do code LLMs do static analysis?

Article 10 April 2026

Secure coding with AI – from detection to repair

Article Open access 13 March 2026

Data availability

We provide https://doi.org/10.6084/m9.figshare.26984716, containing (i) the benchmark suite of tasks (Sect. 2), including both the raw and cleaned source code generated by the models we considered, (ii) the source code for the scripts used in our experimental evaluation (Sect. 6), (iii) the analysis results computed by Infer, covering both the vulnerability analysis and the repair phases. We also provide a README file with the instructions on how to reproduce our experiments for each phase of the pipeline. To reproduce our experimental setting, the user will need to install Infer and get a token for the GROQ API.

References

Zhao S, Jia M, Tuan LA, Wen J. Universal vulnerabilities in large language models: in-context learning backdoor attacks. CoRR 2024 https://doi.org/10.48550/ARXIV.2401.05949, arXiv:abs/2401.05949.
Zhang B, Liang P, Zhou X, Ahmad A, Waseem M. Practices and challenges of using github copilot: An empirical study. In: Chang, S. (ed.) The 35th International Conference on Software Engineering and Knowledge Engineering, SEKE 2023, KSIR Virtual Conference Center, USA, July 1–10, 2023, pp. 124–129. KSI Research Inc., 2023. https://doi.org/10.18293/SEKE2023-077.
Barke S, James MB, Polikarpova N. Grounded copilot: how programmers interact with code-generating models. Proc ACM Program Lang. 2023;7(OOPSLA1):85–111. https://doi.org/10.1145/3586030.
Google Scholar
Gartner: gartner hype cycle shows ai practices and platform engineering will reach mainstream adoption in software engineering in two to five years 2024. https://www.gartner.com/en/newsroom/press-releases/2023-11-28-gartner-hype-cycle-shows-ai-practices-and-platform-engineering-will-reach-mainstream-adoption-in-software-engineering-in-two-to-five-years.
Team ML. https://ai.meta.com/blog/meta-llama-3/ 2024. https://ai.meta.com/blog/meta-llama-3/.
Mesnard T, Hardin C, Dadashi R, Bhupatiraju S, Pathak S, Sifre L, Rivière M, Kale MS, Love J, Tafti P, Hussenot L, Chowdhery A, Roberts A, Barua A, Botev A, Castro-Ros A, Slone A, Héliou A, Tacchetti A, Bulanova A, Paterson A, Tsai B, Shahriari B, Lan CL, Choquette-Choo CA, Crepy C, Cer D, Ippolito D, Reid D, Buchatskaya E, Ni E, Noland E, Yan G, Tucker G, Muraru G, Rozhdestvenskiy G, Michalewski H, Tenney I, Grishchenko I, Austin J, Keeling J, Labanowski J, Lespiau J, Stanway J, Brennan J, Chen J, Ferret J, Chiu J, et al. Gemma: Open models based on gemini research and technology. 2024 https://doi.org/10.48550/ARXIV.2403.08295, CoRR arXiv:abs/2403.08295.
Jiang AQ, Sablayrolles A, Roux A, Mensch A, Savary B, Bamford C, Chaplot DS, Las Casas D, Hanna EB, Bressand F, Lengyel G, Bour G, Lample G, Lavaud LR, Saulnier L, Lachaux M, Stock P, Subramanian S, Yang S, Antoniak S, Scao TL, Gervet T, Lavril T, Wang T, Lacroix T, Sayed WE. Mixtral of experts. https://doi.org/10.48550/ARXIV.2401.04088, 2024 CoRR arXiv:abs/2401.04088.
Calcagno C, Distefano D. Infer: An automatic program verifier for memory safety of C programs. In: Bobaru MG, Havelund K, Holzmann GJ, Joshi R (eds.) NASA Formal Methods - Third International Symposium, NFM 2011, Pasadena, CA, USA, April 18–20, 2011. Proceedings. Lecture Notes in Computer Science, vol. 6617, pp. 459–465. Springer, 2011 https://doi.org/10.1007/978-3-642-20398-5_33.
Austin J, Odena A, Nye MI, Bosma M, Michalewski H, Dohan D, Jiang E, Cai CJ, Terry M, Le QV, Sutton C. Program synthesis with large language models. 2021 CoRR arXiv:abs/2108.07732.
Xu R, Cao J, Lu Y, Wen M, Lin H, Han X, He B, Cheung S-C, Sun L. CRUXEval-X: a benchmark for multilingual code reasoning, understanding and execution 2025. https://arxiv.org/abs/2408.13001.
Chen M, Tworek J, Jun H, Yuan Q, Oliveira Pinto HP, Kaplan J, Edwards H, Burda Y, Joseph N, Brockman G, Ray A, Puri R, Krueger G, Petrov M, Khlaaf H, Sastry G, Mishkin P, Chan B, Gray S, Ryder N, Pavlov M, Power A, Kaiser L, Bavarian M, Winter C, Tillet P, Such FP, Cummings D, Plappert M, Chantzis F, Barnes E, Herbert-Voss A, Guss WH, Nichol A, Paino A, Tezak N, Tang J, Babuschkin I, Balaji S, Jain S, Saunders W, Hesse C, Carr AN, Leike J, Achiam J, Misra V, Morikawa E, Radford A, Knight M, Brundage M, Murati M, Mayer K, Welinder P, McGrew B, Amodei D, McCandlish S, Sutskever I, Zaremba W. Evaluating large language models trained on code 2021 arXiv:2107.03374 [cs.LG].
Jiang AQ, Sablayrolles A, Mensch A, Bamford C, Chaplot DS, Las Casas D, Bressand F, Lengyel G, Lample G, Saulnier L, Lavaud LR, Lachaux M, Stock P, Scao TL, Lavril T, Wang T, Lacroix T, Sayed WE. Mistral 7b. 2023 https://doi.org/10.48550/ARXIV.2310.06825, CoRR arXiv:abs/2310.06825.
Schulhoff S, Ilie M, Balepur N, Kahadze K, Liu A, Si C, Li Y, Gupta A, Han H, Schulhoff S, Dulepet PS, Vidyadhara S, Ki D, Agrawal S, Pham C, Kroiz GC, Li F, Tao H, Srivastava A, Costa HD, Gupta S, Rogers ML, Goncearenco I, Sarli G, Galynker I, Peskoff D, Carpuat M, White J, Anadkat S, Hoyle AM, Resnik P. The prompt report: A systematic survey of prompting techniques. https://doi.org/10.48550/ARXIV.2406.06608, CoRR arXiv:abs/2406.06608 (2024).
Wei J, Wang X, Schuurmans D, Bosma M, Ichter B, Xia F, Chi EH, Le QV, Zhou D. Chain-of-thought prompting elicits reasoning in large language models. In: Koyejo S, Mohamed S, Agarwal A, Belgrave D, Cho K, Oh A. (eds.) Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022 2022. http://papers.nips.cc/paper_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html.
Zhou Y, Muresanu AI, Han Z, Paster K, Pitis S, Chan H, Ba J. Large language models are human-level prompt engineers. In: The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1–5, 2023. OpenReview.net, 2023. https://openreview.net/forum?id=92gvk82DE-.
Cousot P, Cousot R. Abstract interpretation: A unified lattice model for static analysis of programs by construction or approximation of fixpoints. In: Graham RM, Harrison MA, Sethi R. (eds.) Conference Record of the Fourth ACM Symposium on Principles of Programming Languages, Los Angeles, California, USA, January 1977, pp. 238–252. ACM, 1977. https://doi.org/10.1145/512950.512973.
Tihanyi N, Bisztray T, Jain R, Ferrag MA, Cordeiro LC, Mavroeidis V. The formai dataset: Generative AI in software security through the lens of formal verification. In: McIntosh S, Choi E, Herbold S. (eds.) Proceedings of the 19th International Conference on Predictive Models and Data Analytics in Software Engineering, PROMISE 2023, San Francisco, CA, USA, 8 December 2023, pp. 33–43. ACM, 2023. https://doi.org/10.1145/3617555.3617874.
Gadelha MYR, Monteiro FR, Morse J, Cordeiro LC, Fischer B, Nicole DA. ESBMC 5.0: an industrial-strength C model checker. In: Huchard M, Kästner C, Fraser G. (eds.) Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, ASE 2018, Montpellier, France, September 3–7, 2018, pp. 888–891. ACM, 2018. https://doi.org/10.1145/3238147.3240481
Tihanyi N, Bisztray T, Ferrag MA, Jain R, Cordeiro LC. How secure is ai-generated code: a large-scale comparison of large language models. Empir Softw Eng. 2025;30(2):1–42.
Google Scholar
Pearce H, Ahmad B, Tan B, Dolan-Gavitt B, Karri R. Asleep at the keyboard? assessing the security of github copilot’s code contributions. In: IEEE Symposium on Security and Privacy, S&P 2022, pp. 754–768 2022. IEEE.
Nazzal M, Khalil I, Khreishah A, Phan N. Promsec: Prompt optimization for secure generation of functional source code with large language models (llms). In: Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pp. 2266–2280, 2024.
He J, Vechev M. Large language models for code: Security hardening and adversarial testing. In: Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, pp. 1865–1879, 2023.
Li D, Yan M, Zhang Y, Liu Z, Liu C, Zhang X, Chen T, Lo D. Cosec: On-the-fly security hardening of code llms via supervised co-decoding. In: Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, pp. 1428–1439, 2024.
Chapman PJ, Rubio-González C, Thakur AV. Interleaving static analysis and LLM prompting. In: Proceedings of the 13th ACM SIGPLAN International Workshop on the State Of the Art in Program Analysis, SOAP 2024, pp. 9–17, 2024.
Li Z, Dutta S, Naik M. LLM-assisted static analysis for detecting security vulnerabilities. arXiv preprint, 2024 arXiv:2405.17238.
Ullah S, Han M, Pujar S, Pearce H, Coskun A, Stringhini G. LLMs cannot reliably identify and reason about security vulnerabilities (yet?): A comprehensive evaluation, framework, and benchmarks. In: IEEE Symposium on Security and Privacy, S&P 2024, 2024.
Lu G, Ju X, Chen X, Pei W, Cai Z. Grace: empowering llm-based software vulnerability detection with graph structure and in-context learning. J Syst Softw. 2024;212:112031.
Google Scholar
Wen X-C, Gao C, Gao S, Xiao Y, Lyu MR. Scale: Constructing structured natural language comment trees for software vulnerability detection. In: Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, pp. 235–247, 2024.
Charalambous Y, Tihanyi N, Jain R, Sun Y, Ferrag MA, Cordeiro LC. A new era in software security: Towards self-healing software via large language models and formal verification, 2023. https://doi.org/10.48550/ARXIV.2305.14752, CoRR arXiv:abs/2305.14752.
Jin M, Shahriar S, Tufano M, Shi X, Lu S, Sundaresan N, Svyatkovskiy A. Inferfix: End-to-end program repair with llms. In: Chandra S, Blincoe K, Tonella P. (eds.) Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2023, San Francisco, CA, USA, December 3–9, 2023, pp. 1646–1656. ACM, 2023. https://doi.org/10.1145/3611643.3613892.
Janßen C, Richter C, Wehrheim H. Can chatgpt support software verification?, 2023. https://doi.org/10.48550/ARXIV.2311.02433, CoRR arXiv:abs/2311.02433.
Li H, Hao Y, Zhai Y, Qian Z. Assisting static analysis with large language models: A chatgpt experiment. In: Chandra S, Blincoe K, Tonella P. (eds.) Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2023, San Francisco, CA, USA, December 3–9, 2023, pp. 2107–2111. ACM, 2023. https://doi.org/10.1145/3611643.3613078.
Jain N, Han K, Gu A, Li W-D, Yan F, Zhang T, Wang S, Solar-Lezama A, Sen K, Stoica I: Livecodebench: Holistic and contamination free evaluation of large language models for code, 2024. arXiv preprint arXiv:2403.07974.
Olausson TX, Inala JP, Wang C, Gao J, Solar-Lezama A. Is self-repair a silver bullet for code generation? In: The Twelfth International Conference on Learning Representations, 2023.
Islam NT, Khoury J, Seong A, Bou-Harb E, Najafirad P. Enhancing source code security with llms: demystifying the challenges and generating reliable repairs. In: Network and Distributed System Security (NDSS) Symposium 2024, 2024.
OpenAI: GPT-3.5-turbo, 2022. Accessed March 2024 . https://platform.openai.com/docs/models/gpt-3-5-turbo.
Bhatt M, Chennabasappa S, Nikolaidis C. Wan S, Evtimov I, Gabi D, Song D, Ahmad F, Aschermann C, Fontana L, Frolov S, Giri RP, Kapil D, Kozyrakis Y, LeBlanc D, Milazzo J, Straumann A, Synnaeve G, Vontimitta V, Whitman S, Saxe J. Purple llama cyberseceval: A secure coding benchmark for language models. 2023. https://doi.org/10.48550/ARXIV.2312.04724, CoRR arXiv:abs/2312.04724.
He J, Vero M, Krasnopolska G, Vechev MT. Instruction tuning for secure code generation. In: Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21–27, 2024. OpenReview.net, 2024. https://openreview.net/forum?id=MgTzMaYHvG.
Pearce H, Tan B, Ahmad B, Karri R, Dolan-Gavitt B. Examining zero-shot vulnerability repair with large language models. In: IEEE Symposium on Security and Privacy, S&P 2023, pp. 2339–2356, 2023. IEEE.
Plein L, Ouédraogo WC, Klein J, Bissyandé TF. Automatic generation of test cases based on bug reports: a feasibility study with large language models. In: Proceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings, ICSE Companion 2024, Lisbon, Portugal, April 14–20, 2024, pp. 360–361. ACM, 2024. https://doi.org/10.1145/3639478.3643119..

Download references

Acknowledgements

This work was supported by Bando di Ateneo 2024 per la Ricerca, funded by University of Parma (FIL_2024_PROGETTI_B_IOTTI - CUP D93C24001250005) and by SERICS (PE00000014 - CUP H73C2200089001) project funded by PNRR NextGeneration EU.

Author information

Authors and Affiliations

University of Parma, Parco Area delle Scienze 53/A, Parma, 43124, PR, Italy
Vincenzo Arceri, Eleonora Iotti & Enea Zaffanella
Imperial College London, South Kensington Campus, London, SW7 2AZ, United Kingdom
Sergio Maffeis
Ca’ Foscari University of Venice, Via Torino 155, Venice, 30172, VE, Italy
Greta Dolcetti & Agostino Cortesi

Authors

Greta Dolcetti
View author publications
Search author on:PubMed Google Scholar
Vincenzo Arceri
View author publications
Search author on:PubMed Google Scholar
Eleonora Iotti
View author publications
Search author on:PubMed Google Scholar
Sergio Maffeis
View author publications
Search author on:PubMed Google Scholar
Agostino Cortesi
View author publications
Search author on:PubMed Google Scholar
Enea Zaffanella
View author publications
Search author on:PubMed Google Scholar

Contributions

G.D.: Software, Data Curation, Writing, Visualization, Experiments V.A.: Software, Supervision, Writing, Experiments E.I.: Software, Writing, Visualization, Experiments S.M.: Conceptualization, Supervision, Writing A.C.: Resources, Supervision, Writing E.Z.: Validation, Supervision, Writing All authors reviewed the manuscript.

Corresponding author

Correspondence to Greta Dolcetti.

Ethics declarations

Conflict of interest

The authors declare no Conflict of interest.

Ethical approval

Not applicable.

Consent to participate

Not applicable.

Consent to publish

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

1.1 Prompt experiments

For each phase we report the prompt experiments and the result we obtained. The names of the best prompts, whose results are reported in the paper, are highlighted in italics.

1.2 Code generation experiments

See Fig.14

See table 11

Table 11 Correctness results for each model (Vanilla). OK all the test passed, Exec an execution error or timeout occurred, Assert at least one test failed, Comp a compilation error occurred

Helping LLMs improve code generation using feedback from testing and static analysis

Abstract

Similar content being viewed by others

The Hidden Risks of LLM-Generated Web Application Code: A Security-Centric Evaluation of Code Generation Capabilities in Large Language Models

Do code LLMs do static analysis?

Secure coding with AI – from detection to repair

Explore related subjects

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Consent to participate

Consent to publish

Additional information

Publisher's Note

Appendix

Appendix

1.1 Prompt experiments

1.2 Code generation experiments

1.3 Self-evaluation experiments - correctness

1.4 Self-evaluation experiments - safety

1.5 Code repair experiments - correctness

1.6 Code repair experiments - safety

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Profiles