Skip to main content

Advertisement

Springer Nature Link
Account
Menu
Find a journal Publish with us Track your research
Search
Saved research
Cart
  1. Home
  2. Discover Artificial Intelligence
  3. Article

Helping LLMs improve code generation using feedback from testing and static analysis

  • Research
  • Open access
  • Published: 05 March 2026
  • article number , (2026)
  • Cite this article

You have full access to this open access article

Download PDF
Save article
View saved research
Discover Artificial Intelligence Aims and scope Submit manuscript
Helping LLMs improve code generation using feedback from testing and static analysis
Download PDF
  • Greta Dolcetti3,
  • Vincenzo Arceri1,
  • Eleonora Iotti1,
  • Sergio Maffeis2,
  • Agostino Cortesi3 &
  • …
  • Enea Zaffanella1 
  • 131 Accesses

  • 3 Citations

  • 1 Altmetric

  • Explore all metrics

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Abstract

Large Language Models (LLMs) are one of the most promising developments in the field of artificial intelligence, and the software engineering community has readily noticed their potential role in the software development life-cycle. Developers routinely ask LLMs to generate code snippets, increasing productivity but also potentially introducing ownership, privacy, correctness, and security issues. Previous work highlighted how code generated by mainstream commercial LLMs is often not safe, containing vulnerabilities, bugs, and code smells. In this paper, we present a framework that leverages testing and static analysis to assess the quality, and guide the self-improvement, of code generated by general-purpose, open-source LLMs. First, we ask LLMs to generate C code to solve a number of programming tasks. Then we employ ground-truth tests to assess the (in)correctness of the generated code, and a static analysis tool to detect potential safety vulnerabilities. Next, we assess the models ability to evaluate the generated code, by asking them to detect errors and vulnerabilities. Finally, we test the models ability to fix the generated code, providing the reports produced during the static analysis and incorrectness evaluation phases as feedback. Our results show that models often produce incorrect code, and that the generated code can include safety issues. Moreover, they perform very poorly at detecting either issue. On the positive side, we observe a substantial ability to fix flawed code when provided with information about failed tests or potential vulnerabilities, indicating a promising avenue for improving the safety of LLM-based code generation tools.

Similar content being viewed by others

The Hidden Risks of LLM-Generated Web Application Code: A Security-Centric Evaluation of Code Generation Capabilities in Large Language Models

Chapter © 2026

Do code LLMs do static analysis?

Article 10 April 2026

Secure coding with AI – from detection to repair

Article Open access 13 March 2026

Explore related subjects

Discover the latest articles, books and news in related subjects, suggested using machine learning.
  • Software Quality Assurance and Defect Prediction

Data availability

We provide https://doi.org/10.6084/m9.figshare.26984716, containing (i) the benchmark suite of tasks (Sect. 2), including both the raw and cleaned source code generated by the models we considered, (ii) the source code for the scripts used in our experimental evaluation (Sect. 6), (iii) the analysis results computed by Infer, covering both the vulnerability analysis and the repair phases. We also provide a README file with the instructions on how to reproduce our experiments for each phase of the pipeline. To reproduce our experimental setting, the user will need to install Infer and get a token for the GROQ API.

References

  1. Zhao S, Jia M, Tuan LA, Wen J. Universal vulnerabilities in large language models: in-context learning backdoor attacks. CoRR 2024 https://doi.org/10.48550/ARXIV.2401.05949, arXiv:abs/2401.05949.

  2. Zhang B, Liang P, Zhou X, Ahmad A, Waseem M. Practices and challenges of using github copilot: An empirical study. In: Chang, S. (ed.) The 35th International Conference on Software Engineering and Knowledge Engineering, SEKE 2023, KSIR Virtual Conference Center, USA, July 1–10, 2023, pp. 124–129. KSI Research Inc., 2023. https://doi.org/10.18293/SEKE2023-077.

  3. Barke S, James MB, Polikarpova N. Grounded copilot: how programmers interact with code-generating models. Proc ACM Program Lang. 2023;7(OOPSLA1):85–111. https://doi.org/10.1145/3586030.

    Google Scholar 

  4. Gartner: gartner hype cycle shows ai practices and platform engineering will reach mainstream adoption in software engineering in two to five years 2024. https://www.gartner.com/en/newsroom/press-releases/2023-11-28-gartner-hype-cycle-shows-ai-practices-and-platform-engineering-will-reach-mainstream-adoption-in-software-engineering-in-two-to-five-years.

  5. Team ML. https://ai.meta.com/blog/meta-llama-3/ 2024. https://ai.meta.com/blog/meta-llama-3/.

  6. Mesnard T, Hardin C, Dadashi R, Bhupatiraju S, Pathak S, Sifre L, Rivière M, Kale MS, Love J, Tafti P, Hussenot L, Chowdhery A, Roberts A, Barua A, Botev A, Castro-Ros A, Slone A, Héliou A, Tacchetti A, Bulanova A, Paterson A, Tsai B, Shahriari B, Lan CL, Choquette-Choo CA, Crepy C, Cer D, Ippolito D, Reid D, Buchatskaya E, Ni E, Noland E, Yan G, Tucker G, Muraru G, Rozhdestvenskiy G, Michalewski H, Tenney I, Grishchenko I, Austin J, Keeling J, Labanowski J, Lespiau J, Stanway J, Brennan J, Chen J, Ferret J, Chiu J, et al. Gemma: Open models based on gemini research and technology. 2024 https://doi.org/10.48550/ARXIV.2403.08295, CoRR arXiv:abs/2403.08295.

  7. Jiang AQ, Sablayrolles A, Roux A, Mensch A, Savary B, Bamford C, Chaplot DS, Las Casas D, Hanna EB, Bressand F, Lengyel G, Bour G, Lample G, Lavaud LR, Saulnier L, Lachaux M, Stock P, Subramanian S, Yang S, Antoniak S, Scao TL, Gervet T, Lavril T, Wang T, Lacroix T, Sayed WE. Mixtral of experts. https://doi.org/10.48550/ARXIV.2401.04088, 2024 CoRR arXiv:abs/2401.04088.

  8. Calcagno C, Distefano D. Infer: An automatic program verifier for memory safety of C programs. In: Bobaru MG, Havelund K, Holzmann GJ, Joshi R (eds.) NASA Formal Methods - Third International Symposium, NFM 2011, Pasadena, CA, USA, April 18–20, 2011. Proceedings. Lecture Notes in Computer Science, vol. 6617, pp. 459–465. Springer, 2011 https://doi.org/10.1007/978-3-642-20398-5_33.

  9. Austin J, Odena A, Nye MI, Bosma M, Michalewski H, Dohan D, Jiang E, Cai CJ, Terry M, Le QV, Sutton C. Program synthesis with large language models. 2021 CoRR arXiv:abs/2108.07732.

  10. Xu R, Cao J, Lu Y, Wen M, Lin H, Han X, He B, Cheung S-C, Sun L. CRUXEval-X: a benchmark for multilingual code reasoning, understanding and execution 2025. https://arxiv.org/abs/2408.13001.

  11. Chen M, Tworek J, Jun H, Yuan Q, Oliveira Pinto HP, Kaplan J, Edwards H, Burda Y, Joseph N, Brockman G, Ray A, Puri R, Krueger G, Petrov M, Khlaaf H, Sastry G, Mishkin P, Chan B, Gray S, Ryder N, Pavlov M, Power A, Kaiser L, Bavarian M, Winter C, Tillet P, Such FP, Cummings D, Plappert M, Chantzis F, Barnes E, Herbert-Voss A, Guss WH, Nichol A, Paino A, Tezak N, Tang J, Babuschkin I, Balaji S, Jain S, Saunders W, Hesse C, Carr AN, Leike J, Achiam J, Misra V, Morikawa E, Radford A, Knight M, Brundage M, Murati M, Mayer K, Welinder P, McGrew B, Amodei D, McCandlish S, Sutskever I, Zaremba W. Evaluating large language models trained on code 2021 arXiv:2107.03374 [cs.LG].

  12. Jiang AQ, Sablayrolles A, Mensch A, Bamford C, Chaplot DS, Las Casas D, Bressand F, Lengyel G, Lample G, Saulnier L, Lavaud LR, Lachaux M, Stock P, Scao TL, Lavril T, Wang T, Lacroix T, Sayed WE. Mistral 7b. 2023 https://doi.org/10.48550/ARXIV.2310.06825, CoRR arXiv:abs/2310.06825.

  13. Schulhoff S, Ilie M, Balepur N, Kahadze K, Liu A, Si C, Li Y, Gupta A, Han H, Schulhoff S, Dulepet PS, Vidyadhara S, Ki D, Agrawal S, Pham C, Kroiz GC, Li F, Tao H, Srivastava A, Costa HD, Gupta S, Rogers ML, Goncearenco I, Sarli G, Galynker I, Peskoff D, Carpuat M, White J, Anadkat S, Hoyle AM, Resnik P. The prompt report: A systematic survey of prompting techniques. https://doi.org/10.48550/ARXIV.2406.06608, CoRR arXiv:abs/2406.06608 (2024).

  14. Wei J, Wang X, Schuurmans D, Bosma M, Ichter B, Xia F, Chi EH, Le QV, Zhou D. Chain-of-thought prompting elicits reasoning in large language models. In: Koyejo S, Mohamed S, Agarwal A, Belgrave D, Cho K, Oh A. (eds.) Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022 2022. http://papers.nips.cc/paper_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html.

  15. Zhou Y, Muresanu AI, Han Z, Paster K, Pitis S, Chan H, Ba J. Large language models are human-level prompt engineers. In: The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1–5, 2023. OpenReview.net, 2023. https://openreview.net/forum?id=92gvk82DE-.

  16. Cousot P, Cousot R. Abstract interpretation: A unified lattice model for static analysis of programs by construction or approximation of fixpoints. In: Graham RM, Harrison MA, Sethi R. (eds.) Conference Record of the Fourth ACM Symposium on Principles of Programming Languages, Los Angeles, California, USA, January 1977, pp. 238–252. ACM, 1977. https://doi.org/10.1145/512950.512973.

  17. Tihanyi N, Bisztray T, Jain R, Ferrag MA, Cordeiro LC, Mavroeidis V. The formai dataset: Generative AI in software security through the lens of formal verification. In: McIntosh S, Choi E, Herbold S. (eds.) Proceedings of the 19th International Conference on Predictive Models and Data Analytics in Software Engineering, PROMISE 2023, San Francisco, CA, USA, 8 December 2023, pp. 33–43. ACM, 2023. https://doi.org/10.1145/3617555.3617874.

  18. Gadelha MYR, Monteiro FR, Morse J, Cordeiro LC, Fischer B, Nicole DA. ESBMC 5.0: an industrial-strength C model checker. In: Huchard M, Kästner C, Fraser G. (eds.) Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, ASE 2018, Montpellier, France, September 3–7, 2018, pp. 888–891. ACM, 2018. https://doi.org/10.1145/3238147.3240481

  19. Tihanyi N, Bisztray T, Ferrag MA, Jain R, Cordeiro LC. How secure is ai-generated code: a large-scale comparison of large language models. Empir Softw Eng. 2025;30(2):1–42.

    Google Scholar 

  20. Pearce H, Ahmad B, Tan B, Dolan-Gavitt B, Karri R. Asleep at the keyboard? assessing the security of github copilot’s code contributions. In: IEEE Symposium on Security and Privacy, S&P 2022, pp. 754–768 2022. IEEE.

  21. Nazzal M, Khalil I, Khreishah A, Phan N. Promsec: Prompt optimization for secure generation of functional source code with large language models (llms). In: Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pp. 2266–2280, 2024.

  22. He J, Vechev M. Large language models for code: Security hardening and adversarial testing. In: Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, pp. 1865–1879, 2023.

  23. Li D, Yan M, Zhang Y, Liu Z, Liu C, Zhang X, Chen T, Lo D. Cosec: On-the-fly security hardening of code llms via supervised co-decoding. In: Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, pp. 1428–1439, 2024.

  24. Chapman PJ, Rubio-González C, Thakur AV. Interleaving static analysis and LLM prompting. In: Proceedings of the 13th ACM SIGPLAN International Workshop on the State Of the Art in Program Analysis, SOAP 2024, pp. 9–17, 2024.

  25. Li Z, Dutta S, Naik M. LLM-assisted static analysis for detecting security vulnerabilities. arXiv preprint, 2024 arXiv:2405.17238.

  26. Ullah S, Han M, Pujar S, Pearce H, Coskun A, Stringhini G. LLMs cannot reliably identify and reason about security vulnerabilities (yet?): A comprehensive evaluation, framework, and benchmarks. In: IEEE Symposium on Security and Privacy, S&P 2024, 2024.

  27. Lu G, Ju X, Chen X, Pei W, Cai Z. Grace: empowering llm-based software vulnerability detection with graph structure and in-context learning. J Syst Softw. 2024;212:112031.

    Google Scholar 

  28. Wen X-C, Gao C, Gao S, Xiao Y, Lyu MR. Scale: Constructing structured natural language comment trees for software vulnerability detection. In: Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, pp. 235–247, 2024.

  29. Charalambous Y, Tihanyi N, Jain R, Sun Y, Ferrag MA, Cordeiro LC. A new era in software security: Towards self-healing software via large language models and formal verification, 2023. https://doi.org/10.48550/ARXIV.2305.14752, CoRR arXiv:abs/2305.14752.

  30. Jin M, Shahriar S, Tufano M, Shi X, Lu S, Sundaresan N, Svyatkovskiy A. Inferfix: End-to-end program repair with llms. In: Chandra S, Blincoe K, Tonella P. (eds.) Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2023, San Francisco, CA, USA, December 3–9, 2023, pp. 1646–1656. ACM, 2023. https://doi.org/10.1145/3611643.3613892.

  31. Janßen C, Richter C, Wehrheim H. Can chatgpt support software verification?, 2023. https://doi.org/10.48550/ARXIV.2311.02433, CoRR arXiv:abs/2311.02433.

  32. Li H, Hao Y, Zhai Y, Qian Z. Assisting static analysis with large language models: A chatgpt experiment. In: Chandra S, Blincoe K, Tonella P. (eds.) Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2023, San Francisco, CA, USA, December 3–9, 2023, pp. 2107–2111. ACM, 2023. https://doi.org/10.1145/3611643.3613078.

  33. Jain N, Han K, Gu A, Li W-D, Yan F, Zhang T, Wang S, Solar-Lezama A, Sen K, Stoica I: Livecodebench: Holistic and contamination free evaluation of large language models for code, 2024. arXiv preprint arXiv:2403.07974.

  34. Olausson TX, Inala JP, Wang C, Gao J, Solar-Lezama A. Is self-repair a silver bullet for code generation? In: The Twelfth International Conference on Learning Representations, 2023.

  35. Islam NT, Khoury J, Seong A, Bou-Harb E, Najafirad P. Enhancing source code security with llms: demystifying the challenges and generating reliable repairs. In: Network and Distributed System Security (NDSS) Symposium 2024, 2024.

  36. OpenAI: GPT-3.5-turbo, 2022. Accessed March 2024 . https://platform.openai.com/docs/models/gpt-3-5-turbo.

  37. Bhatt M, Chennabasappa S, Nikolaidis C. Wan S, Evtimov I, Gabi D, Song D, Ahmad F, Aschermann C, Fontana L, Frolov S, Giri RP, Kapil D, Kozyrakis Y, LeBlanc D, Milazzo J, Straumann A, Synnaeve G, Vontimitta V, Whitman S, Saxe J. Purple llama cyberseceval: A secure coding benchmark for language models. 2023. https://doi.org/10.48550/ARXIV.2312.04724, CoRR arXiv:abs/2312.04724.

  38. He J, Vero M, Krasnopolska G, Vechev MT. Instruction tuning for secure code generation. In: Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21–27, 2024. OpenReview.net, 2024. https://openreview.net/forum?id=MgTzMaYHvG.

  39. Pearce H, Tan B, Ahmad B, Karri R, Dolan-Gavitt B. Examining zero-shot vulnerability repair with large language models. In: IEEE Symposium on Security and Privacy, S&P 2023, pp. 2339–2356, 2023. IEEE.

  40. Plein L, Ouédraogo WC, Klein J, Bissyandé TF. Automatic generation of test cases based on bug reports: a feasibility study with large language models. In: Proceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings, ICSE Companion 2024, Lisbon, Portugal, April 14–20, 2024, pp. 360–361. ACM, 2024. https://doi.org/10.1145/3639478.3643119..

Download references

Acknowledgements

This work was supported by Bando di Ateneo 2024 per la Ricerca, funded by University of Parma (FIL_2024_PROGETTI_B_IOTTI - CUP D93C24001250005) and by SERICS (PE00000014 - CUP H73C2200089001) project funded by PNRR NextGeneration EU.

Author information

Authors and Affiliations

  1. University of Parma, Parco Area delle Scienze 53/A, Parma, 43124, PR, Italy

    Vincenzo Arceri, Eleonora Iotti & Enea Zaffanella

  2. Imperial College London, South Kensington Campus, London, SW7 2AZ, United Kingdom

    Sergio Maffeis

  3. Ca’ Foscari University of Venice, Via Torino 155, Venice, 30172, VE, Italy

    Greta Dolcetti & Agostino Cortesi

Authors
  1. Greta Dolcetti
    View author publications

    Search author on:PubMed Google Scholar

  2. Vincenzo Arceri
    View author publications

    Search author on:PubMed Google Scholar

  3. Eleonora Iotti
    View author publications

    Search author on:PubMed Google Scholar

  4. Sergio Maffeis
    View author publications

    Search author on:PubMed Google Scholar

  5. Agostino Cortesi
    View author publications

    Search author on:PubMed Google Scholar

  6. Enea Zaffanella
    View author publications

    Search author on:PubMed Google Scholar

Contributions

G.D.: Software, Data Curation, Writing, Visualization, Experiments V.A.: Software, Supervision, Writing, Experiments E.I.: Software, Writing, Visualization, Experiments S.M.: Conceptualization, Supervision, Writing A.C.: Resources, Supervision, Writing E.Z.: Validation, Supervision, Writing All authors reviewed the manuscript.

Corresponding author

Correspondence to Greta Dolcetti.

Ethics declarations

Conflict of interest

The authors declare no Conflict of interest.

Ethical approval

Not applicable.

Consent to participate

Not applicable.

Consent to publish

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

1.1 Prompt experiments

For each phase we report the prompt experiments and the result we obtained. The names of the best prompts, whose results are reported in the paper, are highlighted in italics.

1.2 Code generation experiments

See Fig.14

Fig. 14
Fig. 14
Full size image

Vanilla system prompt for generating C code

See table 11

Table 11 Correctness results for each model (Vanilla). OK all the test passed, Exec an execution error or timeout occurred, Assert at least one test failed, Comp a compilation error occurred
Full size table

See Fig. 15

Fig. 15
Fig. 15
Full size image

Example and Counterexample system prompt for generating C code

See table 12

Table 12 Correctness results for each model (Example and Counterexample). OK all the test passed, Exec an execution error or timeout occurred, Assert at least one test failed, Comp a compilation error occurred
Full size table

See Fig. 16

Fig. 16
Fig. 16
Full size image

Chain of Thought system prompt for generating C code

see Table 13

Table 13 Correctness results for each model (Chain of Thought). OK all the test passed, Exec an execution error or timeout occurred, Assert at least one test failed, Comp a compilation error occurred
Full size table

See Fig. 17

Fig. 17
Fig. 17
Full size image

Combo system prompt for generating C code

See table 14

Table 14 Correctness results for each model (Combo). OK all the test passed, Exec an execution error or timeout occurred, Assert at least one test failed, Comp a compilation error occurred
Full size table

See Fig. 18

Fig. 18
Fig. 18
Full size image

Vanilla + Implicit CoT system prompt for generating C code

See table 15

Table 15 Correctness results for each model (Vanilla + Implicit CoT). OK all the test passed, Exec an execution error or timeout occurred, Assert at least one test failed, Comp a compilation error occurred
Full size table

1.3 Self-evaluation experiments - correctness

See table 16

Table 16 Results of correctness self-evaluation using vanilla prompt
Full size table

See Figs. 19, 20

Fig. 19
Fig. 19
Full size image

Vanilla system prompt to perform correctness classification

 

Fig. 20
Fig. 20
Full size image

Preference heatmap for the self-correctness classification, using vanilla prompt. On the rows the evaluated models, on the columns the evaluator models

See table 17

Table 17 Results of correctness self-evaluation using example and counterexample prompt
Full size table

See Figs. 21, 22

Fig. 21
Fig. 21
Full size image

Example and counterexample system prompt to perform correctness classification

 

Fig. 22
Fig. 22
Full size image

Preference heatmap for the self-correctness classification, using example and counterexample prompt. On the rows the evaluated models, on the columns the evaluator models

See table 18

Table 18 Results of correctness self-evaluation using Chain of Thought prompt
Full size table

See Figs. 23, 24, 25

Fig. 23
Fig. 23
Full size image

Chain of Thought system prompt to perform correctness classification

Fig. 24
Fig. 24
Full size image

Preference heatmap for the self-correctness classification, using Chain of Thought prompt. On the rows the evaluated models, on the columns the evaluator models

Fig. 25
Fig. 25
Full size image

Combo system prompt to perform correctness classification

1.4 Self-evaluation experiments - safety

See table 19

Table 19 Self-safety analysis results for each evaluator model, referring to a model to evaluate, and using vanilla prompt
Full size table

See Fig. 26

Fig. 26
Fig. 26
Full size image

Vanilla system prompt to perform vulnerability detection

See table 20

Table 20 Self-safety analysis results for each evaluator model, referring to a model to evaluate, and using example and counterexample prompt
Full size table

See Fig. 27

Fig. 27
Fig. 27
Full size image

Example and counterexample system prompt to perform vulnerability detection

See table 21

Table 21 Self-safety analysis results for each evaluator model, referring to a model to evaluate, and using Chain of Thought prompt
Full size table

See Figs. 28, 29

Fig. 28
Fig. 28
Full size image

Chain of Thought system prompt to perform vulnerability detection

 

Fig. 29
Fig. 29
Full size image

Combo system prompt to perform vulnerability detection

1.5 Code repair experiments - correctness

See Fig. 30

Fig. 30
Fig. 30
Full size image

Vanilla Prompt

See table 22

Table 22 Overall results of code correctness after the repair and code cleaning phase for each model (Vanilla)
Full size table

See Fig. 31

Fig. 31
Fig. 31
Full size image

Chain of Thought Prompt

see Table 23

Table 23 Overall results of code correctness after the repair and code cleaning phase for each model (CoT)
Full size table

See Fig. 32

Fig. 32
Fig. 32
Full size image

One Assert at the Time Prompt. The system prompt is the vanilla prompt but we provide one failed assertion at the time, iteratively, for a maximum of 6 iterations

See tables 24, 25, 26, 27, 28

Table 24 Overall results of code correctness after the repair and code cleaning phase for each model (One Assert at the Time)
Full size table
Table 25 Results of code correctness after the repair and code cleaning phase for gemma-7b-it with the One Assert at the Time prompt
Full size table
Table 26 Results of code correctness after the repair and code cleaning phase for llama3-8b-8192 with the One Assert at the Time prompt
Full size table

 

Table 27 Results of code correctness after the repair and code cleaning phase for llama3-70b-8192 with the One Assert at the Time prompt
Full size table
Table 28 Results of code correctness after the repair and code cleaning phase for mixtral-8x7b-32768 with the One Assert at the Time prompt
Full size table

1.6 Code repair experiments - safety

See Fig. 33

Fig. 33
Fig. 33
Full size image

Vanilla Prompt

See table 29

Table 29 Aggregate vulnerability analysis for each model after the repair phase (Vanilla)
Full size table

See Fig. 34

Fig. 34
Fig. 34
Full size image

Chain of Thought Prompt

See table 30

Table 30 Aggregate vulnerability analysis for each model after the repair phase (Chain of Thought)
Full size table

See Fig. 35

Fig. 35
Fig. 35
Full size image

Instructions Prompt

See table 31

Table 31 Aggregate vulnerability analysis for each model after the repair phase (Instructions)
Full size table

See Fig. 36

Fig. 36
Fig. 36
Full size image

Combo Prompt

See table 32

Table 32 Aggregate vulnerability analysis for each model after the repair phase (Combo)
Full size table

See Fig. 37

Fig. 37
Fig. 37
Full size image

No Info Prompt. In the content prompt, in this experiment, we do not provide any information for the kind of vulnerabilities present in the code

See table 33

Table 33 Aggregate vulnerability analysis for each model after the repair phase (No Info)
Full size table

See Fig. 38

Fig. 38
Fig. 38
Full size image

No Line Prompt. In the content prompt, in this experiment, we do not provide the line number where the vulnerability is present

See table 34

Table 34 Aggregate vulnerability analysis for each model after the repair phase (No Line)
Full size table

See Fig. 39

Fig. 39
Fig. 39
Full size image

One Vulnerability at the Time Prompt

See tables 35, 36, 37, 38, 39

Table 35 Aggregate vulnerability analysis for each model after the repair phase (One Vulnerability at the Time)
Full size table
Table 36 Breakdown of vulnerabilities for model: gemma-7b-it. Prompt: One Vulnerability at the Time
Full size table

 

Table 37 Breakdown of vulnerabilities for model: llama3-70b-8192. Prompt: One Vulnerability at the Time
Full size table
Table 38 Breakdown of vulnerabilities for model: llama3-8b-8192. Prompt: One Vulnerability at the Time
Full size table
Table 39 Breakdown of vulnerabilities for model: mixtral-8x7b-32768. Prompt: One Vulnerability at the Time
Full size table

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dolcetti, G., Arceri, V., Iotti, E. et al. Helping LLMs improve code generation using feedback from testing and static analysis. Discov Artif Intell (2026). https://doi.org/10.1007/s44163-026-01009-5

Download citation

  • Received: 23 September 2025

  • Accepted: 09 February 2026

  • Published: 05 March 2026

  • DOI: https://doi.org/10.1007/s44163-026-01009-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Keywords

  • Large language models
  • Code generation
  • Static analysis
  • Code repair

Profiles

  1. Greta Dolcetti View author profile

Advertisement

Search

Navigation

  • Find a journal
  • Publish with us
  • Track your research

Discover content

  • Journals A-Z
  • Books A-Z

Publish with us

  • Journal finder
  • Publish your research
  • Language editing
  • Open access publishing

Products and services

  • Our products
  • Librarians
  • Societies
  • Partners and advertisers

Our brands

  • Springer
  • Nature Portfolio
  • BMC
  • Palgrave Macmillan
  • Apress
  • Discover
  • Your US state privacy rights
  • Accessibility statement
  • Terms and conditions
  • Privacy policy
  • Help and support
  • Legal notice
  • Cancel contracts here

Not affiliated

Springer Nature

© 2026 Springer Nature