0% found this document useful (0 votes)
9 views7 pages

1 - Code Generation Using LLMs

The document discusses the use of Large Language Models (LLMs) for code generation, highlighting the challenge of hallucinations, particularly repetitive code generation. The authors propose a solution by fine-tuning a pre-trained model on a curated dataset of Python code to improve efficiency and reduce redundancy. Experimental results show significant improvements in code generation metrics compared to existing models, with plans for future enhancements to address other types of hallucinations.

Uploaded by

osama fakre
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views7 pages

1 - Code Generation Using LLMs

The document discusses the use of Large Language Models (LLMs) for code generation, highlighting the challenge of hallucinations, particularly repetitive code generation. The authors propose a solution by fine-tuning a pre-trained model on a curated dataset of Python code to improve efficiency and reduce redundancy. Experimental results show significant improvements in code generation metrics compared to existing models, with plans for future enhancements to address other types of hallucinations.

Uploaded by

osama fakre
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

International Journal of Future Engineering Innovations [Link].

com

Code Generation Using LLMs


Tenneti Lekhya Sri Durga 1*, Mallak Keshava Gayatri 2, Mukku Deepthi Prabha 3, Dr. D Shravani 4
1-3
BE, AI&DS, VIII Sem, SCETW, Hyderabad, INDIA
4
Associate Professor, ADCE dept, SCETW, OU, Hyderabad, INDIA

* Corresponding Author: Tenneti Lekhya Sri Durga

Article Info Abstract


Code generation using Large Language Models has brought a significant change in
the software industry helping people in various things like understanding the code,
ISSN (online): 3049-1215 writing a code, completion of the code. The main challenge that the Code Generation
Volume: 02 Models are facing is Hallucinations wherein the given code may not satisfy the users
Issue: 02 requirements completely or might deviate from the user’s requirements. There are
many kinds of hallucinations and are divided into different types like the intent
March-April 2025 conflicting which consists of semantic conflicts, context deviation which includes
Received: 28-01-2025 inconsistency, repetition, dead code. So, to help with the problem of code repetition
Accepted: 24-02-2025 type of hallucination we proposed a solution that is fine-tuning a pre-trained model on
Page No: 16-22 a specific dataset to help in code generation and to reduce the repetition type
hallucination. Repetitive code hallucinations occur when the model generates
redundant lines increasing complexity and confusing developers. Repetitive code
hallucinations remain a key challenge for LLMs. To address this, we curated a dataset
of Python code snippets extracted from GitHub repositories. We have also created a
User Interface for efficient usability. We made a comparative study with different
models and their results. This can further be improvised using prompt engineering,
using more vast datasets for finetuning the model for different languages and can be
trained to give more accurate and efficient results.

DOI: [Link]

Keywords: Large Language Model(LLMs), Code Generation, Hallucinations, Repetitive Code

1. Introduction
The rapid advancement of technology led to the evolution of Large Language Models (LLMs) which are used in various fields
for like explaining concepts, design assistance, giving suggestions, and most importantly generation, debugging, autocompletion
of codes making work easier in the field of software development. It has also helped the non-technical people with the
development works by generating the codes required by the user giving rise to the code generation models. Despite these
advances these models have major challenge that is Hallucinations. The LLMs are trained on vast amounts of data for multiple
tasks which makes them general for example we can take a code generation model might be trained by taking vast amount of
data including various programming languages which leads to the problem of hallucinations. They remain a persistent challenge
in code generation, where the model produces incorrect, redundant, or irrelevant code increasing the lines of code and reducing
the readability. This is the concept where the code generated deviates from the user requirements leading to inappropriate results.
They are of multiple types like semantic inconsistencies, context deviations, knowledge conflicting. Among these hallucination
types the repetition code pose a significant problem reducing the readability and efficiency of the code generated. To address
this challenge of repetitive code our contributions are as follows.
 We have developed a fine-tuned model by training it on specific curated dataset of high-quality python codes.
 We have compared the results of our fine-tuned model with other models to understand the effectiveness and efficiency of
our model

16 | P a g e
International Journal of Future Engineering Innovations [Link]

 Created high quality dataset and fine-tuned the model the program repair challenge through an autonomous agent
with different hyper-parameters trying to reduce the based on an LLM. Boyang Yang [15] study assesses LLMs
repetition of code and increase the efficiency of the code repair performance on TutorCode, measuring repair
generated. correctness and patch precision. Anupam Garg [16] study
 Designed a user interface ensuring increase in the ease involves exploring different prompting strategies, from basic
of accessibility and usability. prompts to advanced techniques like chain-of-thought
prompting and iterative refinement. Chris Brown [17] their
The remainder of the paper is organized as follows: Section research work conducts a preliminary evaluation exploring
2 review related work on different studies and the beliefs and behaviors of LLM used to support software
implementations done in the field of code generation using development tasks. Federico Cassano [18] comes up with an
LLMs. Section 3 throws light on the proposed solution. effective approach for boosting the performance of Code
Section 4 presents the experimental results and Section 5 LLMs on low-resource languages using semi-synthetic data.
discusses future research directions and concludes the study. Anisha Agarwal [19] research introduced the Copilot
evaluation harness a set of data and tools for evaluating LLM-
2. Related Work guided IDE interactions, covering various programming
This section reviews literature on existing works in the field scenarios and languages. Albert Contreras [20] research
of code generation using LLMs. Fang Liu [1] research is based presents an extensible architecture for the definition of
on the types of hallucinations that are encountered during the assistance tasks based on LLMs, and their binding to IDE
code generation using LLMs. They have classified them into commands and natural language prompts.
different groups. Simiao Zhang [2] research was based on the Parshin Shojaee [21] research introduced novel approach that
AI-aided software development prototype that generates use leverages the extensive scientific knowledge and robust code
cases, system designs, implementations from high level user generation capabilities of Large Language Models (LLMs) to
requirements integrating continuous feedback. Chong Wang discover scientific equations from data in an efficient manner.
[3]
here they came up with ToolGen, an approach that John Chen [22] research paper explores how Large Language
integrated autocompletion tools into the code LLM Models (LLMs) can assist in Agent-Based Modeling,
generation process to address the dependencies. Tianyu specifically using NetLogo, a popular programming language
Wang [4] their study involved categorizing programming for Agent-Based Modeling. Yichen Li [23] research proposed
questions based on educational requirements, applying IDECoder a practical framework that leverages IDE native
various prompt engineering strategies, and assessing the static contexts for cross-context construction and diagnosis
effectiveness of LLM-generated responses. Sarah Fakhoury results for self-refinement. Julian Van Santen [24] study
[5]
study based on TiCoder for guided intent clarification demonstrated how LLM chatbots can be restricted by
through tests for more accurate code generation. Nikhil teachers to provide a helpful tool for students learning
Parasaram [6] whose study involved in fact selection for functional programming and its concepts. Yanggyu Lee [25]
generating the appropriate code. They came up with model proposed an effective approach for detecting logical errors
that selects facts specific to a given bug to include in the with LLMs that makes use of relations among error types in
prompt. Arun-Balajiee Lekshmi-Narayanan [7] their study the chain-of thought and tree-of-thought prompts. Irene
involved assessing the feasibility of LLMs to generate code Weber [26] research provides a taxonomy for LLM integrated
explanations. Daye Nam [8] study is on a UI built directly in applications, offering a framework for analyzing and
the IDE that is geared towards code understanding. Shubham describing these systems. Dewu Zheng [27] research includes
Ugare [9] research involves a novel framework for efficient empirical study to deeply understand LLMs code generation
and general syntactical decoding with LLMs. Wen-Ding Li performance within settings that reflect the evolving nature
[10]
investigated the ability of LLMs to perform programming of software development. Gregor Jost [28] research paper
by examples where the algorithm is generated from input- explores the nuanced impact of informal LLM usage on
output examples. undergraduate students learning outcomes in software
Juyong Jiang [11] study involves a systematic literature review development education, focusing on React applications.
that serves as a valuable reference for researchers Haonan Li [29] research proposes framework that synergizes
investigating the cutting-edge progress in LLMs for code static analysis and LLMs, with a spotlight on identifying Use
generation. Shuyuan Xu [12] they developed a novel system before Initialization bugs within the Linux kernel. Kaibo Liu
[30]
for Code Representation and Execution which employs LLM research paper proposes AID, which combines LLMs with
as interpreter to interpret and execute natural language differential testing to generate fault-revealing test inputs and
programs. Hagit Gabbay [13] study explores the potential of oracles targeting plausibly correct programs.
LLMs to generate feedback on code assignments and to
address the gaps in Automated Test-based Feedback. Islem 3. Proposed Framework
Bouzenia [14] this research introduces RepairAgent to address

17 | P a g e
International Journal of Future Engineering Innovations [Link]

Fig 1: Flowchart depicting the process

We proposed a solution to mitigate repetitive code meaningful code patterns while preventing issues such as
hallucination in LLM-based code generation by fine-tuning exploding or vanishing gradients. Our fine-tuning strategy
the Salesforce CodeGen model on a high-quality, well- enables the model to generate clear, concise, and structurally
processed dataset. This dataset was curated by extracting accurate code. We evaluated the effectiveness of our
Python code from multiple GitHub repositories and applying approach through BLEU and METEOR scores,
rigorous data preprocessing techniques to ensure consistency Demonstrating significant improvements over baseline
and relevance. We carefully selected optimal
hyperparameters to enhance the model’s ability to learn

18 | P a g e
International Journal of Future Engineering Innovations [Link]

models. 809 rows consisting of the python codes on which the model
will be fine-tuned. The model has a BLEU score of 61.76%,
4. Experimental Results METEOR score of 85.88% and BERT Score with Precision:
We have fine-tuned out model by collecting various python 73.33%, Recall: 92.55%, F1-Score: 82.65%. We have made
code snippets from the GitHub repositories and then a comparative study of our model with the gpt-2 and below is
performed data preprocessing like removing all the redundant the table denoting that.
codes, comments that aren’t useful. The dataset consisted of

Table 1: Comparing metrics of gpt-2 and Salesforce CodeGen finetuned model


Metric GPT-2 Model Fine-tuned Model
BLEU Score 12.60 61.76
ROUGE-1 28.57 75.77
ROUGE-2 25.93 75.03
ROUGE-L 28.57 75.61
METEOR 53.18 85.88
BERTScore F1 28.99 82.65

Fig 2: Comparing different metrics of GPT-2 and Salesforce CodeGen fine-tuned model

The above graphs show us the difference between the metrics shows that Salesforce CodeGen fine-tuned model has good
like BLEU, ROUGE, METEOR, BERT, Recall, F1 Scores of metrics when compared with the other model.
GPT-2 and Salesforce CodeGen fine-tuned model. This

Fig 3: Decreasing validation loss in the Salesforce CodeGen fine-tuned model with every epoch

19 | P a g e
International Journal of Future Engineering Innovations [Link]

The above figure shows us a declining line depicting the model learn the data in a much better way.
decrease in the validation loss with every epoch making the

Fig 4: Comparing the repetitive code hallucination in GPT-2 and Salesforce CodeGen fine-tuned model

The graph shown above compares the repetitive code code hallucination in the Salesforce CodeGen fine-tuned
hallucination for both GPT-2 and Salesforce CodeGen fine- model has reduced compared to the GPT-2 model.
tuned models. Seeing the graph, we can say that the repetitive

Fig 5: Outputs of GPT-2 and Salesforce CodeGen fine-tuned model for Factorial

20 | P a g e
International Journal of Future Engineering Innovations [Link]

The above images are the outputs that are generated by GPT- large language models for code generation. arXiv
2 and Salesforce CodeGen fine-tuned model showing that the preprint arXiv:2406.00515. 2024.
Salesforce CodeGen fine-tuned model is generating the code 12. Xu S, Li Z, Mei K, Zhang Y. AIOS Compiler: LLM as
without any repetition of the lines of code increasing the interpreter for natural language programming and flow
readability, efficiency and accuracy. programming of AI agents. CoRR. 2024.
13. Gabbay H, Cohen A. Combining LLM-generated and
5. Conclusion and future scope test-based feedback in a MOOC for programming. In:
In this study, we proposed a fine-tuned code generation Proceedings of the Eleventh ACM Conference on
model based on the Salesforce CodeGen pre-trained model. Learning@ Scale. 2024 Jul; p. 177–87.
We created a high-quality dataset by extracting Python code 14. Bouzenia I, Devanbu P, Pradel M. RepairAgent: An
snippets from various GitHub repositories and applying data autonomous, LLM-based agent for program repair.
preprocessing techniques. The fine-tuned model successfully arXiv preprint arXiv:2403.17134. 2024.
reduced repetitive code generation, improving the clarity and 15. Yang B, Tian H, Pian W, Yu H, Wang H, Klein J, et al
readability of the output. A comparative evaluation with other CREF: An LLM-based conversational software repair
models demonstrated the effectiveness of our approach, framework for programming tutors. In: Proceedings of
achieving a BLEU score of 71.32%. While our model shows the 33rd ACM SIGSOFT International Symposium on
promising results, further improvements can be made. In the Software Testing and Analysis. 2024 Sep; p. 882–94.
future, we aim to expand the dataset to include multiple 16. Arora C, Venaik U, Singh P, Goyal S, Tyagi J, Goel S,
programming languages for improved generalization. et al Analyzing LLM usage in an advanced computing
Additionally, we plan to address other hallucination types class in India. arXiv preprint arXiv:2404.04603. 2024.
beyond code repetition, such as semantic inconsistencies and 17. Brown C, Cusati J. Exploring the evidence-based beliefs
context deviation, to enhance the reliability of code and behaviors of LLM-based programming assistants.
generation models. arXiv preprint arXiv:2407.13900. 2024.
18. Cassano F, Gouwar J, Lucchetti F, Schlesinger C,
6. References Freeman A, Anderson CJ, et al Knowledge transfer from
1. Liu F, Liu Y, Shi L, Huang H, Wang R, Yang Z, et al high-resource to low-resource programming languages
Exploring and evaluating hallucinations in LLM- for code LLMs. Proceedings of the ACM on
powered code generation. arXiv preprint Programming Languages. 2024;8(OOPSLA2):677–708.
arXiv:2404.00971. 2024. 19. Agarwal A, Chan A, Chandel S, Jang J, Miller S,
2. Zhang S, Wang J, Dong G, Sun J, Zhang Y, Pu G. Moghaddam RZ, et al Copilot evaluation harness:
Experimenting a new programming practice with LLMs. Evaluating LLM-guided software programming. arXiv
arXiv preprint arXiv:2401.01062. 2024. preprint arXiv:2402.14261. 2024.
3. Wang C, Zhang J, Feng Y, Li T, Sun W, Liu Y, et al 20. Contreras A, Guerra E, de Lara J. Towards an extensible
Teaching code LLMs to use autocompletion tools in architecture for LLM-based programming assistants in
repository-level code generation. arXiv preprint IDEs. arXiv preprint. 2024.
arXiv:2401.06391. 2024. 21. Shojaee P, Meidani K, Gupta S, Farimani AB, Reddy
4. Wang T, Zhou N, Chen Z. Enhancing computer CK. LLM-SR: Scientific equation discovery via
programming education with LLMs: A study on programming with large language models. arXiv
effective prompt engineering for Python code preprint arXiv:2404.18400. 2024.
generation. arXiv preprint arXiv:2407.05437. 2024. 22. Chen J, Lu X, Du Y, Rejtig M, Bagley R, Horn M, et al
5. Fakhoury S, Naik A, Sakkas G, Chakraborty S, Lahiri Learning agent-based modeling with LLM companions:
SK. LLM-based test-driven interactive code generation: Experiences of novices and experts using ChatGPT &
User study and empirical evaluation. arXiv preprint NetLogo chat. In: Proceedings of the CHI Conference on
arXiv:2404.10100. 2024. Human Factors in Computing Systems. 2024 May; p. 1–
6. Parasaram N, Yan H, Yang B, Flahy Z, Qudsi A, Ziaber 18.
D, et al The fact selection problem in LLM-based 23. Li Y, Peng Y, Huo Y, Lyu MR. Enhancing LLM-based
program repair. arXiv preprint arXiv:2404.05520. 2024. coding tools through native integration of IDE-derived
7. Narayanan ABL, Oli P, Chapagain J, Hassany M, static context. In: Proceedings of the 1st International
Banjade R, Brusilovsky P, et al Explaining code Workshop on Large Language Models for Code. 2024
examples in introductory programming courses: LLM vs Apr; p. 70–74.
humans. In: AI for Education: Bridging Innovation and 24. Santen J. Using LLM chatbots to improve the learning
Responsibility at the 38th AAAI Annual Conference on experience in functional programming courses
AI. 2024 Feb. [Bachelor’s thesis]. Enschede, Netherlands: University
8. Nam D, Macvean A, Hellendoorn V, Vasilescu B, Myers of Twente; 2024.
B. Using an LLM to help with code understanding. In: 25. Lee Y, Jeong S, Kim J. Improving LLM classification of
Proceedings of the IEEE/ACM 46th International logical errors by integrating error relationship into
Conference on Software Engineering. 2024 Apr; p. 1– prompts. In: International Conference on Intelligent
13. Tutoring Systems. Cham: Springer Nature Switzerland;
9. Ugare S, Suresh T, Kang H, Misailovic S, Singh G. 2024 Jun; p. 91–103.
Improving LLM code generation with grammar 26. Weber I. Large language models as software
augmentation. arXiv preprint arXiv:2403.01632. 2024. components: A taxonomy LLM-integrated applications.
10. Li WD, Ellis K. Is programming by example solved by arXiv preprint arXiv:2406.10300. 2024.
LLMs? arXiv preprint arXiv:2406.08316. 2024. 27. Zheng D, Wang Y, Shi E, Zhang R, Ma Y, Zhang H, et
11. Jiang J, Wang F, Shen J, Kim S, Kim S. A survey on al Towards more realistic evaluation of LLM-based code

21 | P a g e
International Journal of Future Engineering Innovations [Link]

generation: An experimental study and beyond. arXiv


preprint arXiv:2406.06918. 2024.
28. Jošt G, Taneski V, Karakatič S. The impact of large
language models on programming education and student
learning outcomes. Applied Sciences.
2024;14(10):4115.
29. Li H, Hao Y, Zhai Y, Qian Z. Enhancing static analysis
for practical bug detection: An LLM-integrated
approach. Proceedings of the ACM on Programming
Languages. 2024;8(OOPSLA1):474–99.
30. Liu K, Liu Y, Chen Z, Zhang JM, Han Y, Ma Y, et al
LLM-powered test case generation for detecting tricky
bugs. arXiv preprint arXiv:2404.10304. 2024.
31. Wikipedia contributors. Large language model
[Internet]. Wikipedia. 2024 [cited 2024 Mar 19].
Available from:
[Link]

22 | P a g e

You might also like