Kantek DP
Kantek DP
Master’s Thesis
Master’s Thesis
iii
Acknowledgements
I would like to express my gratitude to Dr. Bruno Rossi, my supervi-
sor, for the numerous constructive feedback sessions and insightful
discussions about the thesis. I would also like to thank my family for
everything.
iv
Abstract
In recent years, applications of artificial intelligence have seen notable
success across various fields. Large Language Models (LLMs) have
particularly found extensive use in the field of software development.
From source code generation and code understanding to documenta-
tion translation, tools based on LLMs can enhance the effectiveness of
the software development life cycle and assist software engineers in
their daily tasks, all within the comfort of their favorite Integrated De-
velopment Environments (IDEs). An open question regarding these
tools revolves around the quality of the source code they produce,
as low-quality code could potentially do more harm than good, es-
pecially when it comes to common security vulnerabilities in source
code.
Building upon this line of thought, this work establishes two key
goals. The first goal is to identify the best AI-based tool from a selected
list of popular options, including GitHub Copilot, Tabnine, ChatGPT,
and CodeGeex, for source code generation based on the quality of the
produced code. The experiments reveal that GitHub Copilot takes pri-
macy, demonstrating the best results across eight research questions.
The second goal involves deploying a custom LLM tool for code gen-
eration using the TabbyML platform and cloud hosting. This custom
tool is then integrated with IDEs.
Keywords
Large Language Models, Transformer, Deep Learning, Static Code
Analysis, Software Quality, Vulnerabilities
v
Contents
Introduction 1
4 Related Works 53
5 Experimental Design 56
5.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . 57
5.2 Goals, Research Questions, and Metrics . . . . . . . . . 60
6 Experimental Evaluation 67
6.1 Goal 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.2 Goal 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.3 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . 83
7 Conclusion 84
Bibliography 86
vi
List of Tables
vii
List of Figures
viii
Introduction
Artificial Intelligence (AI) has become increasingly prominent in vari-
ous domains. One area where AI holds immense potential is in soft-
ware development. AI-based tools can instantly generate source code
across a wide range of programming languages, enhancing develop-
ers’ effectiveness throughout the software development life cycle. As
organizations aim to deliver high-quality software products within
shorter time frames, ensuring source code quality becomes especially
crucial when incorporating AI-generated code into the codebase.
Problem Statement
Despite the advancements in AI-driven software development, chal-
lenges still persist. One significant issue is accurately assessing and
enhancing source code quality. AI-based code generation tools are
trained on vast amounts of human-written source code—mined from
both public and private repositories—potentially incorporating com-
mon vulnerabilities into the generated code. Beyond vulnerabilities,
the generated code may also contain code smells, impacting the main-
tainability of software projects.
1
Introduction
2
1 AI-Driven Code Generation
There are a lot of similarities between natural language and program-
ming languages used in software development. Although natural
languages tend to be complex, creative, ambiguous, and pose other
characteristics in similar direction, but what humans use from the
whole potential set of natural language is more simpler, repetitive
and thus more easily modelled with a statistical language model. [2]
defines software as natural in the same way as natural language, since
its created by daily and repetitive human work that shares the same
modellable characteristic.
This repetitiveness was studied in [3] and the findings showed that a
corpus consisting of 420 million lines of code from over 6000 projects
written in C, C++, and Java contained a large degree of repetitions 1 .
This finding further supports the argument that source code can be
modelled by language models and language models can be in retro-
spective used in software engineering to aid programmers in their
daily work.
1. The amount of repetitiveness was depending on the size of the code fragment
which was compared to the rest of the project as smaller sizes tended to be naturally
more repetitive than bigger sizes.
3
1. AI-Driven Code Generation
p ( s ) = p ( a 1 ) · p ( a 2 | a 1 ) · p ( a 3 | a 1 a 2 ) . . . · p ( a n | a 1 . . . a n −1 )
p ( a i | a 1 . . . a i −1 ) ≃ p ( a i | a i −3 a i −2 a i −1 )
count( a1 a2 a3 a4 )
p ( a4 | a1 a2 a3 ) =
count( a1 a2 a3 ∗)
The last definition is left for a performance metric which in the case
of language modelling and distribution estimation on text corpora is
cross entropy, which in NLP measures the perplexity of a model when
processing a new system.
4
1. AI-Driven Code Generation
Figure 1.1: The relationship between order of N-Grams and the value
of 10-Fold Cross Validation Cross Entropy. Line represents natural
language corpus and boxplots represent code corpora. [2]
5
1. AI-Driven Code Generation
6
1. AI-Driven Code Generation
7
1. AI-Driven Code Generation
Code Generation
The most general and applicable NLP task in source code applications.
Code generation is an an automated process of transforming natural
language specifications or descriptions into executable source code,
bridging the gap between human language and programming con-
structs. Natural language specifications can be in the form of code-level
comments, prompts, documentation and other.
Code Completion
A common feature of integrated development environments (IDEs)
that suggests and automates the insertion of code elements as develop-
ers type, streamlining the coding process and making the developer
type less. There have been many types of code completion systems
based on rules, type inference, statistical models 5 , machine learning
models.
Code Suggestion
A subtask of code generation providing developers with intelligent
recommendations of code snippets for code enhancements, optimiza-
tions, or alternative implementations during the coding phase. This
task takes instructions in forms of prompts, code-level comments,
function signatures and other.
Code Translation
Also called transpilation, code translation is the conversion of code
from one programming language to equivalent code in another pro-
5. N-grams.
8
1. AI-Driven Code Generation
Code Refinement
Improving or optimizing automatically generated source code. In an
NLP setting, code refinement involves enhancing the quality, read-
ability, and efficiency of code that has been generated from natural
language specifications or descriptions.
Code Summarization
Creating concise and informative summaries of code snippets or entire
codebases to facilitate comprehension, documentation, and knowledge
transfer. This task is especially useful for legacy codebases where
the human knowledge had been lost or for more exotic languages
or technologies which not many software engineers have experience
with.
Defect Detection
The identification and analysis of bugs, errors, or imperfections in
software code to improve its correctness, reliability, and functionality.
Unlike the rests of the task, defect detection can be implemented as a
binary classification task, where the input code snippet is categorized
either as defective or correct.
Code Repair
Automatic or semi-automatic techniques for identifying and fixing
issues or errors in source code. This task provides additional function-
ality on top of defect detection, as the task implementation must not
only recognize that input code snippet contains an error but it must
also be able to fix it6 .
6. There is a new field of research called self-healing applications. In this field, ap-
plications utilize generative AI to rewrite themselves whenever they detect an error,
9
1. AI-Driven Code Generation
Clone Detection
The process of identifying redundant or similar sections of code within
a software project to enhance maintainability and reduce redundancy.
Clone detection can be seen as a practical application of the DRY7
software engineering principle.
Documentation Translation
The translation of software documentation from one language to an-
other, ensuring accessibility and comprehension for a broader audi-
ence. This task is probably the closest to common NLP tasks such as
machine translation since it deals mostly with natural language8 .
NL Code Search
The ability to search for relevant code snippets using natural language
queries, enabling developers to locate code based on contextual de-
scriptions rather than on more trivial search methods such as full-text
search in the source code repository.
exception, or a bug and thus can potentially automate bug fixing and debugging.
There are several open-source projects which deal with this field.
7. Don’t Repeat Yourself - it aims at reducing source code clones and repetitions.
8. But software documentation can of course also contain code snippets.
10
1. AI-Driven Code Generation
9. Transformers are not limited only to NLP but they have achieved state-of-the-art
performance also in other areas like computer vision.
10. Unlike RNNs which accept one by one token. This also makes Transformers
more computationally efficient than RNNs as they can better utilize modern GPUs.
11
1. AI-Driven Code Generation
11. In other words, FNNs introduce non-linearity to the Transformer through the
non-linear activation functions of their neurons and multiple hidden layers.
12. A well-known example is the BERT transformer.
12
1. AI-Driven Code Generation
13
1. AI-Driven Code Generation
Definition 1.2.1 (Query, Key, Value [4]). Let X be the output of the
previous network layer, dmodel the dimension of layers’ outputs, dk the
dimension of queries and keys, dv dimensions of values, and W Q ∈
Rdmodel ×dk , W K ∈ Rdmodel ×dk , W V ∈ Rdmodel ×dv be the learned projection
(weight) matrices of a single self-attention layer. Query Q, key K, and
value V are defined as
Q = X · WQ
K = X · WK
V = X · WV
The terms query, key, and value in the definition refer to self-attention
instead of general attention. This distinction arises when the query,
key, and value are computed from the same input, leading to the use
of the term self-attention in such cases. The concept of query, key, and
value is based on the logic from information retrieval:
Query. The query represents what information the model is inter-
ested in or what it wants to retrieve from the input sequence or the
outputs of the previous layer.
Key. The key represents the information contained in the retriev-
able element (metadata) and how it can be used to provide context
for the query.
Value. The value represents the actual information that can be
retrieved based on query and key.
14
1. AI-Driven Code Generation
MH A( X ) = Concati∈[1,h] ( A( Qi , Ki , Vi )) · W O
where
Qi = X · WiQ
Ki = X · WiK
Vi = X · WiV
15
1. AI-Driven Code Generation
16
1. AI-Driven Code Generation
Figure 1.5: The evolution of NLP models considering their task solving
capacity [5].
13. For instance, the LLaMA 2 LLM from Meta comprises 70B parameters. It was
trained on 2T tokens and utilized 2000 Nvidia A100 80GB GPUs [6]. Although the
training time is not disclosed in the LLaMA 2 paper, its predecessor of a comparable
size trained on only 1.4T tokens took 21 days to train on the same 2048 GPUs [7].
17
1. AI-Driven Code Generation
Nc α N
L( N ) = ( ) , α N ∼ 0.076, Nc ∼ 8.8 × 1013
N
Dc
L( D ) = ( )α D , α D ∼ 0.095, Dc ∼ 5.4 × 1013
D
Cc αC
L(C ) = ( ) , αC ∼ 0.050, Cc ∼ 3.1 × 108
C
LLM Architectures
18
1. AI-Driven Code Generation
19
1. AI-Driven Code Generation
Pre-Training of LLMs
Table 1.2: Commonly used text corpora for pre-training of LLMs [5].
Corpora Size Source Description
BookCorpus 5GB Books 11,000 books
Gutenberg - Books 70,000 books
BigQuery - Code code snippets from public repositories
CC-Stories-R 31GB CommonCrawl narrative-like style
CC-NEWS 78GB CommonCrawl news articles from news sites
REALNEWs 120GB CommonCrawl news articles from 5,000 domains
C4 800GB CommonCrawl subset of Common Crawl
the Pile 800GB Other diverse sources
ROOTS 1.6TB Other diverse smaller datasets
OpenWebText 38GB Reddit links highly upvoted Reddit links
Pushift.io 2TB Reddit links regularly updated entire Reddit content history
Wikipedia 21GB Wikipedia Wikipedia articles
20
1. AI-Driven Code Generation
Fine-Tuning of LLMs
The second phase of training LLMs is fine-tuning. Utilizing suitable
training procedures can further develop the performance, general-
ization abilities, applicability to more diverse set of tasks, and stabil-
ity of predictions of LLMs. Instruction tuning, an almost-supervised
fine-tuning objective, involves providing LLMs with general training
15. Even for inference, which is orders of magnitude less expensive than training,
GPUs are necessary for larger LLMs to achieve reasonable latency.
16. The combination of data parallelism, pipeline parallelism, and tensor parallelism
is collectively referred to as 3D parallelism [5].
17. For instace from 32-bit to 16-bit floating-point numbers.
21
1. AI-Driven Code Generation
18. The entire question-answering dataset could be augmented with a single human-
written task description (e.g., ’Please answer this question’ [5]). Nevertheless, the
effectiveness of employing such a general task description for all examples in the
dataset is questionable.
19. The use of the RLHF algorithm in alignment tuning is known to push safe
behavior to an extreme, often leading LLMs to generate outputs that are deemed
safe but not really useful. This challenge is recognized as the evasion problem [5].
22
1. AI-Driven Code Generation
While the most likely token still carries a high probability of being
generated, other tokens with significant probabilities20 can also be
generated. As a result, running an LLM multiple times with the same
input sequence can produce diverse outputs.
To avoid sampling tokens with very low probabilities, a tempera-
ture parameter is introduced in the softmax function, adjusting the
20. For meaningful outputs, the most likely token and the set of the second most
likely tokens share lexical semantics (e.g., the most likely token is "dog," and the set
of other likely tokens is "cat," "rabbit," "hamster").
23
1. AI-Driven Code Generation
probability distribution toward more likely tokens. The lower the tem-
perature parameter, the more deterministic the outputs become, and
vice versa. Certain LLM-based tools let users adjust the temperature
parameter, providing control over the creativeness of LLMs.
24
1. AI-Driven Code Generation
21. In fact, the same LLM can be used for both machine translation and code-related
tasks, given that suitable fine-tuning objective were employed during its fine-tuning.
25
1. AI-Driven Code Generation
have both seen upgrades in the form of new minor versions, namely
GPT-3.5 Turbo and GPT-4 Turbo. However, precise technical details
about the architecture, training methods, and training data for both
GPT-3 and GPT-4, as well as their newer minor versions, have not been
disclosed in their official publication papers [9, 10].
GPT-1. Released in 2018, GPT-1 was the first model in the GPT
series, containing 117M parameters. It combined unsupervised pre-
training for language modeling with supervised fine-tuning [11].
GPT-2. Released in 2019, GPT-2 scaled the number of parameters
to 1.5B. In contrast to the previous generation, it relied solely on un-
supervised training by predicting the next word in a sequence. The
authors aimed to unify all NLP tasks as word prediction tasks [12].
However, GPT-2 was found to be insufficiently complex for employing
only unsupervised training, as its fine-tuned versions achieved better
performance [5].
GPT-3. Released in 2020, this generation marked a significant leap
in model size with 175B parameters and can be regarded as the in-
ception of LLMs. It employed in-context learning, enabling it to excel
in unseen tasks without requiring changes to the model’s parame-
ters. The substantial size of 175B parameters empowered the model to
showcase remarkable capabilities for NLP tasks [9].
Codex. Released in 2021, GPT-3 was noted for its limitations in
solving complex reasoning problems, including code generation and
mathematical reasoning [5]. In response, OpenAI introduced the
Codex model, built upon the GPT-3 architecture but with up to 12B
parameters. This model was fine-tuned on a multi-lingual code dataset
sourced from repositories on GitHub, resulting in improved perfor-
mance in these specific tasks [13]
GPT-4. Released in 2023, GPT-4 continued the trend of increasing
model size and complexity, achieving an impressive 1.76T parameters,
making it one of the largest models ever created. Its newer minor
version, GPT-4 Turbo, introduced task capabilities beyond the scope
of NLP, encompassing text-to-image generation, text-to-speech gener-
ation, and computer vision [10].
26
1. AI-Driven Code Generation
LLaMA
LLaMA, a relatively new family of open-source LLMs from Meta, has
become a center of attention in LLM research. These models serve as
the foundation for many researchers, given their open-source nature
and the authors’ release of extensive evaluations on numerous bench-
marks. LLaMA models have found applications as base models for
more specialized variants, frequently undergoing fine-tuning with
instruction tuning objectives tailored to specific domains [5].
LLaMA 1. Released in 2023, LLaMMA 1 is available in versions
with 6.7B, 13B, 32.5B, and 65.2B parameters. The training procedure
employed a combination of common open-source datasets, including
CommonCrawl, C4, GitHub, Wikipedia, Books, ArXiv, and StackEx-
change, totaling 1.4T tokens. The authors’ emphasis on the openness
of LLM research is evident in their choice of datasets. Notably, LLaMA
1 with 13B parameters has demonstrated superior performance com-
pared to GPT models from OpenAI, including GPT-3 with 175B pa-
rameters, across most open-source benchmarks [7].
LLaMA 2. Released in 2023, the second generation of LLaMA
models was trained on an even larger dataset, sourcing data from
publicly available repositories, totaling 2T tokens. LLaMA 2 introduces
a general foundational model and a specialized version, LLaMA 2-
CHAT, specifically optimized for dialogue. Both variants are available
in versions with 7B, 13B, 34B, and 70B parameters [6].
Code LLaMA. Released in 2023, Code LLaMA is a family of open-
source LLMs based on LLaMA 2, sharing a foundation pre-trained on
2T tokens. This family underwent additional training on code tokens,
with the foundational model, Code LLaMA, trained on 500B tokens,
Code LLaMA - Python on 500B tokens and 100B Python tokens, and
Code LLaMA - Instruct on 14,000 instruction examples. All versions
are available in 7B, 13B, and 34B parameter models [14].
Hugging Face
Hugging Face is a company that maintains the largest open-source cu-
rated library of transformer neural networks, containing over 400,000
models [15]. Naturally, they also design and train their own open-
source LLMs, which they make available as part of the library.
27
1. AI-Driven Code Generation
DeepMind
DeepMind, a company owned by Google, has developed a series of
LLMs with a primary focus on innovating training techniques.
Chinchilla. Released in 2022, Chinchilla is a general LLM with 70B
parameters, trained on 1.4T tokens. The primary purpose of this model
was to estimate the optimal scaling law of LLMs, and it demonstrated
superior performance to GPT-3 on the Massive Multitask Language
Understanding benchmark [19].
AlphaCode. Released in 2022, AlphaCode is an LLM available in
versions with 300M, 1.1B, 2.8B, 8.7B, and 41.1B parameters. It is specif-
ically designed for competition-level code generation with special
requirements [20]. AlphaCode employs a novel training procedure
based on reinforcement learning and clustering of suggested programs.
As a result of this innovative approach, AlphaCode is a state-of-the-art
code generation LLM in the competitive programming domain [20].
28
1. AI-Driven Code Generation
Risks of LLMs
LLMs are recognized for not being optimal models in terms of risks
and security. This is primarily attributed to the fact that LLMs are
trained on large and diverse datasets harvested from the public in-
ternet, and their content cannot be fully controlled. In the context
of LLM-based tools for software development, which are primarily
aimed at generating source code rather than natural language, they are
not particularly influenced by the relatively common social risks associ-
ated with LLMs, such as bias, racism, toxicity, and hate speech—except,
perhaps, in the case of chatbot-based or prompt-based code gener-
ation LLMs like ChatGPT, which generate a substantial amount of
complementary text.
The primary risks associated with code generation LLMs include
hallucinations, trustworthiness, and security. Particularly with junior
and inexperienced programmers, there is a notable risk of accepting
incorrectly generated, vulnerable, or otherwise suboptimal code. Fac-
tors such as time pressure, lack of experience, or reluctance to conduct
thorough testing contribute to this risk. This work is motivated, in
part, by the need to illuminate the risks associated with code genera-
tion LLM-based tools and to address questions related to source code
quality through the evaluation of a series of experiments.
29
1. AI-Driven Code Generation
Hallucinations
In the context of LLMs, hallucinations refer to instances where the
model generates content that is factually incorrect, fictional, or entirely
fabricated. These occurrences can arise when the model extrapolates
information based on patterns learned during training but produces
outputs that have no basis in reality. Taking code generation as exam-
ple, an LLM generates code containing fictional libraries, language
syntax, or the LLM tries to convince the user that the generated code
is correct even though it is incorrect.
Hallucinations can also arise from to limited knowledge of LLMs.
A common method to mitigate hallucinations is to extend the LLM’s
knowledge base beyond the statically fixed training dataset. This can
be achieved by utilizing a vector database together with a suitable
information retrieval method, such as retrieval augmented generation
(RAG). Retrieved information is then incorporated into the user’s
prompt to provide the LLM with additional knowledge and context,
Trustworthiness
Trustworthiness concerns the reliability and dependability of the infor-
mation generated by LLMs. Trustworthy models should consistently
produce accurate and verifiable content, instilling confidence in users
regarding the correctness of the generated information. Ensuring the
trustworthiness of LLMs is challenging due to potential biases in the
training data, the model’s vulnerability to adversarial attacks, and the
difficulty in verifying the accuracy of generated content, especially in
dynamic or rapidly changing contexts, such as software development.
Security
In the context of code generation LLMs, security concerns involve
potential vulnerabilities in the training data that could be memorized
by the LLM and transferred to the generated code. Another security
concern is the potential leakage of private and sensitive data that could
also be memorized by the LLM from the training dataset.
30
2 AI-driven Software Development Tools
This chapter introduces specific LLM-based tools desgined for integra-
tion into the software development process, with a primary emphasis
on source code generation. The focus will center on complete tools or
software components, given their comprehensive features and higher
likelihood of adoption by software engineers compared to raw LLMs.
Interfacing
Software engineers commonly use Integrated Development Environ-
ments (IDEs) for writing source code. Therefore, code generation
tools focus on seamless integration with popular IDEs such as Visual
Studio Code, IntelliJ IDEs, Neovim, and others, allowing engineers
to work within their familiar environments. The integration of code
generation tools into IDEs is achieved through plugins. These plugins
link IDE actions, such as key presses and other commands, to requests
sent to servers hosting the LLMs or to servers containing logic that,
in turn, dispatches the requests to LLMs hosted in other locations
such as the cloud or Software as a Service (SaaS) platforms. The im-
plementation of plugins is IDE-specific and relies on the respective
IDE’s software development kit or protocol. For example, Intellij IDEA
plugins are predominantly written in Java, Visual Studio Code plugins
in JavaScript, and Neovim plugins in Lua.
Some code generation tools, such as ChatGPT, offer a web-based
interface or a web API without official plugins for their integration
into IDEs. In such cases, third-party plugins may exist, facilitating
their integration with IDEs. Moreover, there are cases where code
generation tools come with their own dedicated IDEs.
GitHub Copilot
GitHub Copilot is arguably the most popular AI-based code genera-
tion tool, having over a million paying users and being adopted by
more than 37,000 organizations [21]. According to a survey conducted
by GitHub, developers using Copilot reported completing required
tasks in 55% less time [22], showcasing Copilot’s potential to enhance
software development efficiency.
31
2. AI-driven Software Development Tools
Copilot is built on the OpenAI Codex LLM and has been actively
developed by GitHub since June 2021 [21]. Supporting a wide array of
programming languages and integration with numerous IDEs, Copilot
provides real-time code suggestions, inline completions, suggests code
based on comments in natural language, and features a prompt-style
interface known as GitHub Copilot Chat, which utilizes prompting
for code generation.
GitHub Copilot is a subscription-based service and is available as
part of GitHub’s paid plans. These plans include Copilot Individual for
individual developers, and Copilot Business and Copilot Enterprise
for organizations. The subscription costs are $10, $19, and $39 per user
per month, respectively [21].
A notable feature of GitHub Copilot is the completions panel,
which enables the synthesis of multiple solutions simultaneously,
allowing users to choose the most suitable one. Copilot offers advanced
configuration options, including the temperature parameter, which
controls the level of creativity in Copilot’s outputs. Users can also
configure the number of synthesized solutions in the completions
panel and view additional information, such as the mean probability
of the suggested code snippet. Additional configuration parameters
include the maximum number of generated tokens and the maximum
number of suggested lines of code.
ChatGPT
32
2. AI-driven Software Development Tools
Tabnine
Tabnine is a closed-source tool that has been around for many years.
It utilizes a collection of multilingual GPT models, but more detailed
information about the architecture or the use of data sets is not public
[24]. Tabnine offers features such as whole-line code completions, full-
function code completions, and natural language-to-code completions.
Users can choose from three plans: Starter (free), Pro ($12 per user
per month), and Enterprise (quoted individually). When using the
Tabnine Pro plan, the user can choose from local, cloud, or hybrid
model serving modes. An advanced feature of Tabnine is its ability to
be trained on private code repositories.
33
2. AI-driven Software Development Tools
CodeGeex
CodeGeeX is an open-source and free tool for source code generation.
The newest utilized LLM is multilingual CodeGeeX2 [25]. It offers
standard features, programming languages, and integration with com-
mon IDEs. Similarly to Tabnine, CodeGeex also offers an Enterprise
plan which allows organizations to have the tool fine-tuned on their
code repositories.
Other popular AI-based tools for software development include
Replit GhostWriter [26], Codeium [27], Blackbox [28], and Cursor
[29]. Cursor is somewhat unique in that it provides its own IDE,
which has been designed from scratch around AI-driven software
development
4. Cost optimization
6. More control over the entire tool and potential further develop-
ment
34
2. AI-driven Software Development Tools
35
2. AI-driven Software Development Tools
36
3 Source Code Quality
Quality of software is an essential topic of software development. In
the current, rather agile, software development life cycle, ensuring
the high quality of source code has become a common task through-
out the software development life cycle. The assessment can be done
whenever a new functionality is added to the code base, which in
technical terms translates to assessing the quality of each pull request.
The assessment is carried out by experienced developers based on
measurable properties, i.e. metrics, of source code or even automati-
cally by comparing source code metrics’ values to default thresholds
for the project or enterprise-level requirements.
Software quality has been standardized by two international stan-
dards, ISO/IEC 9126 [30] and its successor ISO/IEC 25010:2011 [31]1 .
These standards have played a pivotal role in shaping understanding
and assessment of software quality. ISO/IEC 9126 defines a quality
model having functionality, reliability, usability, efficiency, maintain-
ability, and portability as the main characteristics of software quality.
The successor, standard ISO/IEC 25010:2011, added security and com-
patibility to the model. The quality models are primarily associated
with non-functional requirements which go beyond particular features
of software.
These two standards defined a broad perspective of software qual-
ity distinguishing between internal and external quality metrics. This
work researches the quality of AI-generated code in terms of internal
quality metrics of source code, i.e. static measures. Internal quality
metrics will be described in the following two sections. The first section
is focused on software security, particularly common vulnerabilities
found in source code and how they can be automatically detected
using static analysis of source code.
The second section introduces static metrics for measuring the fol-
lowing source code characteristics: readability, complexity, code smells,
maintainability, technical debt, consistency, modularity, reusability,
testability, and documentation.
1. After the standard ISO/IEC 9126 became obsolete, ISO developed a whole
family of standards called SQuaRE (Software product Quality Requirements and
Evaluation), comprised of standards with identifiers ISO/IEC 250xy.
37
3. Source Code Quality
OWASP
Open Worldwide Application Security Project (OWASP) is a non-
profit organization operating in the field of cyber security and par-
ticularly in web application security. OWASP creates and regularly
updates OWASP Top 10 which is a list of the most critical categories
of vulnerabilities found in web applications2 .
The list is created by gathering data3 from companies and from
conducting surveys among cyber security professionals. From the se-
lected 10 categories appearing in the final list, 8 categories are based on
the collected data and 2 categories are from the surveys. The collected
data are analyzed in terms of the incidence rate of around 4004 unique
Common Weakness Enumerations (CWEs) which are mapped to 10
categories.
38
3. Source Code Quality
Category Description
A1-Broken Access Control Inadequate access controls that
allow unauthorized users to per-
form actions or access data
A2-Cryptographic Failures Exposure of sensitive data due
to inadequate protection or weak
encryption
A3-Injection Attacker sends malicious data
as part of a command or query
to exploit vulnerabilities in data
parsing and processing
A4-Insecure Design Missing or ineffective control de-
sign
A5-Security Misconfiguration Insecurely configured settings,
default accounts, and unneces-
sary services that may expose
vulnerabilities
A6-Vulnerable and Outdated Insecurely configured settings,
Components default accounts, and unneces-
sary services that may expose
vulnerabilities
A7-Identification and Authenti- Weaknesses in authentication
cation Failures mechanisms, such as credential
stuffing, weak password policies,
and insecure session manage-
ment
A8-Software and Data Integrity Relates to code and infrastruc-
Failures ture that does not protect against
integrity violations
A9-Security Logging and Moni- Inadequate logging and moni-
toring Failures toring, making it difficult to de-
tect and respond to security inci-
dents
A10-Server Side Request Forgery Injection of malicious URLs from
user input and crafted requests
39
3. Source Code Quality
where
count(CWEX ∈ NVD) − min(Freq)
Fr(CWEX ) =
max(Freq) − min(Freq)
average_CVSS(CWEX ) − min(CVSS)
Sv(CWEX ) =
max(CVSS) − min(CVSS)
Freq = {count(CWEX ′ ∈ NVD) for each CWEX ′ in NVD}
Since both OWASP Top 10 and CWE provide good basis for vul-
nerability analysis, but they are at different levels of abstraction, The
MITRE Corporation published an official tree-like mapping between
OWASP Top 10 and CWE categories.
40
3. Source Code Quality
Table 3.2: The first 15 entries of CWE Top 25 2023 sorted from most
critical vulnerabilities [32].
ID Description
CWE-787 Out-of-bounds Write
CWE-79 Improper Neutralization of Input During Web Page Gen-
eration (’Cross-site Scripting’)
CWE-89 Improper Neutralization of Special Elements used in an
SQL Command (’SQL Injection’)
CWE-416 Use After Free
CWE-78 Improper Neutralization of Special Elements used in an
OS Command (’OS Command Injection’)
CWE-79 Improper Neutralization of Input During Web Page Gen-
eration (’Cross-site Scripting’)
CWE-20 Improper Input Validation
CWE-125 Out-of-bounds Read
CWE-22 Improper Limitation of a Pathname to a Restricted Di-
rectory (’Path Traversal’)
CWE-352 Cross-Site Request Forgery (CSRF)
CWE-434 Unrestricted Upload of File with Dangerous Type
CWE-862 Missing Authorization
CWE-476 NULL Pointer Dereference
CWE-287 Improper Authentication
CWE-190 Integer Overflow or Wraparound
CWE-502 Deserialization of Untrusted Data
Vulnerability Analysis
Analysing software code for detection of vulnerabilities is a crucial
aspect of software security. Various methods and techniques are used
to identify vulnerabilities in software code. The most basic one is man-
ual inspection of source code by security experts, which is necessary
in complex cases or when there is a completely new vulnerability
discovered but this method does not scale and cannot be automated.
Penetration tests detect Vulnerabilities by simulating cyberattacks. The
focus of the next section will be on automated static analysis of source
code for detecting vulnerabilities using the CodeQL query language.
41
3. Source Code Quality
CodeQL
CodeQL is a declarative, object-oriented logic programming language
for querying relational data models [34]. CodeQL is a general-purpose
query language with SQL-like syntax and Datalog semantics, but it has
found its application mainly in static analysis of source code5 . It is in
the core of the GitHub CodeQL tool which is used for security analysis
of source code6 [34]. GitHub CodeQL allows software engineers and
researchers perform detection of vulnerabilities in code bases using a
query language as if a code base was a database7 and vulnerability a
piece of information to be retrieved by a query.
GitHub CodeQL uses two main methods of static analysis: data flow
analysis and taint analysis. Data flow analysis tracks the flow of data
through a program. This is achieved by implementing a query for
vulnerable patterns which tracks the data flow between a source and
a sink. Source is an untrusted input of a program which is passed to a
sink section of code, which executes sensitive operations, for instance
5. The goal of static analysis using CodeQL can encompass various objectives,
such as vulnerability analysis, identification of violations of private enterprise-level
coding rules, detection of code smells, and more.
6. Originally developed by Semmle, which was then acquired by GitHub.
7. The program’s source code is not stored directly in the database, but its rep-
resentation is created by a language-specific extractor and stored in the database.
The representation of source code can be thought of as an abstract syntax tree or a
control flow graph.
42
3. Source Code Quality
database reads. This particular data flow can become a security vul-
nerability causing malicious behavior, such as unauthorized data read.
If the source and sink are not securely implemented, the query is able
to detect it and the code can be rewritten in a more secure way.
Standard data flow analysis poses a limitation since it is able to
track only value-preserving data [34]. To solve this limitation, the tool
uses a modified form of data flow analysis called taint analysis that
allows tracking of data that does not preserve its value. Throughout
taint analysis, sources are flagged as tainted, and a CodeQL query
examines whether they propagate to sinks, even in cases where their
original values are not preserved, such as when SQL query parameters
are formatted into an SQL query template.
Listing 3.1: A sample CodeQL query for JavaScript.
/* *
* @name F i n d U n u s ed V a r i a b l e s
* @description Detects unused variables .
*/
import javascript
8. GitHub CodeQL covers a significant portion of the CWE category system, and
the list of supported categories can be found here.
43
3. Source Code Quality
Basic Metrics
The first category of metrics is comprised of simple language-agnostic
metrics that encompass basic information about a codebase. Monitor-
ing the changes in these metrics as new source code is written gives
a rough idea of how relatively big the changes are compared to the
44
3. Source Code Quality
current code base size and whether they are well documented and
tested.
Metric Description
Lines of Code Total number of lines in the
source code, providing a basic
measure of code size.
Number of Logical Lines of Code Count of lines that contribute to
the logic and functionality of the
program, excluding comments
and blank lines.
Comment Ratio Proportion of code lines that are
comments, offering insights into
code readability and documenta-
tion.
Test Coverage Percentage of code covered by
automated tests, indicating the
reliability of the codebase.
Lines of code metric is not a source code quality metric per se, but
comparing it to the number of logical lines of code gives an idea of
how complex expressions are written on a single lines as too complex
one-line expressions are difficult to read and understand and thus
affect the readability of the codebase. Similarly, code-level comments
help understand complex, often algorithm-heavy, parts of codebases10 .
High test coverage gives a certain level of assurance that all or most
of the codebase is covered with automatic tests which can increase
efficiency as there is smaller need for repeated manual testing. Many
open-source projects reach 100% of test coverage and many enforce
that in order for the new functionality to be merged to the codebase it
must also have 100% of test coverage11 .
10. In the case of Linux kernel, the comment ratio is around 11.4% [35].
11. An example is the Python FastAPI framework.
45
3. Source Code Quality
CC = E − N + 2P
Since execution of programs can represented using graphs, cyclomatic
complexity can be interpreted as the number of linearly independent
paths in the graph, i.e. the number of possible unique execution paths
in the code. Common programming constructs that increase cyclo-
matic complexity are conditionals, for/while loops, exception blocks,
context managers, boolean operators, assertions, comprehensions
and other. Therefore, the more of these constructs are contained in a
program the more complex it is.
Definition 3.2.2 (Halstead Metrics [37]). Let there be the following
variables computed from the source code
n1 = the number of distinct operators
n2 = the number of distinct operands
N1 = the total number of operators
N2 = the total number of operands
12. This formula is derived from the first Betti number, a term from algebraic topol-
ogy [36].
46
3. Source Code Quality
n = n1 + n2
N = N1 + N2
N̂ = n1 log2 n1 + n2 log2 n2
V = N · log2 (n)
47
3. Source Code Quality
• MI ≤ 85 ∧ MI ≥ 65 is moderately maintainable
48
3. Source Code Quality
OOP Metrics
Object-oriented programming (OOP) paradigm and programming
languages, such as Java or C#, have specifically designed metrics to
encompass OOP-specific characteristics.
Name Description
Weighted Methods Per Class The sum of complexities of meth-
ods in a class. Common complexity
measure is cyclomatic complexity.
Depth of Inheritance Tree The length of the inheritance path
from a class to the root class.
Number of Children The number of immediate sub-
classes a class has.
Coupling between Object Classes Measures the dependencies be-
tween different classes.
Response For a Class Sum of number of class methods
and number of unique methods in-
voked in the class code
Lack of Cohesion in Methods Measures the internal similarity be-
tween class methods.
49
3. Source Code Quality
50
3. Source Code Quality
Linting Metrics
Last type of quality metrics are less formal but more oriented on
best practices and coding style. Each programming language has its
own set of principles and rules which govern the quality of source
code. There principles and rules can be validated by linting, which
performs static analysis of source code. Linters can be built on modified
versions of official compilers and interpreters or autonomous tools.
Many IDEs provide their own implementations of linting for specific
programming languages, such as IntelliJ IDEA.
Linters can validate the correctness of types13 in a program, pos-
sibly detecting bugs arising from type errors. Linting can also check
code smells, unused variables or code blocks, general programming er-
rors like invalid syntax and missing libraries or packages. Lastly, code
style linters check length of lines, formatting, and documentation.
When linters detect a violation of a rule, they either report an er-
ror/warning or some are able to automatically fix the issue14 . Since lin-
ters perform automatic static analysis, they are commonly integrated
into processes determining source code quality during software de-
velopment lifecycle, such as pre-commit hooks or CI pipelines.
51
3. Source Code Quality
Sonar
Sonar is a popular platform for automated analysis of source code
quality owned by SonarSource. The platform provides 3 main tools:
SonarQube, SonarCloud, SonarLint.
SonarQube is a server for hosting software projects and analysing
their source code quality. It is suitable for integration with on-premise
source code version control systems like GitHub Enterprise or Bit-
Bucket Server. SonarQube is also able to integrate 3rd party plugins
for additional static code analysis capabilities. There is a community
edition and 3 paid versions: developer, enterprise, and data center.
SonarCloud is a software as a service tool offering similar features
as SonarQube but users do not have to take care of its hosting and main-
tenance. It is suitable for integration with cloud-hosted source code
version control systems like GitHub or BitBucket. For private projects
the pricing is based on the size of analyzed projects and for public
projects this tool is free. Both SonarQube and SonarCloud provide the
results of an analysis as a dashboard or pull request comments.
SonarLint is a linter for various IDEs providing software engineers
with instantaneous feedback when writing code. It comes with a de-
fault set of rules which can be configured based on preferences and
even new rules can be created. SonarLint is completely free and does
not offer any additional paid features.
Sonar provides source code quality metrics for the following char-
acteristics:
• Code Complexity: cyclomatic and cognitive
• Code Duplications: duplicated blocks and files
• Code Maintainability: code smells and the SQALE model
• Code Reliability: number of bugs
• Code Security: coverage of CWE and OWASP
• Code Size: lines of code, classes, number of comment lines
• Tests: coverage and number of tests
• Project Issues: number of opened issues
52
4 Related Works
With the rise of AI-based tools for generating source code and their
adoption into software development, scientific efforts have been made
to evaluate these tools from multiple perspectives, including secu-
rity, vulnerabilities, risks, correctness, code quality, prompting style,
developer efficiency, and others.
Data sets
The work on AI-generated code often suffers from a lack of available
data. While there are many datas ets of human-written code, data sets
for AI-generated code are scarce. Typically, academic works create
their own data sets or reuse the few that are available. [44] created a
prompt data set called SecurityEval, consisting of 130 prompts catego-
rized into 75 distinct CWEs. Alongside the prompts, the data set also
contains generated code for the 130 prompts from GitHub Copilot
53
4. Related Works
54
4. Related Works
Practical Guidelines
In the field of AI code generation, works focused on providing practical
guidelines often emphasize prompting styles and prompt catalogues.
[49] created a catalogue of prompt design techniques and principles
similar to software engineering patterns, which can be used to solve
common problems and tasks. The prompts were tested with ChatGPT
[49]. The effect of the style of prompting was also evaluated in [41],
demonstrating that prompt style can influence the vulnerability of the
generated code.
55
5 Experimental Design
The experimental section of the thesis is divided into two chapters.
The first chapter details the design process of the experiments, includ-
ing the structure, types of data used, data collection methods, and
the goals along with related research questions. The second chapter
focuses on the evaluation environment and presents an analysis of the
experiment results.
The experiments follow a hierarchy established by the Goal Ques-
tion Metric approach [1]. Two overarching goals guide the experimen-
tation process: a primary goal and a side goal. The primary goal is to
determine the best tool for AI-driven source code generation in terms
of the quality of the produced code, choosing from GitHub Copilot,
ChatGPT, Tabnine, and CodeGeeX. These particular tools were se-
lected due to their large user bases and the fact that they encompass
both proprietary options (GitHub Copilot, ChatGPT, Tabnine) and an
open-source alternative (CodeGeeX). To achieve this goal, a set of re-
search questions has been formulated, with their granularity centered
around domains, programming languages, and scenarios.
The specific choices in this experimental design are determined
by their relevance to the given research question and will be further
discussed in the rest of the chapter. The objective is to ensure the ut-
most consistency in these choices, enabling a meaningful comparison
of results across research questions and extracting conclusive insights
and lessons learned from the experiments.
Each research question is associated with a set of metrics. Research
questions can share metrics or have their own, tailored specifically to
the question’s domain. The metrics are primarily computed from static
analysis tools and code quality platforms, but some are calculated
manually due to the absence of a suitable tool or platform.
The second goal is to establish own code generation tool based
on open-source LLMs. The aim of this goal is to provide practical
guidelines for creating one’s own code generation tools, recognizing
that existing tools might not be suitable for various reasons. This goal
is not associated with specific research questions and metrics that
can be computed and answered. Instead, its purpose is to produce a
practical guideline.
56
5. Experimental Design
57
5. Experimental Design
python
copilot ···
cwe_020 ···
The rest of the data set (approx. 9%) is dedicated to OS shell scripting,
divided into Bash scripts for Linux and UNIX systems and PowerShell
scripts for Windows. The whole data set consists of 860 autonomous
programs. A total of 464 programs (54 %) are generated in interpreted
programming languages (Python, JavaScript, Bash, PowerShell), while
396 programs (46%) in compiled languages (C and C#).
The data set is unbalanced in terms of the selected programming
languages. This is due to two main reasons. First, each language has a
58
5. Experimental Design
For each program, the generated code is placed into a single file. Each
file begins with a prompt header, which separates the prompt (which
can also involve code) from the generated code by AI tools.
# define M A X _ U S E R N A M E _ L E N G T H 256
int main () {
char username [ M A X _ U S E R N A M E _ L E N G T H ];
59
5. Experimental Design
Goal 1
Which of the tools GitHub Copilot, ChatGPT, Tabnine, and
CodeGeeX produces source code with the highest quality?
To achieve the primary Goal 1, there are eight research questions that
aim to assess the quality and security of the source code generated
by AI-based code generation tools. These questions are categorized
into two different domains: five are concerned with security and vul-
nerabilities, and three with code quality. The research questions are
formulated in the same format as the goal, i.e., ’Which tool achieves
the best metrics?’.
To evaluate the research questions consistently and fairly, a scoring
system is utilized. The metrics’ values for each tool determine the
compound score obtained in the given research question. These scores
are then used to answer the research question, as the tool with the
highest local score is considered to have achieved the best metrics. In
other words, that particular tool is the answer to the research question.
The scoring system is not only applied locally for each research
question but also contributes to the global primary goal. Each tool
accumulates its score across all eight research questions. The final
accumulated score determines which tool produces source code with
the highest quality among the selected tools.
60
5. Experimental Design
JavaScript, contextualized with the CWE Top 25 list from 2023. Re-
search question 5 concentrates on detecting the presence of secrets
within the generated code across all six languages using the Gitleaks1
tool and SonarCloud.
RQ1
Which tool generates the least vulnerable Python source code
based on metrics M1, M2, M3, and M4?
• M1 Number of vulnerabilities
1. https://github.com/gitleaks/gitleaks.
2. The generation of Python programs was not targeted at a specific CPython
version, but using any reasonably modern one (3.7 and above) should work (tested
with 3.7 and 3.11).
61
5. Experimental Design
RQ2
Which tool generates the least vulnerable C source code based
on metrics M1, M2, M3, and M4?
RQ3
Which tool generates the least vulnerable C# source code based
on metrics M1, M2, M3, and M4?
62
5. Experimental Design
ated using two project templates: ASP.NET Core Empty and Console
App. The choice of a particular template was determined based on
the program’s context; for instance, if it included a REST API, then
ASP.NET Core Empty was chosen, and if the program was a simple
console application, then Console App was selected.
For ASP.NET Core Empty, the generated code was placed into a
single file,TestController.cs, with a few exceptions where the code
was placed into Program.cs, since the code was intended for the con-
figuration of the application. For Console App, the generated code was
always placed into Program.cs. Both templates include a test.csproj
configuration file for program compilation.
RQ4
Which tool generates the least vulnerable JavaScript source code
based on metrics M1, M2, M3, and M4?
RQ5
Which tool generates the least amount of programs with secrets
based on metrics M5 and M6?
This research question involves the entire data set, comprising 860 pro-
gram samples, as secrets could be present in any generated program.
The analysis will be conducted using a combination of Gitleaks static
63
5. Experimental Design
The last three research questions evaluate source code quality metrics
defined in Section 3.2. Research question 6 analyzes the number of
valid generated programs, i.e., programs that are compilable or inter-
pretable. This is particularly crucial, as not all generated programs are
valid. Therefore, tools that more frequently generate invalid programs
could have an advantage, as the GitHub CodeQL tool will unsuc-
cessfully conduct an analysis of these programs. Consequently, the
detected number of vulnerabilities could be lower.
Research question 7 compares the tools based on source code qual-
ity metrics from the SonarCloud platform, evaluating programs’ main-
tainability, readability, and complexity. Research Question 8 focuses
on the tools’ ability to generate correct OS shell scripts.
RQ6
Which tool generates the most valid programs based on metrics
M7 and M8?
The entire data set will be used for this research question. For Python,
C, C#, and JavaScript, the evaluation will utilize GitHub CodeQL,
which fails the analysis if the provided program is not interpretable or
compilable. In the case of Bash and PowerShell, the respective shells
will be used to determine the validity of the scripts.
64
5. Experimental Design
RQ7
Which tool generates source code with the highest quality based
on metrics M9, M10, M11, M12, and M13?
• M9 Cyclomatic complexity
• M12 Bugs
Source code quality metrics for the entire data set will be computed
using the SonarCloud platform, which defines specific rules for each
programming language.
RQ8
Which tool produces the most correct Bash and PowerShell OS
scripts based on metrics M14 and M15?
The programs are stored in a single file, caseN.sh for Bash and simi-
larly caseN.ps1 for PowerShell, where N represents the case number
ranging from 1 to 10.
65
5. Experimental Design
Goal 2
Deploy a pre-trained open-source LLM for code generation and
connect it to Visual Studio Code and IntelliJ IDEs using available
plugins.
66
6 Experimental Evaluation
The results of the experiments are derived from various sources. The
first source consists of CSV files containing vulnerabilities detected
by GitHub CodeQL. These CSV files are stored in the same folder
as the program analyzed, and the results are encapsulated within
the respective CSV file. Subsequently, the CSV files undergo further
processing in Jupyter notebooks.
Another source involves a SonarCloud project that hosts the data
set of generated programs. However, SonarCloud proves less suitable
to automated processing of metrics and analysis results. Consequently,
values were manually extracted from the SonarCloud website and
compiled into Excel files. These files were then subject to additional
aggregation and processing.
Aspect Description
Hardware DELL E7420 laptop (i7 3Ghz, 16
GB RAM), MS SurfaceBook (i5
2.4GHz, 8 GB RAM)
Operation Systems Windows 10 Enterprise, Ubuntu
22.04 LTS in WSL
Programming Languages Python 3.7/11, C - GCC 11.4,
.NET 7.0, JavaScript, Bash, Pow-
erShell
Software CodeQL CLI 2.15.3, SonarCloud,
Gitleaks v8.18.1
AI Tools GitHub Copilot Individual,
Tabnine Pro, ChatGPT 3.5,
CodeGeeX2
IDE Visual Studio Code 1.84
Data Part of the work archive
Source Code Part of the work archive
Reproducibility Described in README in the
work archive
67
6. Experimental Evaluation
6.1 Goal 1
The scoring algorithm is designed to rank code generation tools based
on their performance in a defined set of metrics:
68
6. Experimental Evaluation
69
6. Experimental Evaluation
The scoring is based on all six available metrics, two from CodeQL
and four from SonarCloud. GitHub Copilot performed the best across
these six metrics, followed by ChatGPT, CodeGeeX, and Tabnine in de-
scending order. In terms of absolute scores, GitHub Copilot obtained
21 points, ChatGPT 17 points, CodeGeeX 16 points, and Tabnine 8
points.
70
6. Experimental Evaluation
The scoring is based on all six available metrics, two from CodeQL
and four from SonarCloud. CodeGeex obtained 23 points, GitHub
Copilot 17 points, Tabnine 16 points and ChatGPT 10 points.
71
6. Experimental Evaluation
72
6. Experimental Evaluation
Cloud. The secrets detected by Gitleaks were exported into a CSV file,
which is included in the work’s archive. Secrets detected by Sonar-
Cloud were primarily categorized as security hotspots, with only a
few cases classified as vulnerabilities1
The tools exhibited good performance in this research question,
with none surpassing the 6% threshold of programs containing a secret.
This outcome aligns with the findings in [46], which detected sensitive
information in 8% of the outputs from the OpenAI Codex model
utilized by GitHub Copilot. The secrets identified by Gitleaks and
SonarCloud were mostly in the form of plain text dummy passwords.
Throughout the code generation process, the models frequently issued
warnings against defining secrets directly in the code.
Table 6.10: Gitleaks secrets metrics for the entire data set.
Table 6.11: SonarCloud secrets metrics for the entire data set.
1. All instances involved plain text database connection strings in the code.
73
6. Experimental Evaluation
The scoring was based on the M5 metric from both Gitleaks and
SonarCloud. GitHub Copilot, Tabnine, and CodeGeeX each received 6
points, as their metric combinations were identical. ChatGPT received
3 points.
74
6. Experimental Evaluation
◦ RQ7. Which tool generates the source code with the highest
quality based on metrics M9, M10, M11, M12, and M13?
This research question was evaluated based on the metrics computed
by SonarCloud. The overall results indicate a low quality of generated
programs, with ratings consistently at E3 , as illustrated in Figure 6.1.
A notable aspect of the results is the technical debt ratio, which, for
all languages and tools, has never exceeded 2%4 . This is a significant
achievement, as technical debt, commonly computed by the SQALE
model, serves as a crucial indicator of a project’s maintainability.
On the other hand, there are also cases with high technical debt,
specifically C# CWE-352 scenario CWE context (16.7%), JavaScript
Tabnine CWE-20 scenario SecurityEval (15.7%), and C ChatGPT CWE-
20 scenario SecurityEval (11.4%)."
C ranked as the most complex language based on the values of
cyclomatic and cognitive complexities (M9 and M10). In terms of bug
count, Python and JavaScript exhibited the least number (even 0 in
multiple cases for all tools), while C# had the highest.
Table 6.13: SonarCloud source code quality metrics of generated pro-
grams part 1/2.
3. Although the ratings are quite strict. For instance, for bugs the program receives
an E rating if it contains at least a single blocker bug.
4. The default A rating limit is 5%.
75
6. Experimental Evaluation
GitHub Copilot generated code with the highest quality and ChatGPT
with the lowest. The scoring was based on all 5 metrics with GitHub
Copilot earning 16 points, CodeGeex 13 points, Tabnine 12 points, and
ChatGPT 9 points.
◦ RQ8. Which tool produces the most correct Bash and PowerShell
OS scripts based on metrics M14 and M15?
The scripts underwent manual inspection and testing in a controlled
environment, and the findings are included in the work archive. In
general, each tool exhibited at least one error. The most common issues
were related to incorrect usage of script parameters and inaccurate
integer thresholds when generating CPU monitoring scripts
76
6. Experimental Evaluation
Summary
77
6. Experimental Evaluation
78
6. Experimental Evaluation
79
6. Experimental Evaluation
Table 6.16: Resulting scores of the tools across all research questions in
the primary goal. The best-performing tools in each research question
are highlighted in blue.
6.2 Goal 2
The Tabby framework [50] is utilized for developing the custom code
generation tool. The main steps of the process involve running a Tabby
server in a Docker container and then configuring the Tabby IDE
plugin to connect to the URL of the Tabby server. The server can be
initiated locally using an official Docker image from Tabby:
docker run - it -- gpus all -p 8080:8080 -v $HOME /. tabby :/ data \
tabbyml / tabby serve -- model TabbyML / StarCoder -1 B -- device cuda
80
6. Experimental Evaluation
For both Visual Studio Code and Intellij IDEs, the Tabby plugin
can be installed from their marketplaces. If using the default config-
uration, the Tabby server runs on port 8080 [50]. In this case, the
URL of the Tabby server configured in the plugin settings would be
http://localhost:8080.
Cloud Deployment
Since LLMs require extensive computational resources, local deploy-
ment is often not feasible. Fortunately, the Tabby server can be easily
run in a Docker container, making deployment to cloud environments
straightforward. The Tabby documentation provides step-by-step in-
structions for deploying the solution to a serverless cloud platform
Modal5 . The serverless on-demand deployment model charges only
for the used RAM and CPU/GPU time.
To deploy a Tabby server using Modal, Python 3.7 or later is re-
quired. The initial step involves installing the Python modal library
pip install modal
After completing the Modal setup, the next step involves using
a deployment script. Tabby conveniently provides a suitable default
script. This script utilizes the Nvidia T4 GPU, the cheapest GPU option
offered by Modal, and is sufficient for the performance requirements
of StarCoder 1B LLM.
To execute the deployment script locally, run:
model server d epl oy men t_s cr ipt . py
curl -- location \
’ https :// < username > - - tabby - server - starcoder -1 b - app - dev . modal . run / v1 / health ’
5. https://modal.com/
81
6. Experimental Evaluation
The final step is to configure the IDE plugin to use the Tabby server de-
ployed on Modal. This can be achieved in the ’API: Endpoint’ settings
of the plugin in Visual Studio Code.
For IntelliJ IDE, the process remains identical. Install the Tabby IDE
plugin from the Jetbrains marketplace and configure it to send requests
to the Tabby server URL hosted on Modal. Once the configuration for
either IDE is successfully completed, the requests from the IDE to the
Tabby server can be monitored using Modal’s app dashboard.
82
6. Experimental Evaluation
83
7 Conclusion
This work introduced deep learning models for code generation, in-
cluding the essential Transformer architecture and LLMs built around
the concept of attention mechanisms. Several state-of-the-art LLMs
were described, establishing the theoretical foundations. The follow-
ing chapter introduced popular AI-based code generation tools, such
as GitHub Copilot, ChatGPT, and CodeGeex, and explored how they
can be integrated into IDEs. The work also delved into the concepts of
software and source code quality, along with their associated metrics.
A significant focus was placed on vulnerabilities, including the CWE
Top 25 list, CodeQL, and SonarCloud.
As part of the experiments, a multilingual data set consisting of
860 autonomous programs with approximately 18,000 lines of code
was collected. This data set includes programs written in Python,
C, C#, JavaScript, Bash, and PowerShell. The programs are based on
the SecurityEval dataset[44], CWE definitions, and GitHub CodeQL
documentation.
The experiments were designed to achieve two primary goals, ad-
dress eight research questions, and evaluate 15 metrics. The main ob-
jective was to identify the best AI-based code generation tool through
a scoring system that considered values from various metrics. GitHub
Copilot emerged as the top-performing tool, scoring 107 points, sur-
passing the second-best tool CodeGeex, which accumulated 96 points.
Each research question provided an insight into AI-driven source
code generation. For example, the tools, on average, generated 40.2%
of Python programs containing at least one vulnerability. This finding
aligns with results reported in related works [41, 42, 43]. It is evident
that no significant, game-changing modifications have been made
to the tools and the underlying LLMs since the related works were
published that would significantly alter their performance in either
direction.
Future Work
Future work can take diverse paths, and one promising involves fur-
ther expanding and curating the data set of generated programs, ulti-
84
7. Conclusion
1. https://stackoverflow.blog/2023/06/07/self-healing-code-is-the-future-of-
software-development/
85
Bibliography
1. BASILI, Victor R.; WEISS, David M. A Methodology for Collecting
Valid Software Engineering Data. IEEE Transactions on Software
Engineering. 1984, vol. SE-10, no. 6, pp. 728–738. Available from
doi: 10.1109/TSE.1984.5010301.
2. HINDLE, Abram et al. On the naturalness of software. In: 2012
34th International Conference on Software Engineering (ICSE). 2012,
pp. 837–847. Available from doi: 10.1109/ICSE.2012.6227135.
3. GABEL, Mark; SU, Zhendong. A Study of the Uniqueness of
Source Code. In: Santa Fe, New Mexico, USA: Association for
Computing Machinery, 2010, pp. 147–156. FSE ’10. isbn 9781605587912.
Available from doi: 10.1145/1882291.1882315.
4. VASWANI, Ashish et al. Attention is all you need. Advances in
neural information processing systems. 2017, vol. 30.
5. ZHAO, Wayne Xin et al. A Survey of Large Language Models.
arXiv preprint arXiv:2303.18223. 2023. Available also from: http:
//arxiv.org/abs/2303.18223.
6. TOUVRON, Hugo; MARTIN, Louis; AL., Kevin Stone Et. Llama 2:
Open Foundation and Fine-Tuned Chat Models. 2023. Available from
arXiv: 2307.09288 [cs.CL].
7. TOUVRON, Hugo et al. Llama: Open and efficient foundation
language models. arXiv preprint arXiv:2302.13971. 2023.
8. KAPLAN, Jared et al. Scaling laws for neural language models.
arXiv preprint arXiv:2001.08361. 2020.
9. BROWN, Tom et al. Language models are few-shot learners. Ad-
vances in neural information processing systems. 2020, vol. 33, pp. 1877–
1901.
10. OPENAI. GPT-4 Technical Report. ArXiv. 2023, vol. abs/2303.08774.
Available also from: https://api.semanticscholar.org/CorpusID:
257532815.
11. RADFORD, Alec; NARASIMHAN, Karthik; SALIMANS, Tim;
SUTSKEVER, Ilya, et al. Improving language understanding by
generative pre-training. 2018.
12. RADFORD, Alec et al. Language models are unsupervised multi-
task learners. OpenAI blog. 2019, vol. 1, no. 8, p. 9.
86
BIBLIOGRAPHY
87
BIBLIOGRAPHY
88
BIBLIOGRAPHY
89
BIBLIOGRAPHY
90
A Attached Source Code
This supplementary material contains a description of the attached
source code files.
root
analysis → Python helper scripts for the experiments
llm_deployment → Python scripts for deployment of an LLM
results → folder with Excel files containing experimental
results
templates → folder with template files for generating
programs
vulnerability_analysis → the data set with generated programs
README.md → file with other detailed information
91