0% found this document useful (0 votes)
76 views13 pages

Chat GPT Automated Framework

Uploaded by

Linh Ngo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views13 pages

Chat GPT Automated Framework

Uploaded by

Linh Ngo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

FACE: a Framework for AI-driven Coding

Generation Evaluation ∗

Abstract
Previous work on evaluation code generation solutions is limited to
static test cases due to difficulty in manual acquisition of test data. This
paper presents a framework that enables the automated study of various
code generation solutions using the entirety of an online competitive
programming platform. To evaluate the capability of this framework, we
exhaustively tested solutions generated from ChatGPT and Gemini for
all programming questions on this platform. The resulting statistical and
textual analysis highlights the difference between these two platforms and
demonstrates the contribution of this framework in enabling researchers
to collect and analyze a massive amount of data.

1 Introduction
The recent popularization of AI-enabled tools that are based on large language
models has motivated researchers to investigate the long-term impacts and
implications of these tools on the labor market. One working paper yet already
highly cited finds that information processing industries will be exposed to
“high economic impact without distinguishing between labor-augmenting or
labor-displacing effects” [6]. From the perspective of software development,
there are AI-enabled platforms that have the ability to generate code based
on a problem statement. These platforms raises the possibility of partially or
completely replacing software developers. A search on Google Scholar with
the phrase ai code generation yielded more than 18,000 results in 2023 alone,

∗ Copyright ©2021 by the Consortium for Computing Sciences in Colleges. Permission to

copy without fee all or part of this material is granted provided that the copies are not made
or distributed for direct commercial advantage, the CCSC copyright notice and the title of
the publication and its date appear, and notice is given that copying is by permission of the
Consortium for Computing Sciences in Colleges. To copy otherwise, or to republish, requires
a fee and/or specific permission.

1
with ten initial pages of results include the majority of work published in
IEEE/ACM peer-reviewed conferences and journals.
One of the challenges in study AI-enabled code generation platforms is how
to test the quality of generated. Existing literature focuses on utilizing ready-
to-use problem statements and test cases collected by mining publicly available
data. This approach provides a massive amount of problem statements but is
limited to only public test cases. Hidden test cases, such as those provided by
LeetCode or Kattis, are not accessible to research purposes.
In this work, we present an approach to help alleviate the above problem
by design and implement a framework for AI-driven code generation evalua-
tion (FACE). Without direct access to all test cases, FACE can still leverage
the online testing platforms’ API interfaces to capture the evaluation of AI-
generated code, thus enabling the investigation of code generation. FACE also
allows easy modification and interfacing to multiple AI platforms.
The rest of the paper is as follows. Section 2 discusses the design and
implementation of the FACE framework. Section 3 studies the effectiveness of
the framework through the case study of two platforms, OpenAI’s ChatGPT
(GPT) and Google’s Gemini (Gemini). A summary of other static test datasets
is presented in Section 4. Section 5 concludes the paper and discusses future
work.

2 Framework Architecture
2.1 Architecture
We design the system to automate the process of collecting thousands of pro-
gramming problems from a coding challenge platform, querying an AI platform
to generate solutions, and finally submitting the solutions back to the original
platform for evaluation. Each programming problem typically consists of a
problem description, sample input, and output. The result includes the status
of the submission, whether the solution is accepted or rejected, and the rea-
son for rejection. Figure 1 illustrates the overall architecture of our proposed
framework, comprising three primary components: the Miner, the Generator,
and the Submitter. Each component plays an important role in the seamless
functioning of the system.
The Miner is responsible for extracting the text of individual coding prob-
lems from the coding challenge platform. It gathers essential information,
including detailed problem descriptions, corresponding test cases, and other
requirements. This information serves as the foundational input for the sub-
sequent stages of the framework. The Generator leverages the information
procured by the Miner, this component queries an AI-enabled platform to gen-
erate potential solutions to the coding problem. By interpreting the problem

2
Figure 1: Data Collection Framework

descriptions and test cases, an AI-enabled platform generates a solution that


aims to meet the specified requirements and constraints of each problem. The
Submitter takes over the process of submitting the solution back to the coding
challenge platform for evaluation. This component ensures that each solution
is assessed by the original platform, with the results being saved in text files,
creating a repository of results that can be analyzed, and evaluated to refine
the system further.

2.2 Implementation
The 8-stage workflow of FACE is visualized in Figure 1. In our prototype, the
Miner obtains coding problems listed on Kattis1 , a popular coding challenge
platform. First, the Miner extracts a list of problem names and constructs a
URL for each problem. In our implementation, we employ the autokattis22
Python library to request (Stage 1) an HTML page for each problem. Then,
the problem name, difficulty score, difficulty level, problem description, and
sample test cases are extracted from the HTML page. All data associated with
each problem is stored (Stage 2) in text files in a single folder.
In Stage 3, the Generator reads the information obtained by the Miner and
begins constructing an AI prompt for each problem. Each prompt includes
a leading question that describes the objective of the prompt, along with the
problem description and the test cases. An example of a leading question

1 https://open.kattis.com/problems
2 https://www.piwheels.org/project/autokattis/

3
Metadata Description Sample Test Cases
different Write a program that computes Sample Input 1
https://open.kattis.com the difference between 10 12
/problems/different non-negative integers. 71293781758123
2.8 Input 72784
Medium Each line of the input consists 1 12345677654321
of a pair of integers. Sample Output 1
Each integer is between 0 and 2
215 (inclusive). 71293781685339
The input is terminated by end 12345677654320
of file.
Output
For each pair of integers in the
input, output one line,
containing the absolute value of
their difference.
Table 1: An example of a coding problem from Kattis

is: Write a python program for the following problem and make sure that the
variables’ names and functions’ names are different, and also, only use internal
Python libraries, not external Python libraries. These constraints are added to
ensure that the solutions generated by AI platforms (ChatGPT or Gemini)
comply with Kattis’ judging system, which only allows the use of internal
Python libraries. In Stage 4, the Generator use OpenAI APIs or Gemini APIs
to send the constructed prompt to the specific platform to generate the solution.
A copy of the generated solution is saved (Stage 5) locally in a text file before
the Submitter takes over and submits the solution to the Kattis’ judging system
for evaluation.
Once Python solutions generated by ChatGPT/Gemini are obtained, the
Submitter reads (Stage 6) the solutions from local storage and submits (Stage
7) them back to Kattis’ judging system for evaluation. Based on the number
of passed test cases, Kattis’ judging system will label the result. The label
is either Accepted (AC ), Wrong Answer (WA), Runtime Error (RTE ), Time
Limit Exceeded (TLE ), or Memory Limit Exceeded (MLE ). We discuss the
meaning of these five statuses in the next section. The result is stored (Stage
8) in a text file for further evaluation. Table 2 shows an example of the result
from a Kattis submission through the Submitter.

4
Result
Submission received. Submission ID: 13469967.
Submission URL: https://open.kattis.com/submissions/13469967
https://open.kattis.com/submissions/13469967

New...New...New...New...New...New...New...New...Test cases: Test


cases: [..] 2 / 2

Table 2: An example of a result file from a Kattis submission

2.3 Technical Discussion


For coding challenge platforms with large amount of problems (e.g., 3,323 cod-
ing problems for Kattis), it is important that FACE does not accidentally flood
that platform with large amount of requests. In addition to the rate limits of
ChatGPT and Gemini, various timing delays were included in Stages 1, 4, and
7. We find that random delays between 60 to 100 seconds are adequate for all
external platforms.
The additional of timing delays significantly increases the duration it takes
to completely generate all solutions. Throughout this process, we encounters
various errors such as losing network connection and reaching the monthly
quota limits on ChatGPT/Gemini. To make our framework fault-tolerant, we
develop a checkpoint component that allows the Generator to restart where
it failed. To future-proof FACE, we also decide to not overwrite generated
solutions but to save multiple versions across different runs. This leads to
change in the Submitter, which is able to identify and load only the latest
solution. This will let us later expand FACE to support prompt refinement.

3 Case Study: Analyzing ChatGPT and Gemini


In this section, we use FACE to study the solution generation process using two
separate AI platforms: ChatGPT and Google Gemini. The generated solutions
are submitted to Kattis and the evaluation results are collected and validated.

3.1 Programming Problem Description


Our study examines 3,323 programming contest problems from Kattis. After
the AI-generated solutions are submitted to Kattis via its Python API, FACE
captures the returned text, which contains information about how many test
cases passed and what the final status is. These results, in addition to input

5
information, create the core features set to be later analyzed. The features can
be categorized into two groups, one including attributes of programming prob-
lems (Problem, Difficulty, and Description) and the other including attributes
of the resulting evaluations (Status, Pass, and Total ).
For each programming problem, Problem provides a unique problem name,
which is used to generate a direct URL to the Kattis problem. Difficulty is a
number representing Kattis’ difficulty ranking for the problem. Kattis’ diffi-
culty values are neither fixed nor manually assigned, but calculated based on
the ratio of success solutions versus failed attempts. Problems that are solved
by many and have few failed attempts have lower difficulty score. Problems
that are frequently tried but have more failed submissions have higher difficulty
score. The lowest difficulty score for Kattis problems is 1.1, and the hardest
problem has a difficulty of 9.7. Description contains the problem’s text de-
scription, which contains all requirements and information needed to solve the
programming contest.
The resulting text captured through Kattis’ Python API allows the frame-
work to extract the final evaluation statuses, which include Accepted (AC: The
submitted solution passed all tests), Wrong Answers (WA: The solution ran,
but could not pass all tests. Failure to pass a test could mean either incorrect
results or incorrect output format of results), Run Time Error (RTE: The so-
lution crashed and could not produce a result), Time Limit Exceeded (TLE:
The solution took too long to run), and Memory Limit Exceeded (MLE: The
solution required more memory to run than allowed by the problem statement).
Pass indicates how many tests were successfully passed by the submitted so-
lutions prior to failure, and Total indicates how many tests are there in total
for the problem.
We are able to collect 3323 solutions to unique Kattis problems from Chat-
GPT but only 2139 from Gemini. The number of generated solutions from
Gemini has been limited by the daily rate limit for our Gemini account. The
two platforms share 1981 unique problems. Table 3 presents a breakdown of
the status counts across these common problem. Value in cell (i,j) of the ma-
trix represent the number of problems with solution status i for Gemini and
solution status j for GPT. The final column and the final row represents the
total count for each problem status for Gemini and GPT, accordingly.

3.2 Exploratory analysis


As shown in Table 3, the numbers of AC (186/192), WA (1023/1066), and
MLE (20/15) results sare similar for both platforms. On the other hands,
the number of RTE (431/278) and TLE (71/180) results MLE differ from
one another, with GPT has more RTE statuses and Gemini has more TLE
statuses. These differences are visualized in Figures 2, and 3.

6
Table 3: Status summary matrix for Gemini solutions and GPT solutions

AC WA RTE TLE MLE Gemini Total

AC 114 75 20 8 0 217
WA 59 820 296 31 9 1215
RTE 23 142 136 10 2 313
TLE 15 130 38 42 4 229
MLE 0 8 2 1 6 17
GPT Total 211 1175 492 92 21 1981

Figure 2: Status per Difficulty Range (Gemini) Figure 3: Status per Distribution Range (GPT)

In addition to the evaluation results, FACE also stored the generated Python
solutions, enabling study of the source codes. Sample generated solutions from
Gemini and GPT are presented in Listings 1 and 2, respectively. Visual in-
spection of the generated solutions shows that Gemini generates codes with
more comments and well-defined function and variable names according to
good software engineering conventions. On the other hands, GPT generates
more abbreviated code with function and variable names taken directly from
problem’ text.
Listing 1: Gemini
def evaluate_sound_duration():
"""Compares the duration of Jon Marius' "aaah" with
Listing 2: GPT
,→ the doctor's requirement."""
def sore_throat_test():
patient_sound = input()
jon_aaah = input()
doctor_sound = input()
doctor_aaah = input()
patient_a_count = patient_sound.count('a')
if len(jon_aaah) >= len(doctor_aaah):
doctor_a_count = doctor_sound.count('a')
return 'go'
else:
if patient_a_count >= doctor_a_count:
return 'no'
print("go")
else:
print(sore_throat_test())
print("no")

evaluate_sound_duration()

A direct text comparison does not work in this scenario. Instead, we utilize
Python’s Abstract Syntax Tree (AST) and the PyASTSim Python library [11]
to compare the generated solutions from the two platforms. PyASTSim first
converts the Python source codes to AST trees, remove all comments and doc-

7
string, then normalizes the identifiers. Next, the AST trees are reconverted to
source code, and the differences between the source codes are measured using
the Damerau-Levenshtein distance [2]. The edit distances are then converted to
percentages. Table 4 provides the summary statistics of this similarity percent-
age scores for different statuses across different difficulty range. The selected
problems are the ones where both platforms generate the same status. The
median similarity scores for AC problems in lower difficulty ranges (0.0-2.0
and 2.0-4.0) are noticeably higher than the similarity scores for other statuses
at other difficulty ranges, suggesting that simpler problems are likely to have
more similar solutions.
Table 4: Summary statistics of text similarity between Gemini/ChatGPT-generated solutions

Problem Difficulty: 0.0 - 2.0


Status Count Median Minimum Maximum Std. Dev.
Accepted 65 46.000 8.000 89.000 16.039
Wrong Answer 28 39.500 0.000 69.000 16.344
Run Time Error 2 50.500 40.000 61.000 14.849
Time Limit Exceeded 0 nan nan nan nan
Memory Limit Exceeded 0 nan nan nan nan
Total 95
Problem Difficulty: 2.0 - 4.0
Status Count Median Minimum Maximum Std. Dev.
Accepted 48 48.000 0.000 91.000 18.656
Wrong Answer 149 36.000 0.000 82.000 14.848
Run Time Error 24 36.000 0.000 70.000 14.940
Time Limit Exceeded 4 38.000 33.000 71.000 17.569
Memory Limit Exceeded 0 nan nan nan nan
Total 225
Problem Difficulty: 4.0 - 6.0
Status Count Median Minimum Maximum Std. Dev.
Accepted 1 24.000 24.000 24.000 nan
Wrong Answer 276 35.000 0.000 79.000 14.344
Run Time Error 45 34.000 0.000 53.000 13.056
Time Limit Exceeded 9 48.000 30.000 76.000 18.824
Memory Limit Exceeded 2 58.000 40.000 76.000 25.456
Total 333
Problem Difficulty: 6.0 - 8.0
Status Count Median Minimum Maximum Std. Dev.
Accepted 0 nan nan nan nan
Wrong Answer 268 33.000 0.000 69.000 14.221
Run Time Error 44 32.500 0.000 53.000 12.777
Time Limit Exceeded 15 40.000 22.000 61.000 11.767
Memory Limit Exceeded 2 20.500 7.000 34.000 19.092
Total 329
Problem Difficulty: 8.0 - 10.0
Status Count Median Minimum Maximum Std. Dev.
Accepted 0 nan nan nan nan
Wrong Answer 99 34.000 0.000 78.000 13.140
Run Time Error 21 33.000 0.000 54.000 12.557
Time Limit Exceeded 4 41.000 0.000 57.000 24.364
Memory Limit Exceeded 2 33.000 29.000 37.000 5.657
Total 126

8
3.3 Statistical Analysis
Figures 4, provide a visual intuition regarding the correlation between Chat-
GPT and Gemini’s performance and the problems’ difficulty. Solutions with
AC status concentrate primarily between difficulty levels 0 and 3. As the range
of difficulty increases, the number of AC solutions declines by a visible amount
across both platforms. For WA, RTE, and TLE, the distribution seems to
visually fit with a normal distribution with a mean around 6.0 difficulty.

Figure 4: Difficulty Distribution across Different Status

To determine whether there is a statistically significant difference between


difficulty distributions and pass-ratio distributions for different statuses across
the two platforms, we first apply the Kolmogorov-Smirnov (KS) to the two
platforms’ set of problem difficulty scores for each status. For the KS test,
the null hypothesis is that both score sets come from the same continuous
distribution. Next, we apply the t-test with the null hypothesis is that both
score sets have the same expected value. The same set of procedures is applied
for the platforms’ pass ratio score for each status. The p-value results are
shown in Table 5.

9
From the results, we fail to reject the null hypotheses for both tests in the
cases of AC, RTE, and MLE. In other words, we fail to find any statistically
significant proof that the distributions of solutions for these statuses from both
platforms seem to be drawn from the same distribution. For WA, both KS test
results for Difficulty and Pass Ratio scores are statistically significant (p-value
< 0.05), while both t-test results are not. This means that while we cannot
reject the null hypothesis that the expected value of these scores distributions
are similar across platforms, it is statistically significant that their distributions
are different. In other words, there is significant difference between WA-causing
solutions generated by Gemini and GPT. For TLE, only the null hypothesis of
the KS-test for Pass Ratio scores is rejected at 0.0274.
Table 5: Summary statistics comparing Gemini solutions and GPT solutions

Status Pass Ratio Difficulty


KS test t-test KS test t-test
Accepted 1.0 0.8294 nan 0.4187
Wrong Answer 0.0048 0.802 0.0171 0.2732
Run Time Error 0.9999 0.4498 0.2739 0.0978
Time Limit Exceeded 0.0274 0.8388 0.3818 0.3321
Memory Limit Exceeded 0.2144 0.8329 0.1534 0.7647

From the Damerau-Levenshtein distance collected in the Exploratory Anal-


ysis, we also graph the distribution of similarity scores between solutions for
different statuses. In this case, we have to select a union set of solutions that
generating the same status for both platforms. Figure 5 indicates that with
the exception of MLE, all other statuses have nearly identical distributions for
their similarity scores. There is no overwhelmingly similar distribution, indi-
cating some common coding structure and majority of differences lying in the
details of the code generated by the two platforms.

4 Literature Review
There have been many work focusing on studying the quality of code gen-
erated though LLM platforms [7, 4, 10, 3, 5, 9, 12]. These work utilized
datasets consists of test cases and codes collected and generated previously.
One of the more popular dataset is APPS, a benchmark for code genera-
tion from natural language specification [7]. It consists of 10,000 problems
collected from 7 sources: codeforces.com, atcoder.jp, www.codechef.com, leet-
code.com, open.kattis.com, www.hackerrank.com, and www.codewars.com. The
input/output test cases are collected from publicly available sources. For ex-
ample, test cases from open.kattis.com are the ones available on the problems’
pages. Hidden tests are not available from APPS. The another dataset is the
Most Basic Programming Problems [1]. These problems were created by crowd-

10
Figure 5: Similarity Score Distribution across Different Status

sourcing participants to write a short problem statement, a single self-contained


Python solution, and three test cases that check for semantic correctness. The
Refactory dataset contains 2442 correct and 1783 buggy programs collected
from real world students’ submission to an introductory programming course
at a large public university [8]. FACE provides an additional alternative to
these static dataset by allowing users to access authentic and extreme test
cases that are not readily available to the public. This, in turns, can provide
for a more rigorous study of AI-generated coding platforms.

5 Conclusion
Through FACE, we are able to extensively collect problem statements, cap-
ture AI-generation solutions from different online platforms, and evaluate the
quality of these solutions for comparison purposes. The analysis of the results
demonstrate a clear difference in written convention and coding style, which in
turns lead to a noticeable difference in distribution of final evaluation status
among the two platforms in our case study: OpenAI’s ChatGPT and Google’s
Gemini. FACE provides a foundational framework from which the following
future study can be carried out:

11
• Adding of prompt engineering capability to customize and resubmit prob-
lem statements to generate better codes.
• Investigating approaches to customize and summarize problem state-
ments to reduce token count in order to improve cost without impacting
quality of generated codes.

• Investigating the possibility of combining multiple failed codes into the


problem statements to ask the platforms to generate better codes.

References
[1] Jacob Austin et al. “Program Synthesis with Large Language Models”.
In: arXiv e-prints (2021), arXiv–2108.
[2] Bonnie Berger, Michael S Waterman, and Yun William Yu. “Levenshtein
distance, sequence comparison and biological database search”. In: IEEE
transactions on information theory 67.6 (2020), pp. 3287–3294.
[3] Bei Chen et al. CodeT: Code Generation with Generated Tests. 2022.
arXiv: 2207.10397 [cs.CL].
[4] Mark Chen et al. Evaluating Large Language Models Trained on Code.
2021. arXiv: 2107.03374 [cs.LG].
[5] Carlos Eduardo Andino Coello, Mohammed Nazeh Alimam, and Rand
Kouatly. “Effectiveness of ChatGPT in Coding: A Comparative Analysis
of Popular Large Language Models”. In: Digital 4.1 (2024), pp. 114–125.
issn: 2673-6470. doi: 10.3390/digital4010005. url: https://www.
mdpi.com/2673-6470/4/1/5.
[6] Tyna Eloundou et al. “Gpts are gpts: An early look at the labor market
impact potential of large language models”. In: arXiv preprint arXiv:2303
.10130 (2023).
[7] Dan Hendrycks et al. “Measuring Coding Challenge Competence With
APPS”. In: Thirty-fifth Conference on Neural Information Processing
Systems Datasets and Benchmarks Track. 2021.
[8] Yang Hu et al. “Re-factoring based program repair applied to program-
ming assignments”. In: 2019 34th IEEE/ACM International Conference
on Automated Software Engineering (ASE). IEEE. 2019, pp. 388–398.
[9] Jiawei Liu et al. Is Your Code Generated by ChatGPT Really Correct?
Rigorous Evaluation of Large Language Models for Code Generation.
2023. arXiv: 2305.01210 [cs.SE].

12
[10] Erik Nijkamp et al. CodeGen: An Open Large Language Model for Code
with Multi-Turn Program Synthesis. 2023. arXiv: 2203.13474 [cs.LG].
[11] PyASTSim. https://pypi.org/project/pyastsim/. 2021.
[12] Haoye Tian et al. Is ChatGPT the Ultimate Programming Assistant –
How far is it? 2023. arXiv: 2304.11938 [cs.SE].

13

You might also like