The Fools are Certain; the Wise are Doubtful: Exploring LLM Confidence in Code Completion

Kotti, Zoe; Dritsa, Konstantina; Spinellis, Diomidis; Louridas, Panagiotis

doi:10.5281/zenodo.16637787

Published August 5, 2025 | Version 3.0

Software Open

The Fools are Certain; the Wise are Doubtful: Exploring LLM Confidence in Code Completion

1. Athens University of Economics and Business
2. Delft University of Technology

Replication package and dataset of the research paper: The Fools are Certain; the Wise are Doubtful: Exploring LLM Confidence in Code Completion

Code completion entails the task of providing missing tokens given a surrounding context. It can boost developer productivity while providing a powerful code discovery tool. Following the Large Language Model (LLM) wave, code completion has been approached with diverse LLMs fine-tuned on code (code LLMs). The performance of code LLMs can be assessed with downstream and intrinsic metrics. Downstream metrics are usually employed to evaluate the practical utility of a model, but can be unreliable and require complex calculations and domain-specific knowledge. In contrast, intrinsic metrics such as perplexity, entropy, and mutual information, which measure model confidence or uncertainty, are simple, versatile, and universal across LLMs and tasks, and can serve as proxies for functional correctness and hallucination risk in LLM-generated code. Motivated by this, we evaluate the confidence of LLMs when generating code by measuring code perplexity across programming languages, models, and datasets using various LLMs, and a sample of 1008 files from 657 GitHub projects. We find that strongly-typed languages exhibit lower perplexity than dynamically typed languages. Scripting languages also demonstrate higher perplexity. Perl appears universally high in perplexity, whereas Java appears low. Code perplexity depends on the employed LLM, but not on the code dataset. Although code comments often increase perplexity, the language ranking based on perplexity is barely affected by their presence. LLM researchers, developers, and users can employ our findings to assess the benefits and suitability of LLM-based code completion in specific software projects based on how language, model choice, and code characteristics impact model confidence.

The following source code and data files are included.

./src:
- ./src/bigquery:
  - distinct_licenses.sql: Query for GitHub project licenses
  - gpl_projects.sql: Query for projects distributed under GPL licenses
  - llama_projects.sql: Query for projects distributed under the Apache, BSD, and MIT licenses
- ./src/graphql:
  - get-project-characteristics.py: Script for querying GitHub GraphQL API to fetch project metadata
  - join-results.sh: Script for joining results of queries to GitHub GraphQL API
- ./src/notebooks:
  - ./src/notebooks/img: Images used in perplexity analysis notebook
  - perplexity-analysis.ipynb: Analysis of perplexity results
  - sample-files.ipynb: Sampled files from GitHub projects
  - sample-files-ctx-size.ipynb: Sampled files (with comments) from languages
  - sample-files-ctx-size-nc.ipynb: Sampled files (without comments) from languages
  - sample-projects.ipynb: Sample GitHub projects
  - shasum-analysis-sample.ipynb: Analysis of duplicate files of sampled projects
  - token-analysis-polycoder.ipynb: Analysis of LLaMA tokens of PolyCoder evaluation dataset
  - token-analysis-sample.ipynb: Analysis of LLaMA tokens of our sampled files (with comments)
  - token-analysis-sample-nc.ipynb: Analysis of LLaMA tokens of our sampled files (without comments)
- clone-projects.sh: Script for cloning GitHub projects
- deduplicate-projects.sh: Script for deduplicating GitHub projects
- filter-projects.py: Script for filtering GitHub projects
- get-docstring-headers.sh: Script for finding files with docstring header comments
- get-filepaths-shasums.sh: Script for getting source code file paths with specified extensions, and their SHA sums
- get-language-extensions.py: Script for creating a CSV file with the file extensions of all programming languages
- predict-perplexity-cpp.sh: Script for computing source code perplexity at file level with LLaMA 3.2
- predict-perplexity-cpp-polycoder.sh: Script for computing source code perplexity at file level with LLaMA 3.2 of the PolyCoder evaluation dataset
- preprocess-files.py: Script for preprocessing source code files
- preprocess-files-polycoder.py: Script for preprocessing source code files of the PolyCoder evaluation dataset
- process-perplexity-results.sh: Script for aggregating perplexity results of all files
- process-perplexity-results-polycoder.sh: Script for aggregating perplexity results of all files of the PolyCoder evaluation dataset
- Makefile: Rules for running the analysis and producing the associated data files
./data:
- ./data/polycoder_ppl_results_ctx_128_stride_1: Perplexity results of PolyCoder evaluation dataset using LLaMA 3.2
- ./data/ppl_results_ctx_64_stride_1_{model}: Perplexity results of sampled files with comments using associated model (CodeLlama, CodeShell, LLaMA 2, LLaMA 3, LLaMA 3.1, LLaMA 3.2, Mistral, Mixtral MoE, StarCoder)
- .data/ppl_results_ctx_64_stride_1_{model}_nc: Perplexity results of sampled files without comments using associated model (LLaMA 3.2)
- file_extensions_filtered.csv: Curated file extensions for each language
- filepaths-shasums-sample-ctx-size.csv: SHA sums of sampled files with comments
- filepaths-shasums-sample-ctx-size-nc.csv: SHA sums of sampled files without comments

Files

codepred-repl-data.zip

Files (232.4 MB)

Name	Size	Download all
codepred-repl-data.zip md5:e1c5ee71325585e209f2566378faf636	232.4 MB	Preview Download

	All versions	This version
Views	175	76
Downloads	36	17
Data volume	6.9 GB	4.0 GB

The Fools are Certain; the Wise are Doubtful: Exploring LLM Confidence in Code Completion

Authors/Creators

Description

Files

codepred-repl-data.zip

Files (232.4 MB)