LLM Research Papers Collection

This repository contains a curated list of research papers related to Large Language Models (LLMs) in scientific research, code generation, and idea evaluation.

1️⃣ Automated Scientific Discovery Systems (AI Scientist)

Systems that leverage LLMs to perform end-to-end scientific research, from idea generation to experimentation and writing.

End-to-End Scientific Discovery

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery (Sakana AI, 2024)
- Description: Automates idea generation, coding, experimentation, plotting, paper writing, and reviewing.
- Paper: arXiv
DeepScientist: Advancing Frontier-Pushing Scientific Findings Progressively (Weng et al., 2025)
- Description: Formalizes discovery as a Bayesian optimization problem with a "hypothesize–verify–analyze" loop.
- Repository: GitHub
- Paper: arXiv
Towards an AI Co-Scientist (Google / Gemini 2.0, 2025)
- Description: Multi-agent system for biomedical discovery with generate-debate-evolve structure.
- Paper: arXiv
Coscientist: Autonomous Chemical Research with LLM Agents (Nature, 2023)
- Description: Multi-agent system controlling a cloud lab for complex organic synthesis.
- Paper: Nature

Multi-Agent Research Teams

Many Heads Are Better Than One (VirSci): Improved Scientific Idea Generation by A LLM-Based Multi-Agent System (ACL 2025)
- Description: Virtual Scientists ecosystem with multiple specialist agents for idea generation.
- Repository: GitHub
- Paper: ACL Anthology
ResearchAgent: Iterative Research Idea Generation over Scientific Literature with Large Language Models (2024)
- Description: Idea agent + multiple reviewer agents with human review standard alignment.
- Repository: GitHub
- Paper: arXiv
ToolUniverse / From Models to Scientists (Gao et al., 2025)
- Description: A framework connecting LLMs to 600+ scientific tools.
- Website: Kempner Institute
Agent Laboratory: Using LLM Agents as Research Assistants
- Description: A "research OS" for human-AI collaboration in research workflows.
- Repository: GitHub
Towards Scientific Intelligence: A Survey of LLM-based Scientific Agents (2025)
- Description: Comprehensive survey of scientific agents.
- Paper: arXiv

2️⃣ Paper-to-Code Reproduction Systems

Multi-agent systems for automatically reproducing research papers as executable code.

Multi-Agent Reproduction Systems

Paper2Code (PaperCoder): Automating Code Generation from Scientific Papers in Machine Learning (2025)
- Description: Planning-Analysis-Generation multi-agent pipeline, evaluated on PaperBench.
- Repository: GitHub
- Paper: arXiv
ResearchCodeAgent: An LLM Multi-Agent System for Automated Codification of Research Methodologies (2025)
- Description: Multi-agent system translating ML methodologies into executable code. Achieves 46.9% high-quality code, 25% outperforming baselines.
- Paper: arXiv
AutoReproduce: Automatic AI Experiment Reproduction with Paper Lineage (2025)
- Description: Uses paper lineage (citation tracking) + multi-agent framework for full experiment reproduction. 70%+ improvement over baselines.
- Paper: arXiv
RePro: Reflective Paper-to-Code Reproduction Enabled by Fine-Grained Verification (2025)
- Description: Fingerprint-based verification + iterative refinement loop. 13% improvement on PaperBench Code-Dev.
- Paper: arXiv
SciReplicate-Bench + Sci-Reproducer: Dual-Agent Algorithmic Reproduction (2025)
- Description: Paper Agent (reasoning graph) + Code Agent for NLP algorithm reproduction.
- Paper: arXiv
Paper2Code (Autonomous-Scientific-Agents)
- Description: CrewAI-based system for reproducing computational science papers.
- Repository: GitHub

3️⃣ Reproduction Benchmarks

Benchmarks for evaluating AI agents' ability to reproduce research papers and experiments.

PaperBench: Evaluating AI's Ability to Replicate AI Research (OpenAI, 2025)
- Description: Benchmark on reproducing 20 ICML 2024 Spotlight/Oral papers. Best agents ~20% score.
- Repository: GitHub
- Paper: arXiv
CORE-Bench: Computational Reproducibility Agent Benchmark (2024)
- Description: 90 papers, 270 tasks for computational reproducibility. Includes CORE-Agent baseline.
- Paper: arXiv
LMR-BENCH: Evaluating LLM Agents' Ability on Reproducing Language Modeling Research (EMNLP 2025)
- Description: 28 tasks from 23 language modeling papers.
- Paper: arXiv
SciReplicate-Bench: Benchmarking LLMs in Agent-driven Algorithmic Reproduction (2025)
- Description: 100 algorithmic reproduction tasks from 36 NLP papers. Best model ~39% execution accuracy.
- Paper: arXiv
ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code
- Description: 212 coding challenges from recent ML papers.
- Repository: researchcodebench.github.io
MLAgentBench (2023)
- Description: Kaggle-style end-to-end experiment benchmark.
- Paper: ar5iv

4️⃣ Scientific Claim Verification & Review

Tools for evaluating ideas, reviewing papers, and verifying scientific claims.

DeepReview: Improving LLM-based Paper Review with Human-like Deep Thinking Process (2025)
- Description: Structured analysis + literature retrieval + evidence-based argumentation.
- Repository: GitHub
- Paper: arXiv
ScholarEval: Research Idea Evaluation Grounded in Literature (2025)
- Description: Retrieval-based idea evaluation for soundness + contribution scoring.
- Repository: GitHub
- Paper: arXiv
LLM-based Corroborating and Refuting Evidence Retrieval (CREDIFY / CIBER)
- Description: Multi-view evidence retrieval for claim verification.
- Repository: GitHub
- Paper: arXiv
Zero-shot Scientific Claim Verification Using LLMs and Citation Text
- Paper: ACL Anthology
AI4Research: A Survey of Artificial Intelligence for Scientific Research
- Website: ai-4-research.github.io

5️⃣ Research Directions

Combining Reproduction + Multi-Agent Research

Reproduction Multi-Agent (Paper2Code / ResearchCodeAgent / AutoReproduce / RePro) → Solves "implementation / reproduction & reliability"
Research Multi-Agent (VirSci / ResearchAgent / AI Co-Scientist / DeepScientist) → Solves "idea / hypothesis / experiment design"

Natural Research Topics

End-to-end system: VirSci-style idea generation → AutoReproduce/RePro-style code generation → automated evaluation
Systematic comparison on benchmarks: single agent vs multi-role vs reflective multi-agent
New benchmark: Can multi-agent systems "read paper → question it → reproduce/refute with experiments"?

6️⃣ Resources

Awesome LLM Agents for Scientific Discovery: GitHub

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM Research Papers Collection

1️⃣ Automated Scientific Discovery Systems (AI Scientist)

End-to-End Scientific Discovery

Multi-Agent Research Teams

2️⃣ Paper-to-Code Reproduction Systems

Multi-Agent Reproduction Systems

3️⃣ Reproduction Benchmarks

4️⃣ Scientific Claim Verification & Review

5️⃣ Research Directions

Combining Reproduction + Multi-Agent Research

Natural Research Topics

6️⃣ Resources

About

Uh oh!

Releases

Packages

jerry609/AI4S

Folders and files

Latest commit

History

Repository files navigation

LLM Research Papers Collection

1️⃣ Automated Scientific Discovery Systems (AI Scientist)

End-to-End Scientific Discovery

Multi-Agent Research Teams

2️⃣ Paper-to-Code Reproduction Systems

Multi-Agent Reproduction Systems

3️⃣ Reproduction Benchmarks

4️⃣ Scientific Claim Verification & Review

5️⃣ Research Directions

Combining Reproduction + Multi-Agent Research

Natural Research Topics

6️⃣ Resources

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages