This repository contains a curated list of research papers related to Large Language Models (LLMs) in scientific research, code generation, and idea evaluation.
Systems that leverage LLMs to perform end-to-end scientific research, from idea generation to experimentation and writing.
-
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery (Sakana AI, 2024)
- Description: Automates idea generation, coding, experimentation, plotting, paper writing, and reviewing.
- Paper: arXiv
-
DeepScientist: Advancing Frontier-Pushing Scientific Findings Progressively (Weng et al., 2025)
-
Towards an AI Co-Scientist (Google / Gemini 2.0, 2025)
- Description: Multi-agent system for biomedical discovery with generate-debate-evolve structure.
- Paper: arXiv
-
Coscientist: Autonomous Chemical Research with LLM Agents (Nature, 2023)
- Description: Multi-agent system controlling a cloud lab for complex organic synthesis.
- Paper: Nature
-
Many Heads Are Better Than One (VirSci): Improved Scientific Idea Generation by A LLM-Based Multi-Agent System (ACL 2025)
- Description: Virtual Scientists ecosystem with multiple specialist agents for idea generation.
- Repository: GitHub
- Paper: ACL Anthology
-
ResearchAgent: Iterative Research Idea Generation over Scientific Literature with Large Language Models (2024)
-
ToolUniverse / From Models to Scientists (Gao et al., 2025)
- Description: A framework connecting LLMs to 600+ scientific tools.
- Website: Kempner Institute
-
Agent Laboratory: Using LLM Agents as Research Assistants
- Description: A "research OS" for human-AI collaboration in research workflows.
- Repository: GitHub
-
Towards Scientific Intelligence: A Survey of LLM-based Scientific Agents (2025)
- Description: Comprehensive survey of scientific agents.
- Paper: arXiv
Multi-agent systems for automatically reproducing research papers as executable code.
-
Paper2Code (PaperCoder): Automating Code Generation from Scientific Papers in Machine Learning (2025)
-
ResearchCodeAgent: An LLM Multi-Agent System for Automated Codification of Research Methodologies (2025)
- Description: Multi-agent system translating ML methodologies into executable code. Achieves 46.9% high-quality code, 25% outperforming baselines.
- Paper: arXiv
-
AutoReproduce: Automatic AI Experiment Reproduction with Paper Lineage (2025)
- Description: Uses paper lineage (citation tracking) + multi-agent framework for full experiment reproduction. 70%+ improvement over baselines.
- Paper: arXiv
-
RePro: Reflective Paper-to-Code Reproduction Enabled by Fine-Grained Verification (2025)
- Description: Fingerprint-based verification + iterative refinement loop. 13% improvement on PaperBench Code-Dev.
- Paper: arXiv
-
SciReplicate-Bench + Sci-Reproducer: Dual-Agent Algorithmic Reproduction (2025)
- Description: Paper Agent (reasoning graph) + Code Agent for NLP algorithm reproduction.
- Paper: arXiv
-
Paper2Code (Autonomous-Scientific-Agents)
- Description: CrewAI-based system for reproducing computational science papers.
- Repository: GitHub
Benchmarks for evaluating AI agents' ability to reproduce research papers and experiments.
-
PaperBench: Evaluating AI's Ability to Replicate AI Research (OpenAI, 2025)
-
CORE-Bench: Computational Reproducibility Agent Benchmark (2024)
- Description: 90 papers, 270 tasks for computational reproducibility. Includes CORE-Agent baseline.
- Paper: arXiv
-
LMR-BENCH: Evaluating LLM Agents' Ability on Reproducing Language Modeling Research (EMNLP 2025)
- Description: 28 tasks from 23 language modeling papers.
- Paper: arXiv
-
SciReplicate-Bench: Benchmarking LLMs in Agent-driven Algorithmic Reproduction (2025)
- Description: 100 algorithmic reproduction tasks from 36 NLP papers. Best model ~39% execution accuracy.
- Paper: arXiv
-
ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code
- Description: 212 coding challenges from recent ML papers.
- Repository: researchcodebench.github.io
-
MLAgentBench (2023)
- Description: Kaggle-style end-to-end experiment benchmark.
- Paper: ar5iv
Tools for evaluating ideas, reviewing papers, and verifying scientific claims.
-
DeepReview: Improving LLM-based Paper Review with Human-like Deep Thinking Process (2025)
-
ScholarEval: Research Idea Evaluation Grounded in Literature (2025)
-
LLM-based Corroborating and Refuting Evidence Retrieval (CREDIFY / CIBER)
-
Zero-shot Scientific Claim Verification Using LLMs and Citation Text
- Paper: ACL Anthology
-
AI4Research: A Survey of Artificial Intelligence for Scientific Research
- Website: ai-4-research.github.io
-
Reproduction Multi-Agent (Paper2Code / ResearchCodeAgent / AutoReproduce / RePro) → Solves "implementation / reproduction & reliability"
-
Research Multi-Agent (VirSci / ResearchAgent / AI Co-Scientist / DeepScientist) → Solves "idea / hypothesis / experiment design"
- End-to-end system: VirSci-style idea generation → AutoReproduce/RePro-style code generation → automated evaluation
- Systematic comparison on benchmarks: single agent vs multi-role vs reflective multi-agent
- New benchmark: Can multi-agent systems "read paper → question it → reproduce/refute with experiments"?
- Awesome LLM Agents for Scientific Discovery: GitHub