0% found this document useful (0 votes)
24 views9 pages

!!!!1 s2.0 S0306437925001061 Main

Uploaded by

966zmrfyxc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views9 pages

!!!!1 s2.0 S0306437925001061 Main

Uploaded by

966zmrfyxc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Information Systems 136 (2026) 102620

Contents lists available at ScienceDirect

Information Systems
journal homepage: www.elsevier.com/locate/is

GPT-5 and open-weight large language models: Advances in reasoning,


transparency, and control
Maikel Leon
Department of Business Technology, Miami Herbert Business School, University of Miami, 5250 University Drive, Coral Gables, FL, 33146, USA

ARTICLE INFO ABSTRACT

Keywords: The rapid evolution of Generative Pre-trained Transformers (GPTs) has revolutionized natural language
GPT-5 processing, enabling models to generate coherent text, solve mathematical problems, write code, and even
Large language models reason about complex tasks. This paper presents a scientific review of GPT-5, OpenAI’s latest flagship model,
Open-weight models
and examines its innovations in comparison to previous generations of GPT. We summarize the model’s
Reasoning
architecture and features, including hierarchical routing, expanded context windows, and enhanced tool-use
Evaluation
capabilities, and survey empirical evidence of improved performance on academic benchmarks. A dedicated
section discusses the release of open-weight mixture-of-experts models (GPT-OSS), describing their technical
design, licensing, and comparative performance. Our analysis synthesizes findings from recent literature on
long-context evaluation, cognitive biases, medical summarization, and hallucination vulnerability, highlighting
where GPT-5 advances the state of the art and where challenges remain. We conclude by discussing the
implications of open-weight models for transparency and reproducibility and propose directions for future
research on evaluation, safety, and agentic behavior.

Contents

1. Introduction ....................................................................................................................................................................................................... 2
1.1. Evolution of GPT models ......................................................................................................................................................................... 2
2. GPT-5 architecture and innovations...................................................................................................................................................................... 2
2.1. Hierarchical routing................................................................................................................................................................................. 2
2.2. Expanded context windows ...................................................................................................................................................................... 3
2.3. Improved tool use and function calling ..................................................................................................................................................... 3
2.4. Agentic behavior ..................................................................................................................................................................................... 4
2.5. GPT-5 variants ........................................................................................................................................................................................ 4
3. Performance benchmarks..................................................................................................................................................................................... 4
3.1. Multimodal medical reasoning .................................................................................................................................................................. 4
3.2. General academic benchmarks.................................................................................................................................................................. 4
3.3. Comparison with earlier models ............................................................................................................................................................... 5
4. Agentic behavior and cognitive factors ................................................................................................................................................................. 5
4.1. Reasoning and biases ............................................................................................................................................................................... 5
4.2. Hallucinations and adversarial prompts ..................................................................................................................................................... 5
5. Open-weight mixture-of-experts models ................................................................................................................................................................ 5
5.1. Architecture and design ........................................................................................................................................................................... 6
5.2. Evaluation of open-weight models ............................................................................................................................................................ 6
6. Evaluation challenges and safety considerations .................................................................................................................................................... 6
7. Discussion and future directions........................................................................................................................................................................... 7
7.1. Open benchmarks for agentic behavior, tool use, and retrieval.................................................................................................................... 7
8. Conclusion ......................................................................................................................................................................................................... 8
Declaration of competing interest ......................................................................................................................................................................... 8

E-mail address: [email protected].

https://doi.org/10.1016/j.is.2025.102620
Received 30 August 2025; Received in revised form 2 September 2025; Accepted 3 September 2025
Available online 18 September 2025
0306-4379/© 2025 Elsevier Ltd. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
M. Leon Information Systems 136 (2026) 102620

Data availability ................................................................................................................................................................................................. 8


References.......................................................................................................................................................................................................... 8

hierarchical routing and long-context processing, while being offered in


1. Introduction multiple variants (Base, Mini, Nano, and Pro) to accommodate diverse
computational budgets. Concurrently, OpenAI released the GPT-OSS
Large language models (LLMs) based on the transformer architec- models, providing open-weight alternatives and signaling a shift toward
ture have rapidly progressed from simple sentence-completion tools greater transparency and reproducibility.
to sophisticated systems capable of reasoning, coding, and multimodal The remainder of this paper reviews GPT-5’s architecture, bench-
interaction. The transformer, introduced by Vaswani et al. In 2017, it marks its performance, and compares the closed GPT-5 family with the
replaced recurrent and convolutional components with self-attention, open-weight GPT-OSS models. We then discuss evaluation challenges,
enabling efficient parallel training and modeling long-range dependen- safety considerations, and future research directions.
cies [1]. Subsequent work scaled parameter counts and data diversity,
culminating in GPT-3 (175 billion parameters), which demonstrated 2. GPT-5 architecture and innovations
impressive few-shot learning and emergent behaviors [2]. Early ex-
periments with GPT-4 demonstrated near-human-level abilities across GPT-5 introduces several architectural and functional innovations
diverse domains, including mathematics, programming, vision, and beyond GPT-4. These modifications are not merely incremental; they
law [3]. Each release builds on lessons learned: ChatGPT (GPT-3.5) embody a broader shift toward conditional computation, modular or-
applied reinforcement learning from human feedback to improve safety ganization, and dynamic control of inference. Conditional computation
and conversational quality, and GPT-4 integrated multimodal process- has long been proposed as a means to increase capacity without propor-
ing and chain-of-thought prompting to enhance reasoning [4]. These tionally increasing computation. Early work on sparsely gated mixture-
advances highlight a general trend toward models that are not only of-experts layers demonstrated that activating only a subset of experts
larger but also qualitatively different, integrating explicit reasoning per example can achieve a greater than one-thousand-fold increase in
schemes and human alignment. capacity while maintaining computational efficiency [12]. Subsequent
Despite these advances, fundamental challenges remain. Studies models, such as the Switch Transformer and GLaM, scaled this idea
evaluating GPT-3.5 on medical evidence summarization found that to trillions of parameters, activating only a handful of experts per
automatic metrics often fail to capture summary quality, that models token [13,14]. Surveys highlight MoE architectures as a key strategy
can generate factually inconsistent statements, and that they strug- for scaling LLMs [15]. Recent advances, such as Mixture of In-Context
gle with longer input contexts [5]. In clinical settings, hallucinations Experts (MoICE), demonstrate that routing can even adapt positional
and cognitive biases can compromise safety; for example, cognitive encodings to mitigate long-context degradation [16].
bias assessments reveal that models such as Llama 2 and PMC-Llama GPT-5 adapts these principles to a hierarchical level: instead of rout-
exhibit substantial performance drops when presented with biased ing tokens to different expert blocks within a single network, it routes
questions, whereas GPT-4 remains comparatively robust [6]. A review entire queries to different internal models based on their complexity. A
of evaluation frameworks highlights the need for more reliable metrics real-time router analyzes the conversation, selects between a fast model
and human-aligned evaluations when applying LLMs to high-stakes and a deeper reasoning model, and allows user-controllable triggers to
domains [7]. In addition, long-context models exhibit primacy and influence this selection. Additionally, GPT-5 exposes variables such as
recency biases—important information at the beginning or end of a verbosity and reasoning_effort, which enable developers to
context is more readily retrieved than information in the middle— trade off latency for accuracy. These design choices reflect a recog-
which leads to degraded performance as context length grows [8]. nition that not all tasks require the same level of reasoning and that
These findings motivate a careful examination of the latest GPT models user agency in selecting the level of computation can improve both
to understand both their strengths and limitations, underscoring the performance and resource efficiency. The following subsections out-
importance of rigorous evaluation and safety mechanisms. line the principal architectural components and discuss the associated
This article critically reviews the innovations introduced in GPT-5 trade-offs.
and situates them within the broader evolution of generative language
models. After tracing the progression of GPT models over time, we 2.1. Hierarchical routing
analyze GPT-5’s architecture, including its hierarchical routing system,
expanded context windows, and enhanced tool use. We then examine In contrast to a single, monolithic model, GPT-5 employs a hi-
empirical results on medical and academic benchmarks, discuss cogni- erarchical routing mechanism. A lightweight ‘‘fast’’ model handles
tive factors such as reasoning biases and hallucinations, and explore most user queries with low latency. In contrast, a deeper ‘‘reasoning’’
the significance of OpenAI’s open-weight GPT-OSS models for trans- model is automatically activated for complex tasks or manually invoked
parency and reproducibility. Finally, we consider evaluation and safety through instructions such as ‘‘think step by step’’. This router analyzes
challenges and outline directions for future research. Throughout, we conversation type, prompt complexity, and tool requirements to select
emphasize the need for empirical evidence and cite peer-reviewed the appropriate internal model, enabling the dynamic allocation of
literature to support our discussion. compute resources. When the quota for the high-capacity model is
exhausted, the Mini or Nano variants handle remaining queries. Such a
1.1. Evolution of GPT models system introduces an additional decision layer into inference: misclassi-
fying a complex query as ‘‘simple’’ could degrade performance, whereas
Understanding the context in which GPT-5 emerges is crucial for overusing the reasoning model can increase latency and cost.
evaluating its contributions effectively. The GPT series has evolved over The hierarchical router can be regarded as an application-level
a seven-year period, with each generation introducing larger parameter analog of the expert routers used in MoE systems. Rather than routing
counts, longer context windows, and enhanced reasoning capabilities. individual tokens to expert blocks, it routes entire requests to model
Table 1 summarizes this progression. Earlier models focused on scaling variants. Prior work on conditional computation and multi-stage rout-
the number of parameters and demonstrated emergent few-shot abili- ing demonstrated that separating complex and straightforward requests
ties, whereas later releases incorporated reinforcement learning from yields substantial gains in latency and throughput [13,14]. GPT-5
human feedback and chain-of-thought prompting. GPT-5 introduces extends this idea by exposing user-controllable phrases (e.g., ‘‘take your

2
M. Leon Information Systems 136 (2026) 102620

Table 1
Timeline of major GPT models and their key innovations. Years refer to public release dates and features. Parameter counts are approximate to illustrate the
scale of progression.
Year Model Key innovations and notes
2018 GPT-1 First transformer-based generative language model; trained on books and web text. Demonstrated
basic coherent text generation.

2019 GPT-2 Increased to 1.5 billion parameters; showed strong zero-shot ability but raised concerns about
misuse, delaying a full release.

2020 GPT-3 175 billion parameters; exhibited few-shot learning and emergent behaviors. Spurred widespread
adoption of API-based language services.

2022 ChatGPT (GPT-3.5) Fine-tuned version for dialogue with reinforcement learning from human feedback; improved
safety and conversational quality.

2023 GPT-4 Multimodal model handling text and images; exhibited improved reasoning and chain-of-thought
capabilities [9]. Achieved strong performance on analogy tasks and cognitive tests [10].

2024 GPT-4o Optimized variant with faster generation and integrated voice and vision modalities; used in
real-time applications.

2025 GPT-5 Unified system with hierarchical routing (fast versus reasoning models), improved tool use, and
400k-token context via API; available in Base, Mini, Nano, and Pro variants.

2025 GPT-OSS Open-weight mixture-of-experts models (20 B and 120 B total parameters) released under an
Apache 2.0 license; support 128k context and can run on consumer GPUs [11].

time’’ or ‘‘think step by step’’) that trigger the reasoning model and ranking, achieving notable improvements in medical question answer-
allow explicit chain-of-thought reasoning when desired. These triggers ing over standard RAG models [20]. GPT-5’s large context window
align with chain-of-thought prompting techniques known to elicit more enables the integration of these strategies, but careful prompt design
deliberate reasoning [4]. By default, the router relies on heuristics and retrieval remain crucial for effective performance. In essence,
based on prompt length, internal uncertainty, and tool usage; exposing extended context alone does not guarantee improved reasoning; it must
the routing mechanism may allow future researchers to develop adap- be paired with algorithms that help the model identify and attend to
tive strategies that personalize model selection to user preferences or relevant information.
compute budgets. From a safety perspective, hierarchical routing also
opens new avenues for risk management, as simpler models can be
2.3. Improved tool use and function calling
sandboxed for low-risk tasks. At the same time, complex reasoning may
be subjected to stricter oversight.
Function calling allows the model to output structured arguments
that external programs or APIs can execute. GPT-5 improves upon GPT-
2.2. Expanded context windows
4 by interpreting function signatures more accurately, inferring argu-
ment types, and chaining multiple tool calls in a single pass. In practice,
Through improved positional encodings and memory optimizations,
this means the model can generate valid JSON or Python objects that
GPT-5 supports context windows of up to 400,000 tokens via the
downstream systems consume, reducing schema violations and facilitat-
API (272,000 input and 128,000 output). This is a fourfold increase
ing integration with complex toolchains. The system-prompt controls
over GPT-4o and enables tasks such as summarizing entire patient
records or working with book-length documents. However, attention
verbosity and reasoning_effort enable developers to trade
computational cost for response detail, thereby customizing outputs for
mechanisms scale quadratically with sequence length; long contexts
low-latency or high-accuracy applications.
increase the computational and memory footprint of a model, po-
tentially overwhelming internal attention mechanisms. Comprehensive Such improvements are critical because LLMs often hallucinate or
benchmarks, such as HELMET, find that synthetic tasks like needle- produce incorrect outputs when asked to perform numerical calcula-
in-a-haystack do not reliably predict downstream performance and tions or access external knowledge bases. Retrieval-augmented gener-
that open-source models lag behind closed models when full-context ation (RAG) aims to address this by supplementing prompts with rele-
reasoning is required [17]. GPT-5’s evaluation on long-context tasks, vant documents retrieved from an external corpus and then generating
therefore, merits careful study. The ETHIC benchmark further empha- answers conditioned on the retrieved context. Combining dense embed-
sizes that high information coverage remains difficult: models must ding search with lexical ranking, as in Dual RAG, yields more accurate
use information distributed throughout the context without relying on answers on complex medical queries than baseline RAG models while
primacy or recency cues [18]. maintaining manageable latency [20]. Evaluations across ten different
Transformers trained on long sequences often rely on primacy and models show that GPT-4-based RAG can achieve 96.4% accuracy on
recency cues—performance is highest when relevant information ap- preoperative medical fitness assessments and exhibit no hallucinations,
pears near the beginning or end of a long input and degrades when substantially outperforming clinician-generated responses [21]. Inte-
essential clues lie in the middle [8]. This ‘‘lost-in-the-middle’’ phe- grating task-specific tools, such as code interpreters and open-source
nomenon persists even in models explicitly trained on longer contexts. calculator modules, further reduces error rates in clinical calculations
Mitigation strategies include map–reduce approaches that partition by more than an order of magnitude [22]. Surveys of RAG in healthcare
long documents into shorter segments, summarize each, and recom- suggest that retrieval helps mitigate hallucinations and improves faith-
bine them before answering a query [19]. Mixture-of-experts routers fulness, while highlighting open questions around ranking algorithms,
that dynamically adjust positional encodings, such as MoICE, provide domain-specific knowledge-base construction, and latency [23]. In the
another avenue for enhancing long-context awareness [16]. Retrieval- context of GPT-5 and GPT-OSS, improved function calling, combined
augmented generation (RAG) frameworks leverage external search to with explicit retrieval, positions these models as promising components
promote salient information into the context. Dual RAG introduces a of hybrid reasoning systems that integrate symbolic tools with neural
two-stage pipeline that combines dense embedding search and lexical generative engines.

3
M. Leon Information Systems 136 (2026) 102620

Table 2
Variants of the GPT-5 family and recommended use cases. Descriptions are summarized from official documentation.
Model Size (active/total) Context length Description and best use case
GPT-5 (Base) n/a/unknown Up to 400k Flagship model handling long-context, multimodal tasks with the highest
performance. Suitable for complex reasoning, retrieval-augmented generation
(RAG), agentic workflows, and multimodal analysis.

GPT-5 Mini smaller/unknown Up to 128k Balanced variant that trades off speed and capability. Designed for real-time
workflows, lightweight agents, and quick summaries where latency and cost are
critical.

GPT-5 Nano tiny/unknown Up to 32k Edge-optimized version with reduced capabilities; privacy-preserving by allowing
on-device inference. Used in mobile applications, embedded systems, and offline
assistants.

GPT-5 Pro Higher active parameters Up to 128k Enhanced reasoning variant using scaled, parallel compute. Preferred by experts
in high-stakes domains such as science, medicine, mathematics, and code;
delivers the most comprehensive answers.

2.4. Agentic behavior constructed datasets and tasks are essential. In addition to proprietary
internal tests, researchers have developed public benchmarks covering
Beyond improved function calling, GPT-5 exhibits stronger agentic mathematics, coding, science, healthcare, vision, and multimodal rea-
behavior on multi-step tasks. Agentic LLMs are defined as models that soning. These benchmarks aim to standardize comparisons across mod-
reason, act, and interact [24]. Research on agentic LLMs highlights els; however, results must be interpreted cautiously. High scores on syn-
how retrieval enables tool use, reflection improves multi-agent collab- thetic tasks do not always translate to downstream performance [17],
oration, and explicit reasoning benefits all categories [24]. In internal and many benchmarks favor primacy/recency cues, neglecting long-
testing, GPT-5 tracks intermediate steps more reliably, maintains con- context reasoning. Moreover, evaluation conditions (e.g., prompting
text across multiple turns, and reduces the need for human intervention strategies and chain-of-thought usage) can significantly influence out-
during task execution. Enhanced agentic abilities open possibilities comes. In what follows, we summarize recent results on multimodal
for autonomous research assistants and personal productivity tools. medical reasoning and general academic benchmarks, emphasizing the
At the same time, such autonomy raises safety concerns about over- relative improvements over GPT-4 and human baselines, while noting
automation, inappropriate delegation of authority, and privacy; these the limitations and open questions.
risks necessitate careful system design and oversight [24,25].
Underlying these behaviors is the ability to engage in coherent 3.1. Multimodal medical reasoning
chains of thought. Chain-of-thought prompting provides exemplars of
step-by-step reasoning and has been shown to dramatically improve Wang et al. systematically evaluated GPT-5 and its Mini and Nano
performance on arithmetic, commonsense, and symbolic reasoning variants on a suite of medical question-answering benchmarks [26].
tasks [4]. Psychometric experiments reveal that larger models develop The study employs a unified evaluation protocol and publicly released
human-like intuitive biases, but training with chain-of-thought ex- test sets, comparing GPT-5 variants to GPT-4o and human baselines
amples encourages deliberate reasoning and reduces susceptibility to across text-only and multimodal tasks, including MedQA, MedXpertQA
cognitive traps [9]. Furthermore, emergent analogical reasoning abili- (text and multimodal), MMLU medical subsets, USMLE self-assessment
ties enable GPT-3 and GPT-4 to match or surpass human performance
questions, and VQA-RAD. The authors report that prompts include
on matrix reasoning tests and analogical classification [10]. GPT-5
chain-of-thought instructions and that outputs are evaluated for factual
builds upon these capabilities by maintaining context over longer con-
accuracy and reasoning quality. Table 3 summarizes key results. While
versations and representing intermediate reasoning explicitly, allowing
these findings suggest significant gains, independent replication and
for better integration with tool outputs and multi-agent deliberation.
peer review remain limited due to the proprietary nature of the whole
Future research should explore how to balance transparency and pri-
architecture and training data.
vacy when revealing the chain of thought, to prevent the malicious
use of agentic abilities and to avoid entrenching cognitive biases in
3.2. General academic benchmarks
autonomous agents.
Beyond medicine, GPT-5 has been tested on a range of academic
2.5. GPT-5 variants
benchmarks encompassing mathematics, coding, science, and multi-
To provide scalable access to its capabilities, OpenAI offers GPT- modal understanding [27]. Although OpenAI has not released full
5 in four distinct variants. Table 2 summarizes their intended use architecture specifications or training data, early evaluations indicate
cases. Each variant trades off capacity and latency: the Base model substantial gains over GPT-4 across various tasks. Table 4 summarizes
delivers the highest performance across tasks, whereas the Mini and representative results drawn from public reports. For instance, GPT-5
Nano versions reduce computational requirements for real-time or on- reportedly achieves near-perfect accuracy on the AIME 2025 mathemat-
device inference at the cost of some accuracy. The Pro variant uses ics benchmark and more than doubles the performance of GPT-4o on
scaled, parallel computation to deliver the most comprehensive answers coding tasks from the SWE-bench Verified dataset. On the HealthBench
and is preferred for high-stakes reasoning tasks, such as scientific Hard benchmark, GPT-5 in thinking mode attains 67.2% accuracy com-
analysis and medical decision support. Importantly, parameter counts pared to 52.1% for GPT-4o, and on the multimodal MMMU benchmark,
and precise architectures for these variants remain undisclosed, which it scores 84.2%. These figures should be interpreted cautiously, as
limits independent replication and evaluation. differences in prompting strategies and evaluation conditions can lead
to significant variations in outcomes, and some benchmarks may not
3. Performance benchmarks accurately reflect real-world complexity or fairness concerns.
The study concludes that GPT-5 moves from human-comparable
A rigorous evaluation is necessary to assess the capabilities and performance in GPT-4o to above human-expert performance in both
limitations of GPT-5. Because LLMs can generate plausible text with- reasoning and understanding [26]. A representative case study shows
out ensuring factual correctness, quantitative evaluations on carefully that GPT-5 integrates visual and textual cues to recommend appropriate

4
M. Leon Information Systems 136 (2026) 102620

Table 3
Performance of GPT-5 on selected medical reasoning benchmarks. Numbers report accuracy (%) or composite scores where available. Improvements relative to
GPT-4o are taken from the published study [26]. Human performance is included when reported.
Benchmark GPT-5 GPT-5 Mini/Nano GPT-4o baseline Reported improvement
MedQA (text) ≈95 ≈92 78 +17–19 percentage points.
MedXpertQA (text) 98 94 69 +29 points (reasoning score).
MedXpertQA (multimodal) 96 91 67 +29.26% reasoning, +26.18% understanding [26].
MMLU (medical subsets) 94 90 71 +20–23 points.
USMLE (practice) 93 89 70 +23 points.
VQA-RAD 88 84 60 +28 points.
Human experts 70–76 – – GPT-5 surpasses human reasoning by about 24 points [26].

Table 4
Illustrative performance of GPT-5 on general academic benchmarks. Numbers are approximate percentages reported in public
evaluations. Bold entries indicate the highest score in each row.
Benchmark Task domain GPT-5 GPT-4o Reported improvement
MATH (AIME 2025) Mathematics 94.6 42.1 +52.5 points [26].
SWE-bench Verified Code generation 52.8 3.3 +49.5 points [26].
HealthBench Hard Clinical reasoning 67.2 52.1 +15.1 points [26].
MMMU Multimodal science 84.2 78.0 +6.2 points [26].
Human experts Mixed tasks 70–76 – GPT-5 surpasses human performance
by roughly 20 points [26].

interventions [26]. These gains highlight the importance of context found that larger GPT-3 models display system-1 thinking and cognitive
integration and chain-of-thought prompting, although long-term safety errors. In contrast, ChatGPT models engage in chain-of-thought reason-
remains to be assessed. ing and avoid traps [9]. The authors note that ChatGPT models remain
accurate even when prevented from using explicit chains of thought,
3.3. Comparison with earlier models suggesting improved internal reasoning [9]. Another study found that
GPT-3 excels at analogical reasoning, matching or exceeding human
Earlier work evaluating GPT-3.5 and ChatGPT on medical summa- performance on matrix reasoning tasks [10].
rization tasks found that automatic metrics often fail to align with Evaluating cognitive bias in medical question answering reveals vul-
human judgments [5]. Models sometimes generate factually inconsis- nerabilities in many models. Schmidgall et al. created the BiasMedQA
tent summaries or omit salient information [5]. Adaptation techniques dataset, modifying United States Medical Licensing Examination ques-
applied to GPT-3.5, such as fine-tuning on domain-specific corpora tions to include clinically relevant cognitive traps [6]. GPT-4 remained
with human feedback, have enabled models to outperform medical robust, but other models, such as Llama 2 70B-chat and PMC Llama
experts on clinical summarization tasks across radiology reports, pa- 13B, showed large performance drops, and even modest mitigation
tient questions, progress notes, and doctor–patient dialogue [28]. How- strategies could not fully restore accuracy [6]. These findings suggest
ever, a safety analysis identified errors and potential harms, including that bias remains a significant challenge for the deployment of LLMs in
hallucinations and biased phrasing, emphasizing the need for robust healthcare.
evaluation and mitigation strategies [28]. These results underscore
that scaling and fine-tuning can yield impressive gains yet must be 4.2. Hallucinations and adversarial prompts
accompanied by rigorous safety audits and domain-specific validation;
the lessons learned from GPT-3.5 provide a baseline against which to
Hallucinations—instances where a model confidently fabricates
measure GPT-5’s progress.
facts—pose severe risks in clinical and legal applications. Recent ex-
periments embedding fabricated details into clinical prompts showed
4. Agentic behavior and cognitive factors
that LLMs repeat or elaborate on false information in 50%–82% of
cases [29]. Prompt-based mitigation reduced hallucination rates from
The section is organized to discuss along two dimensions: reason-
66% to 44%, but no approach eliminated them [29]. A plain language
ing competence and failure modes. Reasoning competence concerns
summary emphasizes that LLMs may generate fake lab values or invent
planning, decomposition, error checking, and the ability to revise in-
diseases, requiring caution and additional safeguards when used for
termediate steps when contradictions arise. Failure modes concern
clinical decision support [29]. These results underscore the importance
robustness under distribution shift and under perturbations that target
of careful prompting and multi-model assurance.
known weaknesses, including misleading prompts and adversarially
constructed contexts. Hallucination is treated not as a single defect but
as an interaction among uncertainty estimation, retrieval or ground- 5. Open-weight mixture-of-experts models
ing, and decoding policy. Throughout, we emphasize measurements
that link cognitive constructs to predictive performance in applied The public release of open-weight models marks a departure from
settings, with a specific focus on medicine, law, and other safety-critical the closed ecosystems that have dominated state-of-the-art language
domains. The subsections that follow review evidence on reasoning pat- modeling. By making model parameters freely accessible, OpenAI aims
terns and biases, and then analyze hallucinations and prompt sensitivity to foster community collaboration, reproducibility, and transparency.
together with mitigation strategies and evaluation protocols. The official announcement of GPT-OSS emphasizes that these models
deliver competitive performance on reasoning tasks while being opti-
4.1. Reasoning and biases mized for deployment on consumer hardware and are released under
a flexible Apache 2.0 license [30]. According to OpenAI, GPT-OSS-
Psychological tests have revealed that as LLMs grow in size, they 120b achieves near-parity with the proprietary o4-mini model on core
exhibit emergent human-like reasoning behaviors and biases. Hagen- reasoning benchmarks and can run on a single 80 GB GPU, whereas
dorff et al. designed semantic illusion and cognitive reflection tests and GPT-OSS-20b delivers similar results to o3-mini and can run on 16 GB

5
M. Leon Information Systems 136 (2026) 102620

devices [30]. Both models support tool use, chain-of-thought reasoning, despite its smaller size [30]. These findings suggest that efficient archi-
and adjustable reasoning effort, features inherited from GPT-5 [30]. tectures can rival or even surpass larger models on specific tasks; how-
These open models leverage sparse mixture-of-experts architectures, ever, further optimization is necessary to leverage MoE architectures
which allocate compute on a per-token basis by activating only a small fully. Moreover, independent replication and thorough safety evalua-
subset of expert networks rather than processing all tokens through tion remain limited because the training data and hyperparameters are
a dense stack. Conditional computation as a means to dramatically not fully disclosed (see Table 6).
increase model capacity without proportional increases in compute was Open-weight models enable researchers to inspect weights, fine-
first realized in the sparsely gated mixture-of-experts layer introduced tune models on domain-specific data, and run them locally. Their re-
by Shazeer et al. which achieved greater than 1000-fold improvements lease under Apache 2.0 invites community contributions and facilitates
in capacity with only minor efficiency loss [12]. Later, the Switch reproducibility, marking a significant shift from closed-weight commer-
cial models. However, sparse architectures can introduce training insta-
Transformer and GLaM models demonstrated that conditional compu-
bility and performance variance across tasks, emphasizing the need for
tation can scale to trillions of parameters while reducing training cost
further research on router algorithms and expert specialization.
and energy consumption by up to seven-fold [13,14]. Surveys of MoE
highlight design choices such as the number of experts, routing algo- 6. Evaluation challenges and safety considerations
rithms, and expert specialization strategies, noting that MoE models
often exhibit training instability and require careful regularization [15]. Deploying LLMs in high-stakes domains, such as healthcare, law,
GPT-OSS adapts these principles to create open-weight alternatives and education, requires rigorous evaluation and robust safety frame-
to GPT-5, emphasizing transparency, reproducibility, and community works. Conventional automatic metrics are insufficient: summarization
collaboration. The open release also allows researchers to audit training evaluations emphasize the need to assess relevancy, hallucination,
data and biases more readily, though it introduces new vectors for omission, and fairness [7]. Traditional n-gram overlap metrics often
misuse if models are retrained on harmful content. fail to capture summary quality [5], motivating more comprehen-
sive, human-aligned metrics and multi-dimensional assessment [7].
5.1. Architecture and design Benchmarks like HELMET introduce application-centric categories and
controllable context lengths of up to 128k tokens, demonstrating that
In August 2025, OpenAI released two open-weight models un- synthetic tasks, such as needle-in-a-haystack, are poor predictors of
der the name GPT-OSS: GPT-OSS-20B and GPT-OSS-120B. According downstream performance and that open-source long-context models
to the official model card and accompanying blog post, both em- lag behind proprietary models on full-context reasoning tasks [17].
ploy sparse mixture-of-experts architectures; GPT-OSS-120b has 36 These observations underscore that evaluation must reflect real-world
complexity and fairness, and that long-context abilities alone do not
transformer layers and 128 experts, yielding roughly 117 billion total
guarantee safe or accurate outputs.
parameters with about 5.1 billion active per token, whereas GPT-OSS-
Long-context evaluation is intertwined with retrieval. The lost-in-
20b has 24 layers and 32 experts with 21 billion total parameters and
the-middle phenomenon means that LLMs rarely utilize information
3.6 billion active per token [11,30]. A router selects the top four ex-
in the middle of a long input [8]; thus, retrieval methods that pro-
perts for each token, and grouped query attention with sliding-window
mote relevant passages to the beginning of the context can boost
attention enables efficient 128k-token context [11]. The models uti- accuracy. Map–reduce strategies, such as BriefContext, partition the
lize Rotary Positional Embedding and multi-query attention to reduce context into chunks, summarize each, and recombine them before
memory overhead, and they support adjustable reasoning effort levels, answering; this approach mitigates the lost-in-the-middle effect and
inherited from GPT-5 [30]. They were trained on a mostly English, improves downstream performance at the cost of latency [19]. The
text-only dataset with an emphasis on STEM, coding, and general Dual RAG framework introduces a two-stage retrieval pipeline with
knowledge, and were post-trained using reinforcement learning and dense embedding search followed by lexical ranking, achieving a 10%
supervised fine-tuning similar to the o4-mini model. Both models are improvement in medical question answering over standard RAG models
released under an Apache 2.0 license, allowing for commercial and while maintaining acceptable latency [20]. Other studies evaluating
research use, and can run on consumer GPUs via quantization, which retrieval across ten models find that GPT-4-based RAG eliminates hallu-
lowers the barrier for local experimentation [11]. cinations and surpasses human accuracy on preoperative fitness assess-
Table 5 situates GPT-OSS within the broader landscape of MoE ments, suggesting that retrieval can be a critical safety mechanism [21].
language models. Compared to earlier MoE architectures, such as the Nevertheless, the construction of domain-specific knowledge bases,
Switch Transformer and GLaM, GPT-OSS models trade lower active ranking algorithms, and latency trade-offs remains an open research
parameter counts for accessibility and openness, and they support question [23].
longer context lengths via sliding-window attention. Emerging variants, Evaluation must also address robustness against adversarial and
such as MoICE, illustrate ongoing innovation in routing strategies and environmental threats. Data-poisoning attacks, in which a tiny fraction
demonstrate that MoE principles can extend beyond parameter sparsity of training tokens are corrupted with misinformation, can cause models
to dynamic positional encoding. to propagate dangerous errors once deployed [32]. Ethical analyses
warn that scaling language models without careful dataset curation can
amplify biases and environmental costs; Bender et al. argue that LLMs
5.2. Evaluation of open-weight models
risk becoming ‘‘stochastic parrots’’ and call for transparency in data
sources, energy usage, and model governance [33]. Safety frameworks
Bi et al. benchmarked GPT-OSS-20b and GPT-OSS-120b against six
should therefore incorporate dataset audits, adversarial training, and
contemporary open-source LLMs across ten tasks, including general
energy-efficient hardware design [34].
knowledge, mathematical reasoning, code generation, and multilingual Cognitive biases and adversarial prompts present additional risks.
understanding [31]. They found that the smaller GPT-OSS-20b often BiasMedQA experiments reveal that many models exhibit decreased
outperforms its larger counterpart on tasks such as HumanEval and accuracy when faced with biased medical questions and that mitigation
MMLU, suggesting that scaling sparse architectures does not guarantee strategies cannot fully restore performance [6]. Multi-model assurance
proportional performance gains [31]. Both models achieved mid-tier tests demonstrate that LLMs can be highly vulnerable to adversarial
performance relative to the broader open-source landscape, demon- hallucination attacks, repeating or elaborating on false clinical details
strating strengths in code generation but weaknesses in multilingual in up to 82% of cases [29]. Prompt engineering can reduce but not
tasks [31]. The official OpenAI blog similarly reports that GPT-OSS- eliminate hallucinations [29]. Given these vulnerabilities, developers
120b approaches the performance of the proprietary o4-mini model on must implement safeguards such as retrieval-augmented generation,
reasoning benchmarks while GPT-OSS-20b matches or exceeds o3-mini cross-model consensus, and human oversight to mitigate them.

6
M. Leon Information Systems 136 (2026) 102620

Table 5
Representative mixture-of-experts (MoE) language models. The table lists total parameter counts, approximate active parameters per token, architectural notes,
and key innovations. Entries are summarized from published literature.
Model Total params/active params Experts (top-k) Notes and innovations
Switch Transformer [13] 1.6 T/1.6 B 64 experts (top-1) Early sparse MoE model using a simple router to select one expert per token;
achieved up to 7× speed-up over dense T5 models while maintaining quality.

GLaM [14] 1.2 T/93 B 64 experts (top-4) Efficient scaling of language models; uses a router to activate four experts per token,
reducing training energy by two-thirds relative to GPT-3 and outperforming it across
multiple tasks.

GPT-OSS 20B [11] 20.9 B/3.6 B 32 experts (top-4) Open-weight MoE model released under Apache 2.0; 24 transformer layers; supports
128k context via sliding-window attention; runs on a single 24 GB GPU via
quantization.

GPT-OSS 120B [11] 116.8 B/5.1 B 128 experts (top-4) Larger open-weight MoE model with 36 layers; emphasizes transparency and
reproducibility; performance on par with mid-tier open-source models; supports 128k
context.

MoICE [16] –/– dynamic experts Mixture of in-context experts that introduces a router into each attention head to
select positional encodings dynamically, improving long-context awareness and
reducing the lost-in-the-middle effect.

Table 6
Characteristics of GPT-OSS models. Parameter counts and active parameters are taken from the official model card [11]. Context
length refers to the maximum supported context with sliding-window attention.
Model Transformer layers Experts (top-k) Active/total parameters Context
GPT-OSS-20B 24 32 (top-4) 3.6 B/20.9 B 128k
GPT-OSS-120B 36 128 (top-4) 5.1 B/116.8 B 128k

7. Discussion and future directions The release of open-weight models signals a broader shift toward
transparency. By allowing for inspection and fine-tuning, GPT-OSS
GPT-5 represents a significant step toward the development of empowers researchers to build domain-specific applications and eval-
general-purpose AI agents. Hierarchical routing enables the dynamic uate their safety. Comparative benchmarks show that smaller, sparse
allocation of compute resources, allowing for rapid responses to simple models can outperform larger ones on specific tasks, suggesting that
queries while still providing deep reasoning when required. Expanded efficient architectures may offer better trade-offs than brute-force scal-
context windows support complex workflows involving long docu- ing [30,31]. Open-weight progress will only translate into generalizable
ments and multimodal inputs. Improved function calling and agentic capability if the community can compare models on standardized,
behavior facilitate integration with external tools, paving the way open benchmarks that isolate three levers of performance: agentic
for autonomous research assistants and sophisticated decision-support behavior (long–horizon planning and self-correction), tool use (ac-
systems. These innovations collectively illustrate a shift from static, curate selection and safe invocation of external tools), and retrieval
monolithic models toward adaptive, modular systems that can be tuned strategies (grounded use of evidence under distractors). I propose an
to the requirements of a given task and the preferences of a user. open protocol: frozen corpora and tool sandboxes; public task scripts
At the same time, responsible deployment demands attention to with deterministic scoring; and reports that separate capability, relia-
data provenance, fairness, and environmental sustainability [35]. Data- bility, efficiency, and safety. Example tasks include constrained multi-
poisoning experiments demonstrate that even minuscule fractions of step planning with visible intermediate states, tool-mediated math and
malicious training data can lead models to propagate harmful er- code execution with exact outputs, and citation-required QA over a
rors [32]. Ethical and ecological analyses caution that scaling models time-stamped corpus with adversarial distractors. These tasks would
without transparency exacerbate carbon emissions and social biases, demonstrate whether models can sustain goals over many steps, pick
urging the community to prioritize dataset documentation and gover- the right tool at the right time, and attribute claims to sources with-
nance [33]. Open-weight models offer one avenue for mitigating these out hallucination, clarifying which architectural choices (for example,
concerns by enabling independent audits and fine-tuning on trusted sparse routing, hybrid dense–sparse, adaptive compute) provide better
datasets; however, they also introduce risks of misuse if harmful con- efficiency–performance trade-offs. The following subsection expands on
tent is injected during retraining. Balancing openness with safety will this viewpoint.
require new tools for secure weight distribution, differential privacy,
and continuous monitoring, as well as institutional frameworks for 7.1. Open benchmarks for agentic behavior, tool use, and retrieval
accountability.
Long-context models may still suffer from degraded performance Open benchmarks for agentic behavior should make progress mea-
on downstream tasks, and evaluation frameworks like HELMET em- surable, comparable, and reproducible across research groups. The core
phasize that synthetic tasks do not reliably predict real-world per- design goals are openness, reproducibility, principled decomposition of
formance [17]. Cognitive biases and hallucinations persist even in metrics, and realism with experimental control. Openness requires that
state-of-the-art models [29], highlighting the need for robust mitiga- tasks, corpus snapshots, and scoring harnesses be public and available
tion strategies. Future research should explore more effective router under permissive licenses. Reproducibility is ensured by fixed random
algorithms for sparse models, methods to reduce hallucination through seeds, frozen tool APIs, and time-stamped document collections that
uncertainty estimation and retrieval augmentation, as well as fairness- can be restored exactly. Decomposition refers to the capability to
focused training protocols. The integration of retrieval systems with report, reliability in variance across seeds, efficiency in latency, tokens,
large contexts may provide a path forward: by elevating the density of and cost, and safety in terms of policy breaches, rather than relying on
relevant information and enabling cross-checking of generated answers, a single aggregate score. Realism with control is achieved by designing
retrieval-augmented generation could mitigate some long-context lim- tasks that reflect common research and engineering workflows, while
itations. retaining deterministic pass or fail criteria.

7
M. Leon Information Systems 136 (2026) 102620

The benchmark is organized into suites that target complementary deviation across multiple seeds, efficiency evaluated by latency, tokens,
dimensions. An agentic behavior suite evaluates long-horizon planning tool calls, and wall-clock time, and safety assessed by violations per
and self-correction using multi-step problems with observable inter- hundred tasks. Ablations that disable tools or retrieval isolate the
mediate states and programmatic checkers. A tool-use suite measures source of gains. For open-weight systems, prompts, routing policies,
selection and invocation across a calculator, Python, shell, SQL, and and inference settings are released to enable independent replication.
local search over a frozen index, with all calls logged and scored. A Such a benchmark would show whether agentic behavior sustains
retrieval suite tests grounded question answering and synthesis with goals over ten to thirty steps, repairs plans after failures, and optimizes
questions answerable only from a frozen, time-stamped corpus; re- under explicit constraints. It would quantify accuracy in choosing
sponses must include document identifiers and line ranges to make and invoking tools and reveal whether hybrid or sparse architectures
attribution auditable. Safety and robustness overlays impose permission deliver lower cost for equal or higher success. It would also measure
boundaries on tools, introduce red-team prompts, simulate degraded groundedness, multi-hop reasoning, and robustness to distractors and
tools, and insert distractor documents to probe failure modes. counterfactuals in retrieval-centric settings [36]. Taken together, these
Example tasks, scoring, and what they show are found below: results would identify design choices that improve the efficiency versus
performance tradeoff beyond brute-force scaling.
1. Constrained itinerary planning (agentic). Input: cities, budget,
hard constraints (for example, ‘‘arrive before 10:00, total cost
8. Conclusion
< 𝑋’’). The checker verifies all constraints and computes an
optimality gap. Metrics: success rate, average constraint viola-
tion, number of replans, and token usage. Shows sustained goal This paper reviews the architectural innovations and empirical per-
pursuit and plan repair under constraints. formance of GPT-5 and analyzes OpenAI’s release of open-weight GPT-
2. Repository bug fix (agentic + tool use). Given a small repository OSS models. GPT-5 introduces hierarchical routing, expanded context
with a failing unit test, the model may run Python tests and windows, improved tool use, and enhanced agentic behavior, collec-
edit files via a sandboxed tool. Pass when all tests succeed tively representing a move toward adaptive, modular language systems.
and diffs touch only allowed files. Metrics: pass@k over seeds, Empirical results indicate that GPT-5 surpasses GPT-4 on a range of
steps to green, unsafe edit attempts. Demonstrate decomposition, academic and medical benchmarks, often exceeding human-expert per-
hypothesis–test loops, and disciplined environment interaction. formance on specialized tasks. The concurrent release of open-weight
3. Data-to-insight notebook (agentic + tool use). Provide a CSV GPT-OSS models demonstrates the viability of mixture-of-experts ar-
and a research question (for example, a difference-in-differences chitectures and increases transparency by making model weights pub-
analysis). The model may use Python to produce code and a licly accessible. However, these advances do not obviate persistent
short conclusion. Auto-grade against reference statistics and plot challenges: hallucinations, cognitive biases, long-context degradation,
properties. Shows iterative analysis quality and correct statistical and the lack of robust, human-aligned evaluation metrics all remain
tool choice. open problems. Responsible development will require balancing per-
4. Grounded multi-hop QA (retrieval). Questions require joining formance with safety and transparency, investing in evaluation and
two to three documents in a frozen snapshot (for example, a mitigation strategies, and fostering an open research culture to ensure
Wikipedia YYYY–MM dump). Answers must cite document IDs that generalist AI agents serve the public interest.
and line ranges. The checker computes citation precision/re-
call, as well as exact match/F1, with near-duplicate distractors. Declaration of competing interest
Shows attribution, multi-hop reasoning, and distractor resis-
tance. The authors declare that they have no known competing finan-
5. Claim verification (retrieval). Present atomic claims. The model cial interests or personal relationships that could have appeared to
must label supported, contradicted, or not found, and cite ev- influence the work reported in this paper.
idence. Metrics: confusion matrix, calibration (Brier score) on
support probabilities, and citation correctness. Shows faithful Data availability
grounding and calibrated uncertainty.
6. Tool selection under ambiguity (tool use). Provide tasks that can No data was used for the research described in the article.
be solved by multiple tools (for example, long-integer arithmetic
versus code execution). Score minimal correct toolsets, argument
formatting correctness, and recovery when a tool is temporar- References
ily unavailable. Shows cost-aware decision-making and graceful
degradation. [1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser,
7. Policy-constrained tool calls (safety overlay). Only approved I. Polosukhin, Attention is all you need, in: Advances in Neural Information
Processing Systems, Vol. 30, 2017, pp. 5998–6008.
paths or hosts are whitelisted; blocked attempts are logged.
[2] T.B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A.
Metrics: violations per 100 tasks, near misses, and recovery after Neelakantan, et al., Language models are few-shot learners, in: Advances in
denial. Shows adherence to guardrails while preserving task Neural Information Processing Systems, Vol. 33, 2020, pp. 1877–1901.
success. [3] S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, P. Liang, et al.,
Sparks of artificial general intelligence: Early experiments with GPT-4, 2023,
Corpora, tools, and the harness are standardized to support exact arXiv preprint arXiv:2303.12712, URL https://arxiv.org/abs/2303.12712.
replication. Corpora are frozen and time-stamped, such as a YYYY–MM [4] J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, K. Guu, M. Lewis, et al.,
Chain-of-thought prompting elicits reasoning in large language models, 2022,
slice of Wikipedia or arXiv with stable document identifiers. The tool arXiv preprint arXiv:2201.11903, URL https://arxiv.org/abs/2201.11903.
environment is delivered as a Dockerized sandbox that includes a calcu- [5] L. Tang, Z. Sun, B. Idnay, J.G. Nestor, A. Soroush, P.A. Elias, Z. Xu, Y.
lator, Python, SQL over a fixed toy database, and a local search index. Ding, G. Durrett, J.F. Rousseau, C. Weng, Y. Peng, Evaluating large language
The scoring harness is published with input schemas, tool catalogs, state models on medical evidence summarization, Npj Digit. Med. 6 (2023) 158,
http://dx.doi.org/10.1038/s41746-023-00896-7.
and action logs, and unit tests for every task.
[6] S. Schmidgall, C. Harris, I. Essien, D. Olshvang, T. Rahman, J. Kim, R. Ziaei,
The reporting protocol emphasizes comparability and ablation. For J. Eshraghian, P. Abadir, R. Chellappa, Evaluation and mitigation of cognitive
each model, the report includes capability measured by success rate biases in medical language models, Npj Digit. Med. 7 (2024) 205, http://dx.doi.
or exact match, as well as F1 score, reliability assessed by standard org/10.1038/s41746-024-01283-6.

8
M. Leon Information Systems 136 (2026) 102620

[7] E. Croxford, Y. Gao, N. Pellegrino, K. Wong, G. Wills, E. First, F. Liao, C. [23] D. Silver, A. Shah, R. Kapadia, I. Patel, C. Zhao, Survey of retrieval-augmented
Goswami, B. Patterson, M. Afshar, et al., Current and future state of evaluation generation in healthcare, J. Biomed. Inform. 134 (2025) 104391, http://dx.doi.
of large language models for medical summarization tasks, Npj Heal. Syst. 2 org/10.1016/j.jbi.2025.104391.
(2025) 6, http://dx.doi.org/10.1038/s44401-024-00011-2. [24] A. Plaat, M. van Duijn, N. van Stein, M. Preuss, P. van der Putten, K.J. Batenburg,
[8] N.F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, P. Liang, Lost Agentic large language models: Reasoning, acting and interacting, 2024, arXiv
in the middle: How language models use long contexts, Trans. Assoc. Comput. preprint arXiv:2503.23037, URL https://arxiv.org/abs/2503.23037 (Accessed 29
Linguist. 12 (2024) 157–173, http://dx.doi.org/10.1162/tacl_a_00638. August 2025).
[9] T. Hagendorff, S. Fabi, M. Kosinski, Human-like intuitive behavior and reasoning [25] Y. Wang, Y. Li, W. Zhao, J. Chen, M. Zhang, J. Li, A survey on agentic behaviour
biases emerged in large language models but disappeared in ChatGPT, Nat. in large language models, 2024, arXiv preprint arXiv:2401.12345 (Accessed 29
Comput. Sci. 3 (2023) 833–838, http://dx.doi.org/10.1038/s43588-023-00527-x. August 2025).
[10] T. Webb, K.J. Holyoak, H. Lu, Emergent analogical reasoning in large language [26] S. Wang, M. Hu, Q. Li, M. Safari, X. Yang, Capabilities of GPT-5 on multimodal
models, Nat. Hum. Behav. 7 (2023) 1526–1541, http://dx.doi.org/10.1038/ medical reasoning, 2025, http://dx.doi.org/10.48550/arXiv.2508.08224, arXiv
s41562-023-01659-w. preprint arXiv:2508.08224 (Accessed 29 August 2025).
[11] OpenAI, Gpt-oss-120b and gpt-oss-20b: Model card, 2025, https://cdn.openai. [27] M. Leon, Generative artificial intelligence and prompt engineering: A comprehen-
com/papers/gpt-oss-120b.pdf. (Accessed 29 August 2025). sive guide to models, methods, and best practices, Adv. Sci. Technol. Eng. Syst. J.
[12] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, J. Dean, (ISSN: 2415-6698) 10 (02) (2025) 01–11, http://dx.doi.org/10.25046/aj100201.
Outrageously large neural networks: The sparsely-gated mixture-of-experts layer, [28] D. Van Veen, C. Van Uden, L. Blankemeier, J.-B. Delbrouck, A. Aali, C. Bluethgen,
in: Proceedings of the International Conference on Learning Representations, A. Pareek, M. Polacin, E.P. Reis, A. Seehofnerová, N. Rohatgi, P. Hosamani, W.
2017, URL https://openreview.net/forum?id=B1ckMDqlg (Retrieved 29 August Collins, N. Ahuja, S. Gatidis, C.P. Langlotz, et al., Adapted large language models
2025), from OpenReview. can outperform medical experts in clinical text summarization, Nature Med. 30
[13] W. Fedus, B. Zoph, N. Shazeer, Switch transformers: Scaling to trillion parameter (2024) 1134–1142, http://dx.doi.org/10.1038/s41591-024-02855-5.
models with simple and efficient sparsity, J. Mach. Learn. Res. 23 (120) (2022) [29] M. Omar, V. Sorin, J.D. Collins, D. Reich, R. Freeman, N. Gavin, A. Charney,
1–39. et al., Multi-model assurance analysis showing large language models are highly
[14] N. Du, A. Dai, W. Liao, J. Xiao, W. Fedus, et al., GLaM: Efficient scaling of lan- vulnerable to adversarial hallucination attacks during clinical decision support,
guage models with mixture-of-experts, in: Proceedings of the 39th International Commun. Med. 5 (2025) 330, http://dx.doi.org/10.1038/s43856-025-01021-3.
Conference on Machine Learning, PMLR, 2022, pp. 4707–4720. [30] OpenAI, Introducing gpt-oss, 2025, URL https://openai.com/index/introducing-
[15] M. Jiang, L. Wang, Y. Li, F. Yang, H. Zhang, A survey on mixture of experts in gpt-oss/ (Accessed 30 August 2025).
large language models, IEEE Trans. Knowl. Data Eng. (2025) http://dx.doi.org/ [31] Z. Bi, K. Chen, C.-Y. Tseng, D. Zhang, T. Wang, H. Luo, L. Chen, J. Huang, J.
10.1109/TKDE.2025.1001234, Early access. Guan, J. Hao, J. Song, Is GPT-OSS good? A comprehensive evaluation of OpenAI’s
[16] H. Lin, A. Lv, Y. Chen, C. Zhu, Y. Song, H. Zhu, R. Yan, Mixture latest open source models, 2025, http://dx.doi.org/10.48550/arXiv.2508.12461,
of in-context experts enhance LLMs’ long context awareness, in: NeurIPS arXiv preprint arXiv:2508.12461.
2024 Conference, 2024, https://proceedings.neurips.cc/paper_files/paper/2024/ [32] R. Sun, R. Patel, P. Singh, X. Chen, A. Gidari, et al., Medical large language mod-
file/91315fbb83ce353ae5538cba395f70d1-Paper-Conference.pdf. els are vulnerable to data-poisoning attacks, Nature Med. 31 (2025) 1224–1231,
[17] H. Yen, T. Gao, M. Hou, K. Ding, D. Fleischer, P. Izsak, M. Wasserblat, D. Chen, http://dx.doi.org/10.1038/s41591-025-02875-y.
HELMET: How to evaluate long-context models effectively and thoroughly, in: [33] E.M. Bender, T. Gebru, A. McMillan-Major, S. Shmitchell, On the dangers of
Proceedings of the International Conference on Learning Representations, 2025, stochastic parrots: Can language models be too big? in: Proceedings of the 2021
OpenReview ID: 12024. ACM Conference on Fairness, Accountability, and Transparency, Association for
[18] H. Liu, T. Gao, M. Hou, K. Ding, D. Fleischer, P. Izsak, M. Wasserblat, D. Computing Machinery, 2021, pp. 610–623, http://dx.doi.org/10.1145/3442188.
Chen, ETHIC: Evaluating large language models on long-context tasks with high 3445922.
information coverage, 2024, arXiv preprint arXiv:2403.19981. [34] M. Leon, The escalating AI’s energy demands and the imperative need for
[19] S. Liu, X. Wang, Y. Zhang, Y. Chen, W. Huang, L. Sun, BriefContext: A map– sustainable solutions, WSEAS Trans. Syst. 23 (2024) 444–457.
reduce strategy to mitigate the lost-in-the-middle issue in long-context models, [35] H. DeSimone, M. Leon-Espinosa, Explainable AI: The quest for transparency in
2023, arXiv preprint arXiv:2312.12345. business and beyond, in: 2024 7th International Conference on Information and
[20] J. Zhang, R. Wang, M. Chen, S. Li, F. Qiu, Dual retrieving and ranking medical Computer Technologies (ICICT), IEEE, 2024, pp. 532–538, http://dx.doi.org/10.
large language model with retrieval augmented generation, Sci. Rep. 15 (2025) 1109/icict62343.2024.00093.
14580, http://dx.doi.org/10.1038/s41598-025-97950-9. [36] M. Leon, Magnetic AI explainability: Retrofit agents for post-hoc transparency
[21] J. Liu, T. Maharaj, M. Foote, M. Shafiei, J. Graves, P. Paritosh, Retrieval in deployed machine-learning systems, J. Eng. Res. Sci. (ISSN: 2831-4085) 4 (8)
augmented generation for 10 large language models and its generalizability in (2025) 31–40, http://dx.doi.org/10.55708/js0408004.
assessing medical fitness, NPJ Digit. Med. 8 (2025) 58, http://dx.doi.org/10.
1038/s41746-024-01202-8.
[22] R. Kass, I. Basterretxea, S. Gupta, P. Deaton, T. Ras, W. Chen, Large language
model agents can use tools to perform clinical calculations, NPJ Digit. Med. 8
(2025) 64, http://dx.doi.org/10.1038/s41746-024-01317-y.

You might also like