0% found this document useful (0 votes)
24 views8 pages

CIE - Paper - AICS - 2023 - FineTuneIt - BHartmann - Example Paper

Uploaded by

hixel50400
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views8 pages

CIE - Paper - AICS - 2023 - FineTuneIt - BHartmann - Example Paper

Uploaded by

hixel50400
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Fine-Tune it Like I’m Five: Supporting Medical

Domain Experts in Training NER Models using


Cloud, LLM, and Auto Fine-Tuning
1st Benedict Hartmann 2nd Philippe Tamla
Faculty of Mathematics and Computer Science Faculty of Mathematics and Computer Science
University of Hagen University of Hagen
58097 Hagen, Germany 58097 Hagen, Germany
[email protected] [email protected]

3rd Florian Freund 4th Matthias Hemmje


Faculty of Mathematics and Computer Science Faculty of Mathematics and Computer Science
University of Hagen University of Hagen
58097 Hagen, Germany 58097 Hagen, Germany
[email protected] [email protected]

Abstract—This paper presents a system to manage, train, through improving patient safety [10]. However, the rapid
and optimize Named Entity Recognition models using Cloud adoption of this technology leads to an increased volume of
resources, named CRM4NER. This system explores features to data [39]. This volume also leads to Information Overload
perform model fine-tuning using automatic parameter searches
and context-aware fine-tuning recommendations using a Large (IO) [27]. In IO the complexity and abundance of information
Language Model. A goal is to support domain experts in the dif- and data overwhelms the user and can lead to lower decision
ficult task of training and fine-tuning Machine Learning models, speed and quality [27]. For example, a physician may have
which often requires expertise that domain experts are lacking. access to a large amount of EHR data and may not be
The system also includes functionalities to manage models and able to review this information effectively [33]. The field of
training data for the entire usage life-cycle and is integrated in a
knowledge management system. Through providing Cloud-based Information Retrieval (IR) attempts to address challenges of
storage and training, domain users are further supported though IO. However, with using traditional IR-systems such as search
scalable compute and storage for advanced Machine Learning engines, it gets increasingly difficult to solve IO challenges
workloads such as GPU-based Transformer models for improved [44]. NER constitutes a vital component of Natural Language
performance. Processing (NLP) and offers support to IR systems [44].
Index Terms—Cloud Resource Management, Deep Learning,
Named Entity Recognition, Transformer, Cloud Computing, Mi- Named Entities (NE) encompass textual elements that can be
croservice Architecture categorized into specific classes, such as individuals, locations,
or organizations, with the particular categories depending on
I. I NTRODUCTION AND M OTIVATION the application context [29]. In the domain of medicine,
Named Entity Recognition (NER) is a natural language entities may include drugs, medical conditions, or frequency
processing technique that identifies and classifies entities, such descriptors [18]. NER leverages Machine Learning (ML)
as names of people, organizations, locations, within a text. methods [44] to process these entities. The data processed
[28]. This paper addresses the development of a microservice- by NER systems can be used to address challenges such
based application to manage Amazon Web Services (AWS) as IO. In the medical domain, decision support systems and
Cloud resources for NER model training. It also involves inte- data analytics [25] could use this data to improve treatment
grating the application into a medical knowledge management decisions and care quality. ML systems often require large
system. Electronic Health Records (EHRs) contain textual amounts of compute resources and data [15], [43], [46]. Cloud
data about patients such as treatment history, past conditions, Computing (CC) provides techniques to provision resources
and biographical information. The introduction of EHRs in the [26], [42]. It enables users to run scalable compute services,
medical domain leads to positive developments in patient care without the need to manage infrastructure [13]. CC may be
[10]. EHRs can improve productivity of operations through used to complement NER and can solve challenges such as
reduced administrative overhead and improve care quality storing and processing large amounts of data required for NER
[43]. Recent scientific developments such as Transformers [45]
have lead to exceptional results in NLP including NER [28].
979-8-3503-6021-9/23/$31.00 ©2023 IEEE Furthermore, the recent release of the GPT-4 model demon-
strates impressive performance on different language tasks and Evaluation: Assess enhanced system performance. Evaluate
a basis for further application using Large Language Models new optimization components. Examine system integration
(LLMs) [36]. within KM-EP.
We will now motivate our work by introducing relevant In this section we motivated our work, introduced related
research and development associated with the challenges of ap- research, and defined a structured approach to our work. In the
plying NER in the medical domain. Artificial Intelligence for next section we will review relevant state of the art including
Hospitals, Healthcare and Humanity (AI4H3) [22] presents NER, fine-tuning strategies, and technologies associated with
an architectural proposal designed to support medical do- our system integration in KM-EP.
main experts through the integration of Artificial Intelligence
(AI) technology. The Content and Knowledge Management II. S TATE OF THE A RT IN S CIENCE AND T ECHNOLOGY
Ecosystem Portal (KM-EP), developed at the University of
Hagen within the Chair of Multimedia and Internet Appli- NER Techniques: NER is a common method used in IR
cations, adopts the platform architecture of AI4H3. KM-EP systems to address Information Extraction (IE) [11]. The
has previously been employed in projects like MetaPlat [48] term ”Named Entity Recognition” was coined at the Sixth
and SenseCare [16]. This paper’s objective is to seamlessly Message Understanding Conference (MUC) in 1996 [20].
integrate and further develop a cloud-based NER application NER is a subfield of Natural Language Processing (NLP)
into KM-EP. The Framework Independent Toolkit for that focuses on identifying entities in unstructured text data.
Named Entity Recognition (FIT4NER) [19] extends the NER techniques have evolved from rule-based processing and
AI4H3 architecture, focusing on supporting medical experts gazetteers [28], [32] to unsupervised ML approaches like
in NER tasks. Cloud-based Information Extraction (CIE) clustering [28], [32], [50]. Common supervised NER machine
[43] extends FIT4NER by offering Cloud infrastructure for learning techniques include Hidden Markov Models, Decision
NER tasks. Our system is built upon the CIE architecture. Trees, Support Vector Machines, and Conditional Random
The initial implementation documented in [21] covered only a Fields [28], [38]. Recent years have seen the dominance of
subset of use cases outlined in CIE and lacked integration Deep Learning-based NER models, particularly using Recur-
with KM-EP. This earlier implementation revealed several rent Neural Networks (RNNs) and Long Short-Term Mem-
functional deficiencies that should be solved. Integrating these ory (LSTM) architectures [28], [45]. Transformers, introduced
systems would promote usability and collaboration on a shared by Vaswani et al. [45], have emerged as state-of-the-art models
platform. ML model Hyperparameters (HPs) play a crucial in NER. The NER on CoNLL 2003 (English) leaderboard
role in controlling the model learning process [12]. Novice ranks NER model performance by F1 score [30]. According to
users may face challenges in fine-tuning for Transformer the May 2023 leaderboard, the top 10 models exclusively use
models due to the high technical and conceptual knowledge Transformer architectures or hybrid Transformer approaches.
required for configuring and selecting appropriate HPs. They Due to the rapid advancements in NER techniques, selecting
may also lack knowledge to configure Cloud resources for the optimal method is challenging. Consequently, we have
NER model on a Cloud provider like AWS. This leads to the opted to implement Transformers in our improved system,
following Research Questions (RQs): given their strong performance in current research.
RQ1: How can a system to manage AWS Cloud resources NER Frameworks & Technologies: NER is commonly
for NER be expanded and further improved? RQ2: How implemented using software frameworks. In this paper, we
can users be better supported in fine-tuning Transformer have chosen the spaCy framework [3], which we successfully
models? RQ3: How can the system be integrated into applied in our previous system [21]. spaCy, an open-source
a Knowledge Management System such as KM-EP? To framework [4], was developed in 2015 by Honnibal and
address these RQs, we employ the well-established Nuna- Montani at Explosion [23]. It is a popular NLP framework
maker methodology [35], a systematic approach to developing used in various NER studies [41] and production environments
information systems. This methodology involves delineating [6], [7]. Hugging Face (HF) [1] is a Software as a Service
specific Research Objectives (ROs) for each RQ, spanning (SaaS) provider specialized in ML tasks. HF offers a public
observation, theory building, implementation, and evaluation Transformer library [1] for creating, training, and using Trans-
stages. former models. SpaCy features an HF integration [17], allow-
Our defined Research Objectives (ROs) are categorized as ing users to use HF models as regular pipeline components
follows: a) Observation: Review key concepts of NER, Trans- for NLP tasks. Hyperparameter Optimization Strategies: When
formers, Cloud tech. and investigation of existing fine-tuning selecting model HPs, users face several challenges. First, the
strategies and best practices. Analysis of relevant KM-EP com- effective selection of HPs requires significant ML domain
ponents for expansion. b) Theory Building: Develop models expertise [9]. Second, the definitive effect of HPs on model
to address deficiencies identified in the initial implementation performance is difficult to predict, which may require trial
[21]. Enhance our previous model, including use cases, com- and error [24]. Third, there is a large number of parameters
ponents, and overall architecture. c) Implementation: Realize and combinations, making it difficult to select HPs [9]. We
NER training system functionality. Implement new GUI and introduce several strategies for optimizing model HPs, that
backend components. Seamlessly integrate into KM-EP. d) can support users in addressing these challenges:
Providing well-working default parameters (S1) [9]: Offering insights from our previous implementation [21] and CIE
default parameters that have demonstrated strong performance overall architecture [43]. Our system serves medical experts
in various applications and research studies, simplifying user and NER experts, with medical experts having minimal back-
choices. ground in NER and CC, while NER experts possess varying
Indicative scoping through useful context information (S2) NER and CC knowledge. Both groups aim to train NER
[14]: Supporting users in fine-tuning hyperparameters by pro- models, accessing the system through KM-EP. In addition to
viding informative descriptions and interactive GUI elements, CIE use cases [43], we introduced additional use cases to
enhancing user understanding. optimize the NER training process for users. The additional
Meta-Learning (S3) [24]: Meta-Learning is a sub-field of ML use cases are visible in Figure 2: ”Model and Data Man-
in which the goal is to optimize the process of learning itself agement”: Models and Corpus data should be persisted and
[24]. Hospedales et al. mention Bayesian Meta-Learning for available for further use. ”Selection of compute profiles”: The
optimization of HPs through parameter search. Weights & Bi- possibility of selecting predefined compute resources without
ases (W&B) is a Cloud service provider that provides utilities the need of specifying them in detail. ”Perform NER”: Ability
for parameter search through Parameter Sweeps. W&B offers to perform NER on text directly in the application. ”Training
bayesian parameter search and analysis utilities for ML tasks. metric viewing”: Users can view detailed information about
Random Parameter Search (S4) [24]: Conducting random model training such as Precision, Recall, F1-Score, mem-
parameter searches followed by promotion of the best- ory utilization, and system logs. Additionally, the following
performing parameters, offering a less structured but poten- new use cases support users specifically in improving their
tially effective approach. W&B provides random and grid NER models: ”Automatic Parameter Searches”: Ability to
search functionalities. automatically search for hyperparameters using meta-learning
Leaving fine-tuning to dedicated experts (S5) [9]: Recog- optimization algorithms or random search to improve models
nizing that hyperparameter fine-tuning may be best handled easily. This is an extension of the CIE use case Train Model
by experts with extensive experience in optimization. LLM- in Cloud [43]. ”Hyperparameter recommendations via LLM”:
supported parameter search (S6): Leveraging LLMs such as Users should receive fine-tunig recommendations for their
OpenAI GPT-4 [36] to provide context on hyperparameter val- model via a LLM to receive direct support to improve their
ues and model training performance to suggest improvements. models, specifically value suggestions for spaCy configuration
KM-EP Components: KM-EP is a web content and knowl- parameters. Integrating W&B into the training system and
edge management ecosystem portal utilized in numerous configuring UI forms supports strategies S1-4. S5, requiring
research projects, such as RecomRatio [47] and Sensecare user-initiated optimization parameter requests via the UI, is ex-
[16]. Its key components encompass the Content Manager, cluded. The system could compile context information, includ-
Content Management System (CMS) for content customiza- ing spacy configuration, corpus metadata, compute resources,
tion, Ingest Component for content imports, Authentication and previous training results, to improve recommendations
Component for user management, and SNERC—an App following LLM prompt guidelines by White et al. [49]. Users
for customizing and training NER models using Stanford receive fine-tuning recommendations and apply them during
CoreNLP. KM-EP was developed using the widely-used Sym- model training. The system manages and persists models,
fony framework [5] with PHP and MySQL [2]. Our initial training data, and training run information.
application [21] leveraged AWS resources for Cloud-based Our improved system is primarily managing Cloud re-
NER model training with spaCy customization. It offered a sources to provide functionalities around NER. Therefore we
GUI for configuring compute resources and consisted of two name the system Cloud Resource Management for Named
primary components: a standalone Model-View-Controller Entity Recognition (CRM4NER), pertaining to its core
(MVC)-based application and an AWS system for ML model functionality. The CRM4NER system architecture includes:
training. Both our initial system and KM-EP adhere to a a) An additional KM-EP Component for Symfony, providing
microservices architecture, enabling seamless integration. KM- GUI structure, persistence using the KM-EP Database, and
EP supports App integration, including SNERC [44], through request handling. The App should leverage the KM-EP CMS
isolated components, and both applications facilitate RESTful elements and Authentication and should follow a similar
communication. Our objective is to achieve a seamless inte- layout as SNERC. b) A Service Controller for API calls
gration of our app into KM-EP, as detailed in the following to remote services. c) A Metrics Component for collecting
section. and processing training metrics using W&B. d) An LLM
Provider component for gathering model context information
III. D ESIGN AND M ODELING and offering hyperparameter recommendations. e) Adaptation
We employed User Centered System Design (UCSD) [34], of the previous system [21], using a AWS Cloud Component
a proven methodology for developing information systems, to and Training Container for model training in AWS, to support
guide our system’s design and development. UCSD prioritizes parameter searches with W&B and spaCy. f) Microservice
understanding user needs and requirements to inform design interaction with Cloud and SaaS services via API calls. This
decisions, ensuring the system meets user needs effectively. To design maintains established MVC and component patterns.
apply UCSD, we defined the system’s use context, considering The component diagram is shown in Figure 1.
Fig. 1. CRM4NER Component diagram.

Fig. 3. CRM4NER Model overview page.

storage solution. Notably, the training container includes two


significant enhancements: the metrics component, configured
as a W&B environment, collects training metrics, offers data
visualizations, and facilitates parameter searches. Additionally,
a report overview has been integrated into the KM-EP platform
on a separate page. The OpenAI GPT-4 API is employed as
the LLM provider for fine-tuning optimizations.
Figure 3 showcases the CRM4NER Model Overview page,
serving as the integrated landing page within KM-EP. This
page presents a table of pre-trained models, offering green-
tabbed options below the table for various configurations:
Tab 1 for training models, Tab 2 for automatic parameter
sweeps, Tab 3 for obtaining optimization recommendations
via GPT, and Tab 4 for performing NER tasks. Selecting Tab
Fig. 2. CRM4NER System Use Case Diagram. 1 opens a GUI form in which users can choose default or
custom parameters for model training and select a compute
profile. Tab 2 allows configuration and execution of auto-
IV. I MPLEMENTATION matic parameter searches (e.g., bayes, random, or grid search)
within the defined parameter space. Selecting Tab 2 opens
This section outlines the system’s implementation, where a configuration form to specify the search parameters. Fine-
CRM4NER serves as an integrated extension within KM-EP. tuning recommendations are readily available on the same
This extension encompasses various Symfony components, page upon user request. The microservice component initiates
including Controllers, Services, and Entities, along with Twig cloud-based container computations. The system gathers con-
templates for user interface rendering. KM-EP primarily han- text information from the KM-EP database and AWS object
dles the core responsibilities of the Model-View-Controller storage, including historical scores, configurations, and dataset
(MVC) architecture. The extension communicates with a samples. This data is leveraged to compile prompts within the
CRM4NER Service Controller microservice container via an microservice component, subsequently providing optimization
API. This microservice is responsible for critical application suggestions through the user interface, accompanied by an
logic, such as initiating NER training and interacting with explanation for recommended values.
AWS, W&B, and the OpenAI GPT-4 API. The CRM4NER
Service Controller is an adaptation of the Flask application V. E VALUATION
from the previous system [21], transformed into an API
provider. The user interface has been removed, while the This section evaluates the integration of our system into
service encompasses Model, Job, and TrainingData classes KM-EP. To perform the evaluation, we will conduct a quali-
to execute application logic and interact with the AWS API. tative Cognitive Walkthrough (CW) [37] of the CRM4NER
The Cloud component relies on AWS Batch [8] for container system and perform a quantitative experiment comparing the
runtime services, utilizing AWS Fargate for CPU training performance of model scores achieved with the optimization
and AWS EC2 g4dn virtual machines featuring NVIDIA T4 component to assess the feasibility of the optimization fea-
GPUs for GPU Transformer training. AWS EC2 serves as the tures.
A. Qualitative evaluation (LLMs), we will compare their performance with respect to
In a Cognitive Walkthrough (CW), experienced experts Named Entity Recognition (NER) model F-score. The F-Score
conducted an analysis of the system by employing defined user combines model Precision and Recall on a scale of 1 (perfect)
tasks and goals. This assessment aimed to identify potential to 0 (worst) and is computed according to Moosavi et al.
usability issues within the system. The CW was conducted [31]. Since spaCy’s default parameters are chosen by NLP
by a Ph.D. student and a Postdoc, both well-versed in the specialists, training models with these defaults serves as a
development of Named Entity Recognition (NER) systems. suitable benchmark for evaluating our optimization methods.
Several deficiencies were identified during the CW: An optimization method is deemed valid if it consistently pro-
Regarding the use case Configuration of NER Training [43], duces scores superior to those of the default model and feasible
it was noted that the form tabs used for configuring training if the improvements are substantial and can be achieved with
and parameter searches lack sufficient title information and reasonable user effort. We formulate the following hypothesis
descriptions and used generic and varying terminologies. This for our experiment:
hinders users’ comprehension of the page context across pages Hypothesis (H1): Models trained using CRM4NER’s op-
and was subsequently adjusted. Notably, the configuration of timization methods, GPT-4 recommendations, and Parameter
the pretrained Huggingface Transformer currently relies on a Auto Search will consistently yield significantly higher scores
simple text field and reference link to the Huggingface library. compared to models using the default spaCy parameters for
This can be enhanced by directly integrating the field with the spaCy CPU and GPU Transformer models, all trained on
Huggingface API. An initial improvement to this dynamically the same GERNERMED corpus [18]. This corpus, consisting
filters the link results for the model language. Concerning the of 8599 records from the medical domain in the German
use case Model and Data Management, it was observed that language, will be used for training. We will compare the
the page navigation structure is not sufficiently streamlined achieved scores across various optimization methods, includ-
and deviates from the established workflow of existing KM-EP ing default spaCy parameters, random search, bayes search,
components. Additionally, the configuration for the Automatic and GPT-4 recommendations. Several search configurations
Parameter Searches use case is fixed to only three parameter with different parameter counts will be constructed to explore
fields, restricting users from configuring additional or fewer method performance beyond a single parameter space.
search parameters. Furthermore, these fields exclusively ac- For GPT recommendations, we will investigate two vari-
cept ranges and lack support for arrays of absolute values ants: ”limited context” and ”full context.” In the limited
(eg. dropout [0.1, 0.2]). This form was subsequently context, GPT-4 will only have access to the latest training
adjusted into an inline code editor, empowering expert users configuration, score, and metadata. In the full context, GPT-
to perform advanced sweep configurations. Providing a pre- 4 will have access to the complete configuration, score, and
filled automatic search configuration also serves as a helpful metadata history of the model, as well as its own previous
reference starting point. For the Selection of Compute Profiles recommendations. This exploration aims to discern any per-
use case, explicit text information was added to clarify that formance differences when GPT-4 has access to its own prior
resources are provided by the AWS cloud. Furthermore, the recommendations versus having only one-shot information.
costs associated with specific profiles were included within The optimization methods will be applied to both the stan-
the selection choices instead of being presented separately. dard spaCy CPU model architecture and the GPU Transformer
Regarding the Perform NER use case, descriptions for the pro- model using the pretrained GottBert model [40] available on
cessing stages of the training was adjusted to be shown directly Huggingface. We will execute six attempts for each method,
on the page, instead of an obfuscating hover icon. Finally, a workload considered manageable for real users conducting
the Hyperparameter Recommendations via LLM GUI textare fine-tuning in an organizational setting with a comparable
should contain a markup parser for improved readability. corpus, where time and resources are important considerations.
Three domain experts in Health Systems Management, 1) spaCy CPU architecture model optimization results: In
Biology, and Education with no knowledge in CC and little the CPU experiment, the default spaCy parameters yielded a
to no experience in ML were instructed to train a CPU model model score of 0.828. Both GPT-4 recommendation methods
on a provided corpus file and perform NER on a section of achieved slightly improved scores, with the highest model
text. Provided with the CRM4NER system and no further score being 0.831. The automatic parameter search methods
instructions, all participants were able to complete the tasks. did not result in an improvement over the default spaCy
However, one participant unknowingly trained a GPU model parameter score. The detailed data is visually represented in
instead. This shows initial usability of the system for its Figure 4 and can be found in Tables I and II. The 4-Parameter
core use cases, while highlighting some difficulty that domain Bayes search produced the lowest scores.
experts face. 2) Transformer model optimization results: In the GPU
Transformer model experiment, the spaCy default parameters
B. Quantitative evaluation for Transformer training resulted in a score of 0.813. Notably,
To assess the validity and feasibility of CRM4NER’s op- GPT-4 Limited Context I and II, as well as the 6-Parameter
timization features, including automatic parameter search and Random Search, were able to achieve significantly higher
hyperparameter recommendations via Large Language Models scores. GPT-4 Limited Context I achieved the highest score
TABLE I TABLE III
CPU FINE - TUNING METHOD PERFORMANCE DATA (PART I). GPU T RANSFORMER FINE - TUNING METHOD PERFORMANCE DATA .

3-Parameter GPT- 6-
At- GPT-4 Lim- GPT-4 GPT-4 GPT-
bayes search 4 Full Attempt Parameter
tempt ited Context Limited Limited 4 Full
(config I) context Random
Context I Context II Context
1 0.822 0.826 0.826 Search
2 0.816 0.831 0.825 1 0.813 0.813 0.813 0.823
3 0.821 0.829 0.828 2 config error config error 0.828 0.848
4 0.822 0.827 0.827 3 0.8512812062 0.845 0.613 0.831
5 0.822 0.826 0.818 4 0.848144168 0.812 0.509 0.855
6 0.820 0.819 0.826 5 config error 0.831 0.000 0.832
6 0.852029826 config error 0.251 0.818

TABLE II
CPU FINE - TUNING METHOD PERFORMANCE DATA (PART II). 0.84
3-
3- Parameter 4- 4-
At- Parameter Parameter Parameter
bayes 0.83
tempt random bayes Random
search
search (config II) search Search
1 0.821 0.185 0.234 0.816

F-Score
2 0.811 0.443 0.766 0.820
3 0.811 0.813 0.812 0.810 0.82
4 0.821 0.811 0.650 0.823
5 0.816 0.823 0.648 0.821
6 0.814 0.819 0.459 0.819
0.81

of 0.852, GPT-4 Limited Context II achieved 0.845, and the


6-Parameter Random Search achieved the highest score of 0.8
0.8555. All of these scores surpassed the default parameter 1 2 3 4 5 6
model score. However, it’s worth mentioning that GPT-4 Training Attempt
Limited Context I and GPT-4 Limited Context II encountered
configuration errors in 2 out of 5 attempts, primarily due to GPT-4 Full Context
hallucinated spaCy data augmentation configurations. GPT-4 GPT-4 Limited Context
Full Context achieved an improvement over the default pa- 3-Parameter Bayes search (config I)
rameters with a score of 0.8279, but it subsequently produced 3-Parameter Random Search
much lower model score results, with one unstable training 3-Parameter Bayes search (config II)
run resulting in a score of 0.00. This was caused presumably 4-Parameter Bayes search
due to GPT-4 setting the learn rate too high and continuously 4-Parameter Random Search
setting it to an even more unfavorable value. GPT-4 realized Default spaCy Parameters
and corrected this on attempt 6.
The experiment data is visualized in Figure 5 and Fig. 4. SpaCy CPU Fine-Tuning optimization method performance.
documented in table III. GPT-4 Limited Context I, GPT-4
Limited Context II, and 6-Parameter Random Search were
able to outperform the 0.842 top score of the best performing sufficiently support the feasibility of these methods for CPU
Transformer model trained on the GERNERMED corpus models, as the gains are minimal.
in the previous implementation [21]. Models achieving the None of the search method attempts on the CPU model
best scores used configurations adapting the parameters surpassed the default model’s performance. Consequently, this
hidden_width, maxout_pieces, batch_size, outcome does not support the validity or feasibility of using
L2, accumulate_gradient, batcher: buffer, search and random search methods for the specified use case
size and transformer get_spans: stride, window. scenario on CPU models.
3) Evaluation of optimization component experiment re- On the other hand, when it comes to GPU Transformer
sults: In summary, the optimization methods exhibited bet- models, all four tested methods achieved scores higher than
ter performance on GPU Transformer models compared to the default model in all 6 attempts. These methods achieved
CPU models. The LLM methods outperformed the parameter notable improvements, with a median improvement of 0.849
searches on the CPU model, with both LLM methods achiev- over the default model’s score of 0.813. Additionally, some
ing improvements over the default model in all 6 attempts. methods achieved higher scores than the previous GERN-
However, the improvements on the CPU model were marginal, ERMED model [21], with the 6-Parameter Random search
with the highest achieved improvement being only 0.5% achieving the highest score of 0.855. Overall, these results
(GPT4 Full Context). This level of improvement does not strongly support the validity and feasibility of employing
architecture for the CRM4NER system following the widely
adopted Model-View-Controller (MVC) pattern. This archi-
0.86
tectural design was subsequently implemented, with system
0.85 integration into the KM-EP platform facilitated through the
0.842 interaction of KM-EP and microservice components via APIs.
F-Score

0.84 The implementation phase included the incorporation of


parameter search functionality and metrics visualization,
0.83 achieved through seamless integration with the W&B platform
within the CRM4NER system. LLM recommendations were
0.82 integrated into the system’s user interface using the OpenAI
GPT-4 API.
0.81 The evaluation phase encompassed a Cognitive Walk-
through (CW) involving experts to assess the user functional-
0.8
1 2 3 4 5 6 ities, revealing several implementation deficiencies, that were
Training Attempt partially addressed. These included inadequate information
and inconsistent terminology in the user interface, necessi-
GPT-4 Limited Context I tating improvements in input implementation for parameter
GPT-4 Limited Context II search configuration and pre-trained Transformer model se-
GPT-4 Full Context lection. Three domain experts with no CC and little to no ML
6-Parameter Random Search experience were able to train a NER model and apply them
Default GPU spaCy Parameters to text using the system without specific instructions.
Furthermore, an experiment was conducted to assess the
Previous GerNerMed Top Score [21]
feasibility of the system’s optimization component. Model
Fig. 5. GPU Transformer Fine-Tuning method performance.
training was performed over six attempts using different
parameter search configurations and LLM recommendations
on a medical domain corpus in the German language. The
these methods for optimizing GPU Transformer models, as achieved scores were compared against those obtained using
all methods consistently achieved significant improvements spaCy’s default parameters, both on CPU (default spaCy archi-
within the given attempt window. tecture) and GPU (using the gottbert-base model). Overall, the
In conclusion, while the results are promising for GPU optimization methods exhibited superior performance on GPU
Transformer models, there is insufficient evidence to fully models, with a median score of 0.849, surpassing the 0.813
support the feasibility of these methods for NER fine-tuning, score achieved using spaCy’s default parameters. However,
as indicated by hypothesis H1. Further experimentation is further experimental data is required to establish the feasibility
necessary, involving different models, search and training of these methods conclusively.
configurations, a greater number of attempts, varied LLM Future work should focus on addressing the identified
prompts, and diverse corpora to draw more robust conclusions. deficiencies in the user interface and implementation, with
evaluations conducted by domain experts to validate the im-
VI. C ONCLUSION provements. Additional experiments employing diverse search
This paper presents the design and development of a system configurations, prompting techniques, corpora, and pre-trained
for training and enhancing Named Entity Recognition (NER) models could provide further insights regarding the imple-
models to assist both expert and non-expert users in the mented optimization methods.
medical domain. The motivation for this work stems from In summary, this research entails the design, implemen-
related research projects, including AI4H3, FIT4NER, CIE, tation, and evaluation of a system aimed at aiding domain
and builds upon the KM-EP platform and a prior prototype experts in NER model training and facilitating the challenging
implementation [21]. The research is organized into a review task of fine-tuning through partially and fully automated model
of the state of the art, system design, implementation, and optimization strategies.
evaluation.
The state-of-the-art analysis revealed that Transformers of- VII. ACKNOWLEDGEMENTS
fer competitive results in NER tasks, with notable technologies The author Benedict Hartmann acknowledges the financial
such as AWS, spaCy, and Huggingface playing pivotal roles support provided by Allianz Technology SE to attend AICS
in this domain. Additionally, a range of fine-tuning strategies 2023.
and technologies were identified, including meta-learning and
R EFERENCES
Large Language Model (LLM) recommendations.
The design phase involved crafting additional use cases [1] Hugging face, 2023. Accessed: Sep 19, 2023.
[2] Mysql :: The world’s most popular open source database, 2023. Ac-
through User-Centered Design (User Centered System Design cessed: Sep 19, 2023.
(UCSD)) principles and devising a microservice component [3] Spacy nlp, 2023. Accessed: Sep 19, 2023.
[4] Spacy nlp github repository, 2023. Accessed: Sep 19, 2023. [29] Andrei Mikheev, Marc Moens, and Claire Grover. Named entity
[5] Symfony, high performance php framework for web development, 2023. recognition without gazetteers. In Ninth Conference of the European
Accessed: Sep 19, 2023. Chapter of the Association for Computational Linguistics, pages 1–8,
[6] Explosion AI. Deploying a prodigy cloud service for posh’s financial 1999.
chatbots, 2023. Accessed on: 10/02/2023. [30] Nafise Sadat Moosavi and Michael Strube. Which coreference evaluation
[7] Explosion AI. How the guardian approaches quote extraction with nlp, metric do you trust? a proposal for a link-based entity aware metric.
2023. Accessed on: 10/02/2023. In Proceedings of the 54th Annual Meeting of the Association for
[8] Inc. Amazon Web Services. Aws batch. https://aws.amazon.com/batch/. Computational Linguistics (Volume 1: Long Papers), pages 632–642,
Accessed: 2023-10-02. 2016.
[9] Kanav Anand, Ziqi Wang, Marco Loog, and Jan van Gemert. Black [31] Nafise Sadat Moosavi and Michael Strube. Which coreference evaluation
magic in deep learning: How human skill impacts network training. metric do you trust? a proposal for a link-based entity aware metric. In
arXiv preprint arXiv:2008.05981, 2020. Proceedings of the 54th Annual Meeting of the Association for Compu-
[10] Hilal Atasoy, Brad N. Greenwood, and Jeffrey Scott McCullough. The tational Linguistics (Volume 1: Long Papers), pages 632–642, Berlin,
digitization of patient care: A review of the effects of electronic health Germany, August 2016. Association for Computational Linguistics.
records on health care quality and utilization. Annual Review of Public [32] David Nadeau and Satoshi Sekine. A survey of named entity recognition
Health, 40(1):487–500, 2019. PMID: 30566385. and classification. Lingvisticae Investigationes, 30(1):3–26, 2007.
[11] David Bawden and Lyn Robinson. Information overload: An overview. [33] S. Nijor, G. Rallis, N. Lad, and E. Gokcen. Patient safety issues
2020. from information overload in electronic medical records. J Patient Saf,
[12] James Bergstra, Daniel Yamins, and David Cox. Making a science of 18(6):e999–e1003, Sep 2022. Epub 2022 Apr 7.
model search: Hyperparameter optimization in hundreds of dimensions [34] Donald A Norman and Stephen W Draper. User centered system design:
for vision architectures. In International conference on machine learn- New perspectives on human-computer interaction. 1986.
ing, pages 115–123. PMLR, 2013. [35] Jay F Nunamaker Jr, Minder Chen, and Titus DM Purdin. Systems
[13] Greg Boss, Padma Malladi, Dennis Quan, Linda Legregni, and Harold development in information systems research. Journal of management
Hall. Cloud computing. IBM white paper, 321:224–231, 2007. information systems, 7(3):89–106, 1990.
[14] Hao-Fei Cheng, Ruotong Wang, Zheng Zhang, Fiona O’Connell, Ter- [36] OpenAI. Gpt-4 technical report, 2023.
rance Gray, F Maxwell Harper, and Haiyi Zhu. Explaining decision- [37] Peter G Polson, Clayton Lewis, John Rieman, and Cathleen Wharton.
making algorithms through ui: Strategies to help non-expert stakehold- Cognitive walkthroughs: a method for theory-based evaluation of user
ers. In Proceedings of the 2019 chi conference on human factors in interfaces. International Journal of man-machine studies, 36(5):741–
computing systems, pages 1–12, 2019. 773, 1992.
[15] Hyung Won Chung, Dan Garrette, Kiat Chuan Tan, and Jason Riesa. [38] Lawrence Rabiner and Biinghwang Juang. An introduction to hidden
Improving multilingual models with language-clustered vocabularies. markov models. ieee assp magazine, 3(1):4–16, 1986.
arXiv preprint arXiv:2010.12777, 2020. [39] MK Ross, Wei Wei, and L Ohno-Machado. “big data” and the electronic
[16] Ryan Donovan, Michael Healy, Huiru Zheng, Felix Engel, Binh Vu, health record. Yearbook of medical informatics, 23(01):97–104, 2014.
Michael Fuchs, Paul Walsh, Matthias Hemmje, and Paul Mc Kevitt. [40] Raphael Scheible, Fabian Thomczyk, Patric Tippmann, Victor Jaravine,
Sensecare: Using automatic emotional analysis to provide effective and Martin Boeker. Gottbert: a pure german language model, 2020.
tools for supporting. In 2018 IEEE International Conference on [41] Xavier Schmitt, Sylvain Kubler, Jérémy Robert, Mike Papadakis, and
Bioinformatics and Biomedicine (BIBM), pages 2682–2687, 2018. Yves LeTraon. A replicable comparison study of ner software:
[17] Hugging Face. Spacy models on hugging face hub, 2023. Accessed: Stanfordnlp, nltk, opennlp, spacy, gate. In 2019 Sixth International
10/02/2023. Conference on Social Networks Analysis, Management and Security
[18] Johann Frei and Frank Kramer. Gernermed – an open german medical (SNAMS), pages 338–343. IEEE, 2019.
ner model, 2021. [42] Lanfang Sun, Xin Jiang, Huixia Ren, and Yi Guo. Edge-cloud computing
[19] F. Freund, P. Tamla, C. Nawroth, T. Reis, S. Bruchhaus, M.X. Born- and artificial intelligence in internet of medical things: architecture,
schlegl, M. Hemmje, and P.Mc. Kevitt. Fit4ner - towards a framework technology and application. IEEE Access, 8:101079–101092, 2020.
independent toolkit for named entity recognition. 2022. [43] P. Tamla, B. Hartmann, N. Nguyen, C. Kramer, F. Freund, and M. Hem-
[20] Ralph Grishman and Beth Sundheim. Message understanding mje. Cie: A cloud-based information extraction system for named entity
conference-6: A brief history. In 16th Conference on Computational recognition in aws, azure, and medical domain. 2023.
Linguistics, pages 466–471, 1996. [44] Philippe Tamla. Supporting Access to Textual Resources Using Named
[21] Benedict Hartmann, Philippe Tamla, and Matthias Hemmje. Supporting Entity Recognition and Document Classification. PhD thesis, Hagen,
deep learning-based named entity recognition using cloud resource 2022.
management. In HCI International 2023 – Late Breaking Papers: HCI [45] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
for Health, Well-being, Universal Access and Healthy Aging. Springer, Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention
2023. is all you need. Advances in neural information processing systems, 30,
[22] M. Hemmje, B. Jordan, M. Pfenninger, A. Madsen, F. Murtagh, 2017.
M. Kramer, P. Bouquet, A Hundsdörfer, T McIvor, J Malvehy, et al. [46] Julian Von der Mosel, Alexander Trautsch, and Steffen Herbold. On
Artificial intelligence for hospitals, healthcare & humanity (ai4h3): R&d the validity of pre-trained transformers for natural language processing
white paper. Research Institute for Telecommunication and Cooperation in the software engineering domain. IEEE Transactions on Software
(FTK): Dortmund, Germany, 2020. Engineering, 2022.
[23] Matthew Honnibal, 2015. [47] Binh Vu. A Taxonomy Management System Supporting Crowd-based
[24] Timothy Hospedales, Antreas Antoniou, Paul Micaelli, and Amos Taxonomy Generation, Evolution, and Management. PhD thesis, Hagen,
Storkey. Meta-learning in neural networks: A survey. IEEE transactions 2020.
on pattern analysis and machine intelligence, 44(9):5149–5169, 2021. [48] Binh Vu, Yanxin Wu, Haithem Afli, Paul Mc Kevitt, Paul Walsh,
[25] Henning Kagermann. Change through digitization—value creation in Felix Engel, Michael Fuchs, and Matthias Hemmje. A metagenomic
the age of industry 4.0. In Management of permanent change, pages content and knowledge management ecosystem platform. In 2019 IEEE
23–45. Springer, 2014. International Conference on Bioinformatics and Biomedicine (BIBM),
[26] Mu-Hsing Kuo et al. Opportunities and challenges of cloud computing pages 1–8, 2019.
to improve health care services. Journal of medical Internet research, [49] Jules White, Quchen Fu, Sam Hays, Michael Sandborn, Carlos Olea,
13(3):e1867, 2011. Henry Gilbert, Ashraf Elnashar, Jesse Spencer-Smith, and Douglas C
[27] Lauren F Laker, Craig M Froehle, Jaime B Windeler, and Christo- Schmidt. A prompt pattern catalog to enhance prompt engineering with
pher John Lindsell. Quality and efficiency of the clinical decision- chatgpt. arXiv preprint arXiv:2302.11382, 2023.
making process: Information overload and emphasis framing. Produc- [50] Shicheng Zu and Xiulai Wang. Resume information extraction with
tion and Operations Management, 27(12):2213–2225, 2018. a novel text block segmentation algorithm. Int J Nat Lang Comput,
[28] Jing Li, Aixin Sun, Jianglei Han, and Chenliang Li. A survey on deep 8:29–48, 2019.
learning for named entity recognition. IEEE Transactions on Knowledge
and Data Engineering, 34(1):50–70, 2020.

You might also like