Toward LLM-enabled business process coherence checking based on multi-level process documentation

Schulte, Marek; Franzoi, Sandro; Köhne, Frank; vom Brocke, Jan

doi:10.1007/s44311-025-00024-6

Toward LLM-enabled business process coherence checking based on multi-level process documentation

Research
Open access
Published: 06 November 2025

Volume 2, article number 22, (2025)
Cite this article

You have full access to this open access article

Download PDF

Save article

View saved research

Process Science Aims and scope Submit manuscript

Toward LLM-enabled business process coherence checking based on multi-level process documentation

Download PDF

Marek Schulte¹,
Sandro Franzoi^1,2,
Frank Köhne³ &
…
Jan vom Brocke^1,2

1586 Accesses
Explore all metrics

Abstract

In this paper, aProCheCk, an Autonomous Process Coherence Checking method, is developed. aProCheCk leverages large language models (LLMs) to enhance the coherence checking of multi-level process documentation within business process management (BPM). This research addresses the need for automated ways of managing incoherencies in process documentation. The development of the artifact was guided by a design science research approach, which involved iterative development and refinement. This was achieved through expert interviews with researchers and practitioners, iterative experimental benchmarking, and focus group validation based on demonstrations of a prototypical implementation with naturalistic data from diverse industries. aProCheCk can dynamically analyze and assess changes in BPM documentation, detect incoherencies, and provide actionable insights for maintaining process coherence. The findings reveal significant potential for improving operational efficiency, reducing manual effort, and detecting negative and positive process variation early to support continuous process innovation. This research contributes to the field of BPM by integrating LLMs into the BPM lifecycle, enhancing generative AI-based applications within BPM practices, introducing the Business Process Change Classification Framework, and providing an open-source dataset that can serve as a foundation for future research and development.

Introduction

Deviations from expected process behavior persist in virtually every organization. As such, understanding and addressing process deviance is a critical concern for both business process management (BPM) research and practice (König et al. 2019). In the context of today’s dynamic business environments, deviations from established processes are not merely challenges; they represent potential opportunities for significant process improvements and innovations (Bartelheimer et al. 2023). When effectively identified and managed, these deviations can enhance operational efficiency and adaptability, providing substantial value to businesses (Setiawan and Sadiq 2013; Delias 2017).

In the past, inconsistencies in process documentation were typically solely regarded as errors that needed to be rectified (van der Aa et al. 2017). However, there is a growing recognition that these deviations can be viewed as opportunities for improvement (Galperin 2012; Mertens and Recker 2017). This is illustrated in the shift toward an understanding that BPM needs to balance the enforcement of process compliance with the identification of positive deviance (Setiawan and Sadiq 2013; Mendling et al. 2020). Identifying and analyzing these deviations can provide insights that lead to more effective business outcomes, innovative process improvements, or necessary corrective actions. This shift in perspective underscores the need for sophisticated tools that can efficiently detect, analyze, and leverage both positive and negative deviations to drive business growth and innovation (König et al. 2019).

The emergence of generative artificial intelligence (AI), particularly large language models (LLMs), signifies a transformative development in this domain. LLMs and their capability to generate and understand human-like text render them particularly suitable for interpreting complex business documentation associated with BPM (Vidgof et al. 2023; Fahland et al. 2024).

The potential role of AI in detecting and managing process deviations becomes crucial when considering the automation and enhancement of these tasks (Weinzierl et al. 2024). In particular, LLMs are capable of analyzing extensive sets of process documentation to identify inconsistencies and changes that may indicate deviations (Vidgof et al. 2023). Recent research shows that LLMs are also increasingly capable of making complex subjective decisions based on the given context (Binz and Schulz 2023; Mcintosh et al. 2024). This characteristic enables the expansion of the scope from detecting quantitative deviations, for instance, based on event logs and process diagrams, to encompass the more complex and subjective problem of checking for incoherencies in text-based business process documentation.

By detecting deviations through the analysis of changes in the process documentation, LLMs enable organizations to not only maintain the integrity of their processes but also to take advantage of these deviations to drive innovation and improve efficiency (Vidgof et al. 2023). This approach addresses a critical research gap identified by Feuerriegel et al. (2024), which emphasizes the potential of generative AI methods to reveal opportunities for process innovation and support process redesign initiatives.

This is especially important because organizations frequently maintain multiple, interrelated documents, often resulting in incoherencies between them – a problem magnified by changes made to individual documents that are not reflected consistently across all related documents (Martin-Toral et al. 2010). This issue is particularly evident in the context of business process documentation, as expressed, for example, in the presence of content inconsistencies between process descriptions and the corresponding process models (van der Aa et al. 2017). This suggests that analyzing changes in process documentation and comparing them with other related process documents may reveal incoherencies, which could potentially be leveraged to detect deviance and improve processes faster and more efficiently in organizations.

To address this problem, we formulate the following two research objectives, which aim to explore the integration of LLMs within process coherence checking based on process documentation:

RO1: Define the essential design objectives and specifications for an LLM-enabled process coherence checking method based on changes to process-related documentation.

RO2: Develop and evaluate a solution for employing LLMs to automatically evaluate the consistency and coherence across different versions of multiple documentation types for business processes.

Addressing these research objectives, we followed a design science research (DSR) approach and developed an artifact called aProCheCk, an acronym for Autonomous Process Coherence Checking. This artifact represents a method that is designed to leverage LLMs to enhance the enactment and evaluation phases of the BPM lifecycle by assessing the consistency and coherence of multi-level process documentation. The core functionality of aProCheCk is to automate process coherence checking, thereby enabling organizations to identify and manage both negative and positive deviance within their processes. Specifically, aProCheCk compares two entities of process documentation (e.g., process model and process description or process model and regulatory document, among others) and their change over time to identify incoherencies and subsequently create notifications (see notification creation in Fig. 2), presented in the form of management summaries, to inform process owners. We instantiate a software prototype to demonstrate and evaluate the utility of our artifact in practice. As such, our research marks a fundamental step toward coherence checking based on multi-level process documentation. The identification and assessment of incoherencies allows organizations to not only rectify errors but also to leverage positive deviations as opportunities for strategic advancement and innovation.

Following established guidelines on structuring DSR (Gregor and Hevner 2013), we present the remainder of the paper as follows. First, we introduce the research background on process coherence checking as well as generative AI in BPM. Next, we outline our methodological approach and its respective phases. Then we describe our developed artifact and provides information on its design and development. Afterwards, we present our experimental and focus group evaluations. Finally, we discuss theoretical and practical implications as well as the limitations of our work before we conclude the paper.

Research background

To provide a comprehensive foundation for our work, the research background section is structured into two subsections. First, we introduce the concept of process coherence checking, highlighting the challenges of maintaining semantic consistency across multi-level process documentation and the limitations of traditional approaches. Second, we shift the focus to generative AI in business process management, outlining how recent advances, particularly in LLMs, are opening up new opportunities for addressing challenges in BPM.

Process coherence checking

Process coherence represents a vital concept for ensuring logical consistency across business processes and their related documentation. In the context of information systems, the antonym of coherence, incoherence, provides a useful point of reference. Saint-Dizier (2018), for instance, defines incoherence as linguistic discrepancies between documents. Martin-Toral et al. (2008) define content incoherence as “[…] the weakness of consistency amongst related documents, or amongst different pieces of the same document, or the lack or excess of information in a document” (p. 283). In a broader sense, coherence is essential for preserving logical consistency across systems, processes, and documentation (Martin-Toral et al. 2010). It ensures that various components interrelate coherently, enabling clear communication and operational efficiency. This principle of coherence is particularly important given the extensive and complex documentation typical of business processes. The concept of process deviance (Mertens and Recker 2017) is closely related. It refers to user actions that deviate from the intended process flow and can be studied through various methods, including process mining (Di Francescomarino et al. 2025). These deviations illustrate how users interact with systems in unpredictable or unintended ways, leading to positive or negative effects (Delias 2017). These user-driven deviations may result in updates to operational process documents, such as training materials, to reflect actual practices. However, other documents that are less user-centric, such as process models, may not be updated synchronously, resulting in potential inconsistencies. In contrast to process deviance, incoherence is more closely associated with general inconsistencies within the documentation of processes, which may occur independently of any user-driven actions (Martin-Toral et al. 2010). Examples of such incoherencies that do not originate from user-driven actions include changes to overarching documents such as regulations or guidelines, organizational updates such as role changes, and software or hardware upgrades (Rosemann et al. 2008).

The introduction of multi-level process documentation adds further complexity. The various types of documentation, including process models, textual descriptions, guidelines, and policies, not only differ in form but also in perspective, level of abstraction, and state of knowledge (Polyvyanyy et al. 2015; Rosemann and vom Brocke 2015; Nwankpa et al. 2022). Some documents may address only specific aspects of a process, whereas others may deal with particular abstraction layers, such as the database or strategic abstraction layers. Similarly, internal or external regulations, as well as modeling conventions, often span multiple processes and documentation types, placing them at a higher abstraction level. Ensuring coherence across these diverse documents is essential for effective BPM.

Based on these insights, process coherence in BPM can be defined as semantic consistency across multi-level process documentation. Maintaining this coherence ensures that documents related to a process remain logically aligned, despite differences in perspective and granularity. This alignment is critical to avoid miscommunication and inefficiencies and to support a cohesive working environment.

To address these challenges, the deployment of advanced technologies, particularly LLMs, enables organizations to automate the coherence checking of their process documentation, thereby capitalizing on the technological advancements over existing methods. For example, Martin-Toral et al. (2008) investigated the detection of incoherencies in a corpus of technical and regulatory documents using traditional methods such as extraction techniques involving text and document mining. Recent advances, as exemplified by the work of Sai et al. (2023), highlight the importance of bridging the gaps between organizational documentation and regulatory texts through the application of advanced NLP techniques. Their study focused on the comparison of general business documents with regulatory frameworks without addressing the specifics of BPM-related documents or the recent developments in LLM technology. The integration of LLMs in this context not only supports regulatory compliance but also improves overall process coherence and reliability.

Generative AI in business process management

The progression of generative AI has redefined the potential applications of AI in a multitude of domains, including BPM (Kampik et al. 2024). By enabling more intuitive, adaptive, and creative forms of process interaction and design, generative AI extends the traditional boundaries of BPM beyond automation toward conversational, autonomous, and sophisticated process capabilities (Rosemann et al. 2024). For example, van Dun et al. (2023) leveraged generative adversarial networks to help process designers in the creation of business process improvement ideas. In another paper, Harl et al. (2024) showcase how generative machine learning can be used for automated business process redesign at runtime. Research has also investigated the opportunities and challenges of natural language processing for BPM (van der Aa et al. 2018) and explored its application, for instance, in predictive business process monitoring (Teinemaa et al. 2016).

More recently, the evolution of transformer architectures has marked another key shift in the field of generative AI in BPM. Here, a key development was the emergence of tools such as ChatGPT, which employ LLMs trained on extensive datasets to generate contextually relevant responses to user queries (Vidgof et al. 2023). In the context of future developments, the potential applications of generative AI in BPM are numerous and diverse (Kampik et al. 2024). One area of significant growth is the development of new generations of process guidance systems (Feuerriegel et al. 2024). In contrast to traditional systems that rely on static, manually crafted knowledge bases, new systems powered by generative AI can dynamically retrieve information from a wide array of structured and unstructured data sources, including emails, manuals, and corporate documents (Morana et al. 2019; Feuerriegel et al. 2024). Such systems can guide a wide range of business process management tasks, such as modeling (Kourani et al. 2024), knowledge management (Franzoi et al. 2025b), or analysis support in process mining (Brützke et al. 2025). This enables the provision of real-time, context-sensitive guidance, making these systems more adaptive and intelligent. This shift from static to dynamic guidance systems indicates a broader move toward more responsive and intelligent BPM tools that greatly enhance decision-making and optimization across organizational levels. As highlighted by Feuerriegel et al. (2024), there is a continuing need to explore how generative AI can unveil new opportunities for process innovation. Their work suggests that further investigation into the capabilities of generative AI could reveal transformative approaches to BPM, redefining operational strategies and competitive dynamics in numerous industries. As such, there is still ample potential to employ LLMs to assess process deviance or process coherence across documents.

Research methodology

To address our research objectives, we employ a multifaceted methodology, centering on the applicability of an emerging technology to address a practical issue. The foundational methodological approach adopted is DSR, which aims at identifying a problem and developing an artifact designed to mitigate the issue within predefined parameters (Hevner et al. 2004).

Specifically, we follow the guidelines proposed by Peffers et al. (2007) and Tuunanen et al. (2024), which provide a comprehensive framework for carrying out research that is centered around the creation and evaluation of IT artifacts intended to solve identified problems. The process proposed by Peffers et al. (2007) is divided into six key phases: problem identification, definition of objectives, and design and development as Build phases, followed by demonstration, evaluation, and communication of research findings as Evaluate phases. The approach is illustrated in Section A in the online appendix^{Footnote 1}.

In adapting the DSR framework for this research, specific modifications were made to tailor the approach to the unique requirements and constraints of this study. The aim was to enhance the focus on the practical application of emerging technology in solving real-world problems. This customized approach allowed for a more targeted development and evaluation of the artifact, ensuring that the research outcomes are both practical and theoretically sound. Additionally, we also incorporated principles of an iterative approach, including frequent evaluation and adaptation (Tuunanen et al. 2024). Our instantiation of the DSR approach is depicted in Fig. 1. The evaluation phase in the DSR approach is critical for validating that the developed artifact meets both theoretical expectations and practical utility in real-world applications. This duality ensures that artifacts are robust in their theoretical underpinnings and highly functional in practical scenarios, indicating a successful bridging of theory and practice (Hevner et al. 2004). Our evaluation strategy employs a structured approach, utilizing various methodologies such as reviewing literature, expert interviews, focus groups, or experimental benchmarking (Sonnenberg and vom Brocke 2012). This strategy is designed to align with various evaluation types as identified in the DSR framework. Importantly, we emphasize the iterative nature of our DSR approach by highlighting the recurring cycles between design and development, as well as demonstration and evaluation. Here, phase 3.a, 3.b, and 3.c represent design and development activities, and 4.a, 4.b, and 4.c represent demonstration and evaluation activities (see Fig. 1). The following paragraphs outline the specific steps of our approach.

Phase 1. Identify problem and motivate (Eval 1)

In the initial phase, the underlying problem is carved out and discussed. For this purpose, existing literature is examined to identify relevant research, establish the theoretical foundation for the work, and justify the problem statement.

Phase 2. Define objectives of a solution

Based on the identification of the problem and the relevant research identified in the previous phase, the scope of the targeted artifact is defined, and provisional design objectives are derived from the literature.

Phase 3. Design and Development

Phase 3.a design specifications and initial proof of concept (PoC)

Guided by the established provisional design objectives, an exemplary application is composed utilizing existing technologies. Furthermore, provisional design specifications are derived from the design objectives.

Phase 3.b design and development of prototype

In accordance with the refined design specifications, an initial working prototype is created.

Phase 3.c adaptation of the prototype

The artifact is refined through an iterative process based on the results of the experimental iterations.

Phase 4. Demonstration and Evaluation

Phase 4.a demonstration and evaluation in Semi-structured expert interviews (Eval 2)

The PoC is demonstrated, and the design objectives and specifications are discussed in six semi-structured interviews with three practitioners in BPM-related roles and three researchers in the BPM field. This phase validates the design objectives and refines the specifications, focusing on understandability, feasibility, applicability, and operationality.

Phase 4.b experimental evaluation of prototype (Eval 3)

The applicability of the artifact is demonstrated through iterative experiments based on a dataset sourced from literature and enriched by a BPM expert. This phase focuses on a quantitative evaluation of the artifact’s efficiency, effectiveness, and robustness.

Phase 4.c demonstration and evaluation in focus groups (Eval 4)

The usefulness of the artifact is evaluated through demonstrations of an instantiation in two focus groups, each comprising four to five BPM and AI consultants. Following the demonstration, a semi-structured discussion assesses practical applicability, usability, and real-world integration.

Phase 5. Communication

The contribution to the knowledge base will be achieved through the publication of the research.

This transparent, structured, and iterative evaluation approach ensures that the artifact adheres to rigorous academic standards while enhancing BPM practices in diverse operational contexts (Hevner et al. 2024). Both formative and summative evaluation methods are applied in artificial as well as naturalistic settings (Venable et al. 2016), strengthening confidence in the overall evaluation methodology (vom Brocke et al. 2020). By aligning the interests and feedback of both researchers and practitioners, the artifact is refined into a robust tool that offers significant theoretical and practical contributions to the field of business process management (Sonnenberg and vom Brocke 2012).

Artifact description

This section outlines the developed artifact, aProCheCk, which represents a method for LLM-enabled process coherence checking. We first present an overview of the final artifact and its three main stages –preprocessing, content comparison, and coherence checking– followed by the design rationale, including objectives, proof of concept, expert evaluation, and the Business Process Change Classification Framework. We then detail the development and operating logic, highlighting prompt engineering techniques and the implementation strategy.

Artifact overview: aProCheCk

Drawing on the presented DSR approach, this section presents an overview of the final developed artifact, aProCheCk, which comprises three primary stages: (1) preprocessing, (2) content comparison, and (3) coherence checking. Each stage can terminate early if the documents are found to be coherent, thereby optimizing efficiency and reducing hallucination risks. Figure 2 depicts an overview of aProCheCk, from identifying and inputting process documents to the optional creation of notifications. The darker grey boxes denote LLM interactions, while dotted lines indicate modifiable elements.

Preprocessing

The preprocessing phase focuses on noise reduction by removing non-relevant visual data from XML representations of BPMN files, which were identified as uncritical for coherence checking through expert interviews and testing. An equality check then determines if the preprocessed documents are identical; if so, the workflow terminates early, indicating coherence and conserving computational resources.

Content comparison

In the content comparison phase, the content of the two process document versions is compared to identify changes. Identified changes are aggregated into a JSON element and classified according to the Business Process Change Dimensions: Task, Data, Control flow, and Organization. This structured approach supports systematic comparison while allowing for future customization and integration with existing BPM tools. If no changes are identified, the process terminates, indicating coherence.

Coherence check

In the coherence check phase, identified changes are categorized as relevant, unrelated, or negligible. Irrelevant changes are discarded, ensuring that only those pertinent to the coherence and integrity of the business process remain. If no relevant changes are found, the documentation is deemed coherent, and the process terminates. When incoherencies are validated and categorized, aProCheCk can optionally generate notifications.

The Notification Creation functionality is demonstrated in the section Focus Group Evaluation using naturalistic data. There, we illustrate how notifications are created through additional LLM interactions, drawing on the structured output of the coherence check to create actionable email alerts for process owners. The robustness, efficiency, and practical utility of aProCheCk are further confirmed through focus group evaluations. In the following two sections, we shed light on the design and the development of aProCheCk.

Artifact design

We present the design of our artifact in four subsections. First, we describe how we derived the design objectives that build the foundation of the artifact design. Second, we outline the development of an initial proof of concept and respective design specifications to validate the feasibility of using LLMs to detect and manage inconsistencies in process documentation. Third, we present the results of an evaluation through semi-structured expert interviews with BPM researchers and practitioners, which we conducted to receive feedback and refine the design specifications. A key outcome of this refinement is the development of the Business Process Change Classification Framework, which systematically categorizes changes in process documentation.

Design objectives

In the context of DSR, the definition of precise design objectives is crucial to guide the development of solutions that effectively address specific functional needs while contributing to both theory and practice (Sonnenberg and vom Brocke 2012). This research uses LLMs to address challenges in BPM that were identified in the first two sections, with a particular focus on improving and automating the detection and management of incoherencies in process documentation. The definition of the design objectives is directly derived from the literature that has highlighted the challenges in BPM in the areas of process documentation, deviation, consistency, and coherence. The literature highlighted how technological advances, specifically LLMs, have the potential to effectively address these issues (Feuerriegel et al. 2024). In the following, we present the design objectives with respective supporting references from prior work.

An approach to detect process incoherencies based on process documentation could:

1.
identify substantial changes to a process along different versions of a process document (Martin-Toral et al. 2008; Bose et al. 2011; Delias 2017).
2.
detect incoherencies in process documentation.
1. a)
  of the same type (Leopold et al. 2013; Polyvyanyy et al. 2015).
2. b)
  of different types (Martin-Toral et al. 2008; Leopold et al. 2013; van der Aa et al. 2017).
3. c)
  between overarching policies/guidelines/regulations and process documentation (Martin-Toral et al. 2008; Becker et al. 2011; Sai et al. 2023).
3.
list potential actions based on the detected incoherence and their implications for the process (Sai et al. 2023; Feuerriegel et al. 2024).

These design objectives utilize the analytical capabilities of LLMs to parse through extensive sets of process documentation, detecting inconsistencies that could undermine process coherence and documentation integrity. The first objective focuses on identifying significant changes between different versions of one and the same process document, which is essential for maintaining clarity and ensuring compliance in business processes. Subsequent objectives address the detection of inconsistencies within similar and dissimilar types of process documentation.

Incorporating these objectives into the design and development of the artifact enables it to not only detect incoherencies but also to provide actionable insights to help organizations effectively address these incoherencies, thereby supporting sustainable process coherence and operational efficiency. This structured approach ensures that the functionality of the developed artifact meets the demands of contemporary BPM, thereby enhancing the ability of organizations to manage multi-level process documentation effectively and efficiently.

Proof of concept

The PoC is a critical element of our methodological approach, serving not only as a demonstration but also as a means to validate initial design objectives and facilitate discussions during semi-structured interviews. By providing tangible evidence, the PoC bridges the gap between theoretical constructs and practical application. The PoC illustrates how LLMs effectively identify and manage process incoherencies in multi-level business process documentation. Key objectives include detecting significant changes between document versions, identifying incoherencies within and between document types, and proposing actions to resolve these issues. The PoC operationalizes these objectives, demonstrating their feasibility in a controlled yet realistic scenario. The dataset for the PoC is based on Sànchez-Ferreres et al. (2018), derived from Eid-Sabbagh et al. (2012) and Friedrich et al. (2011). It consists of pairs of simple BPMN models and associated textual descriptions. To simulate a realistic change, the textual document is modified by adding a description of an additional task, introducing a process incoherence. This dataset was later expanded by systematically modifying the descriptions and process models and classifying these changes according to the Business Process Change Classification Framework. The resulting open-source dataset is available in a separate GitHub repository^{Footnote 2} and described in Section E of the online appendix. Developed using Microsoft Azure AI Studio, the PoC leverages the GPT-4o OpenAI API, ensuring consistency in evaluation and applicability to later research stages. The LLM analyzes both versions of the process description and the BPMN model in two steps: first, it identifies content changes, and second, it checks for consistency with the BPMN model. If incoherence is detected, the LLM generates a management summary highlighting inconsistencies and suggesting options for restoring coherence. The process documents used in this PoC, including both versions of the textual descriptions and the BPMN model, are included in the online appendix in Section B. The output from the GPT-4o API, also depicted in Section B of the online appendix, demonstrates the PoC’s success in addressing design objectives and provides a foundation for exploratory inquiries in subsequent interviews. Thus, the PoC substantiates theoretical underpinnings and offers a credible basis for further evaluation and refinement.

Expert interview evaluation

After presenting the design objectives and the PoC demonstrating the potential of LLMs to identify incoherencies in multi-level process documentation, expert feedback was vital for refining and validating the design specifications. To achieve a balanced evaluation, we conducted interviews with six BPM experts, including both researchers and practitioners from various industries. The semi-structured evaluation interviews began with a thorough presentation of the design objectives and the PoC, followed by discussions on the provisional design specifications. Experts provided critical feedback and suggestions for refinement, which were assessed based on four criteria we derived from Sonnenberg and vom Brocke (2012): Understandability, Feasibility, Applicability, and Operationality. This structured framework facilitated a nuanced evaluation of the artifact’s strengths and areas for improvement. The feedback was instrumental in enhancing the clarity, precision, and practical relevance of the design specifications. Subsequent sections detail how this feedback was integrated, the evaluation results, and key findings and recommendations from the expert evaluation process, laying the groundwork for further artifact development. Interviewees are referenced using the abbreviations introduced in Appendix A, which also includes more details about the interview participants.

Design specifications

In the context of DSR, design specifications serve as a blueprint, guiding the development of artifacts by defining crucial principles of form and function (Sonnenberg and vom Brocke 2012). They ensure alignment with the identified problem and objectives, facilitating a coherent transition from theoretical foundations to practical application (Sonnenberg and vom Brocke 2012). The design specifications for this research were derived from the literature-based design objectives. These objectives highlighted the potential of LLMs in managing incoherencies in multi-level process documentation. Following the development of a PoC, these specifications were refined through semi-structured interviews with BPM practitioners and researchers. The refined design specifications of the artifact are as follows:

1.
The artifact should automatically detect deviances in changes to text-based process documentation that are incoherent with related documents of the same process using Large Language Models.
2.
The artifact should provide management summaries of identified incoherent changes to process documentation, including potential implications for the process of applying or reversing these changes.
3.
The artifact should check changes in overarching process-related documentation (e.g., regulations, policies, guidelines) for coherence with existing text-based process documentation.
4.
The artifact should operate autonomously and only notify users when relevant incoherencies are identified in any of the Business Process Change Dimensions.

During the evaluation phase, which was guided by the Eval 2 phase of the DSR framework of Sonnenberg and vom Brocke (2012), experts assessed the design specifications based on their correctness and completeness, alignment with the design objectives, and comprehensibility and meaningfulness. They reviewed the specifications to ensure that they comprehensively addressed all necessary functional components without omissions, validated the logical and practical alignment with the design objectives, and confirmed that the specifications were clearly articulated and easily interpreted by stakeholders.

The expert validation process led to several key refinements of the design specifications. Clarity and objectivity were improved by classifying changes based on BPM process change dimensions, as suggested by I 2 (IR2), stating, “[…] it would also be good if you could add some kind of categorization of the changes later, I suppose. Like control flow, resources, and so on […]”. The expert feedback also highlighted the need for more precise terminology, which was addressed in the refined specifications. Experts from practice and research also stressed that the artifact should inherently support robust data security practices to ensure the security of internal company documents, even if this is not explicitly listed in the design specifications, as I 5 (IP2) stated, “[…] data security has to be considered. […] I think that’s probably the most important thing for everyone involved.”

Design evaluation

The evaluation of the artifact was carried out following the demonstration of its design specifications and proof of concept. The interviewed BPM experts evaluated the artifact on the basis of four specific criteria that were derived from the Eval 2 phase of Sonnenberg and vom Brocke (2012): Understandability, Feasibility, Applicability, and Operationality. Participants were asked to score each criterion on a Likert scale from 1 to 7 (e.g., for applicability: 1 ‘not at all applicable’ to 7 ‘completely applicable’). Each criterion was accompanied by a guiding question to anchor the discussion and allow for a comprehensive assessment:

Understandability was explored with the guiding question, “How understandable and plausible is the concept presented to you?” to assess the clarity and conceptual integrity of the artifact. This criterion ensures that both researchers and practitioners can fully understand the artifact, and it verifies clear communication of its functionality and potential impact.

Feasibility was examined through the guiding question, “What potential problems and risks do you see in implementing the concept presented?” to identify any challenges or risks in the practical implementation of the artifact. This includes technical, operational, and financial aspects to ensure that the artifact can be implemented in real-world BPM environments.

Applicability was assessed using the guiding question, “Would the system’s notifications be more of a burden to you as a process owner, or would you consider them helpful?” to determine the practical usefulness of the system’s notifications in real-world settings. This criterion assesses whether the artifact effectively improves process coherence and reduces manual effort, thereby providing tangible value to stakeholders.

Operationality was evaluated with the guiding question, “Would such a system feel more like surveillance or support to its users?” to understand whether the artifact would be perceived as rather supportive or intrusive. Positive user perception is crucial for adoption, as the artifact should be seen as a supportive tool that enhances the user’s tasks. Figure 3 summarizes the results of the quantitative assessment of all criteria.

The results of the evaluation showed that understandability and applicability were consistently rated highly, reflecting the clarity and practical benefits of the concept. This suggests that the design specifications and proof of concept were effectively communicated and well received by the experts.

Feasibility also received positive feedback, although it sparked discussion about the need to address data security concerns and the complexity of the problem at hand, with I 2 (IR2) stating, “[…] the problem becomes very complex when you have many different sources of deviance in it.” and I 1 (IR1) adding that the issue has a “[…] high complexity, but impressive if it can be realized.“. Experts expressed confidence that with careful planning and iterative refinement, the artifact could be successfully implemented despite the identified complexities. This feedback reinforces the belief that the artifact is feasible, provided that careful consideration is given to implementation strategies and risk management.

Operationality, although positively rated, had the widest range of responses and provoked discussion about the potential perception of the system as a surveillance tool by individuals within an organization. Nevertheless, it was widely recognized by the experts that the core functionality of the artifact was inherently supportive and positive, as exemplified by I 4 (IP1) saying, “I think that it would be seen primarily as support and that people would feel very little monitored”. Designed to notify users only when relevant incoherencies are detected, the interviewees recognized that the artifact aims to save time and improve process efficiency in the Long term. Although an initial effort is required from users, this investment is expected to yield significant long-term benefits, including more coherent processes, improved process documentation, and earlier identification of changes, as acknowledged by I 3 (IP3) stating, “I don’t generate the benefit immediately, but only later, because the clean process documentation or the overall clean documentation helps me in the future.” According to the experts, the supportive nature of the artifact and the transparency of its notifications are key factors in mitigating surveillance concerns, ensuring that the artifact is perceived as a helpful tool rather than an intrusive one.

Overall, the evaluation highlighted that while the artifact has a strong conceptual foundation and practical utility, areas such as feasibility and operationalization require attention to fully realize its potential. The expert feedback provided valuable insights that guided the next stages of development. Additional key takeaways and supporting statements from the expert interviews can be found in Section C of the online appendix. The combination of design objectives derived from the literature, refined specifications, and extensive expert feedback has resulted in a clear and practical set of requirements and guidelines for the development of an effective process coherence checking artifact based on changes to process-related documentation. The refined design specifications ensure that the artifact can accurately detect incoherencies, provide actionable insights, and operate in a way that supports both operational efficiency and strategic business improvement.

Business process change classification framework

In the context of business processes, the effective classification of changes in process documentation is a critical task, especially when dealing with process coherence. The Business Process Change Classification Framework presented in this section, derived from both literature and expert interviews, plays an essential role in classifying and prioritizing identified changes to maintain process coherence.

The task of categorizing changes in business process documentation is inherently complex and subjective, as recognized by expert interviews from research and practice. Different roles within an organization may interpret and prioritize changes in different ways, adding to this complexity, as proposed, for example, by I 3 (IR3), stating “[…] if you have different roles using the tool, […] they might prioritize these changes differently”. The use of LLMs offers a promising solution to this complexity. LLMs have advanced capabilities for processing and interpreting textual data, enabling them to make subjective decisions based on the context of the content provided (Binz and Schulz 2023; Mcintosh et al. 2024). Due to the inherent complexity and subjective nature of classifying changes in BPM documentation, a certain margin of error is manageable within the system. This is facilitated by the human in the loop, who plays a critical role in supervising and making final decisions. The primary objective of the system’s classifications is to generate accurate and relevant notifications that serve as advisory guidance. The human in the loop then evaluates these notifications and ensures that the final decisions, to implement or not implement changes, are made within the specific organizational context. I 3 (IR3) described the decision made by the human in the loop as determining, “do I want to implement this in the process document, or do I want to not implement it, go back and undo it at that point where the deviation was found”. This human oversight ensures that, despite any potential classification inaccuracies, practical needs and objectives are effectively met.

It is also important to recognize that different Business Process Change Dimensions may have differing levels of relevance to specific roles within an organization, highlighting the opportunity for role-specific notifications, as explained by I 3 (IR3) “[…] the process owner is primarily interested in all changes to the control flow, but the production planner may be more interested in the resource dimension”.

Section C of the online appendix contains a more detailed explanation of the Business Process Change Dimensions and Change Relevance Categories, presenting a framework that enhances the efficiency and accuracy of the coherence checking mechanism. An overview of this framework is depicted in Table 1. By addressing the subjective nature of change classification, leveraging the decision-making capabilities of LLMs, and maintaining human-centric oversight, the artifact aims to more effectively manage process coherence and provide accurate and actionable notifications to users.

Table 1 Business process change classification framework overview

Full size table

Artifact development and operating logic

This section describes the transition from theoretical design to practical implementation of aProCheCk for LLM-enabled business process coherence checking. Importantly, while aProCheCk constitutes a general method for LLM-enabled process coherence checking based process documentation, we also instantiate a software artifact to demonstrate and evaluate aProCheCk in a real-world scenario. The code for the instantiation is available in Section G of the online appendix. The development phase is based on the insights and refined design specifications gained from the expert interviews. This section focuses on prompt engineering techniques that are critical to optimizing the capabilities and performance of the artifact. Section D of the online appendix provides in-depth descriptions of the technology stack underlying the instantiated artifact, the prototype development process, and the software architecture, including its structure, components, and connections. There, we also highlight the iterative nature of the software architecture and the various aspects required to build a robust, scalable artifact based on LLMs.

Prompt engineering represents a vital technique for optimizing the utility of LLMs across a range of domains, evident in its use in the context of BPM (Busch et al. 2023). Despite its rising significance, prompt engineering remains a relatively new area of research. Only recently established and validated techniques have begun to emerge in the literature, significantly increasing the potential for the use of LLMs (Sahoo et al. 2024). The comprehensive taxonomy proposed by Schulhoff et al. (2024) classifies 58 general text-only prompting techniques into six clusters: Zero-Shot, Few-Shot, Thought Generation, Ensembling, Self-Criticism, and Decomposition. Among these approaches, Few-Shot Prompting and Chain of Thought prompting have proven to be particularly effective (Schulhoff et al. 2024). Few-Shot prompting involves providing the model with a limited number of examples from which to learn and perform specific tasks (Sahoo et al. 2024). This technique is particularly valuable in contexts where extensive training data is not available, allowing the model to generalize effectively from minimal examples. Chain of Thought prompting, the only technique introduced in the Thought Generation category, encourages the LLM to articulate its reasoning process step by step before delivering a final answer (Sahoo et al. 2024). This encourages responses that are more accurate and logically structured. Both techniques are integrated into the iterative experimental optimization process described in the section Experiment Composition. The operating logic of the artifact involves a two-stage process of logic-based coherence checking that serves as the foundation for the method, as illustrated in Fig. 4.

The first stage of the process is to compare the content of the modified and original documents. During this stage, individual semantic changes are identified and categorized according to the Business Process Change Dimensions. The second step, the coherence checking step, consists of comparing the previously identified changes with the content of the related document. This comparison includes the description of any required changes to the related document that are necessary to restore process coherence. Here, the identified changes are classified into change categories as described in the Business Process Change Classification Framework. The process can be terminated at any stage if coherence is confirmed. The final step, which is part of our instantiation, involves the generation of the notification, which is constructed in the form of a textual notification based on the structured output. In addition, a notification title and severity indicator are added based on the results of the coherence check. Figure 4 shows a conceptual breakdown of the three stages of the operating logic rather than isolated sub-steps. Since each stage is executed within a single LLM prompt, measuring the performance of each step individually is not feasible. Therefore, we evaluate the accuracy of the coherence-checking logic holistically, as detailed in Appendix B as well as Section E of the online appendix.

Adherence to general prompt engineering best practices is crucial in the development of an LLM-based artifact (Lo 2023; Marvin et al. 2024). Hence, throughout the development of aProCheCk, we incorporated established prompting techniques to optimize performance. The prompts were structured to place key information at the beginning, ensuring clarity from the outset and facilitating processing by the model. Clear structural divisions within prompts allowed for logical flow and coherence, while the strategic use of keywords highlighted critical points and guided the model’s focus. Background information was provided after the task description to provide the necessary context without overwhelming the initial instructions. Step-by-step instructions were used to maintain a logical sequence, with constraints listed at the end to avoid distraction from task performance. Importantly, as part of the experimental iterations, a system message was set to assign the role of an experienced BPM expert to the LLM, helping to improve its contextual understanding and decision accuracy. The final prompts as well as the changes made in each experimental iteration are available in the code base in Section G of the online appendix.

Evaluation

The evaluation of aProCheCk was conducted in two complementary stages to ensure both technical rigor and practical applicability. First, an experimental evaluation systematically assessed the artifact’s effectiveness, robustness, and efficiency through an iterative refinement process. Second, a focus group evaluation examined an instantiation of the artifact in naturalistic settings with real tasks, systems, and users, providing confirmatory insights into its utility, integration potential, and areas for further enhancement. Together, these evaluations establish the artifact’s readiness for real-world application and identify directions for future improvements.

Experimental evaluation

Experiment composition

Experimental evaluation is a crucial step in the validation and benchmarking of aProCheCk, in line with the Eval 3 phase of Sonnenberg and vom Brocke (2012). To accomplish this, we designed the evaluation as a reverse ablation study, that is, we began with a baseline prompt and incrementally added techniques in each iteration. This approach allowed us to isolate and measure the individual impact of each technique. The goal of the evaluation is to iteratively refine the artifact and rigorously assess its performance in a controlled environment, thereby ensuring its readiness for practical application. Conducting such experimental evaluations iteratively is important for the systematic optimization of the artifact, closely following the principles of DSR, which emphasizes iterative development and continuous improvement. Each iteration refines a specific aspect of the artifact, ensuring both practical utility and theoretical soundness. This dynamic approach is consistent with the problem-solving cycle inherent in DSR, which involves problem identification, solution design, implementation, evaluation, and ongoing refinement (Peffers et al. 2007; Tuunanen et al. 2024). This framework ensures that the artifact evolves progressively, incorporating feedback and benchmarking results to improve its performance. The procedure of our experiment iterations is visualized in Fig. 5. The dotted arrows visualize the prospect of rolling back the changes made in the most recent iteration if the benchmarking results of the iterations are not satisfactory.

For the empirical validation, specifically during the experiment phase as defined in the DSR evaluation process, a comprehensive dataset from multiple process repositories was used. This novel Process Coherence Checking Dataset, which is published under the GNU General Public License, is central to assessing the effectiveness and robustness of LLMs in identifying and verifying the coherence of multi-level process documentation.

The initial dataset, derived from the research of Sànchez-Ferreres et al. (2018), includes a wide range of process models originally compiled by the BPM Academic Initiative and further detailed by Eid-Sabbagh et al. (2012). These models, enriched with textual descriptions compiled by expert process modelers, form the basis of a dataset for checking the coherence of documented business processes, as described in more detail in Section E of the online appendix. The criteria used for the evaluation of the experiment are also introduced in detail in Section E of the online appendix.

The first iteration focuses on establishing a baseline using basic prompting guidelines. This baseline uses structured prompting without any advanced prompt engineering or extensive data preparation. This simple approach forms the baseline against which subsequent iterations are compared.

The second iteration aims to improve performance by reducing noise in the dataset. Importantly, this iteration focuses solely on data preprocessing. The prompt itself remains unchanged from the structured baseline prompt. Specifically, irrelevant visual data is removed from BPMN files, significantly reducing the file size and, therefore, the number of input tokens required for processing. This step addresses the needle-in-the-haystack challenge by reducing extraneous data that may divert attention from the incoherence of the content (Nelson et al. 2024). By preserving all the important connections and process flows while removing the noise, the artifact is expected to deliver more accurate results. In addition, an extra check ensures that documents that are identical after the noise reduction step are identified beforehand, saving resources and minimizing the risk of hallucinations in the output.

In the third iteration, few-shot prompting is introduced. This refinement provides the LLM with more comprehensive examples of input and output scenarios, both specific to BPMN models and general process documentation, appended to the basic structured prompting template. The inclusion of more examples is intended to better guide the LLM’s decision-making process and ensure consistent and accurate categorization of changes. The enriched context is expected to improve the artifact’s ability to reliably detect and categorize inconsistencies.

The fourth iteration incorporates advanced prompt engineering techniques. A system message is introduced to frame the LLM as an experienced BPM expert, which is expected to improve its contextual understanding and decision-making. In addition, the chain of thought method is implemented, which prompts the LLM to generate a detailed reasoning process before making categorizations and decisions. This approach aims to improve the reasoning capabilities of the artifact, leading to more accurate and consistent categorization of changes. The iterative refinement process reinforces the principles of DSR by continuously enhancing the artifact through empirical evaluation and feedback integration. Each iteration builds on the previous one, incrementally improving the functionality and reliability of aProCheCk. To ensure full transparency, the final prompt developed in this last iteration, including comments on where content was added in previous iterations, is available in the supplementary repository in Section G of the online appendix. A higher level anatomy of the final prompts used can be found in Appendix C.

Experiment findings

The experimental results provide comprehensive insights into the artifact’s performance across multiple evaluation criteria, confirming its robustness and readiness for practical application. Through a series of iterations, the accuracy, consistency, and cost-effectiveness of aProCheCk were systematically evaluated and refined. The results are summarized in Table 2, where accuracy and consistency are expressed as a value from 0 to 1, with 1 representing the optimal result. Appendix B provides a detailed explanation of how accuracy and consistency are calculated (this is further extended in Section E of the online appendix). The average API cost is expressed on a monetary scale, with the lowest values indicating the best results. The data demonstrates that the mean accuracy and consistency have increased with each iterative refinement, indicating highly effective performance. Moreover, the costs per process run were reduced by over 50% from the first to the second iteration, indicating a significant optimization. Although there was a slight increase in cost in the third and fourth iterations due to the incorporation of additional sophisticated prompting techniques, the overall cost-effectiveness remained significantly better than the baseline established in the first iteration. A more in-depth analysis of the results of the experiment iterations, the statistical significance, and the respective scores for accuracy, robustness, and efficiency can be found in Section E of the online appendix.

Table 2 Experiment result summary

Full size table

Overall, aProCheCk demonstrated exceptional performance, given the complexity and subjectivity of the problem domain. While perfect accuracy is unattainable due to the inherent subjectivity of the problem at hand, the structured and iterative optimization process proved effective in significantly improving robustness, efficiency, and effectiveness, validating the artifact’s readiness for real-world BPM applications.

Focus group evaluation

The focus group evaluation is anchored in the three realities proposed by Sun and Kantor (2006): real tasks, real systems, and real users. These dimensions are central to assessing the practical applicability of the artifact and ensuring its alignment with the requirements of real business processes. Real tasks assess the performance of the artifact in the context of real business processes. Real systems evaluate the artifact’s integration into established BPM ecosystems. Real users highlight the practical utility and user experience of the artifact, ensuring alignment with business needs. The use of naturalistic data from different industries underlines aProCheCk’s ability to effectively address real-world scenarios. This data not only provides a robust dataset for validation but also highlights the versatility and practicality of the artifact. By illustrating the coherence checking capabilities of the artifact’s instantiation in a variety of settings, the readiness for use in real-world BPM environments is demonstrated.

Demonstrating the artifact with naturalistic data

An instantiation of aProCheCk is demonstrated and evaluated in a naturalistic setting, following the Eval 4 phase introduced by Sonnenberg and vom Brocke (2012). This naturalistic demonstration showcases the functionality of the artifact in a wider range of real-world scenarios with a prototypical implementation. The naturalistic demonstration employed for the focus groups includes checking the coherence of changes to a process model with a related BPMN diagram and vice versa. In addition, this demonstration introduces an overarching document coherence check, where process models are compared against updated process modeling guidelines, and a coherence check on a changing process diagram and a related process diagram of a variant of the process. It further generalizes the applicability of aProCheCk by incorporating different file formats, such as SVG for process models.

The data sources for the naturalistic data demonstration consist of real process documents from three German companies of diverse sizes and industries. These companies range from 200 to 300 employees to 10,000–15,000 employees, and the sectors represented include consulting, industrial manufacturing, and the energy and fuel sectors. Table 3 summarizes the demonstration data for the four use cases. To ensure confidentiality, parts of the documents have been anonymized, and only the anonymized results generated by the aProCheCk tool are provided. The notification generated by the initial naturalistic example process is illustrated in Fig. 6. The remaining three examples can be found in Section F of the online appendix, along with detailed descriptions of the exemplary use cases. These examples demonstrate aProCheCk’s ability to identify and manage changes across multiple levels of process documentation. The artifact effectively fulfills its design specifications by providing actionable insights, ensuring process coherence, and operating efficiently in diverse real-world contexts. These demonstrations underline the robustness and applicability of the artifact, highlighting its potential to significantly improve BPM practices in naturalistic scenarios.

Table 3 Summary of naturalistic artifact demonstration data

Full size table

Confirmatory Artifact Evaluation

The confirmatory evaluation of the developed artifact was conducted through two focus groups consisting of 4 and 5 BPM IT consultants, respectively. Each session lasted approximately one hour and aimed to validate aProCheCk in a naturalistic setting, employing the use cases introduced in the previous section. Both focus groups were conducted for the same purpose and using the same methodology. Conducting two separate sessions enabled us to collect more extensive feedback and validate our artifact more robustly.

This approach follows the guidelines established by Tremblay et al. (2010) and is aligned with the Eval 4 phase of the evaluation process proposed by Sonnenberg and vom Brocke (2012). The focus groups were designed to evaluate the artifact in the context of the three realities proposed by Sun and Kantor (2006): real tasks, real systems, and real users, ensuring a comprehensive assessment of the artifact’s performance in realistic conditions. Participants were asked to evaluate the artifact based on three specific criteria derived from the Eval 4 phase: fidelity with real-world phenomenon, impact on artifact environment and user, and applicability (Sonnenberg and vom Brocke 2012). Similar to the conducted expert interviews, each criterion was accompanied by a Guiding question to anchor the discussions and allow for a comprehensive evaluation, and participants were instructed to rate each criterion on a Likert scale from 1 to 7.

Fidelity with real-world phenomenon was explored with the guiding question, “Could the developed artifact be used in a realistic working environment?” to assess its applicability and practicality in real-world settings. This criterion assesses the artifact’s capacity to manage the complexity of authentic BPM operations and to integrate seamlessly into existing workflows.

Impact on artifact environment and user was evaluated through the question “How do you assess the potential influence of the artifact on the working environment and the user?” to understand the impact of the artifact on the existing working environment and user interaction. Participants rated this criterion on a scale from 1, indicating “no positive influence at all”, to 7, indicating “completely positive influence”.

Applicability was assessed using the guiding question, “Would the system’s notifications be more of a burden, or would you find them helpful?” to determine the functional usefulness and practicality of the artifact. This criterion determines whether the artifact’s notifications are actionable and beneficial in improving process coherence while reducing manual effort.

The boxplot diagram in Fig. 7 depicts the distribution of scores across the specified criteria, derived from the nine responses provided by the participants. The results of the evaluation showed a consistently high rating for the criterion of impact on artifact environment and user, reflecting a strong positive perception of the artifact’s potential influence. Moreover, the fidelity with real-world phenomenon was rated favorably, although some reservations were expressed regarding the quality of the data present in practice. Such factors have the potential to impact the artifact’s performance in real-world scenarios. The applicability of the artifact was also met with favorable responses, although some raised concerns regarding the verbosity of the email notifications.

Key takeaways from the focus groups highlight various strengths and suggested improvements to aProCheCk. The experts consistently found the artifact impressive and highly relevant, identifying numerous use cases in their respective client organizations, with I 9 (FG1) stating that it is “Highly relevant, because […] in so many customer settings this issue somehow so quickly and easily leads to uncontrolled documentation”. Focus group participants emphasized the practical applicability of the artifact and provided insights into how its functionality could be enhanced. They suggested incorporating deeper process knowledge by embedding more of the organization’s process documentation in the LLM to identify interrelated changes across different processes, which could significantly enhance the usefulness of aProCheCk. The experts emphasized the considerable potential of aProCheCk in managing overarching process documents, particularly in reducing the necessity for manual work, exemplified by I 11 (FG2) saying, “something like this would extremely reduce the manual effort.“ They highlighted that when internal guidelines are updated, the artifact could automatically confirm coherence across all related documents. This capability eliminates the time-consuming task of manually checking each document individually, thereby ensuring organizational consistency and compliance with new policies or regulations. By automating these checks, the artifact significantly enhances operational efficiency and utility within the organization.

While acknowledging the current limitations, the experts were optimistic that future iterations of LLMs would further enhance the capabilities of the artifact. They suggested that future versions could enable automated process model generation and other advanced features. Data quality issues, prevalent in many organizations, were identified as a challenge, but the subjective decision-making capabilities of LLMs based on given contexts were seen as a promising solution for maintaining process coherence.

In summary, the focus groups provided valuable feedback highlighting the significant potential of aProCheCk, its practical utility, and areas for further improvement. By incorporating these insights into future iterations, the artifact can be refined and optimized to better meet the needs of diverse BPM environments. This confirmatory evaluation highlights both the current strengths of aProCheCk and the promising directions for its continuous improvement.

Discussion

In order to address our research objectives, we employed a comprehensive and multifaceted DSR approach involving multiple iterations and experts from both research and practice. First, design objectives were derived from the existing literature. Subsequently, provisional design specifications were developed based on the design objectives and evaluated and refined through expert interviews with researchers and practitioners. The evaluation was informed by a preliminary PoC demonstration. aProCheCk was developed iteratively, refined through experimental benchmarking, and then validated through focus groups using naturalistic data and an instantiation of the artifact. In addition, a Business Process Change Classification Framework and an open-source business process coherence checking dataset were developed. As such, our research has important implications for research and practice.

Theoretical implications

Our work represents a significant advancement in the field of BPM by applying generative AI, specifically LLMs, to improve process management practices. In particular, we address the research gap identified by Feuerriegel et al. (2024) concerning the detection of positive process deviance through the use of generative AI. Our research makes several important contributions:

First, we contribute to the field of business process coherence checking by introducing a novel approach that utilizes LLMs for the dynamic analysis of diverse process documentation, with the objective of identifying incoherencies. The field is currently dominated by static approaches to inconsistency detection utilizing structured process documentation, such as event logs (Ko and Comuzzi 2023). Existing research on unstructured text-based business process documents has largely focused on static pattern-matching approaches (Martin-Toral et al. 2010; van der Aa et al. 2017). Our LLM-based approach advances this line of work by enabling the autonomous identification of incoherencies in multi-level process documentation, thus moving beyond traditional static analyses.

Second, integrating LLMs into the BPM lifecycle, particularly in the process implementation and the monitoring phase, transforms traditional methods that rely on static data and manual reviews (Vidgof et al. 2023). While first studies have started to investigate the potential of LLMs in BPM (Franzoi et al. 2025b), specific applications, such as process coherence checking, remain scarce. Here, our work provides an important starting point by rigorously developing and evaluating an LLM-based artifact to continuously assess the coherence of multi-level process documentation. By enabling dynamic, AI-driven evaluations, the artifact facilitates the detection of negative deviations indicating inefficiencies and positive deviations suggesting innovation opportunities, thereby supporting more context-sensitive decision-making (Franzoi et al. 2025a) and demonstrating the importance of leveraging generative AI to move from static BPM methods to adaptive, proactive management systems (Feuerriegel et al. 2024).

Third, we contribute to the field by establishing detailed design specifications for coherence checking based on multi-level process documentation. These specifications balance functional requirements with the complexities of maintaining BPM documentation coherence, providing a robust foundation for future research.

Fourth, we introduce the Business Process Change Classification Framework, which comprises Business Process Change Dimensions and Change Relevance Categories. Developed through engagement with established BPM literature and extensive expert interviews, the framework deals with changes in text-based, multi-level business process documentation. By systematically categorizing and quantifying changes, the framework provides a structured mechanism for managing and interpreting the nuanced nature of BPM documentation. This advance improves the theoretical understanding of change management within BPM. Traditional methods often prove inadequate in addressing these complexities and subjectivities, whereas the developed artifact employs LLMs to effectively overcome these challenges.

Fifth, we contribute a robust, open-source dataset based on the established work of Sànchez-Ferreres et al. (2018), which provides further support for empirical BPM research. Enriched with expert insights, this dataset is structured around the introduced Business Process Change Classification Framework, providing a valuable resource for future studies. It supports empirical rigor and facilitates reproducibility, establishing a standard for dataset creation with a focus on business process coherence checking.

Furthermore, our empirical findings regarding the use of BPMN models with current-generation LLMs show that preprocessing BPMN XML files by removing non-essential visual data improves output quality and resource efficiency. These results simplify preprocessing requirements, demonstrate the viability of directly using BPMN files in AI-driven BPM tasks, and make the use of LLMs more accessible for practitioners and researchers alike.

In conclusion, this research makes substantial contributions to the fields of BPM and generative AI by addressing a crucial research gap and providing a comprehensive, empirically validated approach to integrating LLMs into BPM practices. Our contributions –including the development of aProCheck, the design specifications, the Business Process Change Classification Framework, the creation of a public dataset, and insights into BPMN preprocessing– enrich academic discourse and offer a robust foundation for future research and innovation in AI-driven business process management.

Practical implications

Besides implications for research, our work also provides important contributions to BPM practice. Integrating aProCheCk into BPM practices offers significant benefits in terms of operational efficiency, user experience, and organizational adaptability. This is highlighted, for example, by the feedback in the focus groups, which supports the high fidelity of the artifact to real-world scenarios and its potential for a positive impact on both users and operational environments. We identify several practical implications of our work.

First, aProCheck substantially reduces the manual effort required to maintain process coherence. By automating the detection of incoherencies in multi-level process documentation, the artifact enables organizations to reallocate valuable time and resources toward more strategic and creative activities. This shift not only enhances productivity but also increases the accuracy and consistency of process documentation, reducing the risks of human oversight inherent in manual review.

Second, aProCheCk’s ability to identify both negative and positive deviance at an early stage provides critical organizational benefits. Early detection of negative deviations allows for timely interventions to prevent potential process failures or inefficiencies. Similarly, identifying positive deviations can act as a catalyst for process innovation, enabling organizations to proactively incorporate beneficial practices. This dual capability supports a culture of continuous improvement, facilitating both the immediate correction of problems and the proactive exploitation of improvement opportunities.

Third, the artifact’s modular design provides further practical utility by offering the ability to interchange between different LLMs. This modularity allows organizations to comply with specific data security policies, such as selecting local models to protect sensitive information or upgrading to newer models as technology advances. This flexibility ensures that aProCheCk remains adaptable and effective over time.

Fourth, the successful implementation of aProCheCk depends on careful organizational preparation and active user engagement. When introducing the artifact in an organization, comprehensive employee training programs are essential to foster transparency, optimize user acceptance, and maximize the tool’s effectiveness. By educating employees about the functionality and potential benefits of the tool, organizations can foster a positive reception and a seamless integration. Furthermore, robust data security measures are critical to address privacy concerns associated with process documentation. It is essential that notifications are clear and user-centric to minimize resistance and increase practical utility.

Lastly, feedback from focus groups also highlights the broad applicability of the artifact to various BPM use cases. For example, it can be used to ensure compliance with regulatory standards, to maintain process integrity in the context of frequent changes, or to ensure consistency with changing process modeling guidelines. This demonstrates that aProCheCk can be tailored to meet different organizational needs. While its focus is on BPM, the artifact’s design principles suggest potential applications beyond this scope, pointing to its versatility and adaptability as a tool for wider organizational use.

In conclusion, aProCheck holds considerable promise for improving BPM practices by addressing current challenges in process documentation management and enabling AI-driven insights for continuous improvement. While full-scale deployment remains conceptual, the artifact’s foundational design and adaptable framework suggest considerable benefits. By addressing current BPM challenges and leveraging AI-driven insights for continuous improvement, aProCheCk represents a promising approach for organizations seeking to enhance operational efficiency, ensure compliance, and foster innovation. The potential for aProCheCk to provide substantial value across diverse business contexts reinforces its role as a transformative tool in the field of BPM.

Limitations and outlook

While our research makes important contributions, it also comes with several limitations. First, in order to manage the scope of this research, a number of simplifying assumptions were necessary. While defining the ground truth for changes and analyzing only two or three process documents per process with well-defined structures is a reasonable approach, it does not fully capture the complexity of real-world business environments. It would be beneficial for future studies to extend this work by testing aProCheCk in more intricate and varied scenarios, incorporating a diverse array of document types and structures to better reflect actual business conditions. Given these simplifying assumptions, the prospect of implementing aProCheCk in a practical organizational context offers exciting potential for future research. Deployment of the artifact in a real business environment can provide valuable insights and refine its functionality based on actual user feedback and operational demands. A real-world implementation would also allow for the evaluation of aProCheCk’s integration with existing BPM frameworks, enabling an assessment of its effectiveness and adaptability in a dynamic environment. By dealing with live data and real-time process changes, the researchers could further validate the robustness and operational reliability of the artifact, uncovering nuances that may be missed in controlled environments.

Second, the computational demands of current LLMs are substantial due to the inherent complexity of process documentation coherence checking tasks. The high computational and financial costs would be exponentially higher if aProCheCk were to be implemented across existing corporate process landscapes. It is thus critical to optimize computational efficiency. Batch processing, enabled by the artifact’s autonomous and non-time-critical operations, presents a significant opportunity. This could involve processing tasks in scheduled batches, particularly for non-urgent analyses, thereby substantially reducing both costs and environmental impacts. To further reduce resource requirements, smaller models could be used to filter out changes that are unlikely to cause incoherence, thus focusing the analysis on more relevant cases. Strengthening the preprocessing phase to remove further elements of the data that are not necessary for the coherence check based on the given changes could also reduce the computational workload. In addition to further BPMN preprocessing optimizations, additional noise reduction could also be implemented in other file formats for graphical models or structured data beyond BPMN.

Third, the artifact’s restricted access to broader contextual process knowledge beyond the specific process in question constitutes a limiting factor. This hinders a comprehensive understanding of broader organizational processes and interdependencies. Future research could integrate RAG techniques to enrich aProCheCk with detailed contextual process knowledge of the organization on multiple levels (Grisold et al. 2024; Franzoi et al. 2025a). In this way, the artifact can gain a more holistic view of the company’s processes, potentially identifying interdependencies and patterns that are critical to maintaining process coherence. The capabilities of aProCheCk could be further enhanced by integrating multimodal data, including training videos, event logs, and UI logs (e.g., Franzoi et al. 2025b). These data sources offer valuable contextual information and complementary data that may not be captured in text-based documentation alone. Incorporating such diverse data types will increase the breadth and depth of contextual understanding of the artifact, leading to more robust and accurate coherence checking and increased applicability.

Fourth, the management of the considerable volume of user notifications generated by aProCheCk represents a substantial challenge. To prevent users from becoming overwhelmed, these notifications could be integrated into existing BPM tools. Additionally, multiple notifications resulting from similar origins could be grouped into a single notification, thus reducing the overall number of notifications. Similarly, another notable limitation is the considerable user workload associated with the manual implementation of the proposed changes. Future iterations of the artifact may potentially alleviate this burden by directly implementing specific types of changes within the process documentation, as envisaged by the experts in the focus groups. Initial studies have demonstrated the potential of utilizing LLMs to automate alterations to BPMN models (Kourani et al. 2024). By integrating these automated changes with established content merge request mechanisms, users can maintain control while significantly reducing manual effort. Automating the implementation process would ensure that changes are applied consistently and accurately, thereby facilitating greater operational efficiency.

At last, technological limitations also affect the performance of aProCheCk. While conversational LLMs were used in this study and were effective for responsive, real-time processing tasks, they are not optimized for the logic and reasoning tasks that are essential for process coherence checking. Newer models will not only be more efficient but will also include different categories of LLMs, such as recent advances in reasoning-focused models, such as OpenAI’s o1 model, which provide enhanced capabilities specifically designed for solving complex logical tasks. These reasoning models are particularly interesting for the use case of checking the coherence of process documentation due to their logical reasoning capabilities. Future research should explore the integration of these advanced reasoning models, as their characteristics align well with the needs of the complex, logical, and analytical tasks inherent in maintaining process coherence. Additionally, while we focused on the use of LLMs for process coherence checking, we did not include a baseline comparison with simpler NLP techniques such as rule-based approaches. Future work could explore such comparisons to more clearly delineate the added value of LLMs over traditional methods in this context.

In summary, this research represents a significant advancement in the integration of LLMs into BPM, providing a foundation for future studies aimed at enhancing the coherence checking of process documentation in general. By addressing these limitations through expanded research –such as real-world validation, optimization of computational resources, integration of context and different data types, and the incorporation of advanced models– the utility and robustness of tools such as aProCheCk can be enhanced in the future.

Conclusion

In this study, we investigated the integration of LLMs into BPM with the aim of enhancing the coherence checking of multi-level process documentation. A novel artifact, aProCheCk, was developed to autonomously detect and address incoherencies in process documentation. The development process was guided by the DSR methodology, incorporating insights from both academic research and practical expertise. The artifact was refined through a process of iterative development and experimental benchmarking and validated using naturalistic data in focus groups, demonstrating its applicability and effectiveness. As such, our work contributes to BPM research by presenting a specific implementation of an LLM-based artifact, highlighting its benefits. From a theoretical perspective, this study extends an emerging line of work on generative AI applications in BPM, providing insights into how LLMs can be leveraged to analyze multi-level process documentation and support decision-making in process management. Practically, aProCheCk offers organizations an automated way for detecting both negative and positive deviations, reducing manual effort while supporting continuous process improvement and innovation. Overall, this study constitutes a fundamental step toward process coherence checking based on multi-level process documentation by leveraging LLMs.

Data availability

Supplementary materials for the paper titled ‘Toward LLM-Enabled Business Process Coherence Checking Based on Multi-Level Process Documentation’ by Schulte, M.; Franzoi*, S.; Köhne, F.; vom Brocke, J. submitted for publication to the journal Process Science can be accessed here: https://github.com/viadee/process-document-coherence-checker.

Notes

The full online appendix with all relevant documents is available here: https://github.com/viadee/process-document-coherence-checker.
https://github.com/viadee/process-document-coherence-checking-dataset.

References

Bartelheimer C, Wolf V, Beverungen D (2023) Workarounds as generative mechanisms for bottom-up process innovation—insights from a multiple case study. Inform Syst J 33:1085–1150
Article Google Scholar
Becker J, Bergener P, Delfmann P, Eggert M, Weiß B (2011) Supporting Business Process Compliance in Financial Institutions - A Model-Driven Approach. In: Bernstein A (ed) Proceedings of the 10th International Conference on Wirtschaftsinformatik: 16–18 February 2011 Zurich, Switzerland, vol 10, Zürich, pp 355–364
Binz M, Schulz E (2023) Using cognitive psychology to understand GPT-3. Proc Natl Acad Sci U S A 120:1–10
Article Google Scholar
Bose RPJC, van der Aalst WMP, Žliobaitė I, Pechenizkiy M (2011) Handling concept drift in process mining. In: Mouratidis H, Rolland C (eds) Advanced information systems engineering, vol 141, 23rd edn. Springer Berlin Heidelberg, Berlin, Heidelberg, pp 391–405
Google Scholar
Brützke P, Killewald R, Franzoi S, vom Brocke J (2025) AI-assisted Process Mining for Context-sensitive Analysis Support. Proceedings of the European Conference on Information Systems (ECIS)
Busch K, Rochlitzer A, Sola D, Leopold H (2023) Just tell me: prompt engineering in business process management. In: van der Aa H, Bork D, Proper HA, Schmidt R (eds) Lecture notes in business information processing. Enterprise, business-process and information systems modeling, vol. 479. Springer Nature, Switzerland, pp 3–11. https://doi.org/10.1007/978-3-031-34241-7_1
Delias P (2017) A positive deviance approach to eliminate wastes in business processes. Ind Manag Data Syst 117:1323–1339
Article Google Scholar
Di Francescomarino C, Donadello I, Ghidini C, Maggi FM, Puura J (2025) Business process deviance mining with sequential and declarative patterns. Bus Inf Syst Eng
Eid-Sabbagh R-H, Kunze M, Meyer A, Weske M (2012) A platform for research on process model collections. In: van der Aalst W, Mylopoulos J, Rosemann M, Shaw MJ, Szyperski C, Mendling J, Weidlich M (eds) Business process model and notation, vol 125. Springer Berlin Heidelberg, Berlin, Heidelberg, pp 8–22
Chapter Google Scholar
Fahland D, Fournier F, Limonad L, Skarbovsky I, Swevels AJE (2024) How well can large language models explain business processes?
Feuerriegel S, Hartmann J, Janiesch C, Zschech P (2024) Generative AI. Bus Inf Syst Eng 66:111–126
Article Google Scholar
Franzoi S, Hartl S, Grisold T, van der Aa H, Mendling J, vom Brocke J (2025a) Explaining process dynamics: a process mining context taxonomy for sense-making. Process Sci. https://doi.org/10.1007/s44311-025-00008-6
Article Google Scholar
Franzoi S, Delwaulle M, Dyong J, Schaffner J, Burger M, vom Brocke J (2025b) Using large Language models to generate process knowledge from enterprise content. In: Gdowska K, Gómez-López MT, Rehse J-R (eds) Business process management workshops, vol 534. Springer Nature Switzerland, Cham, pp 247–258
Chapter Google Scholar
Friedrich F, Mendling J, Puhlmann F (2011) Process model generation from natural Language text. In: Mouratidis H, Rolland C (eds) Advanced information systems engineering, 23rd edn. Springer Berlin Heidelberg, Berlin, Heidelberg, pp 482–496
Google Scholar
Galperin BL (2012) Exploring the nomological network of workplace deviance: developing and validating a measure of constructive deviance. J Appl Soc Psychol 42:2988–3025
Article Google Scholar
Gregor S, Hevner AR (2013) Positioning and presenting design science research for maximum impact. MIS Q 37:337–355
Article Google Scholar
Marvin G, Hellen N, Jjingo D, Nakatumba-Nabende J (2024) Prompt engineering in large Language models. In: Jacob Ij, Piramuthu S, Falkowski-Gilski P (eds) Data intelligence and cognitive informatics. Springer Nature Singapore, Singapore, pp 387–402
Grisold T, van der Aa H, Franzoi S, Hartl S, Mendling J, vom Brocke J (2024) A Context Framework for Sense-making of Process Mining Results. In: 2024 6th International Conference on Process Mining (ICPM). IEEE, pp 57–64
Harl M, Zilker S, Weinzierl S (2024) Towards automated business process redesign in runtime using generative machine learning. Proceedings of the European Conference on Information Systems (ECIS)
Hevner AR, March ST, Park J, Ram S (2004) Design science in information systems research. MIS Q 28:75–105
Article Google Scholar
Hevner AR, Parsons J, Brendel AB, Lukyanenko R, Tiefenbeck V, Tremblay MC, vom Brocke J (2024) Transparency in design science research. Decis Support Syst 182:1–11
Article Google Scholar
Kampik T, Warmuth C, Rebmann A, Agam R, Egger LNP, Gerber A, Hoffart J, Kolk J, Herzig P, Decker G, van der Aa H, Polyvyanyy A, Rinderle-Ma S, Weber I, Weidlich M (2024) Large Process Models: A Vision for Business Process Management in the Age of Generative AI. KI - Künstliche Intelligenz:1–15
Ko J, Comuzzi M (2023) A systematic review of anomaly detection for business process event logs. Bus Inf Syst Eng 65:441–462
Article Google Scholar
König UM, Linhart A, Röglinger M (2019) Why do business processes deviate? Results from a Delphi study. Bus Res 12:425–453
Article Google Scholar
Kourani H, Berti A, Schuster D, van der Aalst WMP, van der Aa H, Bork D, Schmidt R, Sturm A (2024) Process modeling with large Language models. Enterprise, Business-Process and information systems modeling, vol 511. Springer Nature Switzerland, Cham, pp 229–244
Chapter Google Scholar
Leopold H, Eid-Sabbagh R-H, Mendling J, Azevedo LG, Baião FA (2013) Detection of naming convention violations in process models for different languages. Decis Support Syst 56:310–325
Article Google Scholar
Lo LS (2023) The art and science of prompt engineering: a new literacy in the information age. Internet Ref Serv Q 27:203–210
Google Scholar
Martin-Toral S, Sainz-Palmero G, Dimitriadis Y (2008) Detection Of Incoherences In A Technical And Normative Document Corpus. In: Cordeiro J, Filipe J (eds) Proceedings of the Tenth International Conference on Enterprise Information Systems. SciTePress - Science and and Technology Publications, pp 282–287
Martin-Toral S, Sainz-Palmero G, Dimitriadis Y (2010) Hybrid Approach for Incoherence Detection Based on Neuro-fuzzy Systems and Expert Knowledge. In: Cordeiro J, Filipe J (eds) Proceedings of the 12th International Conference on Enterprise Information Systems. SciTePress - Science and and Technology Publications, pp 408–413
Mcintosh TR, Liu T, Susnjak T, Watters P, Halgamuge MN (2024) A reasoning and value alignment test to assess advanced GPT reasoning. ACM Trans Interact Intell Syst 14:1–37
Article Google Scholar
Mendling J, Pentland BT, Recker J (2020) Building a complementary agenda for business process management and digital innovation. Eur J Inform Syst 29:208–219
Article Google Scholar
Mertens W, Recker J (2017) Positive Deviance and Leadership: An Exploratory Field Study. In: Sprague R, Bui TX (eds) Proceedings of the 50th Hawaii International Conference on System Sciences (2017). Hawaii International Conference on System Sciences
Morana S, Kroenung J, Maedche A, Schacht S (2019) Designing process guidance systems. JAIS 20:499–535
Article Google Scholar
Nelson E, Kollias G, Das P, Chaudhury S, Dan S (2024) Needle in the haystack for memory based large language models. https://doi.org/10.48550/arXiv.2407.01437
Nwankpa JK, Roumani Y, Datta P (2022) Process innovation in the digital age of business: the role of digital business intensity and knowledge management. JKM 26:1319–1341
Article Google Scholar
Peffers K, Tuunanen T, Rothenberger MA, Chatterjee S (2007) A design science research methodology for information systems research. J Manage Inf Syst 24:45–77
Article Google Scholar
Polyvyanyy A, Smirnov S, Weske M (2015) Business process model abstraction. In: vom Brocke J, Rosemann M (eds) Handbook on business process management 1, vol 1. Springer Berlin Heidelberg, Berlin, Heidelberg, pp 147–165
Rosemann M, vom Brocke J (2015) The six core elements of business process management. In: vom Brocke J, Rosemann M (eds) Handbook on business process management 1, vol 1. Springer Berlin Heidelberg, Berlin, Heidelberg, pp 105–122
Rosemann M, Recker J, Flender C (2008) Contextualisation of business processes. IJBPIM 3:47
Article Google Scholar
Rosemann M, vom Brocke J, van Looy A, Santoro F (2024) Business process management in the age of AI – three essential drifts. Inf Syst E-Bus Manage. https://doi.org/10.1007/s10257-024-00689-9
Article Google Scholar
Sai C, Winter K, Fernanda E, Rinderle-Ma S (2023) Detecting deviations between external and internal regulatory requirements for improved process compliance assessment. In: Indulska M, Reinhartz-Berger I, Cetina C, Pastor O (eds) Advanced information systems engineering, vol 13901. Springer Nature Switzerland, Cham, pp 401–416
Sahoo PK, Datta R, Rahman MM, Sarkar D. Sustainable environmental technologies: recent development, opportunities, and key challenges. Applied Sciences. 2024;14(23):10956.
Saint-Dizier P (2018) Mining incoherent requirements in technical specifications: analysis and implementation. Data Knowl Eng 117:290–306
Article Google Scholar
Sànchez-Ferreres J, van der Aa H, Carmona J, Padró L (2018) Aligning textual and model-based process descriptions. Data Knowl Eng 118:25–40
Article Google Scholar
Schulhoff S, Ilie M, Balepur N, Kahadze K, Liu A, Si C, Li Y, Gupta A, Han H, Schulhoff S [Sevien], Dulepet PS, Vidyadhara S, Ki D, Agrawal S, Pham C, Kroiz G, Li F, Tao H, Srivastava A, . . . Resnik P (2024) The prompt report: a systematic survey of prompting techniques. https://doi.org/10.48550/arXiv.2406.06608
Schulhoff S, Ilie M, Balepur N, Kahadze K, Liu A, Si C, Li Y, Gupta A, Han H, Schulhoff S [Sevien], Dulepet PS, Vidyadhara S, Ki D, Agrawal S, Pham C, Kroiz G, Li F, Tao H, Srivastava A, . . . Resnik P (2024) The prompt report: a systematic survey of prompting techniques. https://doi.org/10.48550/arXiv.2406.06608
Setiawan MA, Sadiq S (2013) A methodology for improving business process performance through positive deviance. Int J Inf Syst Model Des 4:1–22
Article Google Scholar
Sonnenberg C, vom Brocke J (2012) Evaluations in the science of the Artificial – Reconsidering the Build-Evaluate pattern in design science research. In: Hutchison D, Kanade T, Kittler J, Kleinberg JM, Mattern F, Mitchell JC, Naor M, Nierstrasz O, Pandu Rangan C, Steffen B, Sudan M, Terzopoulos D, Tygar D, Vardi MY, Weikum G, Peffers K, Rothenberger M, Kuechler B (eds) Design science research in information systems. Advances in theory and practice, vol 7286. Springer Berlin Heidelberg, Berlin, Heidelberg, pp 381–397
Sun Y, Kantor PB (2006) Cross-evaluation: a new model for information system evaluation. J Am Soc Inf Sci 57:614–628
Article Google Scholar
Teinemaa I, Dumas M, Maggi FM, Di Francescomarino C (2016) Predictive business process monitoring with structured and unstructured data. In: La Rosa M, Loos P, Pastor O (eds) Business process management, vol 9850. Springer International Publishing, Cham, pp 401–417
Tuunanen T, Winter R, vom Brocke J (2024) Dealing with complexity in design science research: a methodology using design echelons. MIS Q 48:427–458
Article Google Scholar
van der Aa H, Leopold H, Reijers HA (2017) Comparing textual descriptions to process models – the automatic detection of inconsistencies. Inf Syst 64:447–460
Article Google Scholar
van der Aa H, Carmona J, Leopold H, Mendling J, Padró L (2018) Challenges and opportunities of applying natural language processing in business process management. In: Bender EM (ed) The 27th International Conference on Computational Linguistics - proceedings of the conference: August 20–26, 2018, Santa Fe, New Mexico, USA: COLING 2018. Association for Computational Linguistics, Stroudsburg, PA, pp 2791–2801
van Dun C, Moder L, Kratsch W, Röglinger M (2023) ProcessGAN: supporting the creation of business process improvement ideas through generative machine learning. Decis Support Syst 165:113880
Article Google Scholar
Venable J, Pries-Heje J, Baskerville R (2016) FEDS: a framework for evaluation in design science research. Eur J Inf Syst 25:77–89
Article Google Scholar
Vidgof M, Bachhofner S, Mendling J (2023) Large Language models for business process management. Opportunities and Challenges
vom Brocke J, Winter R, Hevner A, Maedche A (2020) Special issue editorial –accumulation and evolution of design knowledge in design science research: a journey through time and space. JAIS 21:520–544
Article Google Scholar
Weinzierl S, Zilker S, Dunzer S, Matzner M (2024) Machine learning in business process management: A systematic literature review. Expert Syst Appl :1–43

Download references

Funding

Open Access funding enabled and organized by Projekt DEAL. As part of the Change.WorkAROUND project (promotion sign 02J21C166), this research was funded by the German Federal Ministry of Education and Research.

Author information

Authors and Affiliations

University of Münster, Münster, Germany
Marek Schulte, Sandro Franzoi & Jan vom Brocke
European Research Center for Information Systems, Münster, Germany
Sandro Franzoi & Jan vom Brocke
viadee Unternehmensberatung AG, Münster, Germany
Frank Köhne

Authors

Marek Schulte
View author publications
Search author on:PubMed Google Scholar
Sandro Franzoi
View author publications
Search author on:PubMed Google Scholar
Frank Köhne
View author publications
Search author on:PubMed Google Scholar
Jan vom Brocke
View author publications
Search author on:PubMed Google Scholar

Contributions

We describe the authors contributions by describing the respective contribution roles according to the CRediT Taxonomy: Marek Schulte: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Validation, Visualization, Writing – original draft, Writing – review & editing; Sandro Franzoi: Conceptualization, Investigation, Methodology, Project administration, Validation, Writing – original draft, Writing – review & editing; Frank Kühne: Conceptualization, Funding acquisition, Resources, Supervision, Validation, Writing – review & editing; Jan vom Brocke: Conceptualization, Resources, Supervision, Validation, Writing – review & editing.

Corresponding author

Correspondence to Sandro Franzoi.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: details on interview and focus group participants

The following table (Table 4) provides an overview of the interview partners, including details on their professional background and years of experience.

Table 4 Interview partner summary

Full size table

Appendix B: experimental dataset and metric calculation

The following table (Table 5) presents information about the experimental dataset, including the processes, a brief description, the source, and any changes made. Additionally, this appendix provides details on the calculation of accuracy and consistency metrics.

Table 5 Experimental dataset summary

Full size table

Metric Calculation

Accuracy is expressed as a percentage. The calculation proceeds in three steps:

Determine the maximum number of achievable accuracy points.
Calculate the number of achieved accuracy points.
Compute the percentage of achieved points.

The weights for each metric were determined in consultation with a BPM expert who also contributed to the dataset creation process. The pseudocode below outlines the calculation steps.

Maximum Accuracy Points:

Number of Relevant Changes from solution.config.

+ (0.5 * (Number of Relevant + Negligible Changes form solution.config).

Achieved Accuracy Points:

Number of Correctly Identified Changes.

+ (0.5 * Number of Correctly Identified Changes in Wrong Dimension).

- (0.25 * Number of Identified Extra Changes in Correct Dimension).
- (0.5 * Number of Identified Extra Changes in Wrong Dimension).
- (1 * Number of Identified Extra Changes Not In Config).

+ (0.5 * Number of Relevant + Negligible Changes form solution.config).

Accuracy Percentage:

Max(Achieved Accuracy Points/Maximum Accuracy Points, 0).

Consistency, which is also measured on a scale from 0 to 1, is defined as one minus the standard deviation of accuracy values across multiple runs.

Appendix C: structural composition of prompts

The following tables present the structural composition of the prompts used in the different phases of aProCheCk: Content Comparison and Coherence Check (Table 6) and Notification Creation (Table 7). Each table outlines the structural prompt elements, provides a brief description, and includes an example from the final prompts used in the instantiation, which are presented in full in Section G of the online appendix.

Table 6 Structure of content comparison and coherence check prompts

Full size table

Table 7 Structure of notification creation prompt

Full size table

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Schulte, M., Franzoi, S., Köhne, F. et al. Toward LLM-enabled business process coherence checking based on multi-level process documentation. Process Sci 2, 22 (2025). https://doi.org/10.1007/s44311-025-00024-6

Download citation

Received: 30 April 2025
Accepted: 07 September 2025
Published: 06 November 2025
Version of record: 06 November 2025
DOI: https://doi.org/10.1007/s44311-025-00024-6

Keywords

Part of a collection:

Toward LLM-enabled business process coherence checking based on multi-level process documentation

Abstract

Explore related subjects

Introduction

Research background

Process coherence checking

Generative AI in business process management

Research methodology

Phase 1. Identify problem and motivate (Eval 1)

Phase 2. Define objectives of a solution

Phase 3. Design and Development

Phase 3.a design specifications and initial proof of concept (PoC)

Phase 3.b design and development of prototype

Phase 3.c adaptation of the prototype

Phase 4. Demonstration and Evaluation

Phase 4.a demonstration and evaluation in Semi-structured expert interviews (Eval 2)

Phase 4.b experimental evaluation of prototype (Eval 3)

Phase 4.c demonstration and evaluation in focus groups (Eval 4)

Phase 5. Communication

Artifact description

Artifact overview: aProCheCk

Preprocessing

Content comparison

Coherence check

Artifact design

Design objectives

Proof of concept

Expert interview evaluation

Design specifications

Design evaluation

Business process change classification framework

Artifact development and operating logic

Evaluation

Experimental evaluation

Experiment composition

Experiment findings

Focus group evaluation

Demonstrating the artifact with naturalistic data

Confirmatory Artifact Evaluation

Discussion

Theoretical implications

Practical implications

Limitations and outlook

Conclusion

Data availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s Note

Appendices

Appendices

Appendix A: details on interview and focus group participants

Appendix B: experimental dataset and metric calculation

Appendix C: structural composition of prompts

Rights and permissions

About this article

Cite this article

Share this article

Keywords