Ironies of Generative AI: Understanding and mitigating productivity loss in human-AI interactions

Auste Simkute University of EdinburghEdinburghUnited Kingdom [email protected] , Lev Tankelevitch [email protected] Microsoft ResearchCambridgeUnited Kingdom , Viktor Kewenig [email protected] University College LondonLondonUnited Kingdom , Ava Elizabeth Scott University College LondonLondonUnited Kingdom [email protected] , Abigail Sellen Microsoft ResearchCambridgeUnited Kingdom [email protected] and Sean Rintel Microsoft ResearchCambridgeUnited Kingdom [email protected]

(20 February 2007; 12 March 2009; 5 June 2009)

Abstract.

Generative AI (GenAI) systems offer opportunities to increase user productivity in many tasks, such as programming and writing. However, while they boost productivity in some studies, many others show that users are working ineffectively with GenAI systems and losing productivity. Despite the apparent novelty of these usability challenges, these ‘ironies of automation’ have been observed for over three decades in Human Factors research on the introduction of automation in domains such as aviation, automated driving, and intelligence. We draw on this extensive research alongside recent GenAI user studies to outline four key reasons for productivity loss with GenAI systems: a shift in users’ roles from production to evaluation, unhelpful restructuring of workflows, interruptions, and a tendency for automation to make easy tasks easier and hard tasks harder. We then suggest how Human Factors research can also inform GenAI system design to mitigate productivity loss by using approaches such as continuous feedback, system personalization, ecological interface design, task stabilization, and clear task allocation. Thus, we ground developments in GenAI system usability in decades of Human Factors research, ensuring that the design of human-AI interactions in this rapidly moving field learns from history instead of repeating it.

Generative AI, Copilot, Large Language Models, Human Factors, human-automation interaction, human-centered design, human-AI interaction, usability

^†^†conference: –; 2024; –^†^†ccs: Human-centered computing HCI theory, concepts and models^†^†ccs: Computing methodologies Artificial intelligence^†^†ccs: Human-centered computing Interaction design process and methods

1. Introduction

Generative artificial intelligence (GenAI) systems, such as large language models (LLMs) that can generate novel content and perform many other tasks, present myriad opportunities and challenges to humans in knowledge-intensive domains. GenAI applications have emerged in domains such as healthcare (Nova, 2023), research (Lund and Wang, 2023), writing (Dang et al., 2023; Chen and Chan, 2023), creative work (Parra Pennefather, 2023a; Oppenlaender, 2022; Gmeiner et al., 2023; Kulkarni et al., 2023), consulting (Dell’Acqua et al., 2023), and recruitment (Budhwar et al., 2023). Software engineering has been particularly impacted, with GenAI-assisted programming tools, such as GitHub Copilot (Friedman, 2021), being increasingly used to support software engineering practices and perform tasks such as auto-completing code (Kim et al., 2021), translating code across languages, and answering programming questions, among others (Ross et al., 2023; Sarkar et al., 2022).

GenAI’s ability to solve domain-specific problems speaks to its potential to augment human performance and transform productivity. Recent research already suggests the enormous positive impact these systems could have on workers’ performance in domains including programming (Peng et al., 2023), writing (Noy and Zhang, 2023), law (Choi and Schwarcz, 2023), and consulting (Dell’Acqua et al., 2023). Based on this research, the expectation is that new tools will often free up users’ time and allow them to focus on higher-level tasks, increasing their productivity. However, when using the new tools in practice, many users, such as programmers, report increased cognitive load, frustration, and time spent on the tasks that GenAI is intended to support. Feedback from Copilot users, as well as usability studies of GenAI-driven programming tools, suggest that, in some cases, using GenAI support can, in fact, lead to productivity loss. For example, software engineers and novice programmers struggle to effectively prompt systems, debug generated code, lose their state of flow when interrupted by long code suggestions, and get stuck in ineffective practices, such as reviewing, editing and then ultimately deleting suggestions (Prather et al., 2023; Barke et al., 2023; Sarkar et al., 2022). Similar observations are emerging in creative domains, where graphic (Oppenlaender, 2022; Kulkarni et al., 2023) and manufacturing (Gmeiner et al., 2023) designers struggle with prompt engineering and other aspects of GenAI interaction. This suggests that the potential of GenAI systems to boost productivity may not be guaranteed, evenly distributed, or fully exploited.

These observations mirror the long line of Human Factors studies exploring human-automation interactions in safety-critical systems in aviation, industrial plants, and other areas (Lee and Seppelt, 2009; Endsley, 2017). Indeed, they reflect the ‘ironies of automation’ (Bainbridge, 1983), which capture the idea that the more advanced an automated system is, the more important the human operator may be.¹¹1Endsley (2023) makes a similar parallel between the ironies of automation and the challenges of modern AI systems; however, whereas they cover both generative and non-generative AI and take a high-level view of AI, the current paper focuses specifically on GenAI and examines concrete usability challenges documented in recent user studies of GenAI systems. Despite automation taking over human manual control in areas where it is expected to provide superior performance, humans are still left to supervise automation. However, operators might have insufficient support to supervise, and so instead of being supported by automation, they find themselves cognitively overburdened, trying to decipher systems’ outputs and spot errors. Similarly, in the context of GenAI, users’ roles have shifted from producing output to evaluating it, often with little contextual information and situational awareness. This is exacerbated by GenAI tools’ ability to produce outputs at a capacity too demanding for adequate evaluation, with questionable reliability, and with poor explainability (Liao and Vaughan, 2023; Chen et al., 2023; Schellaert et al., 2023). Moreover, poor system and interface design can result in unhelpful restructuring of workflows, which increases cognitive load and undermines productivity gains (Bainbridge, 1983). This is echoed in programmers’ experiences and feedback around Copilot features (Sarkar et al., 2022; Barke et al., 2023; Prather et al., 2023), with evidence of similar effects emerging in other domains (Dang et al., 2023; Gu et al., 2023a; Gmeiner et al., 2023). Finally, as a result of which tasks get automated, as well as poor system design, automation often makes easy tasks easier while making hard tasks even harder. This same pattern is now being observed in usability studies of GenAI systems (Sarkar et al., 2022; Barke et al., 2023).

In this paper, we answer recent calls for bridging Human Factors and Human-Computer Interaction research to advance human augmentation by AI and human-AI interactions (Chignell et al., 2023). Extrapolating from over 30 years of Human Factors research on the ‘ironies’ of human-automation and productivity loss, we synthesize an overview of the usability and productivity challenges observed in recent GenAI user studies. We demonstrate how these challenges emerging in GenAI systems mirror those experienced by operators when automation was introduced to their workflows decades ago. Based on these parallels, we highlight key areas of productivity loss and provide insights into the human factors leading to these issues, exploring aspects including feedback, situational awareness, cognitive workload, workflow disruptions and others. We focus primarily on programming due to the early adoption of tools like GitHub Copilot and the accompanying usability research, but we also reflect on emerging studies from other domains, such as healthcare, writing, and design, showing that these issues are not limited to a single domain. Moreover, we discuss potential design solutions, emphasizing the importance of following the Human Factors principles of feedback and flexibility when designing GenAI systems. We suggest that the fast-paced innovation of GenAI will benefit from the decades of Human Factors research in order to design GenAI systems that truly harness the full productivity potential of this technology. In summary, our paper makes the following contributions:

(1)

Based on Human Factors research and a synthesis of recent GenAI studies, we identify key challenges that can lead to productivity loss, grouped into four broad categories: (i) the production-to-evaluation shift, (ii) unhelpful workflow restructuring, (iii) task interruptions, and (iv) task-complexity polarization.
(2)

We provide potential design directions from Human Factors research that address each category of challenges: (i) continuous feedback, (ii) system personalization, (iii) ecological interface design, (iv) main task stabilization and timing, and (v) clear task allocation. Throughout, we also emphasize the importance of following the Human Factors principles of feedback and flexibility.
(3)

We motivate further research into the impact of GenAI systems on aspects such as situational awareness and cognitive workload to better understand systems’ unintended effects on human performance. We also encourage future researchers to take advantage of the plethora of relevant Human Factors work to enrich their understanding of existing human-GenAI interaction issues and anticipate others.

2. Productivity Challenges of Generative AI Automation

Here, we outline the key productivity challenges that have been observed in human-automation interaction over decades of Human Factors research and are now becoming apparent in user studies of GenAI systems. Our focus is on GenAI systems, the integrated whole comprising GenAI models and interfaces. Some challenges pertain to GenAI models (e.g., issues around prompting), and some pertain to interface design (e.g., issues around task interruptions).

We begin with challenges related to the shift from manual control or production to a more passive supervisory role of the user, such as monitoring and evaluation of AI outputs (Section 2.1). We explore specific aspects related to this shift, such as reduced situational awareness, the contributory factors of automation’s high capacity, complexity and opaqueness, reliability, and potential resultant complacency and over-reliance. We then outline how the introduction of automation such as GenAI can unhelpfully restructure users’ workflows, stifling their productivity (Section 2.2). We focus on how the introduction of new tasks, such as prompting or output adaptation, can affect user performance and how workflow restructuring can lead to loss of task sequence and feedback. We also explore the influence that task interruptions from AI suggestions can have on users’ productivity (Section 2.3). Finally, we explore how automation such as GenAI can paradoxically lead to easy tasks being made easier and hard tasks made harder, a phenomenon we refer to as ‘task-complexity polarization’ (also known as ”clumsy automation” in Human Factors research (WIENER and CURRY, 1980); Section 2.4). Figure 1 outlines the four types of challenges.

Refer to caption — Figure 1. Productivity challenges of Generative AI automation: (a) the production-to-evaluation shift, in which users’ situational awareness of their working environment is reduced, increasing the cognitive demand required to evaluate AI outputs; (b) unhelpful workflow restructuring, including the addition of new challenging tasks of prompting systems and adapting outputs, a loss of task sequence due to AI suggestions or other changes, and a loss of feedback when AI suggestions are presented without the relevant context; (c) task interruptions from automated AI suggestions; and (d) task-complexity polarization, in which automation tends to make easy tasks easier and hard tasks harder when implemented in practice.

2.1. The production-to-evaluation shift

Decades ago, the introduction of automation shifted many manual control tasks to monitoring tasks, leaving humans to supervise the automation (Sheridan, 2012). However, monitoring (or vigilance) is tedious and requires attention, and can, therefore, paradoxically impose a considerable workload on humans (Warm et al., 2008; Grubb et al., 1995). For example, when automation was introduced in the aviation context (e.g., detection of air traffic in an aircraft’s vicinity), pilots’ workload was not reduced but moved to supervising activity. Pilots reported spending more time interacting with automation and trying to understand it instead of concentrating their efforts on their primary task of flying the aircraft (Rudisill, 1995). In other domains, operators supervising automation also spent a significant amount of time and effort learning how to manage the new technology (Baxter et al., 2012) (see Section 2.2.2).

GenAI workflows have introduced a similar shift from manual control to monitoring—in this case, from the production of outputs to their evaluation—with (Sarkar, 2023) terming this new user role “critical integration” (see Figure 1a).²²2This shift from production to evaluation is relative rather than absolute, as, for example, crafting prompts still constitutes a form of production (see Section 2.2.1). In AI-assisted coding, users spend extended periods reviewing and validating code suggestions (Barke et al., 2023; Vaithilingam et al., 2022), sometimes at the expense of other productive tasks like writing code or running tests (Weisz et al., 2022; Vaithilingam et al., 2022). Some programmers have said that working with Copilot felt like a “proofreading task” (Weisz et al., 2022). Accordingly, in some cases, working with current GenAI systems might not benefit users relative to a more manual approach. For example, when (Vaithilingam et al., 2022) compared programmers’ experience with Copilot versus traditional autocomplete, they found that Copilot participants failed to complete their tasks more often. When they did complete them, they were no faster than those who used autocomplete. Vaithilingam et al. (Vaithilingam et al., 2022) suggest that assessing the correctness of generated code created an efficiency bottleneck, often leading participants down an unsuccessful path of debugging. This not only took time out of their main task, thereby decreasing productivity, but also required a significant amount of cognitive effort. A similar shift towards evaluation of outputs has been observed in consultancy (Dell’Acqua et al., 2023), and in creative writing, where most of the writing time is now being replaced by editing AI-generated text (Noy and Zhang, 2023). Overall, practitioners from various domains, such as advertising, education, business and law, overwhelmingly agree that GenAI outputs will require supervision (Woodruff et al., 2023).

2.1.1. Reduced situational awareness

A key reason why monitoring automation (like evaluating GenAI outputs) is so demanding is that, due to processing being relatively more passive, it reduces operators’ situational awareness: their perception of data and elements of the situation, comprehension of the situation, and the projection of future status (Endsley, 1995). Passive processing resulting in decreased situation awareness has been observed with experienced air traffic controllers (Endsley et al., 1997; Metzger and Parasuraman, 2001) and in other automated tasks (Manzey et al., 2012). Low situation awareness significantly decreased operators’ ability to effectively monitor and observe errors in the automation and to determine whether the given situation is outside the bounds of automation capabilities (Jones and Endsley, 1996).

Evidence suggests that users of GenAI systems similarly experience reduced situational awareness. For example, participants in (Vaithilingam et al., 2022) reported that their debugging of AI-generated code was hampered because they could not use their intuition about where the bug might be and instead ended up refactoring or abandoning the code entirely. This is echoed by participants in (Barke et al., 2023) who say, e.g., “I don’t see the error immediately, and unfortunately, because this is generated, I don’t understand it as well as I feel like I would’ve if I had written it”. Participants in (Weisz et al., 2022) noted a trade-off between writing and debugging code, citing a lack of comprehension for AI-generated code translation and “spotting errors in ‘foreign’ code” as challenges. Similarly, in data science, users report feeling out of control when unable to understand AI-generated suggestions (Mcnutt et al., 2023) and highlight readability “as being a critical feature of usable synthesized code” (Drosos et al., 2020). For novices in a domain, this reduced situational awareness can be particularly challenging, as noted in (Prather et al., 2023). In the healthcare domain, AI-generated medical records may lead physicians to become detached from patients’ medical history, and in turn spend additional time analysing GenAI outputs to compensate for the missing information (Preiksaitis et al., 2023). These findings indicate that gaining situational awareness of GenAI output is demanding and takes users’ time and attention away from proceeding with the main task.

Automation research shows that low situational awareness can be exacerbated by several factors, including automation’s high output capacity and systems’ complexity, opaqueness, and low reliability.³³3Situational awareness can also be reduced due to automation-related unhelpful structuring of workflows (Section 2.2), including changes in the task sequence (Section 2.2.3) and the loss of feedback (see Section 2.2.4). The next sections cover these factors, as well as a potential outcome of the ‘monitoring’ challenge of automation: complacency and over-reliance.

2.1.2. High automation capacity

Monitoring automation—in this case, evaluating GenAI output—is, ironically, made more difficult by the high capacity of automation, which makes it challenging to understand and anticipate system behaviour. For example, when traders in the digital stock exchange changed roles from executing to monitoring trades, they underperformed as they were unable to effectively monitor the trades in real-time (Haldane and May, 2011). As such, they resorted to monitoring them at a higher level of abstraction and required additional resources to process that information, thereby missing more trades that were executed in the meantime.

Similarly, GenAI is notable for its high capacity in outputting content, such as entire documents or software programs, or multiple simultaneous suggestions (Barke et al., 2023; Schellaert et al., 2023; Chen et al., 2023; Sarkar et al., 2022). This makes evaluating these outputs challenging. In GenAI-assisted coding, (Barke et al., 2023) found that users deal with the plethora of code suggestions by quickly assessing them using a “pattern matching” approach, where they search for the presence of certain keywords or control structures. The impact of high output capacity can be worsened by poor system design. For example, participants in (Barke et al., 2023) noted that the separation of Copilot’s multi-suggestion pane from their main code increased cognitive load due to the lack of relevant code context when reviewing and trying to differentiate the code suggestions.

2.1.3. Automation complexity and opaqueness

Evaluation is further challenged by the complexity and opaqueness (i.e., poor explainability) of automated systems, which can reduce situational awareness. More features and modes create more possible interactions among system components and a corresponding reduction in system predictability as the system increasingly considers multiple factors or component states (Endsley et al., 2003). This can lead to unfamiliar and infrequent system states, which add to the challenge of comprehending systems’ workings. For example, even well-trained pilots were startled by unexpected flight automation system behaviours in complex systems (WIENER and CURRY, 1980). System opaqueness similarly reduces situational awareness and affects monitoring, for example, in the use of automation aids in local government organisations (Lindgren, 2023). Put another way, system complexity and opaqueness make it more difficult for users to create an accurate mental model of the system needed for the correct interpretation of information, including situations where manual control will be needed (Baxter et al., 2012).

The opaqueness and complexity of GenAI systems are cited as key barriers to usability, including prompting and evaluating outputs (Liao et al., 2023; Sun et al., 2022). One issue, termed ‘fuzzy abstraction matching’ (Sarkar et al., 2022), describes the opaque relationship between the content of prompts and the resultant output, driven by the flexibility of GenAI models to produce plausible but potentially incorrect outputs for prompts with a wide range of abstraction. Another issue is the sheer range of implicit and explicit parameters available to users, which increases systems’ complexity (Schellaert et al., 2023). This not only makes prompting a challenge (e.g., (Zamfirescu-Pereira et al., 2023; Dang et al., 2023)) but also the evaluation of outputs (e.g., (Weisz et al., 2021; Barke et al., 2023; Liang et al., 2023)) as the two are inextricably intertwined in current systems. The top usability issue for AI programming assistants, as surveyed in (Liang et al., 2023), is not knowing what part of users’ code or comments the GenAI system is relying on to produce output. Likewise, one participant in (Barke et al., 2023) laments the challenge of evaluating code suggestions, “it might be nice if it could highlight what it’s doing or which parts are different, just something that gives me clues as to why I should pick one over the other”.

2.1.4. Automation reliability

The challenge of monitoring automation is further exacerbated by systems’ unreliability. For example, (Metzger and Parasuraman, 2005) found that air traffic controllers who worked with unreliable automation to make aircraft-to-aircraft conflict decisions were unable to monitor the systems effectively and were ultimately better at detecting conflicts without automation. Similar impacts of reliability were found for target detection and decision-making tasks (Galster et al., 2001; Wickens et al., 2000). Evaluation of GenAI outputs is likewise exacerbated by the non-determinism of GenAI models (Schellaert et al., 2023), which can produce different outputs for the same input, resulting in lower reliability from the user’s perspective. More than merely being non-deterministic, GenAI systems can introduce subtle or non-intuitive errors into outputs, particularly in long outputs such as multi-line code suggestions (Sarkar et al., 2022) (see also Section 2.4). Woodruff et al. (2023) found that knowledge workers across domains overwhelmingly cited a lack of reliability as a key reason for humans having to review GenAI outputs. Example concerns ranged from violation of brand standards and copyrights in generated content, to inaccuracies in legal documents (Woodruff et al., 2023).

2.1.5. Potential complacency and over-reliance

Ultimately, as Human Factors research shows, the shift from production to evaluation, the resultant reduced situational awareness, and additional workload can result in complacency, over-reliance on systems, and increased errors (Parasuraman and Riley, 1997). Trying to recover from these errors further increases the workload and, as workload affects monitoring ability, can create a vicious cycle. In high-workload situations, there are fewer attentional resources available for monitoring imperfect automation, resulting in a risk of errors (McBride et al., 2011) and significantly longer error detection time (Dixon et al., 2005). Complacency due to high-workload conditions has been observed in aviation, where pilots would fail to conduct sufficient checks of system state (Parasuraman et al., 1993; Funk et al., 1999). In a spacecraft simulator study, operators did not properly assess the recommendations and simply complied with them, which resulted in missed failures (Manzey et al., 2006).

An increase in complacency and over-reliance related to output evaluation has been observed in GenAI user studies. For example, when verifying the correctness of AI-generated code, some programmers reported skimming through the output rather than reading and evaluating the code rigorously (Sarkar et al., 2022; Vaithilingam et al., 2022). This is especially prevalent for those with less experience, such as end-user programmers (Sarkar et al., 2022) or novices (Prather et al., 2023; Kazemitabaar et al., 2023). In some cases, this has led to errors that users either missed (Ross et al., 2023) or had to later spend time debugging (Vaithilingam et al., 2022). Notably, in advertising, both expert and non-expert writers showed overconfidence in the quality of AI-generated drafts, failing to thoroughly revise them (Chen and Chan, 2023). Complacency and over-reliance have also been reported in the data science domain (Gu et al., 2023b, a; Srinivasa Ragavan et al., 2022); in the legal domain, where ”AI-assisted exams were more likely to miss hidden issues” (Choi and Schwarcz, 2023); and in the design domain, where one participant commented, “I would never design it like that, but this [GenAI system] thinks it can do it like that […] But this is what it gave me, so I don’t have a problem with that.” (Gmeiner et al., 2023). Over-reliance has been shown to lead to decreased performance; for example, management consultants showed overall poorer performance when they blindly adopted AI-generated outputs (Dell’Acqua et al., 2023).

2.2. Unhelpful workflow restructuring

Automation can restructure workflows in unhelpful ways by introducing new challenging tasks, disrupting familiar task sequences, and removing informative feedback (Figure 1b). This changes what strategies operators use, how they perceive information, and how they act in a specific context, potentially leading to ineffective use of freed-up time and cognitive resources. Thus, rather than reducing what they work on when all or part of tasks are automated, people instead rely on different strategies for working on that task (Bainbridge, 1983). For example, when automation introduces new tasks in operators’ workflow, disrupting their familiar workflow, they struggle to adapt their strategies (Klein et al., 2006). Likewise, when automation unexpectedly increases the workload during peak times, operators tailor the system or the task to accommodate the automation needs (Cork et al., 1998). If tailoring the system is not possible, users are forced to tailor their tasks, often having to add new tasks to their workload (Cork et al., 1998). For example, physicians using automation aids learned how to manipulate monitors displaying physiological data to fit their work strategies. However, because this manipulation was an additional task physicians had to perform, they avoided using the system in high-workload situations (Cork et al., 1998). Moreover, when automation changes the familiar sequence of the task, for example, by removing a step, operators make errors and repeat their actions. For example, physicians might forget to record a dose of medication in a log and mistakenly repeat the procedure (Altmann and Trafton, 2015). Finally, when automation removes the critical feedback necessary to make an informed decision, operators succumb to errors. For example, in aviation, pilots were missing critical failures due to relevant information from vibration and smell being lost in the automation process (Moray et al., 1986).

2.2.1. Prompting as a new task

The central role of prompting in GenAI systems is one major way in which such systems are restructuring workflows. Studies show that users struggle with prompting, dedicating considerable time and effort to it. In (Xu et al., 2022), programmers using a code generation plugin invested significant effort in experimenting with prompts to understand how their queries worked best. Likewise, in (Jiang et al., 2022), participants using an LLM-driven tool developed various strategies to deal with model failures, for example, rewording prompts by reducing the scope of the request or looking for alternative wording. Trying to adapt prompts is a cognitively demanding task, as participants must form a mental model of what the model can work with (the problem of ‘fuzzy abstraction matching’; (Sarkar et al., 2022)). Beyond being demanding, prompting may interfere with other aspects of users’ workflows. For example, Copilot users’ code commenting workflows can change. Participants in (Barke et al., 2023) wrote and re-wrote detailed comments intended for Copilot, hoping to increase the context available to the system, and then also spent time deleting comments for Copilot after the fact.

Similar workflow changes were observed in the design and writing domains. For example, one non-professional designer in (Kulkarni et al., 2023) complains, “it felt like I was fighting it…I felt like it was helpful, but I also felt like I had to massage every word and select every character very carefully not to upset it so that it could generate something I wanted” (see also (Oppenlaender, 2022)). Dang et al. (2023) distinguish between diegetic prompts (instructions implicitly conveyed by inputted content to be acted on by the system) and non-diegetic prompts (instructions explicitly conveyed to the system). The latter is particularly disruptive to users’ workflows in the writing domain, as they “[force] writers to shift from thinking about their narrative or argument to thinking about instructions to the system” (Dang et al., 2023) (see also (Yuan et al., 2022)), a finding echoed in the coding domain (Jayagopal et al., 2022). More broadly, prompting seems to function as a new task that competes with other workflow tasks, adding to the workload and potentially increasing over-reliance on automation as users invest more time into it (Endsley and Rodgers, 2016). Indeed, this might explain why some users try to coerce AI output to be useful (see Section 2.2.2) or become complacent in reviewing it (see Section 2.1.5).

2.2.2. Output adaptation as a new task

Another workflow change with GenAI is the need to adapt generated output, effectively a new type of task. In (Barke et al., 2023), several participants chose to adapt Copilot suggestions to use as a template for their code. Rather than accepting or rejecting code entirely, they deleted and edited parts so they would not have to write it from scratch. Others used the strategy of slowly breaking down large blocks of code and adapting them as needed or cherry-picking code from multiple suggestions. This suggests that the use of suggestions is not straightforward, and complex strategies are created by programmers for their workflows. The productivity gains of these workflow changes remain unknown, and although participants in (Barke et al., 2023) found them helpful, they may ultimately decrease productivity. For example, if the adapted code has an error, the necessary debugging will add to the workload, as observed in, e.g., (Barke et al., 2023) and (Vaithilingam et al., 2022). In the design domain, (Gmeiner et al., 2023) found that manufacturing designers struggled with GenAI assistance. In this case, the GenAI system was found to be “dominating the design process”, and “designers either gave up and accepted unsatisfying results, improvised ‘hacky’ strategies to work around the AI or abandoned the AI assistance altogether and proceeded to work manually”.

The productivity gains or losses of output adaptation may depend on users’ expertise. In (Vaithilingam et al., 2022), participants of varying levels of expertise struggled to adapt the code suggestions, and many abandoned them entirely, thereby losing time. Among novices, code adaptation may particularly reduce productivity. Prather et al. (2023) studied novice programmers working with Copilot, identifying an unproductive interaction mode they termed “shepherding”, in which participants spent considerable time trying to coerce Copilot to produce useful code. This included accepting suggestions, then deleting them without any changes, or spending considerable time adapting suggestions without writing any code of their own. More broadly, the assortment of code adaptation strategies reflects a new layer of complex tasks that programmers are introducing to their workflow to accommodate and effectively use GenAI. Ironically, the more complex the code, the more powerful the potential productivity benefits, yet the more intricate and time-consuming the process of reviewing and adaptation might become (e.g., (Barke et al., 2023)).

2.2.3. Loss of task sequence

Workflow changes can also lead to difficulty in following the familiar sequence of steps in a task. Many tasks have sequential constraints, a set of steps that have to be performed in a specific order. When one of the steps is skipped or repeated, errors can occur (Altmann and Trafton, 2015). To perform a task correctly under sequential constraints, the cognitive system has to keep track of where it is in the sequence and select the correct next step when one step is complete (Altmann and Trafton, 2015). Changes in the structure of the task can make it difficult for one to follow the natural sequence of the steps. Automation research showed that operators’ reactions are slower and less integrated when they cannot generate the sequence of activity themselves (Janssen et al., 2015). Not having a task structure to follow also prevents users from monitoring their own progress. Under manual control, users obtain information about the results of their actions and then can correct themselves (Smith, 1979). Without this information, they are more likely to repeat the same type of errors (WIENER and CURRY, 1980).

In GenAI workflows, auto-suggestions generated by the system or the requirement to prompt systems are examples of disruptions to the familiar sequence of steps, which could lead to productivity loss, as evidenced in recent studies. In the coding domain, (Barke et al., 2023) found that long code suggestions in Copilot disrupted users’ task sequence by “forcing them to jump in to write code before coming up with a high-level architectural design”. Analogously, in the design domain, (Gmeiner et al., 2023) found that the need for prompting meant that designers had to specify required parameters in advance instead of working step-by-step, thereby requiring designers “to think through the design problem in advance, which is challenging and different from the usual iterative design process”. This loss of task sequence can be particularly disruptive among novices. For example, (Prather et al., 2023) identified an unproductive interaction pattern among novice programmers called “drifting”, in which participants spent time adapting code suggestions, then deleting them, and repeating the cycle. Thus, they unproductively drifted from suggestion to suggestion without a direction. Moreover, this was exacerbated if the generated output contained an error, which sent users down a “debugging rabbithole”, in which they spent time trying to adapt incorrect code rather than focusing on the correct solution (Prather et al., 2023). In film production, Parra Pennefather (2023a) observed a filmmaker working with GenAI that had to shift between multiple software, struggling to identify which was the most suitable for which part of their creative process. The creative described the process as ”an exercise in randomization and an attempt to control chaos” (Parra Pennefather, 2023b) (see (Oppenlaender, 2022) for similar observations with creative text-to-image generation workflows).

Task sequence can also be obscured when a large part of the workflow is automated. For example, both expert and non-expert copywriters were anchored to GenAI suggestions and produced lower-quality results when GenAI generated the majority of the text versus when it only provided feedback to users (Chen and Chan, 2023). Similarly, professional novel writers (Calderwood et al., 2020) and inexperienced writers working with GenAI (Arnold et al., 2021) found guidance more useful than the injection of generated text.In these examples, users’ familiar task sequences in a given domain are disrupted by aspects of GenAI systems.

2.2.4. Loss of feedback

Automation can deprive users of key feedback needed to assess the state of automation and its ability to perform tasks. For example, automation can cause users to change from processing raw data to processing integrated information. Introducing automation into paper-making plants moved operators away from the information associated with informal feedback (e.g., smells, sounds) and put them in control rooms (Lee and Seppelt, 2009). This change not only required operators to learn the task of plant control but also deprived them of contextual information that could help them diagnose automation failures and intervene appropriately. Similarly, in aviation, relevant information from vibration and smell was lost in the automation of process control operations (Moray et al., 1986), and the automation of auto-feathering systems in commercial aircraft removed the signal telling pilots about engine shut-downs (Billings, 1991). The lack of transparency or supporting contextual feedback often only becomes an issue under system failures when operators lack the relevant detail for detecting or addressing them (Endsley et al., 1997).

An analogous loss of feedback has also been observed in GenAI-assisted coding. Participants in (Vaithilingam et al., 2022) noted that, in comparison to internet search tools like Stack Overflow, Copilot lacked additional information, such as discussions, explanations, and comparisons of code solutions. This sentiment was echoed by participants in (Ross et al., 2023), who noted that their AI code assistant “lacked the ‘multiple answers’…and ‘rich social commentary’…that accompanies answers on Q&A sites”. Thus, programmers using these tools see the code, comments, and data but miss out on the rich feedback that is usually available when programming with access to various media sources.

2.3. Task interruptions

Another aspect stifling productivity gains from GenAI is task interruption (Figure 1c). There are various cognitive costs related to interruptions (Altmann and Trafton, 2002; Janssen et al., 2011; Salvucci and Taatgen, 2011). Interruptions can disrupt the user’s thought processes (Altmann et al., 2014) and initiate a switch between tasks that requires time and cognitive resources, which negatively affects performance (Janssen et al., 2015). Particularly long and complex interruptions significantly disrupt people’s ability to resume their original tasks (Mark et al., 2012; Monk et al., 2008; Mark et al., 2008). Moreover, interruptions can also break the user’s flow state (Taekman and Shelley, 2010).

Copilot auto-suggestions have been shown to interrupt users’ main tasks, with programmers referring to Copilot auto-suggestions as “interrupting their thoughts” (Sarkar et al., 2022), “intrusive”, and “messing up thought process” (Prather et al., 2023). Accordingly, some programmers decide to switch the suggestions off to avoid distractions (Sarkar et al., 2022) or chose to disable the tool completely (Barke et al., 2023), while others admitted being “tempted to follow what it’s saying instead of just thinking about it” (Prather et al., 2023). Beyond programming, similar interruptions are reported in the writing domain (Clark et al., 2018; Dang et al., 2023; Bhat et al., 2023) and in data science (Mcnutt et al., 2023; Gu et al., 2023a).

Particularly distracting are the long, multi-line code suggestions. For example, these have been observed to break programmers’ flow when in ‘acceleration mode’, a state in which programmers work with well-formed intent, relative to an ‘exploration mode’, in which programmers start a novel task or debug (Barke et al., 2023). Programmers were distracted from their flow as they felt compelled to read the code. If they chose to consider it, they then had to review it for errors. Thus, long code suggestions force users to switch back and forth between writing and reviewing code, and if the code has errors, they must then switch to debugging (Vaithilingam et al., 2022). This may be particularly disruptive if the errors are unrelated to the current task focus, as found in (Weisz et al., 2021). Interruptions may be particularly impactful for novice programmers, who are tempted to read the large blocks of code despite their perception as a nuisance (Prather et al., 2023). Accordingly, their attention is shifted from thinking and problem-solving to deciphering. Ironically, the feature that should accelerate productivity significantly increases participants’ cognitive load due to the associated task-switching.

Programmers, particularly experienced ones, eventually learned to dismiss long, multi-line suggestions (Barke et al., 2023; Sarkar et al., 2022). Nevertheless, even when ultimately rejecting these, their thought processes were already disrupted. This was the case not only for novice programmers who reported “[wasting] time reading instead of thinking” (Prather et al., 2023), but also for experienced programmers: “I was about to write the code, and I knew what I wanted to write. But now I’m sitting here, seeing if somehow Copilot came up with something better than the person who’s been writing Haskell for five years…” (Barke et al., 2023). Similarly, in the writing domain, some users learned to ignore suggestions in certain contexts, whereas others deliberately sped up their writing to avoid getting distracted by a suggestion (Bhat et al., 2023).

That complex code suggestions are the most distracting during ‘acceleration’ and are more helpful during ‘exploration’ (Barke et al., 2023) suggests that their timing is a key factor. Indeed, automation research speaks to this. People respond faster to interrupting tasks if the interruption was scheduled as a breakpoint between main task chunks (Iqbal and Bailey, 2008) or when they occur at subtask boundaries (Bailey and Iqbal, 2008; Iqbal and Bailey, 2005; Janssen and Brumby, 2010). Similarly, (Cutrell et al., 2000) found that users interrupted earlier in a task were more likely to request a reminder after being interrupted, and (Cutrell and Guan, 2007) showed that the later in the main task an interruption occurs, the less recovery time is needed when subsequently returning to it. Indeed, in the data science domain, (Gu et al., 2023b) found that when AI suggestions were out of sync with users’ current analysis plans, participants were either distracted or ignored them.

2.4. Task-complexity polarization

Automation often makes easy tasks easier but fails to reduce the workload during cognitively demanding tasks, and in fact, often makes them harder (Lee and Seppelt, 2009). This has been termed “clumsy automation” in Human Factors research (Cook et al., 1991), but we introduce the more precise term task-complexity polarization (Figure 1d). One explanation is that easy tasks are easier to automate, and so the more difficult tasks tend to remain under manual control, albeit alongside the additional task of monitoring automation, and within a now more fragmented workflow (Lee and Seppelt, 2009). For example, automation has been shown to reduce pilots’ mental workload when it is already low during easy tasks, as when the plane is on autopilot during a straight flight. However, automation increased the mental workload of pilots when the flight-related workload was already high, e.g., during landing, as they then had to simultaneously reprogram the system managing autopilot, activate landing procedures, and manage communication (WIENER and CURRY, 1980). Humans are also ineffective in shifting cognitive resources saved by automation to support more difficult tasks. In the study by (Metzger and Parasuraman, 2005), air traffic controllers used automation designed to aid conflict detection and resolution tasks. This was expected to free up enough mental resources that controllers could allocate to performing more complex tasks. However, automation did not reduce the mental workload in routine tasks that were demanding, such as communication and accepting and handing off aircraft. Either the aid did not free enough resources, or the controllers could not allocate them to improve communication performance. Studies on automated decision-making used to support government tasks showed that the new technology often only reduced the easy assignments but left the difficult ones to the government workers, making their work more difficult and fragmented (Lindgren, 2023).

GenAI studies show that a similar pattern is emerging in current users of GenAI systems. First, there is evidence that GenAI systems are most helpful at making easy tasks even easier. For example, GenAI has been shown to be the most effective in supporting novice writers performing easy assignments and low-skilled customer service agents in entry-level tasks (Frey and Osborne, 2023). In AI-assisted programming, users across studies were most confident in using GenAI for simpler tasks, such as “writing boilerplate, repetitive code” (Barke et al., 2023), “short chunks of code” (Ross et al., 2023), or “coding in narrow contexts” (Sarkar et al., 2022). Barke et al. (Barke et al., 2023) found that the most successful Copilot users were able to decompose the coding task into “microtasks”, which Copilot was effective at completing (see also (Vaithilingam et al., 2022). However, it is precisely the task decomposition process itself that is the more cognitively demanding task, and for which Copilot was not able to provide support. Indeed, Copilot’s limitations with larger coding problems meant that “[it] led to more task failures in medium and hard tasks” (Vaithilingam et al., 2022) (see also (Ross et al., 2023; Sarkar et al., 2022)). In the data science domain, some users similarly reported feeling most confident in relying on GenAI for “peripheral tasks such as error-checking or report generation, rather than the central analysis process” (Gu et al., 2023b). Likewise, in a study of AI-assisted legal analysis using GPT-4, Choi and Schwarcz (Choi and Schwarcz, 2023) conclude that ”AI helps with simple legal analysis but stumbles over complex legal reasoning”. Thus, whereas GenAI succeeds at making easy tasks even easier, current systems are less effective at supporting harder tasks.

There is also evidence that GenAI can make hard tasks even harder. First, as discussed throughout, GenAI systems can shift users’ roles to one of cognitively demanding output evaluation (Section 2.1), restructure workflows in unhelpful ways (Section 2.2), and interrupt workflows (Section 2.3), all of which can interfere with users as they work on demanding tasks, for example by depriving them of relevant context or disrupting their task sequence. This can be particularly disruptive for novices, as one participant noted about long code suggestions, “if you do not know what you’re doing, it can confuse you more” (Prather et al., 2023).

Secondly, GenAI systems can introduce errors into outputs that users must deal with. In AI-assisted coding, GenAI systems can “introduce subtle, difficult-to-detect bugs, which are not the kind that would be introduced by a human programmer writing code manually” (Sarkar et al., 2022). Errors are particularly likely in longer code suggestions (Barke et al., 2023; Sarkar et al., 2022), precisely the ones that might help users address complex tasks. This makes the already demanding task of debugging even more difficult, not only because of the inherent challenge of debugging ‘foreign’ code (as discussed in Section 2.1), but also because of errors’ subtlety and the difficulty in discerning whether an error is the user’s or the system’s fault (Barke et al., 2023; Vaithilingam et al., 2022; Sarkar et al., 2022). A similar concern about GenAI systems introducing errors has been raised in the data science domain (Gu et al., 2023b).

Thirdly, when users are stuck on a demanding task, although GenAI systems can provide multiple suggestions to help, this ends up overwhelming some users. Weisz et al. (Weisz et al., 2022) found that users’ frustration and mental demand were significantly heightened when multiple AI-generated code translations were shown to participants. Users similarly found the multi-suggestion pane in Copilot to be overwhelming when they accessed it during a state of coding “exploration” (i.e., starting a novel task or stuck on a task (Barke et al., 2023)). Thus, ironically, GenAI systems can make hard tasks even harder in various ways that may ultimately leave users with the same or increased cognitive workload.

3. Human Factors Solutions

Beyond diagnosing the usability challenges of automation, Human Factors research has spent decades studying approaches to mitigate these challenges (e.g., (Endsley, 2017; Sheridan and Parasuraman, 2005; Parasuraman et al., 2000, 1997)). Here, we outline some key potential design solutions that could reduce the productivity loss in human-GenAI interaction. These include providing continuous relevant feedback to users (Section 3.1), enabling system personalization (Section 3.2), applying ecological interface design (Section 3.3), using task stabilization and interruption timing techniques (Section 3.4), and enabling clear task allocation between users and systems (Section 3.5). Besides targeting individual productivity loss challenges, these solutions share the underlying Human Factors principles of providing feedback and enabling system flexibility (Carayon and Hoonakker, 2019).

More broadly, we argue that these proposed approaches aim to (i) increase user agency in how they adapt the GenAI support to users’ preferred ways of working, reducing the cognitive load stemming from disrupted workflows; (ii) increase users’ situational awareness of system changes and potential errors, reducing the cognitive load associated with the monitoring and evaluation of AI outputs; and (iii) increase user flexibility through the more granular application of AI support to their tasks, freeing users from having to make a binary decision of either using GenAI tools potentially ineffectively or not using them at all (Sarkar et al., 2022; Chen et al., 2023). Throughout, we focus on the programming domain as an example of how these approaches can be applied to GenAI systems.

3.1. Continuous feedback

When GenAI is introduced to users’ workflows, their role can shift from active involvement in performing the task (i.e., production) to more passively reviewing the AI-generated outputs for errors (i.e., output evaluation). The latter is a cognitively demanding task due to the lack of supporting contextual information and the resultant loss of situational awareness. We propose that feedback about system behavior is a key strategy to keep users engaged and in the loop of GenAI system performance.

During the monitoring stages, receiving continuous feedback is crucial for the operator to remain in the loop and recognise moments when interruption and input are needed (Loft et al., 2007; Lee and Seppelt, 2009). Feedback is essential to help operators know if their requests have been received if the actions of the automation system are being performed properly, and if any errors are occurring (Norman et al., 1997). With GenAI systems, this includes knowing which aspects of the input are serving as prompts, how they are being interpreted by the system, how the output matches them, and whether there are any errors. Thus, feedback is tied to carefully designed explainability features (Liao and Vaughan, 2023; Sun et al., 2022). It should help users know why the system is responding in a certain way and allow them to build mental models of the system’s behaviour, how it interacts with them, and where they can expect failures (i.e., cause-and-effect relationships). Moreover, by helping users develop a more accurate mental model of the system, feedback can also serve to support users in better prompting and output adaptation (Chen et al., 2023; Liao and Vaughan, 2023), thereby helping them structure their workflows more effectively.

We suggest that GenAI tools should continuously provide relevant feedback to users, updating them on the system’s state, particularly during the monitoring stages. Feedback should be informative but non-intrusive, where the amount and form of feedback adapts to the interactive style of the participants and the nature of the problem (Norman et al., 1997).

In the context of GenAI systems, feedback is important for understanding system inputs and outputs and the cause-and-effect relationship between them. In the case of automated suggestions, users expressed a need for more information on specifically which code and comments Copilot relies on as inputs (Barke et al., 2023). In the case of conversational interfaces, feedback could highlight prompt changes and the resulting output changes (Zamfirescu-Pereira et al., 2023). Feedback could also be used to support pattern-matching between the AI suggestions and users’ task goals. For example, the output could have keywords highlighted, such as function calls or variable names, that would be a meaningful indication of a code fit (Barke et al., 2023). It could also include more context and documentation with the output, e.g., links to Stack Overflow or official documentation pages (Xu et al., 2022), or provide relevant usage examples (Moreno et al., 2015). To understand outputs, Vaithilingam et al. (2022) suggested using inline comments or highlighting different parts of the code based on confidence to help users understand the code generated by Copilot (see also (Weisz et al., 2021; Vasconcelos et al., 2023)). The authors also suggested supporting debugging by automatically generating test cases and test data for users to validate and identify corner cases (Vaithilingam et al., 2022). Weisz et al. (2022) proposed using alternate translations, where the system showed users the alternative it had considered to help them identify errors. In the writing domain, (Yuan et al., 2022) proposed that systems should give prompt suggestions to users.

Feedback can be overwhelming if it is poorly presented or excessive. It can also be incomprehensible without proper context, abstraction, and integration (Lee and Seppelt, 2009). As such, feedback should be provided by applying methods of ecological interface design (Rasmussen and Vicente, 1989) (see Section 3.3) and notification design (Paul et al., 2015) (see Section 3.4), which are effective approaches for improving situational awareness and error detection.

3.2. System personalization

Human Factors studies have shown that when system personalization is constrained, the cognitive demands on operators and the associated productivity loss both increase (Cook and Woods, 1997). Indeed, as described in Section 2.2, increased cognitive demand and productivity loss have been observed in studies of GenAI-assisted programming as users try to understand and accommodate systems by changing their ways of working. This could be mitigated by allowing users to flexibly personalize systems to fit their tasks and ways of working (Lee and Seppelt, 2009).

For example, users could personalize the system to provide help when needed rather than having suggestions generated automatically. In creative writing, choosing when to receive feedback from GenAI, rather than receiving AI-generated text, preserved writers’ creativity and alleviated anchoring effects and over-reliance (Chen and Chan, 2023). Users should also be able to inform the system about their state of work (e.g., ‘acceleration’ or ‘exploration’, as per (Barke et al., 2023)), so suggestions would better match the users’ goals in terms of complexity, variety, length, and frequency (Gu et al., 2023a). Systems could automatically detect users’ states (Barke et al., 2023; Gu et al., 2023a), guided by user-adjustable parameters, and respond according to user-provided preferences (Rao et al., 2023), feedback (Madaan et al., 2022), or through the use of prompts (Wu et al., 2022b). Users should also be able to personalize the inputs to the system. For example, (Barke et al., 2023) proposed that users should be able to control the context they provide to Copilot, enable comments that make code invisible to the tool, or decide that the tool will rely on Stack Overflow-style prompts rather than in-context code.

Personalization is particularly important as users might have varying levels of task and domain expertise, which has been shown to affect their preferences and needs regarding the amount and kind of information provided (Paris, 1988). For example, novice programmers might want to spend some time working on the problem themselves and only ask Copilot for support when they are stuck (Prather et al., 2023), whereas experts might want to simply complete their lines (Barke et al., 2023).

3.3. Ecological interface design

The introduction of GenAI to users’ workflows can disrupt them, leaving workers looking to adjust their ways of working or their familiar task structure. These processes increase cognitive load and result in productivity losses. Moreover, these disruptive changes can prevent users from being able to exercise their expertise and from benefiting from AI support. To align GenAI systems with users’ workflows effectively, we suggest that GenAI systems be designed according to an ecological interface design (EID) approach. EID emphasizes designing interfaces that reflect users’ perceptual constraints within a work environment in a highly domain- and context-specific manner (Rasmussen and Vicente, 1989). Specifically, it emphasizes (i) combining what users control and what they see in the system so that they can interact using clear, real-time signals; (ii) providing a consistent mapping between work domain constraints and interface cues; and (iii) showing the system’s key relationships directly on the screen, making it easier for users to form a mental model of the system (Rasmussen and Vicente, 1989; McIlroy and Stanton, 2015). EID has been shown to reduce workload and improve performance in aviation risk management (King et al., 2022), medical domains (EFFKEN et al., 1997), and automation-assisted driving (Stoner et al., 2003).

In practice, this approach suggests that an automation aid or AI system should be designed to perform consistently with operators’ mental models, preferences, and expectations in a given work domain (Goodrich and Olsen, 2003). For example, GenAI systems should consider a broader domain context for their inputs by including information from interactions with external sources within the work domain (e.g., with Copilot, the consideration of code beyond the current file (Bird et al., 2023)). Which sources and when they are considered should be clearly specified to users to support real-time control.

Systems should also consider work domain constraints. For example, Copilot should consider the natural task sequence of certain programming tasks by providing support for high-level architectural design (or planning) when it is needed and avoiding code suggestions that might interfere with this process (Gu et al., 2023a) (see also Section 3.4 for more on managing interruptions). Likewise, interfaces should adapt to support debugging when long code suggestions are provided, as outlined in Section 3.1. Systems should also help users understand how code suggestions map to and affect other aspects of the code beyond the local insertion point. Likewise, when helping physicians with administrative tasks, GenAI system outputs should include records of patients’ unique medical histories and physicians’ clinical reasoning (Preiksaitis et al., 2023).

EID also aims to support users’ ways of perceiving information in a specific domain. For example, it encourages using a hierarchical visual structure to display relevant information to allow multiple levels of information to be (meaningfully) visible simultaneously in the interface. This way, users can guide their attention to the level of interest, depending on their level of expertise and current task demands (Rasmussen and Vicente, 1989). This also supports flexibility, as users do not have to attend to a specific description level. For example, depending on where users are in their workflow, GenAI systems can provide programmers with suggestions at different levels of abstraction (Gu et al., 2023a), from high-level pseudo-code to low-level implementations, organized in a visual hierarchy, which would be particularly helpful for novices (Prather et al., 2023). Similarly, (Gu et al., 2023a) suggest that interactive visualizations, linked to users’ code and other parts of the interface, can be used to support decision-making.

Finally, as discussed in Section 3.1, explainability features are essential to help users form an accurate mental model. These features should be integrated directly into the interface (e.g., as in AI Chains (Wu et al., 2022a)), taking into consideration the work domain context. In the healthcare domain, explainability has been shown to be most effective when combined with insights from medical experts. Without considering domain specifics, explanations lacked important context and included unnecessary information (e.g., background skin texture) that confused expert dermatologists (DeGrave et al., 2023) (see also (Huang et al., 2023) for similar results in radiology).

3.4. Main task stabilization and interruption timing

As discussed in Section 2.3, GenAI system suggestions (e.g., Copilot code suggestions) interrupt users, especially during their flow states, distracting them and potentially leading to productivity loss. Accordingly, some users disable auto-suggestion features or GenAI systems entirely because of their distracting nature (Barke et al., 2023; Sarkar et al., 2022; Liang et al., 2023). Writers similarly prefer not to be interrupted by AI-generated snippets of text (Chen and Chan, 2023). Instead of forcing users to avoid interruptions by disabling tools, systems should preserve users’ flow states by incorporating task stabilization techniques or by carefully timing interruptions around their flow states.

3.4.1. Task stabilization via attention guidance

Interruptions can be designed to support task stabilization, i.e., to help users prepare their current (main) task for the temporary switch in focus (Czerwinski et al., 2004; Parnin and DeLine, 2010). For example, among software users and developers, (Paul et al., 2015) found that interruptions were helpful when they directed users to the parts of the current task (or a new task) they needed to attend to. Interruptive notifications were also useful as progress indicators, helping users plan and resume their next task after interruption. In the case of GenAI systems such as Copilot, this could manifest in long code suggestions being divided (e.g., via colour) into small logical units for programmers to easily parse during the acceleration (flow) mode (Barke et al., 2023). Alternatively, systems could direct users’ attention to certain keywords (e.g., via highlighting) that could help them identify the applicability of the suggestion by using “pattern matching” (Barke et al., 2023). In line with Human Factors principles, interface design should provide cues to guide users’ attention to the next appropriate action. Otherwise, users may fall into ‘procedural traps’ (Rasmussen and Vicente, 1989; Reason et al., 1997), novel situations where they rely on their normal rule set but without the usual success. Indeed, this has been observed in Copilot studies, where programmers end up in ‘debugging rabbitholes’ (Prather et al., 2023; Vaithilingam et al., 2022).

3.4.2. Task stabilization via pre-interruption alerts

Task stabilization can also be achieved by using pre-interruption alerts, which function as progress indicators, helping users plan and resume their next task after interruption (Paul et al., 2015). Andrews et al. (Andrew, 2003) found that those who received a pre-interruption alert could resume the main task faster than participants who did not. This aligns with studies showing that adding a brief lag period before interruption helps users set place-keepers at their current task point, making it easier for them to return to it after being interrupted (Altmann and Trafton, 2015; Brumby et al., 2013). Similar pre-interruption alerts may be helpful for GenAI systems. For example, when Copilot is about to suggest a long code chunk, an alerting notification could create a brief pause period necessary to lock the users’ main task state. Even better, AI systems should set place-keepers automatically together with auto-suggestions, along with any other context-relevant information that could help users return to their train of thought. This would begin to address the challenge of helping users regain their prior context post-interruption, as has been raised in GenAI-assisted coding (Ross et al., 2023) and data science (Gu et al., 2023a, b).

3.4.3. Timing of interruptions

Timing interruptions thoughtfully is another way to reduce their associated productivity loss. Interruptions are valuable for user productivity when they provide valuable awareness about things outside the user’s attention, such as new or background tasks (Paul et al., 2015). However, interruptions can be disruptive when related to a task currently in focus. We propose that systems such as Copilot should be able to recognise when the user is in focus (Barke et al., 2023; Gu et al., 2023a). Then, interruptions should be limited to supporting contextual alerts or providing information about ongoing tasks in the background (e.g., providing explainability information). Otherwise, during this stage, suggestions should carefully align with users’ flow (Gu et al., 2023a), in line with ecological interface design. The system should recognise the strategies that users use during the flow state and support them by completing their thought processes, for example, auto-completing the end of the code line (Barke et al., 2023), providing only short code suggestions (Prather et al., 2023). Recognising when users are not in a flow state, systems could give users prompt examples and suggestions (Yuan et al., 2022), provide feedback (Chen and Chan, 2023), or goal-orientated guidance (Arnold et al., 2021). This could be supported further by user personalization as per Section 3.2. This would enable GenAI support to be used more narrowly (e.g., to provide warning messages and supporting contextual information or short snippets of code) rather than users having to use the GenAI ineffectively or turn it off completely.

3.5. Clear task allocation

GenAI user studies suggest that current systems make easy tasks easier and hard tasks harder for users, a phenomenon we have termed task-complexity polarization (and referred to as ”clumsy automation” in the Human Factors literature (WIENER and CURRY, 1980)). Thus, it appears that these systems are not applied effectively to reduce overall workload. Human Factors research shows that one of the ways to address this is by clearly specifying how tasks are allocated between the human and system, particularly during high workload periods (Enstrom and Rouse, 1977; Wallace Sinaiko, 1972). This not only better distributes the workload according to the respective strengths and weaknesses of humans and automated systems but also reduces the cognitive demand on users stemming from trying to discern the relative responsibilities on a moment-by-moment basis. For example, in aviation, reducing pilots’ workflow to a single loop (eliminating the need for the operator to interact with the automation through the high workload tasks) resulted in better performance in a cockpit simulator. Similarly, allocating tasks to the computer and allowing the operator to deal with the queue items manually have also been shown to reduce workload (Chu and Rouse, 1979). We suggest that the allocation of tasks between the user and GenAI system should be clearly defined and supported by GenAI systems. The user should know which tasks the GenAI system deals with at a given moment. (Cook and Woods, 1997)

As discussed in Sections 2.1 and 2.4, for simple tasks or in low workload conditions, users often let the GenAI system operate continuously. However, when complex tasks needed to be performed, they often stepped in and overrode the system and, in some cases, engaged in ineffective practices (e.g., reviewing code suggestions, editing, and then deleting them (Prather et al., 2023)). Instead of having to do this, users should be able to proactively allocate responsibilities to the GenAI system. For example, according to their experience with the system, personal preferences, or expertise, they could identify tasks or parts of the tasks that they are confident that AI will perform successfully without their oversight or ones that they found AI to be most helpful with. For example, users might prefer manually translating certain types of code (Weisz et al., 2022) or allowing the tool to be responsible for generating control structures while the user fills out the body (Barke et al., 2023). Likewise, users could allocate only repetitive ‘boilerplate’ code for the system to complete autonomously while requesting its high-level planning support (rather than entire code completion) during more complex or exploratory tasks. In creative domains, this might mean that GenAI tools provide ideas in an open-ended form (e.g., probing questions), rather than as explicit suggestions (Arnold et al., 2021), an approach that was found to be particularly helpful in copywriting (Chen and Chan, 2023). Making this initial allocation of responsibility and clearly understanding how tasks are divided would reduce the cognitive load of interacting with the GenAI system throughout demanding tasks. Moreover, it would help users better manage their demanding role as evaluators of AI output (as per Section 2.1).

Supporting effective task allocation depends on GenAI systems having a clear understanding of the work domain context, which is enabled by ecological interface design (see Section 3.3). As such, the described Human Factors approaches work in synergy to support human-GenAI interaction and productivity.

4. Conclusions

We have synthesized and analyzed the productivity challenges emerging during human-GenAI interactions, focusing on the much-studied domain of software development and noting similarities in areas such as data science, design, and writing. We have demonstrated the parallels between productivity challenges in older Human Factors automation studies and recent GenAI studies. Drawing on the human automation studies, we have categorised these challenges and the underlying reasons related to Human Factors, such as workload, feedback, and situational awareness. We show how aspects like the shift from active production (e.g., writing code) to passive evaluation (e.g., reviewing code), unhelpful workflow restructuring, task interruptions, and task-complexity polarization can stifle human performance and effective implementation of GenAI.

Further extrapolating from human-automation studies, we have provided a set of design solutions that could help avoid productivity losses in human-GenAI interaction. More broadly, we argue for more consideration of users’ workflows, unique ways of working, and domain specificities when designing GenAI tools. To achieve this, we propose that systems be designed in accordance with ecological interface design, the principle of continuous feedback, support for flexibility via task allocation between users and systems, and user-guided system personalization. We also provide concrete design solutions for effectively guiding user attention during interruptions.

Our paper is an initial bridge between Human Factors and Human-Computer Interaction issues of human-GenAI interaction. There is, of course, far more nuanced Human Factors research that can help understand and address the key productivity challenges in this fast-paced area. Reciprocally, we also expect that future Human-Computer Interaction research may open up new domains of exploration for Human Factors.

Acknowledgements.

Anonymized.

References

(1)
Altmann and Trafton (2002) Erik M. Altmann and J. Gregory Trafton. 2002. Memory for goals: an activation-based model. Cognitive Science 26, 1 (2002), 39–83. https://doi.org/10.1207/s15516709cog2601_2 _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1207/s15516709cog2601_2.
Altmann and Trafton (2015) Erik M. Altmann and J. Gregory Trafton. 2015. Brief Lags in Interrupted Sequential Performance: Evaluating a Model and Model Evaluation Method. International Journal of Human-Computer Studies 79 (July 2015), 51–65. https://doi.org/10.1016/j.ijhcs.2014.12.007
Altmann et al. (2014) Erik M. Altmann, J. Gregory Trafton, and David Z. Hambrick. 2014. Momentary interruptions can derail the train of thought. Journal of Experimental Psychology: General 143, 1 (2014), 215–226. https://doi.org/10.1037/a0030986 Place: US Publisher: American Psychological Association.
Andrew (2003) Alex M. Andrew. 2003. Humans And Automation: System Design And Research Issues, by Thomas B. Sheridan, Wiley, in cooperation with the Human Factors and Ergonomics Society, Santa Barbara, California, 2002, pp. xii, 264. ISBN 0-471-23428-1. Wiley Series in System Engineering and Management HFES Issues in Human Factors and Ergonomics Series, Vol. 3 (Hardback, £37.50). Robotica 21, 3 (June 2003), 345–345. https://doi.org/10.1017/S0263574702274858 Publisher: Cambridge University Press.
Arnold et al. (2021) Kenneth C Arnold, April M Volzer, and Noah G Madrid. 2021. Generative Models can Help Writers without Writing for Them. Joint Proceedings of the ACM IUI 2021 Workshops (2021).
Bailey and Iqbal (2008) Brian P. Bailey and Shamsi T. Iqbal. 2008. Understanding changes in mental workload during execution of goal-directed tasks and its application for interruption management. ACM Transactions on Computer-Human Interaction 14, 4 (Jan. 2008), 21:1–21:28. https://doi.org/10.1145/1314683.1314689
Bainbridge (1983) Lisanne Bainbridge. 1983. Ironies of automation. Automatica 19, 6 (Nov. 1983), 775–779. https://doi.org/10.1016/0005-1098(83)90046-8
Barke et al. (2023) Shraddha Barke, Michael B. James, and Nadia Polikarpova. 2023. Grounded Copilot: How Programmers Interact with Code-Generating Models. Proceedings of the ACM on Programming Languages 7, OOPSLA1 (April 2023), 78:85–78:111. https://doi.org/10.1145/3586030
Baxter et al. (2012) Gordon Baxter, John Rooksby, Yuanzhi Wang, and Ali Khajeh-Hosseini. 2012. The ironies of automation: still going strong at 30?. In Proceedings of the 30th European Conference on Cognitive Ergonomics. ACM, Edinburgh United Kingdom, 65–71. https://doi.org/10.1145/2448136.2448149
Bhat et al. (2023) Advait Bhat, Saaket Agashe, Parth Oberoi, Niharika Mohile, Ravi Jangir, and Anirudha Joshi. 2023. Interacting with Next-Phrase Suggestions: How Suggestion Systems Aid and Influence the Cognitive Processes of Writing. In Proceedings of the 28th International Conference on Intelligent User Interfaces (IUI ’23). Association for Computing Machinery, New York, NY, USA, 436–452. https://doi.org/10.1145/3581641.3584060
Billings (1991) Charles E. Billings. 1991. Toward a Human-Centered Aircraft Automation Philosophy. The International Journal of Aviation Psychology 1, 4 (Oct. 1991), 261–270. https://doi.org/10.1207/s15327108ijap0104_1 Publisher: Taylor & Francis _eprint: https://doi.org/10.1207/s15327108ijap0104_1.
Bird et al. (2023) Christian Bird, Denae Ford, Thomas Zimmermann, Nicole Forsgren, Eirini Kalliamvakou, Travis Lowdermilk, and Idan Gazit. 2023. Taking Flight with Copilot: Early insights and opportunities of AI-powered pair-programming tools. Queue 20, 6 (Jan. 2023), Pages 10:35–Pages 10:57. https://doi.org/10.1145/3582083
Brumby et al. (2013) Duncan P. Brumby, Anna L. Cox, Jonathan Back, and Sandy J. J. Gould. 2013. Recovering from an interruption: Investigating speed-accuracy trade-offs in task resumption behavior. Journal of Experimental Psychology: Applied 19, 2 (2013), 95–107. https://doi.org/10.1037/a0032696
Budhwar et al. (2023) Pawan Budhwar, Soumyadeb Chowdhury, Geoffrey Wood, Herman Aguinis, Greg J. Bamber, Jose R. Beltran, Paul Boselie, Fang Lee Cooke, Stephanie Decker, Angelo DeNisi, Prasanta Kumar Dey, David Guest, Andrew J. Knoblich, Ashish Malik, Jaap Paauwe, Savvas Papagiannidis, Charmi Patel, Vijay Pereira, Shuang Ren, Steven Rogelberg, Mark N. K. Saunders, Rosalie L. Tung, and Arup Varma. 2023. Human resource management in the age of generative artificial intelligence: Perspectives and research directions on ChatGPT. Human Resource Management Journal 33, 3 (2023), 606–659. https://doi.org/10.1111/1748-8583.12524 _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/1748-8583.12524.
Calderwood et al. (2020) Alex Calderwood, Vivian Qiu, K. Gero, and Lydia B. Chilton. 2020. How Novelists Use Generative Language Models: An Exploratory User Study. In IUI 2020 Workshops. https://www.semanticscholar.org/paper/How-Novelists-Use-Generative-Language-Models%3A-An-Calderwood-Qiu/8cf1fc0b87dfda2a11bfaaaa3a0bf9f9e069bb0f
Carayon and Hoonakker (2019) Pascale Carayon and Peter Hoonakker. 2019. Human Factors and Usability for Health Information Technology: Old and New Challenges. Yearbook of Medical Informatics 28, 1 (Aug. 2019), 71–77. https://doi.org/10.1055/s-0039-1677907 Publisher: Georg Thieme Verlag KG.
Chen et al. (2023) Xiang ’Anthony’ Chen, Jeff Burke, Ruofei Du, Matthew K. Hong, Jennifer Jacobs, Philippe Laban, Dingzeyu Li, Nanyun Peng, Karl D. D. Willis, Chien-Sheng Wu, and Bolei Zhou. 2023. Next Steps for Human-Centered Generative AI: A Technical Perspective. https://doi.org/10.48550/arXiv.2306.15774 arXiv:2306.15774 [cs].
Chen and Chan (2023) Zenan Chen and Jason Chan. 2023. Large Language Model in Creative Work: The Role of Collaboration Modality and User Expertise. https://doi.org/10.2139/ssrn.4575598
Chignell et al. (2023) Mark Chignell, Lu Wang, Atefeh Zare, and Jamy Li. 2023. The Evolution of HCI and Human Factors: Integrating Human and Artificial Intelligence. ACM Transactions on Computer-Human Interaction 30, 2 (April 2023), 1–30. https://doi.org/10.1145/3557891
Choi and Schwarcz (2023) Jonathan H. Choi and Daniel Schwarcz. 2023. AI Assistance in Legal Analysis: An Empirical Study. https://doi.org/10.2139/ssrn.4539836
Chu and Rouse (1979) Yee-Yeen Chu and William B. Rouse. 1979. Adaptive Allocation of Decisionmaking Responsibility between Human and Computer in Multitask Situations. IEEE Transactions on Systems, Man, and Cybernetics 9, 12 (Dec. 1979), 769–778. https://doi.org/10.1109/TSMC.1979.4310128 Conference Name: IEEE Transactions on Systems, Man, and Cybernetics.
Clark et al. (2018) Elizabeth Clark, Anne Spencer Ross, Chenhao Tan, Yangfeng Ji, and Noah A. Smith. 2018. Creative Writing with a Machine in the Loop: Case Studies on Slogans and Stories. In 23rd International Conference on Intelligent User Interfaces (IUI ’18). Association for Computing Machinery, New York, NY, USA, 329–340. https://doi.org/10.1145/3172944.3172983
Cook and Woods (1997) Richard Cook and David Woods. 1997. Adapting to New Technology in the Operating Room. Human factors 38 (Jan. 1997), 593–613. https://doi.org/10.1518/001872096778827224
Cook et al. (1991) Richard I. Cook, David D. Woods, Elizabeth Mccolligan, and Michael B. Howie. 1991. Cognitive consequences of clumsy automation on high workload, high consequence human performance. In NASA, Lyndon B. Johnson Space Center, Fourth Annual Workshop on Space Operations Applications and Research (SOAR 90). https://ntrs.nasa.gov/citations/19910011398 NTRS Author Affiliations: Ohio State Univ. NTRS Document ID: 19910011398 NTRS Research Center: Legacy CDMS (CDMS).
Cork et al. (1998) Randy D. Cork, William M. Detmer, and Charles P. Friedman. 1998. Development and Initial Validation of an Instrument to Measure Physicians’ Use of, Knowledge about, and Attitudes Toward Computers. Journal of the American Medical Informatics Association 5, 2 (March 1998), 164–176. https://doi.org/10.1136/jamia.1998.0050164
Cutrell and Guan (2007) Edward Cutrell and Zhiwei Guan. 2007. What are you looking for?: an eye-tracking study of information usage in web search. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, San Jose California USA, 407–416. https://doi.org/10.1145/1240624.1240690
Cutrell et al. (2000) Edward B. Cutrell, Mary Czerwinski, and Eric Horvitz. 2000. Effects of instant messaging interruptions on computing tasks. In CHI ’00 Extended Abstracts on Human Factors in Computing Systems. ACM, The Hague The Netherlands, 99–100. https://doi.org/10.1145/633292.633351
Czerwinski et al. (2004) Mary Czerwinski, Eric Horvitz, and Susan Wilhite. 2004. A diary study of task switching and interruptions. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, Vienna Austria, 175–182. https://doi.org/10.1145/985692.985715
Dang et al. (2023) Hai Dang, Sven Goller, Florian Lehmann, and Daniel Buschek. 2023. Choice Over Control: How Users Write with Large Language Models using Diegetic and Non-Diegetic Prompting. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23). Association for Computing Machinery, New York, NY, USA, 1–17. https://doi.org/10.1145/3544548.3580969
DeGrave et al. (2023) Alex J. DeGrave, Zhuo Ran Cai, Joseph D. Janizek, Roxana Daneshjou, and Su-In Lee. 2023. Dissection of medical AI reasoning processes via physician and generative-AI collaboration. medRxiv (May 2023), 2023.05.12.23289878. https://doi.org/10.1101/2023.05.12.23289878
Dell’Acqua et al. (2023) Fabrizio Dell’Acqua, Edward McFowland, Ethan R. Mollick, Hila Lifshitz-Assaf, Katherine Kellogg, Saran Rajendran, Lisa Krayer, François Candelon, and Karim R. Lakhani. 2023. Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and Quality. https://doi.org/10.2139/ssrn.4573321
Dixon et al. (2005) Stephen R. Dixon, Christopher D. Wickens, and Dervon Chang. 2005. Mission Control of Multiple Unmanned Aerial Vehicles: A Workload Analysis. Human Factors 47, 3 (Sept. 2005), 479–487. https://doi.org/10.1518/001872005774860005 Publisher: SAGE Publications Inc.
Drosos et al. (2020) Ian Drosos, Titus Barik, Philip J. Guo, Robert DeLine, and Sumit Gulwani. 2020. Wrex: A Unified Programming-by-Example Interaction for Synthesizing Readable Code for Data Scientists. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (CHI ’20). Association for Computing Machinery, New York, NY, USA, 1–12. https://doi.org/10.1145/3313831.3376442
EFFKEN et al. (1997) JUDITH A. EFFKEN, NAM-GYOON KIM, and ROBERT E. SHAW. 1997. Making the constraints visible: testing the ecological approach to interface design. Ergonomics 40, 1 (Jan. 1997), 1–27. https://doi.org/10.1080/001401397188341 Publisher: Taylor & Francis _eprint: https://doi.org/10.1080/001401397188341.
Endsley (1995) Mica R. Endsley. 1995. Measurement of Situation Awareness in Dynamic Systems. Human Factors 37, 1 (March 1995), 65–84. https://doi.org/10.1518/001872095779049499 Publisher: SAGE Publications Inc.
Endsley (2017) Mica R. Endsley. 2017. From Here to Autonomy: Lessons Learned From Human–Automation Research. Human Factors 59, 1 (Feb. 2017), 5–27. https://doi.org/10.1177/0018720816681350 Publisher: SAGE Publications Inc.
Endsley (2023) Mica R. Endsley. 2023. Ironies of artificial intelligence. Ergonomics 66, 11 (Nov. 2023), 1656–1668. https://doi.org/10.1080/00140139.2023.2243404 Publisher: Taylor & Francis _eprint: https://doi.org/10.1080/00140139.2023.2243404.
Endsley et al. (2003) Mica R. Endsley, Cheryl A. Bolstad, Debra G. Jones, and Jennifer M. Riley. 2003. Situation Awareness Oriented Design: From User’s Cognitive Requirements to Creating Effective Supporting Technologies. Proceedings of the Human Factors and Ergonomics Society Annual Meeting 47, 3 (Oct. 2003), 268–272. https://doi.org/10.1177/154193120304700304 Publisher: SAGE Publications Inc.
Endsley et al. (1997) Mica R Endsley, Richard H Mogford, Kenneth R Allendoerfer, and Michael D Snyder. 1997. Effect of Free Flight Conditions on Controller Performance, Workload, and Situation Awareness. Technical Report. Federal Aviation Administration.
Endsley and Rodgers (2016) Mica R. Endsley and Mark D. Rodgers. 2016. Distribution of Attention, Situation Awareness and Workload in a Passive Air Traffic Control Task: Implications for Operational Errors and Automation. Air Traffic Control Quarterly (Aug. 2016). https://doi.org/10.2514/atcq.6.1.21 Publisher: American Institute of Aeronautics and Astronautics, Inc..
Enstrom and Rouse (1977) Kenneth D. Enstrom and William B. Rouse. 1977. Real-Time Determination of How a Human Has Allocated His Attention between Control and Monitoring Tasks. IEEE Transactions on Systems, Man, and Cybernetics 7, 3 (March 1977), 153–161. https://doi.org/10.1109/TSMC.1977.4309679 Conference Name: IEEE Transactions on Systems, Man, and Cybernetics.
Frey and Osborne (2023) Carl Benedikt Frey and Michael Osborne. 2023. Generative AI and the Future of Work: A Reappraisal. Brown Journal of World Affairs (2023).
Friedman (2021) Nat Friedman. 2021. Introducing GitHub Copilot: your AI pair programmer. https://github.blog/2021-06-29-introducing-github-copilot-ai-pair-programmer/
Funk et al. (1999) Ken Funk, Beth Lyall, Jennifer Wilson, Rebekah Vint, Mary Niemczyk, Candy Suroteguh, and Griffith Owen. 1999. Flight Deck Automation issues. The International Journal of Aviation Psychology (April 1999). https://doi.org/10.1207/s15327108ijap0902_2 Publisher: Lawrence Erlbaum Associates, Inc..
Galster et al. (2001) Scott M. Galster, Robert S. Bolia, Merry M. Roe, and Raja Parasuraman. 2001. Effects of Automated Cueing on Decision Implementation in a Visual Search Task. Proceedings of the Human Factors and Ergonomics Society Annual Meeting 45, 4 (Oct. 2001), 321–325. https://doi.org/10.1177/154193120104500412 Publisher: SAGE Publications Inc.
Gmeiner et al. (2023) Frederic Gmeiner, Humphrey Yang, Lining Yao, Kenneth Holstein, and Nikolas Martelaro. 2023. Exploring Challenges and Opportunities to Support Designers in Learning to Co-create with AI-based Manufacturing Design Tools. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23). Association for Computing Machinery, New York, NY, USA, 1–20. https://doi.org/10.1145/3544548.3580999
Goodrich and Olsen (2003) M.A. Goodrich and D.R. Olsen. 2003. Seven principles of efficient human robot interaction. In SMC’03 Conference Proceedings. 2003 IEEE International Conference on Systems, Man and Cybernetics. Conference Theme - System Security and Assurance (Cat. No.03CH37483), Vol. 4. IEEE, Washington, DC, USA, 3942–3948. https://doi.org/10.1109/ICSMC.2003.1244504
Grubb et al. (1995) Paula L. Grubb, Joel S. Warm, William N. Dember, and Daniel B. Berch. 1995. Effects of Multiple-Signal Discrimination on Vigilance Performance and Perceived Workload. Proceedings of the Human Factors and Ergonomics Society Annual Meeting 39, 21 (Oct. 1995), 1360–1364. https://doi.org/10.1177/154193129503902101 Publisher: SAGE Publications Inc.
Gu et al. (2023a) Ken Gu, Madeleine Grunde-McLaughlin, Andrew M. McNutt, Jeffrey Heer, and Tim Althoff. 2023a. How Do Data Analysts Respond to AI Assistance? A Wizard-of-Oz Study. https://doi.org/10.48550/arXiv.2309.10108 arXiv:2309.10108 [cs].
Gu et al. (2023b) Ken Gu, Ruoxi Shang, Tim Althoff, Chenglong Wang, and Steven M. Drucker. 2023b. How Do Analysts Understand and Verify AI-Assisted Data Analyses? https://doi.org/10.48550/arXiv.2309.10947 arXiv:2309.10947 [cs].
Haldane and May (2011) Andrew G. Haldane and Robert M. May. 2011. Systemic risk in banking ecosystems. Nature 469, 7330 (Jan. 2011), 351–355. https://doi.org/10.1038/nature09659 Number: 7330 Publisher: Nature Publishing Group.
Huang et al. (2023) Jonathan Huang, Luke Neill, Matthew Wittbrodt, David Melnick, Matthew Klug, Michael Thompson, John Bailitz, Timothy Loftus, Sanjeev Malik, Amit Phull, Victoria Weston, J. Alex Heller, and Mozziyar Etemadi. 2023. Generative Artificial Intelligence for Chest Radiograph Interpretation in the Emergency Department. JAMA Network Open 6, 10 (Oct. 2023), e2336100. https://doi.org/10.1001/jamanetworkopen.2023.36100
Iqbal and Bailey (2005) Shamsi T. Iqbal and Brian P. Bailey. 2005. Investigating the effectiveness of mental workload as a predictor of opportune moments for interruption. In CHI ’05 Extended Abstracts on Human Factors in Computing Systems (CHI EA ’05). Association for Computing Machinery, New York, NY, USA, 1489–1492. https://doi.org/10.1145/1056808.1056948
Iqbal and Bailey (2008) Shamsi T. Iqbal and Brian P. Bailey. 2008. Effects of intelligent notification management on users and their tasks. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’08). Association for Computing Machinery, New York, NY, USA, 93–102. https://doi.org/10.1145/1357054.1357070
Janssen and Brumby (2010) Christian P. Janssen and Duncan P. Brumby. 2010. Strategic Adaptation to Performance Objectives in a Dual-Task Setting. Cognitive Science 34, 8 (2010), 1548–1560. https://doi.org/10.1111/j.1551-6709.2010.01124.x _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1551-6709.2010.01124.x.
Janssen et al. (2011) Christian P. Janssen, Duncan P. Brumby, John Dowell, Nick Chater, and Andrew Howes. 2011. Identifying Optimum Performance Trade-Offs Using a Cognitively Bounded Rational Analysis Model of Discretionary Task Interleaving. Topics in Cognitive Science 3, 1 (2011), 123–139. https://doi.org/10.1111/j.1756-8765.2010.01125.x _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1756-8765.2010.01125.x.
Janssen et al. (2015) Christian P. Janssen, Sandy J. J. Gould, Simon Y. W. Li, Duncan P. Brumby, and Anna L. Cox. 2015. Integrating knowledge of multitasking and interruptions across different perspectives and research methods. International Journal of Human-Computer Studies 79 (July 2015), 1–5. https://doi.org/10.1016/j.ijhcs.2015.03.002
Jayagopal et al. (2022) Dhanya Jayagopal, Justin Lubin, and Sarah E. Chasins. 2022. Exploring the Learnability of Program Synthesizers by Novice Programmers. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology (UIST ’22). Association for Computing Machinery, New York, NY, USA, 1–15. https://doi.org/10.1145/3526113.3545659
Jiang et al. (2022) Ellen Jiang, Edwin Toh, Alejandra Molina, Kristen Olson, Claire Kayacik, Aaron Donsbach, Carrie J Cai, and Michael Terry. 2022. Discovering the Syntax and Strategies of Natural Language Programming with Generative Language Models. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI ’22). Association for Computing Machinery, New York, NY, USA, 1–19. https://doi.org/10.1145/3491102.3501870
Jones and Endsley (1996) D G Jones and M R Endsley. 1996. Sources of situation awareness errors in aviation. Aviation, space, and environmental medicine 67, 6 (June 1996), 507–512.
Kazemitabaar et al. (2023) Majeed Kazemitabaar, Xinying Hou, Austin Henley, Barbara J. Ericson, David Weintrop, and Tovi Grossman. 2023. How Novices Use LLM-Based Code Generators to Solve CS1 Coding Tasks in a Self-Paced Learning Environment. https://doi.org/10.48550/arXiv.2309.14049 arXiv:2309.14049 [cs].
Kim et al. (2021) Seohyun Kim, Jinman Zhao, Yuchi Tian, and Satish Chandra. 2021. Code Prediction by Feeding Trees to Transformers. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, Madrid, ES, 150–162. https://doi.org/10.1109/ICSE43902.2021.00026
King et al. (2022) Brandon J. King, Gemma J.M. Read, and Paul M. Salmon. 2022. Clear and present danger? Applying ecological interface design to develop an aviation risk management interface. Applied Ergonomics 99 (Feb. 2022), 103643. https://doi.org/10.1016/j.apergo.2021.103643
Klein et al. (2006) G. Klein, B. Moon, and R.R. Hoffman. 2006. Making Sense of Sensemaking 1: Alternative Perspectives. IEEE Intelligent Systems 21, 4 (July 2006), 70–73. https://doi.org/10.1109/MIS.2006.75
Kulkarni et al. (2023) Chinmay Kulkarni, Stefania Druga, Minsuk Chang, Alex Fiannaca, Carrie Cai, and Michael Terry. 2023. A Word is Worth a Thousand Pictures: Prompts as AI Design Material. https://doi.org/10.48550/arXiv.2303.12647 arXiv:2303.12647 [cs].
Lee and Seppelt (2009) John Lee and Bobbie Seppelt. 2009. Human Factors in Automation Design. In Springer Handbook of Automation. Springer, 417–436. https://doi.org/10.1007/978-3-540-78831-7_25
Liang et al. (2023) Jenny T. Liang, Chenyang Yang, and Brad A. Myers. 2023. Understanding the Usability of AI Programming Assistants. https://doi.org/10.48550/arXiv.2303.17125 arXiv:2303.17125 [cs].
Liao et al. (2023) Q. Vera Liao, Hariharan Subramonyam, Jennifer Wang, and Jennifer Wortman Vaughan. 2023. Designerly Understanding: Information Needs for Model Transparency to Support Design Ideation for AI-Powered User Experience. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23). Association for Computing Machinery, New York, NY, USA, 1–21. https://doi.org/10.1145/3544548.3580652
Liao and Vaughan (2023) Q. Vera Liao and Jennifer Wortman Vaughan. 2023. AI Transparency in the Age of LLMs: A Human-Centered Research Roadmap. https://doi.org/10.48550/arXiv.2306.01941 arXiv:2306.01941 [cs].
Lindgren (2023) Ida Lindgren. 2023. Ironies of Public Service Automation – Bainbridge Revisited. In Proceedings of the 24th Annual International Conference on Digital Government Research. ACM, Gda?sk Poland, 395–404. https://doi.org/10.1145/3598469.3598514
Loft et al. (2007) Shayne Loft, Penelope Sanderson, Andrew Neal, and Martijn Mooij. 2007. Modeling and Predicting Mental Workload in En Route Air Traffic Control: Critical Review and Broader Implications. Human Factors 49, 3 (June 2007), 376–399. https://doi.org/10.1518/001872007X197017 Publisher: SAGE Publications Inc.
Lund and Wang (2023) Brady D. Lund and Ting Wang. 2023. Chatting about ChatGPT: how may AI and GPT impact academia and libraries? Library Hi Tech News 40, 3 (Jan. 2023), 26–29. https://doi.org/10.1108/LHTN-01-2023-0009 Publisher: Emerald Publishing Limited.
Madaan et al. (2022) Aman Madaan, Shuyan Zhou, Uri Alon, Yiming Yang, and Graham Neubig. 2022. Language Models of Code are Few-Shot Commonsense Learners. https://arxiv.org/abs/2210.07128v3
Manzey et al. (2006) Dietrich Manzey, J. Elin Bahner, and Anke-Dorothea Hueper. 2006. Misuse of Automated Aids in Process Control: Complacency, Automation Bias and Possible Training Interventions. Proceedings of the Human Factors and Ergonomics Society Annual Meeting 50, 3 (Oct. 2006), 220–224. https://doi.org/10.1177/154193120605000303 Publisher: SAGE Publications Inc.
Manzey et al. (2012) Dietrich Manzey, Juliane Reichenbach, and Linda Onnasch. 2012. Human Performance Consequences of Automated Decision Aids: The Impact of Degree of Automation and System Experience. Journal of Cognitive Engineering and Decision Making 6, 1 (March 2012), 57–87. https://doi.org/10.1177/1555343411433844 Publisher: SAGE Publications.
Mark et al. (2008) Gloria Mark, Daniela Gudith, and Ulrich Klocke. 2008. The cost of interrupted work: more speed and stress. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, Florence Italy, 107–110. https://doi.org/10.1145/1357054.1357072
Mark et al. (2012) Gloria Mark, Stephen Voida, and Armand Cardello. 2012. ”A pace not dictated by electrons”: an empirical study of work without email. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, Austin Texas USA, 555–564. https://doi.org/10.1145/2207676.2207754
McBride et al. (2011) Sara E. McBride, Wendy A. Rogers, and Arthur D. Fisk. 2011. Understanding the Effect of Workload on Automation Use for Younger and Older Adults. Human Factors 53, 6 (Dec. 2011), 672–686. https://doi.org/10.1177/0018720811421909 Publisher: SAGE Publications Inc.
McIlroy and Stanton (2015) Rich C. McIlroy and Neville A. Stanton. 2015. Ecological Interface Design Two Decades On: Whatever Happened to the SRK Taxonomy? IEEE Transactions on Human-Machine Systems 45, 2 (April 2015), 145–163. https://doi.org/10.1109/THMS.2014.2369372 Conference Name: IEEE Transactions on Human-Machine Systems.
Mcnutt et al. (2023) Andrew M Mcnutt, Chenglong Wang, Robert A Deline, and Steven M. Drucker. 2023. On the Design of AI-powered Code Assistants for Notebooks. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23). Association for Computing Machinery, New York, NY, USA, 1–16. https://doi.org/10.1145/3544548.3580940
Metzger and Parasuraman (2001) Ulla Metzger and Raja Parasuraman. 2001. The Role of the Air Traffic Controller in Future Air Traffic Management: An Empirical Study of Active Control versus Passive Monitoring. Human Factors 43, 4 (Dec. 2001), 519–528. https://doi.org/10.1518/001872001775870421 Publisher: SAGE Publications Inc.
Metzger and Parasuraman (2005) Ulla Metzger and Raja Parasuraman. 2005. Automation in Future Air Traffic Management: Effects of Decision Aid Reliability on Controller Performance and Mental Workload. Human Factors 47, 1 (March 2005), 35–49. https://doi.org/10.1518/0018720053653802 Publisher: SAGE Publications Inc.
Monk et al. (2008) Christopher A. Monk, J. Gregory Trafton, and Deborah A. Boehm-Davis. 2008. The effect of interruption duration and demand on resuming suspended goals. Journal of Experimental Psychology: Applied 14, 4 (2008), 299–313. https://doi.org/10.1037/a0014402 Place: US Publisher: American Psychological Association.
Moray et al. (1986) Neville Moray, Pam Lootsteen, and Jan Pajak. 1986. Acquisition of Process Control Skills. IEEE Transactions on Systems, Man, and Cybernetics 16, 4 (July 1986), 497–504. https://doi.org/10.1109/TSMC.1986.289252 Conference Name: IEEE Transactions on Systems, Man, and Cybernetics.
Moreno et al. (2015) Laura Moreno, Gabriele Bavota, Massimiliano Di Penta, Rocco Oliveto, and Andrian Marcus. 2015. How Can I Use This Method?. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol. 1. 880–890. https://doi.org/10.1109/ICSE.2015.98 ISSN: 1558-1225.
Norman et al. (1997) D. A. Norman, Donald Eric Broadbent, Alan David Baddeley, and J. Reason. 1997. The ‘problem ’ with automation: inappropriate feedback and interaction, not ‘over-automation’. Philosophical Transactions of the Royal Society of London. B, Biological Sciences 327, 1241 (Jan. 1997), 585–593. https://doi.org/10.1098/rstb.1990.0101 Publisher: Royal Society.
Nova (2023) Kannan Nova. 2023. Generative AI in Healthcare: Advancements in Electronic Health Records, facilitating Medical Languages, and Personalized Patient Care. Journal of Advanced Analytics in Healthcare Management 7, 1 (April 2023), 115–131. https://research.tensorgate.org/index.php/JAAHM/article/view/43 Number: 1.
Noy and Zhang (2023) Shakked Noy and Whitney Zhang. 2023. Experimental Evidence on the Productivity Effects of Generative Artificial Intelligence. https://doi.org/10.2139/ssrn.4375283
Oppenlaender (2022) Jonas Oppenlaender. 2022. The Creativity of Text-to-Image Generation. In Proceedings of the 25th International Academic Mindtrek Conference. 192–202. https://doi.org/10.1145/3569219.3569352 arXiv:2206.02904 [cs].
Parasuraman et al. (1993) Raja Parasuraman, Robert Molloy, and Indramani L. Singh. 1993. Performance Consequences of Automation-Induced ’Complacency’. The International Journal of Aviation Psychology (Jan. 1993). https://doi.org/10.1207/s15327108ijap0301_1 Publisher: Lawrence Erlbaum Associates, Inc..
Parasuraman et al. (1997) Raja Parasuraman, Mustapha Mouloua, and Robert Molloy. 1997. Effects of Adaptive Task Allocation on Monitoring of Automated Systems. Human factors 38 (Jan. 1997), 665–79. https://doi.org/10.1518/001872096778827279
Parasuraman and Riley (1997) Raja Parasuraman and Victor Riley. 1997. Humans and Automation: Use, Misuse, Disuse, Abuse. Human Factors 39, 2 (June 1997), 230–253. https://doi.org/10.1518/001872097778543886 Publisher: SAGE Publications Inc.
Parasuraman et al. (2000) R. Parasuraman, T.B. Sheridan, and C.D. Wickens. 2000. A model for types and levels of human interaction with automation. IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans 30, 3 (May 2000), 286–297. https://doi.org/10.1109/3468.844354 Conference Name: IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.
Paris (1988) Cecile L. Paris. 1988. Tailoring Object Descriptions to a User’s Level of Expertise. Computational Linguistics 14, 3 (1988), 64–78. https://aclanthology.org/J88-3006
Parnin and DeLine (2010) Chris Parnin and Robert DeLine. 2010. Evaluating cues for resuming interrupted programming tasks. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, Atlanta Georgia USA, 93–102. https://doi.org/10.1145/1753326.1753342
Parra Pennefather (2023a) Patrick Parra Pennefather. 2023a. AI and the Future of Creative Work. In Creative Prototyping with Generative AI: Augmenting Creative Workflows with Generative AI, Patrick Parra Pennefather (Ed.). Apress, Berkeley, CA, 387–410. https://doi.org/10.1007/978-1-4842-9579-3_13
Parra Pennefather (2023b) Patrick Parra Pennefather. 2023b. Use Cases. In Creative Prototyping with Generative AI: Augmenting Creative Workflows with Generative AI, Patrick Parra Pennefather (Ed.). Apress, Berkeley, CA, 339–385. https://doi.org/10.1007/978-1-4842-9579-3_12
Paul et al. (2015) Celeste Lyn Paul, Anita Komlodi, and Wayne Lutters. 2015. Interruptive notifications in support of task management. International Journal of Human-Computer Studies 79 (July 2015), 20–34. https://doi.org/10.1016/j.ijhcs.2015.02.001
Peng et al. (2023) Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. 2023. The Impact of AI on Developer Productivity: Evidence from GitHub Copilot. http://arxiv.org/abs/2302.06590 arXiv:2302.06590 [cs].
Prather et al. (2023) James Prather, Brent N. Reeves, Paul Denny, Brett A. Becker, Juho Leinonen, Andrew Luxton-Reilly, Garrett Powell, James Finnie-Ansley, and Eddie Antonio Santos. 2023. ”It’s Weird That it Knows What I Want”: Usability and Interactions with Copilot for Novice Programmers. https://doi.org/10.48550/arXiv.2304.02491 arXiv:2304.02491 [cs].
Preiksaitis et al. (2023) Carl Preiksaitis, Christine A. Sinsky, and Christian Rose. 2023. ChatGPT is not the solution to physicians’ documentation burden. Nature Medicine 29, 6 (June 2023), 1296–1297. https://doi.org/10.1038/s41591-023-02341-4 Number: 6 Publisher: Nature Publishing Group.
Rao et al. (2023) Haocong Rao, Cyril Leung, and Chunyan Miao. 2023. Can ChatGPT Assess Human Personalities? A General Evaluation Framework. https://arxiv.org/abs/2303.01248v2
Rasmussen and Vicente (1989) Jens Rasmussen and Kim J. Vicente. 1989. Coping with human errors through system design: implications for ecological interface design. International Journal of Man-Machine Studies 31, 5 (Nov. 1989), 517–534. https://doi.org/10.1016/0020-7373(89)90014-X
Reason et al. (1997) J. Reason, Donald Eric Broadbent, Alan David Baddeley, and J. Reason. 1997. The contribution of latent human failures to the breakdown of complex systems. Philosophical Transactions of the Royal Society of London. B, Biological Sciences 327, 1241 (Jan. 1997), 475–484. https://doi.org/10.1098/rstb.1990.0090 Publisher: Royal Society.
Ross et al. (2023) Steven I. Ross, Fernando Martinez, Stephanie Houde, Michael Muller, and Justin D. Weisz. 2023. The Programmer’s Assistant: Conversational Interaction with a Large Language Model for Software Development. In Proceedings of the 28th International Conference on Intelligent User Interfaces (IUI ’23). Association for Computing Machinery, New York, NY, USA, 491–514. https://doi.org/10.1145/3581641.3584037
Rudisill (1995) Dr Marianne Rudisill. 1995. Line Pilots’ Attitudes About And Experience With Flight Deck Automation: Results Of An International Survey And Proposed Guidelines. Proceedings of the eighth international symposium on aviation psychology (1995).
Salvucci and Taatgen (2011) Dario D. Salvucci and Niels A. Taatgen. 2011. Toward a Unified View of Cognitive Control. Topics in Cognitive Science 3, 2 (2011), 227–230. https://doi.org/10.1111/j.1756-8765.2011.01134.x _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1756-8765.2011.01134.x.
Sarkar (2023) Advait Sarkar. 2023. Exploring Perspectives on the Impact of Artificial Intelligence on the Creativity of Knowledge Work: Beyond Mechanised Plagiarism and Stochastic Parrots. In Proceedings of the 2nd Annual Meeting of the Symposium on Human-Computer Interaction for Work (CHIWORK ’23). Association for Computing Machinery, New York, NY, USA, 1–17. https://doi.org/10.1145/3596671.3597650
Sarkar et al. (2022) Advait Sarkar, Andrew D. Gordon, Carina Negreanu, Christian Poelitz, Sruti Srinivasa Ragavan, and Ben Zorn. 2022. What is it like to program with artificial intelligence? https://doi.org/10.48550/arXiv.2208.06213 arXiv:2208.06213 [cs].
Schellaert et al. (2023) Wout Schellaert, Fernando Martínez-Plumed, Karina Vold, John Burden, Pablo A. M. Casares, Bao Sheng Loe, Roi Reichart, Sean Ó hÉigeartaigh, Anna Korhonen, and José Hernández-Orallo. 2023. Your Prompt is My Command: On Assessing the Human-Centred Generality of Multimodal Models. Journal of Artificial Intelligence Research 77 (June 2023), 377–394. https://doi.org/10.1613/jair.1.14157
Sheridan (2012) Thomas B. Sheridan. 2012. Human Supervisory Control. In Handbook of Human Factors and Ergonomics. John Wiley & Sons, Ltd, 990–1015. https://doi.org/10.1002/9781118131350.ch34 Section: 34 _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/9781118131350.ch34.
Sheridan and Parasuraman (2005) Thomas B. Sheridan and Raja Parasuraman. 2005. Human-Automation Interaction. Reviews of Human Factors and Ergonomics 1, 1 (June 2005), 89–129. https://doi.org/10.1518/155723405783703082 Publisher: SAGE Publications.
Smith (1979) H. P. R. Smith. 1979. A simulator study of the interaction of pilot workload with errors, vigilance, and decisions. Technical Report NASA-TM-78482. NASA. https://ntrs.nasa.gov/citations/19790006598 NTRS Author Affiliations: NASA Ames Research Center NTRS Document ID: 19790006598 NTRS Research Center: Legacy CDMS (CDMS).
Srinivasa Ragavan et al. (2022) Sruti Srinivasa Ragavan, Zhitao Hou, Yun Wang, Andrew D Gordon, Haidong Zhang, and Dongmei Zhang. 2022. GridBook: Natural Language Formulas for the Spreadsheet Grid. In 27th International Conference on Intelligent User Interfaces (IUI ’22). Association for Computing Machinery, New York, NY, USA, 345–368. https://doi.org/10.1145/3490099.3511161
Stoner et al. (2003) Heather A. Stoner, Emily E. Wiese, and John D. Lee. 2003. Applying Ecological Interface Design to the Driving Domain: The Results of an Abstraction Hierarchy Analysis. Proceedings of the Human Factors and Ergonomics Society Annual Meeting 47, 3 (Oct. 2003), 444–448. https://doi.org/10.1177/154193120304700341 Publisher: SAGE Publications Inc.
Sun et al. (2022) Jiao Sun, Q. Vera Liao, Michael Muller, Mayank Agarwal, Stephanie Houde, Kartik Talamadupula, and Justin D. Weisz. 2022. Investigating Explainability of Generative AI for Code through Scenario-based Design. In 27th International Conference on Intelligent User Interfaces (IUI ’22). Association for Computing Machinery, New York, NY, USA, 212–228. https://doi.org/10.1145/3490099.3511119
Taekman and Shelley (2010) Jeffrey M. Taekman and Kirk Shelley. 2010. Virtual Environments in Healthcare: Immersion, Disruption, and Flow. International Anesthesiology Clinics 48, 3 (2010), 101. https://doi.org/10.1097/AIA.0b013e3181eace73
Vaithilingam et al. (2022) Priyan Vaithilingam, Tianyi Zhang, and Elena L. Glassman. 2022. Expectation vs. Experience: Evaluating the Usability of Code Generation Tools Powered by Large Language Models. In Extended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems (CHI EA ’22). Association for Computing Machinery, New York, NY, USA, 1–7. https://doi.org/10.1145/3491101.3519665
Vasconcelos et al. (2023) Helena Vasconcelos, Gagan Bansal, Adam Fourney, Q. Vera Liao, and Jennifer Wortman Vaughan. 2023. Generation Probabilities Are Not Enough: Exploring the Effectiveness of Uncertainty Highlighting in AI-Powered Code Completions. https://doi.org/10.48550/arXiv.2302.07248 arXiv:2302.07248 [cs].
Wallace Sinaiko (1972) H. Wallace Sinaiko. 1972. Human intervention and full automation in control systems. Applied Ergonomics 3, 1 (March 1972), 3–7. https://doi.org/10.1016/0003-6870(72)90003-8
Warm et al. (2008) Joel S. Warm, Raja Parasuraman, and Gerald Matthews. 2008. Vigilance requires hard mental work and is stressful. Human Factors 50, 3 (June 2008), 433–441. https://doi.org/10.1518/001872008X312152
Weisz et al. (2021) Justin D. Weisz, Michael Muller, Stephanie Houde, John Richards, Steven I. Ross, Fernando Martinez, Mayank Agarwal, and Kartik Talamadupula. 2021. Perfection Not Required? Human-AI Partnerships in Code Translation. In 26th International Conference on Intelligent User Interfaces (IUI ’21). Association for Computing Machinery, New York, NY, USA, 402–412. https://doi.org/10.1145/3397481.3450656
Weisz et al. (2022) Justin D. Weisz, Michael Muller, Steven I. Ross, Fernando Martinez, Stephanie Houde, Mayank Agarwal, Kartik Talamadupula, and John T. Richards. 2022. Better Together? An Evaluation of AI-Supported Code Translation. In 27th International Conference on Intelligent User Interfaces (IUI ’22). Association for Computing Machinery, New York, NY, USA, 369–391. https://doi.org/10.1145/3490099.3511157
Wickens et al. (2000) Christopher D. Wickens, Keith Gempler, and M. Ephimia Morphew. 2000. Workload and Reliability of Predictor Displays in Aircraft Traffic Avoidance. Transportation Human Factors 2, 2 (June 2000), 99–126. https://doi.org/10.1207/STHF0202_01 Publisher: Routledge _eprint: https://doi.org/10.1207/STHF0202_01.
WIENER and CURRY (1980) By EARL L. WIENER and RENWICK E. CURRY. 1980. Flight-deck automation: promises and problems. ERGONOMICS (Nov. 1980). https://doi.org/10.1080/00140138008924809 Publisher: Taylor & Francis Group.
Woodruff et al. (2023) Allison Woodruff, Renee Shelby, Patrick Gage Kelley, Steven Rousso-Schindler, Jamila Smith-Loud, and Lauren Wilcox. 2023. How Knowledge Workers Think Generative AI Will (Not) Transform Their Industries. http://arxiv.org/abs/2310.06778 arXiv:2310.06778 [cs] version: 1.
Wu et al. (2022a) Tongshuang Wu, Michael Terry, and Carrie Jun Cai. 2022a. AI Chains: Transparent and Controllable Human-AI Interaction by Chaining Large Language Model Prompts. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI ’22). Association for Computing Machinery, New York, NY, USA, 1–22. https://doi.org/10.1145/3491102.3517582
Wu et al. (2022b) Yiqing Wu, Ruobing Xie, Yongchun Zhu, Fuzhen Zhuang, Xu Zhang, Leyu Lin, and Qing He. 2022b. Personalized Prompt for Sequential Recommendation. https://arxiv.org/abs/2205.09666v2
Xu et al. (2022) Frank F. Xu, Bogdan Vasilescu, and Graham Neubig. 2022. In-IDE Code Generation from Natural Language: Promise and Challenges. ACM Transactions on Software Engineering and Methodology 31, 2 (March 2022), 29:1–29:47. https://doi.org/10.1145/3487569
Yuan et al. (2022) Ann Yuan, Andy Coenen, Emily Reif, and Daphne Ippolito. 2022. Wordcraft: Story Writing With Large Language Models. In 27th International Conference on Intelligent User Interfaces. ACM, Helsinki Finland, 841–852. https://doi.org/10.1145/3490099.3511105
Zamfirescu-Pereira et al. (2023) J.D. Zamfirescu-Pereira, Richmond Y. Wong, Bjoern Hartmann, and Qian Yang. 2023. Why Johnny Can’t Prompt: How Non-AI Experts Try (and Fail) to Design LLM Prompts. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. ACM, Hamburg Germany, 1–21. https://doi.org/10.1145/3544548.3581388