0% found this document useful (0 votes)
18 views14 pages

Program Generation For Situated Robot Task Planning Using Large Language Models

The document presents ProgPrompt, a programmatic prompting structure that utilizes large language models (LLMs) for robot task planning in situated environments. It emphasizes the importance of incorporating domain knowledge and state feedback to generate executable action sequences, improving task success rates in robotic applications. The authors demonstrate the effectiveness of their method through experiments in both virtual and physical environments, showcasing its ability to generate contextually appropriate plans.

Uploaded by

etadudhbhat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views14 pages

Program Generation For Situated Robot Task Planning Using Large Language Models

The document presents ProgPrompt, a programmatic prompting structure that utilizes large language models (LLMs) for robot task planning in situated environments. It emphasizes the importance of incorporating domain knowledge and state feedback to generate executable action sequences, improving task success rates in robotic applications. The authors demonstrate the effectiveness of their method through experiments in both virtual and physical environments, showcasing its ability to generate contextually appropriate plans.

Uploaded by

etadudhbhat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Autonomous Robots (2023) 47:999–1012

https://doi.org/10.1007/s10514-023-10135-3

P ROG P ROMPT: program generation for situated robot task planning


using large language models
Ishika Singh1 · Valts Blukis2 · Arsalan Mousavian2 · Ankit Goyal2 · Danfei Xu2 · Jonathan Tremblay2 ·
Dieter Fox2,3 · Jesse Thomason1 · Animesh Garg2,4

Received: 1 May 2023 / Accepted: 3 August 2023 / Published online: 28 August 2023
© The Author(s) 2023

Abstract
Task planning can require defining myriad domain knowledge about the world in which a robot needs to act. To ameliorate
that effort, large language models (LLMs) can be used to score potential next actions during task planning, and even generate
action sequences directly, given an instruction in natural language with no additional domain information. However, such
methods either require enumerating all possible next steps for scoring, or generate free-form text that may contain actions
not possible on a given robot in its current context. We present a programmatic LLM prompt structure that enables plan
generation functional across situated environments, robot capabilities, and tasks. Our key insight is to prompt the LLM with
program-like specifications of the available actions and objects in an environment, as well as with example programs that
can be executed. We make concrete recommendations about prompt structure and generation constraints through ablation
experiments, demonstrate state of the art success rates in VirtualHome household tasks, and deploy our method on a physical
robot arm for tabletop tasks. Website and code at progprompt.github.io

Keywords Robot task planning · LLM code generation · Planning domain generalization · Symbolic planning

1 Introduction

B Ishika Singh Everyday household tasks require both commonsense under-


[email protected] standing of the world and situated knowledge about the
Valts Blukis current environment. To create a task plan for “Make din-
[email protected] ner,” an agent needs common sense: object affordances, such
Arsalan Mousavian as that the stove and microwave can be used for heating;
[email protected] logical sequences of actions, such as an oven must be pre-
Ankit Goyal heated before food is added; and task relevance of objects
[email protected] and actions, such as heating and food are actions related to
Danfei Xu “dinner” in the first place. However, this reasoning is infea-
[email protected] sible without state feedback. The agent needs to know what
Jonathan Tremblay food is available in the current environment, such as whether
[email protected] the freezer contains fish or the fridge contains chicken.
Dieter Fox Autoregressive large language models (LLMs) trained
[email protected] on large corpora to generate text sequences conditioned on
Jesse Thomason input prompts have remarkable multi-task generalization.
[email protected] This ability has recently been leveraged to generate plau-
Animesh Garg
[email protected]
3 Computer Science and Engineering, University of
1 Computer Science, University of Southern California, Los Washington, Seattle, WA 98195, USA
Angeles, CA 90089, USA 4 School of Interactive Computing, Georgia Institute of
2 Seattle Robotics Lab, NVIDIA, Seattle, WA 98105, USA Technology, Atlanta, GA 30308, USA

123
1000 Autonomous Robots (2023) 47:999–1012

show that including natural language comments in Prog-


Prompt programs to explain the goal of the upcoming action
improves task success of generated plan programs.

2 Background and related work

2.1 Task planning

For high-level planning, most works in robotics use search


in a pre-defined domain (Fikes and Nilsson, 1971; Jiang et
al., 2018; Garrett et al., 2020). Unconditional search can be
hard to scale in environments with many feasible actions
and objects (Puig et al., 2018; Shridhar et al., 2020) due to
large branching factors. Heuristics are often used to guide
Fig. 1 ProgPrompt leverages LLMs’ strengths in both world knowl- the search (Baier et al., 2007; Hoffmann, 2001; Helmert,
edge and programming language understanding to generate situated
task plans that can be directly executed 2006; Bryce and Kambhampati, 2007). Recent works have
explored learning-based task & motion planning, using meth-
ods such as representation learning, hierarchical learning,
sible action plans in context of robotic task planning (Ahn et language as planning space, learning compositional skills
al., 2022; Huang et al., 2022b, a; Zeng et al., 2022) by either and more (Akakzia et al., 2021; Eysenbach et al., 2019; Jiang
scoring next steps or generating new steps directly. In scoring et al., 2019; Kurutach et al., 2018; Mirchandani et al., 2021;
mode, the LLM evaluates an enumeration of actions and their Nair and Finn, 2020; Shah et al., 2022; Sharma et al., 2022;
arguments from the space of what’s possible. For instance, Silver et al., 2022; Srinivas et al., 2018; Xu et al., 2018, 2019;
given a goal to “Make dinner” with first action being “open Zhu et al., 2020). Our method sidesteps search to directly
the fridge”, the LLM could score a list of possible actions: generate a plan that includes conditional reasoning and error-
“pick up the chicken”, “pick up the soda”, “close the fridge”, correction.
. . . , “turn on the lightswitch.” In text-generation mode, the We formulate task planning as the tuple O, P, A, T , I, G,
LLM can produce the next few words, which then need to be t. O is a set of all the objects available in the environment, P
mapped to actions and world objects available to the agent. is a set of properties of the objects which also informs object
For example, if the LLM produced “reach in and pick up affordances, A is a set of executable actions that changes
the jar of pickles,” that string would have to neatly map to depending on the current environment state defined as s ∈ S.
an executable action like “pick up jar.” A key component A state s is a specific assignment of all object properties, and
missing in LLM-based task planning is state feedback from S is a set of all possible assignments. T represents the tran-
the environment. The fridge in the house might not contain sition model T : S × A → S, I and G are the initial and
chicken, soda, or pickles, but a high-level instruction “Make goal states. The agent does not have access to the goal state
dinner” doesn’t give us that world state information. Our g ∈ G, but only a high-level task description t.
work introduces situated-awareness in LLM-based robot Consider the task t = “microwave salmon”. Task rele-
task planning. vant objects microwave, salmon ∈ O will have properties
We introduce ProgPrompt, a prompting scheme that modified during action execution. For example, action a =
goes beyond conditioning LLMs in natural language. Prog- open(microwave) will change the state from closed
Prompt utilizes programming language structures, leverag- (microwave) ∈ s to ¬closed(microwave) ∈ s if a is
ing the fact that LLMs are trained on vast web corpora that admissible, i.e., ∃(a, s, s ) s.t. a ∈ A ∧ s, s ∈ S ∧ T (s, a) =
includes many programming tutorials and code documen- s . In this example a goal state g ∈ G could contain the con-
tation (Fig. 1). ProgPrompt provides an LLM a Pythonic ditions heated(salmon) ∈ g, ¬closed(microwave) ∈ g
program header with an import statement for available and ¬switchedOn(microwave) ∈ g.
actions and their expected parameters, a list of environ-
ment objects, and function definitions like make_dinner 2.2 Planning with LLMs
whose bodies are sequences of actions operating on objects.
We incorporate situated state feedback from the environment A Large Language Model (LLM) is a neural network with
by asserting preconditions of our plan, such as being close many parameters—currently hundreds of billions (Brown
to the fridge before attempting to open it, and responding et al., 2020; Chen et al., 2021)—trained on unsupervised
to failed assertions with recovery actions. What’s more, we learning objectives such as next-token prediction or masked-

123
Autonomous Robots (2023) 47:999–1012 1001

language modelling. An autoregressive LLM is trained with is performed by allowing the LLM to generate an entire, exe-
a maximum likelihood loss to model the probability of a cutable plan program directly.
sequence of tokens y conditioned on an input sequence x,
i.e. θ = arg maxθ P(y|x; θ ), where θ are model param- 2.3 Recent developments following P ROG P ROMPT
eters. The trained LLM is then used for prediction ŷ =
arg maxy∈S P(y|x; θ ), where S is the set of all text sequences. Vemprala et al. (2023) further explores API-based planning
Since search space S is huge, approximate decoding strate- with ChatGPT1 in domains such as aerial robotics, manipula-
gies are used for tractability (Holtzman et al., 2020; Luong tion and visual navigation. They discuss the design principles
et al., 2015; Wiseman et al., 2017). for constructing interaction APIs, for action and perception,
LLMs are trained on large text corpora, and exhibit multi- and prompts that can be used to generate code for robotic
task generalization when provided with a relevant prompt applications. Huang et al. (2023) builds on SayCan (Ahn et
input x. Prompting LLMs to generate text useful for robot al., 2022) and generates planning steps token-by-token while
task planning is a nascent topic (Ahn et al., 2022; Jansen, scoring the tokens using both the LLM and the grounded
2020; Huang et al., 2022a, b; Li et al., 2022; Patel and Pavlick, pretrained value function. Cao and Lee (2023) explores gen-
2022). Prompt design is challenging given the lack of paired erating behavior trees to study hierarchical task planning
natural language instruction text with executable plans or using LLMs. Skreta et al. (2023) proposes iterative error
robot action sequences (Liu et al., 2021). Devising a prompt correction via a syntax verifier that repeatedly prompts the
for task plan prediction can be broken down into a prompting LLM with previous query appended with a list of errors.
function and an answer search strategy (Liu et al., 2021). Mai et al. (2023), similar in approach as Zeng et al. (2022),
A prompting function, f prompt (.) transforms the input state Huang et al. (2022b), integrates pretrained models for percep-
observation s into a textual prompt. Answer search is the tion, planning, control, memory, and dialogue zero-shot, for
generation step, in which the LLM outputs from the entire active exploration and embodied question answering tasks.
LLM vocabulary or scores a predefined set of options. Gupta and Kembhavi (2022) extends the LLM code gen-
Closest to our work, Huang et al. (2022a) generates open- eration and API-based perceptual interaction approach for a
domain plans using LLMs. In that work, planning proceeds variety of vision-langauge tasks. Some recent works Xie et al.
by: 1) selecting a similar task in the prompt example ( fprompt ); (2023a), Capitanelli and Mastrogiovanni (2023) use PDDL
2) open-ended task plan generation (answer search); and as the translation language instead of code, and use the LLM
3) 1:1 prediction to action matching. The entire plan is to generate either a PDDL plan or the goal. A classical plan-
generated open-loop without any environment interaction, ner then plans for the PDDL goal or executes the generated
and later tested for executability of matched actions. How- plan. This approach ablated the need to generate precondi-
ever, action matching based on generated text doesn’t ensure tions using the LLM, however, needs the domain rules to be
the action is admissible in the current situation. Inner- specified for the planner.
Monologue (Huang et al., 2022b) introduces environment
feedback and state monitoring, but still found that LLM
planners proposed actions involving objects not present in
3 Our method: P ROG P ROMPT
the scene. Our work shows that a programming language-
inspired prompt generator can inform the LLM of both
We represent robot plans as pythonic programs. Following
situated environment state and available robot actions, ensur-
the paradigm of LLM prompting, we create a prompt struc-
ing output compatibility to robot actions.
tured as pythonic code and use an LLM to complete the code
The related SayCan (Ahn et al., 2022) uses natural lan-
(Fig. 2). We use features available in Python to construct
guage prompting with LLMs to generate a set of feasible
prompts that elicit an LLM to generate situated robot task
planning steps, re-scoring matched admissible actions using
plans, conditioned on a natural language instruction.
a learned value function. SayCan constructs a set of all admis-
sible actions expressed in natural language and scores them
using an LLM. This is challenging to do in environments 3.1 Representing robot plans as pythonic functions
with combinatorial action spaces. Concurrent with our work
are Socratic models (Zeng et al., 2022), which also use code- Plan functions consist of API calls to action primitives,
completion to generate robot plans. We go beyond (Zeng et comments to summarize actions, and assertions for tracking
al., 2022) by leveraging additional, familiar features of pro- execution (Fig. 3). Primitive actions use objects as arguments.
gramming languages in our prompts. We define an f prompt For example, the “put salmon in the microwave” task includes
that includes import statements to model robot capabilities, API calls like find(salmon).
natural language comments to elicit common sense reason-
ing, and assertions to track execution state. Our answer search 1 https://openai.com/blog/chatgpt/

123
1002 Autonomous Robots (2023) 47:999–1012

Fig. 2 Our ProgPrompts include import statement, object list, and tion checks. We further illustrate the execution of the program via a
example tasks (PROMPT for Planning). The Generated Plan is for scenario where an assertion succeeds or fails, and how the generated
microwave salmon. We highlight prompt comments, actions as plan corrects the error before executing the next step. Full Execution is
imported function calls with objects as arguments, and assertions with shown in bottom-right. ‘...’ used for brevity
recovery steps. PROMPT for State Feedback represents example asser-

‘chain of thought’ for improving performance of LLMs on a


range of arithmetic, commonsense, and symbolic reasoning
tasks. We empirically verify the utility of comments (Table 1;
column Comments).
Assertions provide an environment feedback mechanism
that encourages preconditions to be met, and allow error
recovery possibility when they are not. For example, in
Fig. 3, before the grab(salmon) action, the plan asserts
the agent is close to salmon. If not, the agent first exe-
cutes find(salmon). In Table 1, we show that such assert
statements (column Feedback) benefit plan generation, and
Fig. 3 Pythonic ProgPrompt plan for “put salmon in the microwave” improve success rates.

3.2 Constructing programming language prompts


We utilize comments in the code to provide natural
language summaries for subsequent sequences of actions. We provide information about the environment and primitive
Comments help break down the high-level task into logi- actions to the LLM through prompt construction. As done in
cal sub-tasks. For example, in Fig. 3, the “put salmon in few-shot LLM prompting, we also provide the LLM with
microwave” task is broken down into sub-tasks using com- examples of sample tasks and plans. Figure 2 illustrates our
ments “# grab salmon” and “# put salmon in microwave”. prompt function f prompt which takes in all the information
This partitioning could help the LLM to express its knowl- (observations, action primitives, examples) and produces a
edge about tasks and sub-tasks in natural language and aid Pythonic prompt for the LLM to complete. The LLM then
planning. Comments also inform the LLM about immedi- predicts the <next_task>(.) as an executable function
ate goals, reducing the possibility of incoherent, divergent, (microwave_salmon in Fig. 2).
or repetitive outputs. Prior work Wei et al. (2022) has also In the task microwave_salmon, a reasonable first step
shown the efficacy of similar intermediate summaries called that an LLM could generate is take_out(salmon, grocery

123
Autonomous Robots (2023) 47:999–1012 1003

bag). However, the agent responsible for executing the plan We create a dataset of 70 household tasks. Tasks are posed
might not have a primitive action to take_out. To inform with high-level instructions like “microwave salmon”. We
the LLM about the agent’s action primitives, we provide them collect a ground-truth sequence of actions that completes the
as Pythonic import statements. These encourage the LLM task from an initial state, and record the final state g that
to restrict its output to only functions that are available in the defines a set of symbolic goal conditions, g ∈ P.
current context. To change agents, ProgPrompt just needs a When executing generated programs, we incorporate
new list of imported functions representing agent actions. A environment state feedback in response to assertions. VH
grocery bag object might also not exist in the environment. provides observations in the form of state graph with object
We provide the available objects in the environment as a list properties and relations. To check assertions in this environ-
of strings. Since our prompting scheme explicitly lists out ment, we extract information about the relevant object from
the set of functions and objects available to the model, the the state graph and prompt the LLM to return whether the
generated plans typically contain actions an agent can take assertion holds or not given the state graph and assertion as a
and objects available in the environment. text prompt (Fig. 2 Prompt for State Feedback). We choose
ProgPrompt also includes a few example tasks—fully this design over a rule-based checking since it’s more general.
executable program plans. Each example task demonstrates
how to complete a given task using available actions and 4.2 Real-robot experiments
objects in the given environment. These examples demon-
strate the relationship between task name, given as the We use a Franka-Emika Panda robot with a parallel-jaw grip-
function handle, and actions to take, as well as the restrictions per. We assume access to a pick-and-place policy. The policy
on actions and objects to involve. takes as input two pointclouds of a target object and a tar-
get container, and performs a pick-and-place operation to
3.3 Task plan generation and execution place the object on or inside the container. We use the sys-
tem of Danielczuk et al. (2021) to implement the policy, and
The given task is fully inferred by the LLM based on the use MPPI for motion generation, SceneCollisionNet (Daniel-
ProgPrompt prompt. Generated plans are executed on a czuk et al., 2021) to avoid collisions, and generate grasp poses
virtual agent or a physical robot system using an interpreter with Contact-GraspNet (Sundermeyer et al., 2021).
that executes each action command against the environment. We specify a single import statement for the action
Assertion checking is done in a closed-loop manner during grab_and_putin(obj1, obj2) for ProgPrompt. We
execution, providing current environment state feedback. use ViLD (Gu et al., 2022), an open-vocabulary object detec-
tion model, to identify and segment objects in the scene and
construct the available object list for the prompt. Unlike in the
4 Experiments virtual environment, where object list was a global variable in
common for all tasks, here the object list is a local variable for
We evaluate our method with experiments in a virtual house- each plan function, which allows greater flexibility to adapt
hold environment and on a physical robot manipulator. to new objects. The LLM outputs a plan containing function
calls of form grab_and_putin(obj1, obj2). Here,
4.1 Simulation experiments objects obj1 and obj2 are text strings that we map to point-
clouds using ViLD segmentation masks and the depth image.
We evaluate our method in the Virtual Home (VH) Environ- Due to real world uncertainty, we do not implement assertion-
ment (Puig et al., 2018), a deterministic simulation platform based closed loop options on the tabletop plans.
for typical household activities. A VH state s is a set of objects
O and properties P. P encodes information like in(salmon, 4.3 Evaluation metrics
microwave) and agent_close_to(salmon). The action
space is A = {grab, putin, putback, walk, find, We use three metrics to evaluate system performance: suc-
open, close, switchon, switchoff, sit, standup}. cess rate (SR), executability (Exec), and goal conditions
We experiment with 3 VH environments. Each environ- recall (GCR). The task-relevant goal-conditions are the set
ment contains 115 unique object instances (Fig. 2), including of goal-conditions that changed between the initial and final
class-level duplicates. Each object has properties correspond- state in the demonstration. SR is the fraction of executions
ing to its action affordances. Some objects also have a seman- that achieved all task-relevant goal-conditions. Exec is the
tic state like heated, washed, or used. For example, an fraction of actions in the plan that are executable in the envi-
object in the Food category can become heated whenever ronment, even if they are not relevant for the task. GCR is
in(object, microwave) ∧ switched_on(microwave). measured using the set difference between ground truth final
state conditions g and the final state achieved g with the

123
1004 Autonomous Robots (2023) 47:999–1012

Table 1 Evaluation of generated programs on Virtual Home


# — Prompt Format and Parameters —
Format Comments Feedback LLM Backbone SR Exec GCR

* ProgPrompt ✓ ✓ GPT4 0.37 ± 0.06 0.87 ± 0.01 0.64 ± 0.02


* ProgPrompt ✓ ✓ Davinci- 003 0.47 ± 0.15 0.85 ± 0.02 0.74
0.74±0.07
1 ProgPrompt ✓ ✓ Codex 0.40 ± 0.11 0.90 ± 0.05 0.72 ± 0.09
2 ProgPrompt ✓ ✓ Davinci 0.22 ± 0.04 0.60 ± 0.04 0.46 ± 0.04
3 ProgPrompt ✓ ✓ GPT3 0.34 ± 0.08 0.84 ± 0.01 0.65 ± 0.05
4 ProgPrompt ✓ ✗ GPT3 0.28 ± 0.04 0.82 ± 0.01 0.56 ± 0.02
5 ProgPrompt ✗ ✓ GPT3 0.30 ± 0.00 0.65 ± 0.01 0.58 ± 0.02
6 ProgPrompt ✗ ✗ GPT3 0.18 ± 0.04 0.68 ± 0.01 0.42 ± 0.02
7 LangPrompt – – GPT3 0.00 ± 0.00 0.36 ± 0.00 0.42 ± 0.02
8 Baseline from Huang et al. GPT3 0.00 ± 0.00 0.45 ± 0.03 0.21 ± 0.03

ProgPrompt uses 3 fixed example programs, except the Davinci backbone which can fit only 2 in the available API. Huang et al. (2022a) use
1 dynamically selected example, as described in their paper. LangPrompt uses 3 natural language text examples. Best performing model with
a GPT3 backbone is shown in italic (used for our ablation studies); best performing model overall shown in bold. ProgPrompt significantly
outperforms the baseline Huang et al. (2022a) and LangPrompt. We also showcase how each ProgPrompt feature adds to the performance of
the method

Table 2 ProgPrompt Task desc |A| SR Exec GCR


performance on the VH
test-time tasks and their ground watch tv 3 0.20 ± 0.40 0.42 ± 0.13 0.63 ± 0.28
truth actions sequence lengths
|A| turn off light 3 0.40 ± 0.49 1.00 ± 0.00 0.65 ± 0.30
brush teeth 8 0.80 ± 0.40 0.74 ± 0.09 0.87 ± 0.26
throw away apple 8 1.00 ± 0.00 1.00 ± 0.00 1.00 ± 0.00
make toast 8 0.00 ± 0.00 1.00 ± 0.00 0.54 ± 0.33
eat chips on the sofa 5 0.00 ± 0.00 0.40 ± 0.00 0.53 ± 0.09
put salmon in the fridge 8 1.00 ± 0.00 1.00 ± 0.00 1.00 ± 0.00
wash the plate 18 0.00 ± 0.00 0.97 ± 0.04 0.48 ± 0.11
bring coffeepot and cupcake to the coffee table 8 0.00 ± 0.00 1.00 ± 0.00 0.52 ± 0.14
microwave salmon 11 0.00 ± 0.00 0.76 ± 0.13 0.24 ± 0.09
Avg: 0 ≤ |A| ≤ 5 0.20 ± 0.40 0.61 ± 0.29 0.60 ± 0.25
Avg: 6 ≤ |A| ≤ 10 0.60 ± 0.50 0.95 ± 0.11 0.79 ± 0.29
Avg: 11 ≤ |A| ≤ 18 0.00 ± 0.00 0.87 ± 0.14 0.36 ± 0.16

generated plan, divided by the number of task-specific goal- Each result is averaged over 5 runs in a single VH envi-
conditions; SR= 1 only if GCR= 1. ronment across 10 tasks. The variability in performance
across runs arises from sampling LLM output. We include
3 Pythonic task plan examples per prompt after evaluating
5 Results performance on VH for between 1 prompt and 7 prompts
and finding that 2 or more prompts result in roughly equal
We show that ProgPrompt is an effective method for performance for GPT3. The plan examples are fixed to be:
prompting LLMs to generate task plans for both virtual and “put the wine glass in the kitchen cabinet”, “throw away the
physical agents. lime”, and “wash mug”.
We also include results on the recent GPT4 backbone.
5.1 Virtual experiment results Unlike the GPT3 language model, GPT4 is a chat-bot model
trained with reinforcement learning with human feedback
Table 1 summarizes the performance of our task plan gen- (RLHF) to act as a helpful digital assistant OpenAI (2023).
eration and execution system in the seen environment of GPT4 takes as input a system prompt followed by one or more
VirtualHome. We utilize a GPT3 as a language model back- user prompts. Instead of simply auto-completing the code
bone to receive ProgPrompt prompts and generate plans. in the prompt, GPT4 interprets user prompts as questions

123
Autonomous Robots (2023) 47:999–1012 1005

and generates answers as an assistant. To make GPT4 auto- Table 3 ProgPrompt results on Virtual Home in additional scenes. We
complete our prompt, we used the following system prompt: evaluate on 10 tasks each in two additional VH scenes beyond scene
Env- 0 where other reported results take place
You are a helpful assistant.. The user prompt is the same
f prompt as shown in Fig. 2. VH scene SR Exec GCR
We can draw several conclusions from Table 1. First,
Env- 0 0.34 ± 0.08 0.84 ± 0.01 0.65 ± 0.05
ProgPrompt (rows 3–6) outperforms prior work (Huang
Env- 1 0.56 ± 0.08 0.85 ± 0.02 0.81 ± 0.07
et al., 2022a) (row 8) by a substantial margin on all metrics
Env- 2 0.56 ± 0.05 0.85 ± 0.03 0.72 ± 0.09
using the same large language model backbone. Second, we
Average 0.48 ± 0.13 0.85 ± 0.02 0.73 ± 0.10
observe that the Codex (Chen et al., 2021) and Davinci
models (Brown et al., 2020)—themselves GPT3 variants—
show mixed success at the task. In particular, Davinci, the
original GPT3 version, does not match base GPT3 perfor- policy P(at |st , GPT3 step, a1:t−1 ) to map those generated
mance (row 2 versus row 3), possibly because its prompt sequences to executable actions in the simulation environ-
length constraints limit it to 2 task examples versus the 3 ment. We use the 35 tasks in the training set, and annotate
available to other rows. Additionally, Codex exceeds GPT3 the text steps and the corresponding action sequence to get
performance on every metric (row 1 versus row 3), likely 400 data points for training and validation of this policy. We
because Codex is explicitly trained on programming lan- find that while this method achieves reasonable partial suc-
guage data. However, Codex has limited access in terms of cess through GCR, it does not match (Huang et al., 2022a)
number of queries per minute, so we continue to use GPT3 for program executability Exec and does not generate any
as our main LLM backbone in the following ablation experi- fully successful task executions.
ments. Our recommendation to the community is to utilize a Task-by-Task Performance ProgPrompt performance for
program-like prompt for LLM-based task planning and exe- each task in the test set is shown in Table 2. We observe
cution, for which base GPT3 works well, and we note that an that tasks that are similar to prompt examples, such as throw
LLM fine-tuned further on programming language data, such away apple versus wash the plate have higher GCR since
as Codex, can do even better. We additionally report results the ground truth prompt examples hint about good stop-
on Davinci- 003 and GPT4 (row *), which is the latest GPT3 ping points. Even with high Exec, some task GCR are low,
variant and the latest GPT variant in the series respectively because some tasks have multiple appropriate goal states, but
at the time of this submission. Davinci- 003 has a better SR we only evaluate against a single “true” goal. For example,
and GCR, indicating it might have an improved common- after microwaving and plating salmon, the agent may put the
sense understanding, but lower Exec compared to Codex. salmon on a table or a countertop.
The newest model, GPT- 4 does not seem to be better than Other Environments We evaluate ProgPrompt in two
latest GPT3 variant, on our tasks. Most of our results use additional VH environments (Table 3). For each, we append
the Davinci- 002 variant (that we refer to as GPT3 in this a new object list representing the new environment after the
paper), which was the latest model available when this study example tasks in the prompt, followed by the task to be com-
was conducted. pleted in the new scene. The action primitives and other
We explore several ablations of ProgPrompt. First, we ProgPrompt settings remain unchanged. We evaluate on 10
find that Feedback mechanisms in the example programs, tasks with 5 runs each. For new tasks like wash the cutlery in
namely the assertions and recovery actions, improve perfor- dishwasher, ProgPrompt is able to infer that cutlery refers
mance (rows 3 versus 4 and 5 versus 6) across metrics, the to spoons and forks in the new scenes, despite that cutlery
sole exception being that Exec improves a bit without Feed- always refers to knives in example prompts.
back when there are no Comments in the prompt example
code. Second, we observe that removing Comments from 5.2 Qualitative analysis and limitations
the prompt code substantially reduces performance on all
metrics (rows 3 versus 5 and 4 versus 6), highlighting the We manually inspect generated programs and their execution
usefulness of the natural language guidance within the pro- traces from ProgPrompt and characterize common fail-
gramming language structure. ure modes. Many failures stem from the decision to make
We also evaluate LangPrompt, an alternative to Prog- ProgPrompt agnostic to the deployed environment and its
Prompt that builds prompts from natural language text peculiarities, which may be resolved through explicitly com-
description of objects available and example task plans municating, for example, object affordances of the target
(row 7). LangPrompt is similar to the prompts built environment as part of the ProgPrompt prompt.
by Huang et al. (2022a). The outputs of LangPrompt
are generated action sequences, rather than our proposed, • Environment artifacts: the VH agent cannot find or inter-
program-like structures. Thus, we finetune GPT2 to learn a act with objects nearby when sitting, and some

123
1006 Autonomous Robots (2023) 47:999–1012

Fig. 4 Robot plan execution rollout example on the sorting task show- berry are fruits, and generates plan steps to place them on the plate,
ing relevant objects banana, strawberry, bottle, plate and box, and a while placing the bottle in the box. The LLM ignores the distractor
distractor object drill. The LLM recognizes that banana and straw- object drill. See Fig. 1 for the prompt structure used

common sense actions for objects, such as open tvs- Table 4 Results on the physical robot by task type
tand’s cabinets, are not available in VH. Task description Distractors SR Plan SR GCR
• Environment complexities: when an object is not acces-
sible, the generated assertions might not be enough. For put the banana in the bowl 0 1 1 1/1
example, if the agent finds an object in a cabinet, it may 4 1 1 1/1
not plan to open the cabinet to grab the object. put the pear on the plate 0 1 1 1/1
• Action success feedback is not provided to the agent, 4 1 1 1/1
which may lead to failure of the subsequent actions. put the banana on the plate 0 1 1 2/2
Assertion recovery modules in the plan can help, but and the pear in the bowl 3 1 1 2/2
aren’t generated to cover all possibilities. sort the fruits on the plate 0 0 1 2/3
• Incomplete generation: Some plans are cut short by LLM and the bottles in the box 1 1 1 3/3
API caps. One possibility is to query the LLM again with 2 0 0 2/3
the prompt and partially generated plan.

In addition to these failure modes, our strict final state


checking means if the agent completes the task and some,
we may infer failure, because the environment goal state will 5.3 Physical robot results
not match our precomputed ground truth final goal state. For
example, after making coffee, the agent may take the cof- The physical robot results are shown in Table 4. We evaluate
feepot to another table. Similarly, some task descriptions are on 4 tasks of increasing difficulty listed in Table 4. For each
ambiguous and have multiple plausible correct programs. task we perform two experiments: one in a scene that only
For example, “make dinner” can have multiple possible contains the necessary objects, and with one to four distractor
solutions. ProgPrompt generates plans that cooks salmon objects added.
using the fryingpan and stove, and sometimes the agent adds All results shown use ProgPrompt with comments, but
bellpepper or lime, or sometimes with a side of fruit, or served not feedback. Our physical robot setup did not allow reliably
in a plate with cutlery. When run in a different VH environ- tracking system state and checking assertions, and is prone
ment, the agent cooks chicken instead. ProgPrompt is able to random failures due to things like grasps slipping. The
to generate plans for such complex tasks as well while using real world introduces randomness that complicates a quan-
the objects available in the scene and not explicitly men- titative comparison between systems. Therefore, we intend
tioned in the task. However, automated evaluation of such the physical results to serve as a qualitative demonstration
tasks requires enumerating all valid and invalid possibilities of the ease with which our prompting approach allows con-
or introducing human verification. straining and grounding LLM-generated plans to a physical
Furthermore, we note that while the reasoning capabilities robot system. We report an additional metric Plan SR, which
of current state LLMs are impressive, our proposed method refers to whether the plan would have likely succeeded, pro-
does not make any claims of providing guarantees. However, vided successful pick-and-place execution without gripper
the evaluations reported in Table 1 offer insights into the failures.
capabilities of different LLMs within our task settings. While Across tasks, with and without distractor objects, the sys-
our method effectively prevents the LLM from generating tem almost always succeeds, failing only on the sort task.
unavailable actions or objects, it is worth acknowledging that The run without distractors failed due to a random gripper
depending on the LLM’s generation quality and reasoning failure. The run with 2 distractors failed because the model
capabilities, there is still a possibility of hallucination. mistakenly considered a soup can to be a bottle. The exe-

123
Autonomous Robots (2023) 47:999–1012 1007

cutability for the generated plans was always Exec=1. An reasoning ability to generate an executable and valid task
execution rollout example is illustrated in Fig. 4. plan.
After this study was conducted, we re-attempted plan gen- • The precondition checking helps recover from some fail-
eration of the failed plan with GPT- 4, using the same system ure modes that can happen if actions are generated in the
prompt as in Sect. 5.1. GPT- 4 was able to successfully pre- wrong order or are missed by the base plan.
dict the correct plan and not confuse the soup can for a bottle.
Cons:

6 Conclusions and future work • Requires action space discretization, formalization of


environments and objects.
We present an LLM prompting scheme for robot task • Plan generation is open-loop, with commonsense precon-
planning that brings together the two strengths of LLMs: dition checking-based environment interaction.
commonsense reasoning and code understanding. We con- • Plan generation doesn’t consider low-level continuous
struct prompts that include situated understanding of the aspects of the environment state, and only reasons with
world and robot capabilities, enabling LLMs to directly gen- the semantic state for planning as well as precondition
erate executable plans as programs. Our experiments show checking.
that ProgPrompt programming language features improve • The amount of information exchange between language
task performance across a range of metrics. Our method is models and other modules such as the robot’s perceptual
intuitive and flexible, and generalizes widely to new scenes, or proprioceptive state encoders is limited, since API-
agents and tasks, including a real-robot deployment. based access to these recent LLMs only allows textual
As a community, we are only scratching the surface of task queries. However, this is still promising as it indicates
planning as robot plan generation and completion. We hope to the need for a multimodal encoder that can work with
study broader use of programming language features, includ- input such as vision, touch, force, temperature, etc.
ing real-valued numbers to represent measurements, nested
dictionaries to represent scene graphs, and more complex
Question 2 How does it compare with the concurrent work:
control flow. Several works from the NLP community show
Code-as-Policies (CaP) (Liang et al., 2023)?
that LLMs can do arithmetic and understand numbers, yet
their capabilities for complex robot behavior generation are
still relatively under-explored. • We believe that the general approach is quite similar to
ours. CaP defines Hints and Examples which may corre-
spond to Imports/Object lists and Task Plan examples in
7 FAQs and discussion ProgPrompt .
• CaP uses actions as API calls with certain parameters for
Question 1 How does this approach compare with end- the calls such as robot arm pose, velocity, etc. We use
to-end robot learning models, and what are the current actions as API calls with objects as parameters.
limitations? • CaP uses APIs to obtain environment information as
well, like object pose or segmentation, for the pur-
ProgPrompt is a hierarchical solution to task planning pose of plan generation. However, ProgPrompt extracts
where the abstract task descriptions leverage LLM’s reason- environment information via precondition checking on
ing and maps the task plan to the grounded environment current environment state, to ensure plan executability.
labels. On the other hand, in end-to-end approaches, gen- ProgPrompt also generates the prompt conditioned on
erally the model implicitly learns reasoning, planning, and information from perception models.
grounding, while mapping the abstract task description to the
action space directly.
Pros: Question 3 During “PROMPT for State Feedback”, it seems
that the prompt already includes all information about the
environment state. Is it necessary to prompt the LLM again for
• LLMs can do long-horizon planning from an abstract task
the assertion (compared to a simple rule-based algorithm)?
description.
• Decoupling the LLM planner from the environment
makes generalization to new tasks and environments fea- • The environment state input to the model is not the full
sible. state for brevity. Thus, checking pre-conditions with the
• ProgPrompt enables LLMs to intelligently combine the full state separately helps, as shown in Table 1.
robot capabilities with the environment and their own • The environment state could change during execution.

123
1008 Autonomous Robots (2023) 47:999–1012

• Using LLM as opposed to a rule-based algorithm is a ProgPrompt doesn’t tackle the issue, however, Xie et al.
design choice made to keep the approach more general, (2023b) shows that multiple instances of the same objects can
instead of using a hand-coded rule-based algorithm. The be handled by using labels with object IDs such as “book_1,
assertion checking may also be replaced with a visual book_2”.
state conditioned module, when a semantic state is not
Question 9 Why doesn’t the paper compare the performance
available, such as in the real-world scenario. However,
of the proposed method to InnerMonologue, SAYCAN, or
we leave these aspects to be addressed in future research.
Socratic models?

Question 4 Is it possible that the generated code might lead At the time of writing, the dataset or model from the
the robot to be stuck in an infinite loop? above papers were not public. However, we do compare with
a proxy approach, similar in underlying idea to the above
LLM code generation could lead to loops by predicting approaches, in the VirtualHome environment. LangPlan in
the same actions repeatedly as a generation artifact. LLMs our baselines, uses GPT3 to get textual plan steps, which are
used to suffer from such degeneration, but with latest LLMs then executed using a GPT-2 based trained policy.
(i.e. GPT-3) we have not encountered it at all.
Question 10 So the next step in this direction of research is
Question 5 Why are real-robot experiments simpler than vir- to create highly structured inputs and outputs that could be
tual experiments? compiled, since eventually we want something that compiles
on robotic machines?
The real-robot experiments were done as a demonstration
of the approach on a real-robot, while studying the method The disconnect and information bottleneck between LLM
in depth in a virtual simulator, for the sake of simplicity and planning module and skill execution module might make it
efficiency. less concrete on “how much” and “what” information should
be passed through the LLM during planning. That said, we
Question 6 What’s the difference between various GPT3 think that this would be an interesting direction to pursue and
model versions used in this project? test the limits of LLM’s highly structured input understanding
and generation.
We name GPT3, which is the latest available version of
GPT3 model on OpenAI at the time the paper was written: Question 11 How does it compare to a classical planner?
text- davinci- 002. We name davinci as the original ver-
sion of GPT3 released: text- davinci.2 • Classical planners require concrete goal condition speci-
fication. An LLM planner reasons out a feasible goal state
Question 7 Why not a planning language like PDDL (or
from a high level task description, such as “microwave
other planning languages) be used to construct Prog-
salmon”. From a user’s perspective, it is desirable to not
Prompt? Any advantages of using a pythonic structure?
have to specify a concrete semantic goal state of the envi-
ronment and just be able to give an instruction to act on.
• GPT-3 has been trained on data from the internet. There • The search space would also be huge without common
is a lot of python code on the internet, while PDDL is a sense priors that an LLM planner leverages as opposed
language of much more narrow interest. Thus, we expect to a classical planner. Moreover, we also bypass the need
the LLM to better understand python syntax. to specify the domain knowledge needed for the search
• Python is a general purpose language, so it has more to roll out.
features than PDDL. Furthermore, we want to avoid spec- • Moreover, the domain specification and search space will
ifying the full planning domain, instead relying on the grow non-linearly with the complexity of the environ-
knowledge learned by the LLM to make common-sense ment.
inferences. A recent work Xie et al. (2023b) uses LLMs
to generate PDDL goals, however, it requires full domain Question 12 Is it possible to decouple high-level language
specification for a given environment. planning from low-level perceptual planning?
• Python is an accessible language that a larger community
is familiar with. It may be feasible to an extent, however we believe that a
clean decoupling might not be “all we need”. For instance,
Question 8 How to handle multiple instances of the same imagine an agent being stuck at an action that needs to be
object type in the scene? resolved at semantic level of reasoning, and probably very
hard for the visual module to figure out. For instance, while
2 More info on GPT3 models variations and naming can be found here: placing a dish on an oven tray, the robot may need to pull the
https://platform.openai.com/docs/models/overview dish rack out of the oven to be successful in the task.

123
Autonomous Robots (2023) 47:999–1012 1009

Question 13 What are the kinds of failures that can happen is not included in the article’s Creative Commons licence and your
with ProgPrompt-like 2 stage decoupled pipeline? intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copy-
A few broad failure categories could be: right holder. To view a copy of this licence, visit http://creativecomm
ons.org/licenses/by/4.0/.

• Generation of a semantically wrong action.


• Robot might fail to execute the action at perception/action
/skill level.
• Robot needs to recover from a failure by taking a different References
high-level action, i.e., a precondition needs to be satisfied.
The challenge is to identify that precondition from the Ahn, M., Brohan, A., Brown, N., Chebotar, Y., Cortes, O., David, B.,
current state of the environment and the agent. & Yan, M. (2022). Do as i can, not as i say: Grounding language
in robotic affordances. arXiv.
Akakzia, A., Colas, C., Oudeyer, P. Y., Chetouani, M., & Sigaud, O.
Question 14 What are the assumptions made about the (2021). Grounding language to autonomously-acquired skills via
actions used for ProgPrompt? goal generation. In International conference on learning represen-
tations.
We assume a set of available action APIs that are imple- Baier, J. A., Bacchus, F., & McIlraith, S. A. (2007). A heuristic search
mented on the robot, without assuming the implementation approach to planning with temporally extended preferences. In
Proceedings of the 20th international joint conference on artifical
method (e.g. motion planning or reinforcement learning). intelligence (pp. 1808–1815). Morgan Kaufmann Publishers Inc.
ProgPrompt abstracts over and complements other research Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal,
on developing flexible robot skills. This assumption is sim- P., & Amodei, D. (2020). Language models are few-shot learners.
ilar to those made in classical TAMP planners, where the arXiv.
Bryce, D., & Kambhampati, S. (2007). A tutorial on planning graph
planning space is restricted by the available robot skills. based reachability heuristics. AI Magazine, 28(1), 47.
Cao, Y., & Lee, C. (2023). Robot behavior-tree-based task generation
Question 15 Can the ProgPrompt planner handle more with large language models. arXiv preprint arXiv:2302.12927
expressive situations when “the embodied agent has to grasp Capitanelli, A., & Mastrogiovanni, F. (2023). A framework to gen-
an object in a specific way in order to complete an aspect of erate neurosymbolic pddl-compliant planners. arXiv preprint
arXiv:2303.00438
the task”?
Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J.,
& Zaremba, W. (2021). Evaluating large language models trained
This is possible, provided the deployed robot is capable of on code. arXiv.
handling the requested action. For example, one can specify Danielczuk, M., Mousavian, A., Eppner, C.,& Fox, D. (2021). Object
‘how’ along with ‘what’ parameters for an action as function rearrangement using learned implicit collision functions. In IEEE
arguments, which may be discrete semantic grounded labels international conference on robotics and automation (ICRA).
Eysenbach, B., Salakhutdinov, R. R., & Levine, S. (2019). Search on the
affecting the low-level skill execution, e.g. to select between replay buffer: Bridging planning and reinforcement learning. In H.
different modes of grasping intended for different task pur- Wallach, H. Larochelle, A. Beygelzimer, F. d’ Alché-Buc, E. Fox,
poses. However, it is an open question as to what the right & R. Garnett (Eds.), Advances in neural information processing
level of abstraction is between high-level task specification systems (vol. 32). Curran Associates, Inc.
Fikes, R. E., & Nilsson, N. J. (1971). Strips: A new approach to the
and continuous control space actions, and the answer might application of theorem proving to problem solving. In Proceedings
depend on the application domain. of the 2nd international joint conference on artificial intelligence
(pp. 608–620). Morgan Kaufmann Publishers Inc.
Author Contributions IS lead the research, conducted experiments, and Garrett, C. R., Lozano-Pérez, T., & Kaelbling, L. P. (2020). Pddl-
drafted the manuscript; VB provided feedback, conducted experiments, stream: Integrating symbolic planners and blackbox samplers via
drafted and reviewed the manuscript; JT and AG provided feedback, optimistic adaptive planning. Proceedings of the International
drafted and reviewed the manuscript; AM, AG, DX, JT, and DF provided Conference on Automated Planning and Scheduling, 30(1), 440–
feedback and reviewed the manuscript. 448.
Gu, X., Lin, T. Y., Kuo, W., & Cui, Y. (2022). Open-vocabulary object
Funding Open access funding provided by SCELC, Statewide Califor- detection via vision and language knowledge distillation. In Inter-
nia Electronic Library Consortium. This project was conducted at and national conference on learning representations.
funded by NVIDIA. Gupta, T., & Kembhavi, A. (2022). Visual programming: Com-
positional visual reasoning without training. arXiv preprint
Open Access This article is licensed under a Creative Commons arXiv:2211.11559
Attribution 4.0 International License, which permits use, sharing, adap- Helmert, M. (2006). The fast downward planning system. Journal of
tation, distribution and reproduction in any medium or format, as Artificial Intelligence Research, 26(1), 191–246.
long as you give appropriate credit to the original author(s) and the Hoffmann, J. (2001). Ff: The fast-forward planning system. AI Maga-
source, provide a link to the Creative Commons licence, and indi- zine, 22(3), 57.
cate if changes were made. The images or other third party material Holtzman, A., Buys, J., Du, L., Forbes, M., & Choi, Y. (2020). The curi-
in this article are included in the article’s Creative Commons licence, ous case of neural text degeneration. In International conference
unless indicated otherwise in a credit line to the material. If material on learning representations.

123
1010 Autonomous Robots (2023) 47:999–1012

Huang, W., Abbeel, P., Pathak, D., & Mordatch, I. (2022). Language 1: Long papers) (pp. 1713–1726). Association for Computational
models as zero-shot planners: Extracting actionable knowledge for Linguistics.
embodied agents. arXiv preprint arXiv:2201.07207 Shridhar, M., Thomason, J., Gordon, D., Bisk, Y., Han, W., Mottaghi,
Huang, W., Xia, F., Shah, D., Driess, D., Zeng, A., Lu, Y., others (2023). R., & Fox, D. (2020). ALFRED: A Benchmark for Interpreting
Grounded decoding: Guiding text generation with grounded mod- Grounded Instructions for Everyday Tasks. In The IEEE confer-
els for robot control. arXiv preprint arXiv:2303.00855 ence on computer vision and pattern recognition (cvpr).
Huang, W., Xia, F., Xiao, T., Chan, H., Liang, J., Florence, P., & Ichter, Silver, T., Chitnis, R., Kumar, N., McClinton, W., Lozano-Perez, T.,
B. (2022). Inner monologue: Embodied reasoning through plan- Kaelbling, L. P., & Tenenbaum, J. (2022). Inventing relational state
ning with language models. arxiv preprint arxiv:2207.05608. and action abstractions for effective and efficient bilevel planning.
Jansen, P. (2020). Visually-grounded planning without vision: Lan- In The multi-disciplinary conference on reinforcement learning
guage models infer detailed plans from high-level instructions. In and decision making (rldm).
Findings of the association for computational linguistics: Emnlp Skreta, M., Yoshikawa, N., Arellano-Rubach, S., Ji, Z., Kristensen, L.
2020 (pp. 4412–4417). Online: Association for Computational B., Darvish, K., & Garg, A. (2023). Errors are useful prompts:
Linguistics. Instruction guided task programming with verifier-assisted itera-
Jiang, Y., Gu, S. S., Murphy, K. P., & Finn, C. (2019). Language as tive prompting. arXiv preprint arXiv:2303.14100
an abstraction for hierarchical deep reinforcement learning. In H. Srinivas, A., Jabri, A., Abbeel, P., Levine, S., & Finn, C. (2018). Uni-
Wallach, H. Larochelle, A. Beygelzimer, F. d’ Alché-Buc, E. Fox, versal planning networks: Learning generalizable representations
& R. Garnett (Eds.), Advances in neural information processing for visuomotor control. In J. Dy, & A. Krause (Eds.), Proceedings
systems. (vol. 32). Curran Associates, Inc. of the 35th international conference on machine learning (vol. 80,
Jiang, Y., Zhang, S., Khandelwal, P., & Stone, P. (2018). Task planning pp. 4732–4741). PMLR.
in robotics: An empirical comparison of pddl-based and asp-based Sundermeyer, M., Mousavian, A., Triebel, R., & Fox, D. (2021).
systems. arXiv. Contact-graspnet: Efficient 6-dof grasp generation in cluttered
Kurutach, T., Tamar, A., Yang, G., Russell, S. J., & Abbeel, P. (2018). scenes. In 2021 IEEE international conference on robotics and
Learning plannable representations with causal infogan. In S. Ben- automation (icra) (pp. 13438–13444).
gio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, & Vemprala, S., Bonatti, R., Bucker, A., & Kapoor, A. (2023). Chatgpt
R. Garnett (Eds.), Advances in neural information processing sys- for robotics: Design principles and model abilities. 2023
tems (vol. 31). Curran Associates, Inc. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., &
Li, S., Puig, X., Paxton, C., Du, Y., Wang, C., Fan, L., & Zhu, Zhou, D. (2022). Chain of thought prompting elicits reasoning in
Y. (2022). Pre-trained language models for interactive decision- large language models. arXiv.
making. arXiv. Wiseman, S., Shieber, S., & Rush, A. (2017). Challenges in data-
Liang, J., Huang, W., Xia, F., Xu, P., Hausman, K., Ichter, B., & Zeng, A. to-document generation. In Proceedings of the 2017 conference
(2023). Code as policies: Language model programs for embodied on empirical methods in natural language processing (pp. 2253–
control. 2263). Association for Computational Linguistics.
Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., & Neubig, G. (2021). Xie, Y., Yu, C., Zhu, T., Bai, J., Gong, Z., & Soh, H. (2023a). Translating
Pre-train, prompt, and predict: A systematic survey of prompting natural language to planning goals with large-language models.
methods in natural language processing. arXiv. arXiv preprint arXiv:2302.05128
Luong, T., Pham, H., & Manning, C. D. (2015). Effective approaches Xie, Y., Yu, C., Zhu, T., Bai, J., Gong, Z., & Soh, H. (2023b). Translating
to attention-based neural machine translation. In Proceedings of natural language to planning goals with large-language models.
the 2015 conference on empirical methods in natural language Xu, D., Martín-Martín, R., Huang, D. A., Zhu, Y., Savarese, S., & Fei-
processing (pp. 1412–1421). Association for Computational Lin- Fei, L. F. (2019). Regression planning networks. In H. Wallach, H.
guistics. Larochelle, A. Beygelzimer, F. d’ Alché-Buc, E. Fox, & R. Garnett
Mai, J., Chen, J., Li, B., Qian, G., Elhoseiny, M., & Ghanem, B. (2023). (Eds.), Advances in neural information processing systems (vol.
Llm as a robotic brain: Unifying egocentric memory and control. 32). Curran Associates, Inc.
arXiv preprint arXiv:2304.09349 Xu, D., Nair, S., Zhu, Y., Gao, J., Garg, A., Fei-Fei, L., & Savarese, S.
Mirchandani, S., Karamcheti, S., & Sadigh, D. (2021). Ella: Exploration (2018). Neural task programming: Learning to generalize across
through learned language abstraction. In M. Ranzato, A. Beygelz- hierarchical tasks. In 2018 IEEE international conference on
imer, Y. Dauphin, P. Liang, J. W. Vaughan (Eds.), Advances in robotics and automation (icra) (pp. 3795–3802).
neural information processing systems (vol. 34, pp. 29529–29540). Zeng, A., Attarian, M., Ichter, B., Choromanski, K., Wong, A., Welker,
Curran Associates, Inc. S., & Florence, P. (2022). Socratic models: Composing zero-shot
Nair, S., & Finn, C. (2020). Hierarchical foresight: Self-supervised multimodal reasoning with language. arXiv
learning of long-horizon tasks via visual subgoal generation. In Zhu, Y., Tremblay, J., Birchfield, S., & Zhu, Y. (2020). Hierarchical
International conference on learning representations. planning for long-horizon manipulation with geometric and sym-
OpenAI (2023). Gpt-4 technical report. arXiv. bolic scene graphs. arXiv.
Patel, R., & Pavlick, E. (2022). Mapping language models to grounded
conceptual spaces. In International conference on learning repre-
sentations. Publisher’s Note Springer Nature remains neutral with regard to juris-
Puig, X., Ra, K., Boben, M., Li, J., Wang, T., Fidler, S., & Tor- dictional claims in published maps and institutional affiliations.
ralba, A. (2018). Virtualhome: Simulating household activities via
programs. In 2018 IEEE/cvf conference on computer vision and
pattern recognition (pp. 8494–8502).
Shah, D., Toshev, A. T., Levine, S., & brian ichter. (2022). Value function
spaces: Skill-centric state abstractions for long-horizon reasoning.
In International conference on learning representations.
Sharma, P., Torralba, A., & Andreas, J. (2022). Skill induction and
planning with latent language. In Proceedings of the 60th annual
meeting of the association for computational linguistics (volume

123
Autonomous Robots (2023) 47:999–1012 1011

Ishika Singh is a 3rd year PhD Danfei Xu is an Assistant Pro-


student advised by Professor Jesse fessor at the School of Interac-
Thomason in the Computer Sci- tive Computing at Georgia Tech
ence department at the Univer- and a (part-time) Research Sci-
sity of Southern California. Her entist at NVIDIA AI. His cur-
research focuses on problems in rent research focuses on visuo-
language-conditioned robot learn- motor skill learning, long-horizon
ing such as vision-language nav- manipulation planning, and data-
igation, manipulation and task plan- driven approaches to human-robot
ning. Previously, she was an under- collaboration. He received his Ph.D.
grad at IIT Kanpur. in CS from Stanford University.

Valts Blukis is a research scien- Jonathan Tremblay is a research


tist at NVIDIA. His research goal scientist at NVIDIA. His research
is creating scalable and generaliz- interests are in computer vision,
able machine learning algorithms synthetic data, and reinforcement
and models that enable robots to learning for robotics applications.
interact with people through nat- At NVIDIA, Jonathan has focused
ural language while observing the on using synthetic data to train
unstructured world through first- object detectors, object pose esti-
person sensor observations. He received mation, few shot learning, etc. Jonathan’s
his PhD from Cornell University goal is to create robust and acces-
and Cornell Tech. sible computer vision systems for
roboticists to use on their sys-
tem. Prior to joining NVIDIA,
Jonathan received Ph.D. in com-
puter science from McGill Uni-
versity.
Arsalan Mousavian s a senior
research scientist at NVIDIA Seat-
Dieter Fox is Senior Director of
tle Robotics Lab. He is interested
Robotics Research at Nvidia. His
in using computer vision and 3D
research is in robotics, with strong
vision for robotics tasks such as
connections to artificial intelligence,
object manipulation. Prior to NVIDIA,
computer vision, and machine learn-
he finished his PhD in the Com-
ing. He is currently on partial
puter Science department at George
leave from the University of Wash-
Mason University.
ington, where he is a Professor
in the Paul G. Allen School of
Computer Science & Engineer-
ing. At UW, he also heads the
UW Robotics and State Estima-
tion Lab. From 2009 to 2011, he
was Director of the Intel Research
Ankit Goyal is a Research Scientist Labs Seattle. Dieter obtained his
in Robotics at NVIDIA. He did Ph.D. from the University of Bonn, Germany.
his Ph.D. in Computer Science
at Princeton University. I com-
pleted Masters from University of Jesse Thomason is an Assis-
Michigan and Bachelors from IIT tant Professor at USC leading the
Kanpur. He is interested in under- Grounding Language in Multimodal
standing various aspects of intel- Observations, Actions, and Robots
ligence, especially reasoning and (GLAMOR) lab. GLAMOR brings
common sense. In particular, he together natural language process-
wants to develop computation mod- ing and robotics (RoboNLP). Jesse
els for various reasoning skills joined USC in 2021 and received
that humans possess. his PhD from the University of
Texas at Austin in 2018.

123
1012 Autonomous Robots (2023) 47:999–1012

Animesh Garg is an Stephen


Fleming Early Career Professor
in Computer Science at Georgia
Tech. Previously, he was an Assis-
tant Professor of Computer Sci-
ence at University of Toronto and
a Faculty Member at the Vector
Institute. He is also a Sr. Research
Scientist at Nvidia. He earned his
Ph.D. in Operations Research from
UC Berkeley and postdoc at Stan-
ford. His group focuses on multi-
modal object-centric and spatiotem-
poral event representations, self-
supervised pre-training for rein-
forcement learning & control, principle of efficient dexterous skill
learning.

123

You might also like