Program Generation For Situated Robot Task Planning Using Large Language Models
Program Generation For Situated Robot Task Planning Using Large Language Models
https://doi.org/10.1007/s10514-023-10135-3
Received: 1 May 2023 / Accepted: 3 August 2023 / Published online: 28 August 2023
© The Author(s) 2023
Abstract
Task planning can require defining myriad domain knowledge about the world in which a robot needs to act. To ameliorate
that effort, large language models (LLMs) can be used to score potential next actions during task planning, and even generate
action sequences directly, given an instruction in natural language with no additional domain information. However, such
methods either require enumerating all possible next steps for scoring, or generate free-form text that may contain actions
not possible on a given robot in its current context. We present a programmatic LLM prompt structure that enables plan
generation functional across situated environments, robot capabilities, and tasks. Our key insight is to prompt the LLM with
program-like specifications of the available actions and objects in an environment, as well as with example programs that
can be executed. We make concrete recommendations about prompt structure and generation constraints through ablation
experiments, demonstrate state of the art success rates in VirtualHome household tasks, and deploy our method on a physical
robot arm for tabletop tasks. Website and code at progprompt.github.io
Keywords Robot task planning · LLM code generation · Planning domain generalization · Symbolic planning
1 Introduction
123
1000 Autonomous Robots (2023) 47:999–1012
123
Autonomous Robots (2023) 47:999–1012 1001
language modelling. An autoregressive LLM is trained with is performed by allowing the LLM to generate an entire, exe-
a maximum likelihood loss to model the probability of a cutable plan program directly.
sequence of tokens y conditioned on an input sequence x,
i.e. θ = arg maxθ P(y|x; θ ), where θ are model param- 2.3 Recent developments following P ROG P ROMPT
eters. The trained LLM is then used for prediction ŷ =
arg maxy∈S P(y|x; θ ), where S is the set of all text sequences. Vemprala et al. (2023) further explores API-based planning
Since search space S is huge, approximate decoding strate- with ChatGPT1 in domains such as aerial robotics, manipula-
gies are used for tractability (Holtzman et al., 2020; Luong tion and visual navigation. They discuss the design principles
et al., 2015; Wiseman et al., 2017). for constructing interaction APIs, for action and perception,
LLMs are trained on large text corpora, and exhibit multi- and prompts that can be used to generate code for robotic
task generalization when provided with a relevant prompt applications. Huang et al. (2023) builds on SayCan (Ahn et
input x. Prompting LLMs to generate text useful for robot al., 2022) and generates planning steps token-by-token while
task planning is a nascent topic (Ahn et al., 2022; Jansen, scoring the tokens using both the LLM and the grounded
2020; Huang et al., 2022a, b; Li et al., 2022; Patel and Pavlick, pretrained value function. Cao and Lee (2023) explores gen-
2022). Prompt design is challenging given the lack of paired erating behavior trees to study hierarchical task planning
natural language instruction text with executable plans or using LLMs. Skreta et al. (2023) proposes iterative error
robot action sequences (Liu et al., 2021). Devising a prompt correction via a syntax verifier that repeatedly prompts the
for task plan prediction can be broken down into a prompting LLM with previous query appended with a list of errors.
function and an answer search strategy (Liu et al., 2021). Mai et al. (2023), similar in approach as Zeng et al. (2022),
A prompting function, f prompt (.) transforms the input state Huang et al. (2022b), integrates pretrained models for percep-
observation s into a textual prompt. Answer search is the tion, planning, control, memory, and dialogue zero-shot, for
generation step, in which the LLM outputs from the entire active exploration and embodied question answering tasks.
LLM vocabulary or scores a predefined set of options. Gupta and Kembhavi (2022) extends the LLM code gen-
Closest to our work, Huang et al. (2022a) generates open- eration and API-based perceptual interaction approach for a
domain plans using LLMs. In that work, planning proceeds variety of vision-langauge tasks. Some recent works Xie et al.
by: 1) selecting a similar task in the prompt example ( fprompt ); (2023a), Capitanelli and Mastrogiovanni (2023) use PDDL
2) open-ended task plan generation (answer search); and as the translation language instead of code, and use the LLM
3) 1:1 prediction to action matching. The entire plan is to generate either a PDDL plan or the goal. A classical plan-
generated open-loop without any environment interaction, ner then plans for the PDDL goal or executes the generated
and later tested for executability of matched actions. How- plan. This approach ablated the need to generate precondi-
ever, action matching based on generated text doesn’t ensure tions using the LLM, however, needs the domain rules to be
the action is admissible in the current situation. Inner- specified for the planner.
Monologue (Huang et al., 2022b) introduces environment
feedback and state monitoring, but still found that LLM
planners proposed actions involving objects not present in
3 Our method: P ROG P ROMPT
the scene. Our work shows that a programming language-
inspired prompt generator can inform the LLM of both
We represent robot plans as pythonic programs. Following
situated environment state and available robot actions, ensur-
the paradigm of LLM prompting, we create a prompt struc-
ing output compatibility to robot actions.
tured as pythonic code and use an LLM to complete the code
The related SayCan (Ahn et al., 2022) uses natural lan-
(Fig. 2). We use features available in Python to construct
guage prompting with LLMs to generate a set of feasible
prompts that elicit an LLM to generate situated robot task
planning steps, re-scoring matched admissible actions using
plans, conditioned on a natural language instruction.
a learned value function. SayCan constructs a set of all admis-
sible actions expressed in natural language and scores them
using an LLM. This is challenging to do in environments 3.1 Representing robot plans as pythonic functions
with combinatorial action spaces. Concurrent with our work
are Socratic models (Zeng et al., 2022), which also use code- Plan functions consist of API calls to action primitives,
completion to generate robot plans. We go beyond (Zeng et comments to summarize actions, and assertions for tracking
al., 2022) by leveraging additional, familiar features of pro- execution (Fig. 3). Primitive actions use objects as arguments.
gramming languages in our prompts. We define an f prompt For example, the “put salmon in the microwave” task includes
that includes import statements to model robot capabilities, API calls like find(salmon).
natural language comments to elicit common sense reason-
ing, and assertions to track execution state. Our answer search 1 https://openai.com/blog/chatgpt/
123
1002 Autonomous Robots (2023) 47:999–1012
Fig. 2 Our ProgPrompts include import statement, object list, and tion checks. We further illustrate the execution of the program via a
example tasks (PROMPT for Planning). The Generated Plan is for scenario where an assertion succeeds or fails, and how the generated
microwave salmon. We highlight prompt comments, actions as plan corrects the error before executing the next step. Full Execution is
imported function calls with objects as arguments, and assertions with shown in bottom-right. ‘...’ used for brevity
recovery steps. PROMPT for State Feedback represents example asser-
123
Autonomous Robots (2023) 47:999–1012 1003
bag). However, the agent responsible for executing the plan We create a dataset of 70 household tasks. Tasks are posed
might not have a primitive action to take_out. To inform with high-level instructions like “microwave salmon”. We
the LLM about the agent’s action primitives, we provide them collect a ground-truth sequence of actions that completes the
as Pythonic import statements. These encourage the LLM task from an initial state, and record the final state g that
to restrict its output to only functions that are available in the defines a set of symbolic goal conditions, g ∈ P.
current context. To change agents, ProgPrompt just needs a When executing generated programs, we incorporate
new list of imported functions representing agent actions. A environment state feedback in response to assertions. VH
grocery bag object might also not exist in the environment. provides observations in the form of state graph with object
We provide the available objects in the environment as a list properties and relations. To check assertions in this environ-
of strings. Since our prompting scheme explicitly lists out ment, we extract information about the relevant object from
the set of functions and objects available to the model, the the state graph and prompt the LLM to return whether the
generated plans typically contain actions an agent can take assertion holds or not given the state graph and assertion as a
and objects available in the environment. text prompt (Fig. 2 Prompt for State Feedback). We choose
ProgPrompt also includes a few example tasks—fully this design over a rule-based checking since it’s more general.
executable program plans. Each example task demonstrates
how to complete a given task using available actions and 4.2 Real-robot experiments
objects in the given environment. These examples demon-
strate the relationship between task name, given as the We use a Franka-Emika Panda robot with a parallel-jaw grip-
function handle, and actions to take, as well as the restrictions per. We assume access to a pick-and-place policy. The policy
on actions and objects to involve. takes as input two pointclouds of a target object and a tar-
get container, and performs a pick-and-place operation to
3.3 Task plan generation and execution place the object on or inside the container. We use the sys-
tem of Danielczuk et al. (2021) to implement the policy, and
The given task is fully inferred by the LLM based on the use MPPI for motion generation, SceneCollisionNet (Daniel-
ProgPrompt prompt. Generated plans are executed on a czuk et al., 2021) to avoid collisions, and generate grasp poses
virtual agent or a physical robot system using an interpreter with Contact-GraspNet (Sundermeyer et al., 2021).
that executes each action command against the environment. We specify a single import statement for the action
Assertion checking is done in a closed-loop manner during grab_and_putin(obj1, obj2) for ProgPrompt. We
execution, providing current environment state feedback. use ViLD (Gu et al., 2022), an open-vocabulary object detec-
tion model, to identify and segment objects in the scene and
construct the available object list for the prompt. Unlike in the
4 Experiments virtual environment, where object list was a global variable in
common for all tasks, here the object list is a local variable for
We evaluate our method with experiments in a virtual house- each plan function, which allows greater flexibility to adapt
hold environment and on a physical robot manipulator. to new objects. The LLM outputs a plan containing function
calls of form grab_and_putin(obj1, obj2). Here,
4.1 Simulation experiments objects obj1 and obj2 are text strings that we map to point-
clouds using ViLD segmentation masks and the depth image.
We evaluate our method in the Virtual Home (VH) Environ- Due to real world uncertainty, we do not implement assertion-
ment (Puig et al., 2018), a deterministic simulation platform based closed loop options on the tabletop plans.
for typical household activities. A VH state s is a set of objects
O and properties P. P encodes information like in(salmon, 4.3 Evaluation metrics
microwave) and agent_close_to(salmon). The action
space is A = {grab, putin, putback, walk, find, We use three metrics to evaluate system performance: suc-
open, close, switchon, switchoff, sit, standup}. cess rate (SR), executability (Exec), and goal conditions
We experiment with 3 VH environments. Each environ- recall (GCR). The task-relevant goal-conditions are the set
ment contains 115 unique object instances (Fig. 2), including of goal-conditions that changed between the initial and final
class-level duplicates. Each object has properties correspond- state in the demonstration. SR is the fraction of executions
ing to its action affordances. Some objects also have a seman- that achieved all task-relevant goal-conditions. Exec is the
tic state like heated, washed, or used. For example, an fraction of actions in the plan that are executable in the envi-
object in the Food category can become heated whenever ronment, even if they are not relevant for the task. GCR is
in(object, microwave) ∧ switched_on(microwave). measured using the set difference between ground truth final
state conditions g and the final state achieved g with the
123
1004 Autonomous Robots (2023) 47:999–1012
ProgPrompt uses 3 fixed example programs, except the Davinci backbone which can fit only 2 in the available API. Huang et al. (2022a) use
1 dynamically selected example, as described in their paper. LangPrompt uses 3 natural language text examples. Best performing model with
a GPT3 backbone is shown in italic (used for our ablation studies); best performing model overall shown in bold. ProgPrompt significantly
outperforms the baseline Huang et al. (2022a) and LangPrompt. We also showcase how each ProgPrompt feature adds to the performance of
the method
generated plan, divided by the number of task-specific goal- Each result is averaged over 5 runs in a single VH envi-
conditions; SR= 1 only if GCR= 1. ronment across 10 tasks. The variability in performance
across runs arises from sampling LLM output. We include
3 Pythonic task plan examples per prompt after evaluating
5 Results performance on VH for between 1 prompt and 7 prompts
and finding that 2 or more prompts result in roughly equal
We show that ProgPrompt is an effective method for performance for GPT3. The plan examples are fixed to be:
prompting LLMs to generate task plans for both virtual and “put the wine glass in the kitchen cabinet”, “throw away the
physical agents. lime”, and “wash mug”.
We also include results on the recent GPT4 backbone.
5.1 Virtual experiment results Unlike the GPT3 language model, GPT4 is a chat-bot model
trained with reinforcement learning with human feedback
Table 1 summarizes the performance of our task plan gen- (RLHF) to act as a helpful digital assistant OpenAI (2023).
eration and execution system in the seen environment of GPT4 takes as input a system prompt followed by one or more
VirtualHome. We utilize a GPT3 as a language model back- user prompts. Instead of simply auto-completing the code
bone to receive ProgPrompt prompts and generate plans. in the prompt, GPT4 interprets user prompts as questions
123
Autonomous Robots (2023) 47:999–1012 1005
and generates answers as an assistant. To make GPT4 auto- Table 3 ProgPrompt results on Virtual Home in additional scenes. We
complete our prompt, we used the following system prompt: evaluate on 10 tasks each in two additional VH scenes beyond scene
Env- 0 where other reported results take place
You are a helpful assistant.. The user prompt is the same
f prompt as shown in Fig. 2. VH scene SR Exec GCR
We can draw several conclusions from Table 1. First,
Env- 0 0.34 ± 0.08 0.84 ± 0.01 0.65 ± 0.05
ProgPrompt (rows 3–6) outperforms prior work (Huang
Env- 1 0.56 ± 0.08 0.85 ± 0.02 0.81 ± 0.07
et al., 2022a) (row 8) by a substantial margin on all metrics
Env- 2 0.56 ± 0.05 0.85 ± 0.03 0.72 ± 0.09
using the same large language model backbone. Second, we
Average 0.48 ± 0.13 0.85 ± 0.02 0.73 ± 0.10
observe that the Codex (Chen et al., 2021) and Davinci
models (Brown et al., 2020)—themselves GPT3 variants—
show mixed success at the task. In particular, Davinci, the
original GPT3 version, does not match base GPT3 perfor- policy P(at |st , GPT3 step, a1:t−1 ) to map those generated
mance (row 2 versus row 3), possibly because its prompt sequences to executable actions in the simulation environ-
length constraints limit it to 2 task examples versus the 3 ment. We use the 35 tasks in the training set, and annotate
available to other rows. Additionally, Codex exceeds GPT3 the text steps and the corresponding action sequence to get
performance on every metric (row 1 versus row 3), likely 400 data points for training and validation of this policy. We
because Codex is explicitly trained on programming lan- find that while this method achieves reasonable partial suc-
guage data. However, Codex has limited access in terms of cess through GCR, it does not match (Huang et al., 2022a)
number of queries per minute, so we continue to use GPT3 for program executability Exec and does not generate any
as our main LLM backbone in the following ablation experi- fully successful task executions.
ments. Our recommendation to the community is to utilize a Task-by-Task Performance ProgPrompt performance for
program-like prompt for LLM-based task planning and exe- each task in the test set is shown in Table 2. We observe
cution, for which base GPT3 works well, and we note that an that tasks that are similar to prompt examples, such as throw
LLM fine-tuned further on programming language data, such away apple versus wash the plate have higher GCR since
as Codex, can do even better. We additionally report results the ground truth prompt examples hint about good stop-
on Davinci- 003 and GPT4 (row *), which is the latest GPT3 ping points. Even with high Exec, some task GCR are low,
variant and the latest GPT variant in the series respectively because some tasks have multiple appropriate goal states, but
at the time of this submission. Davinci- 003 has a better SR we only evaluate against a single “true” goal. For example,
and GCR, indicating it might have an improved common- after microwaving and plating salmon, the agent may put the
sense understanding, but lower Exec compared to Codex. salmon on a table or a countertop.
The newest model, GPT- 4 does not seem to be better than Other Environments We evaluate ProgPrompt in two
latest GPT3 variant, on our tasks. Most of our results use additional VH environments (Table 3). For each, we append
the Davinci- 002 variant (that we refer to as GPT3 in this a new object list representing the new environment after the
paper), which was the latest model available when this study example tasks in the prompt, followed by the task to be com-
was conducted. pleted in the new scene. The action primitives and other
We explore several ablations of ProgPrompt. First, we ProgPrompt settings remain unchanged. We evaluate on 10
find that Feedback mechanisms in the example programs, tasks with 5 runs each. For new tasks like wash the cutlery in
namely the assertions and recovery actions, improve perfor- dishwasher, ProgPrompt is able to infer that cutlery refers
mance (rows 3 versus 4 and 5 versus 6) across metrics, the to spoons and forks in the new scenes, despite that cutlery
sole exception being that Exec improves a bit without Feed- always refers to knives in example prompts.
back when there are no Comments in the prompt example
code. Second, we observe that removing Comments from 5.2 Qualitative analysis and limitations
the prompt code substantially reduces performance on all
metrics (rows 3 versus 5 and 4 versus 6), highlighting the We manually inspect generated programs and their execution
usefulness of the natural language guidance within the pro- traces from ProgPrompt and characterize common fail-
gramming language structure. ure modes. Many failures stem from the decision to make
We also evaluate LangPrompt, an alternative to Prog- ProgPrompt agnostic to the deployed environment and its
Prompt that builds prompts from natural language text peculiarities, which may be resolved through explicitly com-
description of objects available and example task plans municating, for example, object affordances of the target
(row 7). LangPrompt is similar to the prompts built environment as part of the ProgPrompt prompt.
by Huang et al. (2022a). The outputs of LangPrompt
are generated action sequences, rather than our proposed, • Environment artifacts: the VH agent cannot find or inter-
program-like structures. Thus, we finetune GPT2 to learn a act with objects nearby when sitting, and some
123
1006 Autonomous Robots (2023) 47:999–1012
Fig. 4 Robot plan execution rollout example on the sorting task show- berry are fruits, and generates plan steps to place them on the plate,
ing relevant objects banana, strawberry, bottle, plate and box, and a while placing the bottle in the box. The LLM ignores the distractor
distractor object drill. The LLM recognizes that banana and straw- object drill. See Fig. 1 for the prompt structure used
common sense actions for objects, such as open tvs- Table 4 Results on the physical robot by task type
tand’s cabinets, are not available in VH. Task description Distractors SR Plan SR GCR
• Environment complexities: when an object is not acces-
sible, the generated assertions might not be enough. For put the banana in the bowl 0 1 1 1/1
example, if the agent finds an object in a cabinet, it may 4 1 1 1/1
not plan to open the cabinet to grab the object. put the pear on the plate 0 1 1 1/1
• Action success feedback is not provided to the agent, 4 1 1 1/1
which may lead to failure of the subsequent actions. put the banana on the plate 0 1 1 2/2
Assertion recovery modules in the plan can help, but and the pear in the bowl 3 1 1 2/2
aren’t generated to cover all possibilities. sort the fruits on the plate 0 0 1 2/3
• Incomplete generation: Some plans are cut short by LLM and the bottles in the box 1 1 1 3/3
API caps. One possibility is to query the LLM again with 2 0 0 2/3
the prompt and partially generated plan.
123
Autonomous Robots (2023) 47:999–1012 1007
cutability for the generated plans was always Exec=1. An reasoning ability to generate an executable and valid task
execution rollout example is illustrated in Fig. 4. plan.
After this study was conducted, we re-attempted plan gen- • The precondition checking helps recover from some fail-
eration of the failed plan with GPT- 4, using the same system ure modes that can happen if actions are generated in the
prompt as in Sect. 5.1. GPT- 4 was able to successfully pre- wrong order or are missed by the base plan.
dict the correct plan and not confuse the soup can for a bottle.
Cons:
123
1008 Autonomous Robots (2023) 47:999–1012
• Using LLM as opposed to a rule-based algorithm is a ProgPrompt doesn’t tackle the issue, however, Xie et al.
design choice made to keep the approach more general, (2023b) shows that multiple instances of the same objects can
instead of using a hand-coded rule-based algorithm. The be handled by using labels with object IDs such as “book_1,
assertion checking may also be replaced with a visual book_2”.
state conditioned module, when a semantic state is not
Question 9 Why doesn’t the paper compare the performance
available, such as in the real-world scenario. However,
of the proposed method to InnerMonologue, SAYCAN, or
we leave these aspects to be addressed in future research.
Socratic models?
Question 4 Is it possible that the generated code might lead At the time of writing, the dataset or model from the
the robot to be stuck in an infinite loop? above papers were not public. However, we do compare with
a proxy approach, similar in underlying idea to the above
LLM code generation could lead to loops by predicting approaches, in the VirtualHome environment. LangPlan in
the same actions repeatedly as a generation artifact. LLMs our baselines, uses GPT3 to get textual plan steps, which are
used to suffer from such degeneration, but with latest LLMs then executed using a GPT-2 based trained policy.
(i.e. GPT-3) we have not encountered it at all.
Question 10 So the next step in this direction of research is
Question 5 Why are real-robot experiments simpler than vir- to create highly structured inputs and outputs that could be
tual experiments? compiled, since eventually we want something that compiles
on robotic machines?
The real-robot experiments were done as a demonstration
of the approach on a real-robot, while studying the method The disconnect and information bottleneck between LLM
in depth in a virtual simulator, for the sake of simplicity and planning module and skill execution module might make it
efficiency. less concrete on “how much” and “what” information should
be passed through the LLM during planning. That said, we
Question 6 What’s the difference between various GPT3 think that this would be an interesting direction to pursue and
model versions used in this project? test the limits of LLM’s highly structured input understanding
and generation.
We name GPT3, which is the latest available version of
GPT3 model on OpenAI at the time the paper was written: Question 11 How does it compare to a classical planner?
text- davinci- 002. We name davinci as the original ver-
sion of GPT3 released: text- davinci.2 • Classical planners require concrete goal condition speci-
fication. An LLM planner reasons out a feasible goal state
Question 7 Why not a planning language like PDDL (or
from a high level task description, such as “microwave
other planning languages) be used to construct Prog-
salmon”. From a user’s perspective, it is desirable to not
Prompt? Any advantages of using a pythonic structure?
have to specify a concrete semantic goal state of the envi-
ronment and just be able to give an instruction to act on.
• GPT-3 has been trained on data from the internet. There • The search space would also be huge without common
is a lot of python code on the internet, while PDDL is a sense priors that an LLM planner leverages as opposed
language of much more narrow interest. Thus, we expect to a classical planner. Moreover, we also bypass the need
the LLM to better understand python syntax. to specify the domain knowledge needed for the search
• Python is a general purpose language, so it has more to roll out.
features than PDDL. Furthermore, we want to avoid spec- • Moreover, the domain specification and search space will
ifying the full planning domain, instead relying on the grow non-linearly with the complexity of the environ-
knowledge learned by the LLM to make common-sense ment.
inferences. A recent work Xie et al. (2023b) uses LLMs
to generate PDDL goals, however, it requires full domain Question 12 Is it possible to decouple high-level language
specification for a given environment. planning from low-level perceptual planning?
• Python is an accessible language that a larger community
is familiar with. It may be feasible to an extent, however we believe that a
clean decoupling might not be “all we need”. For instance,
Question 8 How to handle multiple instances of the same imagine an agent being stuck at an action that needs to be
object type in the scene? resolved at semantic level of reasoning, and probably very
hard for the visual module to figure out. For instance, while
2 More info on GPT3 models variations and naming can be found here: placing a dish on an oven tray, the robot may need to pull the
https://platform.openai.com/docs/models/overview dish rack out of the oven to be successful in the task.
123
Autonomous Robots (2023) 47:999–1012 1009
Question 13 What are the kinds of failures that can happen is not included in the article’s Creative Commons licence and your
with ProgPrompt-like 2 stage decoupled pipeline? intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copy-
A few broad failure categories could be: right holder. To view a copy of this licence, visit http://creativecomm
ons.org/licenses/by/4.0/.
123
1010 Autonomous Robots (2023) 47:999–1012
Huang, W., Abbeel, P., Pathak, D., & Mordatch, I. (2022). Language 1: Long papers) (pp. 1713–1726). Association for Computational
models as zero-shot planners: Extracting actionable knowledge for Linguistics.
embodied agents. arXiv preprint arXiv:2201.07207 Shridhar, M., Thomason, J., Gordon, D., Bisk, Y., Han, W., Mottaghi,
Huang, W., Xia, F., Shah, D., Driess, D., Zeng, A., Lu, Y., others (2023). R., & Fox, D. (2020). ALFRED: A Benchmark for Interpreting
Grounded decoding: Guiding text generation with grounded mod- Grounded Instructions for Everyday Tasks. In The IEEE confer-
els for robot control. arXiv preprint arXiv:2303.00855 ence on computer vision and pattern recognition (cvpr).
Huang, W., Xia, F., Xiao, T., Chan, H., Liang, J., Florence, P., & Ichter, Silver, T., Chitnis, R., Kumar, N., McClinton, W., Lozano-Perez, T.,
B. (2022). Inner monologue: Embodied reasoning through plan- Kaelbling, L. P., & Tenenbaum, J. (2022). Inventing relational state
ning with language models. arxiv preprint arxiv:2207.05608. and action abstractions for effective and efficient bilevel planning.
Jansen, P. (2020). Visually-grounded planning without vision: Lan- In The multi-disciplinary conference on reinforcement learning
guage models infer detailed plans from high-level instructions. In and decision making (rldm).
Findings of the association for computational linguistics: Emnlp Skreta, M., Yoshikawa, N., Arellano-Rubach, S., Ji, Z., Kristensen, L.
2020 (pp. 4412–4417). Online: Association for Computational B., Darvish, K., & Garg, A. (2023). Errors are useful prompts:
Linguistics. Instruction guided task programming with verifier-assisted itera-
Jiang, Y., Gu, S. S., Murphy, K. P., & Finn, C. (2019). Language as tive prompting. arXiv preprint arXiv:2303.14100
an abstraction for hierarchical deep reinforcement learning. In H. Srinivas, A., Jabri, A., Abbeel, P., Levine, S., & Finn, C. (2018). Uni-
Wallach, H. Larochelle, A. Beygelzimer, F. d’ Alché-Buc, E. Fox, versal planning networks: Learning generalizable representations
& R. Garnett (Eds.), Advances in neural information processing for visuomotor control. In J. Dy, & A. Krause (Eds.), Proceedings
systems. (vol. 32). Curran Associates, Inc. of the 35th international conference on machine learning (vol. 80,
Jiang, Y., Zhang, S., Khandelwal, P., & Stone, P. (2018). Task planning pp. 4732–4741). PMLR.
in robotics: An empirical comparison of pddl-based and asp-based Sundermeyer, M., Mousavian, A., Triebel, R., & Fox, D. (2021).
systems. arXiv. Contact-graspnet: Efficient 6-dof grasp generation in cluttered
Kurutach, T., Tamar, A., Yang, G., Russell, S. J., & Abbeel, P. (2018). scenes. In 2021 IEEE international conference on robotics and
Learning plannable representations with causal infogan. In S. Ben- automation (icra) (pp. 13438–13444).
gio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, & Vemprala, S., Bonatti, R., Bucker, A., & Kapoor, A. (2023). Chatgpt
R. Garnett (Eds.), Advances in neural information processing sys- for robotics: Design principles and model abilities. 2023
tems (vol. 31). Curran Associates, Inc. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., &
Li, S., Puig, X., Paxton, C., Du, Y., Wang, C., Fan, L., & Zhu, Zhou, D. (2022). Chain of thought prompting elicits reasoning in
Y. (2022). Pre-trained language models for interactive decision- large language models. arXiv.
making. arXiv. Wiseman, S., Shieber, S., & Rush, A. (2017). Challenges in data-
Liang, J., Huang, W., Xia, F., Xu, P., Hausman, K., Ichter, B., & Zeng, A. to-document generation. In Proceedings of the 2017 conference
(2023). Code as policies: Language model programs for embodied on empirical methods in natural language processing (pp. 2253–
control. 2263). Association for Computational Linguistics.
Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., & Neubig, G. (2021). Xie, Y., Yu, C., Zhu, T., Bai, J., Gong, Z., & Soh, H. (2023a). Translating
Pre-train, prompt, and predict: A systematic survey of prompting natural language to planning goals with large-language models.
methods in natural language processing. arXiv. arXiv preprint arXiv:2302.05128
Luong, T., Pham, H., & Manning, C. D. (2015). Effective approaches Xie, Y., Yu, C., Zhu, T., Bai, J., Gong, Z., & Soh, H. (2023b). Translating
to attention-based neural machine translation. In Proceedings of natural language to planning goals with large-language models.
the 2015 conference on empirical methods in natural language Xu, D., Martín-Martín, R., Huang, D. A., Zhu, Y., Savarese, S., & Fei-
processing (pp. 1412–1421). Association for Computational Lin- Fei, L. F. (2019). Regression planning networks. In H. Wallach, H.
guistics. Larochelle, A. Beygelzimer, F. d’ Alché-Buc, E. Fox, & R. Garnett
Mai, J., Chen, J., Li, B., Qian, G., Elhoseiny, M., & Ghanem, B. (2023). (Eds.), Advances in neural information processing systems (vol.
Llm as a robotic brain: Unifying egocentric memory and control. 32). Curran Associates, Inc.
arXiv preprint arXiv:2304.09349 Xu, D., Nair, S., Zhu, Y., Gao, J., Garg, A., Fei-Fei, L., & Savarese, S.
Mirchandani, S., Karamcheti, S., & Sadigh, D. (2021). Ella: Exploration (2018). Neural task programming: Learning to generalize across
through learned language abstraction. In M. Ranzato, A. Beygelz- hierarchical tasks. In 2018 IEEE international conference on
imer, Y. Dauphin, P. Liang, J. W. Vaughan (Eds.), Advances in robotics and automation (icra) (pp. 3795–3802).
neural information processing systems (vol. 34, pp. 29529–29540). Zeng, A., Attarian, M., Ichter, B., Choromanski, K., Wong, A., Welker,
Curran Associates, Inc. S., & Florence, P. (2022). Socratic models: Composing zero-shot
Nair, S., & Finn, C. (2020). Hierarchical foresight: Self-supervised multimodal reasoning with language. arXiv
learning of long-horizon tasks via visual subgoal generation. In Zhu, Y., Tremblay, J., Birchfield, S., & Zhu, Y. (2020). Hierarchical
International conference on learning representations. planning for long-horizon manipulation with geometric and sym-
OpenAI (2023). Gpt-4 technical report. arXiv. bolic scene graphs. arXiv.
Patel, R., & Pavlick, E. (2022). Mapping language models to grounded
conceptual spaces. In International conference on learning repre-
sentations. Publisher’s Note Springer Nature remains neutral with regard to juris-
Puig, X., Ra, K., Boben, M., Li, J., Wang, T., Fidler, S., & Tor- dictional claims in published maps and institutional affiliations.
ralba, A. (2018). Virtualhome: Simulating household activities via
programs. In 2018 IEEE/cvf conference on computer vision and
pattern recognition (pp. 8494–8502).
Shah, D., Toshev, A. T., Levine, S., & brian ichter. (2022). Value function
spaces: Skill-centric state abstractions for long-horizon reasoning.
In International conference on learning representations.
Sharma, P., Torralba, A., & Andreas, J. (2022). Skill induction and
planning with latent language. In Proceedings of the 60th annual
meeting of the association for computational linguistics (volume
123
Autonomous Robots (2023) 47:999–1012 1011
123
1012 Autonomous Robots (2023) 47:999–1012
123