PaLM-E: An Embodied Multimodal Language Model

PaLM-E: An Embodied Multimodal Language Model

Danny Driess^1,2
Fei Xia¹
Mehdi S. M. Sajjadi³
Corey Lynch¹
Aakanksha Chowdhery³

Brian Ichter¹
Ayzaan Wahid¹
Jonathan Tompson¹
Quan Vuong¹
Tianhe Yu¹
Wenlong Huang¹

Yevgen Chebotar¹
Pierre Sermanet¹
Daniel Duckworth³
Sergey Levine¹
Vincent Vanhoucke¹

Karol Hausman¹
Marc Toussaint²
Klaus Greff³
Andy Zeng¹
Igor Mordatch³
Pete Florence¹

Abstract

Large language models have been demonstrated to perform complex tasks. However, enabling general inference in the real world, e.g. for robotics problems, raises the challenge of grounding. We propose embodied language models to directly incorporate real-world continuous sensor modalities into language models and thereby establish the link between words and percepts. Input to our embodied language model are multi-modal sentences that interleave visual, continuous state estimation, and textual input encodings. We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks, including sequential robotic manipulation planning, visual question answering, and captioning. Our evaluations show that PaLM-E, a single large embodied multimodal model, can address a variety of embodied reasoning tasks, from a variety of observation modalities, on multiple embodiments, and further, exhibits positive transfer: the model benefits from diverse joint training across internet-scale language, vision, and visual-language domains. Our largest model, PaLM-E-562B with 562B parameters, in addition to being trained on robotics tasks, is a visual-language generalist with state-of-the-art performance on OK-VQA, and retains generalist language capabilities with increasing scale.

Approach

The main architectural idea of PaLM-E is to inject continuous, embodied observations such as images, state estimates, or other sensor modalities into the language embedding space of a pre-trained language model. This is realized by encoding the continuous observations into a sequence of vectors with the same dimension as the embedding space of the language tokens. The continuous information is hence injected into the language model in an analogous way to language tokens. PaLM-E is a decoder-only LLM that generates textual completions autoregressively given a prefix or prompt. We call our model PaLM-E, since we use PaLM (Chowdhery et al., 2022) as the pre-trained language model, and make it Embodied.

Results

We show a few example videos showing how PaLM-E can be used to plan and execute long horizon tasks on two different real embodiments. Please note, that all of these results were obtained using the same model trained on all data. In the first video, we execute a long-horizon instruction "bring me the rice chips from the drawer" that includes multiple planning steps as well as incorporating visual feedback from the robot's camera. Finally, show another example on the same robot where the instruction is "bring me a green star". Green star is an object that this robot wasn't directly exposed to.

"Bring me the rice chips from the drawer."

"Bring me the green star."

In the following part, we show PaLM-E controlling a table top robot arranging blocks. We show the PaLM-E can successfully plan over multiple stages based on visual and language input. Our model is able to successfully plan a long-horizon task "sort blocks by colors into different corners" . Another example of planning over multiple stages and incorporating visual feedback over long time horizons. Finally, we demonstrate another example of long-horizon pushing tasks on this robot. The first instruction is "move remaining blocks to the group". PaLM-E sequences step-by-step commands to the low-level policy such as "move the yellow hexagon to the green star", and "move the blue triangle to the group".

sort blocks by colors into different corners

Incorporating visual feedback over long time horizons

Move remaining blocks to the group

Push the ocean colored blocks together

Next, we demonstrate two examples of generalization. In the case below the instruction is "push red blocks to the coffee cup". The dataset contains only three demonstrations with the coffee cup in them, and none of them included red blocks. We show another generalization example, where the instruction is "push green blocks to the turtle". The robot is able to successfully execute this task even though it has never seen the turtle before.

Push red blocks to the coffee cup

Push green blocks to the turtle

In addition to unlocking new capabilities in robot planning. PaLM-E is a competent Vision-Language Model. Please check out our paper for more details and see the dmeo below.

Demo

The examples below are all example completions (in orange) from PaLM-E. The prompt is the one or more images and the text in gray.

Given <b>img</b>. Q: Who are the two teams playing in this photo, and which was the last to win a championship, and which year did they win, and who was their star player that year? A: Let's think step by step. [sep] 1. team in white is new york knick. 2. team in green is boston celtic. 3. last to win championship was boston celtic. 4. year was 2008. 5. star player was paul pierce. 6. i am not sure if this is correct. i am not basketball fan. i just google it.

Given <b>img</b>. Q: What flavor is the pink doughnut on the right? A:[sep] strawberry.. Q: What flavor is the doughnut on the left? A blueberry.

Given <b>img</b>. Q: What will the robot do next? A: [sep] fall.

Robot: I am a robot operating in a kitchen. Given <b>img</b>, When a human asks me to do a task, I will respond with the sequence of actions I would do to accomplish the task with only the items I see. Human: Use all of the ingredients you see to make a cake batter. [sep]1. crack egg. 2. put egg in bowl. 3. put flour in bowl. 4. put sugar in bowl. 5. mix. 6. put in pan. 7. bake. 8. eat. 9. clean up.

Given <b>img</b>. Q: I am getting just two custom pizza for me and my friend. How much should I pay in total? A: Let's think step by step. [sep] 1st step: 2 custom pizzas. 2nd step: 2 x $9.99. 3rd step: $19.98.

Given <b>img</b>. Q: What's in the image? Answer in emojis. A: [sep]🍏🍌🍇🍐🍑🍈🍒.

Given <b>img</b>. Q: If a robot wanted to be useful here, what steps should it take? A: [sep]clean table..pick up trash.. pick up chairs.. wipe chairs.. put chairs down.

Given <b>img</b>. Q: How many championship rings did the player on the left win in his career A: Let’s think step by step. [sep] 1. kobe bryant is on the left. 2. kobe bryant has won 5 championship rings.

Given <b>img</b>. Q: Can I go down this street on a bicycle, yes or no? A: Let's think step by step. [sep] 1. do not enter. 2. except bicycles. 3. do not enter except bicycles. 4. yes.

I see (image omitted). Description: A cow sitting on a rug. Joke: Why is the cow sitting on a rug? Because it wants to be a cow-ch!I see <b>img</b>.[sep] Description: a donkey is carrying a dog, cat, and rooster. Joke: what do you call a donkey with a rooster on his back. A rooster booster.

Photo 1: <b>img1</b>. Photo 2: <b>img2</b>. Q: What is in Photo 1 but NOT in Photo 2? A: Let's think step by step and explain.[sep] Photo 1 has sunglasses on top of folded clothes. Photo 2 does not have sunglasses on top of folded clothes. Therefore, sunglasses are in Photo 1 but not in Photo 2.

Photo 1, at 10:30 am: <b>img1</b>. Photo 2, at 12:45 pm: <b>img2</b>. Photo 3, at 3:45 pm: <b>img3</b>. Q: I forget, what did I have for lunch, and what time was it? A: Let's think step by step. [sep]1. you had a sandwich for lunch. 2. it was 12:45 pm.

Prompt text in gray.

PaLM-E response in orange shade.

Given <b>img</b>. Q: What was the task? A:[sep]Prediction: make one vertical line out of the red and blue blocks, then make a vertical line out of the green and yellow blocks

Given <b>img</b>. Q: What might have happend? A:[sep]Prediction: move the green circle to the left of the green star

Robot: I am a robot operating in an office kitchen. When a human asks me to do a task, I will respond with the sequence of actions I would do to accomplish the task. Human: apportez-moi les chips de riz du tiroir. Robot: 1. go to drawers. 2. open top drawer. I see <b>img</b> [sep]3. pick green rice chip bag from top drawer and place on counter.

Given <b>img</b>. Q: How to grasp the red object?[sep]A: First grasp the green object an place it on the table, then grasp the red object.

Given <b>img</b>. Q: Which block is not currently in the group? A:[sep] red circle.

Given <b>img</b>. Q: Which blocks of the same color are closest together? A:[sep] yellow blocks.

Given <b>img</b>. Q: What was the task? A:[sep]put all the blocks in a horizontal line.

Given <b>img</b>. Q: Which block is missing from the horizontal line? A:[sep] blue triangle.

Given <b>img</b>. Q: How to stack the yellow object on top of the blue plate? [sep] A: First grasp the orange object with the left arm and place it on the table, then grasp yellow object with the left arm and place it on the table, then grasp the yellow object with the right arm and place it on the blue plate.

Given <b>img</b>. Q: How to stack the white object on top of the red object? [sep] A: First grasp the red object and place it on the table, then grasp the white object and place if on the red object.

Citation

[arxiv version]

@inproceedings{driess2023palme,
    title={PaLM-E: An Embodied Multimodal Language Model},
    author={Driess, Danny and Xia, Fei and Sajjadi, Mehdi S. M. and Lynch, Corey and Chowdhery, Aakanksha and Ichter, Brian and Wahid, Ayzaan and Tompson, Jonathan and Vuong, Quan and Yu, Tianhe and Huang, Wenlong and Chebotar, Yevgen and Sermanet, Pierre and Duckworth, Daniel and Levine, Sergey and Vanhoucke, Vincent and Hausman, Karol and Toussaint, Marc and Greff, Klaus and Zeng, Andy and Mordatch, Igor and Florence, Pete},
    booktitle={arXiv preprint arXiv:2303.03378},
    year={2023}
}

Acknowledgements

The authors would like to thank, for their advice, help and support: Xi Chen, Etienne Pot, Sebastian Goodman, Ted Xiao, Keerthana Gopalakrishnan, Kehang Han, Henryk Michalewski, Neil Houlsby, Basil Mustafa, Justin Gilmer, Yonghui Wu, Erica Moreira, Victor Gomes, Tom Duerig, and Kendra Byrne.

Paper

Demo

Abstract

Approach

Results

"Bring me the rice chips from the drawer."

"Bring me the green star."

sort blocks by colors into different corners

Incorporating visual feedback over long time horizons

Move remaining blocks to the group

Push the ocean colored blocks together

Push red blocks to the coffee cup

Push green blocks to the turtle

Demo

Citation

Acknowledgements