Danny Driess Zhiao Huang Yunzhu Li Russ Tedrake Marc Toussaint
TU Berlin, MIT, UC San Diego
The video shows the closed-loop visual planning and execution result in the simulator. The goal is to move all yellow blocks to the top and all blue to the bottom. The initial state is relatively adversarial, i.e. a greedy strategy of just pushing the objects straight to the goal region would fail. Notice how the planned behavior sometimes leads to multiple objects being pushed at once, as it reduces the cost most.
This video shows the closed-loop visual planning and execution result in the simulator for a box-sorting task with 6 objects. The model has been trained on scenes that contain exactly 4 objects. Due to the structure of our proposed framework, the model is able to generalize to more objects than during training and therefore solve this complex box sorting task.
This video shows the predictions of the model for a very long horizon, after observing the scene only at the beginning. Although it is clearly not perfect, given how many steps it predicts into the future, both the reconstructions and the dynamics behavior is still qualitatively useful. There is not much drift of the objects, which is achieved through by exploiting the adjacency matrix estimation from the model itself.
This video shows the predictions of the model after observing the scene only once at the beginning. The right video corresponds to a case where the adjacency matrix is only estimated for every time step, but not during message passing (Sec. VII.C-2) and the quasi static assumption (Sec. VII.C-1) is not exploited. When expoliting the information in the estimated adjacency matrix from the model fully, the predictions are more stable into the future with little drift.
Here we compare to using a GNN with a dense adjacency matrix (Sec. VII.C-3). As one can see, a dense adjacency matrix leads to significantly more drift compared to our proposed method of estimating the adjacency matrix from the model's own prediction. This drift is the reason why planning with a model using a dense adjacency matrix fails (Table 1 in the paper).
Replacing the compositional NeRF decoder with a 2D compositional CNN decoder, not only the reconstruction at the very beginning is significantly less sharp, the predictions become very quickly useless after only a few time steps with the CNN decoder based model.
Here we show the forward predictions of the learned model on a test dataset containing 4 objects. Each scene is observed only once and the subsequent reconstructions are purely generated from the model's own predictions. The images in the top row are rendered with the learned model.
This video shows how novel scenes can be composed/transformed with the model. In the first part, objects are observed individually and then composed into a scene while applying rigid transformations to the objects. In the second part, rigid transformations are applied to individual objects in an observed scene. Note that the rigid transformations are achieved through the implicit object encoder, i.e. the transformed object lead to new latent vectors describing their configuration in the scene and not only the appearance/rendered image changes. All images with black background are rendered through the model.