Synthesizing controllable 6-DOF object manipulation trajectories in 3D environments is essential for enabling robots to interact with complex scenes, yet remains challenging due to the need for accurate spatial reasoning, physical feasibility, and multimodal scene understanding. Existing approaches often rely on 2D or partial 3D representations, limiting their ability to capture full scene geometry and constraining trajectory precision.
We present GMT, a multimodal transformer framework that generates realistic and goal-directed object trajectories by jointly leveraging 3D bounding box geometry, point cloud context, semantic object categories, and target end poses. The model represents trajectories as continuous 6-DOF pose sequences and employs a tailored conditioning strategy that fuses geometric, semantic, contextual, and goal-oriented information.
Extensive experiments on synthetic and real-world benchmarks demonstrate that GMT outperforms state-of-the-art human motion and human-object interaction baselines, such as CHOIS and GIMO, achieving substantial gains in spatial accuracy and orientation control. Our method establishes a new benchmark for learning-based manipulation planning and shows strong generalization to diverse objects and cluttered 3D environments.
Given an observed trajectory and scene context, our model predicts future 6-DOF object trajectories conditioned on a specified goal state. The pipeline consists of:
The green trajectory represents the input history across all experiments. Only our model produces trajectories that both reach the target and avoid collisions, while also achieving shorter path lengths compared to the ground-truth natural trajectories.
The green points indicate the input history. Our model generates trajectories that are more efficient than the ground truth, while all baselines remain stuck in repetitive motions.
HD-EPIC only annotates object positions at pickup and drop events. We use the interacting hand as a proxy for the object's motion: by tracking the hand via Project Aria's MPS and detecting hand-object contact with a pretrained Hands23 detector, we infer dense 6-DOF object trajectories between pickup and release.
HD-EPIC lacks 3D bounding box annotations for scene objects. We recover them automatically: given a 2D annotation mask, we filter 3D point correspondences using depth estimates from Unik3D and back-project them into 3D space, then fit a 3D bounding box to the filtered point cloud.
@inproceedings{zeng2026gmt,
title = {{GMT}: Goal-Conditioned Multimodal Transformer for 6-DOF Object Trajectory Synthesis in 3D Scenes},
author = {Zeng, Huajian and Saroha, Abhishek and Cremers, Daniel and Wang, Xi},
booktitle = {International Conference on 3D Vision (3DV)},
year = {2026},
}