GMT: Goal-Conditioned Multimodal Transformer for 6-DOF Object Trajectory Synthesis in 3D Scenes

1Technical University of Munich (TUM),   2Munich Center for Machine Learning (MCML),   3ETH Zürich,   4Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)
3DV 2026
3DV 2026 - International Conference on 3D Vision, Vancouver, BC, Canada, March 20-23, 2026

Given an observed trajectory, scene context, and action description, GMT predicts plausible future 6-DOF object trajectories. The generated trajectories are more efficient than natural human motions.

Abstract

Synthesizing controllable 6-DOF object manipulation trajectories in 3D environments is essential for enabling robots to interact with complex scenes, yet remains challenging due to the need for accurate spatial reasoning, physical feasibility, and multimodal scene understanding. Existing approaches often rely on 2D or partial 3D representations, limiting their ability to capture full scene geometry and constraining trajectory precision.

We present GMT, a multimodal transformer framework that generates realistic and goal-directed object trajectories by jointly leveraging 3D bounding box geometry, point cloud context, semantic object categories, and target end poses. The model represents trajectories as continuous 6-DOF pose sequences and employs a tailored conditioning strategy that fuses geometric, semantic, contextual, and goal-oriented information.

Extensive experiments on synthetic and real-world benchmarks demonstrate that GMT outperforms state-of-the-art human motion and human-object interaction baselines, such as CHOIS and GIMO, achieving substantial gains in spatial accuracy and orientation control. Our method establishes a new benchmark for learning-based manipulation planning and shows strong generalization to diverse objects and cluttered 3D environments.

Pipeline Overview

Pipeline Overview

Given an observed trajectory and scene context, our model predicts future 6-DOF object trajectories conditioned on a specified goal state. The pipeline consists of:

  • Multi-modal Feature Extraction — We encode (a) trajectory dynamics via a linear layer, (b) local geometry propagated from the scene point cloud to the object's bounding box using PointNet++, (c) semantic fixture boxes and labels along with natural language descriptions using CLIP, and (d) fixture bounding box interactions via self-attention.
  • Feature Fusion — A Perceiver IO-inspired multimodal transformer with learnable latent arrays performs hierarchical fusion through stacked cross-attention and self-attention blocks, prioritizing hard geometric constraints over softer semantic cues.
  • Trajectory Prediction — The fused latent is fed directly to the prediction head, generating future 6-DOF object trajectories conditioned on the multimodal scene representation and goal state.

Qualitative Results

Results on the ADT Dataset

Qualitative results on the ADT dataset

The green trajectory represents the input history across all experiments. Only our model produces trajectories that both reach the target and avoid collisions, while also achieving shorter path lengths compared to the ground-truth natural trajectories.

Results on the HD-EPIC Dataset

Qualitative results on the HD-EPIC dataset

The green points indicate the input history. Our model generates trajectories that are more efficient than the ground truth, while all baselines remain stuck in repetitive motions.

Hand Tracking for Object Trajectory Extraction

Hand tracking pipeline for HD-EPIC object trajectory extraction

HD-EPIC only annotates object positions at pickup and drop events. We use the interacting hand as a proxy for the object's motion: by tracking the hand via Project Aria's MPS and detecting hand-object contact with a pretrained Hands23 detector, we infer dense 6-DOF object trajectories between pickup and release.

3D Bounding Box Extraction for HD-EPIC

3D Bounding Box Extraction Pipeline for HD-EPIC

HD-EPIC lacks 3D bounding box annotations for scene objects. We recover them automatically: given a 2D annotation mask, we filter 3D point correspondences using depth estimates from Unik3D and back-project them into 3D space, then fit a 3D bounding box to the filtered point cloud.

More Qualitative Results

Additional Results on the ADT Dataset

Additional qualitative results on the ADT dataset

Additional Results on the HD-EPIC Dataset

Additional qualitative results on the HD-EPIC dataset

BibTeX

@inproceedings{zeng2026gmt,
  title     = {{GMT}: Goal-Conditioned Multimodal Transformer for 6-DOF Object Trajectory Synthesis in 3D Scenes},
  author    = {Zeng, Huajian and Saroha, Abhishek and Cremers, Daniel and Wang, Xi},
  booktitle = {International Conference on 3D Vision (3DV)},
  year      = {2026},
}