FlowHOI: Flow-based Semantics-Grounded Generation of Hand-Object Interactions for Dexterous Robot Manipulation

¹Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI), ²Technical University of Munich (TUM), ³National University of Singapore (NUS), ⁴Westlake University

^*Corresponding author

Abstract

Recent vision-language-action (VLA) models can generate plausible end-effector motions, yet they often fail in long-horizon, contact-rich tasks because the underlying hand-object interaction (HOI) structure is not explicitly represented. An embodiment-agnostic interaction representation that captures this structure would make manipulation behaviors easier to validate and transfer across robots.

We propose FlowHOI, a two-stage flow-matching framework that generates semantically grounded, temporally coherent HOI sequences, comprising hand poses, object poses, and hand-object contact states, conditioned on an egocentric observation, a language instruction, and a 3D Gaussian splatting (3DGS) scene reconstruction. We decouple geometry-centric grasping from semantics-centric manipulation, conditioning the latter on compact 3D scene tokens and a motion–text alignment loss to semantically ground the generated interactions in both the physical scene layout and the language instruction.

To address the scarcity of high-fidelity HOI supervision, we introduce a reconstruction pipeline that recovers aligned hand-object trajectories and meshes from large-scale egocentric videos, yielding an HOI prior for robust generation. Across the GRAB and HOT3D benchmarks, FlowHOI achieves the highest action recognition accuracy and a 1.7× higher physics simulation success rate than the strongest diffusion-based baseline, while delivering a 40× inference speedup. We further demonstrate real-robot execution on four dexterous manipulation tasks, illustrating the feasibility of retargeting generated HOI representations to real-robot execution pipelines.

Pipeline Overview

Given an egocentric observation, a text command, and a 3D scene context, FlowHOI generates hand-object interaction motions through a two-stage pipeline:

Stage 1: Grasping — Generates hand motion to approach and grasp the object. A pretrained grasping prior, fine-tuned on high-fidelity HOI data reconstructed from large-scale egocentric videos, provides contact-stable initializations.
Stage 2: Manipulation — Produces the subsequent interaction sequence conditioned on scene context and language instructions, ensuring semantically grounded and physically plausible object state changes.
Scene Conditioning — Both stages leverage a Diffusion Transformer (DiT) with compact 3D scene tokens, fused from geometric and semantic features via a gated fusion mechanism, to anchor the generated motions in the physical scene layout.

Qualitative Results

HOI Generation on GRAB

"The person grabs the mug from the container and drinks a few sips, then sets it back down on the table, always using their right hand."

HOI Generation on HOT3D

"The person picks up the whiteboard eraser with their right hand from the table, uses it to erase markings on the whiteboard, and then places it back on the table in its original position using the same hand."

Real-world Application

"The person grabs the cup and drinks, then sets it back down on the table, always using their right hand."

"The person picks up the bottle of ranch dressing from the black platform, using their left hand, pours it onto the plate, then moves it to the right side."

More Qualitative Results

"The person picks up the train with their right hand, passes it to their left hand, investigates it, then rides the train on a track by pushing it through the air with their left hand, and finally puts it down on the table with their left hand."

"The person picks up the mug, toasts it with others across the table, drinks from it, and then places the cup on the table, all with their right hand."

"The person picks up the birdhouse toy from the small round table using their right hand, inspects it briefly while holding it, and then places it back down on the same table with their right hand."

"The person picks up the can of parmesan cheese from the shelf using their right hand, inspects it closely by rotating and examining it with both hands, and then holds it upright in front of them."

More Real-world Results

"The person picks up the milk carton from the table with their left hand and then pours it."

"The person picks up the flask from the table using their right hand, pours its contents, and then places the flask back onto the original table with right hand."

BibTeX

@article{zeng2026flowhoi, title = {{FlowHOI}: Flow-based Semantics-Grounded Generation of Hand-Object Interactions for Dexterous Robot Manipulation}, author = {Zeng, Huajian and Chen, Lingyun and Yang, Jiaqi and Zhang, Yuantai and Shi, Fan and Liu, Peidong and Zuo, Xingxing}, year = {2026}, }