Obstruction reasoning for robotic grasping

1Fondazione Bruno Kessler, 2University of Trento
FBK Logo University of Trento Logo

CVPR 2026


Not only spatial reasoning but also obstruction reasoning


Teaser Image

UNOGrasp performs multi-step obstruction reasoning for robotic grasping in cluttered scenes. Given an RGB-D image and a natural-language goal (e.g., grasp the white iphone box), UNOGrasp reasons and grounds spatial information to infer the sequence of steps to unobstruct a requested object. We also introduce UNOBench to comprehensively benchmark obstruction reasoning.


Abstract

Successful robotic grasping in cluttered environments not only requires a model to visually ground a target object but also to reason about obstructions that must be cleared beforehand. While current vision-language embodied reasoning models show emergent spatial understanding, they remain limited in terms of obstruction reasoning and accessibility planning. To bridge this gap, we present UNOGrasp, a learning-based vision-language model capable of performing visually-grounded obstruction reasoning to infer the sequence of actions needed to unobstruct the path and grasp the target object. We devise a novel multi-step reasoning process based on obstruction paths originated by the target object. We anchor each reasoning step with obstruction-aware visual cues to incentivize reasoning capability. UNOGrasp combines supervised and reinforcement finetuning through verifiable reasoning rewards. Moreover, we construct UNOBench, a large-scale dataset for both training and benchmarking, based on MetaGraspNetV2, with over 100k obstruction paths annotated by humans with obstruction ratios, contact points, and natural-language instructions. Extensive experiments and real-robot evaluations show that UNOGrasp significantly improves obstruction reasoning and grasp success across both synthetic and real-world environments, outperforming generalist and proprietary alternatives.


Benchmark: UNOBench

Architecture of FreeGrasp

UNOBench features two unique characteristics: (i) human-annotated free-form language instructions about objects in cluttered bins, and (ii) per-bin obstruction graphs for grounded spatial reasoning. Human annotators through the Prolific platform were involved to refine the initial GPT-4o generated annotations. UNOBench features three levels of difficulty and introduces novel evaluation metrics.


Method: UNOGrasp

Architecture of FreeGrasp
UNOGrasp is a VLM trained through supervised fine (SFT) on UNOBench to learn structured obstruction-path reasoning, and through GRPO-based reinforcement finetuning (RFT) to further boost its reasoning ability using outcome-driven IoU and format rewards. During inference, given an RGB image and a target object as language instruction, UNOGrasp reasons over multiple obstruction paths (<think> traces) and directly outputs the sequence of actions (<answer>) required to remove obstructions and grasp the target.

Real-world Demo


Qualitative Results

Qualitative Results under different splits. (SR-F1 / MP_NED) scores are reported at the bottom of each image.

Qualitative splits

Scenario with unseen objects. Although our model is trained on top-down views, it can effectively address novel frontal-view, suggesting its generalization potential.

Front-view generalization

More examples


BibTeX

@misc{jiao2025obstruction,
  author = {Runyu Jiao and Matteo Bortolon and Francesco Giuliari and Alice Fasoli and Sergio Povoli and Guofeng Mei and Yiming Wang and Fabio Poiesi},
  title = {Obstruction reasoning for robotic grasping},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  year = {2026}
}