† Equal Contribution
Performing robotic grasping from a cluttered bin based on human instructions is a challenging task, as it requires understanding both the nuances of free-form language and the spatial relationships between objects. Vision-Language Models (VLMs) trained on web-scale data, such as GPT-4o, have demonstrated remarkable reasoning capabilities across both text and images. But can they truly be used for this task in a zero-shot setting? And what are their limitations?
In this paper, we explore these research questions via the free-form language-based robotic grasping task, and propose a novel method, FreeGrasp, leveraging the pre-trained VLMs’ world knowledge to reason about human instructions and object spatial arrangements.
Our method detects all objects as keypoints and uses these keypoints to annotate marks on images, aiming to facilitate GPT-4o’s zero-shot spatial reasoning. This allows our method to determine whether a requested object is directly graspable or if other objects must be grasped and removed first. Since no existing dataset is specifically designed for this task, we introduce a synthetic dataset FreeGraspData by extending the MetaGraspNetV2 dataset with human-annotated instructions and ground-truth grasping sequences. We conduct extensive analyses with both FreeGraspData and real-world validation with a gripper-equipped robotic arm, demonstrating state-ofthe-art performance in grasp reasoning and execution.
Our method takes as inputs: (i) an RGB-D image capturing the 3D scene from a top-down view, (ii) a free-form text instruction from the user.
With the RGB observation, the module first localizes all the objects within the container, forming a holistic understanding of the scene. Notably, the user can specify the operation area beyond the container, enhancing the method's robustness in complex scenarios. Specifically, this can be achieved through the prompt Molmo.
To facilitate visual spatial reasoning in VLMs, we augment the visual prompt by annotating identity (ID) marks for each localized object. Then we feed the mark-based visual prompt to GTP-4o, which reasons about spatial relationships. This reasoning process involves two key aspects: (i) Identifying the corresponding object ID and class name based on the user's free-form instruction. (ii) Determine whether the target object is obstructed and planning to remove the first graspable obstructing object. GPT-4o's final output consists of the ID and class name of the next object to grasp.
Using the class name and coordinates of the corresponding ID, the object is segmented unambiguously. Finally, the grasp estimation module determines the most suitable 6-DoF grasp pose for picking the segmented object.
FreeGraspData is built upon MetaGraspNetV2 to evaluate robotic reasoning and grasping with free-form language instructions. Grasping difficulty is categorized into six levels based on obstruction level and instance ambiguity—where ambiguity refers to scenes with multiple objects of the same category as the target, such as multiple apples when targeting one.
Why FreeGraspData?
Compared to existing relational grasp datasets (e.g. REGRAD) or instruction grasp detection datasets (e.g. GraspAnything), FreeGraspData offers unique challenges:
Examples of FreeGraspData at different task difficulties with three user-provided instructions.
★ indicates the user-described target object, and
🟢 is/are the GT object(s) to pick.
Many excellent works have collectively contributed to our work.
@article{FreeGrasp2025,
author = {Jiao, Runyu and Fasoli, Alice and Giuliari, Francesco and Bortolon, Matteo and Povoli, Sergio and Mei, Guofeng and Wang, Yiming and Fabio, Poiesi},
title = {Free-form Language-based Robotic Reasoning and Grasping},
year = {2025}
}