Robot Arm Free-form language-based robotic reasoning and grasping

Equal Contribution

1Fondazione Bruno Kessler, 2University of Trento, 3Istituto Italiano di Tecnologia
Acknowledgement: Icon
Teaser Image

To enable a human to command a robot using free-form language instructions, our method leverages the world knowledge of Vision-Language Models to interpret instructions and reason about object spatial relationships. This is important when the target object () is not directly graspable, requiring the robot to first identify and remove obstructing objects (🟢). By optimizing the sequence of actions, our approach ensures efficient task completion.


Abstract

Performing robotic grasping from a cluttered bin based on human instructions is a challenging task, as it requires understanding both the nuances of free-form language and the spatial relationships between objects. Vision-Language Models (VLMs) trained on web-scale data, such as GPT-4o, have demonstrated remarkable reasoning capabilities across both text and images. But can they truly be used for this task in a zero-shot setting? And what are their limitations?

In this paper, we explore these research questions via the free-form language-based robotic grasping task, and propose a novel method, FreeGrasp, leveraging the pre-trained VLMs’ world knowledge to reason about human instructions and object spatial arrangements.

Our method detects all objects as keypoints and uses these keypoints to annotate marks on images, aiming to facilitate GPT-4o’s zero-shot spatial reasoning. This allows our method to determine whether a requested object is directly graspable or if other objects must be grasped and removed first. Since no existing dataset is specifically designed for this task, we introduce a synthetic dataset FreeGraspData by extending the MetaGraspNetV2 dataset with human-annotated instructions and ground-truth grasping sequences. We conduct extensive analyses with both FreeGraspData and real-world validation with a gripper-equipped robotic arm, demonstrating state-ofthe-art performance in grasp reasoning and execution.

Method

Architecture of FreeGrasp

Our method takes as inputs: (i) an RGB-D image capturing the 3D scene from a top-down view, (ii) a free-form text instruction from the user.

With the RGB observation, the module first localizes all the objects within the container, forming a holistic understanding of the scene. Notably, the user can specify the operation area beyond the container, enhancing the method's robustness in complex scenarios. Specifically, this can be achieved through the prompt Molmo.

To facilitate visual spatial reasoning in VLMs, we augment the visual prompt by annotating identity (ID) marks for each localized object. Then we feed the mark-based visual prompt to GTP-4o, which reasons about spatial relationships. This reasoning process involves two key aspects: (i) Identifying the corresponding object ID and class name based on the user's free-form instruction. (ii) Determine whether the target object is obstructed and planning to remove the first graspable obstructing object. GPT-4o's final output consists of the ID and class name of the next object to grasp.

Using the class name and coordinates of the corresponding ID, the object is segmented unambiguously. Finally, the grasp estimation module determines the most suitable 6-DoF grasp pose for picking the segmented object.


Video


Free-form language grasping dataset

FreeGraspData is built upon MetaGraspNetV2 to evaluate robotic reasoning and grasping with free-form language instructions. Grasping difficulty is categorized into six levels based on obstruction level and instance ambiguity—where ambiguity refers to scenes with multiple objects of the same category as the target, such as multiple apples when targeting one.

Why FreeGraspData?
Compared to existing relational grasp datasets (e.g. REGRAD) or instruction grasp detection datasets (e.g. GraspAnything), FreeGraspData offers unique challenges:

  • Complex bin-picking scenarios with diverse objects, deep occlusion graphs, and random placements, making spatial reasoning highly challenging for vision-language models.
  • A fine-grained difficulty classification based on obstruction level and instance ambiguity, enabling better evaluation of model reasoning capabilities.
  • Free-form language instructions collected from human annotators, enhancing natural language understanding in grasping tasks.
Grounding open-vocabulary object names and attributes
Grounding complex relations between objects
Grounding user affordances with contextual semantics
Grounding user affordances with contextual semantics
Grounding user affordances with contextual semantics
Grounding user affordances with contextual semantics

Examples of FreeGraspData at different task difficulties with three user-provided instructions.
indicates the user-described target object, and 🟢 is/are the GT object(s) to pick.

Real-world setup

Grounding open-vocabulary object names and attributes

Samples from real-world experiments for different task difficulties.
indicates the user-described target object, and 🟢 is/are the GT object(s) to pick.


Example of the complete method

Method Visualization

BibTeX

@article{FreeGrasp2025,
  author    = {Jiao, Runyu and Fasoli, Alice and Giuliari, Francesco and Bortolon, Matteo and Povoli, Sergio and Mei, Guofeng and Wang, Yiming and Fabio, Poiesi},
  title     = {Free-form Language-based Robotic Reasoning and Grasping},
  year      = {2025}
}