Free-form language-based robotic reasoning and grasping

Runyu Jiao^1,2,†, Alice Fasoli^1,†, Francesco Giuliari¹, Matteo Bortolon^1,2,3, Sergio Povoli¹, Guofeng Mei¹, Yiming Wang¹, Fabio Poiesi¹

^† Equal Contribution

¹Fondazione Bruno Kessler, ²University of Trento, ³Istituto Italiano di Tecnologia

Acknowledgement:

IROS 2025

Paper Video Code Data

To enable a human to command a robot using free-form language instructions, our method leverages the world knowledge of Vision-Language Models to interpret instructions and reason about object spatial relationships. This is important when the target object (★) is not directly graspable, requiring the robot to first identify and remove obstructing objects (🟢). By optimizing the sequence of actions, our approach ensures efficient task completion.

Abstract

Performing robotic grasping from a cluttered bin based on human instructions is a challenging task, as it requires understanding both the nuances of free-form language and the spatial relationships between objects. Vision-Language Models (VLMs) trained on web-scale data, such as GPT-4o, have demonstrated remarkable reasoning capabilities across both text and images. But can they truly be used for this task in a zero-shot setting? And what are their limitations?

In this paper, we explore these research questions via the free-form language-based robotic grasping task, and propose a novel method, FreeGrasp, leveraging the pre-trained VLMs’ world knowledge to reason about human instructions and object spatial arrangements.

Our method detects all objects as keypoints and uses these keypoints to annotate marks on images, aiming to facilitate GPT-4o’s zero-shot spatial reasoning. This allows our method to determine whether a requested object is directly graspable or if other objects must be grasped and removed first. Since no existing dataset is specifically designed for this task, we introduce a synthetic dataset FreeGraspData by extending the MetaGraspNetV2 dataset with human-annotated instructions and ground-truth grasping sequences. We conduct extensive analyses with both FreeGraspData and real-world validation with a gripper-equipped robotic arm, demonstrating state-ofthe-art performance in grasp reasoning and execution.

Video

Free-form language grasping dataset

FreeGraspData is built upon MetaGraspNetV2 to evaluate robotic reasoning and grasping with free-form language instructions. Grasping difficulty is categorized into six levels based on obstruction level and instance ambiguity—where ambiguity refers to scenes with multiple objects of the same category as the target, such as multiple apples when targeting one.

Why FreeGraspData?
Compared to existing relational grasp datasets (e.g. REGRAD) or instruction grasp detection datasets (e.g. GraspAnything), FreeGraspData offers unique challenges:

Complex bin-picking scenarios with diverse objects, deep occlusion graphs, and random placements, making spatial reasoning highly challenging for vision-language models.
A fine-grained difficulty classification based on obstruction level and instance ambiguity, enabling better evaluation of model reasoning capabilities.
Free-form language instructions collected from human annotators, enhancing natural language understanding in grasping tasks.

Grounding open-vocabulary object names and attributes

Grounding complex relations between objects

Grounding user affordances with contextual semantics

Examples of FreeGraspData at different task difficulties with three user-provided instructions.
★ indicates the user-described target object, and 🟢 is/are the GT object(s) to pick.

Real-world setup

Samples from real-world experiments for different task difficulties.
★ indicates the user-described target object, and 🟢 is/are the GT object(s) to pick.

Free-form language-based robotic reasoning and grasping

Abstract

Method

Video

Free-form language grasping dataset

Real-world setup

Samples from real-world experiments for different task difficulties.
★ indicates the user-described target object, and 🟢 is/are the GT object(s) to pick.

Example of the complete method

Related links

BibTeX

Free-form language-based robotic reasoning and grasping

Abstract

Method

Video

Free-form language grasping dataset

Real-world setup

Samples from real-world experiments for different task difficulties. ★ indicates the user-described target object, and 🟢 is/are the GT object(s) to pick.

Example of the complete method

Related links

BibTeX

Samples from real-world experiments for different task difficulties.
★ indicates the user-described target object, and 🟢 is/are the GT object(s) to pick.