SynthFun3D: Action-guided generation of 3D functionality segmentation data

1TeV - Fondazione Bruno Kessler 2University of Trento 3TU Wien 4NVIDIA 5ETH Zurich 6MPI for Informatics 7Stanford University 8USI Lugano

SynthFun3D teaser

We introduce SynthFun3D, the first method for generating 3D functionality segmentation data directly from action descriptions. SynthFun3D constructs a plausible 3D scene by retrieving objects with part-level annotations from a large-scale asset repository and arranging them under spatial and semantic constraints. SynthFun3D renders multi-view images and automatically identifies the target functional element, producing precise ground-truth masks without manual annotation. Our synthetic data generation pipeline provides a scalable and effective complement to manual annotation for 3D functionality understanding.

Abstract

3D functionality segmentation aims to identify the interactive element in a 3D scene required to perform an action described in free-form language (e.g., the handle to "Open the second drawer of the cabinet near the bed"). Progress has been constrained by the scarcity of annotated real-world data, as collecting and labeling fine-grained 3D masks is prohibitively expensive. To address this limitation, we introduce SynthFun3D, the first method for generating 3D functionality segmentation data directly from action descriptions. Given an action description, SynthFun3D constructs a plausible 3D scene by retrieving objects with part-level annotations from a large-scale asset repository and arranging them under spatial and semantic constraints. SynthFun3D renders multi-view images and automatically identifies the target functional element, producing precise ground-truth masks without manual annotation. We demonstrate the effectiveness of the generated data by training a VLM-based 3D functionality segmentation model. Augmenting real-world data with our synthetic data consistently improves performance, with gains of +2.2 mAP, +6.3 mAR, and +5.7 mIoU over real-only training.

Method

Architecture of SynthFun3D
We design SynthFun3D as a training-free approach, that leverages an LLM and a large-scale 3D asset database with part-level annotations to generate 3D scenes from textual descriptions. Inspired by Holodeck [1] an LLM is used to generate the room layout and descriptions of objects within the room from the input prompt. Then, for each object description, a retrieval pipeline is employed to find the most suitable 3D asset. The object on which the action is to be performed is retrieved along with the mask of the functional element required to complete the action, and finally the retrieved objects are arranged according to the spatial relationships defined in the input prompt, via a Depth-First-Search algorithm. Finally, multiple views are generating via rendering random trajectory around the object of interest, and material augmentations is applied to increase the training data variety.

Qualitative scene generation results

Qualitative scene generation results

We show examples of ground-truth data generated by SynthFun3D from different textual descriptions. For each prompt, we show the original RGB image and the ground-truth functional mask of the functional element.

Performance on downstream task

We measure the performance of SynthFun3D by using its generated data (S) and augmented version (A) to train a VLM for pointing at functional elements in images, following the approach in Fun3DU [1]. We test the trained model on real data from SceneFun3D [2]. The best performance is obtained by training with our data and the augmentation-enhanced frames (R+S+A). Our method can generate training data that achieves comparable performance at a fraction of the cost requires to collect and annotate real data for functionality understanding.

Performance on downstream task

Related work

[1] Yang, Yue, et al. "Holodeck: Language guided generation of 3d embodied ai environments." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

[2] Corsetti, Jaime, et al. "Functionality understanding and segmentation in 3D scenes." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2025.

[3] Delitzas, Alexandros, et al. "Scenefun3D: Fine-grained functionality and affordance understanding in 3D scenes." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

Citation

If you find SynthFun3D useful for your work, please cite:
BibTeX
  @inproceedings{corsetti2026synthfun3d,
    title={Action-guided generation of 3D functionality segmentation data},
    author={Corsetti, Jaime and Giuliari, Francesco and Boscaini, Davide and Hermosilla, Pedro and Pilzer, Andrea and Mei, Guofeng and Delitzas, Alexandros and Engelmann, Francis and Poiesi, Fabio},
    booktitle={IEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)},
    year={2026}
  }