⚗️ Distilling 3D distinctive local descriptors for 6D pose estimation

1TeV - Fondazione Bruno Kessler, 2University of Trento
Descriptive Alt Text

We introduce dGedi, a 3D point cloud encoder trained by distilling GeDi features. GeDi suffers from slow inference as it processes points sequentially, first extracting local reference frames (LRF) and then computing descriptors with PointNet++ (PN++). Instead, dGedi retains GeDi's generalization and distinctiveness while being over 170 times faster, making it ideal for real-time robotics applications.

Abstract

Three-dimensional local descriptors are crucial for encoding geometric surface properties, making them essential for various point cloud understanding tasks. Among these descriptors, GeDi has demonstrated strong zero-shot 6D pose estimation capabilities but remains computationally impractical for real-world applications due to its expensive inference process.

Can we retain GeDi's effectiveness while significantly improving its efficiency? In this work, we explore this question by introducing a knowledge distillation framework that trains an efficient student model to regress local descriptors from a GeDi teacher. Our key contributions include: an efficient large-scale training procedure that ensures robustness to occlusions and partial observations while operating under compute and storage constraints, and a novel loss formulation that handles weak supervision from non-distinctive teacher descriptors.

We validate our approach on five BOP Benchmark datasets and demonstrate a significant reduction in inference time while maintaining competitive performance with existing methods, bringing zero-shot 6D pose estimation closer to real-time feasibility.

Method

Method Overview

We introduce a novel distillation procedure to transfer the distinctive properties of a slow teacher encoder to a faster student encoder. We define our proposed distillation approach as object-oriented because training supervision is provided at the object level.

First, we extract teacher descriptors $\mathcal{F}^Q=\Phi_\Theta(\mathcal{P}^Q), \mathcal{F}^T=\Phi_\Theta(\mathcal{P}^T)$ using GeDi encoder $\Phi_\Theta$. Then, we learn student descriptors $\mathcal{G}^Q=\Psi_\Omega(\mathcal{P}^Q), \mathcal{G}^T=\Psi_\Omega(\mathcal{P}^T)$ with a PTV3 encoder $\Psi_\Omega$. During distillation, we optimize the parameters $\Omega$ so that $\mathcal{G}^Q \approx \mathcal{F}^Q$ and $\mathcal{G}^T \approx \mathcal{F}^T$, while $\Theta$ remain frozen. Note that this differs from online knowledge distillation approaches, where both the teacher and student are neural networks and learn $\Theta, \Omega$ simultaneously.

In our case, the architectures of $\Phi_\Theta$ and $\Psi_\Omega$ are distinct. Teacher features are precomputed, stored in memory, and loaded as needed during distillation. To optimize this process, we propose storing teacher features only for query objects and introduce a module that leverages ground-truth 6D poses to transfer them to target objects. Moreover, we propose a custom loss function that focuses learning on noise-free points, leading to improved performance.

Video

Citation

If you find this work useful, please cite:
@article{hamza2025distilling3ddistinctivelocal,
  author    = {Hamza, Amir and Caraffa, Andrea and Boscaini, Davide and Poiesi, Fabio},
  title     = {Distilling 3D disctinctive local descriptors for 6D pose estimation},
  journal   = {arXiv},
  year      = {2025},
}