Distilling 3D distinctive local descriptors for 6D pose estimation

⚗️ Distilling 3D distinctive local descriptors for 6D pose estimation

¹TeV - Fondazione Bruno Kessler, ²University of Trento

{ahamza, dboscaini, acaraffa, poiesi}@fbk.eu

2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

Abstract

Three-dimensional local descriptors are crucial for encoding geometric surface properties, making them essential for various point cloud understanding tasks. Among these descriptors, GeDi has demonstrated strong zero-shot 6D pose estimation capabilities but remains computationally impractical for real-world applications due to its expensive inference process.

Can we retain GeDi's effectiveness while significantly improving its efficiency? In this work, we explore this question by introducing a knowledge distillation framework that trains an efficient student model to regress local descriptors from a GeDi teacher. Our key contributions include: an efficient large-scale training procedure that ensures robustness to occlusions and partial observations while operating under compute and storage constraints, and a novel loss formulation that handles weak supervision from non-distinctive teacher descriptors.

We validate our approach on five BOP Benchmark datasets and demonstrate a significant reduction in inference time while maintaining competitive performance with existing methods, bringing zero-shot 6D pose estimation closer to real-time feasibility.

Method

We introduce a novel distillation procedure to transfer the distinctive properties of a slow teacher encoder to a faster student encoder. We define our proposed distillation approach as object-oriented because training supervision is provided at the object level.

First, we extract teacher descriptors $\mathcal{F}^Q=\Phi_\Theta(\mathcal{P}^Q), \mathcal{F}^T=\Phi_\Theta(\mathcal{P}^T)$ using GeDi encoder $\Phi_\Theta$. Then, we learn student descriptors $\mathcal{G}^Q=\Psi_\Omega(\mathcal{P}^Q), \mathcal{G}^T=\Psi_\Omega(\mathcal{P}^T)$ with a PTV3 encoder $\Psi_\Omega$. During distillation, we optimize the parameters $\Omega$ so that $\mathcal{G}^Q \approx \mathcal{F}^Q$ and $\mathcal{G}^T \approx \mathcal{F}^T$, while $\Theta$ remain frozen. Note that this differs from online knowledge distillation approaches, where both the teacher and student are neural networks and learn $\Theta, \Omega$ simultaneously.

In our case, the architectures of $\Phi_\Theta$ and $\Psi_\Omega$ are distinct. Teacher features are precomputed, stored in memory, and loaded as needed during distillation. To optimize this process, we propose storing teacher features only for query objects and introduce a module that leverages ground-truth 6D poses to transfer them to target objects. Moreover, we propose a custom loss function that focuses learning on noise-free points, leading to improved performance.

Citation

@article{hamza2025distilling3ddistinctivelocal, author = {Hamza, Amir and Caraffa, Andrea and Boscaini, Davide and Poiesi, Fabio}, title = {Distilling 3D disctinctive local descriptors for 6D pose estimation}, journal = {IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)}, year = {2025}, }

⚗️ Distilling 3D distinctive local descriptors for 6D pose estimation

Abstract

Method

Video

Citation