Flose: Generative 6D Pose Estimation via Conditional Flow Matching

1TeV - Fondazione Bruno Kessler, 2University of Trento, 3Technical University of Munich, 4Munich Center for Machine Learning

🏆 Rank 1st on BOP Leaderboard - Model-based 6D Localization of Seen Objects

Abstract

Existing methods for instance-level 6D pose estimation typically rely on neural networks that either directly regress the pose in SE(3) or estimate it indirectly via local feature matching. The former struggle with object symmetries, while the latter fail in the absence of distinctive local features.

To overcome these limitations, we propose a novel formulation of 6D pose estimation as a conditional flow matching problem in ℝ³. We introduce Flose, a generative method that infers object poses through a denoising process conditioned on local features. Unlike prior conditional flow matching approaches that rely solely on geometric guidance, Flose integrates appearance-based semantic features to mitigate ambiguities caused by object symmetries.

We further incorporate RANSAC-based registration to handle outliers. We validate Flose on five datasets from the established BOP benchmark, where it outperforms prior methods with an average improvement of +4.5 Average Recall.

Method

Flose Method Overview

Overview of Flose. Given the query object point cloud $\mathcal{Q}$ and an RGBD image $\mathbf{I}$ as input (left), Flose estimates the 6D pose $(\hat{\mathbf{R}}, \hat{\mathbf{t}})$ (bottom right) through three stages: feature encoding (red), generative denoising (blue), and pose estimation (green).

Feature encoding: An overlap-aware encoder $\Phi_\Theta$ and an appearance-aware encoder $\Gamma$ produce per-point descriptors, which are fused into $\mathbf{F}^\mathcal{Q}, \mathbf{F}^\mathcal{T}$. Colors encode feature similarity: corresponding regions share similar colors, while semantically distinct parts differ.

Generative denoising: A generative network $\Psi_\Omega$, conditioned on $\mathbf{F}^\mathcal{Q}, \mathbf{F}^\mathcal{T}$, learns a displacement field that iteratively denoises a Gaussian-noised point cloud $\mathbf{X}(1)$ into an aligned configuration $\mathbf{X}(0)$.

Pose estimation: The final 6D pose $(\hat{\mathbf{R}}, \hat{\mathbf{t}})$ is recovered using a RANSAC-based Kabsch solver followed by ICP refinement.

Quantitative Results

Comparison with RGBD methods in terms of AR. The top block (rows 1–5) trains a separate model for each object, while the bottom block (rows 6–11) trains a Single Model (S.M.) per dataset. Row 12 quantifies the absolute improvement of Flose over the strongest S.M. competitor.

# Method S.M. LM-O T-LESS TUD-L IC-BIN YCB-V Avg
1 Pix2Pose 58.851.282.039.078.862.0
2 ZebraPose 75.272.794.865.286.678.9
3 GDRNPP (BOP22) 77.587.496.672.292.185.2
4 HccePose(BF) 80.587.994.472.491.185.3
5 GDRNPP (BOP23) 79.491.496.473.792.886.7
6 Koenig-Hybrid 63.165.592.043.070.166.7
7 CosyPose 71.470.193.964.786.177.2
8 SurfEmb 75.883.393.365.682.480.1
9 CIR 73.477.696.867.689.381.0
10 PFA 79.785.096.067.688.883.4
11 Ours 86.1 86.9 98.8 74.8 92.8 87.9
12 Improvement +6.4 +1.9 +2.8 +7.2 +4.0 +4.5

BOP Leaderboard

BOP Leaderboard

Flose ranks 1st on BOP leaderboard for the task of Model-based 6D Localization of Seen Objects – BOP Classic Core.

Citation

If you find this work useful, please cite:
@article{flose2026amir,
  author    = {Hamza, Amir et al.},
  title     = {Generative 6D Pose Estimation via Conditinal Flow Matching},
  journal   = {arXiv},
  year      = {2026},
}