Existing methods for instance-level 6D pose estimation typically rely on neural networks that either directly regress the pose in SE(3) or estimate it indirectly via local feature matching. The former struggle with object symmetries, while the latter fail in the absence of distinctive local features.
To overcome these limitations, we propose a novel formulation of 6D pose estimation as a conditional flow matching problem in ℝ³. We introduce Flose, a generative method that infers object poses through a denoising process conditioned on local features. Unlike prior conditional flow matching approaches that rely solely on geometric guidance, Flose integrates appearance-based semantic features to mitigate ambiguities caused by object symmetries.
We further incorporate RANSAC-based registration to handle outliers. We validate Flose on five datasets from the established BOP benchmark, where it outperforms prior methods with an average improvement of +4.5 Average Recall.
More details coming soon.
@article{flose2026amir,
author = {Hamza, Amir et al.},
title = {Generative 6D Pose Estimation via Conditinal Flow Matching},
journal = {arXiv},
year = {2026},
}