High-Resolution Open-Vocabulary Object 6D Pose Estimation

descriptionPublicationkeyboard_double_arrow_right Article , Journal , Preprint 01 Feb 2026Embargo end date: 01 Jun 2024Publisher:Institute of Electrical and Electronics Engineers (IEEE)Journal:IEEE Transactions on Pattern Analysis and Machine Intelligence, volume 48, pages 2,066-2,077 (issn: 0162-8828, eissn: 1939-3539,

Copyright policy )Funded by:EC | AI-PRISM

Authors: Jaime Corsetti; Davide Boscaini; Francesco Giuliari; Changjae Oh; Andrea Cavallaro; Fabio Poiesi;

doi: 10.1109/tpami.2025.3624589 , 10.48550/arxiv.2406.16384 , 10.5281/zenodo.18790300 , 10.5281/zenodo.18790301

pmid: 41129457

arXiv: 2406.16384

High-Resolution Open-Vocabulary Object 6D Pose Estimation

- Summary
- Subjects
- Metrics

Abstract

The generalisation to unseen objects in the 6D pose estimation task is very challenging. While Vision-Language Models (VLMs) enable using natural language descriptions to support 6D pose estimation of unseen objects, these solutions underperform compared to model-based methods. In this work we present Horyon, an open-vocabulary VLM-based architecture that addresses relative pose estimation between two scenes of an unseen object, described by a textual prompt only. We use the textual prompt to identify the unseen object in the scenes and then obtain high-resolution multi-scale features. These features are used to extract cross-scene matches for registration. We evaluate our model on a benchmark with a large variety of unseen objects across four datasets, namely REAL275, Toyota-Light, Linemod, and YCB-Video. Our method achieves state-of-the-art performance on all datasets, outperforming by 12.6 in Average Recall the previous best-performing approach.

Technical report. Extension of CVPR paper "Open-vocabulary object 6D pose estimation". Project page: https://jcorsetti.github.io/oryon

Related Organizations

Queen Mary University of London
United Kingdom
École Polytechnique Fédérale de Lausanne EPFL
Switzerland
Fondazione Bruno Kessler
Italy

Keywords

FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Green

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering

Fields of Science

engineering and technology

electrical engineering, electronic engineering, information engineering

Funded by

EC| AI-PRISM