Open-vocabulary object 6D pose estimation

descriptionPublicationkeyboard_double_arrow_right Article , Preprint , Conference object 16 Jun 2024Embargo end date: 01 Jan 2023 Italy Publisher:IEEEJournal:2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)Funded by:UKRI | JADE: Joint Academic Data...

Authors: Corsetti, Jaime; Boscaini, Davide; Changjae, Oh; Cavallaro, Andrea; Poiesi, Fabio;

doi: 10.1109/cvpr52733.2024.01711 , 10.48550/arxiv.2312.00690 , 10.5281/zenodo.18803543 , 10.5281/zenodo.18803544

arXiv: 2312.00690

handle: 11572/447551

Open-vocabulary object 6D pose estimation

- Summary
- Subjects
- Metrics

Abstract

We introduce the new setting of open-vocabulary object 6D pose estimation, in which a textual prompt is used to specify the object of interest. In contrast to existing approaches, in our setting (i) the object of interest is specified solely through the textual prompt, (ii) no object model (e.g., CAD or video sequence) is required at inference, and (iii) the object is imaged from two RGBD viewpoints of different scenes. To operate in this setting, we introduce a novel approach that leverages a Vision-Language Model to segment the object of interest from the scenes and to estimate its relative 6D pose. The key of our approach is a carefully devised strategy to fuse object-level information provided by the prompt with local image features, resulting in a feature space that can generalize to novel concepts. We validate our approach on a new benchmark based on two popular datasets, REAL275 and Toyota-Light, which collectively encompass 34 object instances appearing in four thousand image pairs. The results demonstrate that our approach outperforms both a well-established hand-crafted method and a recent deep learning-based baseline in estimating the relative 6D pose of objects in different scenes. Code and dataset are available at https://jcorsetti.github.io/oryon.

Camera ready version (CVPR 2024, poster highlight). New Oryon version: arXiv:2406.16384

Country

Italy

Related Organizations

University of Trento
Italy
Queen Mary University of London
United Kingdom
Idiap Research Institute
Switzerland

Keywords

FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	1
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

1

Average

Green

Funded by

UKRI| JADE: Joint Academic Data science Endeavour - 2