
Under review. Pre-print version. The emergence of vision-language models like CLIP has significantly advanced open-vocabulary object detection, enabling object recognition through free-text descriptions at inference time. However, existing approaches primarily focus on class-level discrimination, often failing to capture fine-grained object attributes such as color, pattern, and material.In this paper, we introduce Fine-Grained Open-Vocabulary Object Detection and propose a benchmark suite to assess the ability of models to detect, differentiate, and describe objects with fine-grained attributes, even in the presence of challenging negative captions. Our benchmark suite covers multiple difficulty levels and attribute types, providing a comprehensive evaluation of state-of-the-art open-vocabulary object detectors. Extensive experiments reveal that most detection models struggle to capture subtle object attributes effectively.In order to mitigate the critical failures of the probed models, we prepare a weakly labeled training set and introduce a distillation-based adaptation method that balances attribute-level and class-level detection. This approach improves the trade-off between fine- and coarse-grained recognition, helping to bridge the gap that emerges in current state-of-the-art models.Our results highlight current limitations and suggest promising directions for improving fine-grained open-world detection. Data and code are available at https://lorebianchi98.github.io/FG-OVD/.
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
