
handle: 2158/1433620
This study explores the use of Vision Large Language Models (VLLMs) for identifying items in complex graphical documents. In particular, we focus on looking for furniture objects (e.g. beds, tables, and chairs) and structural items (doors and windows) in floorplan images. We evaluate one object detection model (YOLO) and state-of-the-art VLLMs on two datasets featuring diverse floorplan layouts and symbols. The experiments with VLLMs are performed with a zero-shot setting, meaning the models are tested without any training or fine-tuning, as well as with a few-shot approach, where examples of items to be found in the image are given to the models in the prompt. The results highlight the strengths and limitations of VLLMs in recognizing architectural elements, providing guidance for future research in the use multimodal vision-language models for graphics recognition.
Visual Large Language Models; Graphics Recognition; Zero- shot Prompting; Few-shot Prompting
Visual Large Language Models; Graphics Recognition; Zero- shot Prompting; Few-shot Prompting
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
