
Voice Assistants (VAs) are becoming an increasingly important part of our lives. However, most widespread VAs generally fail to take into account the user’s spatiotemporal context [11], leading to more descriptive and less natural dialogue. This paper introduces VOICE, an open-source multimodal VA leveraging multimodal interaction and vision-language models to allow for a more flexible and natural communication. Additionally, we present a preliminary user study to evaluate VOICE’s ability to understand queries with contextual references.
Voice Assistant, Vision-Language Model, Mixed Reality, Multimodal Interaction
Voice Assistant, Vision-Language Model, Mixed Reality, Multimodal Interaction
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
