Object-level Visual Prompts for Compositional Image Generation

descriptionPublicationkeyboard_double_arrow_right Article , Preprint 14 Dec 2025Embargo end date: 01 Jan 2025Publisher:ACMJournal:Proceedings of the SIGGRAPH Asia 2025 Conference PapersFunded by:NSF | CAREER: Exploiting Deep G...

Authors: Gaurav Parmar; Or Patashnik; Kuan-Chieh Wang; Daniil Ostashev; Srinivasa Narasimhan; Jun-Yan Zhu; Daniel Cohen-Or; +1 Authors

doi: 10.1145/3757377.3763867 , 10.48550/arxiv.2501.01424

arXiv: 2501.01424

Object-level Visual Prompts for Compositional Image Generation

- Summary
- Subjects
- Metrics

Abstract

We introduce a method for composing object-level visual prompts within a text-to-image diffusion model. Our approach addresses the task of generating semantically coherent compositions across diverse scenes and styles, similar to the versatility and expressiveness offered by text prompts. A key challenge in this task is to preserve the identity of the objects depicted in the input visual prompts, while also generating diverse compositions across different images. To address this challenge, we introduce a new KV-mixed cross-attention mechanism, in which keys and values are learned from distinct visual representations. The keys are derived from an encoder with a small bottleneck for layout control, whereas the values come from a larger bottleneck encoder that captures fine-grained appearance details. By mixing keys and values from these complementary sources, our model preserves the identity of the visual prompts while supporting flexible variations in object arrangement, pose, and composition. During inference, we further propose object-level compositional guidance to improve the method's identity preservation and layout correctness. Results show that our technique produces diverse scene compositions that preserve the unique characteristics of each visual prompt, expanding the creative potential of text-to-image generation.

Project: https://snap-research.github.io/visual-composer/

Related Organizations

Tel Aviv University
Israel
CARNEGIE-MELLON UNIVERSITY
Carnegie Mellon University
United States
Carnegie Mellon University
Carnegie Mellon university

View all View all

Keywords

FOS: Computer and information sciences, Computer Science - Graphics, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Graphics (cs.GR)

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	4
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%

Found an issue? Give us feedback

4

Top 10%

Average

Top 10%

Green

Funded by

NSF| CAREER: Exploiting Deep Generative Models for Visual Recognition