Separate-and-Enhance: Compositional Finetuning for Text2Image Diffusion Models

Name: Separate-and-Enhance: Compositional Finetuning for Text2Image Diffusion Models
Keywords: FOS: Computer and information sciences, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition

Zhipeng Bao; Yijun Li 0001; Krishna Kumar Singh; Yu-Xiong Wang; Martial Hebert

Found an issue? Give us feedback

arXiv.org e-Print Ar...arrow_drop_down

arXiv.org e-Print Archive

Preprint . 2023

Data sources: arXiv.org e-Print Archive

https://dx.doi.org/10.48550/ar...

Article . 2023

License: arXiv Non-Exclusive Distribution

Data sources: Datacite

DBLP

Article

Data sources: DBLP

Separate-and-Enhance: Compositional Finetuning for Text2Image Diffusion Models

descriptionPublicationkeyboard_double_arrow_right Article , Preprint 01 Jan 2023Embargo end date: 01 Jan 2023Publisher:arXivJournal:CoRR, volume abs/2312.06712

Authors: Zhipeng Bao; Yijun Li 0001; Krishna Kumar Singh; Yu-Xiong Wang; Martial Hebert;

doi: 10.48550/arxiv.2312.06712

arXiv: 2312.06712

Separate-and-Enhance: Compositional Finetuning for Text2Image Diffusion Models

- Summary
- Subjects
- Related research
  (2)
- Metrics

Abstract

Despite recent significant strides achieved by diffusion-based Text-to-Image (T2I) models, current systems are still less capable of ensuring decent compositional generation aligned with text prompts, particularly for the multi-object generation. This work illuminates the fundamental reasons for such misalignment, pinpointing issues related to low attention activation scores and mask overlaps. While previous research efforts have individually tackled these issues, we assert that a holistic approach is paramount. Thus, we propose two novel objectives, the Separate loss and the Enhance loss, that reduce object mask overlaps and maximize attention scores, respectively. Our method diverges from conventional test-time-adaptation techniques, focusing on finetuning critical parameters, which enhances scalability and generalizability. Comprehensive evaluations demonstrate the superior performance of our model in terms of image realism, text-image alignment, and adaptability, notably outperforming prominent baselines. Ultimately, this research paves the way for T2I diffusion models with enhanced compositional capacities and broader applicability.

Keywords

FOS: Computer and information sciences, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition

2 Research products, page 1 of 1

diffusers software on GitHub
IsRelatedTo
clean-sid software on GitHub
IsRelatedTo

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Green

Separate-and-Enhance: Compositional Finetuning for Text2Image Diffusion Models

Separate-and-Enhance: Compositional Finetuning for Text2Image Diffusion Models

2 Research products, page 1 of 1

diffusers software on GitHub

clean-sid software on GitHub