MITIGATING MODE COLLAPSE TO IMPROVE DIVERSITY IN TEXT-TO-IMAGE GAN OUTPUTS: STRATEGIES IN ARCHITECTURAL DESIGN, TRAINING METHODOLOGIES, AND EVALUATION TECHNIQUES

SUBUHI KASHIF ANSARI, MANAL AL KHAMMASH, ANJALI APPUKUTTAN, ANNE ANOOP, SANDEEP KUMAR MATHARIYA, SHEELA D V, MOHAMMED SALEH AL ANSARI

Found an issue? Give us feedback

ZENODOarrow_drop_down

ZENODO

Article . 2026

License: CC BY

Data sources: ZENODO

ZENODO

Article . 2026

License: CC BY

Data sources: Datacite

ZENODO

Article . 2026

License: CC BY

Data sources: Datacite

MITIGATING MODE COLLAPSE TO IMPROVE DIVERSITY IN TEXT-TO-IMAGE GAN OUTPUTS: STRATEGIES IN ARCHITECTURAL DESIGN, TRAINING METHODOLOGIES, AND EVALUATION TECHNIQUES

descriptionPublicationkeyboard_double_arrow_right Article 31 Mar 2026Publisher:Little Lion Scientific

Authors: SUBUHI KASHIF ANSARI, MANAL AL KHAMMASH, ANJALI APPUKUTTAN, ANNE ANOOP, SANDEEP KUMAR MATHARIYA, SHEELA D V, MOHAMMED SALEH AL ANSARI;

doi: 10.5281/zenodo.19365469 , 10.5281/zenodo.19365468

MITIGATING MODE COLLAPSE TO IMPROVE DIVERSITY IN TEXT-TO-IMAGE GAN OUTPUTS: STRATEGIES IN ARCHITECTURAL DESIGN, TRAINING METHODOLOGIES, AND EVALUATION TECHNIQUES

- Summary
- Subjects
- Metrics

Abstract

Text-to-image generation using Generative Adversarial Networks (GANs) has advanced significantly in recent years. This enables image synthesis from textual descriptions. However, mode collapse remains a critical challenge that limits output diversity. This systematic review analyzes strategies to mitigate mode collapse in text-to-image GANs. It examins architectural designs, training methodologies, latent-space techniques, and evaluation metrics. The review covers 45 studies published between 2015 and 2025, categorized into: architectural innovations (18 papers), training-based strategies (12 papers), latent-space and loss function methods (10 papers), and evaluation-centric approaches (5 papers). Findings show that attention-based models, multi-scale architectures, and semantic-spatial models enhance semantic alignment and diversity, with specific limitations. Training-based approaches, including curriculum learning, adaptive training, gradient penalties, and progressive growing of GANs, help stabilize training and mitigate collapse. Latent-space techniques, such as mode-seeking losses, contrastive losses, and noise manipulation, promote output diversity. However, evaluation metrics like Fréchet Inception Distance (FID), Inception Score (IS), Learned Perceptual Image Patch Similarity (LPIPS), and Multi-Scale Structural Similarity Index (MS-SSIM) show limitations in capturing semantic diversity. Progress in mitigating mode collapse depends on combined architectural design, training stability, and loss-function engineering. Future priorities include developing unified benchmarks for evaluating semantic diversity, exploring hybrid architectures, and designing adaptive training protocols to enable more robust text-to-image models generating diverse, semantically coherent outputs.

Keywords

Text-To-Image GANS, Mode Collapse, Output Diversity, Architectural Design, Training Methodologies, Evaluation Techniques, Attention Mechanisms, Latent-Space Techniques

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average