End-to-End Image-to-Speech Generation for Untranscribed Unknown Languages

descriptionPublicationkeyboard_double_arrow_right Article 01 Jan 2021Publisher:Institute of Electrical and Electronics Engineers (IEEE)Journal:IEEE Access, volume 9, pages 55,144-55,154 (eissn: 2169-3536,

Copyright policy )

Authors: Johanes Effendi; Sakriani Sakti; Satoshi Nakamura 0001;

doi: 10.1109/access.2021.3071541

handle: 10061/14268

End-to-End Image-to-Speech Generation for Untranscribed Unknown Languages

- Summary
- Subjects
- Related research
  (2)
- Metrics

Abstract

Describing orally what we are seeing is a simple task we do in our daily life. However, in the natural language processing field, this simple task needs to be bridged by a textual modality that helps the system to generalize various objects in the image and various pronunciations in speech utterances. In this study, we propose an end-to-end Image2Speech system that does not need any textual information in its training. We use a vector-quantized variational autoencoder (VQ-VAE) model to learn the discrete representation of a speech caption in an unsupervised manner, where discrete labels are used by an image-captioning model. This self-supervised speech representation enables the Image2Speech model to be trained with the minimum amount of paired image-speech data while still maintaining the quality of the speech caption. Our experimental results with a multi-speaker natural speech dataset demonstrate our proposed text-free Image2Speech system’s performance close to the one with textual information. Furthermore, our approach also successfully outperforms the most recent existing frameworks with phoneme-based and grounding-based Image2Speech systems.

Related Organizations

Keywords

image captioning, Decoding, Data models, Speech recognition, Bridges, TK1-9971, vector-quantized variational autoencoder, untranscribed unknown language, Task analysis, Image reconstruction, Training, self-supervised speech representation, Electrical engineering. Electronics. Nuclear engineering, Image-to-speech

2 Research products, page 1 of 1

Evaluating Automatically Generated Phoneme Captions for Images
2020IsAmongTopNSimilarDocuments
epitran software on GitHub
IsRelatedTo

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	10
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%