Deep Inverse Cooking

Medical images are widely used in hospitals for the diagnosis and treatment of many diseases, such as skin cancer or diabetic retinopathy. Machine learning algorithms have recently been shown to outperform human doctors in a broad variety of diagnosis tasks. A diagnosis is often posed as a semantic segmentation problem where models are trained to classify each pixel of an image or as a multi-label classification task where the output is a set of tags. However, both types of outputs are hard to interpret due to the lack of reasoning about how the decisions were achieved. In contrast, a diagnosis made by a medical doctor is different. When a family doctor refers a patient to a specialist, he will expect a medical report in which the specialist explains her diagnosis. Likewise, the output of a neural network would be more useful if augmented by a medical report written in a natural language. Recently, there has been much progress in the development of image-to-text models that the task of automatically generating medical reports can now be considered feasible. However, such models require a large amount of paired data, i.e. images paired with medical reports. To the author's best knowledge, there is no publicly available dataset of such paired data. In order to experiment with image-to-text models, domains were switched from medicine to cooking, where such data is prolific. A dataset consisting of 0.9M recipes and 1.3M images was acquired through crawling five different cooking platforms. Since the majority of the recipes originate from community cooking websites, an extensive data cleaning pipeline had to be implemented. This allowed the number of unique ingredients to be reduced from 1M to 1.3k at the cost of dropping some recipes. Using this dataset, a multi-task neural network model was implemented, trained and evaluated. It generates a list of ingredients (cf. medical features), a title and cooking instructions (cf. medical report) based on an image of a dish. The model consists of a VGG-16 encoder to extract image features. Given these features, a transformer-based decoder generates a list of ingredients. Finally, an additional transformer decoder generates the recipe title as well as the cooking instructions by processing the image and ingredients features simultaneously. Evaluation on unseen test data showed that the model achieves an F1 score of 38.62% for the ingredients prediction, a BLEU1 score of 7.17% for generating the title and a BLEU4 score of 6.15% for the instructions text generation task. Comparing the architecture of the inverse cooking model to medical image captioning systems from the literature shows several similarities. Therefore, it is expected that the proposed model can be adapted and extended for generating medical reports in the future.

+ ID der Publikation: hslu_78709 + Art des Beitrages: Bericht + Sprache: Englisch + Letzte Aktualisierung: 2020-07-16 16:35:45

Country

Switzerland

Related Organizations

Zentral und Hochschulbibliothek Luzern
Switzerland
Lucerne University of Applied Sciences and Arts
Switzerland

Keywords

Machine Learning, Deep Learning, Computer Vision, Transformers, Convolutional Neural Networks, Image Captioning, Cooking, Supervised Learning, Natural Language Processing

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average