Multimodal3DIdent

This upload contains the Multimodal3DIdent dataset introduced in the paper Identifiability Results for Multimodal Contrastive Learning presented at ICLR 2023. The dataset provides an identifiability benchmark with image/text pairs generated from controllable ground truth factors, some of which are shared between image and text modalities. The training, validation, and test sets contain 125000, 10000, and 10000 image/text pairs and ground truth factors, respectively. The code for the data generation is publicly available: https://github.com/imantdaunhawer/Multimodal3DIdent. Description ------------------ The generated dataset contains image and text data as well as the ground truth factors of variation for each modality. Each split (train/val/test) of the dataset is structured as follows: . ├── images │ ├── 000000.png │ ├── 000001.png │ └── etc. ├── text │ └── text_raw.txt ├── latents_image.csv └── latents_text.csv The directories images and text contain the generated image and text data, whereas the CSV files latents_image.csv and latents_text.csv contain the values of the respective latent factors. There is an index-wise correspondence between images, sentences, and latent factors. For example, the first line in the file text_raw.txt is the sentence that corresponds to the first image in the images directory. Latent factors: We use the following ground truth latent factors to generate image and text data. Each factor is sampled from a uniform distribution defined on the specified set of values for the respective factor. Modality Latent Factor Values Details Image Object shape {0, 1, ..., 6} Mapped to Blender shapes like "Teapot", "Hare", etc. Image Object x-position {0, 1, 2} Mapped to {-3, 0, 3} for Blender Image Object y-position {0, 1, 2} Mapped to {-3, 0, 3} for Blender Image Object z-position {0} Constant Image Object alpha-rotation [0, 1]-interval Linearly transformed to [-pi/2, pi/2] for Blender Image Object beta-rotation [0, 1]-interval Linearly transformed to [-pi/2, pi/2] for Blender Image Object gamma-rotation [0, 1]-interval Linearly transformed to [-pi/2, pi/2] for Blender Image Object color [0, 1]-interval Hue value in HSV transformed to RGB for Blender Image Spotlight position [0, 1]-interval Transformed to a unique position on a semicircle Image Spotlight color [0, 1]-interval Hue value in HSV transformed to RGB for Blender Image Background color [0, 1]-interval Hue value in HSV transformed to RGB for Blender Text Object shape {0, 1, ..., 6} Mapped to strings like "teapot", "hare", etc. Text Object x-position {0, 1, 2} Mapped to strings "left", "center", "right" Text Object y-position {0, 1, 2} Mapped to strings "top", "mid", "bottom" Text Object color string values Color names from 3 different color palettes Text Text phrasing {0, 1, ..., 4} Mapped to 5 different English sentences Image rendering: We use the Blender rendering engine to create visually complex images depicting a 3D scene. Each image in the dataset shows a colored 3D object of a certain shape or class (i.e., teapot, hare, cow, armadillo, dragon, horse, or head) in front of a colored background and illuminated by a colored spotlight that is focused on the object and located on a semicircle above the scene. The resulting RGB images are of size 224 x 224 x 3. Text generation: We generate a short sentence describing the respective scene. Each sentence describes the object's shape or class (e.g., teapot), position (e.g., bottom-left), and color. The color is represented in a human-readable form (e.g., "lawngreen", "xkcd:bright aqua", etc.) as the name of the color (from a randomly sampled palette) that is closest to the sampled color value in RGB space. The sentence is constructed from one of five pre-configured phrases with placeholders for the respective ground truth factors. Relation between modalities: Three latent factors (object shape, x-position, y-position) are shared between image/text pairs. The object color also exhibits a dependence between modalities; however, it is not a 1-to-1 correspondence because the color palette is sampled randomly from a set of multiple palettes. Additionally, there is a causal dependence of object color on object x-position since the range of hue values [0, 1] is split into three equally sized intervals, each of which is associated with a fixed x-position of the object. For instance, if x-position is “left”, we sample the hue value from the interval [0, 1/3]. Consequently, the color of the object can be predicted to some degree from the object's position. Acknowledgements ------------------------------- The Multimodal3DIdent dataset builds on the following resources: - 3DIdent dataset - Causal3DIdent dataset - CLEVR dataset - Blender open-source 3D creation suite

Related Organizations

ETH Zurich
Switzerland

Keywords

representation learning, machine learning, causal inference, causal discovery

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Usage byUsageCounts

visibility	views	29
download	downloads	12

29
views
12
downloads
Powered by

Found an issue? Give us feedback

visibility

download

0

Average

29

12