MNIST Large Scale data set

Motivation The MNIST Large Scale data set is based on the classic MNIST data set but contains large scale variations up to a factor of 16. The motivation behind creating this data set was to enable testing the ability of different algorithms to learn in the presence of large scale variability and specifically the ability to generalise to new scales not present in the training set over large scale ranges. The MNIST Large Scale data set was originally introduced in the paper: [1] Y. Jansson and T. Lindeberg (2021) “Exploring the ability of CNNs to generalise to previously unseen scales over wide scale ranges”, International Conference on Pattern Recognition (ICPR 2020), pp. 1181–1188., Extended version preprint (which includes additional information about dataset creation) arXiv:2004.01536. A more extensive experimental description of this data set is given in the paper (including a published description of the details of data set creation, as well as compact performance measures that can serve as benchmarks regarding scale generalisation performance): [2] Y. Jansson and T. Lindeberg (2022) "Scale-invariant scale-channel networks: Deep networks that generalise to previously unseen scales", Journal of Mathematical Imaging and Vision, https://doi.org/10.1007/s10851-022-01082-2. Access and rights The data set is freely available under the condition that you reference both the original MNIST data set: [3] LeCun, Y., Bottou, L., & Haffner, P. (1998). “Gradient-based learning applied to document recognition”. Proceedings of the IEEE, 86(11), 2278–2324 and this derived version, either of the references [1] or [2] (preferably [2]). The data set is made available on request. If you would be interested in trying out this data set, please make a request in the system below, and we will grant you access as soon as possible. The dataset The MNIST Large Scale data set is based on the classic MNIST data set [2] but contains large scale variations up to a factor of 16. The data set is created by scaling the original MNIST images with varying scale factors and embedding the resulting image in a 112x112 image with a uniform background followed by smoothing and soft thresholding to reduce discretization artifacts. The details of data set creation are described in [1] and [2]. All training data sets are created from the first 50,000 examples in the original MNIST training set, while the validation data sets are created from to the last 10,000 images of the original MNIST training set. The test data sets are created from the 10,000 images in the original MNIST test set. There are three data sets (7.0 GB each) for single scale training for three different scales (scale 1, 2 and 4), which also include test and validation data for the same scales: mnist_large_scale_tr50000_vl10000_te10000_outsize112-112_sctr1p000_scte1p000.h5 mnist_large_scale_tr50000_vl10000_te10000_outsize112-112_sctr2p000_scte2p000.h5 mnist_large_scale_tr50000_vl10000_te10000_outsize112-112_sctr4p000_scte4p000.h5 In addition, there are 17 data sets (1.0 GB each) for testing generalisation ability to scales not present in the training set. These data sets include scale factors 2k/4 with k in the range [-4, 12], i.e. spanning the scale range [1/2, 8]: mnist_large_scale_te10000_outsize112-112_scte0p500.h5 mnist_large_scale_te10000_outsize112-112_scte0p595.h5 mnist_large_scale_te10000_outsize112-112_scte0p707.h5 mnist_large_scale_te10000_outsize112-112_scte0p841.h5 mnist_large_scale_te10000_outsize112-112_scte1p000.h5 mnist_large_scale_te10000_outsize112-112_scte1p189.h5 mnist_large_scale_te10000_outsize112-112_scte1p414.h5 mnist_large_scale_te10000_outsize112-112_scte1p682.h5 mnist_large_scale_te10000_outsize112-112_scte2p000.h5 mnist_large_scale_te10000_outsize112-112_scte2p378.h5 mnist_large_scale_te10000_outsize112-112_scte2p828.h5 mnist_large_scale_te10000_outsize112-112_scte3p364.h5 mnist_large_scale_te10000_outsize112-112_scte4p000.h5 mnist_large_scale_te10000_outsize112-112_scte4p757.h5 mnist_large_scale_te10000_outsize112-112_scte5p657.h5 mnist_large_scale_te10000_outsize112-112_scte6p727.h5 mnist_large_scale_te10000_outsize112-112_scte8p000.h5 The above data sets were used for the experiments presented in Figure 2 and Figure 4 in [1]. The numerical performance scores for a vanilla CNN and the different scale channel architectures evaluated in the paper are given in Table I in [1]. To evaluate the ability of different algorithms to learn from data with large scale variations when only a limited number of training samples are available, there is also a data set where both the training, test and validation data span the scale range [1,4]: mnist_large_scale_tr50000_vl10000_te10000_outsize112-112_sctr1-4_scte1-4.h5 This data set was used for the experiment presented in Figure 5 in [1]. The numerical performance scores for a vanilla CNN and the different scale channel architectures evaluated in [1] are given in Table III in [1]. When evaluating how the performance varies with the number of training samples for this data set, the first n samples from the training set should be used for training, while the full test set should be used for testing. Instructions for loading the data set The dataset is saved in HDF5 format. The four training data sets are stored as 6 partitions in the respective HDF5 files (“/x_train, /x_val, /x_test, /y_train, /y_test, /y_val”) and can be loaded in Python as follows: import h5py with h5py.File(<filename>, 'r') as f: x_train = np.array( f["/x_train"], dtype=np.float32) x_val= np.array( f["/x_val"], dtype=np.float32) x_test = np.array( f["/x_test"], dtype=np.float32) y_train = np.array( f["/y_train"], dtype=np.int32) y_val= np.array( f["/y_val"], dtype=np.int32) y_test = np.array( f["/y_test"], dtype=np.int32) or in Matlab as: x_train = h5read(<filename>),’/x_train’); x_val = h5read(<filename>,’/x_val’); x_test = h5read(<filename>,’/x_test’); y_train = h5read(<filename>,’/y_train’); y_val = h5read(<filename>,’/y_val’); y_test = h5read(<filename>,’/y_test’); The 17 test data sets can be loaded in Python as: with h5py.File(<filename>, 'r') as f: x_test = np.array( f["/x_test"], dtype=np.float32) y_test = np.array( f["/y_test"], dtype=np.int32) or in Matlab as: x_test = h5read(<filename>,’/x_test’); y_test = h5read(<filename>,’/y_test’); (The test data sets do additionally contain a single train and validation sample in the “x_train/, x_val/, y_train/ and y_val/ partitions to enable compatibility with code that always loads all the three data sets. This sample is not intended to be used.) Note that the greyscale images are stored in the HDF5 files using row-major (C-style) order i.e. as [n_samples, xdim, ydim, n_channels] where the size of the channel dimension is 1. For convenience we also provide a Jupyter notebook and a Matlab script for loading and inspecting the data sets at https://github.com/spacemir/MNISTLargeScaleDataset .

{"references": ["Y. Jansson and T. Lindeberg (2020), \"Exploring the ability of CNNs to generalise to previously unseen scales\", arXiv preprint arXiv:2004.01536.", "LeCun, Y., Bottou, L., & Haffner, P. (1998), \"Gradient-based learning applied to document recognition\", Proceedings of the IEEE, 86(11), 2278\u20132324"]}

Related Organizations

Royal Institute of Technology
Sweden

Keywords

Invariant neural networks, Scale invariance, Convolutional neural networks, MNIST

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	2
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average