FastLloyd Clustering Datasets

This artifact bundles the five dataset archives used in our private federated clustering evaluation, corresponding to the real-world benchmarks, scaling experiments, ablation studies, and timing performance tests described in the paper. The real_datasets.tar.xz includes ten established clustering benchmarks drawn from UCI and the Clustering basic benchmark (DOI: https://doi.org/10.1007/s10489-018-1238-7); scale_datasets.tar.xz contains the SynthNew family generated to assess scalability via the R clusterGeneration package ; ablate_datasets.tar.xz holds the AblateSynth sets varying cluster separation for ablation analysis also powered by clusterGeneration ; g2_datasets.tar.xz packages the G2 sets—Gaussian clusters of size 2048 across dimensions 2–1024 with two clusters each, collected from the Clustering basic benchmark (DOI: https://doi.org/10.1007/s10489-018-1238-7) ; and timing_datasets.tar.xz includes the real s1 and lsun datasets alongside TimeSynth files (balanced synthetic clusters for timing), as per Mohassel et al.’s experimental framework . Contents 1. real_datasets.tar.xz Contains ten real-world benchmark datasets and formatted as one sample per line with space-separated features: iris.txt: 150 samples, 4 features, 3 classes; classic UCI Iris dataset for petal/sepal measurements. lsun.txt: 400 samples, 2 features, 3 clusters; two-dimensional variant of the LSUN dataset for clustering experiments . s1.txt: 5,000 samples, 2 features, 15 clusters; synthetic benchmark from Fränti’s S1 series. house.txt: 1,837 samples, 3 features, 3 clusters; housing data transformed for clustering tasks. adult.txt: 48,842 samples, 6 features, 3 clusters; UCI Census Income (“Adult”) dataset for income bracket prediction. wine.txt: 178 samples, 13 features, 3 cultivars; UCI Wine dataset with chemical analysis features. breast.txt: 569 samples, 9 features, 2 classes; Wisconsin Diagnostic Breast Cancer dataset. yeast.txt: 1,484 samples, 8 features, 10 localization sites; yeast protein localization data. mnist.txt: 10,000 samples, 784 features (28×28 pixels), 10 digit classes; MNIST handwritten digits. birch2.txt: (a random) 25,000/100,000 subset of samples, 2 features, 100 clusters; synthetic BIRCH2 dataset for high-cluster‐count evaluation . 2. scale_datasets.tar.xz Holds the SynthNew_{k}_{d}_{s}.txt files for scaling experiments, where: $k \in \{2,4,8,16,32\}$ is the number of clusters, $d \in \{2,4,8,16,32,64,128,256,512\}$ is the dimensionality, $s \in \{1,2,3\}$ are different random seeds. These are generated with the R clusterGeneration package with cluster sizes following a $1:2:...:k$ ratio. We incorporate a random number (in $[0, 100]$) of randomly sampled outliers and set the cluster separation degrees randomly in $[0.16, 0.26]$, spanning partially overlapping to separated clusters. 3. ablate_datasets.tar.xz Contains the AblateSynth_{k}_{d}_{sep}.txt files for ablation studies, with: $k \in \{2,4,8,16\}$ clusters, $d \in \{2,4,8,16\}$ dimensions, $sep \in \{0.25, 0.5, 0.75\}$ controlling cluster separation degrees. Also generated via clusterGeneration. 4. g2_datasets.tar.xz Packages the G2 synthetic sets (g2-{dim}-{var}.txt) from the clustering-data benchmarks: $N=2048$ samples, $k=2$ Gaussian clusters, Dimensions $d \in \{1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024\}$ Cluster overlap $var \in \{10, 20, 30, 40, 50, 60, 70, 80, 90, 100\}$ 5. timing_datasets.tar.xz Includes: s1.txt, lsun.txt: two real datasets for baseline timing. timesynth_{k}_{d}_{n}.txt: synthetic timing datasets with balanced cluster sizes C_{avg}=N/K, varying: $k \in \{2,5\}$ $d \in \{2,5\}$ $N \in \{10000; 100000\}$ Generated similarly to the scaling sets, following Mohassel et al.’s timing experiment protocol . Usage: Unpack any archive with tar -xJf .tar.xz to access the .txt files directly for replication of clustering experiments. Each file contains one data point per line, with features separated by spaces.

Related Organizations

University of Waterloo
Canada

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average

Related to Research communities

Cancer Research