
This artifact bundles the five dataset archives used in our private federated clustering evaluation, corresponding to the real-world benchmarks, scaling experiments, ablation studies, and timing performance tests described in the paper. The real_datasets.tar.xz includes ten established clustering benchmarks drawn from UCI and the Clustering basic benchmark (DOI: https://doi.org/10.1007/s10489-018-1238-7); scale_datasets.tar.xz contains the SynthNew family generated to assess scalability via the R clusterGeneration package ; ablate_datasets.tar.xz holds the AblateSynth sets varying cluster separation for ablation analysis also powered by clusterGeneration ; g2_datasets.tar.xz packages the G2 sets—Gaussian clusters of size 2048 across dimensions 2–1024 with two clusters each, collected from the Clustering basic benchmark (DOI: https://doi.org/10.1007/s10489-018-1238-7) ; and timing_datasets.tar.xz includes the real s1 and lsun datasets alongside TimeSynth files (balanced synthetic clusters for timing), as per Mohassel et al.’s experimental framework . Contents 1. real_datasets.tar.xz Contains ten real-world benchmark datasets and formatted as one sample per line with space-separated features: iris.txt: 150 samples, 4 features, 3 classes; classic UCI Iris dataset for petal/sepal measurements. lsun.txt: 400 samples, 2 features, 3 clusters; two-dimensional variant of the LSUN dataset for clustering experiments . s1.txt: 5,000 samples, 2 features, 15 clusters; synthetic benchmark from Fränti’s S1 series. house.txt: 1,837 samples, 3 features, 3 clusters; housing data transformed for clustering tasks. adult.txt: 48,842 samples, 6 features, 3 clusters; UCI Census Income (“Adult”) dataset for income bracket prediction. wine.txt: 178 samples, 13 features, 3 cultivars; UCI Wine dataset with chemical analysis features. breast.txt: 569 samples, 9 features, 2 classes; Wisconsin Diagnostic Breast Cancer dataset. yeast.txt: 1,484 samples, 8 features, 10 localization sites; yeast protein localization data. mnist.txt: 10,000 samples, 784 features (28×28 pixels), 10 digit classes; MNIST handwritten digits. birch2.txt: (a random) 25,000/100,000 subset of samples, 2 features, 100 clusters; synthetic BIRCH2 dataset for high-cluster‐count evaluation . 2. scale_datasets.tar.xz Holds the SynthNew_{k}_{d}_{s}.txt files for scaling experiments, where: $k \in \{2,4,8,16,32\}$ is the number of clusters, $d \in \{2,4,8,16,32,64,128,256,512\}$ is the dimensionality, $s \in \{1,2,3\}$ are different random seeds. These are generated with the R clusterGeneration package with cluster sizes following a $1:2:...:k$ ratio. We incorporate a random number (in $[0, 100]$) of randomly sampled outliers and set the cluster separation degrees randomly in $[0.16, 0.26]$, spanning partially overlapping to separated clusters. 3. ablate_datasets.tar.xz Contains the AblateSynth_{k}_{d}_{sep}.txt files for ablation studies, with: $k \in \{2,4,8,16\}$ clusters, $d \in \{2,4,8,16\}$ dimensions, $sep \in \{0.25, 0.5, 0.75\}$ controlling cluster separation degrees. Also generated via clusterGeneration. 4. g2_datasets.tar.xz Packages the G2 synthetic sets (g2-{dim}-{var}.txt) from the clustering-data benchmarks: $N=2048$ samples, $k=2$ Gaussian clusters, Dimensions $d \in \{1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024\}$ Cluster overlap $var \in \{10, 20, 30, 40, 50, 60, 70, 80, 90, 100\}$ 5. timing_datasets.tar.xz Includes: s1.txt, lsun.txt: two real datasets for baseline timing. timesynth_{k}_{d}_{n}.txt: synthetic timing datasets with balanced cluster sizes C_{avg}=N/K, varying: $k \in \{2,5\}$ $d \in \{2,5\}$ $N \in \{10000; 100000\}$ Generated similarly to the scaling sets, following Mohassel et al.’s timing experiment protocol . Usage: Unpack any archive with tar -xJf .tar.xz to access the .txt files directly for replication of clustering experiments. Each file contains one data point per line, with features separated by spaces.
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
