Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ ZENODOarrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
ZENODO
Dataset . 2025
License: CC BY
Data sources: ZENODO
ZENODO
Dataset . 2025
License: CC BY
Data sources: Datacite
ZENODO
Dataset . 2025
License: CC BY
Data sources: Datacite
versions View all 2 versions
addClaim

FastLloyd Clustering Datasets

Authors: Diaa, Abdulrahman; Humphries, Thomas; Kerschbaum, Florian;

FastLloyd Clustering Datasets

Abstract

This artifact bundles the five dataset archives used in our private federated clustering evaluation, corresponding to the real-world benchmarks, scaling experiments, ablation studies, and timing performance tests described in the paper. The real_datasets.tar.xz includes ten established clustering benchmarks drawn from UCI and the Clustering basic benchmark (DOI: https://doi.org/10.1007/s10489-018-1238-7); scale_datasets.tar.xz contains the SynthNew family generated to assess scalability via the R clusterGeneration package ; ablate_datasets.tar.xz holds the AblateSynth sets varying cluster separation for ablation analysis also powered by clusterGeneration ; g2_datasets.tar.xz packages the G2 sets—Gaussian clusters of size 2048 across dimensions 2–1024 with two clusters each, collected from the Clustering basic benchmark (DOI: https://doi.org/10.1007/s10489-018-1238-7) ; and timing_datasets.tar.xz includes the real s1 and lsun datasets alongside TimeSynth files (balanced synthetic clusters for timing), as per Mohassel et al.’s experimental framework . Contents 1. real_datasets.tar.xz Contains ten real-world benchmark datasets and formatted as one sample per line with space-separated features: iris.txt: 150 samples, 4 features, 3 classes; classic UCI Iris dataset for petal/sepal measurements. lsun.txt: 400 samples, 2 features, 3 clusters; two-dimensional variant of the LSUN dataset for clustering experiments . s1.txt: 5,000 samples, 2 features, 15 clusters; synthetic benchmark from Fränti’s S1 series. house.txt: 1,837 samples, 3 features, 3 clusters; housing data transformed for clustering tasks. adult.txt: 48,842 samples, 6 features, 3 clusters; UCI Census Income (“Adult”) dataset for income bracket prediction. wine.txt: 178 samples, 13 features, 3 cultivars; UCI Wine dataset with chemical analysis features. breast.txt: 569 samples, 9 features, 2 classes; Wisconsin Diagnostic Breast Cancer dataset. yeast.txt: 1,484 samples, 8 features, 10 localization sites; yeast protein localization data. mnist.txt: 10,000 samples, 784 features (28×28 pixels), 10 digit classes; MNIST handwritten digits. birch2.txt: (a random) 25,000/100,000 subset of samples, 2 features, 100 clusters; synthetic BIRCH2 dataset for high-cluster‐count evaluation . 2. scale_datasets.tar.xz Holds the SynthNew_{k}_{d}_{s}.txt files for scaling experiments, where: $k \in \{2,4,8,16,32\}$ is the number of clusters, $d \in \{2,4,8,16,32,64,128,256,512\}$ is the dimensionality, $s \in \{1,2,3\}$ are different random seeds. These are generated with the R clusterGeneration package with cluster sizes following a $1:2:...:k$ ratio. We incorporate a random number (in $[0, 100]$) of randomly sampled outliers and set the cluster separation degrees randomly in $[0.16, 0.26]$, spanning partially overlapping to separated clusters. 3. ablate_datasets.tar.xz Contains the AblateSynth_{k}_{d}_{sep}.txt files for ablation studies, with: $k \in \{2,4,8,16\}$ clusters, $d \in \{2,4,8,16\}$ dimensions, $sep \in \{0.25, 0.5, 0.75\}$ controlling cluster separation degrees. Also generated via clusterGeneration. 4. g2_datasets.tar.xz Packages the G2 synthetic sets (g2-{dim}-{var}.txt) from the clustering-data benchmarks: $N=2048$ samples, $k=2$ Gaussian clusters, Dimensions $d \in \{1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024\}$ Cluster overlap $var \in \{10, 20, 30, 40, 50, 60, 70, 80, 90, 100\}$ 5. timing_datasets.tar.xz Includes: s1.txt, lsun.txt: two real datasets for baseline timing. timesynth_{k}_{d}_{n}.txt: synthetic timing datasets with balanced cluster sizes C_{avg}=N/K, varying: $k \in \{2,5\}$ $d \in \{2,5\}$ $N \in \{10000; 100000\}$ Generated similarly to the scaling sets, following Mohassel et al.’s timing experiment protocol . Usage: Unpack any archive with tar -xJf .tar.xz to access the .txt files directly for replication of clustering experiments. Each file contains one data point per line, with features separated by spaces.

Related Organizations
  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    0
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Average
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Average
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Average
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
0
Average
Average
Average
Related to Research communities
Cancer Research