MLNX Job Placement Failure Dataset for Simulated Datacenter Clusters with Reconfigurable Optical Networks

📌 Overview This dataset contains cluster-level snapshots and job placement outcomes generated using a simulated large-scale datacenter environment.The data is intended for training and evaluating machine learning models that predict whether a job submission will succeed or fail given the current cluster state and job resource request. The dataset was produced as part of the MLSysOps project (EU Horizon Europe) and supports research on: job admission control, failure prediction, resource fragmentation, and network feasibility in modern datacenter architectures. Each data sample represents a single scheduling decision and includes both: detailed cluster state features, and the observed outcome of the placement attempt. 🏢 System Context Simulated Datacenter Architecture The dataset is generated using a proprietary datacenter simulator modeling a hierarchical cluster composed of Scalable Units (SUs). Cluster configuration: 32 Scalable Units (SUs) 32 servers per SU (1024 servers total) 8 leaf switches per SU 8 GPUs per server Leaf switches interconnected via a reconfigurable optical circuit switch (OCS) Failure Modes Captured Each job placement attempt can result in: Successful placement Failure due to insufficient servers Failure due to insufficient or infeasible uplink connectivity While server insufficiency can be determined via simple capacity checks,uplink infeasibility is more complex, as it depends on: current optical circuit configurations, contention between jobs, and connectivity constraints of the OCS fabric. The dataset explicitly captures these outcomes to support learning-based approaches for failure prediction. 📂 Dataset Structure Format: Apache Parquet Granularity: One row per scheduling decision Each row contains: Job request features Cluster state features (scalar + vector) Ground-truth placement outcome label Rows are treated as independent samples. 🏷️ Ground-Truth Labels The dataset includes a label column encoding the observed outcome of the job placement: Value Meaning 0 Job placement succeeded 1 Job placement failed due to insufficient servers 2 Job placement failed due to insufficient uplinks / infeasible network connectivity Notes: Labels 1 and 2 both indicate job failure, but with different root causes. This encoding allows: binary failure prediction, failure cause analysis, and future multi-class modeling. 📊 Feature Description Scalar Cluster Features These features summarize utilization, imbalance, and fragmentation across the cluster: Column Description f1_event_type The recorded event: add, failed_server, failed_uplink f2_mean_util Mean server utilization f3_diff_max_min_util Utilization imbalance across SUs f4_cv_util Coefficient of variation of server utilization f5_ratio_max_to_mean_workload Workload skew across SUs f6_mean_uplink_util Mean uplink utilization f7_diff_max_min_uplink_util Uplink utilization imbalance f8_cv_uplink_util Coefficient of variation of uplink utilization f9_mean_combined_util Combined compute and network utilization f10_resource_imbalance Compute vs network mismatch f11_bottleneck_ratio Network-to-compute utilization ratio f12_frag_spread_sus Fragmentation due to SU spread f13_frag_wasted Fragmentation due to wasted capacity f14_frag_su_sparseness Intra-SU sparseness f15_total_servers_used Total servers in use f16_total_sus_used Number of active SUs f17_total_uplink_utilized Total uplink usage Vector Features Feature Description f18_su_server_bitmap Binary vector (length 1024) indicating per-server usage f19_leaf_up Vector (length 256) indicating leaf switch uplink utilization Job Request Feature Column Type Description f20_requested_nodes int / float Number of nodes requested by the job 🧪 Data Collection Methodology Environment: Simulated datacenter Workloads: Synthetic job traces with varying sizes and arrival patterns Placement policy: Simulator-internal scheduling logic Labeling: Determined by placement outcome (success or failure cause) The simulator executes job placement attempts under varying load, fragmentation, and network conditions to generate diverse training examples. ⚠️ The simulator itself is not publicly released. Only the resulting dataset is provided. 📊 Statistical Summary The dataset contains a total of 1,062,943 rows, each corresponding to a single job placement attempt in the simulated cluster. The table below summarizes the distribution of all numeric columns, including the ground-truth label. Column Summary and Data Types Descriptor Type Count Mean Std Min 25% 50% (Median) 75% Max l1_failed int32 1,062,943 0.6198 0.7127 0.0 0.0 0.0 1.0 2.0 f2_mean_util float32 1,062,943 0.9129 0.1089 0.0078 0.8906 0.9404 0.9717 1.0 f3_diff_max_min_util float32 1,062,943 0.5843 0.3332 0.0 0.2813 0.5625 1.0 1.0 f4_cv_util float32 1,062,943 0.1982 0.3047 0.0 0.0712 0.1446 0.2500 5.5678 f5_ratio_max_to_mean_workload float32 1,062,943 1.1865 1.2299 1.0 1.0291 1.0633 1.1228 32.0 f6_mean_uplink_util float32 1,062,943 0.5637 0.1156 0.0 0.5195 0.5840 0.6367 0.9598 f7_diff_max_min_uplink_util float32 1,062,943 0.9016 0.1717 0.0 0.8125 0.9063 0.9688 2.0 f8_cv_uplink_util float32 1,062,943 0.4681 0.3022 0.0 0.3526 0.4258 0.5076 3.8730 f9_mean_combined_util float32 1,062,943 0.7383 0.1002 0.0039 0.7139 0.7563 0.7915 0.9716 f10_resource_imbalance float32 1,062,943 0.3493 0.1010 0.0001 0.2803 0.3350 0.4043 0.8926 f11_bottleneck_ratio float32 1,062,943 0.6137 0.1137 0.0 0.5636 0.6345 0.6894 1.7378 f12_frag_spread_sus float32 1,062,943 1.0713 0.0818 1.0 1.0261 1.0524 1.0922 4.0 f13_frag_wasted float32 1,062,943 0.0713 0.0818 0.0 0.0261 0.0524 0.0922 3.0 f14_frag_su_sparseness float32 1,062,943 0.0177 0.0176 0.0 0.0065 0.0135 0.0235 0.2589 f15_total_servers_used int64 1,062,943 934.86 111.47 8 912 963 995 1024 f16_total_sus_used int64 1,062,943 31.11 3.10 1 31 32 32 32 f17_total_uplink_utilized int64 1,062,943 4617.66 946.63 0 4256 4784 5216 7863 f20_requested_nodes int64 1,062,943 54.96 38.77 8 20 44 87 128 Ground-truth label distribution note:The l1_failed column encodes job outcomes as: 0: success 1: failure due to insufficient servers 2: failure due to insufficient uplinks / infeasible connectivity Both 1 and 2 correspond to job failures. 🧰 Working with the Data Loading the Dataset (Python) import pandas as pd df = pd.read_parquet("final_merged.parquet") print(df.head()) Loading Selected Columns cols = ["f20_requested_nodes", "f2_mean_util", "l1_failed"] df = pd.read_parquet("final_merged.parquet", columns=cols) Tools and Documentation Apache Parquet specification: https://parquet.apache.org/docs Pandas Parquet I/O: https://pandas.pydata.org/docs/reference/api/pandas.read_parquet.html PyArrow Parquet support: https://arrow.apache.org/docs/python/parquet.html 🎯 Intended Use This dataset is intended for: machine learning research on job failure prediction, benchmarking admission-control models, studying resource fragmentation and network feasibility, offline evaluation of scheduling heuristics. It is not intended to represent any specific production datacenter. ⚠️ Limitations Data is generated from a simulator, not a production system. The cluster topology is fixed and may not generalize to other architectures. Temporal dependencies between jobs are not explicitly modeled. Network behavior is abstracted and may differ from real optical fabrics. 📜 Citation If you use this dataset in your research, please cite it using the citation provided by Zenodo (available in the right sidebar of the dataset record 🤝 Acknowledgements & Funding This work is part of the MLSysOps project and is funded by the European Union’s Horizon Europe research and innovation programme under grant agreement No. 101092912. More information: https://mlsysops.eu/

Related Organizations

University Of Thessaly
Greece

Keywords

optical circuit switching, Machine Learning, datacenter simulation, Machine learning, job placement, mahine learning, datacenter scheduling, reconfigurable optical networks, Cloud Computing, cluster resource management

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average