MLNX ML Model for Predicting Job Placement Failures in Datacenter Clusters

Overview This repository contains a trained binary classification model, exported to ONNX, that predicts whether a submitted job will fail or run successfully, given: the current state of a simulated datacenter cluster, and the resource request of an incoming job. The model was developed within the MLSysOps research project and is intended for offline analysis, benchmarking, and integration into scheduling or admission-control pipelines. Problem Statement Modern large-scale clusters must decide whether to admit a job under uncertainty. Poor placement decisions can lead to job failures, even when aggregate resources appear sufficient. In this work, a job failure can occur due to two distinct causes: Insufficient compute resources (servers)If the cluster does not have enough free servers to satisfy the job request, failure can be determined through a simple availability check. Insufficient or infeasible network connectivity (uplinks)Even when the total number of uplinks appears sufficient, the job may still fail because the required connectivity cannot be realized. The latter case arises from the presence of a reconfigurable optical circuit switch (OCS) interconnecting leaf switches. Although OCS-based fabrics provide high bandwidth and flexibility, they introduce topological and temporal constraints: not all feasible matchings between leaf switches can be realized simultaneously, and reconfiguration constraints may prevent forming the necessary end-to-end paths. As a result, uplink feasibility is not a simple counting problem, but a combinatorial one that depends on: the current circuit configuration, contention with existing jobs, and connectivity constraints imposed by the optical fabric. Goal:The model learns to predict whether a job will fail due to either compute insufficiency or network infeasibility, based on a snapshot of the cluster state and the job request. Dataset The model was trained and evaluated using a large-scale simulated dataset of job placement attempts. 📎 Dataset repository (Zenodo):👉 https://zenodo.org/records/18485585 The dataset repository provides: detailed system context, feature descriptions, ground-truth label semantics, statistical summaries, and usage examples. ⚠️ The dataset is released separately and is required to reproduce training or evaluation results. Model Summary Task: Binary classification (job failure prediction) Framework: PyTorch Training orchestration: Ray Train / Ray Tune Export format: ONNX Inference backend: ONNX Runtime The model consumes tabular features plus fixed-length vectors describing cluster utilization. Although the dataset distinguishes between different failure causes, the released model produces a binary output: not failed failed Inputs and Preprocessing The model expects: scalar numeric features describing cluster utilization and fragmentation, fixed-length vector features representing server and uplink utilization. All preprocessing steps are defined in bundle.json, including: feature column order, normalization parameters (StandardScaler), vector dimensions. ⚠️ bundle.json must always be treated as the authoritative source of truth for model inputs. Quick Start Prerequisites Install the required Python dependencies: pip install numpy pandas pyarrow onnxruntime or pip install -r requirements.txt Basic Usage The src/inference_runtime.py script loads the ONNX model and preprocessing bundle, reads rows from a parquet file, and outputs predictions. Run inference on the first 1000 rows python src/inference_runtime.py \ --onnx model/model.onnx \ --bundle model/bundle.json \ --parquet model/data.parquet \ --n 1000 Output format (per row): 0 not failed proba=0.023456 1 failed proba=0.987654 2 not failed proba=0.012345 Evaluate metrics (if ground-truth labels are available) If your parquet file includes the ground-truth label column, you can compute evaluation metrics: python src/inference_runtime.py \ --onnx model/model.onnx \ --bundle model/bundle.json \ --parquet model/data.parquet \ --n 1000 \ --label_col l1_failed Additional output: Metrics on loaded rows: accuracy=0.925980 precision=0.933392 recall=0.910949 f1=0.922034 Command-Line Arguments The inference script (inference_runtime.py) supports the following command-line arguments: Argument Required Default Description --onnx Yes — Path to the model.onnx file --bundle Yes — Path to bundle.json containing preprocessing metadata --parquet Yes — Path to the input Parquet file --n No 1000 Number of rows to load from the Parquet file --label_col No None Name of the ground-truth label column (used only for metrics) ℹ️ If --label_col is not provided, the script performs inference only and does not compute evaluation metrics. Notes The exact feature column order and normalization parameters are stored in bundle.json. Model Constraints The released model is subject to several explicit constraints that must be respected for correct and meaningful use. Fixed Input Schema The model expects a fixed set of input features: scalar numeric features, a server utilization bitmap of fixed length, a leaf-switch utilization vector of fixed length. The exact feature order, normalization parameters, and vector lengths are defined in bundle.json. Fixed Cluster Topology Assumption The model is trained assuming a specific cluster architecture: 32 Scalable Units (SUs), 32 servers per SU (1024 total servers), 8 leaf switches per SU (256 total leaf uplinks). The server and uplink vectors are not dynamically resizable. Applying the model to clusters with: different numbers of servers, different SU layouts, or different network topologiesrequires retraining or careful feature remapping and validation. Binary Output Only Although the dataset distinguishes between: server-related failures, and uplink-related failures, the released model produces a binary output only: failed not failed The model does not indicate why a failure is predicted. Probabilistic Predictions The model outputs a probability of failure, not a deterministic decision. The default classification threshold is 0.5, but: different operational settings may require different thresholds, threshold tuning should consider false-positive vs false-negative trade-offs. Predictions should be interpreted as risk estimates, not guarantees. It is intended to be used as a decision-support component, not as a standalone scheduler. ⚠️ Users integrating this model into larger systems should ensure that all constraints above are satisfied and validated before relying on predictions in operational workflows. Citation If you use this model, please cite it using the Zenodo DOI. Acknowledgements & Funding This work is part of the MLSysOps project, funded by the European Union’s Horizon Europe research and innovation programme under grant agreement No. 101092912. More information about the project is available at https://mlsysops.eu/

Keywords

Machine Learning, job failure prediction, Machine learning, Cloud Computing

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average