ExplainTS: A Benchmark Suite for Reproducible Time-Series XAI Research

Model Training The time series classification model was trained using a deep neural architecture based on stacked ConvLSTM1D layers. Input data consisted of time series samples of shape ($T$, $F$), where $T$ denotes the number of timesteps and $F$ the number of features. The same model architecture was used for both univariate and multivariate inputs; in the univariate case, the number of features is $F = 1$. Prior to training, the data was normalized to zero mean and unit variance using StandardScaler. To facilitate temporal feature learning, the input sequences were divided into smaller temporal blocks. Specifically, each sequence of length $T$ was segmented into n_steps parts, where n_steps corresponds to the third smallest integer divisor of $T$ above 2. The segment length was then computed as n_length = $T$ / n_steps, resulting in a reshaped input tensor of shape (n_steps, n_length, $F$). This restructuring enables the model to capture both local patterns within segments and long-range dependencies across segments. The model architecture includes two ConvLSTM1D layers with 64 and 32 filters, respectively, each using a kernel size of 9 and ReLU activation. A dropout layer with a rate of 0.5 follows for regularization, and the output is flattened to create an intermediate embedding representation of length $N$. This is followed by two fully connected layers: one with 100 ReLU units and another with softmax activation for classification. Class imbalance was addressed using computed class weights during training. The model was trained using the Adam optimizer and categorical cross-entropy loss for 25 epochs with a batch size of 64. The generic architecture of the models across all datasets is presented below. The exact number of trainable parameters depends on the dataset-specific input shapes ($T$, $F$, and $C$ classes). Layer (type) Output Shape =============================================================reshape (Reshape) (None, n_steps, n_length, F) conv_lstm1d (ConvLSTM1D) (None, n_steps, n_length, 64) conv_lstm1d_1 (ConvLSTM1D) (None, n_steps, n_length, 32) dropout (Dropout) (None, n_steps, n_length, 32) embedding (Flatten) (None, N) dense (Dense) (None, 100) dense_1 (Dense) (None, C) ============================================================= The dataset was split into training and testing sets using a predefined stratified partitioning strategy: 75% of the samples were used for model training, while the remaining 25% were held out for evaluation, preserving the class distribution across both sets. Labels were one-hot encoded to support the categorical cross-entropy loss. Post-hoc Local Explanations The explanations were computed for both training and test subsets. All indices in the explanation files are aligned with the respective train/test instances. Explanation coverage:- Univariate: Anchor (31.33%), LIME (100.00%), SHAP (100.00%), PHAR (100.00%)- Multivariate: Anchor (20.00%), LIME (95.00%), SHAP (100.00%), PHAR (100.00%) Rule-based explanations: Anchor: Each explanation is a list of rule sets grouped per instance, e.g.: [ [ { 'index': 0, 'success': True, 'prediction': '1', 'rule': { 'feature_1': ['>-0.74'], 'feature_4': ['-0.74'], 'feature_42': ['-0.74'] }, 'confidence': 0.9565, 'coverage': 0.7197 } ] ] Each entry corresponds to a sample index. The `rule` defines a conjunction of feature constraints satisfied by the sample. `confidence` measures the fraction of samples fulfilling the rule for which the model gives the same prediction, while `coverage` is the fraction of the train/test set satisfying the rule. PHAR rules follow a similar pattern, here for multivariate series: [ [ { 'index': 0, 'success': True, 'prediction': '2', 'rule': { 'var_0_ts_24': ['>-1.05', '-0.98', '-1.12', '0.15', '0.10', '-0.55', '<=-0.42'], ... }, 'confidence': 0.9812, 'coverage': 0.8450 } ]] In addition to the extracted rules, the PHAR archive includes two metadata files generated during its two-stage hyperparameter optimization (HPO) process via the Optuna framework: - phar_metadata.json: Captures the single optimal configuration applied globally to extract the final rules. It details the winning parameters (such as the base explainer, threshold percentile, and perturb sigma) and provides aggregated statistics for confidence, coverage, and sparsity.- phar_trials_log.jsonl: A JSON Lines file recording the complete history of all multi-objective Pareto-front optimization trials. Each line details a single trial's tested configuration, execution time, evaluated metrics, and the intermediate rules generated for the evaluation pool. Attribution-based explanations: SHAP and LIME:Unlike the discrete rule records, SHAP and LIME provide continuous feature attributions. The explanations are stored as serialized NumPy arrays (.pickle files) containing float values. Format: numpy.array of shape (M, C, T, F)Where:- M is the number of observations (instances) in the respective train or test split.- C is the number of target classes.- T is the length of the time series (number of timesteps).- F is the dimensionality (number of features or channels; F=1 for univariate series). Each numerical value in this 4D array represents the local attribution score (importance weight) for a specific class, timestep, and channel of a given instance. A positive weight indicates a feature that contributed towards the model's predicted probability for that specific class, while a negative weight indicates a feature that pushed the prediction away from it. Accompanying GitHub repository The complete open-source codebase used to construct the ExplainTS benchmark, train the underlying deep learning models, and compute the post-hoc explanations is publicly hosted on GitHub. Repository link: https://github.com/mozo64/papers/tree/main/zenodo-ucr. Code used for generation:The repository is structured to ensure full reproducibility and is organized into the following key directories: notebooks/Contains the core Jupyter notebooks executing the entire pipeline:- UCR-train.ipynb: Handles data loading, preprocessing, and training of the unified ConvLSTM-based classifiers.- UCR-explainers-lime-shap.ipynb: Executes the extraction of continuous feature attributions using DeepSHAP and LIME.- UCR-explainers-anchor.ipynb: Manages the Anchor explainer pipeline, including the multi-stage cascade retry mechanism for continuous signals.- UCR-explainers-phar.ipynb: Runs the PHAR rule extraction, including the Optuna-based multi-objective Pareto-front hyperparameter optimization.- datasets_summary.ipynb: Evaluates model accuracies and aggregates dataset statistics.- ExplainTS_CaseStudy.ipynb: An educational template demonstrating how to load the precomputed artifacts, visualize them, and calculate XAI metrics (like the Jaccard Index). services/Contains helper Python utilities (model_manager.py, model_server.py) used to parallelize and distribute computationally heavy extraction jobs (specifically for Anchor) across GPUs. scripts/Includes bash shell utilities used to validate completeness and assemble the final Zenodo archives (e.g., compress_models.sh, compress_explainers.sh, filter_move.sh, and compress_all.sh). List of all Time series The following tables summarize the properties of the 20 multivariate and 83 univariate time-series datasets included in the ExplainTS benchmark. For each dataset, the tables report the number of instances in the standardized 75/25 train/test splits, sequence length, dimensionality, and the number of target classes. Furthermore, they detail the test-set accuracy of the provided baseline ConvLSTM classifiers, along with the availability of precomputed local explanations across the four evaluated methods (LIME, DeepSHAP, Anchor, and PHAR). A "Yes" indicates that the respective explanations were successfully generated and are available in the repository archives for both the training and test sets. Multivariate: Time Series Train Test Len. Dim. Classes Acc. (%) LIME SHAP Anchor PHAR ArticularyWordRecognition 431 144 144 9 25 97.22 Yes Yes No Yes AtrialFibrillation 22 8 640 2 3 50.00 Yes Yes No Yes BasicMotions 60 20 100 6 4 100.00 Yes Yes No Yes Cricket 135 45 1197 6 12 91.11 Yes Yes No Yes Epilepsy 206 69 206 3 4 91.30 Yes Yes No Yes ERing 225 75 65 4 6 96.00 Yes Yes No Yes EthanolConcentration 393 131 1751 3 4 32.82 Yes Yes Yes Yes FaceDetection 7060 2354 62 144 2 50.25 No Yes Yes WIP FingerMovements 312 104 50 28 2 56.73 Yes Yes No Yes HandMovementDirection 175 59 400 10 4 25.42 Yes Yes No Yes Handwriting 750 250 152 3 26 56.40 Yes Yes No Yes Heartbeat 306 103 405 61 2 72.82 Yes Yes No Yes Libras 270 90 45 2 15 65.56 Yes Yes Yes Yes LSST 3693 1232 36 6 14 22.40 Yes Yes No Yes NATOPS 270 90 51 24 6 86.67 Yes Yes No Yes PenDigits 8244 2748 8 2 10 98.62 Yes Yes Yes Yes RacketSports 227 76 30 6 4 80.26 Yes Yes No Yes SelfRegulationSCP1 420 141 896 6 2 90.78 Yes Yes No Yes SelfRegulationSCP2 285 95 1152 7 2 47.37 Yes Yes No Yes UWaveGestureLibrary 330 110 315 3 8 92.73 Yes Yes No Yes Univariate: Time Series Train Test Len. Dim. Classes Acc. (%) LIME SHAP Anchor PHAR Adiac 585 196 176 1 37 28.06 Yes Yes No Yes Beef 45 15 470 1 5 46.67 Yes Yes No Yes BeetleFly 30 10 512 1 2 80.00 Yes Yes No Yes BirdChicken 30 10 512 1 2 60.00 Yes Yes No Yes BME 135 45 128 1 3 91.11 Yes Yes No Yes CBF 697 233 128 1 3 100.00 Yes Yes No Yes Chinatown 272 91 24 1 2 98.90 Yes Yes Yes Yes Coffee 42 14 286 1 2 78.57 Yes Yes No Yes Computers 375 125 720 1 2 64.80 Yes Yes No Yes CricketX 585 195 300 1 12 64.62 Yes Yes Yes Yes CricketY 585 195 300 1 12 62.05 Yes Yes No Yes CricketZ 585 195 300 1 12 64.62 Yes Yes No Yes Crop 18000 6000 46 1 24 68.68 Yes Yes No Yes DiatomSizeReduction 241 81 345 1 4 96.30 Yes Yes No Yes DistalPhalanxOutlineAgeGroup 404 135 80 1 3 77.78 Yes Yes Yes Yes DistalPhalanxOutlineCorrect 657 219 80 1 2 77.17 Yes Yes Yes Yes DistalPhalanxTW 404 135 80 1 6 68.89 Yes Yes No Yes DodgerLoopDay 118 40 288 1 7 57.50 Yes Yes Yes Yes DodgerLoopGame 118 40 288 1 2 87.50 Yes Yes Yes Yes DodgerLoopWeekend 118 40 288 1 2 97.50 Yes Yes No Yes Earthquakes 345 116 512 1 2 69.83 Yes Yes No Yes ECG200 150 50 96 1 2 86.00 Yes Yes Yes Yes ECG5000 3750 1250 140 1 5 92.16 Yes Yes Yes Yes ECGFiveDays 663 221 136 1 2 99.55 Yes Yes Yes Yes ElectricDevices 12477 4160 96 1 7 84.69 Yes Yes No Yes FaceFour 84 28 350 1 4 92.86 Yes Yes Yes Yes FiftyWords 678 227 270 1 47 63.00 Yes Yes No Yes FordA 3690 1231 500 1 2 83.92 Yes Yes No Yes FordB 3334 1112 500 1 2 85.16 Yes Yes No Yes FreezerRegularTrain 2250 750 301 1 2 97.33 Yes Yes No Yes FreezerSmallTrain 2158 720 301 1 2 94.17 Yes Yes No Yes Fungi 153 51 201 1 18 3.92 Yes Yes Yes Yes GunPoint 150 50 150 1 2 84.00 Yes Yes No Yes GunPointAgeSpan 338 113 150 1 2 90.27 Yes Yes Yes Yes GunPointMaleVersusFemale 338 113 150 1 2 100.00 Yes Yes No Yes GunPointOldVersusYoung 338 113 150 1 2 100.00 Yes Yes No Yes Herring 96 32 512 1 2 56.25 Yes Yes No Yes InsectWingbeatSound 1650 550 256 1 11 65.64 Yes Yes No Yes ItalyPowerDemand 822 274 24 1 2 95.99 Yes Yes No Yes LargeKitchenAppliances 562 188 720 1 3 63.83 Yes Yes No Yes Lightning2 90 31 637 1 2 54.84 Yes Yes No Yes Lightning7 107 36 319 1 7 66.67 Yes Yes No Yes Meat 90 30 448 1 3 63.33 Yes Yes No Yes MedicalImages 855 286 99 1 9 65.03 Yes Yes Yes Yes MiddlePhalanxOutlineAgeGroup 415 139 80 1 3 78.42 Yes Yes Yes Yes MiddlePhalanxOutlineCorrect 668 223 80 1 2 70.40 Yes Yes No Yes MiddlePhalanxTW 414 139 80 1 6 64.75 Yes Yes No Yes MoteStrain 954 318 84 1 2 95.28 Yes Yes No Yes OliveOil 45 15 570 1 4 13.33 Yes Yes Yes Yes OSULeaf 331 111 427 1 6 60.36 Yes Yes No Yes PhalangesOutlinesCorrect 1993 665 80 1 2 67.82 Yes Yes Yes Yes Plane 157 53 144 1 7 94.34 Yes Yes Yes Yes PowerCons 270 90 144 1 2 100.00 Yes Yes Yes Yes ProximalPhalanxOutlineAgeGroup 453 152 80 1 3 69.74 Yes Yes No Yes ProximalPhalanxOutlineCorrect 668 223 80 1 2 71.75 Yes Yes No Yes ProximalPhalanxTW 453 152 80 1 6 46.71 Yes Yes Yes Yes RefrigerationDevices 562 188 720 1 3 43.62 Yes Yes No Yes ScreenType 562 188 720 1 3 40.96 Yes Yes No Yes ShapeletSim 150 50 500 1 2 48.00 Yes Yes No Yes ShapesAll 900 300 512 1 60 66.00 Yes Yes No Yes SmallKitchenAppliances 562 188 720 1 3 59.57 Yes Yes No Yes SmoothSubspace 225 75 15 1 3 93.33 Yes Yes Yes Yes SonyAIBORobotSurface1 465 156 70 1 2 98.72 Yes Yes No Yes SonyAIBORobotSurface2 735 245 65 1 2 99.18 Yes Yes No Yes Strawberry 737 246 235 1 2 73.58 Yes Yes Yes Yes SwedishLeaf 843 282 128 1 15 83.33 Yes Yes Yes Yes Symbols 765 255 398 1 6 94.90 Yes Yes No Yes SyntheticControl 450 150 60 1 6 89.33 Yes Yes Yes Yes ToeSegmentation2 124 42 343 1 2 85.71 Yes Yes No Yes Trace 150 50 275 1 4 68.00 Yes Yes Yes Yes TwoLeadECG 871 291 82 1 2 90.03 Yes Yes No Yes TwoPatterns 3750 1250 128 1 4 99.84 Yes Yes No Yes UMD 135 45 150 1 3 91.11 Yes Yes Yes Yes UWaveGestureLibraryAll 3358 1120 945 1 8 95.54 Yes Yes No Yes UWaveGestureLibraryX 3358 1120 315 1 8 80.89 Yes Yes No Yes UWaveGestureLibraryY 3358 1120 315 1 8 71.96 Yes Yes No Yes UWaveGestureLibraryZ 3358 1120 315 1 8 75.62 Yes Yes No Yes Wafer 5373 1791 152 1 2 100.00 Yes Yes Yes Yes Wine 83 28 234 1 2 60.71 Yes Yes No Yes WordSynonyms 678 227 270 1 25 67.84 Yes Yes No Yes Worms 193 65 900 1 5 52.31 Yes Yes No Yes WormsTwoClass 193 65 900 1 2 50.77 Yes Yes No Yes Yoga 2475 825 426 1 2 90.91 Yes Yes No Yes

Collection of Time-Series Classification Datasets with Pretrained DL Models and Local Post-hoc Explainers (SHAP, LIME, Anchor & PHAR) Precomputed bundle of post-hoc explanations and black-box models for time-series classification. This dataset introduces ExplainTS, a comprehensive testbed containing 83 univariate and 20 multivariate time-series datasets from the UCR/UEA repository (source: TSC), each used for multiclass classification with a deep learning model. For each dataset, we provide: Precomputed train/test splits (75/25). A trained ConvLSTM-based TensorFlow model (SavedModel and .h5 formats). Post-hoc local explanations generated using four methods: Shapley Additive Explanations (SHAP), Local Interpretable Model-agnostic Explanations (LIME), Anchor, Post-hoc Attribution Rules [1] (PHAR). Impact of the dataset These datasets provide a ready-to-use, frozen benchmark layer for Explainable AI in time-series classification. Since models and explanation outputs are precomputed, researchers can immediately use them for evaluation, visualization, or developing new post-hoc XAI metrics without the massive computational overhead of retraining or re-explaining models. Comprehensive Coverage: 83 univariate and 20 multivariate UCR/UEA time-series, each with a standardized 75/25 train-test split in NumPy .pickle format. Pretrained Models: Ready-to-use ConvLSTM1D models for every dataset (requiring no custom dependencies), eliminating costly training and ensuring experimental consistency. Precomputed Explanations: Post-hoc outputs for training and test sets spanning attribution scores (DeepSHAP, LIME) and discrete rule sets (Anchor, PHAR) with confidence and coverage metadata. Living Resource: ExplainTS is designed as a community-driven resource; we actively invite researchers to contribute their locally computed explanation artifacts to future releases. Educational & Prototyping Value: Includes a ready-to-use Jupyter notebook demonstrating how to calculate XAI stability metrics and render publication-quality explanation plots directly over time-series signals. The following visualizations, generated using the included educational notebook, demonstrate a practical XAI auditing use case on the ECG5000 dataset. We compare the explanations of a reference sample against its nearest neighbor to evaluate method stability. - View SHAP attributions. Link: https://raw.githubusercontent.com/mozo64/papers/main/zenodo-ucr/results/shap_stability_casestudy_ECG5000.png - View Discrete PHAR interval rules applied to the same signals. Link: https://raw.githubusercontent.com/mozo64/papers/main/zenodo-ucr/results/phar_casestudy_comparison_ECG5000.png Companion Code & Notebooks All resources are openly available under the CC-BY-4.0 license. The linked GitHub repository (https://github.com/mozo64/papers/tree/main/zenodo-ucr) provides a suite of Python scripts and Jupyter notebooks for reproducing and interacting with the benchmark: ExplainTS_CaseStudy.ipynb — An educational case study for calculating XAI stability metrics and plotting explanations over time-series. UCR-train.ipynb — Pipeline for training the baseline ConvLSTM models. UCR-explainers-lime-shap.ipynb — Execution of DeepSHAP and LIME explainers. UCR-explainers-anchor.ipynb — Execution of the Anchor explainer with cascade retries. UCR-explainers-phar.ipynb — Execution of the PHAR rule extraction and hyperparameter optimization. datasets_summary.ipynb — Utility for extracting dataset statistics and model accuracies. Repository content train_test.zip — contains files of the form {uni|multi}_{series_name}_train_and_test.zip. Each includes: trainX.pickle, trainy.pickle, testX.pickle, testy.pickle. Format: `numpy.array` models.zip — trained models as directories in the form {uni|multi}_{series_name}_model. Each contains a TensorFlow SavedModel and .h5 file for loading flexibility. shap.zip — DeepSHAP values in {series_name}_shap_values.zip. Files: svtr.pickle (train), svts.pickle (test). Format: `numpy.array` lime.zip — LIME values in {series_name}_lime_values.zip. Files: lvtr.pickle (train), lvts.pickle (test). Format: `numpy.array` anchor.zip — Anchor rule records in {series_name}_anchor_values.zip. Files: avtr.pickle (train), avts.pickle (test). Format: `List[List[Dictionary]]` phar.zip — PHAR rules and hyperoptimization logs in {series_name}_phar_values.zip. Files: pvtr.pickle and pvts.pickle, plus phar_metadata.json and phar_trials_log.jsonl. [1] Mozolewski, M., Bobek, S., & Nalepa, G. J. (2026). Explaining Time Series Classifiers with PHAR: Rule Extraction and Fusion from Post-hoc Attributions. arXiv preprint arXiv:2508.01687. https://arxiv.org/abs/2508.01687

Data Provenance and Licensing The raw time series used in this benchmark originate from the Time Series Classification Repository, a well-established and widely used academic resource for time series classification research. All time series datasets were obtained using the standardized data access interface provided by the aeon Python toolkit: Middlehurst, M., Ismail-Fawaz, A., Guillaume, A., Holder, C., Guijo-Rubio, D., Bulatova, G., Tsaprounis, L., Mentel, L., Walter, M., Schäfer, P., & Bagnall, A. (2024). aeon: a Python Toolkit for Learning from Time Series. Journal of Machine Learning Research, 25(289), 1–10. http://jmlr.org/papers/v25/23-1444.html For academic applications and non-commercial usage, the files are made available under the terms of the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

Related Organizations

Jagiellonian University
Poland

Keywords

Anchor, deep learning, Time Series, LIME, TS, rule-based explanations, Tensorflow, XAI, classification, DL, SHAP, Explainable AI, PHAR (Post-hoc Attribution Rules), UCI repository

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average