Raw data for SimpleFold-Turbo preprint

SimpleFold-Turbo: Adaptive Inference Caching Yields 14-fold Acceleration of Flow-matching Protein Structure Prediction General Information for Raw Data Description: This dataset contains all benchmarking data, predicted structures, and analysis results for the SimpleFold-Turbo manuscript. SimpleFold-Turbo applies TeaCache-style adaptive step-skipping to SimpleFold diffusion models across six model scales (100M–3B parameters), evaluated on a structurally diverse subset of 300 CATH domains. Total size: ~572 MB (553 MB predicted structure files) Benchmark Set File Description CATH300.csv Benchmark set of 300 CATH domains: name, sequence, and length diverse_cath_300.fasta Sequences for the 300 domains in FASTA format diverse_cath_300.json Extended metadata for each domain (CATH classification, structural annotations ss_content.json Secondary structure content (helix/sheet/coil fractions) per domain Core Benchmark Results File Description cath_benchmark_full.csv Per-protein benchmark results (3,600 rows): model, TeaCache threshold, TM-score, RMSD, lDDT, inference time, and cache hit rate cath_benchmark_full.json Same data in JSON format gt_comparison.csv Side-by-side quality comparison (baseline vs. TeaCache) against ground-truth experimental structures: TM-score, RMSD, and lDDT differences Dual Sweep (Uniform vs. Adaptive Step-Skipping) File Description dual_sweep_simplefold_{100M,360M,700M,1.1B,1.6B,3B}.json Per-protein results for each model scale (5,100 entries each). Each entry records method (uniform or adaptive), condition (number of steps or threshold), inference time, cache hit rate, computed steps, and quality metrics (RMSD, TM-score) dual_sweep_summary.csv Aggregated summary across all models and conditions (30,600 rows) uniform_vs_adaptive.csv Head-to-head comparison of uniform vs. adaptive skipping at matched compute budgets Threshold and Step Sweeps File Description threshold_sweep.json Per-protein results across TeaCache threshold values (2,400 entries) threshold_summary.csv Aggregated: mean time, speedup, cache hit rate, and quality loss per threshold uniform_step_sweep.json Per-protein results for uniform step counts (2,700 entries) Mechanistic Analyses File Description skip_patterns.json Timestep-resolved skip/compute patterns across the denoising trajectory. Includes per-step skip rates, a summary of always-computed warmup steps (11) vs. always-skipped steps (200) vs. variable steps (289) warmup_comparison.json Analysis of warmup phase: compares the first 11 (always-computed) steps to full 500-step trajectories across 300 proteins clustering_results.json Clustering of denoising timesteps into two regimes based on skip behavior, with secondary-structure correlation crystallization_results.json Atom-level settling ("crystallization") analysis: per-protein statistics on when atomic coordinates stabilize during denoising (20 proteins) dimensionality_control.csv Cache hit rate vs. chain length and embedding dimensionality (synthetic and empirical) dimensionality_control.json Full dimensionality control experiment data including Pearson correlations Predicted Structures structures.zip:30,811 PDB files (~553 MB compressed). Organized as structures/{model}/{method}_{condition}/{domain}.pdb, where model is one of simplefold_{100M,360M,700M,1.1B,1.6B,3B}, method is uniform or adaptive, and condition is the step count or threshold value. File Formats - CSV files use comma delimiters with a header row- JSON files are either arrays of per-protein result objects or dictionaries with descriptive top-level keys- FASTA follows standard format with CATH domain identifiers as headers- PDB files follow standard Protein Data Bank format Reproducing the Figures The Python scripts used to generate all manuscript figures from these data files are included in the GitHub repo publication/ directory (figure1.py, figure2.py, figure_supplement.py).

Found an issue? Give us feedback