
This dataset contains the precomputed variant effect predictions and interpretability features that power the Evo Variant Effect Explorer (EVEE) web application, accompanying the preprint "EVEE: Interpretable variant effect prediction from genomic foundation model embeddings" (Pearce et al., 2026, doi:10.64898/2026.04.10.717844). Each row is one ClinVar variant (4,252,870 total) and carries its genomic coordinates, gene and consequence annotations, ClinVar clinical significance, an Evo 2 embedding-based pathogenicity score, and roughly 4,900 additional probe outputs covering protein-level disruption features (InterPro domains, post-translational modifications, secondary structure, active/binding sites, disorder, etc.), regulatory-track predictions (ChromHMM states, ATAC-seq and ChIP-seq peaks across multiple cell types, CCRE annotations), amino-acid and consequence classifiers, and per-variant reference-predictor scores (AlphaMissense, REVEL, CADD, PrimateAI, SpliceAI, and others). The table is released as five chromosome-balanced Parquet shards (clean_shard_0.parquet through clean_shard_4.parquet, each 6.8–7.3 GB) plus a manifest.json describing which chromosomes live in each shard. Consumers can read all shards as a single logical table with polars.scan_parquet("clean_shard_*.parquet") or duckdb.read_parquet. This is the exact artifact used to build the EVEE variants.duckdb served at https://evee.goodfire.ai.
