Learning biologically informed RNA representations for sequence-structure-function modeling

Fan, Xioya; Feng, Yufan; Zhong, Jiaxin; Liu, Jingwen; Zhi, Xuanchen; Junwei, Chen; Rongling, Wu; Danko, Charles G.; Zhao, Zheng; Zhao, Qi; Wang, Zhong

Found an issue? Give us feedback

ZENODOarrow_drop_down

ZENODO

Dataset

Data sources: ZENODO

Learning biologically informed RNA representations for sequence-structure-function modeling

Research datakeyboard_double_arrow_right Dataset Under curationPublisher:Zenodo

Authors: Fan, Xioya; Feng, Yufan; Zhong, Jiaxin; Liu, Jingwen; Zhi, Xuanchen; Junwei, Chen; Rongling, Wu; +4 Authors

doi: 10.5281/zenodo.19363040

Learning biologically informed RNA representations for sequence-structure-function modeling

- Summary

Abstract

VQRNA Dataset and pretrained model weights 1. Overview This archive contains curated RNA datasets and model weights used in the VQ-RNA study, covering both structure-related and function-related downstream tasks. - Project name: `VQRNA_dataset` - Release date: `2026-03-22` - Scope: distance map prediction, secondary structure modeling, ncRNA classification, RNACentral RNA type classification, miRNA-mRNA interaction prediction, 5'UTR MRL regression, torsion angle prediction, motif analysis (MEME and VQ code-window FASTA) - Root directory: this `README.md` is placed at the dataset root 2. Directory Layout VQRNA_dataset/ ├─ DistanceMap/ │ ├─ train.csv, val.csv, test.csv, RFAM19.csv │ └─ distance_map/*.npy ├─ human_utr_mrl/ │ └─ GSM4084997_varying_length_25to100.csv ├─ nRC/ │ └─ *.csv ├─ Motif_Analysis/ │ ├─ fasta/ │ │ └─ code_*.fasta │ │ │ └─ meme/*.meme ├─ RNACentral_type/ │ └─ *.fasta ├─ rr_inter/ │ └─ MirTarRAW.csv ├─ SS/ │ ├─ train_0.8_72025_1_17.pkl │ ├─ TR0/TR0.pkl, TS0/TS0.pkl │ ├─ ArchiveII/archiveII.pkl │ ├─ bpRNAnew/bpRNAnew.pkl │ ├─ rnastralign/train.pkl │ ├─ RNA3DB_2D/{bpseq/, pickle/, raw_pickle/} │ └─ casprna_2d/{bpseq/, pkl/} ├─ torsion_angle/ │ └─ data/fixed_final_output_{TR,VL,TS1,TS2,TS3}_seqs.csv └─ weights/ ├─ VQRNA_backbone_weights.pth ├─ VQRNA_distance_map.pth ├─ VQRNA_mrl.pth ├─ VQRNA_nrc.pth ├─ VQRNA_rr_inter.pth ├─ VQRNA_SS.pth └─ VQRNA_torsion_angle.pth 3. Dataset Summary 3.1 DistanceMap - Files: - `DistanceMap/train.csv` (179 sequences) - `DistanceMap/val.csv` (23 sequences) - `DistanceMap/test.csv` (80 sequences) - `DistanceMap/RFAM19.csv` (19 sequences) - `DistanceMap/distance_map/*.npy` (distance matrices) - CSV columns: `id,input` - Typical usage: sequence-level input with paired 2D distance map regression target 3.2 Human UTR MRL - File: `human_utr_mrl/GSM4084997_varying_length_25to100.csv` (106,530 rows) - Key columns: `utr`, `set`, read-count-derived features, `rl`, `len` - Current `set` distribution in this file: - `random`: 87,000 - `human`: 16,739 - `controls`: 1,017 - `with_uaugs`: 900 - `no_uaugs`: 874 - Typical usage: sequence-level regression (target often uses `rl`) 3.3 nRC - Main files: - `nRC/finetune/train_6320_13.csv` (6,320 rows, 13 classes) - `nRC/finetune/test_2600_13.csv` (2,600 rows, 13 classes) - CSV columns: `id,seq_str,class` - These files directly downloaded from nRC Dataset 3.4 RNACentral_type - Folder: `RNACentral_type/` - Format: per-class FASTA files (`*.fasta`) - Current content: - 16 RNA type files - 11,200 sequences in total - 700 sequences per type - Retained RNA types: `rRNA`, `tRNA`, `lncRNA`, `pre_miRNA`, `snRNA`, `snoRNA`, `piRNA`, `tmRNA`, `hammerhead_ribozyme`, `SRP_RNA`, `miRNA`, `RNase_P_RNA`, `siRNA`, `antisense_RNA`, `Y_RNA`, `RNase_MRP_RNA` 3.5 rr_inter - File: `rr_inter/MirTarRAW.csv` (27,719 rows) - CSV columns: `a_name,a_seq,b_name,b_seq,label` - Label distribution in current file: - `label=0`: 13,860 - `label=1`: 13,859 3.6 Secondary Structure (SS) - This folder aggregates multiple SS benchmarks/resources in `.pkl` and `.bpseq` formats: - bpRNA-based splits (`TR0`, `TS0`) - ArchiveII - RNA3DB-2D - CASPRNA-2D - RNAStrAlign - For RNA3DB-2D and CASPRNA-2D, both intermediate structural files (`.bpseq`) and model-ready serialized files (`.pkl`/`.pickle`) are included. 3.7 Torsion Angle - Files: - `fixed_final_output_TR_seqs.csv` - `fixed_final_output_VL_seqs.csv` - `fixed_final_output_TS1_seqs.csv` - `fixed_final_output_TS2_seqs.csv` - `fixed_final_output_TS3_seqs.csv` - CSV columns: `Base,alpha,beta,gamma,delta,epsilon,zeta,chi,id,pdb_id` - Note: these are nucleotide-level rows; one sequence contains multiple rows. - Current row / unique `pdb_id` counts: - TR: 30,783 rows / 245 structures - VL: 1,540 rows / 28 structures - TS1: 8,807 rows / 58 structures - TS2: 2,598 rows / 27 structures - TS3: 2,137 rows / 47 structures 3.8 Motif Analysis (MEME + FASTA) - Folder: `Motif_Analysis/` - Subfolders: - `Motif_Analysis/fasta/` - `Motif_Analysis/meme/` - FASTA content (`Motif_Analysis/fasta/`): - FASTA files (`code_*.fasta`) - Sequence definition: 21-nt window sequences for each VQ-Tokenizer code - MEME content (`Motif_Analysis/meme/`): - 14 MEME motif database files (`*.meme`) - Current content includes: - Public RNA/RBP resources: - `ATtRACT.meme` - `RBPDB_motifs.meme` - `Ray2013_rbp_RNA.meme` - `SpliceAid.meme` - `cisbp_rna.meme` - Vocabulary-derived motif sets: - `3_mer_vocab.meme` - `4_mer_vocab.meme` - `5_mer_vocab.meme` - `6_mer_vocab.meme` - `BPE_vocab.meme` - VQ tokenizer-derived motif sets: - `vq_tokenizer_nrc_bb6a38.meme` - `vq_tokenizer_nrc_bb6a38_cut_7.meme` - `vq_tokenizer_nrc_bb6a38_cut_9.meme` - `vq_tokenizer_nrc_bb6a38_cut_11.meme` 4. weights - Files: - `weights/VQRNA_backbone_weights.pth` VQRNA pretrained weights - `weights/VQRNA_distance_map.pth` finetuned in DistanceMap task - `weights/VQRNA_mrl.pth` finetuned in MRL task - `weights/VQRNA_nrc.pth` finetune in nRC task - `weights/VQRNA_rr_inter.pth` finetuned in RR Interaction task - `weights/VQRNA_SS.pth` finetuned in secondary structure task - `weights/VQRNA_torsion_angle.pth` finetuned in torsion angle task - Description: pretrained model weights corresponding to each downstream task in this dataset release. 5. Data Usage and License Notes - This collection is built from publicly available datasets and derived processing outputs. - Reuse must follow the license/terms of each original source dataset. - Before final Zenodo publication, set an explicit license in Zenodo metadata and ensure compatibility with all included sources. - If needed, add a `LICENSE` file at the root for the packaged release policy. 6. Contact - Maintainer: `XiaoYa Fan` - Email: `xiaoyafan@dlut.edu.cn` - Project link (optional): `https://github.com/XploreAI-Lab/SA-VQRNA`

Found an issue? Give us feedback