
This dataset contains all data files and experimental results associated with the manuscript "Interpretable multimodal learning from sequence andgenomic context for lncRNA classification". Related Code Repository: https://github.com/cbib/beta_vae_lnclassifier Code Archive DOI: 10.5281/zenodo.18833347 Manuscript: [Citation when available] File Organization The dataset is organized into three ZIP archives: data.zip - Input data files and preprocessing outputs gencode_v47_experiments.zip - All experimental results on GENCODE v47 gencode_v49_experiments.zip - All experimental results on GENCODE v49 1. data.zip Contains all input sequences, features, and lncRNA-BERT baseline results. Contents: cdhit_clusters/ CD-HIT clustered transcript sequences for training: v47_lncRNA_clustered.fa - CD-HIT clustered lncRNA sequences (v47) v47_pc_clustered.fa - CD-HIT clustered protein-coding sequences (v47) v49_lncRNA_clustered.fa - CD-HIT clustered lncRNA sequences (v49) v49_pc_clustered.fa - CD-HIT clustered protein-coding sequences (v49) dataset_biotypes/ Biotype annotations for datasets: v47_dataset_biotypes_cdhit.csv - Transcript biotype labels (v47) v49_dataset_biotypes_cdhit.csv - Transcript biotype labels (v49) Format: CSV with columns including transcript_id, biotype, gene_id lncRNABERT_results/ Zero-shot baseline results from lncRNA-BERT: v47_lncRNABERT_embeddings.h5 - Learned embeddings (v47) v47_lncRNABERT_results.csv - Predictions and metrics (v47) v49_lncRNABERT_embeddings.h5 - Learned embeddings (v49) v49_lncRNABERT_results.csv - Predictions and metrics (v49) processed_features/ Cleaned and normalized feature vectors with associated metadata: v47_nonb_feature_names.txt - Non-B DNA feature names (v47) v47_nonb_features_clean.csv - Processed non-B DNA features (v47) v47_nonb_scaler.pkl - Scikit-learn scaler for non-B features (v47) v47_te_feature_names.txt - TE feature names (v47) v47_te_features_clean.csv - Processed TE features (v47) v47_te_scaler.pkl - Scikit-learn scaler for TE features (v47) v49_nonb_feature_names.txt - Non-B DNA feature names (v49) v49_nonb_features_clean.csv - Processed non-B DNA features (v49) v49_nonb_scaler.pkl - Scikit-learn scaler for non-B features (v49) v49_te_feature_names.txt - TE feature names (v49) v49_te_features_clean.csv - Processed TE features (v49) v49_te_scaler.pkl - Scikit-learn scaler for TE features (v49) Description: Feature scalers (.pkl) can be loaded with scikit-learn to apply the same normalization used during training. split_gencode_47/ Train/test split for GENCODE v47: lnc_test.fa - lncRNA test set lnc_trainval.fa - lncRNA training+validation set pc_test.fa - Protein-coding test set pc_trainval.fa - Protein-coding training+validation set split_manifest.json - Split metadata and statistics split_gencode_49/ Train/test split for GENCODE v49 (same structure as split_gencode_47/) 2. gencode_v47_experiments.zip Experimental results for all models trained and evaluated on GENCODE v47. Contents: beta_vae_contrastive_g47/ β-VAE with contrastive learning (sequence-only baseline): evaluation_csvs/ - Evaluation metrics and predictions global_biotype_enrichment/ - Biotype enrichment analysis models/ - Model checkpoints performance_figures/ - Performance visualization plots spatial_analysis/ - Spatial clustering analysis umap_visualizations/ - UMAP embedding visualizations ANALYSIS_SUMMARY.md - Summary of key findings biotype_mapping.json - Biotype label mappings cv_evaluation_results.json - Cross-validation results cv_fold_results.csv - Per-fold cross-validation metrics embeddings_all_folds.npz - Concatenated embeddings from all CV folds embeddings_best_fold.npz - Embeddings from best performing fold model_architecture.txt - Model architecture description model_paths.csv - Paths to saved model files test_results.json - Final test set results beta_vae_features_attn_g47/ β-VAE with attention-based feature fusion (TE + non-B DNA): Same structure as beta_vae_contrastive_g47/ beta_vae_features_g47/ β-VAE with concatenated features (TE + non-B DNA): Same structure as beta_vae_contrastive_g47/ cnn_g47/ CNN baseline (sequence-only): Same structure as beta_vae_contrastive_g47/ stat_results/ Statistical analysis results across all models: ablations_v47/ Ablation study results: bootstrap_f1_ci.csv - Bootstrap confidence intervals for F1 scores delongauc_ci.csv - DeLong test for AUC comparisons fold_summary.csv - Summary statistics per fold g47/ GENCODE v47 statistical analysis: g47_bootstrap_f1_ci.csv - Bootstrap F1 confidence intervals g47_fold_summary.csv - Per-fold summary statistics hardcase_jaccard_pairwise_v47.csv - Jaccard similarity for hard cases hardcase_jaccard_v47.csv - Hard case Jaccard indices hardcase_membership_long_v47.csv - Hard case membership matrix hardcase_upset_v47.png - UpSet plot for hard case overlaps 3. gencode_v49_experiments.zip Experimental results for all models trained and evaluated on GENCODE v49. Contents: Same directory structure as gencode_v47_experiments.zip: beta_vae_contrastive_g49/ beta_vae_features_attn_g49/ beta_vae_features_g49/ cnn_g49/ stat_results/ablations_v49/ and stat_results/g49/ Reproducibility To reproduce the results: Refer to the code repository (DOI: 10.5281/zenodo.18833347) for scripts The split_manifest.json files document the exact train/test splits used. Citation If you use this dataset, please cite: [Author list]. (2026). [Manuscript title]. Bioinformatics. DOI: [DOI when available] Dataset DOI: 10.5281/zenodo.18849718 Code DOI: 10.5281/zenodo.18833347 License CC BY 4.0 Contact For questions or issues regarding this dataset, please contact: Mikaël Georges: mikael.georges@ibgc.cnrs.fr | Macha Nikolski macha.nikolski@u-bordeaux.fr Or open an issue on the GitHub repository: https://github.com/cbib/beta_vae_lnclassifier Last Updated: 03/03/26Version: 1.0.0
RNA, Long Noncoding
RNA, Long Noncoding
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
