miRBench datasets

Changelog: 2026-06-04 (v7) The primary objective of this release is to eliminate sources of duplicate miRNA:target-site sequences both within labels and across labels, resulting in cleaner and more consistent datasets. The following reported issues with v5/v6 datasets were addressed in the v7 release: Added the Nunique column from HybriDetector outputs to the final datasets. Updated noncodingRNA_name to retain all candidate miRNA names associated with a miRNA sequence, removing sequence-to-annotation mapping inconsistencies. Regenerated negative examples following fixes to the negative generation pipeline: Negative candidate exclusion is now based on clusters of unique target sites rather than all target sites. Target-site clusters missed during negative candidate exclusion due to data type inconsistencies are now correctly excluded. miRNA families with insufficient negative candidates are now downsampled to maintain a positive-to-negative ratio closer to 1:1. Re-annotated target sites using the genomic_region_annotator tool, adding the columns: dominant_region: region with greatest overlap in the selected transcript. regions_present: all overlapping regions in the selected transcript. read_start_in_sel_tx_1based: 1-based transcript-relative start coordinate. read_end_in_sel_tx_1based: 1-based transcript-relative end coordinate. Standardized all coordinate columns to integer data types. Changelog (v6) PhyloP and PhastCons conservation scores for the target gene sequence have been added to the test/train/leftout datasets as two additional columns - 'gene_phyloP' and' gene_phastCons'. Both of new columns contain list of conservation scores rounded to 3 decimal places, one score for each nucelotide in the gene sequence.PhyloP and PhastCons scores were obtained from: https://hgdownload.cse.ucsc.edu/goldenPath/hg38/phyloP100way/hg38.phyloP100way.bw https://hgdownload.cse.ucsc.edu/goldenpath/hg38/phastCons100way/hg38.phastCons100way.bw Downloaded on 15 September 2024. Dataset Summary (v5) The following listed datasets were recreated via a series of post-processing pipelines (available here) to eliminate a bias between the positive and negative classes (miRNA family imbalance) discovered in previous versions of the datasets. All have a 1:1 positive to negative class ratio. AGO2_eCLIP_Manakov2022_leftout.tsv.gz AGO2_eCLIP_Manakov2022_test.tsv.gz AGO2_eCLIP_Manakov2022_train.tsv.gz AGO2_eCLIP_Klimentova2022_test.tsv.gz AGO2_CLASH_Hejret2023_test.tsv.gz AGO2_CLASH_Hejret2023_train.tsv.gz The following listed dataset is the concatenated HybriDetector output of all the selected samples from the available Manakov sample files. It therefore contains only a raw version of the positive class of the Manakov dataset. It is the input to the series of post-process pipelines for the Manakov dataset. AGO2_eCLIP_Manakov2022_full_dataset.tsv.gz The other inputs to the post-process pipelines for the Hejret and Klimentova datasets are found at the following links. Hejret dataset Klimentova dataset The structure of each dataset is consistent, with the following column order: gene: A string of length 50 indicating the binding site sequence in the 5’ to 3’ direction. noncodingRNA: A string of variable length (16–28) indicating the mature miRNA sequence in the 5’ to 3’ direction. noncodingRNA_name: A string indicating the name of the miRNA. noncodingRNA_fam: A string indicating the name of the miRNA family the miRNA belongs to. feature: A string indicating the feature annotation on the genome where the binding site occurs. label: A boolean value indicating whether the example belongs to the positive or negative class. chr: A string indicating the chromosome number on the genome where the binding site occurs. start: An integer indicating the 1-based start position of the binding site on the genome. end: An integer indicating the 1-based end position of the binding site on the genome. strand: A string indicating whether the binding site occurs on the ’+’ or ’-’ strand on the genome. gene_cluster_ID: An integer indicating the cluster ID of the binding site sequence used to generate the negative class. Note that the binding sites reported in all datasets are consistent with GRCh38.

Found an issue? Give us feedback