
Changelog: 2026-06-04 (v7) The primary objective of this release is to eliminate sources of duplicate miRNA:target-site sequences both within labels and across labels, resulting in cleaner and more consistent datasets. The following reported issues with v5/v6 datasets were addressed in the v7 release: Added the Nunique column from HybriDetector outputs to the final datasets. Updated noncodingRNA_name to retain all candidate miRNA names associated with a miRNA sequence, removing sequence-to-annotation mapping inconsistencies. Regenerated negative examples following fixes to the negative generation pipeline: Negative candidate exclusion is now based on clusters of unique target sites rather than all target sites. Target-site clusters missed during negative candidate exclusion due to data type inconsistencies are now correctly excluded. miRNA families with insufficient negative candidates are now downsampled to maintain a positive-to-negative ratio closer to 1:1. Re-annotated target sites using the genomic_region_annotator tool, adding the columns: dominant_region: region with greatest overlap in the selected transcript. regions_present: all overlapping regions in the selected transcript. read_start_in_sel_tx_1based: 1-based transcript-relative start coordinate. read_end_in_sel_tx_1based: 1-based transcript-relative end coordinate. Standardized all coordinate columns to integer data types. Changelog (v6) PhyloP and PhastCons conservation scores for the target gene sequence have been added to the test/train/leftout datasets as two additional columns - 'gene_phyloP' and' gene_phastCons'. Both of new columns contain list of conservation scores rounded to 3 decimal places, one score for each nucelotide in the gene sequence.PhyloP and PhastCons scores were obtained from: https://hgdownload.cse.ucsc.edu/goldenPath/hg38/phyloP100way/hg38.phyloP100way.bw https://hgdownload.cse.ucsc.edu/goldenpath/hg38/phastCons100way/hg38.phastCons100way.bw Downloaded on 15 September 2024. Dataset Summary (v5) The following listed datasets were recreated via a series of post-processing pipelines (available here) to eliminate a bias between the positive and negative classes (miRNA family imbalance) discovered in previous versions of the datasets. All have a 1:1 positive to negative class ratio. AGO2_eCLIP_Manakov2022_leftout.tsv.gz AGO2_eCLIP_Manakov2022_test.tsv.gz AGO2_eCLIP_Manakov2022_train.tsv.gz AGO2_eCLIP_Klimentova2022_test.tsv.gz AGO2_CLASH_Hejret2023_test.tsv.gz AGO2_CLASH_Hejret2023_train.tsv.gz The following listed dataset is the concatenated HybriDetector output of all the selected samples from the available Manakov sample files. It therefore contains only a raw version of the positive class of the Manakov dataset. It is the input to the series of post-process pipelines for the Manakov dataset. AGO2_eCLIP_Manakov2022_full_dataset.tsv.gz The other inputs to the post-process pipelines for the Hejret and Klimentova datasets are found at the following links. Hejret dataset Klimentova dataset The structure of each dataset is consistent, with the following column order: gene: A string of length 50 indicating the binding site sequence in the 5’ to 3’ direction. noncodingRNA: A string of variable length (16–28) indicating the mature miRNA sequence in the 5’ to 3’ direction. noncodingRNA_name: A string indicating the name of the miRNA. noncodingRNA_fam: A string indicating the name of the miRNA family the miRNA belongs to. feature: A string indicating the feature annotation on the genome where the binding site occurs. label: A boolean value indicating whether the example belongs to the positive or negative class. chr: A string indicating the chromosome number on the genome where the binding site occurs. start: An integer indicating the 1-based start position of the binding site on the genome. end: An integer indicating the 1-based end position of the binding site on the genome. strand: A string indicating whether the binding site occurs on the ’+’ or ’-’ strand on the genome. gene_cluster_ID: An integer indicating the cluster ID of the binding site sequence used to generate the negative class. Note that the binding sites reported in all datasets are consistent with GRCh38.
