
There are 2 versions of data sets included in this repo: the prototype (24_05_07) and finalized (25_05_11) The major difference between Prototype and the finalized data sets are as following Prototype has more unsupervised clustering labels (60) then finalized (54) Prototype has more clusters associated with Y-Phos, K-Sumo K-Malo Prototype has K-Succ and finalized does not Finalized has PK-Hydr and Prototype does not They are put through the same curation pipeline but with different random seeds for sampling Most of the differences are an outcome of testing the stability of each PTM type (class) in a multi-classification setting. Most PTM (All that was tested) are stable (will converge) in a single binary classification setting. The PTM type "K-Succ" when added to the multi-classification tanked its own performance and also other PTM types (anecdotal behavior testing). The swap to a different finalized training set was primarily due to this clash in performance. In theory with different hyper-parameters, one could fix this. There are 3 types of files used (benchmark, individual, and matrix). Benchmarks are constrained to residues of interest for both negative and positive data. “Individual” and “matrix” hold nearly identical data, but the “individual” datasets are the flattened version of the “matrix” file for easier data handling. The “matrix” version of the file is there for training and an easier way to keep track of peptides associated with multiple PTMs. The “uniprot_IDs_Pos” column for each file can have multiple locations associated and are listed out and separated by a “--”. The current mapping of the shared locations per peptide only holds true if you are looking at 21mers. Any increase context window size might break up the shared peptide associations, and the peptide might not be considered a multi-PTM event in this case. The uniprot IDs are from multiple older versions of uniprot and uniprot IDs can repeated but are denoted as such in the accession. Due to possible repeat uniprot IDs, different uniprot IDs are denoted with the extension of “-outOfDateV2” in the accession to maintain unique sequence mapping. Finalized Training, Testing, and Validation(25_05_11) individual_train_hd3_CustSeqDistSpecClus_1to60NegRaio-25_05_11.csv matrix_train_hd3_CustSeqDistSpecClus_1to60NegRaio-25_05_11.csv Used to train the final model and this was utilized in all figures expect Fig2 and S2_Fig individual_val_hd3_CustSeqDistSpecClus_1to1NegRaio-25_05_11.csv matrix_val_hd3_CustSeqDistSpecClus_1to1NegRaio-25_05_11.csv Used to help train the final model individual_test_hd3_CustSeqDistSpecClus_1to1NegRaio-25_05_11.csv matrix_test_hd3_CustSeqDistSpecClus_1to1NegRaio-25_05_11.csv Used as the benchmark in Fig4 and S_Fig4 Fig 4 and S_Fig4 used the positive labels and just neg labels that share the same res type as the negative class in this benchmark Same positive labels as HUMAN_labs.txt All data in the finalized data set are labeled using the unsupervised clustering labels (54) rather than final labels (20) for the matrix files Finalized Benchmarks benchmark_test_HUMAN_hd3_Phosphorylation(ST)-25_05_11.csv benchmark_test_HUMAN_hd3_Phosphorylation(Y)-25_05_11.csv benchmark_test_HUMAN_hd3_Ubiquitination-25_05_11.csv benchmark_test_HUMAN_hd3_Acetylation(K)-25_05_11.csv benchmark_test_HUMAN_hd3_Acetylation(AM)-25_05_11.csv benchmark_test_HUMAN_hd3_N-linked-Glycosylation-25_05_11.csv benchmark_test_HUMAN_hd3_O-linked-Glycosylation-25_05_11.csv benchmark_test_HUMAN_hd3_Methylation-25_05_11.csv benchmark_test_HUMAN_hd3_Sumoylation-25_05_11.csv benchmark_test_HUMAN_hd3_Malonylation-25_05_11.csv benchmark_test_HUMAN_hd3_Sulfoxidation-25_05_11.csv benchmark_test_HUMAN_hd3_S-palmitoylation-25_05_11.csv benchmark_test_HUMAN_hd3_Glutathionylation-25_05_11.csv benchmark_test_HUMAN_hd3_Hydroxylation-25_05_11.csv All benchmarks are filtered for the residue of interest from individual_test_hd3_CustSeqDistSpecClus_1to1NegRaio-25_05_11.csv out_of_distribution.zip -- YEAST/ --/YEAST/benchmark_YEAST_S_Phosphorylation-25_05_11.csv --/YEAST/benchmark_YEAST_T_Phosphorylation-25_05_11.csv --/YEAST/benchmark_YEAST_Y_Phosphorylation-25_05_11.csv --/YEAST/benchmark_YEAST_K_Ubiquitination-25_05_11.csv --/YEAST/... -- MOUSE/ --/MOUSE/... -- ECOLI/ --/ECOLI//... -- DROME/ --/DROME/... -- CAEEL/ --/CAEEL/... Each species has similar files to the human Finalized Benchmarks but they are residue specific. This is separated by organism. Each benchmark is also separated into PTM type and residue of interest. Due note that some of the benchmarks do not have enough data to be accurate Negative labels were other PTMs types and are only used if they share the residue(s) of interest for the PTM All positive and negative labels were under-sampled to have a max of 500 Each species specific PTM type benchmark was only used if they have at least 100 positive examples and 50 negatives Used in Fig. 5 and S6 Fig. Prototype (24_05_07) individual_train_hd3_CustSeqDistSpecClus_1to60NegRaio-24_05_07.csv matrix_train_hd3_CustSeqDistSpecClus_1to60NegRaio-24_05_07.csv Used to train initial model that was used to in Fig2, S_Fig2 segmented and made a Singel Binary Classification models for Fig2 and S_Fig2 individual_val_hd3_CustSeqDistSpecClus_1to1NegRaio-24_05_07.csv matrix_val_hd3_CustSeqDistSpecClus_1to1NegRaio-24_05_07.csv Used to help train initial model and used to benchmark Fig2, S_Fig2 Fig3 and S_Fig1 used the positive labels and all negative labels negative class in this benchmark Fig2 and S_Fig2 used the positive labels and all negative labels negative class in this benchmark (Singel Binary Classification) All data in the prototype data set are labeled using the unsupervised clustering labels (60) rather than final labels (20) for the matrix files Different Res and Neg ratios (24_05_07) (not good performance in practice but could work in theory) individual_train_hd3_CustSeqDistSpecClus_1to70NegRaio-24_05_07.csv matrix_train_hd3_CustSeqDistSpecClus_1to70NegRaio-24_05_07.csv Pos/Neg ratio is 1/70 (each class) - Medium/Easy negative residue ratio is uniform 1/20 - total residue ratio NOT uniform individual_train_hd3_CustSeqDistSpecClus_1to100NegRaio-24_05_07.csv matrix_train_hd3_CustSeqDistSpecClus_1to100NegRaio-24_05_07.csv Pos/Neg ratio is 1/100 (each class) - Medium/Easy negative residue ratio is uniform 1/20 - total residue ratio NOT uniform individual_train_hd3_CustSeqDistSpecClus_uniResRatio_CustNegRaio-24_05_07.csv matrix_train_hd3_CustSeqDistSpecClus_uniResRatio_CustNegRaio-24_05_07.csv Pos/Neg ratio is 1/1000 (each class) - Medium/Easy negative residue ratio is NOT uniform - total residue ratio is uniform Mappable FASTA -------------------- Here are the fasta files that can be used to get full sequence context. The uniprot IDs are from multiple older versions of uniprot and uniprot IDs can repeat but are denoted as such in the accession. Due to possible repeat uniprot IDs, different uniprot IDs are denoted with the extension of “-outOfDateV2” in the accession to maintain unique sequence mapping. uniprot_version_control_train_test_val_25_05_11.fasta Used to map the "Finalized Training, Testing, and Validation(25_05_11)" and "Finalized Benchmarks" files. uniprot_version_control_train_test_val_24_05_07.fasta Used to map the "Prototype (24_05_07)" and "Different Res and Neg ratios (24_05_07)" files. uniprot_version_control_OOD_25_05_11.fasta Used to map the "out_of_distribution.zip" files.
