
# AMP Dataset Documentation This repository contains several datasets used for conducting ProteoGPT, AMPSorter, AMPGenix and BioToxiPept and datasets conducted with prediction results. Below is a description of each dataset included in this repository. ## Datasets ### 1. `uniprot-compressed_true_download_true_format_fasta_includeIsoform_tr-2022.10.13-02.51.31.70.fasta`- **Description**: 609,216 non-redundant canonical and isoform protein sequences. ### 2. `protein_seqs_1000.json`- **Description**: Contains the training data for ProteoGPT. ### 3. `amp_unique_16062.json`- **Description**: Contains the training data for AMPGenix. ### 4. `AMPSorter&BioToxiPept dataset.xlsx`- **Description**: Contains the fine-tuning data, including: - **AMP_data split**: Data used for training, validating and evaluating AMPSorter. - **AMP_test Set**: Data used for test AMPSorter. - **AMP_benchmarking Set**: A set of peptides used for benchmarking the AMP models. - **AMP_external Validation Dataset**: A separate dataset for external model validation for AMPSorter. - **Toxin_data split**: Data used for training, validating and evaluating BioToxiPept. - **Toxin_test set**: A set of peptides used for testing the toxin models. ### 5. `NRSPDs`- **Description**: A large dataset that includes: - **410,192,277 non-redundant short peptides**. - **A candidate pool of 82,694,928 peptides**. - **Logits** with results predicted by AMPsorter and BioToxiPept. ### 6. `GNRSPDs`- **Description**: Contains: - **7,798 generated sequences**. - **A candidate pool of 4,736 peptides**. - **Logits**with results predicted by AMPsorter and BioToxiPept. ### 7. `196 tested peptides.xlsx`- **Description**: A set of 196 selected and experimentally tested peptides, with experimentally measured values. ### 8. `20 pilot tested peptides.xlsx`- **Description**: 20 selected peptides with prediction results and experimentally values measured in pilot test. ### 9. `Sequences generated by different models.xlsx`- **Description**: Sequences generated by different models with prediction results.
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 1 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
