
Dataset description: The Encyclopedia of Domains (TED) is a joint effort by CATH (Orengo group) and the Jones group at University College London to identify and classify protein domains in AlphaFold2 models from AlphaFold Database version 4, covering over 188 million unique sequences and 324 million domain assignments. In this data release, we will be making available to the community a table of domain boundaries and additional metadata on quality (pLDDT, globularity, number of secondary structures), taxonomy and putative CATH SuperFamily or Fold assignments for all 324 million domains in TED100. For all chains in the TED-redundant dataset, the attached file contains boundaries predictions, consensus level and information on the TED100 representative. Additionally, an archive with chain-level consensus domain assignments are available for 21 model organisms and 25 global health proteomes: For both TED100 and TEDredundant we provide domain boundaries predictions outputted by each of the three methods employed in the project (Chainsaw, Merizo, UniDoc). We are making available 7,427 novel folds PDB files, identified during the TED classification process with an annotation table sorted by novelty. Please use the gunzip command to extract files with a '.gz' extension. CATH annotations have been assigned using the FoldSeek algorithm applied in various modes and the FoldClass algorithm, both of which are used to report significant structural similarity to a known CATH domain. Note: The TED protocol differs from that of our standard CATH Assignment protocol for superfamily assignment, which also involves HMM-based protocols and manual curation for remote matches. This dataset contains: ted_214m_per_chain_segmentation.tsvThe file contains all 214M protein chains in TED with consensus domain boundaries and proteome information in the following columns.1. AFDB_model_ID: chain identifier from AFDB in the format AF--F1-model_v4 i.e. AF-A0A1V6M2Y0-F1-model_v42. md5 hash for chain sequence3. nres - number of residues in chain4. n_high - number of high consensus domains predicted in chain5. n_med - number of medium consensus domains predicted in chain6. n_low - number of low consensus domains predicted in chain7. high_consesnsus - boundaries of high consensus domains predicted in chain8. med_consensus - boundaries of medium consensus domains predicted in chain9. low_consensus - boundaries of low consensus domains predicted in chain10. proteome_id - proteome identifier in the format proteome-tax_id--_v4 i.e. proteome-tax_id-67581-0_v4 ted_365m_domain_boundaries_consensus_level.tsv.gzThe file contains all domain assignments in TED100 and TED-redundant (365M) in the format:1. TED_ID: TED domain identifier in the format AF--F1-model_v4_TED i.e. AF-A0A1V6M2Y0-F1-model_v4_TED032. Boundaries: domain boundaries in the format - or -_- for discontinuous domains.3. Consensus: either high or medium. ted_100_324m.domain_summary.cath.globularity.taxid.tsv and novel_folds_set.domain_summary.tsv are header-less with the following columns separated by tabs (.tsv). ted_324m_seq_clustering.cathlabels.tsv The file contains the results of the domain sequences clustering with MMseqs2. Columns:1. Cluster_representative2. Cluster_member3. CATH code assignment if available i.e. 3.40.50.300 for a domain with a homologous match or 3.20.20 for a domain matching at the fold level in the CATH classification4. CATH assignment type - either Foldseek-T, Foldseek-H or Foldclass novel_folds_set.domain_summary.tsv is sorted by novelty. 1. ted_id - TED domain identifier in the format AF--F1-model_v4_TED i.e. AF-A0A1V6M2Y0-F1-model_v4_TED03 2. md5_domain - md5 hash of domain sequence 3. consensus_level - medium (2 methods agreement) or high (3 methods agreement) 4. chopping - domain boundaries in the format - or -_- for discontinuous domains 5. nres_domain - number of residues in domain 6. num_segments - number of individual segments in domain. 7. plddt - average pLDDT for domain 8. num_helix_strand_turn - number of helix strand turns predicted by STRIDE 9. num_helix - number of helices predicted by STRIDE 10. num_strand - number of strands predicted by STRIDE 11. num_helix_strand - number of helices and strands predicted by STRIDE 12. num_turn - number of turns predicted by STRIDE 13. proteome_id - proteome identifier in the format proteome-tax_id--_v4 i.e. proteome-tax_id-67581-0_v4 14. cath_label - CATH superfamily code if predicted, either a C.A.T.H. homologous superfamily or C.A.T. fold assignment. i.e. 3.40.50.300 15. cath_assignment_level - H for homologous superfamily assignment, T for fold level assignment. 16. cath_assignment_method - Method used to assign a CATH label, either Foldseek or Foldclass 17. packing_density - metric used to determine globularity. A domain with packing_density >=10.333 and norm_rg below 0.356 is considered globular 18. norm_rg - normalised radius of gyration. A domain with packing_density >=10.333 AND norm_rg below 0.356 is considered globular. 19. tax_common_name - Common name for organism 20. tax_scientific_name - Scientific name for organism 21. tax_lineage - Full taxonomic lineage. Domain assignments for TED redundant using single-chain and multi-chain consensus in ted_redundant_39m.multichain.consensus_domain_summary.taxid.tsv and ted_redundant_39m.singlechain.consensus_domain_summary.taxid.tsv The files contain a header with the following fields. Each column is tab-separated (.tsv). 1. TED_redundant_id - TED chain identifier in the format AF--F1-model_v4 i.e. AF-A0A1V6M2Y0-F1-model_v4 2. md5 - md5 hash for chain sequence 3. nres - number of residues in chain 4. n_high - number of high consensus domains predicted in chain 5. n_med - number of medium consensus domains predicted in chain 6. high_consensus - boundaries of high consensus domains predicted in chain 7. med_consensus - boundaries of medium consensus domains predicted in chain 8. ndom_consensus - number of consensus domains predicted in chain 9. n_targets - number of chains considered for consensus calculation 10. proteome_id - proteome identifier in the format proteome-tax_id--_v4 i.e. proteome-tax_id-67581-0_v4 11. TED_redundant_species - Scientific name for organism the chain originally comes from. 12. TED100_chain_rep - TED100 representative for chain 13. TED100_chain_rep_species - Species of TED100 representative for chain. and ted_redundant_39m.singlechain.consensus_domain_summary.taxid.tsv The file contains a header with the following fields. Each column is tab-separated (.tsv). 1. TED_redundant_id - TED chain identifier in the format AF--F1-model_v4 i.e. AF-A0A1V6M2Y0-F1-model_v4 2. md5 - md5 hash for chain sequence 3. nres - number of residues in chain 4. n_high - number of high consensus domains predicted in chain 5. n_med - number of medium consensus domains predicted in chain 6. high_consensus - boundaries of high consensus domains predicted in chain 7. med_consensus - boundaries of medium consensus domains predicted in chain 8. proteome_id - proteome identifier in the format proteome-tax_id--_v4 i.e. proteome-tax_id-67581-0_v4 9. TED_redundant_species - Scientific name for organism the chain originally comes from 10. TED100_chain_rep - TED100 representative for chain 11. TED100_chain_rep_species - Species of TED100 representative for chain. novel_folds_set_models.tar.gz contains PDB files of all novel folds identified in TED100. All per-tool domain boundaries predictions are in the same format with the following columns. 1. TED_chainID - TED chain identifier in the format AF--F1-model_v4 i.e. AF-A0A1V6M2Y0-F1-model_v4 2. TED_chain_md5 - md5 hash for chain sequence 3. TED_chain_length - number of residues in chain 4. ndoms - number of domains predicted in chains 5. Domain boundaries - domain boundaries in the format - or -_- for discontinuous domains 6. Prediction probability - probability of each per-chain prediction Domain boundaries predictions share the same format, with each segment separated by '_' and segment boundaries (start,stop) separated by '-' i.e.domain prediction by Merizo for AF-A0A000-F1-model_v4 AF-A0A000-F1-model_v4 e8872c7a0261b9e88e6ff47eb34e4162 394 2 10-52_289-394,53-288 0.90077 Merizo predicts one continuous domain and a discontinuous domain, Domain1 (discontinuous): 10-52_289-394 segment1: 10-52 segment2: 289-394 Domain 2 (continuous): segment 1: 53-288 ted-tools-main.zip - copy of the https://github.com/psipred/ted-tools repository, containing tools and software used to generate TED. cath-alphaflow-main.zip - copy of CATH-AlphaFlow, used to generate globularity scores for TED domains. ted-web-master.zip - copy of TED-web, containing code to generate the web interface of TED (https://ted.cathdb.info) gofocus_data.tar.bz2 - GOFocus model weights
FOS: Computer and information sciences, Protein Structure, Protein Folds, Bioinformatics, AlphaFoldDB, CATH, TED
FOS: Computer and information sciences, Protein Structure, Protein Folds, Bioinformatics, AlphaFoldDB, CATH, TED
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
