PSSH2 - database of protein sequence-to-structure homologies

The PSSH2 data set PSSH2 is a database of protein sequence-to-structure homologies based on HHblits, an alignment method employing iterative comparisons of hidden Markov models (HMMs). To ensure the highest possible final alignment quality for matches in Aquaria using HHblits, we first calculate HMM profiles for each unique PDB sequence (PDB_full) and also for each unique Swiss-Prot sequence. We generated PSSH2 using HHblits to find similarities between HMMs from PDB and HMMs from UniProt sequences. This dataset contains the Swissprot and PDB data used for generating PSSH2 along with the PSSH2 data itself. This consists of the sequence-to-structure alignments used in Aquaria (aquaria.ws) and also for the Covid19 resource of Aquaria (http://aquaria.ws/covid). Calculating PSSH2 The main bunch of Swissprot and PDB data was downloaded in February 2020, but incremental updates, especially as related to Covid19 were added until July 2020. Generating PSSH2: We used Uniclust30 from HH-suite, a database of non-redundant UniProt sequence clusters in which the highest pairwise sequence identity between clusters was 30% (http://gwdu111.gwdg.de/~compbiol/uniclust/2018_08/uniclust30_2018_08_hhsuite.tar.gz). The HHblits code and the code for running the calculations was retrieved from git (https://github.com/soedinglab/hh-suite.git and https://github.com/aschafu/PSSH2.git respectively) at the respective time of calculation in the timeframe between February and July 2020. Evaluating PSSH2 The resulting alignment data was analysed using CATH domain assignments downloaded from /cath/releases/all-releases/v4_2_0/cath-classification-data/ to define correct hits and false hits: The set of query sequences is defined by the CATH non-redundant S40_overlap_60 dataset (ftp://orengoftp.biochem.ucl.ac.uk/cath/releases/all-releases/v4_2_0/non-redundant-data-sets/) The set of all expected hits are all pdb structures containing a domain with the same CATH code if contained in the set of processed sequences (-> all) or only if also contained in the set of non redundant sequences (-> nr40). The set of true positives is defined by sharing the same CATH code up to the level of homology ("CATH") or up to the level of topology ("CAT"). The data was evaluated with respect to false discovery rate (FDR) and recall (true positive rate TPR) by cumulatively considering all hits with an E-value below the threshold ("C") or in bins with an E-value between the threshold and one tenth of the threshold ("B"). This evaluation was carried out for the data obtained in February 2020 (202002) as well as previous data from September 2017 (201709) and has since been repeated for data from October 2020 (202010). The results are collected in PSSH CATH validation.csv.

Related Organizations

Garvan Institute of Medical Research
Australia

Keywords

protein sequence, hmm sequence alignment profiles, protein structure, homology modelling

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Usage byUsageCounts

visibility	views	5
download	downloads	1

5
views
1
downloads
Powered by

Found an issue? Give us feedback

visibility

download

0

Average

5

1