Protein haplotype sequences obtained by ProHap from the Haplotype Reference Consortium Release 1.1 dataset

Database of protein sequences obtained using ProHap (https://github.com/ProGenNo/ProHap) on the data set of phased genotypes published by the Haplotype Reference Consortium, Release 1.1 (https://ega-archive.org/datasets/EGAD00001002729). We used Ensembl v.110 for the mapping of coordinates between genes, exons, and transcripts. Release 1.1 of the HRC is provided aligned with the GRCh37 reference genome. We have performed a liftover to the GRCh38 reference using GeneBe (https://genebe.net/tools/liftover). Variants for which the reported alternative allele is considered as reference in GRCh38 were removed. A threshold of 1% minor allele frequency was applied to filter the remaining variants. After translation, a frequency threshold of 0.5% was applied to filter the resulting unique non-canonical sequences. The complete configuration file for the ProHap run is attached to this repository. This dataset contains one compressed directory, contains the following files: F1: The concatenated fasta file ready to be used with search engines, contains the following: Protein haplotype sequences obtained by ProHap Reference proteome as per Ensembl v. 110 Contaminant sequences from the cRAP project (https://www.thegpm.org/crap/) The file is provided in two formats - full and simplified. The simplified fasta contains only the artificial protein identifier and the matching gene name, and is optimised for compatibility with a wide range of tools. For annotation of peptides using the PeptideAnnotator, please provide the header (F1.2) in addition to the fasta file. F2: Additional information about the haplotype sequences, to be used for mapping identified peptides to the original haplotypes F3: Translations of haplotype cDNA sequences, before merging with the reference proteome For further description of the files, please refer to https://github.com/ProGenNo/ProHap/wiki/Output-files. For the usage of these databases with search engines, and downstream anaylsis of identified peptides, please refer to the project's wiki page: https://github.com/ProGenNo/ProHap/wiki/Using-the-database-for-proteomic-searches. When using these databases in your publication, please cite: Vašíček, J., Kuznetsova, K.G., Skiadopoulou, D. et al. ProHap enables human proteomic database generation accounting for population diversity. Nat Methods (2024). https://doi.org/10.1038/s41592-024-02506-0

Related Organizations

Royal Institute of Technology
Sweden
University of Bergen
Norway
Norwegian Institute of Public Health
Norway
University of Rostock
Germany

Keywords

haplotypes, proteogenomics, protein database

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	1
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

1

Average

Related to Research communities

European University for Smart Urban Coastal Sustainability

UArctic