Roles of Solvent Accessibility and Gene Expression in Modeling Protein Sequence Evolution

Article English OPEN
Wang, Kuangyu ; Yu, Shuhui ; Ji, Xiang ; Lakner, Clemens ; Griffing, Alexander ; Thorne, Jeffrey L (2015)
  • Publisher: Libertas Academica
  • Journal: Evolutionary Bioinformatics Online, volume 11, pages 85-96 (issn: 1176-9343, eissn: 1176-9343)
  • Related identifiers: doi: 10.4137/EBO.S22911, pmc: PMC4415675
  • Subject: protein structure | protein evolution | scaled selection coefficient | solvent accessibility | Original Research | Biology (General) | gene expression | codon usage | QH301-705.5

Models of protein evolution tend to ignore functional constraints, although structural constraints are sometimes incorporated. Here we propose a probabilistic framework for codon substitution that evaluates joint effects of relative solvent accessibility (RSA), a structural constraint; and gene expression, a functional constraint. First, we explore the relationship between RSA and codon usage at the genomic scale as well as at the individual gene scale. Motivated by these results, we construct our framework by determining how probable is an amino acid, given RSA and gene expression, and then evaluating the relative probability of observing a codon compared to other synonymous codons. We come to the biologically plausible conclusion that both RSA and gene expression are related to amino acid frequencies, but, among synonymous codons, the relative probability of a particular codon is more closely related to gene expression than RSA. To illustrate the potential applications of our framework, we propose a new codon substitution model. Using this model, we obtain estimates of 2N s, the product of effective population size N, and relative fitness difference of allele s. For a training data set consisting of human proteins with known structures and expression data, 2N s is estimated separately for synonymous and nonsynonymous substitutions in each protein. We then contrast the patterns of synonymous and nonsynonymous 2N s estimates across proteins while also taking gene expression levels of the proteins into account. We conclude that our 2N s estimates are too concentrated around 0, and we discuss potential explanations for this lack of variability.