Information‐theoretical entropy as a measure of sequence variability

descriptionPublicationkeyboard_double_arrow_right Article 01 Dec 1991 English Publisher:WileyJournal:Proteins: Structure, Function, and Bioinformatics, volume 11, pages 297-313 (issn: 0887-3585, eissn: 1097-0134,

Copyright policy )

Authors: P S, Shenkin; B, Erman; L D, Mastrandrea;

doi: 10.1002/prot.340110408

pmid: 1758884

Information‐theoretical entropy as a measure of sequence variability

- Summary
- Subjects
- Metrics

Abstract

AbstractWe propose the use of the information‐theoretical entropy, S = −Σpi log2 Pi, as a measure of variability at a given position in a set of aligned sequences. pi stands for the fraction of times the i‐th type appears at a position. For protein sequences, the sum has up to 20 terms, for nucleotide sequences, up to 4 terms, and for codon sequences, up to 61 terms. We compare S and VS, a related measure, in detail with VK, the traditional measure of immunoglobulin sequence variability, both in the abstract and as applied to the immunoglobulins. We conclude that S has desirable mathematical properties that VK lacks and has intuitive and statistical meanings that accord well with the notion of variability. We find that VK and the S‐based measures are highly correlated for the immunoglobulins. We show by analysis of sequence data and by means of a mathematical model that this correlation is due to a strong tendency for the frequency of occurrence of amino acid types at a given position to be log‐linear. It is not known whether the immunoglobulins are typical or atypical of protein families in this regard, nor is the origin of the observed rank‐frequency distribution obvious, although we discuss several possible etiologies.

Related Organizations

Barnard College
United States
Brandeis University
United States
Hamilton College
United States
State University of New York at Potsdam
United States

Keywords

Models, Statistical, Chemical Phenomena, Chemistry, Physical, Molecular Sequence Data, Information Theory, Genetic Variation, Humans, Immunoglobulins, Amino Acid Sequence, Sequence Alignment

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	182
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 1%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 1%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%