
The likelihood of O-GlcNAc glycosylation in human proteins is predicted using the ridge regression estimated linear probability model (LPM). To achieve this, sequences from three similar post-translational modifications (PTMs) of proteins occurring at, or very near, the S or T site are analyzed: N-glycosylation, O-mucin type (O-GalNAc) glycosylation, and phosphorylation. Results found include: 1) The consensus composite sequon for O-glycosylation does NOT have W on either side of the glycosylation site. 2) The same holds for the consensus sequon for phosphorylation. 3) For LPM estimation, N-glycosylated sequences are found to be good approximations to non-O-glycosylatable sequences. 4) The selective positioning of an amino acid along the sequence, differentiates the PTMs of proteins. 5) Some N-glycosylated sequences are also phosphorylated at the S or T site. 6) ASA values for N-glycosylated sequences are stochastically larger than those for O-GlcNAc glycosylated sequences. 7) Structural attributes (beta turn II, II', helix, beta bridges, beta hairpin, and the phi angle) are significant LPM predictors of O-GlcNAc glycosylation. The LPM with sequence and structural data as explanatory variables yields a Kolmogorov-Smirnov (KS) statistic value of 99%. 8) With only sequence data, the KS statistic erodes to 80%, underscoring the germaneness of structural information, which is sparse on O-glycosylated sequences. With 50% as the cutoff probability for predicting O-GlcNAc glycosylation, this LPM mispredicts 21% of out-of-sample O-GlcNAc glycosylated sequences as not being glycosylated. The 95% confidence interval around this mispredictions rate is 16% to 26%
40 pages
Models, Molecular, Glycosylation, N-glycosylation, Quantitative Biology - Quantitative Methods, Statistics, Nonparametric, 62J05, 62J07, Consensus Sequence, Humans, Amino Acid Sequence, Amino Acids, Phosphorylation, Databases, Protein, Quantitative Methods (q-bio.QM), Probability, O-glycosylation, Analysis of Variance, QH573-671, phosphorylation, Proteins, Logistic Models, linear, FOS: Biological sciences, Linear Models, Cytology, consensus sequon, Protein Processing, Post-Translational, probability model, Research Article
Models, Molecular, Glycosylation, N-glycosylation, Quantitative Biology - Quantitative Methods, Statistics, Nonparametric, 62J05, 62J07, Consensus Sequence, Humans, Amino Acid Sequence, Amino Acids, Phosphorylation, Databases, Protein, Quantitative Methods (q-bio.QM), Probability, O-glycosylation, Analysis of Variance, QH573-671, phosphorylation, Proteins, Logistic Models, linear, FOS: Biological sciences, Linear Models, Cytology, consensus sequon, Protein Processing, Post-Translational, probability model, Research Article
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 4 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
