
EvoMotif: Evolutionary Protein Motif Discovery and Statistical Validation OVERVIEWEvoMotif discovers evolutionarily conserved protein motifs through multi-species sequence analysis, combining information theory, evolutionary substitution matrices, and rigorous statistical validation. CORE ALGORITHMS 1. Dual-Metric Conservation Scoring - Shannon Entropy: H(i) = -Σ p_a(i) × log₂ p_a(i), normalized to [0,1] Detects strict conservation (identical residues at catalytic sites) - BLOSUM62 Score: Captures functional constraints from evolutionary substitution data Detects functional conservation (physicochemically similar substitutions) - Combined Score: C_final(i) = 0.5 × C_shannon(i) + 0.5 × B_norm(i) 2. Sliding Window Motif Discovery - Multi-scale scanning: windows of 5, 7, 9, 11, 13, 15, 17, 19, 21 residues - Adaptive thresholding (default: conservation ≥ 0.70) - Overlap resolution: keeps highest-scoring windows - Gap filtering: requires ≥70% sequence coverage 3. Statistical Validation - Permutation Testing: 10,000 permutations per motif for exact p-values - FDR Correction: Benjamini-Hochberg procedure at α = 0.05 - Effect Size: Cohen's d > 0.5 required for reporting - Only motifs with p 0.5 are reported VALIDATION RESULTSTested against known functional sites in hemoglobin α-chain, p53 tumor suppressor, and BRCA1:- Hemoglobin: 100% detection of heme-binding residues (His59, His88)- p53: All 5 Zn²⁺-binding cysteines identified, R248 and R273 cancer hotspots detected- BRCA1: RING domain Cys/His residues, BRCT phospho-peptide binding sites foundConclusion: All discovered motifs correspond to experimentally validated functional sites PERFORMANCE BENCHMARKS (Intel Core i7-9700K, 16GB RAM)- Ubiquitin (50 seq, 76 res): 45 sec total, 350 MB memory, 4 motifs- Hemoglobin α (100 seq, 143 res): 2.5 min total, 580 MB memory, 9 motifs- p53 (150 seq, 393 res): 8 min total, 1.2 GB memory, 12 motifs- BRCA1 (200 seq, 1863 res): 28 min total, 3.8 GB memory, 38 motifs USE CASES1. Mutagenesis Planning: Identify critical residues (conservation > 0.85) vs safe targets (< 0.4)2. Disease Variant Interpretation: Assess pathogenicity of missense mutations3. Functional Domain Annotation: Discover domains in unannotated proteins4. Protein Engineering: Design minimal functional constructs5. Structural Biology: Correlate conservation with AlphaFold confidence scores6. Comparative Genomics: Study evolutionary constraints across protein families PIPELINE STAGESSequence retrieval (NCBI) → Alignment (MAFFT) → Conservation scoring (Shannon + BLOSUM62) → Motif discovery (sliding windows) → Statistical validation (permutation + FDR) → Phylogenetic tree (FastTree) → Structure mapping (PDB) OUTPUT FILES- FASTA: sequences and alignments- JSON: conservation scores, motifs with p-values and effect sizes- Newick: phylogenetic trees- PDB: conservation mapped to B-factor column INSTALLATIONpip install evomotifExternal dependencies: mafft, fasttree (via apt, brew, or conda) DOCUMENTATIONGitHub: https://github.com/tahagill/EvoMotifComplete Guide: https://github.com/tahagill/EvoMotif/blob/main/docs/COMPLETE_GUIDE.mdPyPI: https://pypi.org/project/evomotif/ REQUIREMENTSPython 3.8-3.11, Linux/macOS/WSL, 8GB RAM minimum (16GB recommended) LICENSEMIT License
phylogenetics, protein-structure, protein-motifs, evolution, variant-analysis, bioinformatics, conservation-analysis, structural-biology, computational-biology, multiple-sequence-alignment
phylogenetics, protein-structure, protein-motifs, evolution, variant-analysis, bioinformatics, conservation-analysis, structural-biology, computational-biology, multiple-sequence-alignment
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
