EvoMotif: Evolution-Driven Framework for Protein Motif Discovery

EvoMotif: Evolutionary Protein Motif Discovery and Statistical Validation OVERVIEWEvoMotif discovers evolutionarily conserved protein motifs through multi-species sequence analysis, combining information theory, evolutionary substitution matrices, and rigorous statistical validation. CORE ALGORITHMS 1. Dual-Metric Conservation Scoring - Shannon Entropy: H(i) = -Σ p_a(i) × log₂ p_a(i), normalized to [0,1] Detects strict conservation (identical residues at catalytic sites) - BLOSUM62 Score: Captures functional constraints from evolutionary substitution data Detects functional conservation (physicochemically similar substitutions) - Combined Score: C_final(i) = 0.5 × C_shannon(i) + 0.5 × B_norm(i) 2. Sliding Window Motif Discovery - Multi-scale scanning: windows of 5, 7, 9, 11, 13, 15, 17, 19, 21 residues - Adaptive thresholding (default: conservation ≥ 0.70) - Overlap resolution: keeps highest-scoring windows - Gap filtering: requires ≥70% sequence coverage 3. Statistical Validation - Permutation Testing: 10,000 permutations per motif for exact p-values - FDR Correction: Benjamini-Hochberg procedure at α = 0.05 - Effect Size: Cohen's d > 0.5 required for reporting - Only motifs with p 0.5 are reported VALIDATION RESULTSTested against known functional sites in hemoglobin α-chain, p53 tumor suppressor, and BRCA1:- Hemoglobin: 100% detection of heme-binding residues (His59, His88)- p53: All 5 Zn²⁺-binding cysteines identified, R248 and R273 cancer hotspots detected- BRCA1: RING domain Cys/His residues, BRCT phospho-peptide binding sites foundConclusion: All discovered motifs correspond to experimentally validated functional sites PERFORMANCE BENCHMARKS (Intel Core i7-9700K, 16GB RAM)- Ubiquitin (50 seq, 76 res): 45 sec total, 350 MB memory, 4 motifs- Hemoglobin α (100 seq, 143 res): 2.5 min total, 580 MB memory, 9 motifs- p53 (150 seq, 393 res): 8 min total, 1.2 GB memory, 12 motifs- BRCA1 (200 seq, 1863 res): 28 min total, 3.8 GB memory, 38 motifs USE CASES1. Mutagenesis Planning: Identify critical residues (conservation > 0.85) vs safe targets (< 0.4)2. Disease Variant Interpretation: Assess pathogenicity of missense mutations3. Functional Domain Annotation: Discover domains in unannotated proteins4. Protein Engineering: Design minimal functional constructs5. Structural Biology: Correlate conservation with AlphaFold confidence scores6. Comparative Genomics: Study evolutionary constraints across protein families PIPELINE STAGESSequence retrieval (NCBI) → Alignment (MAFFT) → Conservation scoring (Shannon + BLOSUM62) → Motif discovery (sliding windows) → Statistical validation (permutation + FDR) → Phylogenetic tree (FastTree) → Structure mapping (PDB) OUTPUT FILES- FASTA: sequences and alignments- JSON: conservation scores, motifs with p-values and effect sizes- Newick: phylogenetic trees- PDB: conservation mapped to B-factor column INSTALLATIONpip install evomotifExternal dependencies: mafft, fasttree (via apt, brew, or conda) DOCUMENTATIONGitHub: https://github.com/tahagill/EvoMotifComplete Guide: https://github.com/tahagill/EvoMotif/blob/main/docs/COMPLETE_GUIDE.mdPyPI: https://pypi.org/project/evomotif/ REQUIREMENTSPython 3.8-3.11, Linux/macOS/WSL, 8GB RAM minimum (16GB recommended) LICENSEMIT License

Keywords

phylogenetics, protein-structure, protein-motifs, evolution, variant-analysis, bioinformatics, conservation-analysis, structural-biology, computational-biology, multiple-sequence-alignment

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

0

Average