MLProbs: A Data-Centric Pipeline for Better Multiple Sequence Alignment

descriptionPublicationkeyboard_double_arrow_right Article 01 Jan 2023Publisher:Institute of Electrical and Electronics Engineers (IEEE)Journal:IEEE/ACM Transactions on Computational Biology and Bioinformatics, volume 20, pages 524-533 (issn: 1545-5963, eissn: 2374-0043,

Copyright policy )

Authors: Mengmeng Kuang; Yong Zhang; Tak-Wah Lam; Hing-Fung Ting;

doi: 10.1109/tcbb.2022.3148382

pmid: 35120007

MLProbs: A Data-Centric Pipeline for Better Multiple Sequence Alignment

- Summary
- Subjects
- Related research
  (4)
- Metrics

Abstract

In this paper, we explore using the data-centric approach to tackle the Multiple Sequence Alignment (MSA) construction problem. Unlike the algorithm-centric approach, which reduces the construction problem to a combinatorial optimization problem based on an abstract mathematical model, the data-centric approach explores using classification models trained from existing benchmark data to guide the construction. We identified two simple classifications to help us choose a better alignment tool and determine whether and how much to carry out realignment. We show that shallow machine-learning algorithms suffice to train sensitive models for these classifications. Based on these models, we implemented a new multiple sequence alignment pipeline, called MLProbs. Compared with 10 other popular alignment tools over four benchmark databases (namely, BAliBASE, OXBench, OXBench-X and SABMark), MLProbs consistently gives the highest TC score. More importantly, MLProbs shows non-trivial improvement for protein families with low similarity; in particular, when evaluated against the 1,356 protein families with similarity ≤ 50%, MLProbs achieves a TC score of 56.93, while the next best three tools are in the range of [55.41, 55.91] (increased by more than 1.8%). We also compared the performance of MLProbs and other MSA tools in two real-life applications - Phylogenetic Tree Construction Analysis and Protein Secondary Structure Prediction - and MLProbs also had the best performance. In our study, we used only shallow machine-learning algorithms to train our models. It would be interesting to study whether deep-learning methods can help make further improvements, so we suggest some possible research directions in the conclusion section.

Related Organizations

Chinese Academy of Sciences
China (People's Republic of)
University of Hong Kong
China (People's Republic of)
Shenzhen Institutes of Advanced Technology
China (People's Republic of)

Keywords

Computational Biology, Proteins, Sequence Alignment, Phylogeny, Algorithms, Software

4 Research products, page 1 of 1

Additional file 1 of MSAIndelFR: a scheme for multiple protein sequence alignment using information on indel flanking regions
2015IsAmongTopNSimilarDocuments
DLPAlign: A Deep Learning based Progressive Alignment for Multiple Protein Sequences
2020IsAmongTopNSimilarDocuments
Multiple sequence alignment
2006IsAmongTopNSimilarDocuments
DLPAlign: A Deep Learning based Progressive Alignment Method for Multiple Protein Sequences
2020IsAmongTopNSimilarDocuments

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	1
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

1

Average

Fields of Science (4) View all

engineering and technology

medical engineering

Fields of Science

engineering and technology

medical engineering

View all

Upload OA version

Are you the author of this publication? Upload your Open Access version to Zenodo!

It’s fast and easy, just two clicks!

uploadUpload now

MLProbs: A Data-Centric Pipeline for Better Multiple Sequence Alignment

MLProbs: A Data-Centric Pipeline for Better Multiple Sequence Alignment

4 Research products, page 1 of 1

Additional file 1 of MSAIndelFR: a scheme for multiple protein sequence alignment using information on indel flanking regions

DLPAlign: A Deep Learning based Progressive Alignment for Multiple Protein Sequences

Multiple sequence alignment

DLPAlign: A Deep Learning based Progressive Alignment Method for Multiple Protein Sequences