Powered by OpenAIRE graph
Found an issue? Give us feedback
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/ DROPS - Dagstuhl Res...arrow_drop_down
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
Algorithms for Molecular Biology
Article . 2019 . Peer-reviewed
License: CC BY
Data sources: Crossref
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
Algorithms for Molecular Biology
Article
License: CC BY
Data sources: UnpayWall
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
PubMed Central
Other literature type . 2019
Data sources: PubMed Central
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
Algorithms for Molecular Biology
Article . 2019
Data sources: DOAJ
image/svg+xml art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos Open Access logo, converted into svg, designed by PLoS. This version with transparent background. http://commons.wikimedia.org/wiki/File:Open_Access_logo_PLoS_white.svg art designer at PLoS, modified by Wikipedia users Nina, Beao, JakobVoss, and AnonMoos http://www.plos.org/
image/svg+xml Jakob Voss, based on art designer at PLoS, modified by Wikipedia users Nina and Beao Closed Access logo, derived from PLoS Open Access logo. This version with transparent background. http://commons.wikimedia.org/wiki/File:Closed_Access_logo_transparent.svg Jakob Voss, based on art designer at PLoS, modified by Wikipedia users Nina and Beao
image/svg+xml Jakob Voss, based on art designer at PLoS, modified by Wikipedia users Nina and Beao Closed Access logo, derived from PLoS Open Access logo. This version with transparent background. http://commons.wikimedia.org/wiki/File:Closed_Access_logo_transparent.svg Jakob Voss, based on art designer at PLoS, modified by Wikipedia users Nina and Beao
https://dx.doi.org/10.60692/f7...
Other literature type . 2019
Data sources: Datacite
https://dx.doi.org/10.60692/fk...
Other literature type . 2019
Data sources: Datacite
https://dx.doi.org/10.48550/ar...
Article . 2018
License: arXiv Non-Exclusive Distribution
Data sources: Datacite
DBLP
Conference object
Data sources: DBLP
DBLP
Article
Data sources: DBLP
DBLP
Article
Data sources: DBLP
versions View all 14 versions
addClaim

This Research product is the result of merged Research products in OpenAIRE.

You have already added 0 works in your ORCID record related to the merged Research product.

External memory BWT and LCP computation for sequence collections with applications

الذاكرة الخارجية BWT وحساب الناظمة القلبية بدون أسلاك لمجموعات التسلسل مع التطبيقات
Authors: Lavinia Egidi; Felipe A. Louza; Giovanni Manzini; Guilherme P. Telles;

External memory BWT and LCP computation for sequence collections with applications

Abstract

Les technologies de séquençage produisent des collections de plus en plus importantes de bioséquences qui doivent être stockées dans des index compressés prenant en charge des opérations de recherche rapides. De nombreux indices compressés sont basés sur la transformée de Burrows–Wheeler (BWT) et le plus long tableau de préfixe commun (LCP). En raison de la taille de l'entrée, il est important de construire ces structures de données dans la mémoire externe et le temps en utilisant de la meilleure façon possible la RAM disponible. Nous proposons un algorithme peu encombrant pour calculer le tableau BWT et LCP pour une collection de séquences dans le paramètre de mémoire externe ou semi-externe. Notre algorithme divise la collection d'entrée en sous-collections suffisamment petites pour pouvoir calculer leur BWT en RAM à l'aide d'un algorithme de temps linéaire optimal. Ensuite, il fusionne les BWT partiels en mémoire externe ou semi-externe et, dans le processus, il calcule également les valeurs LCP. Notre algorithme peut être modifié pour produire deux tableaux supplémentaires qui, combinés au tableau BWT et LCP, fournissent des algorithmes de mémoire externes simples, basés sur le balayage, pour trois problèmes bien connus en bioinformatique : le calcul des répétitions maximales, les chevauchements de suffixe et de préfixe de toutes les paires et la construction de graphiques succincts de Bruijn. Nous prouvons que notre algorithme effectue $${\mathcal {O}}(n\, \mathsf {maxlcp})$$ E/S séquentielles, où n est la longueur totale de la collection et $$\mathsf {maxlcp}$$ est la valeur LCP maximale. Les résultats expérimentaux montrent que notre algorithme n'est que légèrement plus lent que l'état de l'art pour les séquences courtes mais il est jusqu'à 40 fois plus rapide pour les séquences plus longues ou lorsque la RAM disponible est au moins égale à la taille de l'entrée.

Las tecnologías de secuenciación producen colecciones cada vez más grandes de biosecuencias que deben almacenarse en índices comprimidos que respalden las operaciones de búsqueda rápida. Muchos índices comprimidos se basan en la transformación Burrows–Wheeler (BWT) y la matriz de prefijo común más largo (LCP). Debido al gran tamaño de la entrada, es importante construir estas estructuras de datos en memoria externa y tiempo utilizando de la mejor manera posible la RAM disponible. Proponemos un algoritmo eficiente en el espacio para calcular la matriz BWT y LCP para una colección de secuencias en la configuración de memoria externa o semiexterna. Nuestro algoritmo divide la colección de entrada en subcolecciones lo suficientemente pequeñas como para poder calcular su BWT en RAM utilizando un algoritmo de tiempo lineal óptimo. A continuación, fusiona los BWT parciales en memoria externa o semi-externa y en el proceso también calcula los valores LCP. Nuestro algoritmo se puede modificar para generar dos matrices adicionales que, combinadas con la matriz BWT y LCP, proporcionan algoritmos de memoria externa simples y basados en escaneo para tres problemas bien conocidos en bioinformática: el cálculo de repeticiones máximas, las superposiciones de sufijo-prefijo de todos los pares y la construcción de gráficos de Bruijn sucintos. Demostramos que nuestro algoritmo realiza $${\mathcal {O}}(n\, \mathsf {maxlcp})$$ I/Os secuenciales, donde n es la longitud total de la colección y $$\mathsf {maxlcp}$$ es el valor LCP máximo. Los resultados experimentales muestran que nuestro algoritmo es solo ligeramente más lento que el estado de la técnica para secuencias cortas, pero es hasta 40 veces más rápido para secuencias más largas o cuando la RAM disponible es al menos igual al tamaño de la entrada.

Sequencing technologies produce larger and larger collections of biosequences that have to be stored in compressed indices supporting fast search operations. Many compressed indices are based on the Burrows–Wheeler Transform (BWT) and the longest common prefix (LCP) array. Because of the sheer size of the input it is important to build these data structures in external memory and time using in the best possible way the available RAM. We propose a space-efficient algorithm to compute the BWT and LCP array for a collection of sequences in the external or semi-external memory setting. Our algorithm splits the input collection into subcollections sufficiently small that it can compute their BWT in RAM using an optimal linear time algorithm. Next, it merges the partial BWTs in external or semi-external memory and in the process it also computes the LCP values. Our algorithm can be modified to output two additional arrays that, combined with the BWT and LCP array, provide simple, scan-based, external memory algorithms for three well known problems in bioinformatics: the computation of maximal repeats, the all pairs suffix–prefix overlaps, and the construction of succinct de Bruijn graphs. We prove that our algorithm performs $${\mathcal {O}}(n\, \mathsf {maxlcp})$$ sequential I/Os, where n is the total length of the collection and $$\mathsf {maxlcp}$$ is the maximum LCP value. The experimental results show that our algorithm is only slightly slower than the state of the art for short sequences but it is up to 40 times faster for longer sequences or when the available RAM is at least equal to the size of the input.

تنتج تقنيات التسلسل مجموعات أكبر وأكبر من التسلسلات الحيوية التي يجب تخزينها في مؤشرات مضغوطة تدعم عمليات البحث السريع. تعتمد العديد من المؤشرات المضغوطة على تحويل بوروز- ويلر (BWT) وأطول مصفوفة بادئة شائعة (LCP). نظرًا للحجم الهائل للمدخلات، من المهم بناء هياكل البيانات هذه في الذاكرة الخارجية والوقت باستخدام ذاكرة الوصول العشوائي المتاحة بأفضل طريقة ممكنة. نقترح خوارزمية فعالة من حيث المساحة لحساب مصفوفة BWT و LCP لمجموعة من التسلسلات في إعداد الذاكرة الخارجية أو شبه الخارجية. تقسم الخوارزمية الخاصة بنا مجموعة المدخلات إلى مجموعات فرعية صغيرة بما يكفي بحيث يمكنها حساب BWT في ذاكرة الوصول العشوائي باستخدام خوارزمية زمنية خطية مثالية. بعد ذلك، يدمج BWTs الجزئي في الذاكرة الخارجية أو شبه الخارجية وفي هذه العملية يحسب أيضًا قيم الناظمة القلبية بدون أسلاك. يمكن تعديل خوارزميتنا لإخراج مصفوفتين إضافيتين توفران، جنبًا إلى جنب مع مصفوفة BWT و LCP، خوارزميات ذاكرة خارجية بسيطة وقائمة على المسح الضوئي لثلاث مشاكل معروفة جيدًا في المعلوماتية الحيوية: حساب التكرارات القصوى، وتداخل جميع الأزواج اللاحقة، وبناء مخططات دي بروين الموجزة. نثبت أن خوارزميتنا تؤدي $${\ mathcal {O}}(n\, \mathsf {maxlcp })$ تسلسلي الإدخال/الإخراج، حيث n هو الطول الإجمالي للمجموعة و $$\mathsf {maxlcp }$$ هو الحد الأقصى لقيمة LCP. تُظهر النتائج التجريبية أن خوارزميتنا أبطأ قليلاً فقط من الحالة الفنية للتسلسلات القصيرة ولكنها تصل إلى 40 مرة أسرع للتسلسلات الأطول أو عندما تكون ذاكرة الوصول العشوائي المتاحة مساوية على الأقل لحجم الإدخال.

Countries
Germany, Italy
Keywords

FOS: Computer and information sciences, All pairs suffix-prefix overlaps; Burrows-Wheeler Transform; External memory algorithms; Longest common prefix array; Maximal repeats; Succinct de Bruijn graph, Plant Science, All pairs suffix-prefix overlaps, QH426-470, External memory algorithms, Agricultural and Biological Sciences, All pairs suffix–prefix overlaps, Data Structures and Algorithms (cs.DS), Biology (General), Auxiliary memory, Suffix, Text Compression and Indexing Algorithms, Life Sciences, Prefix, 004, FOS: Philosophy, ethics and religion, Programming language, Maximal repeats, Algorithm, Data structure, Physical Sciences, sequence alignment, Longest Common Prefix Array, Burrows-Wheeler Transform, QH301-705.5, External memory algorithm, Binary logarithm, Epistemology, Succinct de Bruijn graph, Suffix tree, Maximal repeat, Artificial Intelligence, Biochemistry, Genetics and Molecular Biology, Suffix array, Computer Science - Data Structures and Algorithms, Out-of-core algorithm, FOS: Mathematics, Genetics, All pairs suffix-prefix overlap, RNA Sequencing Data Analysis, Longest common prefix array, Molecular Biology, Biology, Research, All pairs suffix-prefix overlaps; Burrows-Wheeler Transform; Longest Common Prefix Array; Maximal repeats; Succinct de Bruijn graph;, Plant Nutrient Uptake and Signaling Pathways, Linguistics, Computer hardware, Compressed suffix array, Computer science, De Bruijn sequence, Burrows–Wheeler Transform, Philosophy, Combinatorics, FOS: Biological sciences, Computer Science, Computation, Simple (philosophy), FOS: Languages and literature, Mathematics, Sequence (biology), ddc: ddc:004

  • BIP!
    Impact byBIP!
    selected citations
    These citations are derived from selected sources.
    This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    27
    popularity
    This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
    Top 10%
    influence
    This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
    Top 10%
    impulse
    This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
    Top 10%
Powered by OpenAIRE graph
Found an issue? Give us feedback
selected citations
These citations are derived from selected sources.
This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Citations provided by BIP!
popularity
This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.
BIP!Popularity provided by BIP!
influence
This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).
BIP!Influence provided by BIP!
impulse
This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.
BIP!Impulse provided by BIP!
27
Top 10%
Top 10%
Top 10%
Green
gold