
The major bottleneck in searching genomic databases is the sheer size of the databases involved. A number of different solutions to the problem of aligning query sequences to genomic databases have been proposed, including the widely used BLAST and FASTA systems. While such systems are effective against traditional applications such as query alignment, they do not scale well for applications such as whole genome shotgun sequencing and all versus all comparisons of one organism against another. The latter application has quadratic time complexity in the size of the databases involved and requires a different approach to BLAST type search engines that rely on a linear scan of the database. Our approach relies on a two-stage filter to prune a significant fraction of the database prior to alignment. The filter uses the MRS index[8] as the first stage followed by a novel indexing scheme that we propose in this paper. The MRS index screens sequences that map to the same frequency vector and has been shown to produce speedups of up to 12 over systems that do not employ such an index. However, the MRS index is inadequate against sequences that are inherently different while still mapping to the same frequency vector. Our filter, based on the prime factor Indexing scheme is successful in eliminating a large fraction of such false positives that survive the MRS index. Our experiments show that at least 75% of the false positives is eliminated, resulting in speedups of up to 5 times over the MRS indexing scheme.
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 0 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Average | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Average | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Average |
