Bringing large-scale multiple genome analysis one step closer: ScalaBLAST and beyond Christopher S. Oehmen 1 , Heidi J. Sofia 1 , Douglas Baxter 2 , Ernest Szeto 3 , Philip Hugenholtz 4 , Nikos Kyrpides 5 , Victor Markowitz 3 , Tjerk P. Straatsma 1 Computational Sciences and Mathematics Division, Pacific Northwest National Laboratory (PNNL), 902 Battelle Boulevard, P.O. Box 999, Richland, WA USA William R. Wiley Environmental Molecular Sciences Laboratory, Pacific Northwest National Laboratory (PNNL), 902 Battelle Boulevard, P.O. Box 999, Richland, WA USA Biological Data Management and Technology Center, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, USA Microbial Ecology Program, Department of Energy Joint Genome Institute, 2800 Mitchell Drive, Walnut Creek, USA Microbial Genome Analysis Program, Department of Energy Joint Genome Institute, 2800 Mitchell Drive, Walnut Creek, USA Genome sequence comparisons of exponentially growing data sets form the foundation for the comparative analysis tools provided by community biological data resources such as the Integrated Microbial Genome (IMG) system at the Joint Genome Institute (JGI). We present an example of how ScalaBLAST, a high-throughput sequence analysis program harnesses increasingly critical high-performance computing to perform sequence analysis which is a critical component of maintaining a state-of-the-art sequence data repository. The Integrated Microbial Genomes (IMG) system 1 is a data management and analysis platform for microbial genomes hosted at the JGI. IMG contains both draft and complete JGI genomes integrated with other publicly available microbial genomes of all three domains of life. IMG provides tools and viewers for interactive analysis of genomes, genes and functions, individually or in a comparative context. Most of these tools are based on pre-computed pairwise sequence similarities involving millions of genes. These computations are becoming prohibitively time consuming with the rapid increase in the number of newly sequenced genomes incorporated into IMG and the need to refresh regularly the content of IMG in order to reflect changes in the annotations of existing genomes. Thus, building IMG 2.0 (released on December 1 st 2006) entailed reloading from NCBI’s RefSeq all the genomes in the previous version of IMG (IMG 1.6, as of September 1 st , 2006) together with 1,541 new public microbial,viral and eukaryal genomes, bringing the total of IMG genomes to 2,301. A critical part of building IMG 2.0 involved using PNNL ScalaBLAST software for computing pairwise similarities for over 2.2 million genes in under 26 hours on 1,000 processors, thus illustrating the impact that new generation bioinformatics tools are poised to make in biology. The BLAST algorithm 2, 3 is a familiar bioinformatics application for computing sequence similarity, and has become a workhorse in large-scale genomics projects. The rapid growth of genome resources such as IMG cannot be sustained without more powerful tools such as ScalaBLAST that use more effectively large scale computing resources to perform the core BLAST calculations.