Recent advances in molecular biology and genomics have led to a huge growth of digital biological information. This large amount of data is typically analyzed by repeated use of conceptually parallel algorithms: example domains include sequence alignment, gene expression analysis, QTL (Quantitative Trait Locus) analysis and haplotype reconstruction. Developing parallel implementations of these algorithms can lead to significant performance improvements.
MapReduce is an easy-to-use general-purpose parallel programming model tailored for large dataset analysis on commodity hardware. Developers only need to write two functions: Map, which converts an input key/value pair to a set of intermediate key/value pairs, and Reduce, which merges together all intermediate values associated with a given intermediate key. The framework automatically handles all low level details such as data partitioning, scheduling, load balancing, machine failure handling and inter-machine communication.
Several bioinformatics applications seem compatible with the MapReduce paradigm. Biodoop, currently under development, is a suite of parallel bioinformatics applications based upon a popular open-source Java implementation of MapReduce, Hadoop. We are currently working on three qualitatively different algorithms: BLAST, GSEA and GRAMMAR. The latter has been originally implemented as part of the GenABEL R package.