Designing and Engineering of Efficient Distributed MapReduce Algorithms for Bioinformatics Problems
Seminario di Dipartimento del prof. Umberto Ferraro Petrillo.
Following how other scientific disciplines, and even commercial enterprises, are successfully addressing “Big Data” problems, Hadoop, Spark and the associated MapReduce algorithmic paradigm are actively investigated for the support of bioinformatics applications, in particular genomic sequence analysis. Although these middleware are designed to grant scalability, the need for a specific Hadoop- or Spark- based development of bioinformatics algorithms has been recognized as essential to assess the extent of its impact in that area.
In this talk, we will present some properly designed and engineered MapReduce distributed algorithms useful to efficiently solve some fundamental tasks in Bioinformatics. The outcoming experimental results show that the proposed algorithms are competitive, in terms of time performance, with well-established non distributed algorithms, while allowing for a (virtually) unlimited horizontal scalability. Methodologically, we bring to light that the successful and effective development of Hadoop- or Spark- based bioinformatics pipelines must account for both specific algorithmic design and engineering, a fact that has been overlooked so far.