Preliminary Studies on de novo Assembly with Short Reads
Technical Report Identifier: EECS-2009-172
December 15, 2009
Abstract: Recent development of next generation sequencing presents new computational challenges to assembly algorithms. Any effective and practical de novo assembly algorithm must confront issues of short read length, base-calling errors and enormous data size. In this report we present our effort to address these challenges in de novo assembly with short reads. Specifically we show that quality scores contain vital information and algorithms can achieve optimized results if they utilize quality scores. We also show that error correction preprocessing can be used to enhance de novo assembly algorithms with more tolerance to base-calling errors. Finally we present a novel parallel algorithm to cluster sequence reads based on overlap information and show that it has the potential to scale up to handling millions of reads efficiently.