Data-efficient De novo Genome Assembly Algorithm : Theory and Practice
Lam, Ka Kit
Technical Report Identifier: EECS-2016-43
May 9, 2016
Abstract: We study data-efficient and also practical de-novo genome assembly algorithm. Due to the advancement in high-throughput sequencing, the cost of read data has gone down rapidly in recent years. Thus, there is a growing need to do genome assembly in a cost effective manner on a large scale. However, the state-of-the-art assemblers are not designed to be data-efficient. This leads to wastage of read data, which translates to a significant monetary loss for large scale sequencing projects. Moreover, as technology advances, the nature of the read data also evolves. This makes the old assemblers insufficient for discovering all the information behind the data. Therefore, there is a need to re-invent genome assemblers to suit the growing demand. In this dissertation, we address the data efficiency aspect of genome assembly as follows. First, we show that even when there is noise in the reads, one can successfully reconstruct the genomes with information requirements close to the noiseless fundamental limit. We develop a new assembly algorithm, X-phased Multibridging, is designed based on a probabilistic model of the genome. We show, through analysis that it performs well on the model, and through simulation that it performs well on real genomes. Second, we introduce FinisherSC, a repeat-aware and scalable tool for upgrading de-novo haploid genome assembly using long and noisy reads. Experiments with real data suggest that FinisherSC can provide longer and higher quality contigs than existing tools while maintaining high concordance. Thus, FinisherSC achieves higher data efficiency than state-of-the-art haploid genome assembly pipelines. Third, we study whether post-processing metagenomic assemblies with the original input long reads can result in quality improvement. Previous approaches have focused on pre-processing reads and optimizing assemblers. We introduce BIGMAC, which takes an alternative perspective to focus on the post-processing step. Using both the assembled contigs and original long reads as input, BIGMAC first breaks the contigs at potentially mis-assembled locations and subsequently scaffolds contigs. Our experiments on metagenomes assembled from long reads show that BIGMAC can improve assembly quality by reducing the number of mis-assemblies while maintaining/increasing N50 and N75. Thus, BIGMAC achieves higher data efficiency than state-of-the-art metagenomic assembly pipelines. Moreover, we also discuss some theoretical work regarding genome assembly in terms of data, time and space efficiency; and a promising post-processing tool POSTME in improving metagenomic assembly using both short and long reads.