Human population assembly and error-correction of sequence reads. Z. Iqbal1, S. McCarthy2, H. Zheng Bradley3, C. Xiao4, A. Marcketta5, G. McVean1,6 1) Wellcome Trust Centre for Huma, University of Oxford, Oxford, United Kingdom; 2) Wellcome Trust Sanger Institute, Hinxton, UK; 3) European Bioinformatics Institute, Hinxton, UK; 4) National Centre for Biotechnology Information, NIH, Bethesda, USA; 5) Albert Einstein School of Medicine; 6) Department of Statistics, University of Oxford.

   As sequencing technologies improve and read-lengths increase, a major challenge is going to be the error-correction of reads, especially with low coverage. Current methods hinge either on the use of coverage as a proxy for truth, or a reference genome. For those interested in SNP and indel analysis, it is important not to throw out true polymorphisms in the process of removing errors. What we would like to be able to do is make use of prior knowledge, and have more confidence in reads that matches known sequence. A recent development in genome analysis, introduced in [1,2], has been the idea of using de novo assembly not just to study a single individual, but to learn about an entire species. This allows an unbiased access to all sequence - for example [1] found gene sequence that was highly differentiated between Europe, Africa and Asia, but which was missing from the reference genome.
    Using the Cortex assembler, we have built assembly graphs of 1092 humans from 14 populations from Phase 1 of the 1000 Genomes Project. We show how the 1092-sample graph can be used as a repository of known sequence, allowing single-pass quality-aware error correction of reads, improving both power and concordance with genotype arrays. We demonstrate, both on a single high-coverage sample and on a cohort of low-coverage samples from a population absent from the graph. Although the majority of Illumina reads require zero or one base to be corrected, a non-negligible number have more than 10 bases corrected, including correction of Ns. These reads, which previously had no BLAST hits, now BLAST confidently to human sequence.
   The method extends transparently, so it is possible to use trusted graphs of dbSNP, the 1000 Genomes SNP calls, or any assemblies. We discuss the value of using prior information in this manner. These approaches will be of great value in an era of low coverage long-reads.
   [1] De novo assembly and genotyping of variants using colored de Bruijn graphs. Z Iqbal, M Caccamo, I Turner, P Flicek, G McVean, Nature Genetics (2012) [2] High-throughput microbial population genomics using the Cortex variation assembler. Z Iqbal, I Turner, G McVean, Bioinformatics (2012).

You may contact the first author (during and after the meeting) at