Increased complexity of the human genome revealed by single-molecule sequencing. M. J. P. Chaisson1, J. Huddleston1, P. H. Sudmant1, M. Malig1, F. Hormozdiari1, U. Surti3, R. Wilson4, M. Hunkapiller2, J. Korlach2, E. E. Eichler1 1) Genome Sciences, University of Washington, Seattle, WA; 2) Pacific Biosciences, Menlo Park, CA; 3) Department of Pathology, University of Pittsburgh, Pittsburgh, PA; 4) Washington University School of Medicine, St. Louis, MO.

   The human genome is arguably the highest quality mammalian reference assembly yet more than 150 interstitial gaps remain and aspects of its structural variation remain poorly understood 10 years after its completion. We generated and analyzed 40-fold sequence coverage of a haploid human genome (CHM1) using Pacific Biosciences single molecule, real-time (SMRT) sequencing (average mapped read length 5.8 kbp). We developed methods to detect indels and structural variants from several bases up to 20 kbp. We closed or extended 55% of the remaining interstitial gaps in the human GRCh37 reference genome and found that 78% of closed gaps carry long polypyrimidine/purine tracts multiple kilobases in length. Comparing the single haplotype to the human reference, we resolved 34,000 indels and structural variants at the basepair level with 99.9% sequence accuracy. We find a 3:1 excess of simple tandem repeat (STR) insertion over deletion of which 393 STR and variable number tandem repeat insertions are greater than 1 kbp. We find that 51% of such sequences vary at least twofold in copy number, representing sites of potential genetic instability. Of the STR insertions, 1,566 correspond to likely deficiencies in the reference sequence. In addition, the analysis uncovers other categories of complex variation that have been difficult to assess, including mobile element insertions (e.g., SVA) as well as inversions mapping within more complex and GC-rich regions of the genome. Our results suggest a systematic bias against assembly of longer and more complex repetitive DNA that can now be partially resolved with the application of new sequencing technologies.

You may contact the first author (during and after the meeting) at