Understanding revisions to the human reference genome and assembly model. D. Church1, P. Flicek2, T. Graves3, T. Hubbard4, V. Schneider1, The Genome Reference Consortium 1) Natl Ctr Biotech Infor/NIH, Bldg 45 rm 5AS43, Bethesda, MD; 2) EBI, Hinxton, Cambridge, UK; 3) The Genome Institute at Washington University, St. Louis, MO; 4) The Wellcome Trust Sanger Institute, Hinxton, Cambridge, UK.
The publication, more than a decade ago, of an assembly for the human genome was a milestone event in biology. This resource has transformed basic and clinical research. One of the key insights into human biology made possible by the reference assembly was the discovery of an unrecognized degree of genetic variation among individuals. In todays era of whole genome sequencing, alignment of next generation sequencing reads against the high quality reference assembly remains a critical step in the interpretation of variation data. However, these analyses have made it increasingly clear that the linear chromosome models used in the human reference assembly do not always adequately represent the most variant and complex regions of the human genome. To address this issue, the Genome Reference Consortium (GRC; http://genomereference.org), the group overseeing the human reference assembly, developed a new assembly model. GRCh37, the current reference assembly, was the first to use this model, which has now also been implemented in the recently released mouse reference assembly, GRCm38. This model retains the intuitive linear chromosomes of previous assemblies, but now provides alternate assembly representations for more diverse and complicated genomic regions that are placed in a chromosome context via alignment. Assembly patches comprise the second key feature of the assembly model. Representing novel sequences and assembly corrections, the patches enable the GRC to provide timely updates to the reference assembly without disruptive coordinate changes. Like the alternate assemblies, the patches are stand-alone scaffold sequences placed in chromosome context via alignment. Released quarterly, there have been 10 patch releases associated with GRCh37, including more than 71 novel sequence representations and greater than 69 assembly corrections. We will show how the adoption of this assembly model has enabled GRCh37 to capture substantial amounts of human sequence not represented in previous assembly versions. We also will present data showing how the use of the alternate loci and patches improves the ability of the reference to act as an alignment substrate and discuss the need for new analytic resources to take advantage of these assembly features. Lastly, we will discuss ongoing GRC efforts to address assembly issues, including the use of single haplotype resources to resolve complex regions and 1000 genomes data to update rare or erroneous bases.
You may contact the first author (during and after the meeting) at