Consider the geneset: Why the transcripts used for variant annotation matter. A. Frankish1, JM. Mudge1, R. Petryszak2, GRS. Ritchie3,4, A. Brazma2, JL. Harrow1, GENCODE Consortium 1) Human and Vertebrate Analysis and Annotation Group, Computation Genomics, Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK; 2) Functional Genomics Team, EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK; 3) Human Genetics, Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK; 4) Vertebrate Genomics Team, EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.

   McCarthy et al.1 recently demonstrated the large differences in prediction of loss-of-function variation when RefSeq and Ensembl transcripts are used for annotation. Ensembl displays the GENCODE geneset, the reference human gene annotation for the ENCODE project. Although the GENCODE and RefSeq genesets contain similar numbers of protein-coding genes, there are significant differences between them, e.g. in the annotation of alternative splicing where GENCODE protein-coding loci have a mean of 7.6 alternatively-spliced transcripts while RefSeq only have 2.1. Similarly, the GENCODE geneset is enriched compared to RefSeq for the annotation of long non-coding RNAs and pseudogenes, genomic coverage of annotated exons, extent of manual curation, experimental validation, and functionally descriptive biotypes. By representing more transcriptional complexity, the GENCODE geneset allows the annotation of a greater number of potentially interesting variants; the more detailed functional annotation of transcripts also assists with consequence calling. We will discuss GENCODEs extension and refinement of the geneset with the integration of RNAseq, CAGE, polyAseq, ribosome profiling, mass spectrometry and epigenomic data, to identify novel loci, define 5 and 3 transcript boundaries, identify novel translation initiation sites and improve functional annotation e.g. by confirming the translation of putative protein-coding transcripts. While our deep representation of the transcriptome is beneficial for some aspects of variant annotation, it may prove a hinderance to others, e.g. where a variant is predicted to have conflicting effects on different transcripts from the same gene. To address this, we will describe the filtering options provided to allow the user to reduce complexity of the GENCODE gene set and explain our use of RNAseq data to investigate the abundance of GENCODE-annotated genes, transcripts and exons, to present a smaller, but biologically relevant, set of features e.g. by presenting the reduced set of genes expressed in a tissue of interest. In summary, the set of transcripts selected as a basis for annotating variants affects both the number of variants identified as genic and their predicted functional consequences. The GENCODE geneset captures transcriptional complexity and describes its functional potential while permitting filtering of features to facilitate accurate interpretation of variation. 1Genome Medicine 2014, 6:26.

You may contact the first author (during and after the meeting) at