DbGaP Genotype Fingerprint Collection. Y. Jin, S. Stefanov, S. Dracheva, Z. Wang, N. Sharopova, A. Sturcke, S. Sherry, M. Feolo National Center for Biotechnology Information , National Library of Medicine, National Institutes of Health , Bethesda, MD., USA.

   The database of Genotypes and Phenotypes (dbGaP) has accessioned more than one million samples from over 750,000 human individuals. At this scale, it is not uncommon that multiple, independent samples were collected from the same individual (or subject) for different research purposes and submitted to dbGaP under different studies. The dbGaP has established a genotype fingerprint collection to detect these cryptic duplicates in the database. Theoretically, a few dozen independent and informative SNPs are enough to uniquely determine an individual. However, since genotypes submitted to dbGaP are obtained using different methods and cover differing genomic regions, many more SNPs are needed to ensure a sufficient number of informative genotyped SNPs overlap between any two samples. We have selected 11,000 SNPs for fingerprinting using the following requirements: 1) the SNP is covered by at least 80% of the genotyping methods used by dbGaP studies; 2) the SNP is biallelic with a minimum minor allele frequency 0.17 as reported by the 1000 Genome Project; 3) the SNP is well separated from its nearest neighbor in the set with physical distance of at least 50,000 bps; 4) the SNP is not palindromic; 5) the SNP is autosomal. Non-palindromic SNPs were selected to avoid the DNA strand orientation problem. For example, if one genotype chip determines that the two alleles for a certain SNP are A/G, and another chip reports T/C for the same SNP, then we know genotype AA in the first chip is the same as TT in the second one. We have created computer programs to read genotypes from different formats, including PLINK ped and bed files, transposed datasets, and other matrix formats. To minimize the footprint of the collection, we use four binary numbers to represent the three genotypes and one missing state and store genotypes from four SNPs in one byte. We have loaded genotypes of about 600,000 samples into the fingerprint collection. We have also developed algorithms to identify duplicates quickly. Using these algorithms we have found about 70,000 pairs of cryptic duplicate samples that were collected either from the same subjects or identical twins. This presentation will introduce dbGaP genotype fingerprint collection and describe how we use it to discover sample/subject overlaps between dbGaP studies, find inconsistencies across the submitted subject-sample mapping files, pedigree files, genotype datasets, as well as estimate per study genotyping error rates.

You may contact the first author (during and after the meeting) at