I cannot believe that this paper is already a year old. There was a printed copy on my desk, but never got transmitted from the eyes into the brain!! Finally, there was enough time to review the paper and collate all the valuable information to share here.
Whole Exome Sequencing (WES) is fast becoming the most common NGS application. It allows querying almost all of the coding genome (the 3% of 3 billion nucleotides that we understand most about) at a relatively low cost and time investment. Looking up any list of sequencing papers of note, the most common title is “Exome sequencing identifies the causal variant for XYZ“. However, we know about the small but omnipresent spurious results that are part of the WES data. This article does a great job at elucidating the common false positives and sources of noise in WES data.
- 118 WES samples from 29 families seen by NIH Undiagnosed Diseases Program
- 401 additional exomes from ClinSeq study for cross-check
- Agilent 38Mb and 50Mb all exome capture kits; GA-IIx 76 and 100bp paired-end
- Method: ELAND -> Cross_Match -> bam2mpg genotype -> CDPred prediction -> VarSifter -> Galaxy
- Used hg18; No duplicate removal
- False-positive candidate variants are usually
- located in highly polymorphic genomic region
- caused by assembly misalignment
- error in the reference genome
- 23,389 positions with excess heterozygosity (alignment error)
- 1009 positions where reference genome contains the minor allele (excess hom.)
- Errors arise from – library construction bias; polymerase error; higher error rate towards end of short reads; loss of synchrony within a cluster (Illumina sequencing); platform specific mechanistic issues
- Highly Variable Genes – frequently contain numerous pathogenic variants, thus unlikely to be disease causing (gene with >10 high quality variants; should normalize by gene length and where in the CDS variants were found)
- (Pseudo genes) 392 high quality variants were heterozygous in all 118 exomes