Exomes vs Genomes (re-visited)

The paper by Lupski et al in Genome Medicine provides fuel to the perpetual debate of Whole Exome Sequencing (WES) vs Whole Genome Sequencing (WGS). It takes me down the memory lane to my own presentation “Genomes or Exomes: evaluation of cost, time and coverage” at Beyond the Genome 2011 conference. (If you would like to check this out, my poster is available here at Faculty of 1000 resource, with so many others from the conference). My work summarized the WES vs WGS results on a single blood sample of an individual with cardio-myopathy. Although WGS gave better coverage of UCSC exons evaluated, WES identified exclusive variants missed by WGS.

Sequencing coverage has always been the key to elucidation of variants from NGS data. Lupski et al worked on a CMT (also known as HMSN) case, and from my generic evaluation of WES read-depth coverage of CMT related genes 93% of CCDS exons had good coverage (JNNP paper). I found about 89% of the known mutations in the 33 CMT genes, including SH3TC2, to be covered at 10x (or 10-fold) sequencing coverage. As the results suggest (JNNP paper) WES misses a lot of coding regions, including important known mutations, that one needs to be careful of, especially in utilization for clinical medicine.

Back to the WGS vs WES, lets start with the key points to consider for the comparison:

Key Point WES/WGS? Notes
Cost WES Typical WES requires 60-100 million 100bp reads for decent sequencing coverage, whereas WGS requires almost a billion 100bp reads for average 30x coverage
Time WES For same reason as above, WES can be generated and analyzed with a much faster turn-around time. For clinically specific WGS analysis, I developed a novel iterative method (PLoS One) that delivers variant results in 5 hours!
Average Coverage
– Depth WES WES, being targeted, provides much deeper coverage of the captured coding regions
– Breadth WGS Coverage from WGS is much more uniform, covering more of the annotated exons and independent of annotation sources. WGS has the advantage of analyzing regions with difficulty designing capture probes, providing sequencing coverage and thus potential for variant calling
Structural Variants WGS Broad uniform coverage from WGS coupled with mature algorithms and tools allows for better Structural Variant, CNV and large INDEL detection for WGS data

Lupski et al performed a variety of sequencing experiments on different NGS instruments including Illumina, ABI SOLiD and Ion Torrent. The best part is, all this data is publicly available on NCBI SRA. The scientific community can make much bigger strides by open data sharing. Such a deep dataset from multiple platforms and applications is extremely beneficial providing a distinct advantage over simulated datasets for algorithm development, software evaluation and benchmarking.

  • SOLID sequencer: 1 WES + 1 WGS
  • Illumina GAII: 2 WES
  • Illumina HiSeq: 2 WES + 1 WGS
  • Ion Torrent: 2 WES (PGM and Proton)

Summarizing the paper, all the WES were captured using the NimbleGen VCRome 2.1 capture kit. Its 42Mb capture region includes Vega, CCDS and RegSeq gene models along with miRNA and regulatory regions. Interestingly, the Clark et al (Nature Biotechnology) review of different WES capture technologies concluded that the densely packed, overlapping baits of Nimblegen SeqCap EZ generated highest efficiency target enrichment. On the other hand, the recent review of WES capture by Chilamakuri et al in BMC Genomics found Illumina capture data showing higher coverage of annotated exons.

Lupski et al analyzed Illumina data using BWA (align) -> GATK (re-calibrate) -> Atlas2 (SNV/INDEL) -> Cassandra (annotate). Ion Torrent data was analyzed using TMAP (aligner) -> Picard/Torrent-Suite (duplicates) -> VarIONt (SNV) -> Cassandra (annotate). The choice of tools used, and tools like VQSR from GATK that were not used is not detailed in the paper. A particular metric that readers would have liked to know about WGS datasets is ‘Targets hit’ and ‘Targeted bases with 10+ coverage’ in Table 1. The metric should be relatively straight-forward to calculate and provides a good perspective of how metrics compare with those from WES.

The most striking observation was regarding SNV called from all WES datasets absent from WGS! Here are some of the summary points:

  • 3709 coding SNV were concordantly called in all WES datasets, missed by the original SOLID (~30x coverage) WGS. This is huge as those 3709 SNV were identified in all six WES results, and thus should be good quality.
  • Variant concordance of the same sample using Illumina HiSeq & GAII – Figure 3
      • more than 96% and 98% SNV are concordant between HiSeq-HiSeq and GAII-GAII replicates respectively.
      • only 83% and 82% INDEL are concordant between HiSeq-HiSeq and GAII-GAII replicates respectively. Once again, INDEL calling is more noisy, though it was not clear if the authors used the ‘left-align’ on INDEL to get rid of false discordance due to the start and stop coordinates of INDEL not perfectly aligning. Wonder how the recent Scalpel tool that promises higher indel calling sensitivity might perform on these datasets.
      • even higher discordance when comparing HiSeq to GAII data (for the same sample and exome capture!!)
  • Properties of ‘private’ or exclusive SNV from WES results – Figure 4, Figure 5. As expected, a large majority of exclusive SNV are questionable due to basic quality metrics.
      • low variant fraction (% reads supporting alternate or non-reference allele)
      • low coverage depth
      • strand bias or multiply-mapped reads (leading to low variant quality)
  • Both WES and WGS found the 12 pharmacologically relevant variants

In all, this round goes to WES, mostly due to higher coverage achieved compared to WGS. The higher coverage allowed for elucidation of strand bias and appropriate proportion of alternate-supporting (variant calling) reads to reduce the particular FP and FN variants discussed in the paper. It would be interesting to generate a much higher average coverage WGS dataset and assess if some regions or genes are better suited for evaluation using WES. And to conclude, I quote from the paper “the high yet skewed depth of coverage in targeted regions afforded by the (W)ES methods may offer higher likelihood of recovery of significant variants and resolution of their true genotypes when compared to the lower, but more uniform WGS coverage

Journal Club: False-positive signals in exome sequencing

Detecting false-positive signals in exome sequencing

Human Mutation

I cannot believe that this paper is already a year old. There was a printed copy on my desk, but never got transmitted from the eyes into the brain!! Finally, there was enough time to review the paper and collate all the valuable information to share here.

Whole Exome Sequencing (WES) is fast becoming the most common NGS application. It allows querying almost all of the coding genome (the 3% of 3 billion nucleotides that we understand most about) at a relatively low cost and time investment. Looking up any list of sequencing papers of note, the most common title is “Exome sequencing identifies the causal variant for XYZ“. However, we know about the small but omnipresent spurious results that are part of the WES data. This article does a great job at elucidating the common false positives and sources of noise in WES data.

    • 118 WES samples from 29 families seen by NIH Undiagnosed Diseases Program
    • 401 additional exomes from ClinSeq study for cross-check
    • Agilent 38Mb and 50Mb all exome capture kits; GA-IIx 76 and 100bp paired-end
    • Method: ELAND -> Cross_Match -> bam2mpg genotype -> CDPred prediction -> VarSifter -> Galaxy
    • Used hg18; No duplicate removal
    • False-positive candidate variants are usually
      • located in highly polymorphic genomic region
      • caused by assembly misalignment
      • error in the reference genome
    • 23,389 positions with excess heterozygosity (alignment error)
    • 1009 positions where reference genome contains the minor allele (excess hom.)
    • Errors arise from – library construction bias; polymerase error; higher error rate towards end of short reads; loss of synchrony within a cluster (Illumina sequencing); platform specific mechanistic issues
  • Highly Variable Genes – frequently contain numerous pathogenic variants, thus unlikely to be disease causing (gene with >10 high quality variants; should normalize by gene length and where in the CDS variants were found)
  • (Pseudo genes) 392 high quality variants were heterozygous in all 118 exomes

Similar reading:
PLOS ONE: Limitations of the Human Reference Genome for Personalized Genomics

Journal Club: Indels in 179 genomes (1000genome data)

The origin, evolution and functional impact of short insertion-deletion variants identified in 179 human genomes

Genome Research

Finally there is a comprehensive analysis on indels, and of course it is the Next Generation Sequencing data that is driving it. I have my concerns with the biases of NGS technology and analysis along with ensuing false-positives in indel detection. Nonetheless, the authors have done a good job in summarizing the information and touching upon the important points making some valuable observations. It would be great to see this comprehensive analysis repeated on the public Complete Genomics genomes or the increasing Ion Torrent data to corroborate these findings as generic and not specific to any variables.

  • Dataset used = 179 (~4x coverage) genomes from 1000 genomes pilot data of 3 populations
  • 1.6 million indels – 50% of them in 4% of the genome (indel hotspots)
  • Polymerase slippage is the main cause of 75% of indels (almost all indels in hotspots and 50% indels in non-repeat regions are due to slippage)
  • indels subject to stronger purifying selection than SNVs (they call it SNPs)
  • recombination hotspots that are known to be enriched with SNVs are not enriched with indels
  • longer and frameshift indels have stronger effect on fitness
  • indels on average have a stronger functional effect than SNVs
  • Method
    • STAMPY: aligner with high sensitivity and low reference bias
    • DINDEL genotyper: Use alt-supporting reads to select high quality indels
    • build implied haplotypes (LD betw SNV/indel and impute) and error model for homopolymers
    • ignore indels in long (>10bp) homopolymers
    • validate with sanger
  • the 1.6 million indels are 8-fold lower than SNVs from these genomes
  • selected novel indels (not seen in 1000 genomes report not dbSNP129)
  • chose 2 CEU as validation targets and sampled calls predicted to segregate in them
  • randomly selected a subset; able to design primers for 111; 60 sanger sequenced
  • 36 matches; 12 low-Q sanger; 12 discordant => 0.25% FDR for this novel set (4.6% total FDR)
  • INDEL classes
    • Homopolymer Run (6nt+) – HR – 10-fold indel enrichment compared to genomic average (even higher if include longer homopolymers)
    • Tandem Repeat – TR – 20-fold indel enrichment
    • Predicted hotspot – PR – predicted indel rate > predicted SNV rate
    • Non-repetitive sites – NR
    • change in copy-number count – CCC – NR-CCC & NR non-CCC
  • HR + TR + PR = 4% of the genome (hotspot) with 50% of indels – deletions dominate short tracts, insertions longer tracts, and then del again for much longer tracts
  • 100-fold increase in polymorphism rate going from 4-bp homopolymer to 8-bp
  • 25% indels not due to polymerase slippage mostly NR non-CCC – mostly deletions (about 90%) – perhaps due to formation of double-stranded break intermediate and imperfect repair
  • the remaining 2.5% insertions most often involve palindromic repeat
  • 43 genes with high individual predicted mutation rate in coding regions – 10 of those do not show SNV enrichment and thus have exclusive indel enrichment to cause high mutational load – includes HTT (huntington), AR (prostrate cancer), ARID1B (neurodevelopmental), MED and MAML genes
  • GWAS: common indels are well tagged by SNVs – possible to phase indels into SNV haplotype reference panels