The paper by Lupski et al in Genome Medicine provides fuel to the perpetual debate of Whole Exome Sequencing (WES) vs Whole Genome Sequencing (WGS). It takes me down the memory lane to my own presentation “Genomes or Exomes: evaluation of cost, time and coverage” at Beyond the Genome 2011 conference. (If you would like to check this out, my poster is available here at Faculty of 1000 resource, with so many others from the conference). My work summarized the WES vs WGS results on a single blood sample of an individual with cardio-myopathy. Although WGS gave better coverage of UCSC exons evaluated, WES identified exclusive variants missed by WGS.
Sequencing coverage has always been the key to elucidation of variants from NGS data. Lupski et al worked on a CMT (also known as HMSN) case, and from my generic evaluation of WES read-depth coverage of CMT related genes 93% of CCDS exons had good coverage (JNNP paper). I found about 89% of the known mutations in the 33 CMT genes, including SH3TC2, to be covered at 10x (or 10-fold) sequencing coverage. As the results suggest (JNNP paper) WES misses a lot of coding regions, including important known mutations, that one needs to be careful of, especially in utilization for clinical medicine.
Back to the WGS vs WES, lets start with the key points to consider for the comparison:
|Cost||WES||Typical WES requires 60-100 million 100bp reads for decent sequencing coverage, whereas WGS requires almost a billion 100bp reads for average 30x coverage|
|Time||WES||For same reason as above, WES can be generated and analyzed with a much faster turn-around time. For clinically specific WGS analysis, I developed a novel iterative method (PLoS One) that delivers variant results in 5 hours!|
|– Depth||WES||WES, being targeted, provides much deeper coverage of the captured coding regions|
|– Breadth||WGS||Coverage from WGS is much more uniform, covering more of the annotated exons and independent of annotation sources. WGS has the advantage of analyzing regions with difficulty designing capture probes, providing sequencing coverage and thus potential for variant calling|
|Structural Variants||WGS||Broad uniform coverage from WGS coupled with mature algorithms and tools allows for better Structural Variant, CNV and large INDEL detection for WGS data|
Lupski et al performed a variety of sequencing experiments on different NGS instruments including Illumina, ABI SOLiD and Ion Torrent. The best part is, all this data is publicly available on NCBI SRA. The scientific community can make much bigger strides by open data sharing. Such a deep dataset from multiple platforms and applications is extremely beneficial providing a distinct advantage over simulated datasets for algorithm development, software evaluation and benchmarking.
- SOLID sequencer: 1 WES + 1 WGS
- Illumina GAII: 2 WES
- Illumina HiSeq: 2 WES + 1 WGS
- Ion Torrent: 2 WES (PGM and Proton)
Summarizing the paper, all the WES were captured using the NimbleGen VCRome 2.1 capture kit. Its 42Mb capture region includes Vega, CCDS and RegSeq gene models along with miRNA and regulatory regions. Interestingly, the Clark et al (Nature Biotechnology) review of different WES capture technologies concluded that the densely packed, overlapping baits of Nimblegen SeqCap EZ generated highest efficiency target enrichment. On the other hand, the recent review of WES capture by Chilamakuri et al in BMC Genomics found Illumina capture data showing higher coverage of annotated exons.
Lupski et al analyzed Illumina data using BWA (align) -> GATK (re-calibrate) -> Atlas2 (SNV/INDEL) -> Cassandra (annotate). Ion Torrent data was analyzed using TMAP (aligner) -> Picard/Torrent-Suite (duplicates) -> VarIONt (SNV) -> Cassandra (annotate). The choice of tools used, and tools like VQSR from GATK that were not used is not detailed in the paper. A particular metric that readers would have liked to know about WGS datasets is ‘Targets hit’ and ‘Targeted bases with 10+ coverage’ in Table 1. The metric should be relatively straight-forward to calculate and provides a good perspective of how metrics compare with those from WES.
The most striking observation was regarding SNV called from all WES datasets absent from WGS! Here are some of the summary points:
- 3709 coding SNV were concordantly called in all WES datasets, missed by the original SOLID (~30x coverage) WGS. This is huge as those 3709 SNV were identified in all six WES results, and thus should be good quality.
- Variant concordance of the same sample using Illumina HiSeq & GAII – Figure 3
- more than 96% and 98% SNV are concordant between HiSeq-HiSeq and GAII-GAII replicates respectively.
- only 83% and 82% INDEL are concordant between HiSeq-HiSeq and GAII-GAII replicates respectively. Once again, INDEL calling is more noisy, though it was not clear if the authors used the ‘left-align’ on INDEL to get rid of false discordance due to the start and stop coordinates of INDEL not perfectly aligning. Wonder how the recent Scalpel tool that promises higher indel calling sensitivity might perform on these datasets.
- even higher discordance when comparing HiSeq to GAII data (for the same sample and exome capture!!)
- Properties of ‘private’ or exclusive SNV from WES results – Figure 4, Figure 5. As expected, a large majority of exclusive SNV are questionable due to basic quality metrics.
- low variant fraction (% reads supporting alternate or non-reference allele)
- low coverage depth
- strand bias or multiply-mapped reads (leading to low variant quality)
- Both WES and WGS found the 12 pharmacologically relevant variants
In all, this round goes to WES, mostly due to higher coverage achieved compared to WGS. The higher coverage allowed for elucidation of strand bias and appropriate proportion of alternate-supporting (variant calling) reads to reduce the particular FP and FN variants discussed in the paper. It would be interesting to generate a much higher average coverage WGS dataset and assess if some regions or genes are better suited for evaluation using WES. And to conclude, I quote from the paper “the high yet skewed depth of coverage in targeted regions afforded by the (W)ES methods may offer higher likelihood of recovery of significant variants and resolution of their true genotypes when compared to the lower, but more uniform WGS coverage“