With the ever growing abundance of Next Generation Sequencing (NGS) data there continues to be a challenge faced by the research community to not only standardize best practices and analysis methods, but to embrace the foray of open-source tools and file formats. Having been engaged since almost the beginning of this NGS data outburst with Illumina’s read length extending from 30bp to beyond 250bp, for the file format category SAM, BAM and VCF have become well accepted formats to some extent. I say well accepted because a few versions ago CLCbio still reported variants in a csv file format, but maybe I shouldn’t digress.
So in part due to the 1000 Genome Project and Hapmap consortium, formats like VCF are largely accepted as output format as most open-source tools report variants as VCF reports. This has allowed development of tools / parsers to use the file formats as a standard and provide efficient functionality and data portability.
A recent discussion on twitter about the efficiency and functionality of tools made me compile a list of these VCF parsers for future reference. You will be amazed at the variety of tools available for helping parse the eventual VCF reports from NGS data – free of cost!
Feel free to point out pros/cons of these tools and I will continue to update and make a comprehensive post. Also, it would be most helpful if folks could share their experiences with different tools and the example use-cases of the functionality!!
Following up on the previous post here are some more detail on the recent Science paper along with a round-up of “what do we know, what have we learned” thus far.
The Gire et al paper in Science was huge amount of work and a giant collaborative research effort. Being a computational biology researcher, I appreciate their in-depth and detailed evaluation utilizing numerous bioinformatics software tools. Gleaning through the supplemental text, I created the flowchart below as a summary of all the analysis that went into the eventual results and interpretations. This was created using the wonderful Gliffy tool.
Flowchart of the impressive work accomplished by the Gire et al Science paper (I made this using notes from their Supplemental Data)
The Broad Institute and Harvard University worked with Sierra Leone Ministry of Health and Sanitation along with other researchers to provide the comprehensive paper in Science describing the sequencing of current Ebola genomes. Simultaneously, the human trial on NIH and GSK’s investigational Ebola vaccine is to begin this week, as it performed well on primate studies. Hopefully this one will be faster than the usual 10-year turn-around observed for a vaccine trial. Although the experimental drug ZMapp is being used on the cases, it is with mixed results and much and more still needs to be done. Interestingly the drug is a three-mouse monoclonal antibody and the primate research itself was published in Nature last week. Details of how it seemed to have worked on the 2 US health care workers in the midst of this outbreak is a pretty ‘miraculous’ story!
The major points to note thus far:
First Ebola Virus Disease of 2014 confirmed in Sierra Leone on May 25
It seems like there was one instance of the EBOV transmitting from the ‘natural reservoir’ to humans and has since been transmitted from human to human (implying there is rare, though present, chance of non-human transmission)
Substitution rate is twice as high implying that continued progression of this epidemic could allow a viral adaptation, thus the need for rapid containment
The 2014 outbreak has a doubling period of about 35 days!!
Complicating matters, positive diagnosis for malaria does not necessarily rule out Ebola Virus Disease
If you have ever played the card game Killer Bunnies and your Bunnies in the Bunny circle have died because of the Level 11 Weapon of Ebola Virus you want to read this.
Research community is making strides to understand whether the virus is adapting to its host or changing as it spreads through the different populations as more countries get in its warp, in West Africa.
5 of the 50 co-authors on this Science article were infected with the deadly Zaire Ebola Virus (EBOV) themselves. Nothing short of a thriller, the events trace back to the funeral of a healer which kick-started the spread of Ebola in the region. Also reviewed here is a paper from 2008 in which the authors have pointed the VP35 protein, which during their experiments was identified as a critical component of this hemorrhagic fever.
Ebola’s genomic sequence:
Linear, single-stranded genome
Inverse-complementary 3′ and 5′ termini
~19 kb (19 thousand nucleotides long compared to 3 billion human genome)
Seven genes (compared to ~20k in humans)
The current outbreak is due to the EBOV virus, one of the five Ebola virus known to infect humans. Research groups are trying to identify whether the genetic sequence of this virus is changing fast enough in regions that are key for the accuracy of the PCR based diagnostic tests.
This EBOV virus in the 2014 epidemic has been reported to be 97% similar to the virus that first emerged in 1976. Articles across the web estimate that EBOV is set to evolve at about 7×10-4 substitutions per site per year suggesting that the current strain of EBOV would have accumulated many substitutions over the 40 year time period since 1976.
In this article Gire et al use genomic data and inferences by using next generation sequencing technologies to explain whether the virus is accumulating significant mutations as it changes hosts.
Methods compared to ascertain choice for sequencing:
Library preparation: Nugen and Nextera
Sequencing instruments: PacBio and Illumina
Nextera and Illumina provided most complete genome assembly and intrahost SNV identification
99 virus genomes, 78 patients in Sierra Leone sequenced at a median coverage >2,000x across 99.9% of EBOV coding regions
Intra and Interhost genetic variations to characterize transmission patterns
341 fixed substitutions identified between previous and 2014 EBOV
35 nonsynonymous, 173 synonymous, 133 noncoding
55 single nucleotide polymorphisms (SNPs) among this West African outbreak
15 nonsynonymous, 25 synonymous, 15 noncoding
Genetic similarity across sequenced 2014 samples suggests single transmission
Josh Herr of Michigan State University and Daniel Park of Broad Institute aim to maintain an analysis wiki for solving the underlying genomic riddle, by studying the different strains of the virus and are encouraging contributors (ebola-crowdsource).
In an earlier paper published in Journal of Virology in 2008, Hartman et al discuss how whole genome expression profiling reveals that the innate immune response of the host can be inhibited and reversed by single amino acid change in VP35 Protein.
Two reverse genetic-generated Ebola virus strains
Encode wild-type VP35 protein or VP35 with an arginine (R)-to-alanine (A) amino acid substitution at position 312
Whole-genome expression profiling of the host cells in human liver
Host cells reveal differences in response to introduction of these viruses differing by a single amino acid
VP35 protein plays a vital role in inhibiting immune responses of the host
Single amino acid change exhibits the ability to eliminate this inhibitory effect
VP35 Protein demonstrates a critical role in the severity of the disease
Dr. Lipkin professor of epidemiology at the Columbia University discusses a pertinent question of whether ‘Ebola can travel to the United States’ . He explains in a matter-of-fact way that although there is a possibility of the virus traveling to US like anywhere else, there’s also a high likelihood of it being monitored and isolated by health authorities at the earliest possible.
Lets get to a round of that card game now, shall we.
The paper by Lupski et al in Genome Medicine provides fuel to the perpetual debate of Whole Exome Sequencing (WES) vs Whole Genome Sequencing (WGS). It takes me down the memory lane to my own presentation “Genomes or Exomes: evaluation of cost, time and coverage” at Beyond the Genome 2011 conference. (If you would like to check this out, my poster is available here at Faculty of 1000 resource, with so many others from the conference). My work summarized the WES vs WGS results on a single blood sample of an individual with cardio-myopathy. Although WGS gave better coverage of UCSC exons evaluated, WES identified exclusive variants missed by WGS.
Sequencing coverage has always been the key to elucidation of variants from NGS data. Lupski et al worked on a CMT (also known as HMSN) case, and from my generic evaluation of WES read-depth coverage of CMT related genes 93% of CCDS exons had good coverage (JNNP paper). I found about 89% of the known mutations in the 33 CMT genes, including SH3TC2, to be covered at 10x (or 10-fold) sequencing coverage. As the results suggest (JNNP paper) WES misses a lot of coding regions, including important known mutations, that one needs to be careful of, especially in utilization for clinical medicine.
Back to the WGS vs WES, lets start with the key points to consider for the comparison:
Typical WES requires 60-100 million 100bp reads for decent sequencing coverage, whereas WGS requires almost a billion 100bp reads for average 30x coverage
For same reason as above, WES can be generated and analyzed with a much faster turn-around time. For clinically specific WGS analysis, I developed a novel iterative method (PLoS One) that delivers variant results in 5 hours!
WES, being targeted, provides much deeper coverage of the captured coding regions
Coverage from WGS is much more uniform, covering more of the annotated exons and independent of annotation sources. WGS has the advantage of analyzing regions with difficulty designing capture probes, providing sequencing coverage and thus potential for variant calling
Broad uniform coverage from WGS coupled with mature algorithms and tools allows for better Structural Variant, CNV and large INDEL detection for WGS data
Lupski et al performed a variety of sequencing experiments on different NGS instruments including Illumina, ABI SOLiD and Ion Torrent. The best part is, all this data is publicly available on NCBI SRA. The scientific community can make much bigger strides by open data sharing. Such a deep dataset from multiple platforms and applications is extremely beneficial providing a distinct advantage over simulated datasets for algorithm development, software evaluation and benchmarking.
SOLID sequencer: 1 WES + 1 WGS
Illumina GAII: 2 WES
Illumina HiSeq: 2 WES + 1 WGS
Ion Torrent: 2 WES (PGM and Proton)
Summarizing the paper, all the WES were captured using the NimbleGen VCRome 2.1 capture kit. Its 42Mb capture region includes Vega, CCDS and RegSeq gene models along with miRNA and regulatory regions. Interestingly, the Clark et al (Nature Biotechnology) review of different WES capture technologies concluded that the densely packed, overlapping baits of Nimblegen SeqCap EZ generated highest efficiency target enrichment. On the other hand, the recent review of WES capture by Chilamakuri et al in BMC Genomics found Illumina capture data showing higher coverage of annotated exons.
Lupski et al analyzed Illumina data using BWA (align) -> GATK (re-calibrate) -> Atlas2 (SNV/INDEL) -> Cassandra (annotate). Ion Torrent data was analyzed using TMAP (aligner) -> Picard/Torrent-Suite (duplicates) -> VarIONt (SNV) -> Cassandra (annotate). The choice of tools used, and tools like VQSR from GATK that were not used is not detailed in the paper. A particular metric that readers would have liked to know about WGS datasets is ‘Targets hit’ and ‘Targeted bases with 10+ coverage’ in Table 1. The metric should be relatively straight-forward to calculate and provides a good perspective of how metrics compare with those from WES.
The most striking observation was regarding SNV called from all WES datasets absent from WGS! Here are some of the summary points:
3709 coding SNV were concordantly called in all WES datasets, missed by the original SOLID (~30x coverage) WGS. This is huge as those 3709 SNV were identified in all six WES results, and thus should be good quality.
Variant concordance of the same sample using Illumina HiSeq & GAII – Figure 3
more than 96% and 98% SNV are concordant between HiSeq-HiSeq and GAII-GAII replicates respectively.
only 83% and 82% INDEL are concordant between HiSeq-HiSeq and GAII-GAII replicates respectively. Once again, INDEL calling is more noisy, though it was not clear if the authors used the ‘left-align’ on INDEL to get rid of false discordance due to the start and stop coordinates of INDEL not perfectly aligning. Wonder how the recent Scalpel tool that promises higher indel calling sensitivity might perform on these datasets.
even higher discordance when comparing HiSeq to GAII data (for the same sample and exome capture!!)
Properties of ‘private’ or exclusive SNV from WES results – Figure 4, Figure 5. As expected, a large majority of exclusive SNV are questionable due to basic quality metrics.
low variant fraction (% reads supporting alternate or non-reference allele)
low coverage depth
strand bias or multiply-mapped reads (leading to low variant quality)
Both WES and WGS found the 12 pharmacologically relevant variants
In all, this round goes to WES, mostly due to higher coverage achieved compared to WGS. The higher coverage allowed for elucidation of strand bias and appropriate proportion of alternate-supporting (variant calling) reads to reduce the particular FP and FN variants discussed in the paper. It would be interesting to generate a much higher average coverage WGS dataset and assess if some regions or genes are better suited for evaluation using WES. And to conclude, I quote from the paper “the high yet skewed depth of coverage in targeted regions afforded by the (W)ES methods may offer higher likelihood of recovery of significant variants and resolution of their true genotypes when compared to the lower, but more uniform WGS coverage“
Mitochondrial genomes can be extracted from Whole Exome Sequencing (WES) data as outlined by this paper in Nature methods by Ernesto Picardi & Graziano Pesole. Tools like Mito Seek are now available that gather mitochondrial read sequences from NGS data and perform high throughput sequence analysis. Availability of mitochondrial genomes is important as genomic variation in mitochondria has been implicated in a variety of neuro-muscular and metabolic disorders, along with roles in aging and cancer.
However here we ponder upon the feasibility of how effective it is to extract mitochondria from different capture kits used for WES. Picardi et al used the MitoSeek tool to successfully assemble 100%, 95% and 72% of the mtDNA genome from the TruSeq (Illumina), SureSelect (Agilent) and SeqCap EZ-Exome (NimbleGen) platforms, respectively. We set out to assess the mitochondrial genome data extraction using a different approach and tool-set. Using the same sample’s dataset from three different capture kits, and Whole Genome Sequenced (WGS) data as the gold standard we evaluated alignment and variant-calling results.
Clark et al sequenced and analyzed a human blood sample (healthy, anonymous volunteer) at the Stanford University using three commonly used WES kits:
Agilent SureSelect Human All Exon kit
Nimblegen SeqCap EZ Exome Library v2.0
Illumina TruSeq Exome Enrichment
Illumina HiSeq instrument was used for WGS and all three WES capture kits. Clark et al highlight comparisons between the three capture kits, from library preparation to sequencing time. The paper discusses effectiveness of using each of these kits based on metrics such as baits, capture of UTR regions, etc. They compare variant calls across all three WES kits and WGS and discuss the ability of WES to detect additional small variants that were missed by WGS. Although this paper doesn’t provide an in-depth instrument comparison, the readers here assume that Illumina is the leader in sequencing technology (at least until tonight!)
We use this data set to compare and contrast the availability and quality of mitochondrial sequencing in off-target data from WES. A standard WGS experiment at 35× mean genomic coverage was compared to exome sequencing experiments yielding average exome target coverage of 30× for Illumina, 60× for Agilent and 68× for Nimblegen
We also utilized a single custom capture sequenced sample from Teer et al to study the feasibility of gleaning mitochondria from a custom capture experiment.
Clark et al have made this data set downloadable from NCBI in the SRA file format
Using the SRA toolkit we converted SRA to FASTQ. As these are paired end reads we used fastq-dump with the –split-3 option. This generated 2 fastq files for R1 and R2
Using BWA-MEM algorithm we aligned reads in these fastq files to allchr.fa. Additionally for the Truseq data we also used BWA-SAMPE algorithm to compare BWA alignment algorithm
The BWA alignment provided SAM files for each of three WES (Agilent, Nimblegen, Illumina) and WGS. Using Samtools we converted SAM files to BAM for easy storage and interpretability
We filtered for reads that mapped to chromosome M and those that had PHRED-scale mapping quality >= 20 (more than 99% probability of being accurate)
For calling variants we employed a custom perl script on the the generated pileup to determine variant calling at different thresholds of >=1% >=5% and >=10% variant supporting reads
All metrics for 10x/5x are using reads mapped with PHRED-scale mapping quality >= 20. The length of mitochondrial genome covered at more than 5x (5-fold) coverage and 10x is summarized for the sequencing data from different capture kits (Table 1).
All results are for BWA-MEM except for the Illumina TruSeq capture data that was also aligned using BWA-SAMPE. Our comparisons show that BWA-MEM aligned more reads and had generally better performance.
A custom capture sample was evaluated simply to see the potential of extracting mitochondrial genome from that data-type as well. It performed really well, generating more than 900 RPM for mitochondrial genome, implying much greater off-target throughput
All reads (millions)
Mapped reads (millions)
% mapped reads
Q20 chrM RPM*
> 10x chrM
> 5x chrM
SRR309293.pe (Illumina SAMPE)
*: Q20 mapped chrM reads per Million Mapped reads for that sample Table 1: Sequencing throughput and mitochondrial genome coverage from NGS data on whole-genome, exome and custom-captured samples
Coverage of Mitochondrial Genome
WGS data generated really good coverage of the mitochondrial genome, almost always > 700-fold
Coverage from Illumina Truseq data was consistent between results from using BWA-mem or BWA-sampe aligner, though the latter gave slightly lesser coverage due to fewer mapped reads
Agilent off-target data generated sufficient mitochondria mapped reads considering ~95% of mitochondrial genome covered at 5x. Higher overall throughput for the sequenced sample could have provided greater off-target sequence reads yielding higher mitochondrial genome coverage.
Nimblegen off-target data was the least abundant, and the coverage profile across mitochondrial genome was also different from other datasets. This may also be due to the high-density overlapping bait design of Nimblegen, giving focused on-target coverage, leaving fewer off-target reads.
Variant Calling on the Mitochondrial Genome
The sequencing data depicted high variability when using 1% alternate-supporting reads to annotate a mitochondrial genomic position as variant. So we used a threshold of at-least 10% reads at any given nucleotide position to be supporting the alternate allele to define a variant. The above venn-diagram highlights that the vast majority (33/41) of called variants on mitochondrial genome from WGS and WES data overlap. Another 6 variants identified in WGS were also observed in Agilent and Illumina WES data, but missed by Nimblegen WES due to low coverage. We do not provide a comprehensive iteration of the exclusive variants, but most of them suffer from low read-depth, low quality, and strand bias.
With the decreasing cost and increasing availability of exome sequencing data, there is a vast resource of mitochondrial genomes that can be mined for mitochondria-focused research. Data from large consortia like 1000 genomes and NHLBI exome datasets can be utilized for a comparative mitochondrial variation evaluation. As reported by Picardi et al, Illumina Truseq and Agilent exome kits generate better mitochondrial genome coverage compared to Nimblegen. Interestingly, even the custom-capture kit we evaluated generated a decent amount of mitochondrial genome coverage. This opens up a plethora of small NGS panel and custom-capture datasets for mitochondrial genome evaluation.
Finally there is a comprehensive analysis on indels, and of course it is the Next Generation Sequencing data that is driving it. I have my concerns with the biases of NGS technology and analysis along with ensuing false-positives in indel detection. Nonetheless, the authors have done a good job in summarizing the information and touching upon the important points making some valuable observations. It would be great to see this comprehensive analysis repeated on the public Complete Genomics genomes or the increasing Ion Torrent data to corroborate these findings as generic and not specific to any variables.
Dataset used = 179 (~4x coverage) genomes from 1000 genomes pilot data of 3 populations
1.6 million indels – 50% of them in 4% of the genome (indel hotspots)
Polymerase slippage is the main cause of 75% of indels (almost all indels in hotspots and 50% indels in non-repeat regions are due to slippage)
indels subject to stronger purifying selection than SNVs (they call it SNPs)
recombination hotspots that are known to be enriched with SNVs are not enriched with indels
longer and frameshift indels have stronger effect on fitness
indels on average have a stronger functional effect than SNVs
STAMPY: aligner with high sensitivity and low reference bias
DINDEL genotyper: Use alt-supporting reads to select high quality indels
build implied haplotypes (LD betw SNV/indel and impute) and error model for homopolymers
ignore indels in long (>10bp) homopolymers
validate with sanger
the 1.6 million indels are 8-fold lower than SNVs from these genomes
selected novel indels (not seen in 1000 genomes report not dbSNP129)
chose 2 CEU as validation targets and sampled calls predicted to segregate in them
randomly selected a subset; able to design primers for 111; 60 sanger sequenced
36 matches; 12 low-Q sanger; 12 discordant => 0.25% FDR for this novel set (4.6% total FDR)
Homopolymer Run (6nt+) – HR – 10-fold indel enrichment compared to genomic average (even higher if include longer homopolymers)
change in copy-number count – CCC – NR-CCC & NR non-CCC
HR + TR + PR = 4% of the genome (hotspot) with 50% of indels – deletions dominate short tracts, insertions longer tracts, and then del again for much longer tracts
100-fold increase in polymorphism rate going from 4-bp homopolymer to 8-bp
25% indels not due to polymerase slippage mostly NR non-CCC – mostly deletions (about 90%) – perhaps due to formation of double-stranded break intermediate and imperfect repair
the remaining 2.5% insertions most often involve palindromic repeat
43 genes with high individual predicted mutation rate in coding regions – 10 of those do not show SNV enrichment and thus have exclusive indel enrichment to cause high mutational load – includes HTT (huntington), AR (prostrate cancer), ARID1B (neurodevelopmental), MED and MAML genes
GWAS: common indels are well tagged by SNVs – possible to phase indels into SNV haplotype reference panels
Plenary Session: Genomic Technologies
Len Pennacchio, Lawrence Berkeley National Laboratory, Chair
— could not take notes on some of the talks and afternoon session
9:00 a.m. – 9:30 a.m. Rebecca Leary, Johns Hopkins Kimmel Cancer Center
“Personalized Approaches to Non-invasive Cancer Detection”
– personalized analysis of rearranged ends (PARE)-identify structural alterations in solid tumors
– generate personalized biomarkers for the detection of circulating tumor DNA
– Tumor-derived mate-pair library -> somatic rearrangements -> confirmed by PCR in tumor & matched normal
– Application = monitor disease progression, identify residual disease (predict relapse), surgical margins
– Plasma Aneuploidy Score – clearly differentiates normals from colorectal cancer samples (just 10x physical coverage – detect rearrangements)
– 0.75% circulating tumor DNA – 90%+ sensitivity, 99%+ specificity using 1 HiSeq lane
9:30 a.m. – 9:55 a.m. * Eric Antoniou, Cold Spring Harbor Laboratory
“Increased Read Length and Sequence Quality with Pacific Biosciences Magbead Loading System and a New DNA Polymerase”
– duckweed as Biofuel (40tonnes/acre/yr), .1 ton yields .025tons of ethanol by weight and is ~7.5 gallons a day
– rice genome (470 Mbp) sequenced using the Pacific Biosciences RS sequencer (MagBead loading system) – hybrid de novo assembly with Illumina data
– 10kbp insert library; 9X coverage of the rice genome (mean read length – 3kb, max 21kb)
– mean accuracy mode of single pass long read – 90%, (85-87% for current C2 chemistry)
9:55 a.m. – 10:20 a.m. * Tim Harkins, Life Technologies
“Ovarian Cancer Evolution: a Tale of Two Paths”
– ovarian cancer 9th leading cancer among women, 5th leading cause of cancer related death, high relapse rate
10:45 a.m. – 11:10 a.m. * X. Sunney Xie, Harvard University
“Detecting Single Nucleotide and Copy Number Variations of a Single Human Cell by Whole Genome Sequencing”
– Individual cells of identical descent can have different genomes (dynamic changes in DNA) – important to many biological investigations and medical diagnoses
– Single-cell whole-genome amplification methods – exponential amplification bias => low genome coverage
– Multiple Annealing and Looping Based Amplification Cycles (MALBAC) – 93% genome coverage ≥ 1x for a single human cell at 30x mean sequencing depth
– detection of digitized CNV & SNVs – ~76% efficiency for a single cancer cell
– 2.5 single-base substitutions per mitosis in human tumor cell line identified using single cell amplification/sequencing
– circulating tumor cells (CTCs) of same patient show similar CNV; CTCs of lung cancer patients show similar CTC
– clinical trial for pre-implantation genomic screening for IVF using single polar bodies of oocytes
– male’s genome can be phased by seq sperm, female’s genome phased using polar bodies genomes
– 0.1X genome coverage is enough to determine aneuploidy (at 8-cell stage) for MALBAC’s single-cell sequencing in IVF
– anomalous transition/transversion ratio for newly acquired SNVs
11:10 a.m. – 11:35 a.m. * Jeremy Schmutz, HudsonAlpha Institute
“Evaluating Moleculo Long Read Technology for de novo Whole Genome Sequencing”
– Moleculo Long Read technology – sequencing two complex plant genomes (inbred diploid switchgrass comparator Panicum hallii (600 Mb) and the outbred tetraploid Miscanthus sinensis (~2.3 Gb)
– incldue long, retrotransposon-derived repeats, diverse GC-content and present significant challenges for short-read NGS whole genome shotgun sequencing
– Moleculo reads – 10kb reads (5kb avg), high accuracy (1.26bp error/10k), tunable to genome size/complexity, reduces computational complexity
– limitations = distribution of reads depends on local repetitive content & global repeat freq; illumina based => localized chemistry issues; some amplification bias
11:35 a.m. – 12:00 p.m. * Jonas Korlach, Pacific Biosciences
“Automated, Non-Hybrid De Novo Genome Assemblies and Epigenomes of Bacterial Pathogens”