With the ever growing abundance of Next Generation Sequencing (NGS) data there continues to be a challenge faced by the research community to not only standardize best practices and analysis methods, but to embrace the foray of open-source tools and file formats. Having been engaged since almost the beginning of this NGS data outburst with Illumina’s read length extending from 30bp to beyond 250bp, for the file format category SAM, BAM and VCF have become well accepted formats to some extent. I say well accepted because a few versions ago CLCbio still reported variants in a csv file format, but maybe I shouldn’t digress.
So in part due to the 1000 Genome Project and Hapmap consortium, formats like VCF are largely accepted as output format as most open-source tools report variants as VCF reports. This has allowed development of tools / parsers to use the file formats as a standard and provide efficient functionality and data portability.
A recent discussion on twitter about the efficiency and functionality of tools made me compile a list of these VCF parsers for future reference. You will be amazed at the variety of tools available for helping parse the eventual VCF reports from NGS data – free of cost!
Feel free to point out pros/cons of these tools and I will continue to update and make a comprehensive post. Also, it would be most helpful if folks could share their experiences with different tools and the example use-cases of the functionality!!
Mitochondrial genomes can be extracted from Whole Exome Sequencing (WES) data as outlined by this paper in Nature methods by Ernesto Picardi & Graziano Pesole. Tools like Mito Seek are now available that gather mitochondrial read sequences from NGS data and perform high throughput sequence analysis. Availability of mitochondrial genomes is important as genomic variation in mitochondria has been implicated in a variety of neuro-muscular and metabolic disorders, along with roles in aging and cancer.
However here we ponder upon the feasibility of how effective it is to extract mitochondria from different capture kits used for WES. Picardi et al used the MitoSeek tool to successfully assemble 100%, 95% and 72% of the mtDNA genome from the TruSeq (Illumina), SureSelect (Agilent) and SeqCap EZ-Exome (NimbleGen) platforms, respectively. We set out to assess the mitochondrial genome data extraction using a different approach and tool-set. Using the same sample’s dataset from three different capture kits, and Whole Genome Sequenced (WGS) data as the gold standard we evaluated alignment and variant-calling results.
Clark et al sequenced and analyzed a human blood sample (healthy, anonymous volunteer) at the Stanford University using three commonly used WES kits:
Agilent SureSelect Human All Exon kit
Nimblegen SeqCap EZ Exome Library v2.0
Illumina TruSeq Exome Enrichment
Illumina HiSeq instrument was used for WGS and all three WES capture kits. Clark et al highlight comparisons between the three capture kits, from library preparation to sequencing time. The paper discusses effectiveness of using each of these kits based on metrics such as baits, capture of UTR regions, etc. They compare variant calls across all three WES kits and WGS and discuss the ability of WES to detect additional small variants that were missed by WGS. Although this paper doesn’t provide an in-depth instrument comparison, the readers here assume that Illumina is the leader in sequencing technology (at least until tonight!)
We use this data set to compare and contrast the availability and quality of mitochondrial sequencing in off-target data from WES. A standard WGS experiment at 35× mean genomic coverage was compared to exome sequencing experiments yielding average exome target coverage of 30× for Illumina, 60× for Agilent and 68× for Nimblegen
We also utilized a single custom capture sequenced sample from Teer et al to study the feasibility of gleaning mitochondria from a custom capture experiment.
Clark et al have made this data set downloadable from NCBI in the SRA file format
Using the SRA toolkit we converted SRA to FASTQ. As these are paired end reads we used fastq-dump with the –split-3 option. This generated 2 fastq files for R1 and R2
Using BWA-MEM algorithm we aligned reads in these fastq files to allchr.fa. Additionally for the Truseq data we also used BWA-SAMPE algorithm to compare BWA alignment algorithm
The BWA alignment provided SAM files for each of three WES (Agilent, Nimblegen, Illumina) and WGS. Using Samtools we converted SAM files to BAM for easy storage and interpretability
We filtered for reads that mapped to chromosome M and those that had PHRED-scale mapping quality >= 20 (more than 99% probability of being accurate)
For calling variants we employed a custom perl script on the the generated pileup to determine variant calling at different thresholds of >=1% >=5% and >=10% variant supporting reads
All metrics for 10x/5x are using reads mapped with PHRED-scale mapping quality >= 20. The length of mitochondrial genome covered at more than 5x (5-fold) coverage and 10x is summarized for the sequencing data from different capture kits (Table 1).
All results are for BWA-MEM except for the Illumina TruSeq capture data that was also aligned using BWA-SAMPE. Our comparisons show that BWA-MEM aligned more reads and had generally better performance.
A custom capture sample was evaluated simply to see the potential of extracting mitochondrial genome from that data-type as well. It performed really well, generating more than 900 RPM for mitochondrial genome, implying much greater off-target throughput
All reads (millions)
Mapped reads (millions)
% mapped reads
Q20 chrM RPM*
> 10x chrM
> 5x chrM
SRR309293.pe (Illumina SAMPE)
*: Q20 mapped chrM reads per Million Mapped reads for that sample Table 1: Sequencing throughput and mitochondrial genome coverage from NGS data on whole-genome, exome and custom-captured samples
Coverage of Mitochondrial Genome
WGS data generated really good coverage of the mitochondrial genome, almost always > 700-fold
Coverage from Illumina Truseq data was consistent between results from using BWA-mem or BWA-sampe aligner, though the latter gave slightly lesser coverage due to fewer mapped reads
Agilent off-target data generated sufficient mitochondria mapped reads considering ~95% of mitochondrial genome covered at 5x. Higher overall throughput for the sequenced sample could have provided greater off-target sequence reads yielding higher mitochondrial genome coverage.
Nimblegen off-target data was the least abundant, and the coverage profile across mitochondrial genome was also different from other datasets. This may also be due to the high-density overlapping bait design of Nimblegen, giving focused on-target coverage, leaving fewer off-target reads.
Variant Calling on the Mitochondrial Genome
The sequencing data depicted high variability when using 1% alternate-supporting reads to annotate a mitochondrial genomic position as variant. So we used a threshold of at-least 10% reads at any given nucleotide position to be supporting the alternate allele to define a variant. The above venn-diagram highlights that the vast majority (33/41) of called variants on mitochondrial genome from WGS and WES data overlap. Another 6 variants identified in WGS were also observed in Agilent and Illumina WES data, but missed by Nimblegen WES due to low coverage. We do not provide a comprehensive iteration of the exclusive variants, but most of them suffer from low read-depth, low quality, and strand bias.
With the decreasing cost and increasing availability of exome sequencing data, there is a vast resource of mitochondrial genomes that can be mined for mitochondria-focused research. Data from large consortia like 1000 genomes and NHLBI exome datasets can be utilized for a comparative mitochondrial variation evaluation. As reported by Picardi et al, Illumina Truseq and Agilent exome kits generate better mitochondrial genome coverage compared to Nimblegen. Interestingly, even the custom-capture kit we evaluated generated a decent amount of mitochondrial genome coverage. This opens up a plethora of small NGS panel and custom-capture datasets for mitochondrial genome evaluation.