Mitochondrial Gold Rush

Mitochondrial genomes can be extracted from Whole Exome Sequencing (WES) data as outlined by this paper in Nature methods by Ernesto Picardi Graziano Pesole. Tools like Mito Seek are now available that gather mitochondrial read sequences from NGS data and perform high throughput sequence analysis. Availability of mitochondrial genomes is important as genomic variation in mitochondria has been implicated in a variety of neuro-muscular and metabolic disorders, along with roles in aging and cancer.

However here we ponder upon the feasibility of how effective it is to extract mitochondria from different capture kits used for WES. Picardi et al used the MitoSeek tool to successfully assemble 100%, 95% and 72% of the mtDNA genome from the TruSeq (Illumina), SureSelect (Agilent) and SeqCap EZ-Exome (NimbleGen) platforms, respectively. We set out to assess the mitochondrial genome data extraction using a different approach and tool-set. Using the same sample’s dataset from three different capture kits, and Whole Genome Sequenced (WGS) data as the gold standard we evaluated alignment and variant-calling results.

Clark et al sequenced and analyzed a human blood sample (healthy, anonymous volunteer) at the Stanford University using three commonly used WES kits:

  1. Agilent SureSelect Human All Exon kit
  2. Nimblegen SeqCap EZ Exome Library v2.0
  3. Illumina TruSeq Exome Enrichment

Illumina HiSeq instrument was used for WGS and all three WES capture kits. Clark et al highlight comparisons between the three capture kits, from library preparation to sequencing time. The paper discusses effectiveness of using each of these kits based on metrics such as baits, capture of UTR regions, etc. They compare variant calls across all three WES kits and WGS and discuss the ability of WES to detect additional small variants that were missed by WGS. Although this paper doesn’t provide an in-depth instrument comparison, the readers here assume that Illumina is the leader in sequencing technology (at least until tonight!)

We use this data set to compare and contrast the availability and quality of mitochondrial sequencing in off-target data from WES. A standard WGS experiment at 35× mean genomic coverage was compared to exome sequencing experiments yielding average exome target coverage of 30× for Illumina, 60× for Agilent and 68× for Nimblegen

We also utilized a single custom capture sequenced sample from Teer et al to study the feasibility of gleaning mitochondria from a custom capture experiment.

  1. Clark et al have made this data set downloadable from NCBI in the SRA file format
  2. Using the SRA toolkit we converted SRA to FASTQ. As these are paired end reads we used fastq-dump with the –split-3 option. This generated 2 fastq files for R1 and R2
  3. Using BWA-MEM algorithm we aligned reads in these fastq files to allchr.fa. Additionally for the Truseq data we also used BWA-SAMPE algorithm to compare BWA alignment algorithm
  4. The BWA alignment provided SAM files for each of three WES (Agilent, Nimblegen, Illumina) and WGS. Using Samtools we converted SAM files to BAM for easy storage and interpretability
  5. We filtered for reads that mapped to chromosome M and those that had PHRED-scale mapping quality >= 20 (more than 99% probability of being accurate)
  6. For calling variants we employed a custom perl script on the the generated pileup to determine variant calling at different thresholds of >=1% >=5% and >=10% variant supporting reads

Read Metrics:

All metrics for 10x/5x are using reads mapped with PHRED-scale mapping quality >= 20. The length of mitochondrial genome covered at more than 5x (5-fold) coverage and 10x is summarized for the sequencing data from different capture kits (Table 1).

All results are for BWA-MEM except for the Illumina TruSeq capture data that was also aligned using BWA-SAMPE. Our comparisons show that BWA-MEM aligned more reads and had generally better performance.

A custom capture sample was evaluated simply to see the potential of extracting mitochondrial genome from that data-type as well. It performed really well, generating more than 900 RPM for mitochondrial genome, implying much greater off-target throughput

Capture/WGS All reads (millions) Mapped reads (millions) % mapped reads chrM reads Q20 chrM Q20 chrM RPM* > 10x chrM > 5x chrM
SRR309291 (Agilent) 124.193 123.949 99.80 2836 2647 21.36 12615 15691
SRR309292 (Nimblegen) 185.088 184.588 99.73 3770 3466 18.78 5563 11271
SRR309293 (Illumina) 113.369 113.070 99.74 27326 24645 217.96 16569 16569 (Illumina SAMPE) 112.886 105.777 93.70 25149 22894 216.44 16569 16569
SRR341919 (WGS) 1,312.649 1,253.840 95.52 436042 417365 332.87 16569 16569
(Custom Capture)
5.313 5.086 95.73 5346 4897 962.75 9997 14318

*: Q20 mapped chrM reads per Million Mapped reads for that sample
Table 1: Sequencing throughput and mitochondrial genome coverage from NGS data on whole-genome, exome and custom-captured samples


Coverage of Mitochondrial Genome

Figure 1: Contrasting coverage of mitochondrial genome from WGS and WES sequencing data (truseq-pe data was aligned using BWA-sampe tool while all others were aligned using BWA-mem)
  • WGS data generated really good coverage of the mitochondrial genome, almost always > 700-fold
  • Coverage from Illumina Truseq data was consistent between results from using BWA-mem or BWA-sampe aligner, though the latter gave slightly lesser coverage due to fewer mapped reads
  • Agilent off-target data generated sufficient mitochondria mapped reads considering ~95% of mitochondrial genome covered at 5x. Higher overall throughput for the sequenced sample could have provided greater off-target sequence reads yielding higher mitochondrial genome coverage.
  • Nimblegen off-target data was the least abundant, and the coverage profile across mitochondrial genome was also different from other datasets. This may also be due to the high-density overlapping bait design of Nimblegen, giving focused on-target coverage, leaving fewer off-target reads.

Variant Calling on the Mitochondrial Genome

33 variants shared by all 4 (WGS, Illumina/Nimblegen/Agilent capture)
Venn Diagram (generated using Venny) to compare the mitochondiral variants identified in the same sample from WGS and off-target data from different capture kits (10% or more alternate-supporting reads implied a variant call)

The sequencing data depicted high variability when using 1% alternate-supporting reads to annotate a mitochondrial genomic position as variant. So we used a threshold of at-least 10% reads at any given nucleotide position to be supporting the alternate allele to define a variant. The above venn-diagram highlights that the vast majority (33/41) of called variants on mitochondrial genome from WGS and WES data overlap. Another 6 variants identified in WGS were also observed in Agilent and Illumina WES data, but missed by Nimblegen WES due to low coverage. We do not provide a comprehensive iteration of the exclusive variants, but most of them suffer from low read-depth, low quality, and strand bias.


With the decreasing cost and increasing availability of exome sequencing data, there is a vast resource of mitochondrial genomes that can be mined for mitochondria-focused research. Data from large consortia like 1000 genomes and NHLBI exome datasets can be utilized for a comparative mitochondrial variation evaluation. As reported by Picardi et al, Illumina Truseq and Agilent exome kits generate better mitochondrial genome coverage compared to Nimblegen. Interestingly, even the custom-capture kit we evaluated generated a decent amount of mitochondrial genome coverage. This opens up a plethora of small NGS panel and custom-capture datasets for mitochondrial genome evaluation.

Venn Diagrams Simplified

Not a week goes by without me having to compare multiple lists to evaluate overlap. Re-analyzing any data and comparing that with previously computed data, comparing results from multiple tools or multiple samples. Sometimes the unix commands diff, paste and join come handy, but most times am looking for some additional information.

Almost always, am looking to find out the shared elements of the list, which ones are exclusive to a particular list, and so forth. Many a times, these evaluations end up in simple VENN diagrams to diagrammatically depict the comparison.

The simplest way to do so, with small enough lists, is to use this wonderful tool VENNY. It has a very simple to use interface, allows for pasting upto 4 lists, and creates a very nice visual that one can download as an image. VENNY also allows one to click any region of the figure, to list out the elements in that sub-group.

However, as is the case with Bioinformatics, the lists are most times much larger, and if one has enough grasp of command-line unix and basic perl, a simple script can help compare any such lists.

Here is my simple code that has been used 100s of times over the last few years. Feel free to try, leave a comment, or suggest any improvements or faster/better ways to do this sort of evaluation!

Input: single column lists that need to be compared (if your data is multiple columns, use something like `cut -f1 fileName > tmp1` in unix to cut out the first (or whichever using -f) column from the file)

Output: 2-column file, the first column providing a key, the 2nd column listing the actual element from the lists. (Using a pipe ‘|’ with cut -f1 | sort | uniq -c gives the count of each group. In the script below, using 2 files

key=1 => element is unique to the first list / file
key=2 => element is unique to the second list / file
key=3 => element is shared by both the lists / files

## Script to take 2 files, and report venn diagram numbers for an image
## Usage: $ perl w1 w2 | cut -f1 | sort | uniq -c
use strict;
use warnings;
my %hash;
open(DAT, $ARGV[0]); ##1st file
while (my $l = <DAT>) {
chomp $l;
$hash{$l} = 1; ## in set1
close DAT;
open(DAT, $ARGV[1]); ##2nd file
while (my $l = <DAT>) {
chomp $l;
if (defined $hash{$l} && $hash{$l} == 1) {
$hash{$l} = 3; ##Common in sets 1 & 2
} else {$hash{$l} = 2;} ## Set 2, not set 1
close DAT;
foreach my $k (keys %hash) {
print "$hash{$k}\t$k\n";

Similarly, a small tweak allows this script to be used on 3 lists. The scripts run pretty fast for everyday comparisons. One caveat is to use clean data, make sure there are no extra spaces or quotes on some lists, which can really mess up the numbers.

## Script to take 3 files, and report venn diagram numbers for an image
## Usage: $ perl w1 w2 w3 | cut -f1 | sort | uniq -c

## Key: 1 => 1-only, 2 => 2-only, 3 => 1 & 2, 4 => 3-only, 5 => 1 & 3
## 6 => 1 & 2 & 3, 7 => 2 & 3
use strict;
use warnings;
my %hash;
open(DAT, $ARGV[0]); ##1st file
while (my $l = <DAT>) {
chomp $l;
$hash{$l} = 1; ## in set1
close DAT;
open(DAT, $ARGV[1]); ##2nd file
while (my $l = <DAT>) {
chomp $l;
if (defined $hash{$l} && $hash{$l} == 1) {
$hash{$l} = 3; ##Common in sets 1 & 2
} else {$hash{$l} = 2;} ## Set 2, not set 1
close DAT;
if (defined $ARGV[2]) {
open(DAT, $ARGV[2]); ##3rd file
while (my $l = <DAT>) {
chomp $l;
if (!defined $hash{$l}) {$hash{$l} = 4;} ## Set 3 only
elsif ($hash{$l} == 1) {
$hash{$l} = 5; ## sets 1 & 3, not 2
} elsif ($hash{$l} == 2) {
$hash{$l} = 7; ## sets 2 & 3, not 1
} elsif ($hash{$l} == 3) {
$hash{$l} = 6; ## sets 1 & 2 & 3
close DAT;
if (defined $ARGV[3]) {
open(DAT, $ARGV[3]); ##3rd file
while (my $l = ) {
chomp $l;
if ($hash{$l} == 1) {
$hash{$l} = 8; ## sets 1 & 4 only
} elsif ($hash{$l} == 2) {
$hash{$l} = 9; ## sets 2 & 4 only
} elsif ($hash{$l} == 3) {
$hash{$l} = 10; ## sets 1 & 2 & 4
} elsif ($hash{$l} == 4) {
$hash{$l} = 11; ## Set 3 & 4 only
} elsif ($hash{$l} == 5) {
$hash{$l} = 12; ## Set 1 & 3 & 4 only
} elsif ($hash{$l} == 6) {
$hash{$l} = 13; ## Set 1 & 2 & 3 & 4
} elsif ($hash{$l} == 7) {
$hash{$l} = 14; ## Set 1 & 2 & 4 only
} else {$hash{$l} = 15;} ## Set 4 only
close DAT;
foreach my $k (keys %hash) {
print "$hash{$k}\t$k\n";

UPDATE: There is this useful R package VennDIagram pointed by @genetics_blog who BTW has an amazing bioinformatics blog: Getting Genetics Done

R Line Plots – the easiest fastest plot ever

What you need:

  • Data
  • R
  • Eager Bioinformatician

For data I used a file with 7 columns where the first column was a counter and the other columns (2-7) had different values that I would like to compare across using a line plot to visualize the variation in my data points

Serves: As many data points you would like to extend it to

Time: Once you have a parsed dataset, this is fast: you wouldn’t want to blink

Plotting lines with a legend 🙂

> x <- read.table(“filename”);

> png (“filename.png”)

> plot (x[,2], type=”l”, col = “steelblue”, ylab=”Heading”)

> lines (x[,3], col = “pink”)

> lines (x[,4], col = “cyan”)

> lines (x[,5], col = “magenta”)

> lines (x[,6], col = “green3”)

> lines (x[,7], col = “blue”)

> legend(11000, 2, c(“124-100”, “124-50”, “124-20”, “124-10”, “124-5”, “124-2”), lty=c(1,1), lwd=c(2.5,2.5), col=c(“steelblue”,”pink”,”cyan”,”magenta”,”green3″,”blue”))


Feel free to change/add colors* and serve embedded within your document/ppt!

* You can find a detailed list of color names in R here:

Clinical Findings from Sequencing keep flowing…

Just a glance at GenomeWeb’s “In Print: Clinical Sequencing Papers of Note of the Last Two Weeks” is enough to show how hard it is to keep up with the sequencing revolution. Of the 37 publications mentioned here, 15 talk about whole-exome sequencing (WES) (and two amplicon sequencing and three whole-genome seq) identifying the causal variant(s).

Most of them use a slightly different way to align the reads, call variants and interpret the results. This reflects on the recent review paper in Bioinformatics journal (Tools for mapping high-throughput sequencing data) and the relevant blog post talking about the challenges of such a diverse tool-set (An embargo on short read alignment tools).

Amidst all this, the NGS Catalog is a wonderful tool to correlate technology with gene or disease or even specific mutations, along with author names and publications. So much data that can be mined for very useful snippets of information.

Thanks to the express permission of GenomeWeb editors, I am able to post below the list of papers highlighted in their section (which is behind a login access)

In Print: Clinical Sequencing Papers of Note of the Last Two Weeks

 ADAR1 promotes malignant progenitor reprogramming in chronic myeloid leukemia.
Jiang Q, Crews LA, Barrett CL, Chun HJ, et al.
Proc Natl Acad Sci U S A. 2012 Dec 28. [Epub ahead of print]

DGKE variants cause a glomerular microangiopathy that mimics membranoproliferative GN.
Ozaltin F, Li B, Rauhauser A, An SW, et al.
J Am Soc Nephrol. 2012 Dec 28. [Epub ahead of print]

Combining highly multiplexed PCR with semiconductor-based sequencing for rapid cancer genotyping.
Beadling C, Neff TL, Heinrich MC, Rhodes K, et al.
J Mol Diagn. 2012 Dec 27. [Epub ahead of print]

Next-generation sequencing reveals deep intronic cryptic ABCC8 and HADH splicing founder mutations causing hyperinsulinism by pseudoexon activation.
Flanagan SE, Xie W, Caswell R, Damhuis A, et al.
Am J Hum Genet. 2012 Dec 22. [Epub ahead of print]

Exome sequencing identifies INPPL1 mutations as a cause of opsismodysplasia.
Huber C, Faqeih EA, Bartholdi D, Bole-Feysot C, et al.
Am J Hum Genet. 2012 Dec 22. [Epub ahead of print]

Whole-genome analysis reveals that mutations in inositol polyphosphate phosphatase-like 1 cause opsismodysplasia.
Below JE, Earl DL, Shively KM, McMillin MJ, et al.
Am J Hum Genet. 2012 Dec 22. [Epub ahead of print]

Mitochondrial disease genetic diagnostics: optimized whole-exome analysis for all MitoCarta nuclear genes and the mitochondrial genome.
Falk MJ, Pierce EA, Consugar M, Xie MH, et al.
Discov Med. 2012 Dec;14(79):389-99.

Genetic diagnosis of autosomal dominant polycystic kidney disease by targeted capture and next-generation sequencing: Utility and limitations.
Qi XP, Du ZF, Ma JM, Chen XL, et al.
Gene. 2012 Dec 20. [Epub ahead of print]

RNA-seq reveals differentially expressed isoforms and novel splice variants in buccal mucosal cancer.
Jakhesara SJ, Koringa PG, Bhatt VD, Shah TM, et al.
Gene. 2012 Dec 20. [Epub ahead of print]

Next-generation sequencing identified new oncogenes and tumor suppressor genes in human hepatic tumors.
Amaddeo G, Guichard C, Imbeaud S, Zucman-Rossi J.
Oncoimmunology. 2012 Dec 1;1(9):1612-1613.

Exomic sequencing of medullary thyroid cancer reveals dominant and mutually exclusive oncogenic mutations in RET and RAS.
Agrawal N, Jiao Y, Sausen M, Leary R, et al.
J Clin Endocrinol Metab. 2012 Dec 21. [Epub ahead of print]

Newborn screening for SCID identifies patients with Ataxia telangiectasia.
Mallott J, Kwan A, Church J, Gonzalez-Espinosa D, et al.
J Clin Immunol. 2012 Dec 20. [Epub ahead of print]

Exome sequencing identifies mutation in CNOT3 and ribosomal genes RPL5 and RPL10 in T-cell acute lymphoblastic leukemia.
De Keersmaecker K, Atak ZK, Li N, Vicente C, et al.
Nat Genet. 2012 Dec 23. [Epub ahead of print]

Germline mutations affecting the proofreading domains of POLE and POLD1 predispose to colorectal adenomas and carcinomas.
Palles C, Cazier JB, Howarth KM, Domingo E, et al.
Nat Genet. 2012 Dec 23. [Epub ahead of print]

Molecular diagnostics of a single multifocal non-small cell lung cancer case using targeted next generation sequencing.
Geurts-Giele WR, Dirkx-van der Velden AW, Bartalits NM, Verhoog LC, et al.
Virchows Arch. 2012 Dec 22. [Epub ahead of print]

Exome sequencing identifies mutations in CCDC114 as a cause of primary ciliary dyskinesia.
Knowles MR, Leigh MW, Ostrowski LE, Huang L, et al.
Am J Hum Genet. 2012 Dec 19. [Epub ahead of print]

Mutations in ECEL1 cause distal arthrogryposis type 5D.
McMillin MJ, Below JE, Shively KM, Beck AE, et al.
Am J Hum Genet. 2012 Dec 19. [Epub ahead of print]

Rare, low-frequency, and common variants in the protein-coding sequence of biological candidate genes from GWASs contribute to risk of rheumatoid arthritis.
Diogo D, Kurreeman F, Stahl EA, Liao KP, et al.
Am J Hum Genet. 2012 Dec 19. [Epub ahead of print]

Whole-genome sequencing in autism identifies hot spots for de novo germline mutation.
Michaelson JJ, Shi Y, Gujral M, Zheng H, et al.
Cell. 2012 Dec 21;151(7):1431-42.

RNA-seq analysis of synovial fibroblasts brings new insights into rheumatoid arthritis.
Heruth DP, Gibson M, Grigoryev DN, Zhang LQ, Ye SQ.
Cell Biosci. 2012 Dec 21;2(1):43.

Use of next generation sequencing technologies in research and beyond: are participants with mental health disorders fully protected?
Jaitovich Groisman I, Mathieu G, Godard B.
BMC Med Ethics. 2012 Dec 20;13(1):36.

Mutation of the PDGFRB gene as a cause of idiopathic basal ganglia calcification.
Nicolas G, Pottier C, Maltête D, Coutant S, et al.
Neurology. 2012 Dec 19. [Epub ahead of print]

Identification of TP53 as an acute lymphocytic leukemia susceptibility gene through exome sequencing.
Powell BC, Jiang L, Muzny DM, Treviño LR, et al.
Pediatr Blood Cancer. 2012 Dec 19. [Epub ahead of print]

Histamine-2 receptor blockers alter the fecal microbiota in premature infants.
Gupta RW, Tran L, Norori J, Ferris MJ, et al.
J Pediatr Gastroenterol Nutr. 2012 Dec 18. [Epub ahead of print]

Genome-wide sequencing for the identification of rearrangements associated with Tourette syndrome and obsessive-compulsive disorder.
Hooper SD, Johansson AC, Tellgren-Roth C, Stattin EL, et al.
BMC Med Genet. 2012 Dec 19;13(1):123.

Detection of BCR-ABL1 mutations in chronic myeloid leukaemia by massive parallel sequencing.
Eyal E, Amir A, Cesarkas K, Jacob-Hirsch J, et al.
Br J Haematol. 2012 Dec 17. [Epub ahead of print]

West syndrome caused by ST3Gal-III deficiency.
Edvardson S, Baumann AM, Mühlenhoff M, Stephan O, et al.
Epilepsia. 2012 Dec 17. [Epub ahead of print]

Exome sequencing identifies a founder frameshift mutation in an alternative exon of USH1C as the cause of autosomal recessive retinitis pigmentosa with late-onset hearing loss.
Khateb S, Zelinger L, Ben-Yosef T, Merin S, et al.
PLoS One. 2012;7(12):e51566.

The ChIP-seq-defined networks of Bcl-3 gene binding support its required role in skeletal muscle atrophy.
Jackman RW, Wu CL, Kandarian SC.
PLoS One. 2012;7(12):e51478.

Characterization and validation of insertions and deletions in 173 patient exomes.
Lescai F, Bonfiglio S, Bacchelli C, Chanudet E, et al.
PLoS One. 2012;7(12):e51292.

Next-generation sequencing meets genetic diagnostics: development of a comprehensive workflow for the analysis of BRCA1 and BRCA2 genes.
Feliubadaló L, Lopez-Doriga A, Castellsagué E, Del Valle J, et al.
Eur J Hum Genet. 2012 Dec 19. [Epub ahead of print]

A compound heterozygous mutation in DPAGT1 results in a congenital disorder of glycosylation with a relatively mild phenotype.
Iqbal Z, Shahzad M, Vissers LE, van Scherpenzeel M, et al.
Eur J Hum Genet. 2012 Dec 19. [Epub ahead of print]

Analysis of a Chinese pedigree with Zellweger syndrome reveals a novel PEX1 mutation by next-generation sequencing.
Sun Y, Wang L, Wei X, Zhu Q, et al.
Clin Chim Acta. 2012 Dec 13. [Epub ahead of print]

Filling the gaps — the generation of full genomic sequences for 15 common and well-documented HLA class I alleles using next-generation sequencing technology.
Lind C, Ferriola D, Mackiewicz K, Sasson A, Monos D.
Hum Immunol. 2012 Dec 12. [Epub ahead of print]

Integrating GWAS and expression data for functional characterization of disease-associated SNPs: an application to follicular lymphoma.
Conde L, Bracci PM, Richardson R, Montgomery SB, Skibola CF.
Am J Hum Genet. 2012 Dec 11. [Epub ahead of print]

Whole-exome sequencing identifies LRIT3 mutations as a cause of autosomal-recessive complete congenital stationary night blindness.
Zeitz C, Jacobson SG, Hamel CP, Bujakowska K, et al.
Am J Hum Genet. 2012 Dec 11. [Epub ahead of print]

Exome sequencing reveals a signal transducer and activator of transcription 1 (STAT1) mutation in a child with recalcitrant cutaneous fusariosis.
Wang X, Lin Z, Gao L, Wang A, et al.
J Allergy Clin Immunol. 2012 Dec 13. [Epub ahead of print]

HuGGV Poster – Analyzing Variants in 1000 Human Genome Data

Variation accross chromosomes - variant data 1000 Human Genome Project

Been sometime since I worked on this poster (almost a semester) but its in here before I get myself to work on these variation results and find some answers. Though its interesting to see the trends of different variation across different chromosomes. I found chromosome 19 to stand out in most comparisons. Need to dig more to find some plausible explanations for that.. and more.