Bioinformatics tools for VCF files

With the ever growing abundance of Next Generation Sequencing (NGS) data there continues to be a challenge faced by the research community to not only standardize best practices and analysis methods, but to embrace the foray of open-source tools and file formats. Having been engaged since almost the beginning of this NGS data outburst with Illumina’s read length extending from 30bp to beyond 250bp, for the file format category SAM, BAM and VCF have become well accepted formats to some extent. I say well accepted because a few versions ago CLCbio still reported variants in a csv file format, but maybe I shouldn’t digress.

So in part due to the 1000 Genome Project and Hapmap consortium, formats like VCF are largely accepted as output format as most open-source tools report variants as VCF reports. This has allowed development of tools / parsers to use the file formats as a standard and provide efficient functionality and data portability.

A recent discussion on twitter about the efficiency and functionality of tools made me compile a list of these VCF parsers for future reference. You will be amazed at the variety of tools available for helping parse the eventual VCF reports from NGS data – free of cost!

Feel free to point out pros/cons of these tools and I will continue to update and make a comprehensive post. Also, it would be most helpful if folks could share their experiences with different tools and the example use-cases of the functionality!!



VAWK awk-like arithmetic on a VCF file
Bioawk support of several common biological data formats, including optionally gzip’ed BED, GFF, SAM, VCF, FASTA/Q and TAB-delimited formats
VCFFilterJS Filters a VCF with javascript
bio-vcf new generation VCF parser
bcftools contains all the vcf* commands
VCFtools provide easily accessible methods for working with complex genetic variation data in the form of VCF files (Paper:
VariantFiltration Filters variant calls using a number of user-selectable, parameterizable criteria
PyVCF Variant Call Format Parser for Python
vcflib simple C++ library for parsing and manipulating VCF files
wormtable write-once read-many table for large scale datasets (vcf2wt)
SnpSift toolbox to filter and manipulate annotated files
gvcftools Utilities to create and analyze gVCF files


Exomes vs Genomes (re-visited)

The paper by Lupski et al in Genome Medicine provides fuel to the perpetual debate of Whole Exome Sequencing (WES) vs Whole Genome Sequencing (WGS). It takes me down the memory lane to my own presentation “Genomes or Exomes: evaluation of cost, time and coverage” at Beyond the Genome 2011 conference. (If you would like to check this out, my poster is available here at Faculty of 1000 resource, with so many others from the conference). My work summarized the WES vs WGS results on a single blood sample of an individual with cardio-myopathy. Although WGS gave better coverage of UCSC exons evaluated, WES identified exclusive variants missed by WGS.

Sequencing coverage has always been the key to elucidation of variants from NGS data. Lupski et al worked on a CMT (also known as HMSN) case, and from my generic evaluation of WES read-depth coverage of CMT related genes 93% of CCDS exons had good coverage (JNNP paper). I found about 89% of the known mutations in the 33 CMT genes, including SH3TC2, to be covered at 10x (or 10-fold) sequencing coverage. As the results suggest (JNNP paper) WES misses a lot of coding regions, including important known mutations, that one needs to be careful of, especially in utilization for clinical medicine.

Back to the WGS vs WES, lets start with the key points to consider for the comparison:

Key Point WES/WGS? Notes
Cost WES Typical WES requires 60-100 million 100bp reads for decent sequencing coverage, whereas WGS requires almost a billion 100bp reads for average 30x coverage
Time WES For same reason as above, WES can be generated and analyzed with a much faster turn-around time. For clinically specific WGS analysis, I developed a novel iterative method (PLoS One) that delivers variant results in 5 hours!
Average Coverage
– Depth WES WES, being targeted, provides much deeper coverage of the captured coding regions
– Breadth WGS Coverage from WGS is much more uniform, covering more of the annotated exons and independent of annotation sources. WGS has the advantage of analyzing regions with difficulty designing capture probes, providing sequencing coverage and thus potential for variant calling
Structural Variants WGS Broad uniform coverage from WGS coupled with mature algorithms and tools allows for better Structural Variant, CNV and large INDEL detection for WGS data

Lupski et al performed a variety of sequencing experiments on different NGS instruments including Illumina, ABI SOLiD and Ion Torrent. The best part is, all this data is publicly available on NCBI SRA. The scientific community can make much bigger strides by open data sharing. Such a deep dataset from multiple platforms and applications is extremely beneficial providing a distinct advantage over simulated datasets for algorithm development, software evaluation and benchmarking.

  • SOLID sequencer: 1 WES + 1 WGS
  • Illumina GAII: 2 WES
  • Illumina HiSeq: 2 WES + 1 WGS
  • Ion Torrent: 2 WES (PGM and Proton)

Summarizing the paper, all the WES were captured using the NimbleGen VCRome 2.1 capture kit. Its 42Mb capture region includes Vega, CCDS and RegSeq gene models along with miRNA and regulatory regions. Interestingly, the Clark et al (Nature Biotechnology) review of different WES capture technologies concluded that the densely packed, overlapping baits of Nimblegen SeqCap EZ generated highest efficiency target enrichment. On the other hand, the recent review of WES capture by Chilamakuri et al in BMC Genomics found Illumina capture data showing higher coverage of annotated exons.

Lupski et al analyzed Illumina data using BWA (align) -> GATK (re-calibrate) -> Atlas2 (SNV/INDEL) -> Cassandra (annotate). Ion Torrent data was analyzed using TMAP (aligner) -> Picard/Torrent-Suite (duplicates) -> VarIONt (SNV) -> Cassandra (annotate). The choice of tools used, and tools like VQSR from GATK that were not used is not detailed in the paper. A particular metric that readers would have liked to know about WGS datasets is ‘Targets hit’ and ‘Targeted bases with 10+ coverage’ in Table 1. The metric should be relatively straight-forward to calculate and provides a good perspective of how metrics compare with those from WES.

The most striking observation was regarding SNV called from all WES datasets absent from WGS! Here are some of the summary points:

  • 3709 coding SNV were concordantly called in all WES datasets, missed by the original SOLID (~30x coverage) WGS. This is huge as those 3709 SNV were identified in all six WES results, and thus should be good quality.
  • Variant concordance of the same sample using Illumina HiSeq & GAII – Figure 3
      • more than 96% and 98% SNV are concordant between HiSeq-HiSeq and GAII-GAII replicates respectively.
      • only 83% and 82% INDEL are concordant between HiSeq-HiSeq and GAII-GAII replicates respectively. Once again, INDEL calling is more noisy, though it was not clear if the authors used the ‘left-align’ on INDEL to get rid of false discordance due to the start and stop coordinates of INDEL not perfectly aligning. Wonder how the recent Scalpel tool that promises higher indel calling sensitivity might perform on these datasets.
      • even higher discordance when comparing HiSeq to GAII data (for the same sample and exome capture!!)
  • Properties of ‘private’ or exclusive SNV from WES results – Figure 4, Figure 5. As expected, a large majority of exclusive SNV are questionable due to basic quality metrics.
      • low variant fraction (% reads supporting alternate or non-reference allele)
      • low coverage depth
      • strand bias or multiply-mapped reads (leading to low variant quality)
  • Both WES and WGS found the 12 pharmacologically relevant variants

In all, this round goes to WES, mostly due to higher coverage achieved compared to WGS. The higher coverage allowed for elucidation of strand bias and appropriate proportion of alternate-supporting (variant calling) reads to reduce the particular FP and FN variants discussed in the paper. It would be interesting to generate a much higher average coverage WGS dataset and assess if some regions or genes are better suited for evaluation using WES. And to conclude, I quote from the paper “the high yet skewed depth of coverage in targeted regions afforded by the (W)ES methods may offer higher likelihood of recovery of significant variants and resolution of their true genotypes when compared to the lower, but more uniform WGS coverage

Mitochondrial Gold Rush

Mitochondrial genomes can be extracted from Whole Exome Sequencing (WES) data as outlined by this paper in Nature methods by Ernesto Picardi Graziano Pesole. Tools like Mito Seek are now available that gather mitochondrial read sequences from NGS data and perform high throughput sequence analysis. Availability of mitochondrial genomes is important as genomic variation in mitochondria has been implicated in a variety of neuro-muscular and metabolic disorders, along with roles in aging and cancer.

However here we ponder upon the feasibility of how effective it is to extract mitochondria from different capture kits used for WES. Picardi et al used the MitoSeek tool to successfully assemble 100%, 95% and 72% of the mtDNA genome from the TruSeq (Illumina), SureSelect (Agilent) and SeqCap EZ-Exome (NimbleGen) platforms, respectively. We set out to assess the mitochondrial genome data extraction using a different approach and tool-set. Using the same sample’s dataset from three different capture kits, and Whole Genome Sequenced (WGS) data as the gold standard we evaluated alignment and variant-calling results.

Clark et al sequenced and analyzed a human blood sample (healthy, anonymous volunteer) at the Stanford University using three commonly used WES kits:

  1. Agilent SureSelect Human All Exon kit
  2. Nimblegen SeqCap EZ Exome Library v2.0
  3. Illumina TruSeq Exome Enrichment

Illumina HiSeq instrument was used for WGS and all three WES capture kits. Clark et al highlight comparisons between the three capture kits, from library preparation to sequencing time. The paper discusses effectiveness of using each of these kits based on metrics such as baits, capture of UTR regions, etc. They compare variant calls across all three WES kits and WGS and discuss the ability of WES to detect additional small variants that were missed by WGS. Although this paper doesn’t provide an in-depth instrument comparison, the readers here assume that Illumina is the leader in sequencing technology (at least until tonight!)

We use this data set to compare and contrast the availability and quality of mitochondrial sequencing in off-target data from WES. A standard WGS experiment at 35× mean genomic coverage was compared to exome sequencing experiments yielding average exome target coverage of 30× for Illumina, 60× for Agilent and 68× for Nimblegen

We also utilized a single custom capture sequenced sample from Teer et al to study the feasibility of gleaning mitochondria from a custom capture experiment.

  1. Clark et al have made this data set downloadable from NCBI in the SRA file format
  2. Using the SRA toolkit we converted SRA to FASTQ. As these are paired end reads we used fastq-dump with the –split-3 option. This generated 2 fastq files for R1 and R2
  3. Using BWA-MEM algorithm we aligned reads in these fastq files to allchr.fa. Additionally for the Truseq data we also used BWA-SAMPE algorithm to compare BWA alignment algorithm
  4. The BWA alignment provided SAM files for each of three WES (Agilent, Nimblegen, Illumina) and WGS. Using Samtools we converted SAM files to BAM for easy storage and interpretability
  5. We filtered for reads that mapped to chromosome M and those that had PHRED-scale mapping quality >= 20 (more than 99% probability of being accurate)
  6. For calling variants we employed a custom perl script on the the generated pileup to determine variant calling at different thresholds of >=1% >=5% and >=10% variant supporting reads

Read Metrics:

All metrics for 10x/5x are using reads mapped with PHRED-scale mapping quality >= 20. The length of mitochondrial genome covered at more than 5x (5-fold) coverage and 10x is summarized for the sequencing data from different capture kits (Table 1).

All results are for BWA-MEM except for the Illumina TruSeq capture data that was also aligned using BWA-SAMPE. Our comparisons show that BWA-MEM aligned more reads and had generally better performance.

A custom capture sample was evaluated simply to see the potential of extracting mitochondrial genome from that data-type as well. It performed really well, generating more than 900 RPM for mitochondrial genome, implying much greater off-target throughput

Capture/WGS All reads (millions) Mapped reads (millions) % mapped reads chrM reads Q20 chrM Q20 chrM RPM* > 10x chrM > 5x chrM
SRR309291 (Agilent) 124.193 123.949 99.80 2836 2647 21.36 12615 15691
SRR309292 (Nimblegen) 185.088 184.588 99.73 3770 3466 18.78 5563 11271
SRR309293 (Illumina) 113.369 113.070 99.74 27326 24645 217.96 16569 16569 (Illumina SAMPE) 112.886 105.777 93.70 25149 22894 216.44 16569 16569
SRR341919 (WGS) 1,312.649 1,253.840 95.52 436042 417365 332.87 16569 16569
(Custom Capture)
5.313 5.086 95.73 5346 4897 962.75 9997 14318

*: Q20 mapped chrM reads per Million Mapped reads for that sample
Table 1: Sequencing throughput and mitochondrial genome coverage from NGS data on whole-genome, exome and custom-captured samples


Coverage of Mitochondrial Genome

Figure 1: Contrasting coverage of mitochondrial genome from WGS and WES sequencing data (truseq-pe data was aligned using BWA-sampe tool while all others were aligned using BWA-mem)
  • WGS data generated really good coverage of the mitochondrial genome, almost always > 700-fold
  • Coverage from Illumina Truseq data was consistent between results from using BWA-mem or BWA-sampe aligner, though the latter gave slightly lesser coverage due to fewer mapped reads
  • Agilent off-target data generated sufficient mitochondria mapped reads considering ~95% of mitochondrial genome covered at 5x. Higher overall throughput for the sequenced sample could have provided greater off-target sequence reads yielding higher mitochondrial genome coverage.
  • Nimblegen off-target data was the least abundant, and the coverage profile across mitochondrial genome was also different from other datasets. This may also be due to the high-density overlapping bait design of Nimblegen, giving focused on-target coverage, leaving fewer off-target reads.

Variant Calling on the Mitochondrial Genome

33 variants shared by all 4 (WGS, Illumina/Nimblegen/Agilent capture)
Venn Diagram (generated using Venny) to compare the mitochondiral variants identified in the same sample from WGS and off-target data from different capture kits (10% or more alternate-supporting reads implied a variant call)

The sequencing data depicted high variability when using 1% alternate-supporting reads to annotate a mitochondrial genomic position as variant. So we used a threshold of at-least 10% reads at any given nucleotide position to be supporting the alternate allele to define a variant. The above venn-diagram highlights that the vast majority (33/41) of called variants on mitochondrial genome from WGS and WES data overlap. Another 6 variants identified in WGS were also observed in Agilent and Illumina WES data, but missed by Nimblegen WES due to low coverage. We do not provide a comprehensive iteration of the exclusive variants, but most of them suffer from low read-depth, low quality, and strand bias.


With the decreasing cost and increasing availability of exome sequencing data, there is a vast resource of mitochondrial genomes that can be mined for mitochondria-focused research. Data from large consortia like 1000 genomes and NHLBI exome datasets can be utilized for a comparative mitochondrial variation evaluation. As reported by Picardi et al, Illumina Truseq and Agilent exome kits generate better mitochondrial genome coverage compared to Nimblegen. Interestingly, even the custom-capture kit we evaluated generated a decent amount of mitochondrial genome coverage. This opens up a plethora of small NGS panel and custom-capture datasets for mitochondrial genome evaluation.

#ASHG2013 Platform and Poster abstract tag-clouds

With more than 6000 scientists (genetics, bioinformatics, clinicians, statistics, genetic counselor…) and more than 200 companies at Boston for this year’s American Society for Human Genetics conference, there is a lot of great science to catch up on.

Very quickly, I just pulled out the selected platform talk abstracts, and the poster abstracts (too many posters, so I simply picked my biased interest of ~260 Bioinformatics ones) and made these tag-clouds to get the popular keywords.

They are very similar! While the Bioinformatics posters have a lot of DATA, coverage and quality; the platform talks have a lot of CANCER, functional and mutations. The platform talks also have a lot of neandertal, pms and mutation. Looking forward to all the excitement!!

Bioinformatics Posters
Bioinformatics Posters

Platform abstracts
Platform abstracts

Tools & Parameters: TagCrowd to generate the cloud using text from PDF files on the ASHG website. Max 77 words to show, min frequency of 5 and excluding these keywords “boston cambridge chr ma united university”.

BTW, do check out the twitter analysis by @erlichya on #ASHG2013 tweets and keywords

Venn Diagrams Simplified

Not a week goes by without me having to compare multiple lists to evaluate overlap. Re-analyzing any data and comparing that with previously computed data, comparing results from multiple tools or multiple samples. Sometimes the unix commands diff, paste and join come handy, but most times am looking for some additional information.

Almost always, am looking to find out the shared elements of the list, which ones are exclusive to a particular list, and so forth. Many a times, these evaluations end up in simple VENN diagrams to diagrammatically depict the comparison.

The simplest way to do so, with small enough lists, is to use this wonderful tool VENNY. It has a very simple to use interface, allows for pasting upto 4 lists, and creates a very nice visual that one can download as an image. VENNY also allows one to click any region of the figure, to list out the elements in that sub-group.

However, as is the case with Bioinformatics, the lists are most times much larger, and if one has enough grasp of command-line unix and basic perl, a simple script can help compare any such lists.

Here is my simple code that has been used 100s of times over the last few years. Feel free to try, leave a comment, or suggest any improvements or faster/better ways to do this sort of evaluation!

Input: single column lists that need to be compared (if your data is multiple columns, use something like `cut -f1 fileName > tmp1` in unix to cut out the first (or whichever using -f) column from the file)

Output: 2-column file, the first column providing a key, the 2nd column listing the actual element from the lists. (Using a pipe ‘|’ with cut -f1 | sort | uniq -c gives the count of each group. In the script below, using 2 files

key=1 => element is unique to the first list / file
key=2 => element is unique to the second list / file
key=3 => element is shared by both the lists / files

## Script to take 2 files, and report venn diagram numbers for an image
## Usage: $ perl w1 w2 | cut -f1 | sort | uniq -c
use strict;
use warnings;
my %hash;
open(DAT, $ARGV[0]); ##1st file
while (my $l = <DAT>) {
chomp $l;
$hash{$l} = 1; ## in set1
close DAT;
open(DAT, $ARGV[1]); ##2nd file
while (my $l = <DAT>) {
chomp $l;
if (defined $hash{$l} && $hash{$l} == 1) {
$hash{$l} = 3; ##Common in sets 1 & 2
} else {$hash{$l} = 2;} ## Set 2, not set 1
close DAT;
foreach my $k (keys %hash) {
print "$hash{$k}\t$k\n";

Similarly, a small tweak allows this script to be used on 3 lists. The scripts run pretty fast for everyday comparisons. One caveat is to use clean data, make sure there are no extra spaces or quotes on some lists, which can really mess up the numbers.

## Script to take 3 files, and report venn diagram numbers for an image
## Usage: $ perl w1 w2 w3 | cut -f1 | sort | uniq -c

## Key: 1 => 1-only, 2 => 2-only, 3 => 1 & 2, 4 => 3-only, 5 => 1 & 3
## 6 => 1 & 2 & 3, 7 => 2 & 3
use strict;
use warnings;
my %hash;
open(DAT, $ARGV[0]); ##1st file
while (my $l = <DAT>) {
chomp $l;
$hash{$l} = 1; ## in set1
close DAT;
open(DAT, $ARGV[1]); ##2nd file
while (my $l = <DAT>) {
chomp $l;
if (defined $hash{$l} && $hash{$l} == 1) {
$hash{$l} = 3; ##Common in sets 1 & 2
} else {$hash{$l} = 2;} ## Set 2, not set 1
close DAT;
if (defined $ARGV[2]) {
open(DAT, $ARGV[2]); ##3rd file
while (my $l = <DAT>) {
chomp $l;
if (!defined $hash{$l}) {$hash{$l} = 4;} ## Set 3 only
elsif ($hash{$l} == 1) {
$hash{$l} = 5; ## sets 1 & 3, not 2
} elsif ($hash{$l} == 2) {
$hash{$l} = 7; ## sets 2 & 3, not 1
} elsif ($hash{$l} == 3) {
$hash{$l} = 6; ## sets 1 & 2 & 3
close DAT;
if (defined $ARGV[3]) {
open(DAT, $ARGV[3]); ##3rd file
while (my $l = ) {
chomp $l;
if ($hash{$l} == 1) {
$hash{$l} = 8; ## sets 1 & 4 only
} elsif ($hash{$l} == 2) {
$hash{$l} = 9; ## sets 2 & 4 only
} elsif ($hash{$l} == 3) {
$hash{$l} = 10; ## sets 1 & 2 & 4
} elsif ($hash{$l} == 4) {
$hash{$l} = 11; ## Set 3 & 4 only
} elsif ($hash{$l} == 5) {
$hash{$l} = 12; ## Set 1 & 3 & 4 only
} elsif ($hash{$l} == 6) {
$hash{$l} = 13; ## Set 1 & 2 & 3 & 4
} elsif ($hash{$l} == 7) {
$hash{$l} = 14; ## Set 1 & 2 & 4 only
} else {$hash{$l} = 15;} ## Set 4 only
close DAT;
foreach my $k (keys %hash) {
print "$hash{$k}\t$k\n";

UPDATE: There is this useful R package VennDIagram pointed by @genetics_blog who BTW has an amazing bioinformatics blog: Getting Genetics Done

R Line Plots – the easiest fastest plot ever

What you need:

  • Data
  • R
  • Eager Bioinformatician

For data I used a file with 7 columns where the first column was a counter and the other columns (2-7) had different values that I would like to compare across using a line plot to visualize the variation in my data points

Serves: As many data points you would like to extend it to

Time: Once you have a parsed dataset, this is fast: you wouldn’t want to blink

Plotting lines with a legend 🙂

> x <- read.table(“filename”);

> png (“filename.png”)

> plot (x[,2], type=”l”, col = “steelblue”, ylab=”Heading”)

> lines (x[,3], col = “pink”)

> lines (x[,4], col = “cyan”)

> lines (x[,5], col = “magenta”)

> lines (x[,6], col = “green3”)

> lines (x[,7], col = “blue”)

> legend(11000, 2, c(“124-100”, “124-50”, “124-20”, “124-10”, “124-5”, “124-2”), lty=c(1,1), lwd=c(2.5,2.5), col=c(“steelblue”,”pink”,”cyan”,”magenta”,”green3″,”blue”))


Feel free to change/add colors* and serve embedded within your document/ppt!

* You can find a detailed list of color names in R here:

How recent are rare variants?

The Department of Genome Sciences at the University of Washington in Seattle, in a multi-institutional effort sequenced 15,336 genes for the NHLBI sponsored ESP project from a total of 6515 individuals of European American and African American descent.To identify approaches for disease-gene discovery, its important to understand evolutionary history of homosapiens and identify the age of mutations.

The group estimated 73% of all protein-coding SNV’s and around 86% of all SNVs predicted to be deleterious are a recent change within 5,000-10,000 years. European Americans had an excess of deleterious variants and had weaker purifying selection and that was explained with the out-of-africa model.

The gist you ask: rare variants have an important role in heritable phenotypic variation, disease susceptibility and adverse drug responses. The increasing population size has not had enough turn around time for selection to act upon, its only been 200-300 generations since these mutations came to be. Now this increase in mutations results in more Mendelian disorders and has increased the allelic and genetic heterogeneity of traits.

Though if there’s a positive side to it, it may as well be that we as people have created a new repository of advantageous alleles that have come into being fairly recently and hopefully evolution will act upon in subsequent generations