Bioinformatics tools for VCF files

With the ever growing abundance of Next Generation Sequencing (NGS) data there continues to be a challenge faced by the research community to not only standardize best practices and analysis methods, but to embrace the foray of open-source tools and file formats. Having been engaged since almost the beginning of this NGS data outburst with Illumina’s read length extending from 30bp to beyond 250bp, for the file format category SAM, BAM and VCF have become well accepted formats to some extent. I say well accepted because a few versions ago CLCbio still reported variants in a csv file format, but maybe I shouldn’t digress.

So in part due to the 1000 Genome Project and Hapmap consortium, formats like VCF are largely accepted as output format as most open-source tools report variants as VCF reports. This has allowed development of tools / parsers to use the file formats as a standard and provide efficient functionality and data portability.

A recent discussion on twitter about the efficiency and functionality of tools made me compile a list of these VCF parsers for future reference. You will be amazed at the variety of tools available for helping parse the eventual VCF reports from NGS data – free of cost!

Feel free to point out pros/cons of these tools and I will continue to update and make a comprehensive post. Also, it would be most helpful if folks could share their experiences with different tools and the example use-cases of the functionality!!

Tools

Tidbits

VAWK awk-like arithmetic on a VCF file
Bioawk support of several common biological data formats, including optionally gzip’ed BED, GFF, SAM, VCF, FASTA/Q and TAB-delimited formats
VCFFilterJS Filters a VCF with javascript
bio-vcf new generation VCF parser
bcftools contains all the vcf* commands
VCFtools provide easily accessible methods for working with complex genetic variation data in the form of VCF files (Paper: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3137218/)
VariantFiltration Filters variant calls using a number of user-selectable, parameterizable criteria
PyVCF Variant Call Format Parser for Python
vcflib simple C++ library for parsing and manipulating VCF files
wormtable write-once read-many table for large scale datasets (vcf2wt)
SnpSift toolbox to filter and manipulate annotated files
gvcftools Utilities to create and analyze gVCF files

 

The Ebola outbreak of 2014 (Contd.)

Following up on the previous post here are some more detail on the recent Science paper along with a round-up of “what do we know, what have we learned” thus far.

The Gire et al paper in Science was huge amount of work and a giant collaborative research effort. Being a computational biology researcher, I appreciate their in-depth and detailed evaluation utilizing numerous bioinformatics software tools. Gleaning through the supplemental text, I created the flowchart below as a summary of all the analysis that went into the eventual results and interpretations. This was created using the wonderful Gliffy tool.

ebolaFlowchart of the impressive work accomplished by the Gire et al Science paper (I made this using notes from their Supplemental Data)

A slew of articles summarizing the recent Science paper came out as the hype surrounding WHO warning of this current outbreak hitting 20,000 people caught on. This is huge!! Peter Piot, who co-discovered the Ebola virus during the 1976 outbreak never imagined an outbreak like this, but is confident of ‘high-income countries’ doing just fine.

The Broad Institute and Harvard University worked with Sierra Leone Ministry of Health and Sanitation along with other researchers to provide the comprehensive paper in Science describing the sequencing of current Ebola genomes. Simultaneously, the human trial on NIH and GSK’s investigational Ebola vaccine is to begin this week, as it performed well on primate studies. Hopefully this one will be faster than the usual 10-year turn-around observed for a vaccine trial. Although the experimental drug ZMapp is being used on the cases, it is with mixed results and much and more still needs to be done. Interestingly the drug is a three-mouse monoclonal antibody and the primate research itself was published in Nature last week. Details of how it seemed to have worked on the 2 US health care workers in the midst of this outbreak is a pretty ‘miraculous’ story!

The major points to note thus far:

  • First Ebola Virus Disease of 2014 confirmed in Sierra Leone on May 25
  • It seems like there was one instance of the EBOV transmitting from the ‘natural reservoir’ to humans and has since been transmitted from human to human (implying there is rare, though present, chance of non-human transmission)
  • Substitution rate is twice as high implying that continued progression of this epidemic could allow a viral adaptation, thus the need for rapid containment
  • The 2014 outbreak has a doubling period of about 35 days!!
  • Complicating matters, positive diagnosis for malaria does not necessarily rule out Ebola Virus Disease
  • Senegal just became the 5th West African country with a confirmed case of Ebola
  • Breaking News: samples from Ebola outbreak in Congo (DRC) were evaluated to have a distinct and independent transmission event, likely via a bushmeat consumption!

Hopefully this is contained sooner rather than later…

Exomes vs Genomes (re-visited)

The paper by Lupski et al in Genome Medicine provides fuel to the perpetual debate of Whole Exome Sequencing (WES) vs Whole Genome Sequencing (WGS). It takes me down the memory lane to my own presentation “Genomes or Exomes: evaluation of cost, time and coverage” at Beyond the Genome 2011 conference. (If you would like to check this out, my poster is available here at Faculty of 1000 resource, with so many others from the conference). My work summarized the WES vs WGS results on a single blood sample of an individual with cardio-myopathy. Although WGS gave better coverage of UCSC exons evaluated, WES identified exclusive variants missed by WGS.

Sequencing coverage has always been the key to elucidation of variants from NGS data. Lupski et al worked on a CMT (also known as HMSN) case, and from my generic evaluation of WES read-depth coverage of CMT related genes 93% of CCDS exons had good coverage (JNNP paper). I found about 89% of the known mutations in the 33 CMT genes, including SH3TC2, to be covered at 10x (or 10-fold) sequencing coverage. As the results suggest (JNNP paper) WES misses a lot of coding regions, including important known mutations, that one needs to be careful of, especially in utilization for clinical medicine.

Back to the WGS vs WES, lets start with the key points to consider for the comparison:

Key Point WES/WGS? Notes
Cost WES Typical WES requires 60-100 million 100bp reads for decent sequencing coverage, whereas WGS requires almost a billion 100bp reads for average 30x coverage
Time WES For same reason as above, WES can be generated and analyzed with a much faster turn-around time. For clinically specific WGS analysis, I developed a novel iterative method (PLoS One) that delivers variant results in 5 hours!
Average Coverage
– Depth WES WES, being targeted, provides much deeper coverage of the captured coding regions
– Breadth WGS Coverage from WGS is much more uniform, covering more of the annotated exons and independent of annotation sources. WGS has the advantage of analyzing regions with difficulty designing capture probes, providing sequencing coverage and thus potential for variant calling
Structural Variants WGS Broad uniform coverage from WGS coupled with mature algorithms and tools allows for better Structural Variant, CNV and large INDEL detection for WGS data

Lupski et al performed a variety of sequencing experiments on different NGS instruments including Illumina, ABI SOLiD and Ion Torrent. The best part is, all this data is publicly available on NCBI SRA. The scientific community can make much bigger strides by open data sharing. Such a deep dataset from multiple platforms and applications is extremely beneficial providing a distinct advantage over simulated datasets for algorithm development, software evaluation and benchmarking.

  • SOLID sequencer: 1 WES + 1 WGS
  • Illumina GAII: 2 WES
  • Illumina HiSeq: 2 WES + 1 WGS
  • Ion Torrent: 2 WES (PGM and Proton)

Summarizing the paper, all the WES were captured using the NimbleGen VCRome 2.1 capture kit. Its 42Mb capture region includes Vega, CCDS and RegSeq gene models along with miRNA and regulatory regions. Interestingly, the Clark et al (Nature Biotechnology) review of different WES capture technologies concluded that the densely packed, overlapping baits of Nimblegen SeqCap EZ generated highest efficiency target enrichment. On the other hand, the recent review of WES capture by Chilamakuri et al in BMC Genomics found Illumina capture data showing higher coverage of annotated exons.

Lupski et al analyzed Illumina data using BWA (align) -> GATK (re-calibrate) -> Atlas2 (SNV/INDEL) -> Cassandra (annotate). Ion Torrent data was analyzed using TMAP (aligner) -> Picard/Torrent-Suite (duplicates) -> VarIONt (SNV) -> Cassandra (annotate). The choice of tools used, and tools like VQSR from GATK that were not used is not detailed in the paper. A particular metric that readers would have liked to know about WGS datasets is ‘Targets hit’ and ‘Targeted bases with 10+ coverage’ in Table 1. The metric should be relatively straight-forward to calculate and provides a good perspective of how metrics compare with those from WES.

The most striking observation was regarding SNV called from all WES datasets absent from WGS! Here are some of the summary points:

  • 3709 coding SNV were concordantly called in all WES datasets, missed by the original SOLID (~30x coverage) WGS. This is huge as those 3709 SNV were identified in all six WES results, and thus should be good quality.
  • Variant concordance of the same sample using Illumina HiSeq & GAII – Figure 3
      • more than 96% and 98% SNV are concordant between HiSeq-HiSeq and GAII-GAII replicates respectively.
      • only 83% and 82% INDEL are concordant between HiSeq-HiSeq and GAII-GAII replicates respectively. Once again, INDEL calling is more noisy, though it was not clear if the authors used the ‘left-align’ on INDEL to get rid of false discordance due to the start and stop coordinates of INDEL not perfectly aligning. Wonder how the recent Scalpel tool that promises higher indel calling sensitivity might perform on these datasets.
      • even higher discordance when comparing HiSeq to GAII data (for the same sample and exome capture!!)
  • Properties of ‘private’ or exclusive SNV from WES results – Figure 4, Figure 5. As expected, a large majority of exclusive SNV are questionable due to basic quality metrics.
      • low variant fraction (% reads supporting alternate or non-reference allele)
      • low coverage depth
      • strand bias or multiply-mapped reads (leading to low variant quality)
  • Both WES and WGS found the 12 pharmacologically relevant variants

In all, this round goes to WES, mostly due to higher coverage achieved compared to WGS. The higher coverage allowed for elucidation of strand bias and appropriate proportion of alternate-supporting (variant calling) reads to reduce the particular FP and FN variants discussed in the paper. It would be interesting to generate a much higher average coverage WGS dataset and assess if some regions or genes are better suited for evaluation using WES. And to conclude, I quote from the paper “the high yet skewed depth of coverage in targeted regions afforded by the (W)ES methods may offer higher likelihood of recovery of significant variants and resolution of their true genotypes when compared to the lower, but more uniform WGS coverage

#ASHG2013 Platform and Poster abstract tag-clouds

With more than 6000 scientists (genetics, bioinformatics, clinicians, statistics, genetic counselor…) and more than 200 companies at Boston for this year’s American Society for Human Genetics conference, there is a lot of great science to catch up on.

Very quickly, I just pulled out the selected platform talk abstracts, and the poster abstracts (too many posters, so I simply picked my biased interest of ~260 Bioinformatics ones) and made these tag-clouds to get the popular keywords.

They are very similar! While the Bioinformatics posters have a lot of DATA, coverage and quality; the platform talks have a lot of CANCER, functional and mutations. The platform talks also have a lot of neandertal, pms and mutation. Looking forward to all the excitement!!

Bioinformatics Posters
Bioinformatics Posters

Platform abstracts
Platform abstracts

Tools & Parameters: TagCrowd to generate the cloud using text from PDF files on the ASHG website. Max 77 words to show, min frequency of 5 and excluding these keywords “boston cambridge chr ma united university”.

BTW, do check out the twitter analysis by @erlichya on #ASHG2013 tweets and keywords

Venn Diagrams Simplified

Not a week goes by without me having to compare multiple lists to evaluate overlap. Re-analyzing any data and comparing that with previously computed data, comparing results from multiple tools or multiple samples. Sometimes the unix commands diff, paste and join come handy, but most times am looking for some additional information.

Almost always, am looking to find out the shared elements of the list, which ones are exclusive to a particular list, and so forth. Many a times, these evaluations end up in simple VENN diagrams to diagrammatically depict the comparison.

The simplest way to do so, with small enough lists, is to use this wonderful tool VENNY. It has a very simple to use interface, allows for pasting upto 4 lists, and creates a very nice visual that one can download as an image. VENNY also allows one to click any region of the figure, to list out the elements in that sub-group.

http://bioinfogp.cnb.csic.es/tools/venny/getPicture.php?labels=List%201_x_5_x_85_x_List%202_x_85_x_55_x_List%203_x_225_x_55_x_List%204_x_320_x_85&numbers=0_x_65_x_190_x_resultC1000_x_0_x_130_x_95_x_resultC0100_x_3_x_245_x_95_x_resultC0010_x_0_x_310_x_190_x_resultC0001_x_0_x_100_x_137_x_resultC1100_x_1_x_117_x_245_x_resultC1010_x_0_x_188_x_295_x_resultC1001_x_0_x_257_x_245_x_resultC0101_x_0_x_188_x_137_x_resultC0110_x_2_x_270_x_137_x_resultC0011_x_0_x_230_x_183_x_resultC0111_x_1_x_161_x_267_x_resultC1011_x_1_x_212_x_267_x_resultC1101_x_1_x_145_x_183_x_resultC1110_x_1_x_188_x_235_x_resultC1111

However, as is the case with Bioinformatics, the lists are most times much larger, and if one has enough grasp of command-line unix and basic perl, a simple script can help compare any such lists.

Here is my simple code that has been used 100s of times over the last few years. Feel free to try, leave a comment, or suggest any improvements or faster/better ways to do this sort of evaluation!

Input: single column lists that need to be compared (if your data is multiple columns, use something like `cut -f1 fileName > tmp1` in unix to cut out the first (or whichever using -f) column from the file)

Output: 2-column file, the first column providing a key, the 2nd column listing the actual element from the lists. (Using a pipe ‘|’ with cut -f1 | sort | uniq -c gives the count of each group. In the script below, using 2 files

key=1 => element is unique to the first list / file
key=2 => element is unique to the second list / file
key=3 => element is shared by both the lists / files

## Script to take 2 files, and report venn diagram numbers for an image
## Usage: $ perl 2-way-venn.pl w1 w2 | cut -f1 | sort | uniq -c
use strict;
use warnings;
my %hash;
open(DAT, $ARGV[0]); ##1st file
while (my $l = <DAT>) {
chomp $l;
$hash{$l} = 1; ## in set1
}
close DAT;
open(DAT, $ARGV[1]); ##2nd file
while (my $l = <DAT>) {
chomp $l;
if (defined $hash{$l} && $hash{$l} == 1) {
$hash{$l} = 3; ##Common in sets 1 & 2
} else {$hash{$l} = 2;} ## Set 2, not set 1
}
close DAT;
foreach my $k (keys %hash) {
print "$hash{$k}\t$k\n";
}
exit;

Similarly, a small tweak allows this script to be used on 3 lists. The scripts run pretty fast for everyday comparisons. One caveat is to use clean data, make sure there are no extra spaces or quotes on some lists, which can really mess up the numbers.

## Script to take 3 files, and report venn diagram numbers for an image
## Usage: $ perl 3-way-venn.pl w1 w2 w3 | cut -f1 | sort | uniq -c

## Key: 1 => 1-only, 2 => 2-only, 3 => 1 & 2, 4 => 3-only, 5 => 1 & 3
## 6 => 1 & 2 & 3, 7 => 2 & 3
use strict;
use warnings;
my %hash;
open(DAT, $ARGV[0]); ##1st file
while (my $l = <DAT>) {
chomp $l;
$hash{$l} = 1; ## in set1
}
close DAT;
open(DAT, $ARGV[1]); ##2nd file
while (my $l = <DAT>) {
chomp $l;
if (defined $hash{$l} && $hash{$l} == 1) {
$hash{$l} = 3; ##Common in sets 1 & 2
} else {$hash{$l} = 2;} ## Set 2, not set 1
}
close DAT;
if (defined $ARGV[2]) {
open(DAT, $ARGV[2]); ##3rd file
while (my $l = <DAT>) {
chomp $l;
if (!defined $hash{$l}) {$hash{$l} = 4;} ## Set 3 only
elsif ($hash{$l} == 1) {
$hash{$l} = 5; ## sets 1 & 3, not 2
} elsif ($hash{$l} == 2) {
$hash{$l} = 7; ## sets 2 & 3, not 1
} elsif ($hash{$l} == 3) {
$hash{$l} = 6; ## sets 1 & 2 & 3
}
}
close DAT;
}
if (defined $ARGV[3]) {
open(DAT, $ARGV[3]); ##3rd file
while (my $l = ) {
chomp $l;
if ($hash{$l} == 1) {
$hash{$l} = 8; ## sets 1 & 4 only
} elsif ($hash{$l} == 2) {
$hash{$l} = 9; ## sets 2 & 4 only
} elsif ($hash{$l} == 3) {
$hash{$l} = 10; ## sets 1 & 2 & 4
} elsif ($hash{$l} == 4) {
$hash{$l} = 11; ## Set 3 & 4 only
} elsif ($hash{$l} == 5) {
$hash{$l} = 12; ## Set 1 & 3 & 4 only
} elsif ($hash{$l} == 6) {
$hash{$l} = 13; ## Set 1 & 2 & 3 & 4
} elsif ($hash{$l} == 7) {
$hash{$l} = 14; ## Set 1 & 2 & 4 only
} else {$hash{$l} = 15;} ## Set 4 only
}
close DAT;
}
foreach my $k (keys %hash) {
print "$hash{$k}\t$k\n";
}
exit;

UPDATE: There is this useful R package VennDIagram pointed by @genetics_blog who BTW has an amazing bioinformatics blog: Getting Genetics Done

Journal Club: False-positive signals in exome sequencing

Detecting false-positive signals in exome sequencing

Human Mutation

I cannot believe that this paper is already a year old. There was a printed copy on my desk, but never got transmitted from the eyes into the brain!! Finally, there was enough time to review the paper and collate all the valuable information to share here.

Whole Exome Sequencing (WES) is fast becoming the most common NGS application. It allows querying almost all of the coding genome (the 3% of 3 billion nucleotides that we understand most about) at a relatively low cost and time investment. Looking up any list of sequencing papers of note, the most common title is “Exome sequencing identifies the causal variant for XYZ“. However, we know about the small but omnipresent spurious results that are part of the WES data. This article does a great job at elucidating the common false positives and sources of noise in WES data.

    • 118 WES samples from 29 families seen by NIH Undiagnosed Diseases Program
    • 401 additional exomes from ClinSeq study for cross-check
    • Agilent 38Mb and 50Mb all exome capture kits; GA-IIx 76 and 100bp paired-end
    • Method: ELAND -> Cross_Match -> bam2mpg genotype -> CDPred prediction -> VarSifter -> Galaxy
    • Used hg18; No duplicate removal
    • False-positive candidate variants are usually
      • located in highly polymorphic genomic region
      • caused by assembly misalignment
      • error in the reference genome
    • 23,389 positions with excess heterozygosity (alignment error)
    • 1009 positions where reference genome contains the minor allele (excess hom.)
    • Errors arise from – library construction bias; polymerase error; higher error rate towards end of short reads; loss of synchrony within a cluster (Illumina sequencing); platform specific mechanistic issues
  • Highly Variable Genes – frequently contain numerous pathogenic variants, thus unlikely to be disease causing (gene with >10 high quality variants; should normalize by gene length and where in the CDS variants were found)
  • (Pseudo genes) 392 high quality variants were heterozygous in all 118 exomes

Similar reading:
PLOS ONE: Limitations of the Human Reference Genome for Personalized Genomics

Journal Club: Indels in 179 genomes (1000genome data)

The origin, evolution and functional impact of short insertion-deletion variants identified in 179 human genomes

Genome Research

Finally there is a comprehensive analysis on indels, and of course it is the Next Generation Sequencing data that is driving it. I have my concerns with the biases of NGS technology and analysis along with ensuing false-positives in indel detection. Nonetheless, the authors have done a good job in summarizing the information and touching upon the important points making some valuable observations. It would be great to see this comprehensive analysis repeated on the public Complete Genomics genomes or the increasing Ion Torrent data to corroborate these findings as generic and not specific to any variables.

  • Dataset used = 179 (~4x coverage) genomes from 1000 genomes pilot data of 3 populations
  • 1.6 million indels – 50% of them in 4% of the genome (indel hotspots)
  • Polymerase slippage is the main cause of 75% of indels (almost all indels in hotspots and 50% indels in non-repeat regions are due to slippage)
  • indels subject to stronger purifying selection than SNVs (they call it SNPs)
  • recombination hotspots that are known to be enriched with SNVs are not enriched with indels
  • longer and frameshift indels have stronger effect on fitness
  • indels on average have a stronger functional effect than SNVs
  • Method
    • STAMPY: aligner with high sensitivity and low reference bias
    • DINDEL genotyper: Use alt-supporting reads to select high quality indels
    • build implied haplotypes (LD betw SNV/indel and impute) and error model for homopolymers
    • ignore indels in long (>10bp) homopolymers
    • validate with sanger
  • the 1.6 million indels are 8-fold lower than SNVs from these genomes
  • selected novel indels (not seen in 1000 genomes report not dbSNP129)
  • chose 2 CEU as validation targets and sampled calls predicted to segregate in them
  • randomly selected a subset; able to design primers for 111; 60 sanger sequenced
  • 36 matches; 12 low-Q sanger; 12 discordant => 0.25% FDR for this novel set (4.6% total FDR)
  • INDEL classes
    • Homopolymer Run (6nt+) – HR – 10-fold indel enrichment compared to genomic average (even higher if include longer homopolymers)
    • Tandem Repeat – TR – 20-fold indel enrichment
    • Predicted hotspot – PR – predicted indel rate > predicted SNV rate
    • Non-repetitive sites – NR
    • change in copy-number count – CCC – NR-CCC & NR non-CCC
  • HR + TR + PR = 4% of the genome (hotspot) with 50% of indels – deletions dominate short tracts, insertions longer tracts, and then del again for much longer tracts
  • 100-fold increase in polymorphism rate going from 4-bp homopolymer to 8-bp
  • 25% indels not due to polymerase slippage mostly NR non-CCC – mostly deletions (about 90%) – perhaps due to formation of double-stranded break intermediate and imperfect repair
  • the remaining 2.5% insertions most often involve palindromic repeat
  • 43 genes with high individual predicted mutation rate in coding regions – 10 of those do not show SNV enrichment and thus have exclusive indel enrichment to cause high mutational load – includes HTT (huntington), AR (prostrate cancer), ARID1B (neurodevelopmental), MED and MAML genes
  • GWAS: common indels are well tagged by SNVs – possible to phase indels into SNV haplotype reference panels