Bioinformatics tools for VCF files

With the ever growing abundance of Next Generation Sequencing (NGS) data there continues to be a challenge faced by the research community to not only standardize best practices and analysis methods, but to embrace the foray of open-source tools and file formats. Having been engaged since almost the beginning of this NGS data outburst with Illumina’s read length extending from 30bp to beyond 250bp, for the file format category SAM, BAM and VCF have become well accepted formats to some extent. I say well accepted because a few versions ago CLCbio still reported variants in a csv file format, but maybe I shouldn’t digress.

So in part due to the 1000 Genome Project and Hapmap consortium, formats like VCF are largely accepted as output format as most open-source tools report variants as VCF reports. This has allowed development of tools / parsers to use the file formats as a standard and provide efficient functionality and data portability.

A recent discussion on twitter about the efficiency and functionality of tools made me compile a list of these VCF parsers for future reference. You will be amazed at the variety of tools available for helping parse the eventual VCF reports from NGS data – free of cost!

Feel free to point out pros/cons of these tools and I will continue to update and make a comprehensive post. Also, it would be most helpful if folks could share their experiences with different tools and the example use-cases of the functionality!!



VAWK awk-like arithmetic on a VCF file
Bioawk support of several common biological data formats, including optionally gzip’ed BED, GFF, SAM, VCF, FASTA/Q and TAB-delimited formats
VCFFilterJS Filters a VCF with javascript
bio-vcf new generation VCF parser
bcftools contains all the vcf* commands
VCFtools provide easily accessible methods for working with complex genetic variation data in the form of VCF files (Paper:
VariantFiltration Filters variant calls using a number of user-selectable, parameterizable criteria
PyVCF Variant Call Format Parser for Python
vcflib simple C++ library for parsing and manipulating VCF files
wormtable write-once read-many table for large scale datasets (vcf2wt)
SnpSift toolbox to filter and manipulate annotated files
gvcftools Utilities to create and analyze gVCF files


Venn Diagrams Simplified

Not a week goes by without me having to compare multiple lists to evaluate overlap. Re-analyzing any data and comparing that with previously computed data, comparing results from multiple tools or multiple samples. Sometimes the unix commands diff, paste and join come handy, but most times am looking for some additional information.

Almost always, am looking to find out the shared elements of the list, which ones are exclusive to a particular list, and so forth. Many a times, these evaluations end up in simple VENN diagrams to diagrammatically depict the comparison.

The simplest way to do so, with small enough lists, is to use this wonderful tool VENNY. It has a very simple to use interface, allows for pasting upto 4 lists, and creates a very nice visual that one can download as an image. VENNY also allows one to click any region of the figure, to list out the elements in that sub-group.

However, as is the case with Bioinformatics, the lists are most times much larger, and if one has enough grasp of command-line unix and basic perl, a simple script can help compare any such lists.

Here is my simple code that has been used 100s of times over the last few years. Feel free to try, leave a comment, or suggest any improvements or faster/better ways to do this sort of evaluation!

Input: single column lists that need to be compared (if your data is multiple columns, use something like `cut -f1 fileName > tmp1` in unix to cut out the first (or whichever using -f) column from the file)

Output: 2-column file, the first column providing a key, the 2nd column listing the actual element from the lists. (Using a pipe ‘|’ with cut -f1 | sort | uniq -c gives the count of each group. In the script below, using 2 files

key=1 => element is unique to the first list / file
key=2 => element is unique to the second list / file
key=3 => element is shared by both the lists / files

## Script to take 2 files, and report venn diagram numbers for an image
## Usage: $ perl w1 w2 | cut -f1 | sort | uniq -c
use strict;
use warnings;
my %hash;
open(DAT, $ARGV[0]); ##1st file
while (my $l = <DAT>) {
chomp $l;
$hash{$l} = 1; ## in set1
close DAT;
open(DAT, $ARGV[1]); ##2nd file
while (my $l = <DAT>) {
chomp $l;
if (defined $hash{$l} && $hash{$l} == 1) {
$hash{$l} = 3; ##Common in sets 1 & 2
} else {$hash{$l} = 2;} ## Set 2, not set 1
close DAT;
foreach my $k (keys %hash) {
print "$hash{$k}\t$k\n";

Similarly, a small tweak allows this script to be used on 3 lists. The scripts run pretty fast for everyday comparisons. One caveat is to use clean data, make sure there are no extra spaces or quotes on some lists, which can really mess up the numbers.

## Script to take 3 files, and report venn diagram numbers for an image
## Usage: $ perl w1 w2 w3 | cut -f1 | sort | uniq -c

## Key: 1 => 1-only, 2 => 2-only, 3 => 1 & 2, 4 => 3-only, 5 => 1 & 3
## 6 => 1 & 2 & 3, 7 => 2 & 3
use strict;
use warnings;
my %hash;
open(DAT, $ARGV[0]); ##1st file
while (my $l = <DAT>) {
chomp $l;
$hash{$l} = 1; ## in set1
close DAT;
open(DAT, $ARGV[1]); ##2nd file
while (my $l = <DAT>) {
chomp $l;
if (defined $hash{$l} && $hash{$l} == 1) {
$hash{$l} = 3; ##Common in sets 1 & 2
} else {$hash{$l} = 2;} ## Set 2, not set 1
close DAT;
if (defined $ARGV[2]) {
open(DAT, $ARGV[2]); ##3rd file
while (my $l = <DAT>) {
chomp $l;
if (!defined $hash{$l}) {$hash{$l} = 4;} ## Set 3 only
elsif ($hash{$l} == 1) {
$hash{$l} = 5; ## sets 1 & 3, not 2
} elsif ($hash{$l} == 2) {
$hash{$l} = 7; ## sets 2 & 3, not 1
} elsif ($hash{$l} == 3) {
$hash{$l} = 6; ## sets 1 & 2 & 3
close DAT;
if (defined $ARGV[3]) {
open(DAT, $ARGV[3]); ##3rd file
while (my $l = ) {
chomp $l;
if ($hash{$l} == 1) {
$hash{$l} = 8; ## sets 1 & 4 only
} elsif ($hash{$l} == 2) {
$hash{$l} = 9; ## sets 2 & 4 only
} elsif ($hash{$l} == 3) {
$hash{$l} = 10; ## sets 1 & 2 & 4
} elsif ($hash{$l} == 4) {
$hash{$l} = 11; ## Set 3 & 4 only
} elsif ($hash{$l} == 5) {
$hash{$l} = 12; ## Set 1 & 3 & 4 only
} elsif ($hash{$l} == 6) {
$hash{$l} = 13; ## Set 1 & 2 & 3 & 4
} elsif ($hash{$l} == 7) {
$hash{$l} = 14; ## Set 1 & 2 & 4 only
} else {$hash{$l} = 15;} ## Set 4 only
close DAT;
foreach my $k (keys %hash) {
print "$hash{$k}\t$k\n";

UPDATE: There is this useful R package VennDIagram pointed by @genetics_blog who BTW has an amazing bioinformatics blog: Getting Genetics Done

The common story of rare variants..

NGS comes to the aid of researchers to find answers for herited complex traits and diseases, paving a path towards the ‘personalized medicine era’. Whole-exome sequencing (WES) has been deployed for identifying rare variants associated with complex diseases and is providing a philip for further research and insight.

Common variants that were identified through Genome Wide Association Studies (GWASs) have not been able to answer the founding questions to an extent where the research community can identify the traits related to heritability. Moving forward and learning as we do from our experiences, the trail of identification of rare variants is helping us fill the gaps of missing heritability that were unsatiated with these genome wide global efforts.

Cirulli and Goldstein mention in their review (Uncovering the roles of rare variants in common disease through whole-genome sequencing) that common variants are being identified in Mendelian disease studies as having a key role as modifiers of the effects of rarer contributors to disease risk.

Potential frequencies of causal variants in complex traits

At conferences like ASHG in San Francisco last year the buzz words were ‘rare variants are common’. Twitter updates of late from the PAGXII have been talking about the same ‘rare variants’ and this, not just for human populations.

Scoping through literature for common and rare variants has been interesting, there are papers that point out in great depth how the thought process from Common Disease Common Variant (CDCV) moved on to the Common Disease Rare Variant (CDRV). Almost 4 years ago, Schork mentioned in his paper (Common vs. Rare Allele Hypotheses for Complex Diseases) that rare genetic variants (less than 5% frequency) can play key roles in influencing complex disease and traits.

Another interesting paper that I came accross from a decade earlier was from Reich and Lander (Lander of the Gangnam fame) On the allelic spectrum of human disease. In this paper the authors discuss the variation in allelic spectra for common disease genes and point that for some genes predominant disease alleles exist, while for others only a rare set. Their theory revolves around the idea that genes responsible for most of the risk for common diseases (hypertension, heart disease etc) have relatively simple allelic spectra and hence the causal variants for a common disease can be found using GWAS.

In his paper An Abundance of Rare Functional Variants in 202 Drug Target Genes Sequenced in 14,002 People, Nelson talks about rare variants being a result of recent mutations and being clustered geographically to some extent. He also points out that the common variants observe only a small fraction of the genetic diversity in any gene.

Number of Variants/kb of sequence

Moving on a decade (or less) from now, I imagine for 23&Me to expand its base to move on from just the common variants they report for ‘someone’s DNA’ to the rare variants that they could possibly do with specific input on ancestry, population, surnames to guide that route.

From the Whitehead Institute for Biomedical Research in Cambridge comes a paper on Identifying Personal Genomes by Surname Inference. Gymrek talks about how surnames can be recovered from profiling short tandem repeats on the Y chromosome from freely available, publicly accessible internet resources. Though the authors point out that their efforts are more in line to see effective policies being established for data sharing and awareness to the patient regarding participating in genetic studies and not for data sharing to recede.

Another interesting read for the statistically inclined people is SKAT test from the Harvard
Rare-Variant Association Testing for Sequencing Data with the Sequence Kernel Association Test. This study proposed a test: sequence kernel association test (SKAT) for studying association between a set of rare (and common) variants and continuous/dichotomous phenotypes. Using aggregates of individual score test statistics of SNPs belonging to a set, it computes p-values from the defined set level.

SKAT steps in to take the role of testing for association between variants in a region, surpassing burden tests. The authors note it is unlikely for most rare variants to influence the phenotype with the same magnitude. And as it is more common for variants within a sequenced region to have little or no effect on phenotype, SKAT allows different variants to have different directions and magnitude of effects.

Coming back to the conferences here is a snippet of what some institutes are doing: @ ASHG 2012 Quite a few sessions also revolved around the rare variants and other findings from the NHLBI Exome Sequencing Project. The NHLBI GO ESP is a dataset characterizing multiple samples of richly phenotype populations, making this endeavor a variation of the 1000 human genome project in many ways. Some sessions like these listed in the table below, highlighted the idea of accelerated gene discovery for complex traits using NGS:

Genomeweb article from ASHG2012 talks about the details of the Rucphen population study and mentions the other ongoing studies that were in the session:

Institution Isolated Population
University of Miami Midwestern Amish populations
University of Maryland Amish populations in Pennsylvania
Cittadella Univ. di Monserrato, Italy Sardinian population
Erasmus Medical Center Rotterdam Rucphen population

While others focussed on the approaches for testing these rare variant datasets:

Institution Approach
Baylor College of Medicine & Fred Hutchinson Cancer Research Center Testing for rare variant associations in the presence of missing data
Baylor College of Medicine & University of Washington Rare Variant Extensions of the Transmission Disequilibrium Test Detects Associations with Autism Exome Sequence Data
Johannes Kepler University, Austria Associating complex traits with rare variants identified by NGS: improving power by a position-dependent kernel approach

So will we step in this personalized genome era soon someday? I’d hate for this bubble to burst so I hope the CAP & CLIA certifications are done and the clinical labs are all set. The public health and data privacy issues are in check and as we breathe into a no nonsense world looking at our detailed genome analysis report we hopefully can feel its more or less uniquely ours or better still closer to our parents than our neighbours.

Clinical Findings from Sequencing keep flowing…

Just a glance at GenomeWeb’s “In Print: Clinical Sequencing Papers of Note of the Last Two Weeks” is enough to show how hard it is to keep up with the sequencing revolution. Of the 37 publications mentioned here, 15 talk about whole-exome sequencing (WES) (and two amplicon sequencing and three whole-genome seq) identifying the causal variant(s).

Most of them use a slightly different way to align the reads, call variants and interpret the results. This reflects on the recent review paper in Bioinformatics journal (Tools for mapping high-throughput sequencing data) and the relevant blog post talking about the challenges of such a diverse tool-set (An embargo on short read alignment tools).

Amidst all this, the NGS Catalog is a wonderful tool to correlate technology with gene or disease or even specific mutations, along with author names and publications. So much data that can be mined for very useful snippets of information.

Thanks to the express permission of GenomeWeb editors, I am able to post below the list of papers highlighted in their section (which is behind a login access)

In Print: Clinical Sequencing Papers of Note of the Last Two Weeks

 ADAR1 promotes malignant progenitor reprogramming in chronic myeloid leukemia.
Jiang Q, Crews LA, Barrett CL, Chun HJ, et al.
Proc Natl Acad Sci U S A. 2012 Dec 28. [Epub ahead of print]

DGKE variants cause a glomerular microangiopathy that mimics membranoproliferative GN.
Ozaltin F, Li B, Rauhauser A, An SW, et al.
J Am Soc Nephrol. 2012 Dec 28. [Epub ahead of print]

Combining highly multiplexed PCR with semiconductor-based sequencing for rapid cancer genotyping.
Beadling C, Neff TL, Heinrich MC, Rhodes K, et al.
J Mol Diagn. 2012 Dec 27. [Epub ahead of print]

Next-generation sequencing reveals deep intronic cryptic ABCC8 and HADH splicing founder mutations causing hyperinsulinism by pseudoexon activation.
Flanagan SE, Xie W, Caswell R, Damhuis A, et al.
Am J Hum Genet. 2012 Dec 22. [Epub ahead of print]

Exome sequencing identifies INPPL1 mutations as a cause of opsismodysplasia.
Huber C, Faqeih EA, Bartholdi D, Bole-Feysot C, et al.
Am J Hum Genet. 2012 Dec 22. [Epub ahead of print]

Whole-genome analysis reveals that mutations in inositol polyphosphate phosphatase-like 1 cause opsismodysplasia.
Below JE, Earl DL, Shively KM, McMillin MJ, et al.
Am J Hum Genet. 2012 Dec 22. [Epub ahead of print]

Mitochondrial disease genetic diagnostics: optimized whole-exome analysis for all MitoCarta nuclear genes and the mitochondrial genome.
Falk MJ, Pierce EA, Consugar M, Xie MH, et al.
Discov Med. 2012 Dec;14(79):389-99.

Genetic diagnosis of autosomal dominant polycystic kidney disease by targeted capture and next-generation sequencing: Utility and limitations.
Qi XP, Du ZF, Ma JM, Chen XL, et al.
Gene. 2012 Dec 20. [Epub ahead of print]

RNA-seq reveals differentially expressed isoforms and novel splice variants in buccal mucosal cancer.
Jakhesara SJ, Koringa PG, Bhatt VD, Shah TM, et al.
Gene. 2012 Dec 20. [Epub ahead of print]

Next-generation sequencing identified new oncogenes and tumor suppressor genes in human hepatic tumors.
Amaddeo G, Guichard C, Imbeaud S, Zucman-Rossi J.
Oncoimmunology. 2012 Dec 1;1(9):1612-1613.

Exomic sequencing of medullary thyroid cancer reveals dominant and mutually exclusive oncogenic mutations in RET and RAS.
Agrawal N, Jiao Y, Sausen M, Leary R, et al.
J Clin Endocrinol Metab. 2012 Dec 21. [Epub ahead of print]

Newborn screening for SCID identifies patients with Ataxia telangiectasia.
Mallott J, Kwan A, Church J, Gonzalez-Espinosa D, et al.
J Clin Immunol. 2012 Dec 20. [Epub ahead of print]

Exome sequencing identifies mutation in CNOT3 and ribosomal genes RPL5 and RPL10 in T-cell acute lymphoblastic leukemia.
De Keersmaecker K, Atak ZK, Li N, Vicente C, et al.
Nat Genet. 2012 Dec 23. [Epub ahead of print]

Germline mutations affecting the proofreading domains of POLE and POLD1 predispose to colorectal adenomas and carcinomas.
Palles C, Cazier JB, Howarth KM, Domingo E, et al.
Nat Genet. 2012 Dec 23. [Epub ahead of print]

Molecular diagnostics of a single multifocal non-small cell lung cancer case using targeted next generation sequencing.
Geurts-Giele WR, Dirkx-van der Velden AW, Bartalits NM, Verhoog LC, et al.
Virchows Arch. 2012 Dec 22. [Epub ahead of print]

Exome sequencing identifies mutations in CCDC114 as a cause of primary ciliary dyskinesia.
Knowles MR, Leigh MW, Ostrowski LE, Huang L, et al.
Am J Hum Genet. 2012 Dec 19. [Epub ahead of print]

Mutations in ECEL1 cause distal arthrogryposis type 5D.
McMillin MJ, Below JE, Shively KM, Beck AE, et al.
Am J Hum Genet. 2012 Dec 19. [Epub ahead of print]

Rare, low-frequency, and common variants in the protein-coding sequence of biological candidate genes from GWASs contribute to risk of rheumatoid arthritis.
Diogo D, Kurreeman F, Stahl EA, Liao KP, et al.
Am J Hum Genet. 2012 Dec 19. [Epub ahead of print]

Whole-genome sequencing in autism identifies hot spots for de novo germline mutation.
Michaelson JJ, Shi Y, Gujral M, Zheng H, et al.
Cell. 2012 Dec 21;151(7):1431-42.

RNA-seq analysis of synovial fibroblasts brings new insights into rheumatoid arthritis.
Heruth DP, Gibson M, Grigoryev DN, Zhang LQ, Ye SQ.
Cell Biosci. 2012 Dec 21;2(1):43.

Use of next generation sequencing technologies in research and beyond: are participants with mental health disorders fully protected?
Jaitovich Groisman I, Mathieu G, Godard B.
BMC Med Ethics. 2012 Dec 20;13(1):36.

Mutation of the PDGFRB gene as a cause of idiopathic basal ganglia calcification.
Nicolas G, Pottier C, Maltête D, Coutant S, et al.
Neurology. 2012 Dec 19. [Epub ahead of print]

Identification of TP53 as an acute lymphocytic leukemia susceptibility gene through exome sequencing.
Powell BC, Jiang L, Muzny DM, Treviño LR, et al.
Pediatr Blood Cancer. 2012 Dec 19. [Epub ahead of print]

Histamine-2 receptor blockers alter the fecal microbiota in premature infants.
Gupta RW, Tran L, Norori J, Ferris MJ, et al.
J Pediatr Gastroenterol Nutr. 2012 Dec 18. [Epub ahead of print]

Genome-wide sequencing for the identification of rearrangements associated with Tourette syndrome and obsessive-compulsive disorder.
Hooper SD, Johansson AC, Tellgren-Roth C, Stattin EL, et al.
BMC Med Genet. 2012 Dec 19;13(1):123.

Detection of BCR-ABL1 mutations in chronic myeloid leukaemia by massive parallel sequencing.
Eyal E, Amir A, Cesarkas K, Jacob-Hirsch J, et al.
Br J Haematol. 2012 Dec 17. [Epub ahead of print]

West syndrome caused by ST3Gal-III deficiency.
Edvardson S, Baumann AM, Mühlenhoff M, Stephan O, et al.
Epilepsia. 2012 Dec 17. [Epub ahead of print]

Exome sequencing identifies a founder frameshift mutation in an alternative exon of USH1C as the cause of autosomal recessive retinitis pigmentosa with late-onset hearing loss.
Khateb S, Zelinger L, Ben-Yosef T, Merin S, et al.
PLoS One. 2012;7(12):e51566.

The ChIP-seq-defined networks of Bcl-3 gene binding support its required role in skeletal muscle atrophy.
Jackman RW, Wu CL, Kandarian SC.
PLoS One. 2012;7(12):e51478.

Characterization and validation of insertions and deletions in 173 patient exomes.
Lescai F, Bonfiglio S, Bacchelli C, Chanudet E, et al.
PLoS One. 2012;7(12):e51292.

Next-generation sequencing meets genetic diagnostics: development of a comprehensive workflow for the analysis of BRCA1 and BRCA2 genes.
Feliubadaló L, Lopez-Doriga A, Castellsagué E, Del Valle J, et al.
Eur J Hum Genet. 2012 Dec 19. [Epub ahead of print]

A compound heterozygous mutation in DPAGT1 results in a congenital disorder of glycosylation with a relatively mild phenotype.
Iqbal Z, Shahzad M, Vissers LE, van Scherpenzeel M, et al.
Eur J Hum Genet. 2012 Dec 19. [Epub ahead of print]

Analysis of a Chinese pedigree with Zellweger syndrome reveals a novel PEX1 mutation by next-generation sequencing.
Sun Y, Wang L, Wei X, Zhu Q, et al.
Clin Chim Acta. 2012 Dec 13. [Epub ahead of print]

Filling the gaps — the generation of full genomic sequences for 15 common and well-documented HLA class I alleles using next-generation sequencing technology.
Lind C, Ferriola D, Mackiewicz K, Sasson A, Monos D.
Hum Immunol. 2012 Dec 12. [Epub ahead of print]

Integrating GWAS and expression data for functional characterization of disease-associated SNPs: an application to follicular lymphoma.
Conde L, Bracci PM, Richardson R, Montgomery SB, Skibola CF.
Am J Hum Genet. 2012 Dec 11. [Epub ahead of print]

Whole-exome sequencing identifies LRIT3 mutations as a cause of autosomal-recessive complete congenital stationary night blindness.
Zeitz C, Jacobson SG, Hamel CP, Bujakowska K, et al.
Am J Hum Genet. 2012 Dec 11. [Epub ahead of print]

Exome sequencing reveals a signal transducer and activator of transcription 1 (STAT1) mutation in a child with recalcitrant cutaneous fusariosis.
Wang X, Lin Z, Gao L, Wang A, et al.
J Allergy Clin Immunol. 2012 Dec 13. [Epub ahead of print]

The New Genomes of 2012

From the first large genome (goat’s genome) to be sequenced and assembled de novo using whole-genome mapping technology, published almost a week ago to around 1000 rice genomes sequenced at 1x coverage: there’s a lot that has happened at the Whole Genome Sequencing level.

Roche/454 is still being used and so is Sanger, not everyone has moved on to Illumina as you’ll see in the table below. With clinical sequencing aiming a 100x coverage most of the research sequencing seems to be lower. In these papers you’re sure to spot BGI some in partnerships like with University of Copenhagen for the Bat paper and their Illumina data.

As this year wraps up, provided food for thought in his ‘The class of 2012’ post. Here’s a snapshot of some quirky points/details that caught my eye in these ‘Whole Genome Sequencing’ papers.

Lets see what more genomes and sequencing technologies await us in 2013!

Organism Sequencing Usefulness Size & Annotation Software Interesting Findings Scope
Capra hircus; 3yr old female Yunnan black goat WGS: Illumina GA IIx12x coverage For the first time whole-genome mapping technology was used for de novo assembly of large genomes Investigate genetic basis of complex traits, in transgene production of peptide medicines ~2.66-Gb genome sequence; 22,175 protein-coding genes Contigs & scaffolds assembled using SOAPdenovo (Release 1.05) and ABYSSWhole genome mapping: Genome-Builder

Annotation: GLEAN

51 genes are differentially expressed in primary and secondary follicles of cashmere goat Markers for breeding better cashmere goats can be identified and/or potential targets for genetic or nongenetic manipulation
Female western lowland gorilla named Kamilah (San Diego Zoo) WGS: Hybrid de novo assembly combining 5.4 Gb Sanger & 166.1 Gb of Illumina short reads57.4x coverage Closest living relatives after chimpanzees will aid in study of human evolution ~3 Gb genome sequence; 20,962 Protein-coding genes Assembler: ABYSS, Phusion assembler, Maq, VelvetChromosomal AGP files: LASTZ

Annotation: ENSEMBL

30% of the gorilla genome is closer to human though its rarer around coding genes Deeper understanding of great ape biology and evolution
Pan paniscus; female bonobo individual Ulindi (Leipzig Zoo) WGS: 454/Roche 23x coverage (additionally 19 bonobo & chimpanzee genomes on Illumina GAIIx) Compared it to genomes of chimpanzees and humans to study its evolutionary relationship 2.7 Gb genome sequence Assembler: Celera Assembler software 25% of human genes contain parts that are more closely related to one of the two apes than the other Illuminate population history and selective events that affected evolution
Solanum lycopersicum; inbred tomato cultivar ‘Heinz 1706’ WGS: Combination of 21Gb of Roche/454 Titanium & 3.3 Gb of Sanger27x Coverage Compared it with closest wild relative, Solanum pimpinellifolium and potato genome (Solanum tuberosum) 900Mb Genome size; 34,727 protein-coding genes Assembler: Newbler, CABOGAnnotation based on EuGene pipeline Tomato genome more than 8% divergence from potato 18,320 orthologous gene pairs Triplications Comparing gene family evolution & understanding bottlenecks that have narrowed tomato genetic diversity
Denisovan, an extinct relative of Neandertals (& 11 present day individuals) WGS; Single-stranded library preparation method; Illumina GA IIx31x coverage DNA library preparation method to reconstruct a high-coverage (30×) genome sequence ~1.86Gb Genome size; Coverage was not biased toward GC-rich sequences Illumina Genome Analyzer RTA 1.6 software, mapped to reference using BWA, GATK (also realignment) Denisovans share more alleles with east Asian & South American populations (Dai, Han, and Karitiana) than with European populations (French and Sardinian) Determine how modern humans expanded in population size & cultural complexity while archaic humans became physically extinct
Musa acuminata (banana) WGS; 27.5 million Roche/454 single reads and 2.1 million Sanger reads (50× of Illumina data used to correct sequence errors) Served as a stepping-stone to finding conserved non-coding sequences conserved beyond monocotyledons 523-Mb genome size; 36,542 protein-coding genes Assembler: NewblerAligner: BLAT Comparison of Musa, rice, sorghum, Brachypodium, date palm and Arabidopsis proteomes revealed 7,674 gene clusters common to all six species Whole-genome duplications Unravel complex genetics & key to identifying genes responsible for agronomic characters, such as fruit quality and pest resistance
Rice: 1,083 O. sativa accessions & 446 O. rufipogon accessions (China, Japan) WGS; Illumina GA IIx 73-bp paired-end reads O. sativa: 1x coverage O. rufipogon 2x coverage Identified 55 selective sweeps that have occurred during domestication Aligner: SmaltSNP caller: Ssaha Pileup Insights into how and where rice was likely domesticated & set of domestication sweeps and putative causal genes Important resource for rice breeders to effectively exploit diverse genetic resources for rice improvement
Female domestic Duroc pig (Sus scrofa) WGS Hybrid de novo assembly based sequences from BAC clones40x coverage Comparison with the genomes of wild and domestic pigs from Europe and Asia Identification of putative disease-causing variants can aid the pig to be a biomedical model ~2.6Gb genome size; 21,640 protein-coding genes Contigs Assembler: PhrapAssembler: SOAPdenovo, Cortex

Aligner: BLAT

Genes associated with immune response and olfaction exhibit fast evolution112 positions where porcine protein has same amino acid that is implicated in a human disease Important resource for improvement in livestock species
Bread wheat (Triticum aestivum); Chinese Spring (CS42) WGS; Hexaploid genome 454/Roche – 5x coverage. SOLiD used for additional sequencing to increase accuracy of homologous SNP identification. Comparison with diploid progenitors and relatives showed overall trend of gene family size reduction in large gene families in wheat. Defined genome-wide catalog of SNPs 17 Gb genome size; ~94k genes MetaSim to generate readsOrthologous genes: OrthoMCL clustering, BLASTX Assembled gene sequences representing a complete gene set were sequenced. Powerful framework for identifying genes Identification of extensive genetic variation can provide a resource for accelerating gene discovery and improving this crop
Pacific oyster Crassostrea gigas (inbred female produced by four generations of brother–sister mating) WGS 155x Illumina (unable to assemble due to high  polymorphism & repetitive sequence) fosmid-pooling strategy Fosmid library 10x; 60x sequencing &  assembly Combination of fosmid pooling, NGS and hierarchical assembly: new, cost-effective alternative for de novo sequencing & assembly of complex genomes 637 Mb; 28,027 genes Alignment: LASTZ Expansion of genes coding for HSP70 & IAPs is probably central to adaptation to sessile life in the highly stressful intertidal zone valuable resources for studying molluscan biology and lophotrochozoan evolution
owl limpet (Lottia gigantea), a marine polychaete (Capitella teleta) and a freshwater leech (Helobdella robusta) WGS with Sanger dideoxy sequencing reads8x coverage compare them with other animal genomes to investigate the origin and diversification of bilaterians from a genomic perspective ~200-300 Mb genome size; ~23,000 to 33,000 protein-coding genes orthologs BLAST ~8K bilaterian gene families likely from single progenitor genes. 231 putative spiralian-specific gene families, members aligned across all three spiralians, indicating purifying selection rate-stratification approach could be used to place problematic taxa when genome data becomes available
Wild-caught bats, fruit bat (P. alecto) and insectivorous (M. davidii) WGS Illumina HiSeq 2000109x-118x Coverage Identify genetic changes associated with the development of bat-specific traits by comparative analyses of two distantly related bat species ~2Gb genome size; protein coding genes: 21,392 P. alecto & 21,705 M. davidii Assembler:SOAPdenovo aligner BWA

Repeat Annotation: Tandem Repeats Finder

Gene annotation: GLEAN

Genes in DNA damage checkpoint/DNA repair pathway found to be under positive selection. So are COL3A1 (skin elasticity) CACNA2D1 (muscle contraction). Entire locus of PYHIN gene family (sensing microbial DNA) is lost Comparison with other species may provide new insights into bat biology and evolution

MAQ SNPfilter

MAQ is this very vesatile and popular tool for mapping of short read data (Solexa, SOLiD).

This is a very curious case. I used MAQ to call SNPs on a Solexa data of 6.5 million 36bp reads. I used maq map .. followed by SNPfilter using all defaults. Interestingly, I trimmed the 1st two bases of each read and used the 24bp reads to map and call SNPs. This run aligned more reads to the reference and called fewer SNPs after the SNP filtering on MAQ data. Not stopping here, I trimmed even the last two bases of all reads. This left me with 32 base reads, off a 36 base original which had the 1st and last two bases clipped. This yielded even more mapped reads (~ 100k more than the original 36 base mapping) and even fewer final filtered SNPs.

Below is the VENN for the SNP calls, and interestingly, most of the exclusive ones are present in SNP calls of the other data-sets, but got filtered out by MAQ’s SNP filtering steps. SNP filtering takes place due to quality and indels. Trimming a few bases off each read can make all this difference? Fortunately, I do have a True Positive data here to compare, and decide which of these is the best solution, but how would one know in a new study?

MAQ filtered SNP results
MAQ filtered SNP results



Blat is a useful algorithm in the Bioinformatics world dominated with blast! Its even named Blast Like Alignment Tool.

I was recently fiddling with blat alignment on human data, and realized how fast and easy blat can be. Working on next-gen data, Solexa reads is what I had.

One of the cool things was, use the chromosomes separately, run the individual BLAT, and then stitch back the results.

For setting BLAT parameters, and reading/refreshing about the algorithm, these links are really helpful:

From Jim Kent himself
The paper
The forum

I simply ran BLAT on human chromosomes one by one, of course outputting in the popular blast -m9 format to get a pretty table

$ for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y; do blat chr$i.fa reads.fa -out=blast9 blat$i.9; done &

Followed by stitching them together with appropriate sorting. I wanted to have them sorted by read, by bit-score, by alignment-length and by chromosome in that order

$ cat *.9 | grep -v ‘#’ | sort -k 1,1 -k 12,12r -k 4,4r -k 2,2 > reads.blat

This gives me a way to look at the top hits, next-best alignments of reads for various analysis on the next-gen Solexa reads.