Bioinformatics tools for VCF files

With the ever growing abundance of Next Generation Sequencing (NGS) data there continues to be a challenge faced by the research community to not only standardize best practices and analysis methods, but to embrace the foray of open-source tools and file formats. Having been engaged since almost the beginning of this NGS data outburst with Illumina’s read length extending from 30bp to beyond 250bp, for the file format category SAM, BAM and VCF have become well accepted formats to some extent. I say well accepted because a few versions ago CLCbio still reported variants in a csv file format, but maybe I shouldn’t digress.

So in part due to the 1000 Genome Project and Hapmap consortium, formats like VCF are largely accepted as output format as most open-source tools report variants as VCF reports. This has allowed development of tools / parsers to use the file formats as a standard and provide efficient functionality and data portability.

A recent discussion on twitter about the efficiency and functionality of tools made me compile a list of these VCF parsers for future reference. You will be amazed at the variety of tools available for helping parse the eventual VCF reports from NGS data – free of cost!

Feel free to point out pros/cons of these tools and I will continue to update and make a comprehensive post. Also, it would be most helpful if folks could share their experiences with different tools and the example use-cases of the functionality!!

Tools

Tidbits

VAWK awk-like arithmetic on a VCF file
Bioawk support of several common biological data formats, including optionally gzip’ed BED, GFF, SAM, VCF, FASTA/Q and TAB-delimited formats
VCFFilterJS Filters a VCF with javascript
bio-vcf new generation VCF parser
bcftools contains all the vcf* commands
VCFtools provide easily accessible methods for working with complex genetic variation data in the form of VCF files (Paper: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3137218/)
VariantFiltration Filters variant calls using a number of user-selectable, parameterizable criteria
PyVCF Variant Call Format Parser for Python
vcflib simple C++ library for parsing and manipulating VCF files
wormtable write-once read-many table for large scale datasets (vcf2wt)
SnpSift toolbox to filter and manipulate annotated files
gvcftools Utilities to create and analyze gVCF files

 

HuGGV – Human Genome Genetic Variation

In 2010 the 1000 Genome Project published their analyses and findings from the pilot data to determine their strategy of using the light sequencing approach was adequate to present variation results.

For the HuGGV project I have studied the genetic variants from the 1000 genomes 2011 release dataset. I downloaded the variant call format (.vcf) data for all autosomal chromosomes and compared the genetic variations across all chromosomes. I analyzed the data for 22 chromosomes and made a dataset to detect variations of different types.  The variations that I have looked in are:

Single nucleotide polymorphism (SNPs), Insertion deletions (indels)

I studied literature and searched NCBI and OMIM database to see how these variants compare to our knowledge about the human chromosomes and the diseases associated with it.

I analyzed different statistics like comparing known polymorphisms (rsIDs) to the new variants that were found in the data set. I used human reference sequence refFlat hg19 from UCSC and converted it into bed file format to intersect with the genome variation data for every chromosome that was downloaded from the 1000 Genome database. Using the intersect bed tool I was able to analyze the variants that existed in the coding regions.

I used the NHGRI SeattleSeq Annotation tool to analyze the annotation of SNPs in exon regions of all the chromosomes for known and novel regions. It was interesting to note the missense and nonsense SNPs as I have plotted in the graphs.

Total number of samples that are being sequenced are 2500, 500 from each of the 5 Ancestries: European Ancestry, Americas, East Asian Ancestry, West African Ancestry, South Asian Ancestry.

Interpreting results of these various characteristics of these chromosomes can lead to an over view of the properties of chromosomes and diseases related to those.

1000 Human Genome Project

The 1000 Genomes Project envisions to sequence genomes of a number of people from different populations and make this dataset publicly available to the scientific community as a comprehensive resource on human genetic variation. The project provides a thorough characterization of human genome sequence variation from the whole genomes sequences they have collected. This characterization of human genome sequence variation can be used as a foundation for further investigating the relationship between genotypes and respective phenotypes.

How this large a project works around the (still) high cost of deep sequencing whole genomes is interesting to note (Whole Genome Sequencing being: where multiple copies of the same genome are shred into pieces to be sequenced and then aligned with reference sequence to study variation in the sample compared to the reference sequence).

The 1000 Genome Project has chosen to take the route of “light sequencing” their collected whole genome sequences. This entails 4X coverage, (as mentioned on their website 1000 Genomes) compared to 30X or more coverage, leads to lowering the sequencing costs compared to deep sequencing methods. The project is designed in a manner where data across many samples will be combined to give efficient detection of most variants in the region of interest – and this explains that how in their point of view light sequencing of this large a data set seems more viable than deep sequencing a smaller data set.

The Project also considers the detection of such variants to frequencies as low as 1%. Considering the large sample data combining data from this big a sample of whole genomes can give accurate insight into the variants and genotypes for each sample which might not have been as effective with light sequencing on a smaller sample.

After reading their project description we can deduce that this data set which has now from the original goal of sequencing the 1000 Human Genomes been expanded to sequencing 2500 genomes would provide useful information which can be useful for research studies where groups can study variations in large samples and deduce information with comparison to disease samples. Selection and population structure are among loads of interesting aspects that can be studied given the large amount of data sequencing has made available. Interesting.

Playing around with the 1000 Human Genome Data

I have had my eye out on the amazing 1000 Human Genome Project that has been going on. For my course Poster Presentation I am planning to delve into this and play with the data and see what all I can find.

The 1000 Genomes Project aims to make sequencing data publicly available for 1000 human genomes. This information can be useful for research studies where groups can study variation in large samples and deduce information with comparison to disease samples. Selection and population structure are among other interesting aspects that can be studied given the large amount of data sequencing has made available.

I am interested in studying the genetic variants from the 1000 genomes release dataset. As the data is available from their website there is only one concern left – getting it all out on my MAC or find a server, the plant data stays where it is so I have to hunt for something new!  Once I have this dataset I would like to compare the genetic variations across all chromosomes see how these results compare to past literature and our knowledge about the human chromosomes, this would certainly take a lot more research – so this would have to be an ongoing adventure sorta 😀

Lets see how much interesting information brims up.