In 2010 the 1000 Genome Project published their analyses and findings from the pilot data to determine their strategy of using the light sequencing approach was adequate to present variation results.
For the HuGGV project I have studied the genetic variants from the 1000 genomes 2011 release dataset. I downloaded the variant call format (.vcf) data for all autosomal chromosomes and compared the genetic variations across all chromosomes. I analyzed the data for 22 chromosomes and made a dataset to detect variations of different types. The variations that I have looked in are:
Single nucleotide polymorphism (SNPs), Insertion deletions (indels)
I studied literature and searched NCBI and OMIM database to see how these variants compare to our knowledge about the human chromosomes and the diseases associated with it.
I analyzed different statistics like comparing known polymorphisms (rsIDs) to the new variants that were found in the data set. I used human reference sequence refFlat hg19 from UCSC and converted it into bed file format to intersect with the genome variation data for every chromosome that was downloaded from the 1000 Genome database. Using the intersect bed tool I was able to analyze the variants that existed in the coding regions.
I used the NHGRI SeattleSeq Annotation tool to analyze the annotation of SNPs in exon regions of all the chromosomes for known and novel regions. It was interesting to note the missense and nonsense SNPs as I have plotted in the graphs.
Total number of samples that are being sequenced are 2500, 500 from each of the 5 Ancestries: European Ancestry, Americas, East Asian Ancestry, West African Ancestry, South Asian Ancestry.
Interpreting results of these various characteristics of these chromosomes can lead to an over view of the properties of chromosomes and diseases related to those.