Journal Club: Indels in 179 genomes (1000genome data)

The origin, evolution and functional impact of short insertion-deletion variants identified in 179 human genomes

Genome Research

Finally there is a comprehensive analysis on indels, and of course it is the Next Generation Sequencing data that is driving it. I have my concerns with the biases of NGS technology and analysis along with ensuing false-positives in indel detection. Nonetheless, the authors have done a good job in summarizing the information and touching upon the important points making some valuable observations. It would be great to see this comprehensive analysis repeated on the public Complete Genomics genomes or the increasing Ion Torrent data to corroborate these findings as generic and not specific to any variables.

  • Dataset used = 179 (~4x coverage) genomes from 1000 genomes pilot data of 3 populations
  • 1.6 million indels – 50% of them in 4% of the genome (indel hotspots)
  • Polymerase slippage is the main cause of 75% of indels (almost all indels in hotspots and 50% indels in non-repeat regions are due to slippage)
  • indels subject to stronger purifying selection than SNVs (they call it SNPs)
  • recombination hotspots that are known to be enriched with SNVs are not enriched with indels
  • longer and frameshift indels have stronger effect on fitness
  • indels on average have a stronger functional effect than SNVs
  • Method
    • STAMPY: aligner with high sensitivity and low reference bias
    • DINDEL genotyper: Use alt-supporting reads to select high quality indels
    • build implied haplotypes (LD betw SNV/indel and impute) and error model for homopolymers
    • ignore indels in long (>10bp) homopolymers
    • validate with sanger
  • the 1.6 million indels are 8-fold lower than SNVs from these genomes
  • selected novel indels (not seen in 1000 genomes report not dbSNP129)
  • chose 2 CEU as validation targets and sampled calls predicted to segregate in them
  • randomly selected a subset; able to design primers for 111; 60 sanger sequenced
  • 36 matches; 12 low-Q sanger; 12 discordant => 0.25% FDR for this novel set (4.6% total FDR)
  • INDEL classes
    • Homopolymer Run (6nt+) – HR – 10-fold indel enrichment compared to genomic average (even higher if include longer homopolymers)
    • Tandem Repeat – TR – 20-fold indel enrichment
    • Predicted hotspot – PR – predicted indel rate > predicted SNV rate
    • Non-repetitive sites – NR
    • change in copy-number count – CCC – NR-CCC & NR non-CCC
  • HR + TR + PR = 4% of the genome (hotspot) with 50% of indels – deletions dominate short tracts, insertions longer tracts, and then del again for much longer tracts
  • 100-fold increase in polymorphism rate going from 4-bp homopolymer to 8-bp
  • 25% indels not due to polymerase slippage mostly NR non-CCC – mostly deletions (about 90%) – perhaps due to formation of double-stranded break intermediate and imperfect repair
  • the remaining 2.5% insertions most often involve palindromic repeat
  • 43 genes with high individual predicted mutation rate in coding regions – 10 of those do not show SNV enrichment and thus have exclusive indel enrichment to cause high mutational load – includes HTT (huntington), AR (prostrate cancer), ARID1B (neurodevelopmental), MED and MAML genes
  • GWAS: common indels are well tagged by SNVs – possible to phase indels into SNV haplotype reference panels
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s