R Line Plots – the easiest fastest plot ever

What you need:

  • Data
  • R
  • Eager Bioinformatician

For data I used a file with 7 columns where the first column was a counter and the other columns (2-7) had different values that I would like to compare across using a line plot to visualize the variation in my data points

Serves: As many data points you would like to extend it to

Time: Once you have a parsed dataset, this is fast: you wouldn’t want to blink

Plotting lines with a legend 🙂

> x <- read.table(“filename”);

> png (“filename.png”)

> plot (x[,2], type=”l”, col = “steelblue”, ylab=”Heading”)

> lines (x[,3], col = “pink”)

> lines (x[,4], col = “cyan”)

> lines (x[,5], col = “magenta”)

> lines (x[,6], col = “green3”)

> lines (x[,7], col = “blue”)

> legend(11000, 2, c(“124-100”, “124-50”, “124-20”, “124-10”, “124-5”, “124-2”), lty=c(1,1), lwd=c(2.5,2.5), col=c(“steelblue”,”pink”,”cyan”,”magenta”,”green3″,”blue”))

> dev.off()

Feel free to change/add colors* and serve embedded within your document/ppt!

* You can find a detailed list of color names in R here: http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf

How recent are rare variants?

The Department of Genome Sciences at the University of Washington in Seattle, in a multi-institutional effort sequenced 15,336 genes for the NHLBI sponsored ESP project from a total of 6515 individuals of European American and African American descent.To identify approaches for disease-gene discovery, its important to understand evolutionary history of homosapiens and identify the age of mutations.

The group estimated 73% of all protein-coding SNV’s and around 86% of all SNVs predicted to be deleterious are a recent change within 5,000-10,000 years. European Americans had an excess of deleterious variants and had weaker purifying selection and that was explained with the out-of-africa model.

The gist you ask: rare variants have an important role in heritable phenotypic variation, disease susceptibility and adverse drug responses. The increasing population size has not had enough turn around time for selection to act upon, its only been 200-300 generations since these mutations came to be. Now this increase in mutations results in more Mendelian disorders and has increased the allelic and genetic heterogeneity of traits.

Though if there’s a positive side to it, it may as well be that we as people have created a new repository of advantageous alleles that have come into being fairly recently and hopefully evolution will act upon in subsequent generations

Journal Club: False-positive signals in exome sequencing

Detecting false-positive signals in exome sequencing

Human Mutation

I cannot believe that this paper is already a year old. There was a printed copy on my desk, but never got transmitted from the eyes into the brain!! Finally, there was enough time to review the paper and collate all the valuable information to share here.

Whole Exome Sequencing (WES) is fast becoming the most common NGS application. It allows querying almost all of the coding genome (the 3% of 3 billion nucleotides that we understand most about) at a relatively low cost and time investment. Looking up any list of sequencing papers of note, the most common title is “Exome sequencing identifies the causal variant for XYZ“. However, we know about the small but omnipresent spurious results that are part of the WES data. This article does a great job at elucidating the common false positives and sources of noise in WES data.

    • 118 WES samples from 29 families seen by NIH Undiagnosed Diseases Program
    • 401 additional exomes from ClinSeq study for cross-check
    • Agilent 38Mb and 50Mb all exome capture kits; GA-IIx 76 and 100bp paired-end
    • Method: ELAND -> Cross_Match -> bam2mpg genotype -> CDPred prediction -> VarSifter -> Galaxy
    • Used hg18; No duplicate removal
    • False-positive candidate variants are usually
      • located in highly polymorphic genomic region
      • caused by assembly misalignment
      • error in the reference genome
    • 23,389 positions with excess heterozygosity (alignment error)
    • 1009 positions where reference genome contains the minor allele (excess hom.)
    • Errors arise from – library construction bias; polymerase error; higher error rate towards end of short reads; loss of synchrony within a cluster (Illumina sequencing); platform specific mechanistic issues
  • Highly Variable Genes – frequently contain numerous pathogenic variants, thus unlikely to be disease causing (gene with >10 high quality variants; should normalize by gene length and where in the CDS variants were found)
  • (Pseudo genes) 392 high quality variants were heterozygous in all 118 exomes

Similar reading:
PLOS ONE: Limitations of the Human Reference Genome for Personalized Genomics

Journal Club: Indels in 179 genomes (1000genome data)

The origin, evolution and functional impact of short insertion-deletion variants identified in 179 human genomes

Genome Research

Finally there is a comprehensive analysis on indels, and of course it is the Next Generation Sequencing data that is driving it. I have my concerns with the biases of NGS technology and analysis along with ensuing false-positives in indel detection. Nonetheless, the authors have done a good job in summarizing the information and touching upon the important points making some valuable observations. It would be great to see this comprehensive analysis repeated on the public Complete Genomics genomes or the increasing Ion Torrent data to corroborate these findings as generic and not specific to any variables.

  • Dataset used = 179 (~4x coverage) genomes from 1000 genomes pilot data of 3 populations
  • 1.6 million indels – 50% of them in 4% of the genome (indel hotspots)
  • Polymerase slippage is the main cause of 75% of indels (almost all indels in hotspots and 50% indels in non-repeat regions are due to slippage)
  • indels subject to stronger purifying selection than SNVs (they call it SNPs)
  • recombination hotspots that are known to be enriched with SNVs are not enriched with indels
  • longer and frameshift indels have stronger effect on fitness
  • indels on average have a stronger functional effect than SNVs
  • Method
    • STAMPY: aligner with high sensitivity and low reference bias
    • DINDEL genotyper: Use alt-supporting reads to select high quality indels
    • build implied haplotypes (LD betw SNV/indel and impute) and error model for homopolymers
    • ignore indels in long (>10bp) homopolymers
    • validate with sanger
  • the 1.6 million indels are 8-fold lower than SNVs from these genomes
  • selected novel indels (not seen in 1000 genomes report not dbSNP129)
  • chose 2 CEU as validation targets and sampled calls predicted to segregate in them
  • randomly selected a subset; able to design primers for 111; 60 sanger sequenced
  • 36 matches; 12 low-Q sanger; 12 discordant => 0.25% FDR for this novel set (4.6% total FDR)
  • INDEL classes
    • Homopolymer Run (6nt+) – HR – 10-fold indel enrichment compared to genomic average (even higher if include longer homopolymers)
    • Tandem Repeat – TR – 20-fold indel enrichment
    • Predicted hotspot – PR – predicted indel rate > predicted SNV rate
    • Non-repetitive sites – NR
    • change in copy-number count – CCC – NR-CCC & NR non-CCC
  • HR + TR + PR = 4% of the genome (hotspot) with 50% of indels – deletions dominate short tracts, insertions longer tracts, and then del again for much longer tracts
  • 100-fold increase in polymorphism rate going from 4-bp homopolymer to 8-bp
  • 25% indels not due to polymerase slippage mostly NR non-CCC – mostly deletions (about 90%) – perhaps due to formation of double-stranded break intermediate and imperfect repair
  • the remaining 2.5% insertions most often involve palindromic repeat
  • 43 genes with high individual predicted mutation rate in coding regions – 10 of those do not show SNV enrichment and thus have exclusive indel enrichment to cause high mutational load – includes HTT (huntington), AR (prostrate cancer), ARID1B (neurodevelopmental), MED and MAML genes
  • GWAS: common indels are well tagged by SNVs – possible to phase indels into SNV haplotype reference panels

AGBT 2013 Saturday sessions

Plenary Session: Genomic Technologies
Len Pennacchio, Lawrence Berkeley National Laboratory, Chair

— could not take notes on some of the talks and afternoon session

9:00 a.m. – 9:30 a.m.
Rebecca Leary, Johns Hopkins Kimmel Cancer Center
“Personalized Approaches to Non-invasive Cancer Detection”

– personalized analysis of rearranged ends (PARE)-identify structural alterations in solid tumors
– generate personalized biomarkers for the detection of circulating tumor DNA
– Tumor-derived mate-pair library -> somatic rearrangements -> confirmed by PCR in tumor & matched normal
– Application = monitor disease progression, identify residual disease (predict relapse), surgical margins
– Plasma Aneuploidy Score – clearly differentiates normals from colorectal cancer samples (just 10x physical coverage – detect rearrangements)
– 0.75% circulating tumor DNA – 90%+ sensitivity, 99%+ specificity using 1 HiSeq lane

9:30 a.m. – 9:55 a.m.
* Eric Antoniou, Cold Spring Harbor Laboratory
“Increased Read Length and Sequence Quality with Pacific Biosciences Magbead Loading System and a New DNA Polymerase”

– duckweed as Biofuel (40tonnes/acre/yr), .1 ton yields .025tons of ethanol by weight and is ~7.5 gallons a day
– rice genome (470 Mbp) sequenced using the Pacific Biosciences RS sequencer (MagBead loading system) – hybrid de novo assembly with Illumina data
– 10kbp insert library; 9X coverage of the rice genome (mean read length – 3kb, max 21kb)
– mean accuracy mode of single pass long read – 90%, (85-87% for current C2 chemistry)

9:55 a.m. – 10:20 a.m.
* Tim Harkins, Life Technologies
“Ovarian Cancer Evolution: a Tale of Two Paths”

– ovarian cancer 9th leading cancer among women, 5th leading cause of cancer related death, high relapse rate

10:45 a.m. – 11:10 a.m.
* X. Sunney Xie, Harvard University
“Detecting Single Nucleotide and Copy Number Variations of a Single Human Cell by Whole Genome Sequencing”

– Individual cells of identical descent can have different genomes (dynamic changes in DNA) – important to many biological investigations and medical diagnoses
– Single-cell whole-genome amplification methods – exponential amplification bias => low genome coverage
– Multiple Annealing and Looping Based Amplification Cycles (MALBAC) – 93% genome coverage ≥ 1x for a single human cell at 30x mean sequencing depth
– detection of digitized CNV & SNVs – ~76% efficiency for a single cancer cell
– 2.5 single-base substitutions per mitosis in human tumor cell line identified using single cell amplification/sequencing
– circulating tumor cells (CTCs) of same patient show similar CNV; CTCs of lung cancer patients show similar CTC
– clinical trial for pre-implantation genomic screening for IVF using single polar bodies of oocytes
– male’s genome can be phased by seq sperm, female’s genome phased using polar bodies genomes
– 0.1X genome coverage is enough to determine aneuploidy (at 8-cell stage) for MALBAC’s single-cell sequencing in IVF
– anomalous transition/transversion ratio for newly acquired SNVs

11:10 a.m. – 11:35 a.m.
* Jeremy Schmutz, HudsonAlpha Institute
“Evaluating Moleculo Long Read Technology for de novo Whole Genome Sequencing”

– Moleculo Long Read technology – sequencing two complex plant genomes (inbred diploid switchgrass comparator Panicum hallii (600 Mb) and the outbred tetraploid Miscanthus sinensis (~2.3 Gb)
– incldue long, retrotransposon-derived repeats, diverse GC-content and present significant challenges for short-read NGS whole genome shotgun sequencing
– Moleculo reads – 10kb reads (5kb avg), high accuracy (1.26bp error/10k), tunable to genome size/complexity, reduces computational complexity
– limitations = distribution of reads depends on local repetitive content & global repeat freq; illumina based => localized chemistry issues; some amplification bias

11:35 a.m. – 12:00 p.m.
* Jonas Korlach, Pacific Biosciences
“Automated, Non-Hybrid De Novo Genome Assemblies and Epigenomes of Bacterial Pathogens”

AGBT 2013 Friday sessions

Plenary Session:  Genomic Studies II
John McPherson, Ontario Institute for Cancer Research, Chair

9:00 a.m. – 9:30 a.m.
Steve Scherer, The Hospital for Sick Children
“Whole Genome Sequencing Analysis in Autism”

– Autism Spectrum Disorder (ASD) – high heritability, familial clustering & ~4:1 male to female bias (as many candidates on X-chr)
– 100+ risk genes, ~10 not present on the capture
– WGS (at BGI, >30x) on ASD families; need for better indel callers (indel validation rate ~20%, SNV validation rate >90%)
– better and more uniform X chr and splice site coverage in WGS compared to WES
– also mentions PGP-Canada

9:30 a.m. – 10:00 a.m.
Jay Shendure, University of Washington
“Tackling Genetic Heterogeneity with Massive Multiplexing and Molecular Counting”

Missed out on the talk, but here is an older slide-deck from Shendure which covers most of the stuff presented

10:00 a.m. – 10:30 a.m.
* Gabe Rudy, Golden Helix@gabeinformatics
“Home-Brewed Personalized Genomics: The Quest for Meaningful Analysis Results of a 23andMe Exome Pilot Trio of Myself, Wife, and Son”

– $999 80x exome for the trio, mother with clinically-diagnosed idiopathic rheumatoid arthritis
– 75bp PE, SureSelect capture, BWA/GATKdeliver BAM, VCF, PDF Summary report
– goals = variant call accuracy from NGS, usefulness of 23andme risk variants, usefulness of healthy person’s exome, potential to find driver variants and genes for diagnosis
– 3 Mendel errors, usually due to technical biases (eg mom and dad had non-ref nucleotide messing up child’s genotype)
– 8000 phantom variants (some GATK bug in that version)
– Ingenuity Variant Analysis performed on the exome trio data – look for rare variants within 1-hop of JIA gene

—- Illumina User Meeting Dispatch newsletter

11:00 a.m. – 11:30 a.m.
Mark Yandell, University of Utah
“VAAST: A Probabilistic Disease-gene Finder for Personal Genomes”

VAAST substantially improves upon existing approaches in terms of statistical power, flexibility and scope of use
– identify rare-disease causing loci using single trios of family members, and in small cohorts (n=3) where no two individuals share the same deleterious variants
– also identify genes involved in common, complex diseases using many fewer cases than traditional GWAS
– working to integrate indels, CNV and SV into VAAST, along with pedigrees, and non-human projects (piegeonomics)

11:30 a.m. – 12:00 p.m.
* Agnes Viale, Memorial Sloan Kettering Cancer Center
“RNA-sequencing Analysis Identifies Novel Leukemic Pathways in a Genetically Accurate Model of Acute Myeloid Leukemia”

Bronze Sponsor Workshops
Chad Nusbaum, Broad Institute of MIT and Harvard, Chair

Line-up of all the vendor talks – @PerkinElmer @iontorrent @NuGENInc @illumina @BCILifeSciences @QIAGEN @PacBio @dnanexus

1:40 p.m. – 2:00 p.m.
NuGen Technologies, Inc., Christine Malboeuf, Broad Institute of MIT and Harvard
“Viral RNA Genome Sequencing of Ultra-Low Copy Samples using NuGen’s Ovation RNA-Seq”

– 5pg of RNA is in human cell; ultra-low rna = 5fg (1000 copies) to 5 ag = amount of viral rna and does not work well with qPCR, etc
– Challenges – low quantity, host contamination, diversity (high mutation rate), technological and extraction process
– Ovation rna-seq v2 protocol from NuGen (500pg to 100ng input RNA) – low contamination
– West Nile virus – 50fg input 5M reads 31% map to virus, 48% map to host, covering 100% of viral CDS
– Dilutions starting with lesser material generated reproducible coverage profiles
– HIV – 50fg input rna – 5M reads, 69% viral aligned reads 5% host aligned, covering 100% CDS
– lesser copies of input rna meant 1-2% reads mapping to virus 30-40% mapping to host, but covered ~97% CDS with reproducible coverage profile
– process worked on samples that failed RT-PCR-454 process; method applicable to many other viral sample types (300-75k viral copies)
– applications: surveillance of endemic/emerging viral pathogens; co-infection of multiple viruses; pathogen discovery (viral parasite bacterial fungal)

Concurrent Session: Computational Biology
Mike Zody, Broad Institute of MIT and Harvard, Chair

7:30 p.m. – 7:50 p.m.
* Mark DePristo, Broad Institute of MIT and Harvard
“Overcoming Today’s Limitations in Sequencing Technology for Human Medical Genetics”

– have sequenced 40k+ samples to date from the common (Diabetes, Autism, and Heart Disease) to the uncommon/rare (Crohn’s and Mendelian disorders)
– Variation among individuals in a population – 90% SNPs 10% indels; disease-causing variation, particularly rare diseases, SNP and indel approach 50% / 50%
– indels remain an outstanding challenge; technical and analytic reasons
– PCR-free libraries improve variant calling sensitivity & specificity
– nice visual example of data looking clean with almost everything matching reference with one SNP and some noise calls; actually a het indel!
– better error models and longer reads improve sensitivity to true indels
– sample size is a huge limitation to better calling; but the ensuing massive data aggregation becomes a challenge as well

7:50 p.m. – 8:10 p.m.
* Andrew Farrell, Boston College
“Reference-free Approach for Mutation Detection”

– De novo assembly is prohibitively expensive for most labs – deep read coverage and massive computing power
– practical approach = reference guided alignment; dependent on three factors – reference accuracy, mapper’s ability to correctly place read (uniquely), degree to which a variant allele differs from reference (indels)
– developed a novel completely reference-independent method – no mapping or de novo assembly of the genome; directly compares raw sequence data from two or more samples, and identifies groups of reads unique to a sample
– tested on small genomes but will tackle human (incl. tumor) genomes, metagenomes, transcriptomes

8:10 p.m. – 8:30 p.m.
* James Knight, 454 Life Sciences
“Assembling Human Sequence into Genomes”

8:30 p.m. – 8:50 p.m.
* Aaron Quinlan, University of Virginia
“LUMPY: A Probabilistic Framework for Structural Variant Discovery and Genomic Data Mining”

– structural variation (SV) needs integration of multiple alignment signals – read-pair, split-read and read-depth
– most existing SV discovery approaches utilize only one signal; poor at low sequence coverage and for smaller SVs (Hydra, DELLY, GASVPro)
– LUMPY = extremely flexible probabilistic SV discovery framework – integrates SV detection signals from read alignments or prior evidence
– 4k simulated SV – 1k each deletion, duplication, insertion, inversion – 2x, 5x, 10x, 20x coverage
– potential for a unified variant calling framework and probabilistic analyses of diverse genomic interval datasets (ENCODE)

8:50 p.m. – 9:10 p.m.
* Jeffrey Reid, Baylor College of Medicine
“Discovery of Mobile Element Variation in Ultra-deep Whole Genome Data”

9:10 p.m. – 9:30 p.m.
* Michael Schatz, Cold Spring Harbor Laboratory
“Assembling Crop Genomes with Single Molecule Sequencing”

AGBT13 – Thu 2/21 Evening (Genomic Medicine) session

Concurrent Session: Genomic Medicine
George Grills, Cornell University, Chair

7:30 p.m. – 7:50 p.m.
* Olivier Elemento, Weill Cornell Medical College
“Clonal Architecture and Tumor Evolution in Diffuse Large B Cell Lymphomas”

– DLBCL – 25k patients / yr in US; treated with immunochemotherapy (60% cured, 40% relapse)
– HTS (PCR + MiSeq) of VDJ junctions could reveal clonal architecture of DLBCL and tumor evolution
– Studied 18 diagnosis-relapse pairs – 90% of relapse tumors are clonally related to primary original tumor (same VDJ junctions)
– 2 major evolutionary scenarios

  • linear evolution of relapse disease through major diagnosis clone – 65% cases
  • single, small, divergent, chemoresistant clone at diagnosis gives rise to the relapse tumor – 25% cases

– VDJ seq + WES reveals founder mutations and relapse-specific mutations, some clincally actionable

7:50 p.m. – 8:10 p.m.
* Gustavo Glusman, Institute for Systems Biology
“Multi-genome Analysis: a Crucial Tool for Clinical Genomics”

http://familygenomics.systemsbiology.net/
– Example of their Miller syndrome family of 4 with affected kids and unaffected parents
– Complete Genomics WGS on 600+ individuals – custom workflows + Ingenuity
– HaploScribe for variant phasing, autozygosity for IBD analysis and use Kaviar, FAVA, CMS
– discovery of disease-causing variants
– Dataset for modeling systematic failures and biases in technology, eg.

  • Median coverage profiles – normalization/scaling for CNV analysis
  • 30k commonly mutated segments available in FAVA (includes regions from black-listed genes like titin, mucins, etc)

8:10 p.m. – 8:30 p.m.
* Malachi Griffith, Washington University School of Medicine
“Clinical Cancer Sequencing and Integrated Analysis of Whole Genomes, Exomes and Transcriptomes”

– DNA + RNA-seq for tumor + matched normal of 16 cancers (1 ALL, 4 AML, 6 breast, 1 lung, 4 pancreatic)
– timeline 4 weeks (1wk – tumor/blood isolation, 1wk QC/sequencing, 1wk analysis, 1wk++ all other analysis)
– 30-60x WG coverage, 150x exome coverage for tumor/normal
– All info from these applications complements each other, eg. pancreatic exomes provide much more info because of impure samples and need for really high coverage – not possible with genomes
– Combining WES & WGS = higher sensitivity of clincally relevant tumor associated mutations – in sub-clones and low-purity tumors
– Clin-Seq pipeline & Drug-gene interaction database (DGIdb)
– bimodal non-reference read proportion curve => simple clonal burst history

8:30 p.m. – 8:50 p.m.
* Alexander Hoischen, Radboud University Medical Centre Nijmegen, Department of Human Genetics
“De novo Mutations in Embryonic Development and Early Lethality”

– importance of de novo mutations for rare & common sporadic disorders by the direct detection of de novo mutations WES
– genetics of congenital malformations, intellectual disability, and clinically defined syndromes
– early lethality may lead to mis-detection of mendelian disease genes
– 14 fetuses with hypoplastic left heart (HLH), 14 fetuses with neural tube defect and Arnold-Chiari malformation and 10 fetuses with multiple malformations
– 80x coverage, 200 private (non-syn) variants per case

8:50 p.m. – 9:10 p.m.
* Dagan Wells, University of Oxford
“Rapid Genetic Analysis of Single Cells using a Next Generation Sequencing Method: Application to Human Embryos Reveals Aneuploidy and Mutations of the Nuclear and Mitochondrial Genomes”

– In vitro fertilization (IVF) is very inefficient (30%) – choose from several embryo – subjective choice on morphological grounds
– 85% transferred embryos do not implant; increasing age implies higher chance of aneuploidy (20% at 30yr, 75% at 40yr)
– aneuploidy in embryo – principal cause of a range of clinical problems (congenital abnormalities, mental retardation and miscarriage)
– pre-implantation diagnosis/screening (PGD/PGS)
– whole genome amplification followed by next generation sequencing (Ion Torrent) – custom analysis pipeline
– 52k-110k 200bp reads/sample; variable # of reads per chr, but reproducible; so easy to normalize using normals
– correctly diagnosed abnormalities in 10/10 cells from aneuploid cell lines and in 45/45 embryos donated for research – aCGH validataion
– also identified cystic fibrosis in 5/5 single cells from an affected individual and a heteroplasmic mitochondrial DNA mutation

9:10 p.m. – 9:30 p.m.
* Leonard Lipovich, Wayne State University
“Beyond ENCODE: Primate-specific Long Non-coding RNA Genes are Functional in Human Brain and Cancer”

— had to get to drinks, so over n out!!