The Ebola outbreak of 2014 (Contd.)

Following up on the previous post here are some more detail on the recent Science paper along with a round-up of “what do we know, what have we learned” thus far.

The Gire et al paper in Science was huge amount of work and a giant collaborative research effort. Being a computational biology researcher, I appreciate their in-depth and detailed evaluation utilizing numerous bioinformatics software tools. Gleaning through the supplemental text, I created the flowchart below as a summary of all the analysis that went into the eventual results and interpretations. This was created using the wonderful Gliffy tool.

ebolaFlowchart of the impressive work accomplished by the Gire et al Science paper (I made this using notes from their Supplemental Data)

A slew of articles summarizing the recent Science paper came out as the hype surrounding WHO warning of this current outbreak hitting 20,000 people caught on. This is huge!! Peter Piot, who co-discovered the Ebola virus during the 1976 outbreak never imagined an outbreak like this, but is confident of ‘high-income countries’ doing just fine.

The Broad Institute and Harvard University worked with Sierra Leone Ministry of Health and Sanitation along with other researchers to provide the comprehensive paper in Science describing the sequencing of current Ebola genomes. Simultaneously, the human trial on NIH and GSK’s investigational Ebola vaccine is to begin this week, as it performed well on primate studies. Hopefully this one will be faster than the usual 10-year turn-around observed for a vaccine trial. Although the experimental drug ZMapp is being used on the cases, it is with mixed results and much and more still needs to be done. Interestingly the drug is a three-mouse monoclonal antibody and the primate research itself was published in Nature last week. Details of how it seemed to have worked on the 2 US health care workers in the midst of this outbreak is a pretty ‘miraculous’ story!

The major points to note thus far:

  • First Ebola Virus Disease of 2014 confirmed in Sierra Leone on May 25
  • It seems like there was one instance of the EBOV transmitting from the ‘natural reservoir’ to humans and has since been transmitted from human to human (implying there is rare, though present, chance of non-human transmission)
  • Substitution rate is twice as high implying that continued progression of this epidemic could allow a viral adaptation, thus the need for rapid containment
  • The 2014 outbreak has a doubling period of about 35 days!!
  • Complicating matters, positive diagnosis for malaria does not necessarily rule out Ebola Virus Disease
  • Senegal just became the 5th West African country with a confirmed case of Ebola
  • Breaking News: samples from Ebola outbreak in Congo (DRC) were evaluated to have a distinct and independent transmission event, likely via a bushmeat consumption!

Hopefully this is contained sooner rather than later…

Solving the Ebola Virus Genome and Identifying Possible Diagnosis

If you have ever played the card game Killer Bunnies and your Bunnies in the Bunny circle have died because of the Level 11 Weapon of Ebola Virus you want to read this.


Research community is making strides to understand whether the virus is adapting to its host or changing as it spreads through the different populations as more countries get in its warp, in West Africa.

5 of the 50 co-authors on this Science article were infected with the deadly Zaire Ebola Virus (EBOV) themselves. Nothing short of a thriller, the events trace back to the funeral of a healer which kick-started the spread of Ebola in the region. Also reviewed here is a paper from 2008 in which the authors have pointed the VP35 protein, which during their experiments was identified as a critical component of this hemorrhagic fever.

Ebola’s genomic sequence:

  • Linear, single-stranded genome
  • Inverse-complementary 3′ and 5′ termini
  • ~19 kb (19 thousand nucleotides long compared to 3 billion human genome)
  • Seven genes (compared to ~20k in humans)

The current outbreak is due to the EBOV virus, one of the five Ebola virus known to infect humans. Research groups are trying to identify whether the genetic sequence of this virus is changing fast enough in regions that are key for the accuracy of the PCR based diagnostic tests.

This EBOV virus in the 2014 epidemic has been reported to be 97% similar to the virus that first emerged in 1976. Articles across the web estimate that EBOV is set to evolve at about 7×10-4 substitutions per site per year suggesting that the current strain of EBOV would have accumulated many substitutions over the 40 year time period since 1976.

In this article Gire et al use genomic data and inferences by using next generation sequencing technologies to explain whether the virus is accumulating significant mutations as it changes hosts.

  • Methods compared to ascertain choice for sequencing:
    • Library preparation: Nugen and Nextera
    • Sequencing instruments: PacBio and Illumina
  • Nextera and Illumina provided most complete genome assembly and intrahost SNV identification
  • 99 virus genomes, 78 patients in Sierra Leone sequenced at a median coverage >2,000x across 99.9% of EBOV coding regions
  • Intra and Interhost genetic variations to characterize transmission patterns
  • 341 fixed substitutions identified between previous and 2014 EBOV
    • 35 nonsynonymous, 173 synonymous, 133 noncoding
  • 55 single nucleotide polymorphisms (SNPs) among this West African outbreak
    • 15 nonsynonymous, 25 synonymous, 15 noncoding
  • Genetic similarity across sequenced 2014 samples suggests single transmission


Josh Herr of Michigan State University and Daniel Park of Broad Institute aim to maintain an analysis wiki for solving the underlying genomic riddle, by studying the different strains of the virus and are encouraging contributors (ebola-crowdsource).

Screen shot 2014-08-30 at 8.01.22 PM

In an earlier paper published in Journal of Virology in 2008, Hartman et al discuss how whole genome expression profiling reveals that the innate immune response of the host can be inhibited and reversed by single amino acid change in VP35 Protein.

  • Two reverse genetic-generated Ebola virus strains
    • Encode wild-type VP35 protein or VP35 with an arginine (R)-to-alanine (A) amino acid substitution at position 312
  • Whole-genome expression profiling of the host cells in human liver
  • Host cells reveal differences in response to introduction of these viruses differing by a single amino acid
  • VP35 protein plays a vital role in inhibiting immune responses of the host
  • Single amino acid change exhibits the ability to eliminate this inhibitory effect
  • VP35 Protein demonstrates a critical role in the severity of the disease


Dr. Lipkin professor of epidemiology at the Columbia University discusses a pertinent question of whether ‘Ebola can travel to the United States’ . He explains in a matter-of-fact way that although there is a possibility of the virus traveling to US like anywhere else, there’s also a high likelihood of it being monitored and isolated by health authorities at the earliest possible.

Lets get to a round of that card game now, shall we.

Mitochondrial Gold Rush

Mitochondrial genomes can be extracted from Whole Exome Sequencing (WES) data as outlined by this paper in Nature methods by Ernesto Picardi Graziano Pesole. Tools like Mito Seek are now available that gather mitochondrial read sequences from NGS data and perform high throughput sequence analysis. Availability of mitochondrial genomes is important as genomic variation in mitochondria has been implicated in a variety of neuro-muscular and metabolic disorders, along with roles in aging and cancer.

However here we ponder upon the feasibility of how effective it is to extract mitochondria from different capture kits used for WES. Picardi et al used the MitoSeek tool to successfully assemble 100%, 95% and 72% of the mtDNA genome from the TruSeq (Illumina), SureSelect (Agilent) and SeqCap EZ-Exome (NimbleGen) platforms, respectively. We set out to assess the mitochondrial genome data extraction using a different approach and tool-set. Using the same sample’s dataset from three different capture kits, and Whole Genome Sequenced (WGS) data as the gold standard we evaluated alignment and variant-calling results.

Clark et al sequenced and analyzed a human blood sample (healthy, anonymous volunteer) at the Stanford University using three commonly used WES kits:

  1. Agilent SureSelect Human All Exon kit
  2. Nimblegen SeqCap EZ Exome Library v2.0
  3. Illumina TruSeq Exome Enrichment

Illumina HiSeq instrument was used for WGS and all three WES capture kits. Clark et al highlight comparisons between the three capture kits, from library preparation to sequencing time. The paper discusses effectiveness of using each of these kits based on metrics such as baits, capture of UTR regions, etc. They compare variant calls across all three WES kits and WGS and discuss the ability of WES to detect additional small variants that were missed by WGS. Although this paper doesn’t provide an in-depth instrument comparison, the readers here assume that Illumina is the leader in sequencing technology (at least until tonight!)

We use this data set to compare and contrast the availability and quality of mitochondrial sequencing in off-target data from WES. A standard WGS experiment at 35× mean genomic coverage was compared to exome sequencing experiments yielding average exome target coverage of 30× for Illumina, 60× for Agilent and 68× for Nimblegen

We also utilized a single custom capture sequenced sample from Teer et al to study the feasibility of gleaning mitochondria from a custom capture experiment.

  1. Clark et al have made this data set downloadable from NCBI in the SRA file format
  2. Using the SRA toolkit we converted SRA to FASTQ. As these are paired end reads we used fastq-dump with the –split-3 option. This generated 2 fastq files for R1 and R2
  3. Using BWA-MEM algorithm we aligned reads in these fastq files to allchr.fa. Additionally for the Truseq data we also used BWA-SAMPE algorithm to compare BWA alignment algorithm
  4. The BWA alignment provided SAM files for each of three WES (Agilent, Nimblegen, Illumina) and WGS. Using Samtools we converted SAM files to BAM for easy storage and interpretability
  5. We filtered for reads that mapped to chromosome M and those that had PHRED-scale mapping quality >= 20 (more than 99% probability of being accurate)
  6. For calling variants we employed a custom perl script on the the generated pileup to determine variant calling at different thresholds of >=1% >=5% and >=10% variant supporting reads

Read Metrics:

All metrics for 10x/5x are using reads mapped with PHRED-scale mapping quality >= 20. The length of mitochondrial genome covered at more than 5x (5-fold) coverage and 10x is summarized for the sequencing data from different capture kits (Table 1).

All results are for BWA-MEM except for the Illumina TruSeq capture data that was also aligned using BWA-SAMPE. Our comparisons show that BWA-MEM aligned more reads and had generally better performance.

A custom capture sample was evaluated simply to see the potential of extracting mitochondrial genome from that data-type as well. It performed really well, generating more than 900 RPM for mitochondrial genome, implying much greater off-target throughput

Capture/WGS All reads (millions) Mapped reads (millions) % mapped reads chrM reads Q20 chrM Q20 chrM RPM* > 10x chrM > 5x chrM
SRR309291 (Agilent) 124.193 123.949 99.80 2836 2647 21.36 12615 15691
SRR309292 (Nimblegen) 185.088 184.588 99.73 3770 3466 18.78 5563 11271
SRR309293 (Illumina) 113.369 113.070 99.74 27326 24645 217.96 16569 16569 (Illumina SAMPE) 112.886 105.777 93.70 25149 22894 216.44 16569 16569
SRR341919 (WGS) 1,312.649 1,253.840 95.52 436042 417365 332.87 16569 16569
(Custom Capture)
5.313 5.086 95.73 5346 4897 962.75 9997 14318

*: Q20 mapped chrM reads per Million Mapped reads for that sample
Table 1: Sequencing throughput and mitochondrial genome coverage from NGS data on whole-genome, exome and custom-captured samples


Coverage of Mitochondrial Genome

Figure 1: Contrasting coverage of mitochondrial genome from WGS and WES sequencing data (truseq-pe data was aligned using BWA-sampe tool while all others were aligned using BWA-mem)
  • WGS data generated really good coverage of the mitochondrial genome, almost always > 700-fold
  • Coverage from Illumina Truseq data was consistent between results from using BWA-mem or BWA-sampe aligner, though the latter gave slightly lesser coverage due to fewer mapped reads
  • Agilent off-target data generated sufficient mitochondria mapped reads considering ~95% of mitochondrial genome covered at 5x. Higher overall throughput for the sequenced sample could have provided greater off-target sequence reads yielding higher mitochondrial genome coverage.
  • Nimblegen off-target data was the least abundant, and the coverage profile across mitochondrial genome was also different from other datasets. This may also be due to the high-density overlapping bait design of Nimblegen, giving focused on-target coverage, leaving fewer off-target reads.

Variant Calling on the Mitochondrial Genome

33 variants shared by all 4 (WGS, Illumina/Nimblegen/Agilent capture)
Venn Diagram (generated using Venny) to compare the mitochondiral variants identified in the same sample from WGS and off-target data from different capture kits (10% or more alternate-supporting reads implied a variant call)

The sequencing data depicted high variability when using 1% alternate-supporting reads to annotate a mitochondrial genomic position as variant. So we used a threshold of at-least 10% reads at any given nucleotide position to be supporting the alternate allele to define a variant. The above venn-diagram highlights that the vast majority (33/41) of called variants on mitochondrial genome from WGS and WES data overlap. Another 6 variants identified in WGS were also observed in Agilent and Illumina WES data, but missed by Nimblegen WES due to low coverage. We do not provide a comprehensive iteration of the exclusive variants, but most of them suffer from low read-depth, low quality, and strand bias.


With the decreasing cost and increasing availability of exome sequencing data, there is a vast resource of mitochondrial genomes that can be mined for mitochondria-focused research. Data from large consortia like 1000 genomes and NHLBI exome datasets can be utilized for a comparative mitochondrial variation evaluation. As reported by Picardi et al, Illumina Truseq and Agilent exome kits generate better mitochondrial genome coverage compared to Nimblegen. Interestingly, even the custom-capture kit we evaluated generated a decent amount of mitochondrial genome coverage. This opens up a plethora of small NGS panel and custom-capture datasets for mitochondrial genome evaluation.

#ASHG2013 Platform and Poster abstract tag-clouds

With more than 6000 scientists (genetics, bioinformatics, clinicians, statistics, genetic counselor…) and more than 200 companies at Boston for this year’s American Society for Human Genetics conference, there is a lot of great science to catch up on.

Very quickly, I just pulled out the selected platform talk abstracts, and the poster abstracts (too many posters, so I simply picked my biased interest of ~260 Bioinformatics ones) and made these tag-clouds to get the popular keywords.

They are very similar! While the Bioinformatics posters have a lot of DATA, coverage and quality; the platform talks have a lot of CANCER, functional and mutations. The platform talks also have a lot of neandertal, pms and mutation. Looking forward to all the excitement!!

Bioinformatics Posters
Bioinformatics Posters

Platform abstracts
Platform abstracts

Tools & Parameters: TagCrowd to generate the cloud using text from PDF files on the ASHG website. Max 77 words to show, min frequency of 5 and excluding these keywords “boston cambridge chr ma united university”.

BTW, do check out the twitter analysis by @erlichya on #ASHG2013 tweets and keywords

How recent are rare variants?

The Department of Genome Sciences at the University of Washington in Seattle, in a multi-institutional effort sequenced 15,336 genes for the NHLBI sponsored ESP project from a total of 6515 individuals of European American and African American descent.To identify approaches for disease-gene discovery, its important to understand evolutionary history of homosapiens and identify the age of mutations.

The group estimated 73% of all protein-coding SNV’s and around 86% of all SNVs predicted to be deleterious are a recent change within 5,000-10,000 years. European Americans had an excess of deleterious variants and had weaker purifying selection and that was explained with the out-of-africa model.

The gist you ask: rare variants have an important role in heritable phenotypic variation, disease susceptibility and adverse drug responses. The increasing population size has not had enough turn around time for selection to act upon, its only been 200-300 generations since these mutations came to be. Now this increase in mutations results in more Mendelian disorders and has increased the allelic and genetic heterogeneity of traits.

Though if there’s a positive side to it, it may as well be that we as people have created a new repository of advantageous alleles that have come into being fairly recently and hopefully evolution will act upon in subsequent generations

AGBT 2013 Saturday sessions

Plenary Session: Genomic Technologies
Len Pennacchio, Lawrence Berkeley National Laboratory, Chair

— could not take notes on some of the talks and afternoon session

9:00 a.m. – 9:30 a.m.
Rebecca Leary, Johns Hopkins Kimmel Cancer Center
“Personalized Approaches to Non-invasive Cancer Detection”

– personalized analysis of rearranged ends (PARE)-identify structural alterations in solid tumors
– generate personalized biomarkers for the detection of circulating tumor DNA
– Tumor-derived mate-pair library -> somatic rearrangements -> confirmed by PCR in tumor & matched normal
– Application = monitor disease progression, identify residual disease (predict relapse), surgical margins
– Plasma Aneuploidy Score – clearly differentiates normals from colorectal cancer samples (just 10x physical coverage – detect rearrangements)
– 0.75% circulating tumor DNA – 90%+ sensitivity, 99%+ specificity using 1 HiSeq lane

9:30 a.m. – 9:55 a.m.
* Eric Antoniou, Cold Spring Harbor Laboratory
“Increased Read Length and Sequence Quality with Pacific Biosciences Magbead Loading System and a New DNA Polymerase”

– duckweed as Biofuel (40tonnes/acre/yr), .1 ton yields .025tons of ethanol by weight and is ~7.5 gallons a day
– rice genome (470 Mbp) sequenced using the Pacific Biosciences RS sequencer (MagBead loading system) – hybrid de novo assembly with Illumina data
– 10kbp insert library; 9X coverage of the rice genome (mean read length – 3kb, max 21kb)
– mean accuracy mode of single pass long read – 90%, (85-87% for current C2 chemistry)

9:55 a.m. – 10:20 a.m.
* Tim Harkins, Life Technologies
“Ovarian Cancer Evolution: a Tale of Two Paths”

– ovarian cancer 9th leading cancer among women, 5th leading cause of cancer related death, high relapse rate

10:45 a.m. – 11:10 a.m.
* X. Sunney Xie, Harvard University
“Detecting Single Nucleotide and Copy Number Variations of a Single Human Cell by Whole Genome Sequencing”

– Individual cells of identical descent can have different genomes (dynamic changes in DNA) – important to many biological investigations and medical diagnoses
– Single-cell whole-genome amplification methods – exponential amplification bias => low genome coverage
– Multiple Annealing and Looping Based Amplification Cycles (MALBAC) – 93% genome coverage ≥ 1x for a single human cell at 30x mean sequencing depth
– detection of digitized CNV & SNVs – ~76% efficiency for a single cancer cell
– 2.5 single-base substitutions per mitosis in human tumor cell line identified using single cell amplification/sequencing
– circulating tumor cells (CTCs) of same patient show similar CNV; CTCs of lung cancer patients show similar CTC
– clinical trial for pre-implantation genomic screening for IVF using single polar bodies of oocytes
– male’s genome can be phased by seq sperm, female’s genome phased using polar bodies genomes
– 0.1X genome coverage is enough to determine aneuploidy (at 8-cell stage) for MALBAC’s single-cell sequencing in IVF
– anomalous transition/transversion ratio for newly acquired SNVs

11:10 a.m. – 11:35 a.m.
* Jeremy Schmutz, HudsonAlpha Institute
“Evaluating Moleculo Long Read Technology for de novo Whole Genome Sequencing”

– Moleculo Long Read technology – sequencing two complex plant genomes (inbred diploid switchgrass comparator Panicum hallii (600 Mb) and the outbred tetraploid Miscanthus sinensis (~2.3 Gb)
– incldue long, retrotransposon-derived repeats, diverse GC-content and present significant challenges for short-read NGS whole genome shotgun sequencing
– Moleculo reads – 10kb reads (5kb avg), high accuracy (1.26bp error/10k), tunable to genome size/complexity, reduces computational complexity
– limitations = distribution of reads depends on local repetitive content & global repeat freq; illumina based => localized chemistry issues; some amplification bias

11:35 a.m. – 12:00 p.m.
* Jonas Korlach, Pacific Biosciences
“Automated, Non-Hybrid De Novo Genome Assemblies and Epigenomes of Bacterial Pathogens”

AGBT 2013 Friday sessions

Plenary Session:  Genomic Studies II
John McPherson, Ontario Institute for Cancer Research, Chair

9:00 a.m. – 9:30 a.m.
Steve Scherer, The Hospital for Sick Children
“Whole Genome Sequencing Analysis in Autism”

– Autism Spectrum Disorder (ASD) – high heritability, familial clustering & ~4:1 male to female bias (as many candidates on X-chr)
– 100+ risk genes, ~10 not present on the capture
– WGS (at BGI, >30x) on ASD families; need for better indel callers (indel validation rate ~20%, SNV validation rate >90%)
– better and more uniform X chr and splice site coverage in WGS compared to WES
– also mentions PGP-Canada

9:30 a.m. – 10:00 a.m.
Jay Shendure, University of Washington
“Tackling Genetic Heterogeneity with Massive Multiplexing and Molecular Counting”

Missed out on the talk, but here is an older slide-deck from Shendure which covers most of the stuff presented

10:00 a.m. – 10:30 a.m.
* Gabe Rudy, Golden Helix@gabeinformatics
“Home-Brewed Personalized Genomics: The Quest for Meaningful Analysis Results of a 23andMe Exome Pilot Trio of Myself, Wife, and Son”

– $999 80x exome for the trio, mother with clinically-diagnosed idiopathic rheumatoid arthritis
– 75bp PE, SureSelect capture, BWA/GATKdeliver BAM, VCF, PDF Summary report
– goals = variant call accuracy from NGS, usefulness of 23andme risk variants, usefulness of healthy person’s exome, potential to find driver variants and genes for diagnosis
– 3 Mendel errors, usually due to technical biases (eg mom and dad had non-ref nucleotide messing up child’s genotype)
– 8000 phantom variants (some GATK bug in that version)
– Ingenuity Variant Analysis performed on the exome trio data – look for rare variants within 1-hop of JIA gene

—- Illumina User Meeting Dispatch newsletter

11:00 a.m. – 11:30 a.m.
Mark Yandell, University of Utah
“VAAST: A Probabilistic Disease-gene Finder for Personal Genomes”

VAAST substantially improves upon existing approaches in terms of statistical power, flexibility and scope of use
– identify rare-disease causing loci using single trios of family members, and in small cohorts (n=3) where no two individuals share the same deleterious variants
– also identify genes involved in common, complex diseases using many fewer cases than traditional GWAS
– working to integrate indels, CNV and SV into VAAST, along with pedigrees, and non-human projects (piegeonomics)

11:30 a.m. – 12:00 p.m.
* Agnes Viale, Memorial Sloan Kettering Cancer Center
“RNA-sequencing Analysis Identifies Novel Leukemic Pathways in a Genetically Accurate Model of Acute Myeloid Leukemia”

Bronze Sponsor Workshops
Chad Nusbaum, Broad Institute of MIT and Harvard, Chair

Line-up of all the vendor talks – @PerkinElmer @iontorrent @NuGENInc @illumina @BCILifeSciences @QIAGEN @PacBio @dnanexus

1:40 p.m. – 2:00 p.m.
NuGen Technologies, Inc., Christine Malboeuf, Broad Institute of MIT and Harvard
“Viral RNA Genome Sequencing of Ultra-Low Copy Samples using NuGen’s Ovation RNA-Seq”

– 5pg of RNA is in human cell; ultra-low rna = 5fg (1000 copies) to 5 ag = amount of viral rna and does not work well with qPCR, etc
– Challenges – low quantity, host contamination, diversity (high mutation rate), technological and extraction process
– Ovation rna-seq v2 protocol from NuGen (500pg to 100ng input RNA) – low contamination
– West Nile virus – 50fg input 5M reads 31% map to virus, 48% map to host, covering 100% of viral CDS
– Dilutions starting with lesser material generated reproducible coverage profiles
– HIV – 50fg input rna – 5M reads, 69% viral aligned reads 5% host aligned, covering 100% CDS
– lesser copies of input rna meant 1-2% reads mapping to virus 30-40% mapping to host, but covered ~97% CDS with reproducible coverage profile
– process worked on samples that failed RT-PCR-454 process; method applicable to many other viral sample types (300-75k viral copies)
– applications: surveillance of endemic/emerging viral pathogens; co-infection of multiple viruses; pathogen discovery (viral parasite bacterial fungal)

Concurrent Session: Computational Biology
Mike Zody, Broad Institute of MIT and Harvard, Chair

7:30 p.m. – 7:50 p.m.
* Mark DePristo, Broad Institute of MIT and Harvard
“Overcoming Today’s Limitations in Sequencing Technology for Human Medical Genetics”

– have sequenced 40k+ samples to date from the common (Diabetes, Autism, and Heart Disease) to the uncommon/rare (Crohn’s and Mendelian disorders)
– Variation among individuals in a population – 90% SNPs 10% indels; disease-causing variation, particularly rare diseases, SNP and indel approach 50% / 50%
– indels remain an outstanding challenge; technical and analytic reasons
– PCR-free libraries improve variant calling sensitivity & specificity
– nice visual example of data looking clean with almost everything matching reference with one SNP and some noise calls; actually a het indel!
– better error models and longer reads improve sensitivity to true indels
– sample size is a huge limitation to better calling; but the ensuing massive data aggregation becomes a challenge as well

7:50 p.m. – 8:10 p.m.
* Andrew Farrell, Boston College
“Reference-free Approach for Mutation Detection”

– De novo assembly is prohibitively expensive for most labs – deep read coverage and massive computing power
– practical approach = reference guided alignment; dependent on three factors – reference accuracy, mapper’s ability to correctly place read (uniquely), degree to which a variant allele differs from reference (indels)
– developed a novel completely reference-independent method – no mapping or de novo assembly of the genome; directly compares raw sequence data from two or more samples, and identifies groups of reads unique to a sample
– tested on small genomes but will tackle human (incl. tumor) genomes, metagenomes, transcriptomes

8:10 p.m. – 8:30 p.m.
* James Knight, 454 Life Sciences
“Assembling Human Sequence into Genomes”

8:30 p.m. – 8:50 p.m.
* Aaron Quinlan, University of Virginia
“LUMPY: A Probabilistic Framework for Structural Variant Discovery and Genomic Data Mining”

– structural variation (SV) needs integration of multiple alignment signals – read-pair, split-read and read-depth
– most existing SV discovery approaches utilize only one signal; poor at low sequence coverage and for smaller SVs (Hydra, DELLY, GASVPro)
– LUMPY = extremely flexible probabilistic SV discovery framework – integrates SV detection signals from read alignments or prior evidence
– 4k simulated SV – 1k each deletion, duplication, insertion, inversion – 2x, 5x, 10x, 20x coverage
– potential for a unified variant calling framework and probabilistic analyses of diverse genomic interval datasets (ENCODE)

8:50 p.m. – 9:10 p.m.
* Jeffrey Reid, Baylor College of Medicine
“Discovery of Mobile Element Variation in Ultra-deep Whole Genome Data”

9:10 p.m. – 9:30 p.m.
* Michael Schatz, Cold Spring Harbor Laboratory
“Assembling Crop Genomes with Single Molecule Sequencing”