Solving the Ebola Virus Genome and Identifying Possible Diagnosis

If you have ever played the card game Killer Bunnies and your Bunnies in the Bunny circle have died because of the Level 11 Weapon of Ebola Virus you want to read this.

047_The_Ebola_Virus-thumbnail

Research community is making strides to understand whether the virus is adapting to its host or changing as it spreads through the different populations as more countries get in its warp, in West Africa.

5 of the 50 co-authors on this Science article were infected with the deadly Zaire Ebola Virus (EBOV) themselves. Nothing short of a thriller, the events trace back to the funeral of a healer which kick-started the spread of Ebola in the region. Also reviewed here is a paper from 2008 in which the authors have pointed the VP35 protein, which during their experiments was identified as a critical component of this hemorrhagic fever.

Ebola’s genomic sequence:

  • Linear, single-stranded genome
  • Inverse-complementary 3′ and 5′ termini
  • ~19 kb (19 thousand nucleotides long compared to 3 billion human genome)
  • Seven genes (compared to ~20k in humans)

The current outbreak is due to the EBOV virus, one of the five Ebola virus known to infect humans. Research groups are trying to identify whether the genetic sequence of this virus is changing fast enough in regions that are key for the accuracy of the PCR based diagnostic tests.

This EBOV virus in the 2014 epidemic has been reported to be 97% similar to the virus that first emerged in 1976. Articles across the web estimate that EBOV is set to evolve at about 7×10-4 substitutions per site per year suggesting that the current strain of EBOV would have accumulated many substitutions over the 40 year time period since 1976.

In this article Gire et al use genomic data and inferences by using next generation sequencing technologies to explain whether the virus is accumulating significant mutations as it changes hosts.

  • Methods compared to ascertain choice for sequencing:
    • Library preparation: Nugen and Nextera
    • Sequencing instruments: PacBio and Illumina
  • Nextera and Illumina provided most complete genome assembly and intrahost SNV identification
  • 99 virus genomes, 78 patients in Sierra Leone sequenced at a median coverage >2,000x across 99.9% of EBOV coding regions
  • Intra and Interhost genetic variations to characterize transmission patterns
  • 341 fixed substitutions identified between previous and 2014 EBOV
    • 35 nonsynonymous, 173 synonymous, 133 noncoding
  • 55 single nucleotide polymorphisms (SNPs) among this West African outbreak
    • 15 nonsynonymous, 25 synonymous, 15 noncoding
  • Genetic similarity across sequenced 2014 samples suggests single transmission

 

Josh Herr of Michigan State University and Daniel Park of Broad Institute aim to maintain an analysis wiki for solving the underlying genomic riddle, by studying the different strains of the virus and are encouraging contributors (ebola-crowdsource).

Screen shot 2014-08-30 at 8.01.22 PM

In an earlier paper published in Journal of Virology in 2008, Hartman et al discuss how whole genome expression profiling reveals that the innate immune response of the host can be inhibited and reversed by single amino acid change in VP35 Protein.

  • Two reverse genetic-generated Ebola virus strains
    • Encode wild-type VP35 protein or VP35 with an arginine (R)-to-alanine (A) amino acid substitution at position 312
  • Whole-genome expression profiling of the host cells in human liver
  • Host cells reveal differences in response to introduction of these viruses differing by a single amino acid
  • VP35 protein plays a vital role in inhibiting immune responses of the host
  • Single amino acid change exhibits the ability to eliminate this inhibitory effect
  • VP35 Protein demonstrates a critical role in the severity of the disease

 

Dr. Lipkin professor of epidemiology at the Columbia University discusses a pertinent question of whether ‘Ebola can travel to the United States’ . He explains in a matter-of-fact way that although there is a possibility of the virus traveling to US like anywhere else, there’s also a high likelihood of it being monitored and isolated by health authorities at the earliest possible.

Lets get to a round of that card game now, shall we.

Mitochondrial Gold Rush

Mitochondrial genomes can be extracted from Whole Exome Sequencing (WES) data as outlined by this paper in Nature methods by Ernesto Picardi Graziano Pesole. Tools like Mito Seek are now available that gather mitochondrial read sequences from NGS data and perform high throughput sequence analysis. Availability of mitochondrial genomes is important as genomic variation in mitochondria has been implicated in a variety of neuro-muscular and metabolic disorders, along with roles in aging and cancer.

However here we ponder upon the feasibility of how effective it is to extract mitochondria from different capture kits used for WES. Picardi et al used the MitoSeek tool to successfully assemble 100%, 95% and 72% of the mtDNA genome from the TruSeq (Illumina), SureSelect (Agilent) and SeqCap EZ-Exome (NimbleGen) platforms, respectively. We set out to assess the mitochondrial genome data extraction using a different approach and tool-set. Using the same sample’s dataset from three different capture kits, and Whole Genome Sequenced (WGS) data as the gold standard we evaluated alignment and variant-calling results.

Clark et al sequenced and analyzed a human blood sample (healthy, anonymous volunteer) at the Stanford University using three commonly used WES kits:

  1. Agilent SureSelect Human All Exon kit
  2. Nimblegen SeqCap EZ Exome Library v2.0
  3. Illumina TruSeq Exome Enrichment

Illumina HiSeq instrument was used for WGS and all three WES capture kits. Clark et al highlight comparisons between the three capture kits, from library preparation to sequencing time. The paper discusses effectiveness of using each of these kits based on metrics such as baits, capture of UTR regions, etc. They compare variant calls across all three WES kits and WGS and discuss the ability of WES to detect additional small variants that were missed by WGS. Although this paper doesn’t provide an in-depth instrument comparison, the readers here assume that Illumina is the leader in sequencing technology (at least until tonight!)

We use this data set to compare and contrast the availability and quality of mitochondrial sequencing in off-target data from WES. A standard WGS experiment at 35× mean genomic coverage was compared to exome sequencing experiments yielding average exome target coverage of 30× for Illumina, 60× for Agilent and 68× for Nimblegen

We also utilized a single custom capture sequenced sample from Teer et al to study the feasibility of gleaning mitochondria from a custom capture experiment.

  1. Clark et al have made this data set downloadable from NCBI in the SRA file format
  2. Using the SRA toolkit we converted SRA to FASTQ. As these are paired end reads we used fastq-dump with the –split-3 option. This generated 2 fastq files for R1 and R2
  3. Using BWA-MEM algorithm we aligned reads in these fastq files to allchr.fa. Additionally for the Truseq data we also used BWA-SAMPE algorithm to compare BWA alignment algorithm
  4. The BWA alignment provided SAM files for each of three WES (Agilent, Nimblegen, Illumina) and WGS. Using Samtools we converted SAM files to BAM for easy storage and interpretability
  5. We filtered for reads that mapped to chromosome M and those that had PHRED-scale mapping quality >= 20 (more than 99% probability of being accurate)
  6. For calling variants we employed a custom perl script on the the generated pileup to determine variant calling at different thresholds of >=1% >=5% and >=10% variant supporting reads

Read Metrics:

All metrics for 10x/5x are using reads mapped with PHRED-scale mapping quality >= 20. The length of mitochondrial genome covered at more than 5x (5-fold) coverage and 10x is summarized for the sequencing data from different capture kits (Table 1).

All results are for BWA-MEM except for the Illumina TruSeq capture data that was also aligned using BWA-SAMPE. Our comparisons show that BWA-MEM aligned more reads and had generally better performance.

A custom capture sample was evaluated simply to see the potential of extracting mitochondrial genome from that data-type as well. It performed really well, generating more than 900 RPM for mitochondrial genome, implying much greater off-target throughput

Capture/WGS All reads (millions) Mapped reads (millions) % mapped reads chrM reads Q20 chrM Q20 chrM RPM* > 10x chrM > 5x chrM
SRR309291 (Agilent) 124.193 123.949 99.80 2836 2647 21.36 12615 15691
SRR309292 (Nimblegen) 185.088 184.588 99.73 3770 3466 18.78 5563 11271
SRR309293 (Illumina) 113.369 113.070 99.74 27326 24645 217.96 16569 16569
SRR309293.pe (Illumina SAMPE) 112.886 105.777 93.70 25149 22894 216.44 16569 16569
SRR341919 (WGS) 1,312.649 1,253.840 95.52 436042 417365 332.87 16569 16569
SRR062592.s.bam
(Custom Capture)
5.313 5.086 95.73 5346 4897 962.75 9997 14318

*: Q20 mapped chrM reads per Million Mapped reads for that sample
Table 1: Sequencing throughput and mitochondrial genome coverage from NGS data on whole-genome, exome and custom-captured samples

 

Coverage of Mitochondrial Genome

Figure 1: Contrasting coverage of mitochondrial genome from WGS and WES sequencing data (truseq-pe data was aligned using BWA-sampe tool while all others were aligned using BWA-mem)
  • WGS data generated really good coverage of the mitochondrial genome, almost always > 700-fold
  • Coverage from Illumina Truseq data was consistent between results from using BWA-mem or BWA-sampe aligner, though the latter gave slightly lesser coverage due to fewer mapped reads
  • Agilent off-target data generated sufficient mitochondria mapped reads considering ~95% of mitochondrial genome covered at 5x. Higher overall throughput for the sequenced sample could have provided greater off-target sequence reads yielding higher mitochondrial genome coverage.
  • Nimblegen off-target data was the least abundant, and the coverage profile across mitochondrial genome was also different from other datasets. This may also be due to the high-density overlapping bait design of Nimblegen, giving focused on-target coverage, leaving fewer off-target reads.

Variant Calling on the Mitochondrial Genome

33 variants shared by all 4 (WGS, Illumina/Nimblegen/Agilent capture)
Venn Diagram (generated using Venny) to compare the mitochondiral variants identified in the same sample from WGS and off-target data from different capture kits (10% or more alternate-supporting reads implied a variant call)

The sequencing data depicted high variability when using 1% alternate-supporting reads to annotate a mitochondrial genomic position as variant. So we used a threshold of at-least 10% reads at any given nucleotide position to be supporting the alternate allele to define a variant. The above venn-diagram highlights that the vast majority (33/41) of called variants on mitochondrial genome from WGS and WES data overlap. Another 6 variants identified in WGS were also observed in Agilent and Illumina WES data, but missed by Nimblegen WES due to low coverage. We do not provide a comprehensive iteration of the exclusive variants, but most of them suffer from low read-depth, low quality, and strand bias.

Conclusions

With the decreasing cost and increasing availability of exome sequencing data, there is a vast resource of mitochondrial genomes that can be mined for mitochondria-focused research. Data from large consortia like 1000 genomes and NHLBI exome datasets can be utilized for a comparative mitochondrial variation evaluation. As reported by Picardi et al, Illumina Truseq and Agilent exome kits generate better mitochondrial genome coverage compared to Nimblegen. Interestingly, even the custom-capture kit we evaluated generated a decent amount of mitochondrial genome coverage. This opens up a plethora of small NGS panel and custom-capture datasets for mitochondrial genome evaluation.

R Line Plots – the easiest fastest plot ever

What you need:

  • Data
  • R
  • Eager Bioinformatician

For data I used a file with 7 columns where the first column was a counter and the other columns (2-7) had different values that I would like to compare across using a line plot to visualize the variation in my data points

Serves: As many data points you would like to extend it to

Time: Once you have a parsed dataset, this is fast: you wouldn’t want to blink

Plotting lines with a legend 🙂

> x <- read.table(“filename”);

> png (“filename.png”)

> plot (x[,2], type=”l”, col = “steelblue”, ylab=”Heading”)

> lines (x[,3], col = “pink”)

> lines (x[,4], col = “cyan”)

> lines (x[,5], col = “magenta”)

> lines (x[,6], col = “green3”)

> lines (x[,7], col = “blue”)

> legend(11000, 2, c(“124-100”, “124-50”, “124-20”, “124-10”, “124-5”, “124-2”), lty=c(1,1), lwd=c(2.5,2.5), col=c(“steelblue”,”pink”,”cyan”,”magenta”,”green3″,”blue”))

> dev.off()

Feel free to change/add colors* and serve embedded within your document/ppt!

* You can find a detailed list of color names in R here: http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf

How recent are rare variants?

The Department of Genome Sciences at the University of Washington in Seattle, in a multi-institutional effort sequenced 15,336 genes for the NHLBI sponsored ESP project from a total of 6515 individuals of European American and African American descent.To identify approaches for disease-gene discovery, its important to understand evolutionary history of homosapiens and identify the age of mutations.

The group estimated 73% of all protein-coding SNV’s and around 86% of all SNVs predicted to be deleterious are a recent change within 5,000-10,000 years. European Americans had an excess of deleterious variants and had weaker purifying selection and that was explained with the out-of-africa model.

The gist you ask: rare variants have an important role in heritable phenotypic variation, disease susceptibility and adverse drug responses. The increasing population size has not had enough turn around time for selection to act upon, its only been 200-300 generations since these mutations came to be. Now this increase in mutations results in more Mendelian disorders and has increased the allelic and genetic heterogeneity of traits.

Though if there’s a positive side to it, it may as well be that we as people have created a new repository of advantageous alleles that have come into being fairly recently and hopefully evolution will act upon in subsequent generations

The common story of rare variants..

NGS comes to the aid of researchers to find answers for herited complex traits and diseases, paving a path towards the ‘personalized medicine era’. Whole-exome sequencing (WES) has been deployed for identifying rare variants associated with complex diseases and is providing a philip for further research and insight.

Common variants that were identified through Genome Wide Association Studies (GWASs) have not been able to answer the founding questions to an extent where the research community can identify the traits related to heritability. Moving forward and learning as we do from our experiences, the trail of identification of rare variants is helping us fill the gaps of missing heritability that were unsatiated with these genome wide global efforts.

Cirulli and Goldstein mention in their review (Uncovering the roles of rare variants in common disease through whole-genome sequencing) that common variants are being identified in Mendelian disease studies as having a key role as modifiers of the effects of rarer contributors to disease risk.

Potential frequencies of causal variants in complex traits

At conferences like ASHG in San Francisco last year the buzz words were ‘rare variants are common’. Twitter updates of late from the PAGXII have been talking about the same ‘rare variants’ and this, not just for human populations.

Scoping through literature for common and rare variants has been interesting, there are papers that point out in great depth how the thought process from Common Disease Common Variant (CDCV) moved on to the Common Disease Rare Variant (CDRV). Almost 4 years ago, Schork mentioned in his paper (Common vs. Rare Allele Hypotheses for Complex Diseases) that rare genetic variants (less than 5% frequency) can play key roles in influencing complex disease and traits.

Another interesting paper that I came accross from a decade earlier was from Reich and Lander (Lander of the Gangnam fame) On the allelic spectrum of human disease. In this paper the authors discuss the variation in allelic spectra for common disease genes and point that for some genes predominant disease alleles exist, while for others only a rare set. Their theory revolves around the idea that genes responsible for most of the risk for common diseases (hypertension, heart disease etc) have relatively simple allelic spectra and hence the causal variants for a common disease can be found using GWAS.

In his paper An Abundance of Rare Functional Variants in 202 Drug Target Genes Sequenced in 14,002 People, Nelson talks about rare variants being a result of recent mutations and being clustered geographically to some extent. He also points out that the common variants observe only a small fraction of the genetic diversity in any gene.

Number of Variants/kb of sequence

Moving on a decade (or less) from now, I imagine for 23&Me to expand its base to move on from just the common variants they report for ‘someone’s DNA’ to the rare variants that they could possibly do with specific input on ancestry, population, surnames to guide that route.

From the Whitehead Institute for Biomedical Research in Cambridge comes a paper on Identifying Personal Genomes by Surname Inference. Gymrek talks about how surnames can be recovered from profiling short tandem repeats on the Y chromosome from freely available, publicly accessible internet resources. Though the authors point out that their efforts are more in line to see effective policies being established for data sharing and awareness to the patient regarding participating in genetic studies and not for data sharing to recede.

Another interesting read for the statistically inclined people is SKAT test from the Harvard
Rare-Variant Association Testing for Sequencing Data with the Sequence Kernel Association Test. This study proposed a test: sequence kernel association test (SKAT) for studying association between a set of rare (and common) variants and continuous/dichotomous phenotypes. Using aggregates of individual score test statistics of SNPs belonging to a set, it computes p-values from the defined set level.

SKAT steps in to take the role of testing for association between variants in a region, surpassing burden tests. The authors note it is unlikely for most rare variants to influence the phenotype with the same magnitude. And as it is more common for variants within a sequenced region to have little or no effect on phenotype, SKAT allows different variants to have different directions and magnitude of effects.

Coming back to the conferences here is a snippet of what some institutes are doing: @ ASHG 2012 Quite a few sessions also revolved around the rare variants and other findings from the NHLBI Exome Sequencing Project. The NHLBI GO ESP is a dataset characterizing multiple samples of richly phenotype populations, making this endeavor a variation of the 1000 human genome project in many ways. Some sessions like these listed in the table below, highlighted the idea of accelerated gene discovery for complex traits using NGS:

Genomeweb article from ASHG2012 talks about the details of the Rucphen population study and mentions the other ongoing studies that were in the session:

Institution Isolated Population
University of Miami Midwestern Amish populations
University of Maryland Amish populations in Pennsylvania
Cittadella Univ. di Monserrato, Italy Sardinian population
Erasmus Medical Center Rotterdam Rucphen population

While others focussed on the approaches for testing these rare variant datasets:

Institution Approach
Baylor College of Medicine & Fred Hutchinson Cancer Research Center Testing for rare variant associations in the presence of missing data
Baylor College of Medicine & University of Washington Rare Variant Extensions of the Transmission Disequilibrium Test Detects Associations with Autism Exome Sequence Data
Johannes Kepler University, Austria Associating complex traits with rare variants identified by NGS: improving power by a position-dependent kernel approach

So will we step in this personalized genome era soon someday? I’d hate for this bubble to burst so I hope the CAP & CLIA certifications are done and the clinical labs are all set. The public health and data privacy issues are in check and as we breathe into a no nonsense world looking at our detailed genome analysis report we hopefully can feel its more or less uniquely ours or better still closer to our parents than our neighbours.

The New Genomes of 2012

From the first large genome (goat’s genome) to be sequenced and assembled de novo using whole-genome mapping technology, published almost a week ago to around 1000 rice genomes sequenced at 1x coverage: there’s a lot that has happened at the Whole Genome Sequencing level.

Roche/454 is still being used and so is Sanger, not everyone has moved on to Illumina as you’ll see in the table below. With clinical sequencing aiming a 100x coverage most of the research sequencing seems to be lower. In these papers you’re sure to spot BGI some in partnerships like with University of Copenhagen for the Bat paper and their Illumina data.

As this year wraps up, Nextgenseek.com provided food for thought in his ‘The class of 2012’ post. Here’s a snapshot of some quirky points/details that caught my eye in these ‘Whole Genome Sequencing’ papers.

Lets see what more genomes and sequencing technologies await us in 2013!

Organism Sequencing Usefulness Size & Annotation Software Interesting Findings Scope
Capra hircus; 3yr old female Yunnan black goat WGS: Illumina GA IIx12x coverage For the first time whole-genome mapping technology was used for de novo assembly of large genomes Investigate genetic basis of complex traits, in transgene production of peptide medicines ~2.66-Gb genome sequence; 22,175 protein-coding genes Contigs & scaffolds assembled using SOAPdenovo (Release 1.05) and ABYSSWhole genome mapping: Genome-Builder

Annotation: GLEAN

51 genes are differentially expressed in primary and secondary follicles of cashmere goat Markers for breeding better cashmere goats can be identified and/or potential targets for genetic or nongenetic manipulation
Female western lowland gorilla named Kamilah (San Diego Zoo) WGS: Hybrid de novo assembly combining 5.4 Gb Sanger & 166.1 Gb of Illumina short reads57.4x coverage Closest living relatives after chimpanzees will aid in study of human evolution ~3 Gb genome sequence; 20,962 Protein-coding genes Assembler: ABYSS, Phusion assembler, Maq, VelvetChromosomal AGP files: LASTZ

Annotation: ENSEMBL

30% of the gorilla genome is closer to human though its rarer around coding genes Deeper understanding of great ape biology and evolution
Pan paniscus; female bonobo individual Ulindi (Leipzig Zoo) WGS: 454/Roche 23x coverage (additionally 19 bonobo & chimpanzee genomes on Illumina GAIIx) Compared it to genomes of chimpanzees and humans to study its evolutionary relationship 2.7 Gb genome sequence Assembler: Celera Assembler software 25% of human genes contain parts that are more closely related to one of the two apes than the other Illuminate population history and selective events that affected evolution
Solanum lycopersicum; inbred tomato cultivar ‘Heinz 1706’ WGS: Combination of 21Gb of Roche/454 Titanium & 3.3 Gb of Sanger27x Coverage Compared it with closest wild relative, Solanum pimpinellifolium and potato genome (Solanum tuberosum) 900Mb Genome size; 34,727 protein-coding genes Assembler: Newbler, CABOGAnnotation based on EuGene pipeline Tomato genome more than 8% divergence from potato 18,320 orthologous gene pairs Triplications Comparing gene family evolution & understanding bottlenecks that have narrowed tomato genetic diversity
Denisovan, an extinct relative of Neandertals (& 11 present day individuals) WGS; Single-stranded library preparation method; Illumina GA IIx31x coverage DNA library preparation method to reconstruct a high-coverage (30×) genome sequence ~1.86Gb Genome size; Coverage was not biased toward GC-rich sequences Illumina Genome Analyzer RTA 1.6 software, mapped to reference using BWA, GATK (also realignment) Denisovans share more alleles with east Asian & South American populations (Dai, Han, and Karitiana) than with European populations (French and Sardinian) Determine how modern humans expanded in population size & cultural complexity while archaic humans became physically extinct
Musa acuminata (banana) WGS; 27.5 million Roche/454 single reads and 2.1 million Sanger reads (50× of Illumina data used to correct sequence errors) Served as a stepping-stone to finding conserved non-coding sequences conserved beyond monocotyledons 523-Mb genome size; 36,542 protein-coding genes Assembler: NewblerAligner: BLAT Comparison of Musa, rice, sorghum, Brachypodium, date palm and Arabidopsis proteomes revealed 7,674 gene clusters common to all six species Whole-genome duplications Unravel complex genetics & key to identifying genes responsible for agronomic characters, such as fruit quality and pest resistance
Rice: 1,083 O. sativa accessions & 446 O. rufipogon accessions (China, Japan) WGS; Illumina GA IIx 73-bp paired-end reads O. sativa: 1x coverage O. rufipogon 2x coverage Identified 55 selective sweeps that have occurred during domestication Aligner: SmaltSNP caller: Ssaha Pileup Insights into how and where rice was likely domesticated & set of domestication sweeps and putative causal genes Important resource for rice breeders to effectively exploit diverse genetic resources for rice improvement
Female domestic Duroc pig (Sus scrofa) WGS Hybrid de novo assembly based sequences from BAC clones40x coverage Comparison with the genomes of wild and domestic pigs from Europe and Asia Identification of putative disease-causing variants can aid the pig to be a biomedical model ~2.6Gb genome size; 21,640 protein-coding genes Contigs Assembler: PhrapAssembler: SOAPdenovo, Cortex

Aligner: BLAT

Genes associated with immune response and olfaction exhibit fast evolution112 positions where porcine protein has same amino acid that is implicated in a human disease Important resource for improvement in livestock species
Bread wheat (Triticum aestivum); Chinese Spring (CS42) WGS; Hexaploid genome 454/Roche – 5x coverage. SOLiD used for additional sequencing to increase accuracy of homologous SNP identification. Comparison with diploid progenitors and relatives showed overall trend of gene family size reduction in large gene families in wheat. Defined genome-wide catalog of SNPs 17 Gb genome size; ~94k genes MetaSim to generate readsOrthologous genes: OrthoMCL clustering, BLASTX Assembled gene sequences representing a complete gene set were sequenced. Powerful framework for identifying genes Identification of extensive genetic variation can provide a resource for accelerating gene discovery and improving this crop
Pacific oyster Crassostrea gigas (inbred female produced by four generations of brother–sister mating) WGS 155x Illumina (unable to assemble due to high  polymorphism & repetitive sequence) fosmid-pooling strategy Fosmid library 10x; 60x sequencing &  assembly Combination of fosmid pooling, NGS and hierarchical assembly: new, cost-effective alternative for de novo sequencing & assembly of complex genomes 637 Mb; 28,027 genes Alignment: LASTZ Expansion of genes coding for HSP70 & IAPs is probably central to adaptation to sessile life in the highly stressful intertidal zone valuable resources for studying molluscan biology and lophotrochozoan evolution
owl limpet (Lottia gigantea), a marine polychaete (Capitella teleta) and a freshwater leech (Helobdella robusta) WGS with Sanger dideoxy sequencing reads8x coverage compare them with other animal genomes to investigate the origin and diversification of bilaterians from a genomic perspective ~200-300 Mb genome size; ~23,000 to 33,000 protein-coding genes orthologs BLAST ~8K bilaterian gene families likely from single progenitor genes. 231 putative spiralian-specific gene families, members aligned across all three spiralians, indicating purifying selection rate-stratification approach could be used to place problematic taxa when genome data becomes available
Wild-caught bats, fruit bat (P. alecto) and insectivorous (M. davidii) WGS Illumina HiSeq 2000109x-118x Coverage Identify genetic changes associated with the development of bat-specific traits by comparative analyses of two distantly related bat species ~2Gb genome size; protein coding genes: 21,392 P. alecto & 21,705 M. davidii Assembler:SOAPdenovo aligner BWA

Repeat Annotation: Tandem Repeats Finder

Gene annotation: GLEAN

Genes in DNA damage checkpoint/DNA repair pathway found to be under positive selection. So are COL3A1 (skin elasticity) CACNA2D1 (muscle contraction). Entire locus of PYHIN gene family (sensing microbial DNA) is lost Comparison with other species may provide new insights into bat biology and evolution

HuGGV Poster – Analyzing Variants in 1000 Human Genome Data

Variation accross chromosomes - variant data 1000 Human Genome Project

Been sometime since I worked on this poster (almost a semester) but its in here before I get myself to work on these variation results and find some answers. Though its interesting to see the trends of different variation across different chromosomes. I found chromosome 19 to stand out in most comparisons. Need to dig more to find some plausible explanations for that.. and more.