Venn Diagrams Simplified

Not a week goes by without me having to compare multiple lists to evaluate overlap. Re-analyzing any data and comparing that with previously computed data, comparing results from multiple tools or multiple samples. Sometimes the unix commands diff, paste and join come handy, but most times am looking for some additional information.

Almost always, am looking to find out the shared elements of the list, which ones are exclusive to a particular list, and so forth. Many a times, these evaluations end up in simple VENN diagrams to diagrammatically depict the comparison.

The simplest way to do so, with small enough lists, is to use this wonderful tool VENNY. It has a very simple to use interface, allows for pasting upto 4 lists, and creates a very nice visual that one can download as an image. VENNY also allows one to click any region of the figure, to list out the elements in that sub-group.

http://bioinfogp.cnb.csic.es/tools/venny/getPicture.php?labels=List%201_x_5_x_85_x_List%202_x_85_x_55_x_List%203_x_225_x_55_x_List%204_x_320_x_85&numbers=0_x_65_x_190_x_resultC1000_x_0_x_130_x_95_x_resultC0100_x_3_x_245_x_95_x_resultC0010_x_0_x_310_x_190_x_resultC0001_x_0_x_100_x_137_x_resultC1100_x_1_x_117_x_245_x_resultC1010_x_0_x_188_x_295_x_resultC1001_x_0_x_257_x_245_x_resultC0101_x_0_x_188_x_137_x_resultC0110_x_2_x_270_x_137_x_resultC0011_x_0_x_230_x_183_x_resultC0111_x_1_x_161_x_267_x_resultC1011_x_1_x_212_x_267_x_resultC1101_x_1_x_145_x_183_x_resultC1110_x_1_x_188_x_235_x_resultC1111

However, as is the case with Bioinformatics, the lists are most times much larger, and if one has enough grasp of command-line unix and basic perl, a simple script can help compare any such lists.

Here is my simple code that has been used 100s of times over the last few years. Feel free to try, leave a comment, or suggest any improvements or faster/better ways to do this sort of evaluation!

Input: single column lists that need to be compared (if your data is multiple columns, use something like `cut -f1 fileName > tmp1` in unix to cut out the first (or whichever using -f) column from the file)

Output: 2-column file, the first column providing a key, the 2nd column listing the actual element from the lists. (Using a pipe ‘|’ with cut -f1 | sort | uniq -c gives the count of each group. In the script below, using 2 files

key=1 => element is unique to the first list / file
key=2 => element is unique to the second list / file
key=3 => element is shared by both the lists / files

## Script to take 2 files, and report venn diagram numbers for an image
## Usage: $ perl 2-way-venn.pl w1 w2 | cut -f1 | sort | uniq -c
use strict;
use warnings;
my %hash;
open(DAT, $ARGV[0]); ##1st file
while (my $l = <DAT>) {
chomp $l;
$hash{$l} = 1; ## in set1
}
close DAT;
open(DAT, $ARGV[1]); ##2nd file
while (my $l = <DAT>) {
chomp $l;
if (defined $hash{$l} && $hash{$l} == 1) {
$hash{$l} = 3; ##Common in sets 1 & 2
} else {$hash{$l} = 2;} ## Set 2, not set 1
}
close DAT;
foreach my $k (keys %hash) {
print "$hash{$k}\t$k\n";
}
exit;

Similarly, a small tweak allows this script to be used on 3 lists. The scripts run pretty fast for everyday comparisons. One caveat is to use clean data, make sure there are no extra spaces or quotes on some lists, which can really mess up the numbers.

## Script to take 3 files, and report venn diagram numbers for an image
## Usage: $ perl 3-way-venn.pl w1 w2 w3 | cut -f1 | sort | uniq -c

## Key: 1 => 1-only, 2 => 2-only, 3 => 1 & 2, 4 => 3-only, 5 => 1 & 3
## 6 => 1 & 2 & 3, 7 => 2 & 3
use strict;
use warnings;
my %hash;
open(DAT, $ARGV[0]); ##1st file
while (my $l = <DAT>) {
chomp $l;
$hash{$l} = 1; ## in set1
}
close DAT;
open(DAT, $ARGV[1]); ##2nd file
while (my $l = <DAT>) {
chomp $l;
if (defined $hash{$l} && $hash{$l} == 1) {
$hash{$l} = 3; ##Common in sets 1 & 2
} else {$hash{$l} = 2;} ## Set 2, not set 1
}
close DAT;
if (defined $ARGV[2]) {
open(DAT, $ARGV[2]); ##3rd file
while (my $l = <DAT>) {
chomp $l;
if (!defined $hash{$l}) {$hash{$l} = 4;} ## Set 3 only
elsif ($hash{$l} == 1) {
$hash{$l} = 5; ## sets 1 & 3, not 2
} elsif ($hash{$l} == 2) {
$hash{$l} = 7; ## sets 2 & 3, not 1
} elsif ($hash{$l} == 3) {
$hash{$l} = 6; ## sets 1 & 2 & 3
}
}
close DAT;
}
if (defined $ARGV[3]) {
open(DAT, $ARGV[3]); ##3rd file
while (my $l = ) {
chomp $l;
if ($hash{$l} == 1) {
$hash{$l} = 8; ## sets 1 & 4 only
} elsif ($hash{$l} == 2) {
$hash{$l} = 9; ## sets 2 & 4 only
} elsif ($hash{$l} == 3) {
$hash{$l} = 10; ## sets 1 & 2 & 4
} elsif ($hash{$l} == 4) {
$hash{$l} = 11; ## Set 3 & 4 only
} elsif ($hash{$l} == 5) {
$hash{$l} = 12; ## Set 1 & 3 & 4 only
} elsif ($hash{$l} == 6) {
$hash{$l} = 13; ## Set 1 & 2 & 3 & 4
} elsif ($hash{$l} == 7) {
$hash{$l} = 14; ## Set 1 & 2 & 4 only
} else {$hash{$l} = 15;} ## Set 4 only
}
close DAT;
}
foreach my $k (keys %hash) {
print "$hash{$k}\t$k\n";
}
exit;

UPDATE: There is this useful R package VennDIagram pointed by @genetics_blog who BTW has an amazing bioinformatics blog: Getting Genetics Done

R Line Plots – the easiest fastest plot ever

What you need:

  • Data
  • R
  • Eager Bioinformatician

For data I used a file with 7 columns where the first column was a counter and the other columns (2-7) had different values that I would like to compare across using a line plot to visualize the variation in my data points

Serves: As many data points you would like to extend it to

Time: Once you have a parsed dataset, this is fast: you wouldn’t want to blink

Plotting lines with a legend 🙂

> x <- read.table(“filename”);

> png (“filename.png”)

> plot (x[,2], type=”l”, col = “steelblue”, ylab=”Heading”)

> lines (x[,3], col = “pink”)

> lines (x[,4], col = “cyan”)

> lines (x[,5], col = “magenta”)

> lines (x[,6], col = “green3”)

> lines (x[,7], col = “blue”)

> legend(11000, 2, c(“124-100”, “124-50”, “124-20”, “124-10”, “124-5”, “124-2”), lty=c(1,1), lwd=c(2.5,2.5), col=c(“steelblue”,”pink”,”cyan”,”magenta”,”green3″,”blue”))

> dev.off()

Feel free to change/add colors* and serve embedded within your document/ppt!

* You can find a detailed list of color names in R here: http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf

HuGGV Poster – Analyzing Variants in 1000 Human Genome Data

Variation accross chromosomes - variant data 1000 Human Genome Project

Been sometime since I worked on this poster (almost a semester) but its in here before I get myself to work on these variation results and find some answers. Though its interesting to see the trends of different variation across different chromosomes. I found chromosome 19 to stand out in most comparisons. Need to dig more to find some plausible explanations for that.. and more.

BLAST reciprocal best hit for identifying Orthologs

I want to identify the derived sites in Arabidopsis thaliana – derived sites being the sites that have been derived in the population after evolution and are different from the sites in the ancestor sequence. Reconstructing the ancestral sequence could have been a way to go but instead of venturing in that domain we, hearing experience’s of people from other labs concluded using Arabidopsis lyrata as the outgroup and a representative of the ancestral sequence.

Before I talk about orthologs and the steps I took to identify them let me briefly layout the whole picture of why we need orthologs. Now if I have a sequence where at pos. 222 I have an A in 90% Arabidopsis thaliana strains, C in 10% and A in Arabidopsis lyrata. C is my derived site in that strain. Simple enough. The next scenario is where I have A in 100% A. thaliana strains while T in A.lyrata which tells me that this is evolutionary divergence between these two species.

The situation gets complicated as the number of derived sites vs divergence between the two species increase or change and there’s a combination of differences I can have at every position.

So I am looking to identify these derived sites and plot them. Why it needs to be plotted and the root cause for doing all this will follow in another post. So coming back to how I identified the orthologs for Arabidopsis thaliana and Arabidopsis lyrata and this is where BLAST comes in.

I started using the reciprocal best hit to identify the coding sequences that were the supposed “orthologs”. Though there are different ways to identify orthologs we went with the fastest method of using BLAST reciprocal hits. The research article on Choosing BLAST options for better detection of orthologs as reciprocal best hits by Latimer K was helpful in determining ground support in favor of reciprocal best hits.

Started running blast commands with different thresholds, like e-value ranging from 1e-3 to 1e-10 and got rid of e-values for the results completely – yup no e-value threshold at all.  Why no e-value threshold? Because I knew we would be selecting only top reciprocal hits, once we identified the % of length alignment – we would any ways not consider a snippet of an alignment for an ortholog, so to grab all the hits I could I got rid of the e-value, though 1e-10 was a good threshold, if I were to do it again.

Moving on I wrote a script to get reciprocal best hits from both A.thaliana and A.lyrata BLAST results. What was interesting to see that some of the results were divided into two HSPs and the “-m 8” option in BLAST would only give part1 HSP as a top hit, next part as the second hit- now I had written my script to pick the top hit, if it was reflected respectively in both A.thaliana and A.lyrata results, reporting different HSP of the same gene as multiple alignemnts skewed my results.

Also I needed to set the filter to false for low complexity filter as some results for were divided into multiple HSPs for one specie while the other had a complete hit. So I re-ran BLAST with (no e-value parameter) low complexity filter set to false and wrote a script to count multiple HSPs for every coding region and report the aligned length for that set of reciprocal best hit.

Then I calculated the % of length aligned by using the alignment length (adding them from all HSPs) and length of A.thaliana coding sequences. Finally I came up with a file with where I had all the HSPs counted for the reciprocal hits, with their lengths added and % length calculated to see how well the alignment was. Time to make decisions where we want to cut off and consider hits above what threshold as an ortholog and chuck out other sequences because of poor alignment even though they were a reciprocal hit.