Venn Diagrams Simplified

Not a week goes by without me having to compare multiple lists to evaluate overlap. Re-analyzing any data and comparing that with previously computed data, comparing results from multiple tools or multiple samples. Sometimes the unix commands diff, paste and join come handy, but most times am looking for some additional information.

Almost always, am looking to find out the shared elements of the list, which ones are exclusive to a particular list, and so forth. Many a times, these evaluations end up in simple VENN diagrams to diagrammatically depict the comparison.

The simplest way to do so, with small enough lists, is to use this wonderful tool VENNY. It has a very simple to use interface, allows for pasting upto 4 lists, and creates a very nice visual that one can download as an image. VENNY also allows one to click any region of the figure, to list out the elements in that sub-group.

http://bioinfogp.cnb.csic.es/tools/venny/getPicture.php?labels=List%201_x_5_x_85_x_List%202_x_85_x_55_x_List%203_x_225_x_55_x_List%204_x_320_x_85&numbers=0_x_65_x_190_x_resultC1000_x_0_x_130_x_95_x_resultC0100_x_3_x_245_x_95_x_resultC0010_x_0_x_310_x_190_x_resultC0001_x_0_x_100_x_137_x_resultC1100_x_1_x_117_x_245_x_resultC1010_x_0_x_188_x_295_x_resultC1001_x_0_x_257_x_245_x_resultC0101_x_0_x_188_x_137_x_resultC0110_x_2_x_270_x_137_x_resultC0011_x_0_x_230_x_183_x_resultC0111_x_1_x_161_x_267_x_resultC1011_x_1_x_212_x_267_x_resultC1101_x_1_x_145_x_183_x_resultC1110_x_1_x_188_x_235_x_resultC1111

However, as is the case with Bioinformatics, the lists are most times much larger, and if one has enough grasp of command-line unix and basic perl, a simple script can help compare any such lists.

Here is my simple code that has been used 100s of times over the last few years. Feel free to try, leave a comment, or suggest any improvements or faster/better ways to do this sort of evaluation!

Input: single column lists that need to be compared (if your data is multiple columns, use something like `cut -f1 fileName > tmp1` in unix to cut out the first (or whichever using -f) column from the file)

Output: 2-column file, the first column providing a key, the 2nd column listing the actual element from the lists. (Using a pipe ‘|’ with cut -f1 | sort | uniq -c gives the count of each group. In the script below, using 2 files

key=1 => element is unique to the first list / file
key=2 => element is unique to the second list / file
key=3 => element is shared by both the lists / files

## Script to take 2 files, and report venn diagram numbers for an image
## Usage: $ perl 2-way-venn.pl w1 w2 | cut -f1 | sort | uniq -c
use strict;
use warnings;
my %hash;
open(DAT, $ARGV[0]); ##1st file
while (my $l = <DAT>) {
chomp $l;
$hash{$l} = 1; ## in set1
}
close DAT;
open(DAT, $ARGV[1]); ##2nd file
while (my $l = <DAT>) {
chomp $l;
if (defined $hash{$l} && $hash{$l} == 1) {
$hash{$l} = 3; ##Common in sets 1 & 2
} else {$hash{$l} = 2;} ## Set 2, not set 1
}
close DAT;
foreach my $k (keys %hash) {
print "$hash{$k}\t$k\n";
}
exit;

Similarly, a small tweak allows this script to be used on 3 lists. The scripts run pretty fast for everyday comparisons. One caveat is to use clean data, make sure there are no extra spaces or quotes on some lists, which can really mess up the numbers.

## Script to take 3 files, and report venn diagram numbers for an image
## Usage: $ perl 3-way-venn.pl w1 w2 w3 | cut -f1 | sort | uniq -c

## Key: 1 => 1-only, 2 => 2-only, 3 => 1 & 2, 4 => 3-only, 5 => 1 & 3
## 6 => 1 & 2 & 3, 7 => 2 & 3
use strict;
use warnings;
my %hash;
open(DAT, $ARGV[0]); ##1st file
while (my $l = <DAT>) {
chomp $l;
$hash{$l} = 1; ## in set1
}
close DAT;
open(DAT, $ARGV[1]); ##2nd file
while (my $l = <DAT>) {
chomp $l;
if (defined $hash{$l} && $hash{$l} == 1) {
$hash{$l} = 3; ##Common in sets 1 & 2
} else {$hash{$l} = 2;} ## Set 2, not set 1
}
close DAT;
if (defined $ARGV[2]) {
open(DAT, $ARGV[2]); ##3rd file
while (my $l = <DAT>) {
chomp $l;
if (!defined $hash{$l}) {$hash{$l} = 4;} ## Set 3 only
elsif ($hash{$l} == 1) {
$hash{$l} = 5; ## sets 1 & 3, not 2
} elsif ($hash{$l} == 2) {
$hash{$l} = 7; ## sets 2 & 3, not 1
} elsif ($hash{$l} == 3) {
$hash{$l} = 6; ## sets 1 & 2 & 3
}
}
close DAT;
}
if (defined $ARGV[3]) {
open(DAT, $ARGV[3]); ##3rd file
while (my $l = ) {
chomp $l;
if ($hash{$l} == 1) {
$hash{$l} = 8; ## sets 1 & 4 only
} elsif ($hash{$l} == 2) {
$hash{$l} = 9; ## sets 2 & 4 only
} elsif ($hash{$l} == 3) {
$hash{$l} = 10; ## sets 1 & 2 & 4
} elsif ($hash{$l} == 4) {
$hash{$l} = 11; ## Set 3 & 4 only
} elsif ($hash{$l} == 5) {
$hash{$l} = 12; ## Set 1 & 3 & 4 only
} elsif ($hash{$l} == 6) {
$hash{$l} = 13; ## Set 1 & 2 & 3 & 4
} elsif ($hash{$l} == 7) {
$hash{$l} = 14; ## Set 1 & 2 & 4 only
} else {$hash{$l} = 15;} ## Set 4 only
}
close DAT;
}
foreach my $k (keys %hash) {
print "$hash{$k}\t$k\n";
}
exit;

UPDATE: There is this useful R package VennDIagram pointed by @genetics_blog who BTW has an amazing bioinformatics blog: Getting Genetics Done