BLAST reciprocal best hit for identifying Orthologs

I want to identify the derived sites in Arabidopsis thaliana – derived sites being the sites that have been derived in the population after evolution and are different from the sites in the ancestor sequence. Reconstructing the ancestral sequence could have been a way to go but instead of venturing in that domain we, hearing experience’s of people from other labs concluded using Arabidopsis lyrata as the outgroup and a representative of the ancestral sequence.

Before I talk about orthologs and the steps I took to identify them let me briefly layout the whole picture of why we need orthologs. Now if I have a sequence where at pos. 222 I have an A in 90% Arabidopsis thaliana strains, C in 10% and A in Arabidopsis lyrata. C is my derived site in that strain. Simple enough. The next scenario is where I have A in 100% A. thaliana strains while T in A.lyrata which tells me that this is evolutionary divergence between these two species.

The situation gets complicated as the number of derived sites vs divergence between the two species increase or change and there’s a combination of differences I can have at every position.

So I am looking to identify these derived sites and plot them. Why it needs to be plotted and the root cause for doing all this will follow in another post. So coming back to how I identified the orthologs for Arabidopsis thaliana and Arabidopsis lyrata and this is where BLAST comes in.

I started using the reciprocal best hit to identify the coding sequences that were the supposed “orthologs”. Though there are different ways to identify orthologs we went with the fastest method of using BLAST reciprocal hits. The research article on Choosing BLAST options for better detection of orthologs as reciprocal best hits by Latimer K was helpful in determining ground support in favor of reciprocal best hits.

Started running blast commands with different thresholds, like e-value ranging from 1e-3 to 1e-10 and got rid of e-values for the results completely – yup no e-value threshold at all.  Why no e-value threshold? Because I knew we would be selecting only top reciprocal hits, once we identified the % of length alignment – we would any ways not consider a snippet of an alignment for an ortholog, so to grab all the hits I could I got rid of the e-value, though 1e-10 was a good threshold, if I were to do it again.

Moving on I wrote a script to get reciprocal best hits from both A.thaliana and A.lyrata BLAST results. What was interesting to see that some of the results were divided into two HSPs and the “-m 8” option in BLAST would only give part1 HSP as a top hit, next part as the second hit- now I had written my script to pick the top hit, if it was reflected respectively in both A.thaliana and A.lyrata results, reporting different HSP of the same gene as multiple alignemnts skewed my results.

Also I needed to set the filter to false for low complexity filter as some results for were divided into multiple HSPs for one specie while the other had a complete hit. So I re-ran BLAST with (no e-value parameter) low complexity filter set to false and wrote a script to count multiple HSPs for every coding region and report the aligned length for that set of reciprocal best hit.

Then I calculated the % of length aligned by using the alignment length (adding them from all HSPs) and length of A.thaliana coding sequences. Finally I came up with a file with where I had all the HSPs counted for the reciprocal hits, with their lengths added and % length calculated to see how well the alignment was. Time to make decisions where we want to cut off and consider hits above what threshold as an ortholog and chuck out other sequences because of poor alignment even though they were a reciprocal hit.


3 thoughts on “BLAST reciprocal best hit for identifying Orthologs

    1. InParanoid does not have orthologs detected for these two species published and i did not venture into installing their program to do that, though these sequences and 30 other A.thaliana accessions were aligned globally using ClustalW

  1. Its like you read my mind! You appear to know a lot about this, like you wrote
    the book in it or something. I think that you could
    do with a few pics to drive the message home a bit, but instead of that, this is wonderful blog.
    An excellent read. I will definitely be back.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s