I want to identify the derived sites in Arabidopsis thaliana – derived sites being the sites that have been derived in the population after evolution and are different from the sites in the ancestor sequence. Reconstructing the ancestral sequence could have been a way to go but instead of venturing in that domain we, hearing experience’s of people from other labs concluded using Arabidopsis lyrata as the outgroup and a representative of the ancestral sequence.
Before I talk about orthologs and the steps I took to identify them let me briefly layout the whole picture of why we need orthologs. Now if I have a sequence where at pos. 222 I have an A in 90% Arabidopsis thaliana strains, C in 10% and A in Arabidopsis lyrata. C is my derived site in that strain. Simple enough. The next scenario is where I have A in 100% A. thaliana strains while T in A.lyrata which tells me that this is evolutionary divergence between these two species.
The situation gets complicated as the number of derived sites vs divergence between the two species increase or change and there’s a combination of differences I can have at every position.
So I am looking to identify these derived sites and plot them. Why it needs to be plotted and the root cause for doing all this will follow in another post. So coming back to how I identified the orthologs for Arabidopsis thaliana and Arabidopsis lyrata and this is where BLAST comes in.
I started using the reciprocal best hit to identify the coding sequences that were the supposed “orthologs”. Though there are different ways to identify orthologs we went with the fastest method of using BLAST reciprocal hits. The research article on Choosing BLAST options for better detection of orthologs as reciprocal best hits by Latimer K was helpful in determining ground support in favor of reciprocal best hits.
Started running blast commands with different thresholds, like e-value ranging from 1e-3 to 1e-10 and got rid of e-values for the results completely – yup no e-value threshold at all. Why no e-value threshold? Because I knew we would be selecting only top reciprocal hits, once we identified the % of length alignment – we would any ways not consider a snippet of an alignment for an ortholog, so to grab all the hits I could I got rid of the e-value, though 1e-10 was a good threshold, if I were to do it again.
Moving on I wrote a script to get reciprocal best hits from both A.thaliana and A.lyrata BLAST results. What was interesting to see that some of the results were divided into two HSPs and the “-m 8” option in BLAST would only give part1 HSP as a top hit, next part as the second hit- now I had written my script to pick the top hit, if it was reflected respectively in both A.thaliana and A.lyrata results, reporting different HSP of the same gene as multiple alignemnts skewed my results.
Also I needed to set the filter to false for low complexity filter as some results for were divided into multiple HSPs for one specie while the other had a complete hit. So I re-ran BLAST with (no e-value parameter) low complexity filter set to false and wrote a script to count multiple HSPs for every coding region and report the aligned length for that set of reciprocal best hit.
Then I calculated the % of length aligned by using the alignment length (adding them from all HSPs) and length of A.thaliana coding sequences. Finally I came up with a file with where I had all the HSPs counted for the reciprocal hits, with their lengths added and % length calculated to see how well the alignment was. Time to make decisions where we want to cut off and consider hits above what threshold as an ortholog and chuck out other sequences because of poor alignment even though they were a reciprocal hit.