[BioC] pairwise alignment and homology score

Hervé Pagès hpages at fhcrc.org
Thu Jan 12 01:20:28 CET 2012


Hi,

On 01/10/2012 07:02 AM, erikafalisi at tin.it wrote:
>
> Dear List,
> we are trying to compare more than 600 amino acid sequences in clustal format (or FASTA format too) with BLOSUM62 matrix to achieve homology score. We used pairwise alignment but we aren't able to compare our sequences with BLOSUM matrix because pairwise command is used to match only two strings (and not a set of sequences) with the matrix. How can we iterate these command to perform all sequences alignments?

Assuming you've managed to load your 600 sequences in a AAStringSet
object 'myseqs', have you tried to do something along the lines of:

   scores <- matrix(integer(0), nrow=length(myseqs), ncol=length(myseqs))
   for (i in 1:length(myseqs))
     for (j in 1:i)
       scores[i, j] <- pairwiseAlignment(myseqs[i], myseqs[j],
                                         substitutionMatrix="BLOSUM62",
                                         scoreOnly=TRUE)

One problem that arises is that the input of a clustering tool like
hclust() is expected to be a 'dist' object i.e. a matrix of distances,
not a matrix of scores. Since, unlike with the distance, the highest
the score, the "closest" the sequences are, using scores as the input
of hclust() is guaranteed to produce wrong results.

I'm not sure what's the best way to transform pairwise scores into
pairwise distances though (I'm not an expert on the topic) but it
seems to me that the latter could be inferred from the former with
something like:

   dist(x, y) <- 1 - score(x, y)^2 / (score(x, x) * score(y, y))

The distance defined above looks like it could be close enough to a
true distance (in the mathematical sense) but that's pure speculation
since I don't really have a proof I could show you (the tricky part
being to prove it satisfies the triangle inequality). Anyway, as far
as hclust() is concerned, I'm not sure it matters. According to the
man page, the first argument to hclust() only needs to be
"a dissimilarity structure as produced by ‘dist’". Nothing is said
about it being a true distance.

So:

   dist <- 1 - scores^2 / (diag(scores) %*% t(diag(scores)))

Finally, plot the dendogram with:

   plot(hclust(as.dist(dist)))

> We also performed Multiple Alignment but we obtain a cramped dendogram. Is it possible to modify dendogram's sizes to make it readable?

What's a "cramped" dendogram? Not sure about modifying the size of the
dendogram. I suggest you look at the man page for hclust() and in
particular at the plot method for class 'hclust' (it has a lot of args).
Others on the list might have better suggestions about producing
nice dendogram plots.

Cheers,
H.

> Thanks,
> Dr Falisi
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list