[R] similarity matrix conversion to dissimilarity

Thu Dec 9 00:10:55 CET 2004

[replying to your personal address as well as the list; but
 I think you should subscribe to the list since this topic
 may well be pursued further]

On 08-Dec-04 Dr. Thomas Isenbarger wrote:
> I have a matrix of similarity scores that I want to convert into a 
> matrix of dissimilarity scores so that I can apply some clustering 
> methods to the data.  That is, high values in my matrix signify 
> similarity and low values (zero being the lowest) signify no 
> similarity.  What functions/options in R or its packages are available 
> for making this kind of transformation of a matrix?
> 
> Specifically, I am a molecular biologist.  I have a set of 700+ 
> nucleotide sequences i want to group into clusters based on sequence 
> similarities.  There is a wide range of sequences in the set, some of 
> which are homologous to other sequences in the set.  I want to use 
> clustering to identify these groups.
> 
> If the sequences were related and good be trimmed to the same length, I
> would do an alignment and then use phylip (or some other distance 
> method) to create a distance matrix, but since my sequences are 
> unrelated and cannot be trimmed to the same length, I am at a loss for 
> what to do.
> 
> For a set with so many unrelated sequences of different lengths, the 
> only thing I have been able to is an all-against-all BLAST to create 
> the matrix, but this gives high scores for similarities, not high 
> scores for dissimilarities.  The only thought I had was to use the 
> reciprocal of the BLAST score as some perverse measure of distance.
> 
> I am not subscribed to the list, so can I ask for responses directly to
> my email address?

Clearly any function which "inverts" the measure of "similarity"
(i.e. decreases as "similarity" increases) could be used as a
measure of dissimilarity in general. Indeed you imply as much
yourself. There is quite a wide choice ... "reciprocal" could be one.

However, reading between your lines, it seems that you do
not have a substantive interpretation for "dissimilarity".
Yet apparently you have one for "similarity". Otherwise, on
what basis do you claim that your similarity matrix expresses
*substantive* similarity?

But, if you can attach an interpretation (in some substantive
terms) to your measure of similarity, can you not then negate
the propositions that this expresses and obtain a measure of
dissimilarity? In that case, the function could be programmed
in R (though it may not be a function of your "similarity" and.
you would need to derive it from the data).

If not, why not? Or, if your measure of "similarity" in fact
does not carry a substantive interpretation, then one could
assert that any decreasing function of "similarity" could
be used, and would be as meaningful as your measure of
"similarity". Again, this can be programmed in R.

Again reading between your lines, it could be inferred that
in the situation you describe ("unrelated sequences" which
"cannot be trimmed to the same length"), while you can derive
a measure of similarity which matches established concepts
for similarity in your field, you cannot match the concepts
for dissimilarity.

If that is the case, R cannot help you with the conceptual
problem.

This may appear not helpful, but it is a sincere attempt
to clarify the issues.

Best wishes,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861  [NB: New number!]
Date: 08-Dec-04                                       Time: 23:10:55
------------------------------ XFMail ------------------------------