[BioC] Re: [S] Error in clustering procedure

Tue Sep 14 20:26:26 CEST 2004

Liaw, Andy wrote:
>>From: cstrato
>>
>>"Dimension reduction" brings up another important issue:
>>I had discussions with quite a few scientists who believe
>>that dimension reduction is not allowed, since you are
>>loosing worthwile information.
> 
> 
> Eh?  By this logic, we shouldn't believe any conclusions drawn in any paper
> that does not contain the rawest of raw data?  Part of data analysis is
> summmarizing data into the bare essentials (have you heard of `sufficient
> statistics'? If not, might worth your while) and extracting useful
> information from data that contain noise.  People who make statements like
> that probably believe there's no such thing as noise in their data.  May God
> have mercy on them.
>  
I have mentioned this only to show that it still sometimes
hard to argue; mentioning "sufficient statistics" could be
helpful.
> 
>>With respect to gene expression I believe hat it makes
>>sense to filter first non-variant genes to reduce the
>>number of dimensions.
>>
>>But..., these people are using hierarchical clustering
>>to cluster chemical compound libraries in "chemical space",
>>and there are no compounds to eliminate.
> 
> 
> Who are `these people' now?  Seems like you're changing the subject to one
> that's probably off-topic for BioC.
>  
I would not consider this off-topic but a natural extension:
"expression profiling -> compound profiling -> compound
activity profiling -> compound structure profiling"
All these steps share the same  problem: What is the best
clustering algorithm to use (if there is any)?
Furthermore, it is my believe that in the future these
steps will be analyzed together resulting in a much deeper
understanding.

P.S.: Looking at the BioC packages, BioC is already expanding
to include proteomics analysis. It would be a natural step
for BioC to expand further to cover chemoinformatics.

> 
>>So, another question is, which method would be best to
>>cluster about one million compounds in chemical space in
>>order to be able reduce the number of compounds used in
>>screening by selecting only representative members of a
>>certain cluster.
> 
> 
> There's quite a bit of work done on this subject in the computational
> chemistry literature.  The context is really quite different from gene
> expression.   Molecules are clustered based on their chemical structures
> (which are known), and those data are not measured (usually), but computed,
> so there's no measurement errors.  The goal is also quite different.  I have
> not heard of anyone trying to find `representative genes' (but I'm not
> familiar with bioinformatics--- maybe someone _would_ be interested in
> that?).
> 
> Andy
>  
Christian
> 
>>Best regards
>>Christian
>>