[BioC] handle clustering and replicated probes in Agilent 4x44K : "philosiphical" question?

Francois Pepin fpepin at cs.mcgill.ca
Sat Feb 9 00:52:11 CET 2008

Hi Daniela,

There are replicated probes (same probe id) and then there are genes 
that have several probes.

In the first case, I would simply suggest that you choose one. We 
arbitrarily choose the first one because their expression is basically 
identical with all our probes. Averaging them would probably be a better 
way of doing it, but the advantage is likely quite small.

In the second case, those probes generally behave similarly, but they 
can also give you a fairly different expression. I usually use a 
representative probe when doing a hierarchical clustering. I don't have 
any papers to back me up, but I've found most distance metrics to give 
too much weight to the duplicated probes when doing hierarchical clustering.

If the probes come from a differential expression, choosing the best 
p-value is reasonable. If you are doing class discovery, then you would 
need to use a unbiased method, such as the variance or interquartile range.

I hope this helps,


Daniela Marconi wrote:
> Hi everybody,
> I have to come back to the issue of replicates probes in the Agilent 4 x 44K.
> Reading for example the answer of Gordon Smith
> http://article.gmane.org/gmane.science.biology.informatics.conductor/13846/match=agilent+probe+replicates
> I completely agree with him to treat the replicated probes, doing the
> analysis to select the differentially expressed probes, as
> indipendent.
> In fact, I think that to average these probes (like in Feature
> Extraction software and Rosetta Resolver ) before to perform the
> analysis to identify differential expressed genes couldn't be a safe
> solution in general (for example for within-array problems).
> Now the question is: after have identified a set of differentially
> expressed probes, let's say that we want to perform a hierarchical
> clustering to "visualize" the differential gene expression profiling
> adding a third class to evaluate the similarities of this new class
> with the profile of the other two, what we have to do with the
> replicates?
> In my opinion the implicit  constrain of this approach is to introduce
> a "literature-bias" , because the replicated genes are those who are
> better known in the literature as central- players in many different
> process (just for example p53, ER and so on). In this way we force
> implicitly the algorithm to be guided by those genes, if all (or most
> of all) appears as differentially expressed in the list.
> But, in my experience, this kind of bias is however introduced by
> biologists or clinicians when they go through the list of
> differentially expressed genes, to decide on which genes they have to
> focus their attention (for validation and further investigation)
> In this case the problem is how? I was thinking to select the probe
> with the best adjusted p-value for example or at least to average only
> the probes that are identified as differentially expressed.
> The p-value in my opinion could be the best choice, but at the moment
> is just an opinion.
> Have someone faced this point?
> Thank you for any help, suggestion or comment....
> Daniela
> Daniela Marconi
> PhD Students
> Physics Department
> University of Bologna
> Viale Berti Pichat 6/2
> Bologna
> Italy
> office: +39 051 2095136
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

More information about the Bioconductor mailing list