[BioC] How to do clustering

Thomas Girke thomas.girke at ucr.edu
Wed Jun 20 07:03:36 CEST 2007

```Alex,

A similar result, but with slower computation, can obtained by applying
the IQR function like this:

apply(iris[,1:3], 1, IQR)

Thomas

On Tue 06/19/07 21:10, Martin Morgan wrote:
> Alex,
>
> > library(Biobase)
> [snip]
> > args(rowQ)
> function (imat, which)
> NULL
> > showMethods("rowQ")
> Function: rowQ (package Biobase)
> imat="ExpressionSet", which="numeric"
> imat="exprSet", which="numeric"
> imat="matrix", which="numeric"
>
> so it looks like x should be a matrix rather than a data frame.
>
> Martin
>
> "ssls sddd" <ssls.sddd at gmail.com> writes:
>
> > Hi Thomas,
> >
> > Thanks! Sorry for getting back to it late because I was out
> > of town for a couple of days.
> >
> > I like the idea of 'removing all rows with low variability across
> > samples'. I searched around and found an online tutorial
> > http://www.economia.unimi.it/projects/marray/2006/material/Lab3/MachineLearning/ML-lab.pdfis
> > doing very similar thing which teaches how to filter some
> > undifferentially
> > expressed genes.
> >
> > It takes the simplistic approach of using the 75th percentile of the
> > interquartile range
> > (IQR) as the cut-off point and computes quantiles using rowQ.
> >
> > I followed their method and my code is:
> >
> > library("Biobase")
> > lowQ = rowQ(x, floor(0.25 * 49))#49 for 49 samples
> > upQ = rowQ(x, ceiling(0.75 * 49))
> > iqrs = upQ - lowQ
> > giqr = iqrs > quantile(iqrs, probs = 0.75)
> > sum(giqr)
> > xsub = x[giqr, ]
> > dim(xsub)
> >
> > But the error message is like:
> >
> > function (classes, fdef, mtable)  :
> >         unable to find an inherited method for function "rowQ", for
> > signature "data.frame", "numeric"
> >
> > Perhaps you can any experience in using 'rowQ'? If I want to use IQR
> > function, how should I approach this?
> >
> > I really appreciate your help!
> >
> > Thank you very much!
> >
> > Sincerely,
> >
> > Alex
> >
> >
> >
> > On 6/13/07, Thomas Girke <thomas.girke at ucr.edu> wrote:
> >>
> >> Dear Alex,
> >>
> >> In addition, to Sean's advice, I would like to point out that the
> >> sample you are giving below indicates that you are trying to pass on
> >> to the heatmap function a column dendrogram plus a row dendrogram. With
> >> your
> >> matrix of 238,000 rows by 49 columns you should have only a column
> >> dendrogram, because the row dendrogram would take more than 200 GB of
> >> memory to
> >> calculate. You can still use the heatmap or heatmap.2 functions by turning
> >> off the row
> >> sorting by setting the Rowv argument to NA. In addition to this, I would
> >> consider to filter your rows in a meaningful manner to a much smaller
> >> number, perhaps by using R's IQR function to remove all rows with very
> >> low variability. I am suggesting this because, you won't see any
> >> patterns in the heatmap when you have so many rows. If the row filtering
> >> works then you could generate a dendrogram for the row dimension as well.
> >> Remember: hclust will require ~4 GB of memory to cluster ~30,000 items
> >> and < 1 GB for 10,000 items, and pvclust that uses hclust internally will
> >> need even much more than this.
> >>
> >> As a more general advice, when working with large data sets in R always
> >> subset
> >> your data to something very small to test out your strategy first, because
> >> this
> >> will save you a lot of time.
> >> In your case, this could by done by selecting just the first 100 rows of
> >> your
> >> matrix like this:
> >>                 my_matrix <- my_matrix[1:100, ]
> >>
> >> Once you have tested things out then just remove in your script/protocol
> >> the '[1:100,]' part.
> >>
> >> Best,
> >>
> >> Thomas
> >>
> >>
> >> On Wed 06/13/07 06:02, Sean Davis wrote:
> >> > ssls sddd wrote:
> >> > > Dear Dr.Thomas Girke,
> >> > >
> >> > > I have one more question for you. I tried pvclust in the session of
> >> > > 'Obtain significant clusters by pvclust bootstrap analysis' for my
> >> data, x.
> >> > >
> >> > > But I have a problem with:
> >> > >
> >> > > heatmap(x, Rowv=dend_colored, Colv=as.dendrogram(hc), col=my.colorFct
> >> (),
> >> > > scale="row", RowSideColors=mycolhc)
> >> > >
> >> > > the error was:
> >> > >
> >> > > error in heatmap(x, Rowv = dend_colored, Colv = as.dendrogram(hc), col
> >> =
> >> > > my.colorFct(),  :
> >> > >         'x' must be a numeric matrix
> >> > >
> >> > > I ran 'x[1:3,1:3]' and it produced the following:
> >> > >
> >> > >               AIRNS_A09 AIRNS_A11 AIRNS_A12
> >> > > SNP_A-1780271   1.85642   1.50956   1.73154
> >> > > SNP_A-1780274   1.72140   1.83712   1.85948
> >> > > SNP_A-1780277   2.04241   1.53458   1.65270
> >> > >
> >> > > I think the x is a numeric matrix. Do you think where I may get wrong?
> >> >
> >> > Try coercing the x into a matrix directly:
> >> >
> >> > heatmap(as.matrix(x), Rowv=dend_colored, Colv=as.dendrogram(hc),
> >> > col=my.colorFct(), scale="row", RowSideColors=mycolhc)
> >> >
> >> > Does this fix the problem?  You can always check the class of an object
> >> > by doing something like:
> >> >
> >> > class(x)
> >> >
> >> > which should report:
> >> >
> >> > [1] "matrix"
> >> >
> >> > Hope that helps.
> >> >
> >> > Sean
> >> >
> >>
> >> --
> >> Dr. Thomas Girke
> >> Assistant Professor of Bioinformatics
> >> Director, IIGB Bioinformatic Facility
> >> Center for Plant Cell Biology (CEPCEB)
> >> Institute for Integrative Genome Biology (IIGB)
> >> Department of Botany and Plant Sciences
> >> 1008 Noel T. Keen Hall
> >> University of California
> >> Riverside, CA 92521
> >>
> >> E-mail: thomas.girke at ucr.edu
> >> Website: http://faculty.ucr.edu/~tgirke <http://faculty.ucr.edu/%7Etgirke>
> >> Ph: 951-827-2469
> >> Fax: 951-827-4437
> >>
> >
> > 	[[alternative HTML version deleted]]
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at stat.math.ethz.ch
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> --
> Martin Morgan
> Bioconductor / Computational Biology
> http://bioconductor.org
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

--
Thomas Girke
Assistant Professor of Bioinformatics
Director, IIGB Bioinformatic Facility
Center for Plant Cell Biology (CEPCEB)
Institute for Integrative Genome Biology (IIGB)
Department of Botany and Plant Sciences
1008 Noel T. Keen Hall
University of California
Riverside, CA 92521

E-mail: thomas.girke at ucr.edu
Website: http://faculty.ucr.edu/~tgirke
Ph: 951-827-2469
Fax: 951-827-4437

```