[BioC] problem with data processing in R

Sat Dec 12 23:25:05 CET 2009

Hi,

At first: thanks for taking the time!!

I could send some of the data and a jpeg that illustates where I would like
to go to. Unfortunately the data is large. It is genomics data measuring
binding of several factors to specific genomic regions. I'd like to identify
clusters where different factors show similar binding behaviour.

My current dataset has data for 10 factors. I'm looking at 7000 "sites",
each represented by 40 datapoints (that is in 100bp steps from position
-2000 to +2000 relative to the sites). Each site represents a certain
genomic location and I look at the same sites for every factor.

I wonder if this can be done straightforward in R?

I attached an example of the data. It is the first 100 "sites" for 5 factors
and the corresponding heatmap (for the complete set) after external
clustering.

Concerning doing the clustering in R: I have no clue how to do clustering
with such "mutli-dimensional" data within R. I'd be glad in case you could
point me at the right direction how to approach such a task. The approaches
in the literature appear to be quite complex and need lots of CPU power and
are scripted in C, I guess for speed reasons. I wonder whether R is fast
enough to accomplish such a task in a reasonable time.

To explain the data:
It is 5 small tab-delimited files, each with data from 100 "sites" for 5
factors. The example data corresponds to the bottom of the attached heatmap,
so it is having clearly positive signals (blue color) at the center position
for factor 1 and factor 2.

I hope this might help to illustrate my project a little better.

Maxim

2009/12/11 Thomas Girke <thomas.girke at ucr.edu>

> The example I send before works but there was a typo in the heatmap.2
> command
> where mysort needs to replaced by mydata. Like this:
>
> heatmap.2(mydata, dendrogram="none", Rowv=F, Colv=F, trace="none", key=T,
> cellnote=mysamples)
>
> In heatmap.2 you have the option to include a color bar on the left side
> that
> can be used to highlight clusters. See the help documentation for more
> details.
>
> The dimensions of heatmap.2 plots can be controlled like for any other plot
> in
> R, using the hight and width arguments, e.g. x11(height=6, width=2) or
> pdf(...).
>
> To provide more specific help, you may want to send a simple sample data
> set that
> illustrates what you are trying to do exactly. Without this it is really
> hard
> to understand your problem.
>
> If I were you then I would perform the entire clustering procedure in R. Is
> there any
> good reason not use R for this? For hierarchical clustering you can use the
> hclust function.
> A relatively complete list of clustering algorithms available in R can be
> found on
> the cluster task view page:
> http://cran.at.r-project.org/web/views/Cluster.html
>
> Thomas
>
> On Fri, Dec 11, 2009 at 10:17:22PM +0100, Maxim wrote:
> > Hi,
> >
> > what you suggested sounds interesting but actually I do not understand
> where
> > it is going to. Actually I'll get an error doing it like this as mysort
> in
> > heatmap.2 is not defined (yet).
> >
> > What did work for me in meantime, is simply to plot the complete heatmap
> at
> > once. What is missing in this approach is a label on the left or right
> side
> > of the heatmap, I'd love to have a colorcoded block that allows me to
> see,
> > where different IDs were plotted (different IDs are actually clusters
> coming
> > from hierarchical clustering not performed in R).
> >
> > Second I can do it by generating individual heatmaps for each ID loaded
> from
> > individual files, unfortunately for some IDs there are thousand rows of
> > data, for others only 50. But R's heatmap always produces similarly sized
> > maps. I'd prefer to have the height of the individual heatmaps according
> to
> > the corresponding number of rows rather than automatic scaling.
> >
> > Is there a way to do this in R? I found an old mail in the mailing list
> > discussing this point, the result was to use TreeView/Cluster, but I
> cannot
> > get this to work without doing clustering (the data is clustered
> already),
> > additionally I do not know how to do batch processing in TreeView.
> >
> > Maxim
> > 2009/12/10 Thomas Girke <thomas.girke at ucr.edu>
> >
> > > I am not sure if I understand every part of your problem correctly,
> > > but here is an example how something like this could be done in R.
> > > Its main idea is to keep the entire data set in one matrix and use
> > > the cell note feature of heatmap.2 for sample tracking.
> > >
> > > ## Sample matrix for demo purpose. If your
> > > y <- matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""),
> > > paste("t", 1:5, sep="")))
> > >
> > > ## Sort each row by its values
> > > mydata <- t(apply(y, 1, sort))
> > >
> > > ## Obtain sample labels (column titles) for sorted rows
> > > mysamples <-  t(apply(y, 1, function(x) names(sort(x))))
> > >
> > > ## Plot heatmap where the sample labels are given as cell notes for
> > > tracking purposes
> > > library(gplots)
> > > heatmap.2(mydata, dendrogram="none", Rowv=F, Colv=F, col=redgreen(75),
> > > scale="row", trace="none", key=T, cellnote=mysamples)
> > >
> > > Thomas
> > >
> > >
> > > On Thu, Dec 10, 2009 at 03:13:31PM +0100, Maxim wrote:
> > > > Hi,
> > > >
> > > >
> > > > I'm stuck with parsing data into R for heatmap representation.
> > > >
> > > >
> > > > The data looks like:
> > > >
> > > > 1 id1 x1 x2 x3 .... x20
> > > >
> > > > 2 id1 x1 x2 x3 .... x20
> > > >
> > > > 3 id1 x1 x2 x3 .... x20
> > > >
> > > > 4 id1 x1 x2 x3 .... x20
> > > >
> > > > .........
> > > >
> > > > 348 id2 x1 x2 x3 .... x20
> > > >
> > > > 349 id2 x1 x2 x3 .... x20
> > > >
> > > > 350 id2 x1 x2 x3 .... x20
> > > >
> > > > 351 id2 x1 x2 x3 .... x20
> > > >
> > > > .........
> > > >
> > > >
> > > >
> > > > The data is sorted for the IDs (id1,id2 .....id40) and I like to
> produce
> > > 40
> > > > heatmaps thereof, 1 heatmap per data corresponding to a single ID.
>  The
> > > data
> > > > that has to be plotted is 20 values (x1 to x20). There is different
> > > amounts
> > > > of data for respective IDs. In the end I'd like to have the 40
> heatmaps
> > > > stacked on top of each other sorted by ID and heatmap heights
> according
> > > to
> > > > the amount (number of rows) of data. Unfortunately the individual
> data
> > > lines
> > > > have to be sorted with respect to the maximum of the values X1 to x20
> in
> > > > individual rows. Actually this not that important as I guess this
> might
> > > be
> > > > easier to realize in upstream Perl scripts producing the data.
> > > >
> > > >
> > > > The data is available as data per ID in individual files or as a
> sorted
> > > file
> > > > with the complete dataset (as shown above).
> > > >
> > > >
> > > > Is it possible in R to break a file as above into distinct blocks
> > > (depending
> > > > on ID) and then to process it (sorting according to maximum,
> heatmap)?
> > > >
> > > >
> > > > Which commands do I have to issue for the manipulation of the
> data.frame?
> > > I
> > > > tried the
> > > >
> > > >
> > > > I'd be glad  if someone could help me finding the correct direction
> to
> > > solve
> > > > my problem!
> > > >
> > > >
> > > > Best regards
> > > >
> > > >
> > > > Maxim
> > > >
> > > >       [[alternative HTML version deleted]]
> > > >
> > > > _______________________________________________
> > > > Bioconductor mailing list
> i> > > Bioconductor at stat.math.ethz.ch
> > > > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > > > Search the archives:
> > > http://news.gmane.org/gmane.science.biology.informatics.conductor
> > > >
> > >
>