[BioC] problem with data processing in R

Wed Dec 16 18:33:17 CET 2009

Maxim,

Could you provide a simple example data set along with your code. This way 
it would be easier to reproduce what you are trying to do. I don't have your 
original data set anymore that you sent in an email.

Adopting your code to a simple random example would be the best. Something 
like:

y <- matrix(rnorm(50), 10, 5, dimnames=list(paste("loc", 1:10, sep=""), paste("pos", 1:5, sep="")))

Thomas

On Wed, Dec 16, 2009 at 04:01:18PM +0100, Maxim wrote:
> Hi,
> 
> after having addressed most of my questions I have a minor problem with the
> indexing of my data. After I visualized all of my data in heatmaps I'd like
> to plot average profiles of each cluster for each factor. I added a new
> (1st) column to my data indicating the cluster ID. Right now I  do this
> like:
> 
> myrownames <- paste(1:length(impDF[,1]), impDF[,2], impDF[,3], sep="_")
> rownames(impDF) <- myrownames
> colnames(impDF, do.NULL = TRUE, prefix = "col")
> total<-nrow(impDF)
> clusters <- subset (impDF , V1 == 1)
> clusters <- clusters[,5:45]
> clust <- as.matrix(clusters)
> a<-nrow(clust)
> sizefactor<- a/total
> col=(a/count)
> index <- matrix(1:a, nrow=count, ncol=col, byrow=T)[,-101]
> 
> Now I'd like to bind the individual factors' data together in way that I get
> a continuous profile for factors 1-5 (meanwhile 6). I do this like
> 
> g<-clust[index[1,],]
> h<-clust[index[2,],]
> i<-clust[index[3,],]
> j<-clust[index[4,],]
> k<-clust[index[5,],]
> l<-clust[index[6,],]
> 
> m<-cbind(g,h,i,j,k,l)
> 
> 
> walk <- seq(1, ncol(m), by=10)
> pos<-c()
> sig<-c()
> for (k in 1:ncol(m)) {
> pos<-append(pos, rep(walk[k], nrow(m)))
> sig<-append(sig,as.vector(m[,k]))
> }
> 
> par(mar = c(5, 4, 4, 2) + 0.3)
> avg <- apply(m, 2, mean)
> plot( avg, col=0,lty=2, lwd =1,)
> lines(avg, col=8, lwd=2, type = "l"
> 
> Question 1: the cbind portion looks a bit complicated, how can this be
> substituted by simpler code?
> Question 2: each time I do the plot, I get circles plotted, irrespective of
> what I choose for lty (therefore I plot with col=0 and then do lines(...).
> What is the explanation for this strange behavior?
> 
> Maxim
> 
> 
> 2009/12/13 Thomas Girke <thomas.girke at ucr.edu>
> 
> > Maxim,
> >
> > Certainly, the sorted heatmap example is only useful for visualization
> > purposes, while the clustering example is included to show how to perform
> > clustering on your data in R in general not to provide a final solution
> > for your research problem.
> >
> > If your intention is to cluster the final clustering results of your
> > factors then you could compute a similarity/distance metric among their
> > clustering results using a set similarity measure such as the Jaccard
> > partitioning index. This way you can cluster entire clustering data
> > sets. For hierarchical clustering results you would first need to
> > generate discrete clusters from the dendrogams. R's cutree function is
> > very useful for this. Some examples and links to related R libraries for
> > clustering partitioning results are available here:
> >
> > http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/R_BioCondManual.html#clustering_jaccard
> >
> > Similarly, you could compute membership similarities among the clusters
> > obtained for all your factors using again the Jaccard partitioning index
> > or the variation of information criterion. This could be used to cluster
> > all the groupings obtained from your clustering results ("clustering of
> > clusters").
> >
> > Since I do not know enough about your data sets or the genomics technology
> > you are
> > using this case, I am not sure if any of this makes much sense. But perhaps
> > it is useful for exploring your options...
> >
> > Thomas
> >
> >
> >
> >
> >
> >
> >
> >
> > On Sun, Dec 13, 2009 at 10:28:11PM +0100, Maxim wrote:
> > > This works nicely.
> > >
> > > It means I cluster the data for one factor and sort the other factors
> > > according to the clustering result. But what next?
> > >
> > > What I will have to do is to find (cluster together) patterns that are
> > > similar *between* different factors and not within the data of one factor
> > > only. As I mentioned the attached data was already clustered in such a
> > > manner and obviously the patterns of factor 1-3 are somewhat similar.
> > > Therefore clustering makes not much sense with this data.
> > >
> > > I'm not sure whether I might just miss some aspect you try to explain but
> > as
> > > far as I got it, there is some essential thing still missing in order to
> > > accomplish the clustering analysis. When you look at the attached heatmap
> > > you'll find clusters, where factor 1 and 2 are showing similar patterns,
> > > others for 2 and 3, again others for 1 and 2 and 3 and so forth. To sort
> > out
> > > this differences some additional trick has to be done.
> > >
> > > Despite of this my R programming skills get better and better and I think
> > I
> > > will soon substitute lots of my Perl-Code with R-Code, especially that
> > for
> > > the data container manipulation tools.
> > >
> > > Maxim
> > >
> > >
> > >
> > >
> > >
> > >
> > > 2009/12/13 Thomas Girke <thomas.girke at ucr.edu>
> > >
> > > > Great. - The remaining parts of your analysis seem to be solvable
> > > > with simple subsetting and sorting routines of data objects in R.
> > > >
> > > > Here are some more suggestions that might help you here:
> > > >
> > > > ## Accessing data components of hclust objects:
> > > > names(hr)
> > > > hr$labels; hr$order
> > > >
> > > > ## How to return object labels in the order of a hierarchical
> > clustering
> > > > result:
> > > > hr$labels[hr$order]
> > > >
> > > > ## Sorting vector to join rows of same type in a data matrix (e.g. your
> > > > factorX data sets)
> > > > sortv <- as.vector(matrix(1:500, nrow=5, ncol=100, byrow=T))
> > > > somematrix[sortv, ]
> > > >
> > > > Thomas
> > > >
> > > >
> > > > On Sun, Dec 13, 2009 at 11:38:54AM +0100, Maxim wrote:
> > > > > Thank you,
> > > > >
> > > > > that works great. I was not aware of the "image" function. To do
> > scaling
> > > > > helped definitely o improve the quality of the heatmaps. In addition
> > I
> > > > > learned a lot about handling/importing data.
> > > > >
> > > > > I have to think about the clustering itself. Your suggestions works
> > fine
> > > > to
> > > > > cluster 1 dataset. Unfortunately clustering has to be performed
> > comparing
> > > > > all 5 datasets with each other. That means a given row in dataset 1
> > has
> > > > to
> > > > > be aligned with the same rows in datasets 2-5. When I perform
> > clustering
> > > > of
> > > > > the individual datasets each set will be clustered differently.
> > > > >
> > > > > Despite of this you definitely provided a good starting point to
> > better
> > > > > understand how to exploit R for my project.
> > > > >
> > > > > Maxim
> > > > >
> > > > > 2009/12/13 Thomas Girke <thomas.girke at ucr.edu>
> > > > >
> > > > > > Below are some suggestions for importing your data into R, plotting
> > > > > > heatmaps and clustering them using hierarchical clustering. For the
> > > > > > clustering, you have to choose a proper distance measure for your
> > data
> > > > > > type and of course an efficient partitioning algorithm. The task
> > view
> > > > > > site I sent previously provides a good overview as to what is
> > > > available.
> > > > > > Clustering of 7000 objects in R is not a problem for most
> > clustering
> > > > > > methods. Many of them, including hclust, are implemented in
> > C++/Fortran
> > > > > > to run decently time and memory efficient.
> > > > > >
> > > > > > I hope this will help to get you started.
> > > > > >
> > > > > > Thomas
> > > > > >
> > > > > >
> > > > > > ## Import of data sets (appended in one data frame)
> > > > > > ## Note: a row of NA values is inserted to visually separate data
> > sets
> > > > in
> > > > > > heatmap
> > > > > > filenames <- list.files(pattern="factor*") # Requires data files
> > > > "factorX"
> > > > > > in working dir
> > > > > > impDF <- data.frame(NULL)
> > > > > > for(i in filenames) {
> > > > > >        tmp <- read.delim(i, header=FALSE)
> > > > > >        impDF <- rbind(impDF, tmp, rep(NA, dim(tmp)[2]))
> > > > > > }
> > > > > > myrownames <- paste(1:length(impDF[,1]), impDF[,2], impDF[,3],
> > sep="_")
> > > > > > rownames(impDF) <- myrownames
> > > > > > impDF <- impDF[,4:44]
> > > > > > imp <- as.matrix(impDF)
> > > > > >
> > > > > > ## Alternative import into list container (not used in following
> > > > example)
> > > > > > filenames <- list.files(pattern="factor*")
> > > > > > datalist <- lapply(filenames, function(x) read.delim(x, header=F))
> > > > > > names(datalist) <- filenames
> > > > > >
> > > > > > ## Plot heatmap with heatmap.2
> > > > > > library(gplots)
> > > > > > heatmap.2(imp, dendrogram="none", Rowv=F, Colv=F, col=redgreen(75),
> > > > > > scale="row", trace="none", key=F)
> > > > > >
> > > > > > ## Plot a separate heatmap for each data set using R's image()
> > function
> > > > > > index <- matrix(1:505, nrow=5, ncol=101, byrow=T)[,-101]
> > > > > > par(mfrow=c(1,5))
> > > > > > for(i in 1:5) {
> > > > > >        image(scale(t(imp[rev(index[i,]), ])), col=redgreen(75),
> > > > xaxt="n",
> > > > > > yaxt="n", main=i)
> > > > > > }
> > > > > >
> > > > > > ## Example for hierarchical clustering of first data set using
> > hclust
> > > > > > y <- imp[1:100, ] # Selects first 100 rows (=1st data set).
> > > > > > d <- as.dist(1-cor(t(y))) # Creates a distance matrix using Pearson
> > > > > > correlations; here you
> > > > > >                          # want to choose a distance measure that
> > is
> > > > best
> > > > > > for your data.
> > > > > > hr <- hclust(d, method="complete") # Performs hierarchical
> > clustering
> > > > > > heatmap.2(y, Rowv=as.dendrogram(hr), dendrogram="row", Colv=F,
> > > > > > col=redgreen(75), scale="row", trace="none")
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Sat, Dec 12, 2009 at 11:25:05PM +0100, Maxim wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > > At first: thanks for taking the time!!
> > > > > > >
> > > > > > > I could send some of the data and a jpeg that illustates where I
> > > > would
> > > > > > like
> > > > > > > to go to. Unfortunately the data is large. It is genomics data
> > > > measuring
> > > > > > > binding of several factors to specific genomic regions. I'd like
> > to
> > > > > > identify
> > > > > > > clusters where different factors show similar binding behaviour.
> > > > > > >
> > > > > > > My current dataset has data for 10 factors. I'm looking at 7000
> > > > "sites",
> > > > > > > each represented by 40 datapoints (that is in 100bp steps from
> > > > position
> > > > > > > -2000 to +2000 relative to the sites). Each site represents a
> > certain
> > > > > > > genomic location and I look at the same sites for every factor.
> > > > > > >
> > > > > > > I wonder if this can be done straightforward in R?
> > > > > > >
> > > > > > > I attached an example of the data. It is the first 100 "sites"
> > for 5
> > > > > > factors
> > > > > > > and the corresponding heatmap (for the complete set) after
> > external
> > > > > > > clustering.
> > > > > > >
> > > > > > > Concerning doing the clustering in R: I have no clue how to do
> > > > clustering
> > > > > > > with such "mutli-dimensional" data within R. I'd be glad in case
> > you
> > > > > > could
> > > > > > > point me at the right direction how to approach such a task. The
> > > > > > approaches
> > > > > > > in the literature appear to be quite complex and need lots of CPU
> > > > power
> > > > > > and
> > > > > > > are scripted in C, I guess for speed reasons. I wonder whether R
> > is
> > > > fast
> > > > > > > enough to accomplish such a task in a reasonable time.
> > > > > > >
> > > > > > > To explain the data:
> > > > > > > It is 5 small tab-delimited files, each with data from 100
> > "sites"
> > > > for 5
> > > > > > > factors. The example data corresponds to the bottom of the
> > attached
> > > > > > heatmap,
> > > > > > > so it is having clearly positive signals (blue color) at the
> > center
> > > > > > position
> > > > > > > for factor 1 and factor 2.
> > > > > > >
> > > > > > > I hope this might help to illustrate my project a little better.
> > > > > > >
> > > > > > > Maxim
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > 2009/12/11 Thomas Girke <thomas.girke at ucr.edu>
> > > > > > >
> > > > > > > > The example I send before works but there was a typo in the
> > > > heatmap.2
> > > > > > > > command
> > > > > > > > where mysort needs to replaced by mydata. Like this:
> > > > > > > >
> > > > > > > > heatmap.2(mydata, dendrogram="none", Rowv=F, Colv=F,
> > trace="none",
> > > > > > key=T,
> > > > > > > > cellnote=mysamples)
> > > > > > > >
> > > > > > > > In heatmap.2 you have the option to include a color bar on the
> > left
> > > > > > side
> > > > > > > > that
> > > > > > > > can be used to highlight clusters. See the help documentation
> > for
> > > > more
> > > > > > > > details.
> > > > > > > >
> > > > > > > > The dimensions of heatmap.2 plots can be controlled like for
> > any
> > > > other
> > > > > > plot
> > > > > > > > in
> > > > > > > > R, using the hight and width arguments, e.g. x11(height=6,
> > width=2)
> > > > or
> > > > > > > > pdf(...).
> > > > > > > >
> > > > > > > > To provide more specific help, you may want to send a simple
> > sample
> > > > > > data
> > > > > > > > set that
> > > > > > > > illustrates what you are trying to do exactly. Without this it
> > is
> > > > > > really
> > > > > > > > hard
> > > > > > > > to understand your problem.
> > > > > > > >
> > > > > > > > If I were you then I would perform the entire clustering
> > procedure
> > > > in
> > > > > > R. Is
> > > > > > > > there any
> > > > > > > > good reason not use R for this? For hierarchical clustering you
> > can
> > > > use
> > > > > > the
> > > > > > > > hclust function.
> > > > > > > > A relatively complete list of clustering algorithms available
> > in R
> > > > can
> > > > > > be
> > > > > > > > found on
> > > > > > > > the cluster task view page:
> > > > > > > > http://cran.at.r-project.org/web/views/Cluster.html
> > > > > > > >
> > > > > > > > Thomas
> > > > > > > >
> > > > > > > > On Fri, Dec 11, 2009 at 10:17:22PM +0100, Maxim wrote:
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > what you suggested sounds interesting but actually I do not
> > > > > > understand
> > > > > > > > where
> > > > > > > > > it is going to. Actually I'll get an error doing it like this
> > as
> > > > > > mysort
> > > > > > > > in
> > > > > > > > > heatmap.2 is not defined (yet).
> > > > > > > > >
> > > > > > > > > What did work for me in meantime, is simply to plot the
> > complete
> > > > > > heatmap
> > > > > > > > at
> > > > > > > > > once. What is missing in this approach is a label on the left
> > or
> > > > > > right
> > > > > > > > side
> > > > > > > > > of the heatmap, I'd love to have a colorcoded block that
> > allows
> > > > me to
> > > > > > > > see,
> > > > > > > > > where different IDs were plotted (different IDs are actually
> > > > clusters
> > > > > > > > coming
> > > > > > > > > from hierarchical clustering not performed in R).
> > > > > > > > >
> > > > > > > > > Second I can do it by generating individual heatmaps for each
> > ID
> > > > > > loaded
> > > > > > > > from
> > > > > > > > > individual files, unfortunately for some IDs there are
> > thousand
> > > > rows
> > > > > > of
> > > > > > > > > data, for others only 50. But R's heatmap always produces
> > > > similarly
> > > > > > sized
> > > > > > > > > maps. I'd prefer to have the height of the individual
> > heatmaps
> > > > > > according
> > > > > > > > to
> > > > > > > > > the corresponding number of rows rather than automatic
> > scaling.
> > > > > > > > >
> > > > > > > > > Is there a way to do this in R? I found an old mail in the
> > > > mailing
> > > > > > list
> > > > > > > > > discussing this point, the result was to use
> > TreeView/Cluster,
> > > > but I
> > > > > > > > cannot
> > > > > > > > > get this to work without doing clustering (the data is
> > clustered
> > > > > > > > already),
> > > > > > > > > additionally I do not know how to do batch processing in
> > > > TreeView.
> > > > > > > > >
> > > > > > > > > Maxim
> > > > > > > > > 2009/12/10 Thomas Girke <thomas.girke at ucr.edu>
> > > > > > > > >
> > > > > > > > > > I am not sure if I understand every part of your problem
> > > > correctly,
> > > > > > > > > > but here is an example how something like this could be
> > done in
> > > > R.
> > > > > > > > > > Its main idea is to keep the entire data set in one matrix
> > and
> > > > use
> > > > > > > > > > the cell note feature of heatmap.2 for sample tracking.
> > > > > > > > > >
> > > > > > > > > > ## Sample matrix for demo purpose. If your
> > > > > > > > > > y <- matrix(rnorm(50), 10, 5, dimnames=list(paste("g",
> > 1:10,
> > > > > > sep=""),
> > > > > > > > > > paste("t", 1:5, sep="")))
> > > > > > > > > >
> > > > > > > > > > ## Sort each row by its values
> > > > > > > > > > mydata <- t(apply(y, 1, sort))
> > > > > > > > > >
> > > > > > > > > > ## Obtain sample labels (column titles) for sorted rows
> > > > > > > > > > mysamples <-  t(apply(y, 1, function(x) names(sort(x))))
> > > > > > > > > >
> > > > > > > > > > ## Plot heatmap where the sample labels are given as cell
> > notes
> > > > for
> > > > > > > > > > tracking purposes
> > > > > > > > > > library(gplots)
> > > > > > > > > > heatmap.2(mydata, dendrogram="none", Rowv=F, Colv=F,
> > > > > > col=redgreen(75),
> > > > > > > > > > scale="row", trace="none", key=T, cellnote=mysamples)
> > > > > > > > > >
> > > > > > > > > > Thomas
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Thu, Dec 10, 2009 at 03:13:31PM +0100, Maxim wrote:
> > > > > > > > > > > Hi,
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > I'm stuck with parsing data into R for heatmap
> > > > representation.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > The data looks like:
> > > > > > > > > > >
> > > > > > > > > > > 1 id1 x1 x2 x3 .... x20
> > > > > > > > > > >
> > > > > > > > > > > 2 id1 x1 x2 x3 .... x20
> > > > > > > > > > >
> > > > > > > > > > > 3 id1 x1 x2 x3 .... x20
> > > > > > > > > > >
> > > > > > > > > > > 4 id1 x1 x2 x3 .... x20
> > > > > > > > > > >
> > > > > > > > > > > .........
> > > > > > > > > > >
> > > > > > > > > > > 348 id2 x1 x2 x3 .... x20
> > > > > > > > > > >
> > > > > > > > > > > 349 id2 x1 x2 x3 .... x20
> > > > > > > > > > >
> > > > > > > > > > > 350 id2 x1 x2 x3 .... x20
> > > > > > > > > > >
> > > > > > > > > > > 351 id2 x1 x2 x3 .... x20
> > > > > > > > > > >
> > > > > > > > > > > .........
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > The data is sorted for the IDs (id1,id2 .....id40) and I
> > like
> > > > to
> > > > > > > > produce
> > > > > > > > > > 40
> > > > > > > > > > > heatmaps thereof, 1 heatmap per data corresponding to a
> > > > single
> > > > > > ID.
> > > > > > > >  The
> > > > > > > > > > data
> > > > > > > > > > > that has to be plotted is 20 values (x1 to x20). There is
> > > > > > different
> > > > > > > > > > amounts
> > > > > > > > > > > of data for respective IDs. In the end I'd like to have
> > the
> > > > 40
> > > > > > > > heatmaps
> > > > > > > > > > > stacked on top of each other sorted by ID and heatmap
> > heights
> > > > > > > > according
> > > > > > > > > > to
> > > > > > > > > > > the amount (number of rows) of data. Unfortunately the
> > > > individual
> > > > > > > > data
> > > > > > > > > > lines
> > > > > > > > > > > have to be sorted with respect to the maximum of the
> > values
> > > > X1 to
> > > > > > x20
> > > > > > > > in
> > > > > > > > > > > individual rows. Actually this not that important as I
> > guess
> > > > this
> > > > > > > > might
> > > > > > > > > > be
> > > > > > > > > > > easier to realize in upstream Perl scripts producing the
> > > > data.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > The data is available as data per ID in individual files
> > or
> > > > as a
> > > > > > > > sorted
> > > > > > > > > > file
> > > > > > > > > > > with the complete dataset (as shown above).
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Is it possible in R to break a file as above into
> > distinct
> > > > blocks
> > > > > > > > > > (depending
> > > > > > > > > > > on ID) and then to process it (sorting according to
> > maximum,
> > > > > > > > heatmap)?
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Which commands do I have to issue for the manipulation of
> > the
> > > > > > > > data.frame?
> > > > > > > > > > I
> > > > > > > > > > > tried the
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > I'd be glad  if someone could help me finding the correct
> > > > > > direction
> > > > > > > > to
> > > > > > > > > > solve
> > > > > > > > > > > my problem!
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Best regards
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Maxim
> > > > > > > > > > >
> > > > > > > > > > >       [[alternative HTML version deleted]]
> > > > > > > > > > >
> > > > > > > > > > > _______________________________________________
> > > > > > > > > > > Bioconductor mailing list
> > > > > > > > i> > > Bioconductor at stat.math.ethz.ch
> > > > > > > > > > > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > > > > > > > > > > Search the archives:
> > > > > > > > > >
> > > > http://news.gmane.org/gmane.science.biology.informatics.conductor
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > >
> >