[BioC] Integrating Codelink data with bioconductor (using affy and limmafunctions)

Diego Díez Ruiz ddiez at iib.uam.es
Fri Apr 22 14:45:53 CEST 2005


Dear BioC users and developers:

I'm working with Codelink plattform microarray data and i want to apply
some of the knowledge currently available for other platforms as
Affymetrix o cDNA spotted microarrays to deal with this data. I have 
been using Affymetrix data  for a while and i used suscessfully the
great affy package. I also worked with spoted cDNA data and use the
also great package limma mainly. Currently this is what I have done:

1. Write a parser for exported text data from the codelink software.

2. I found convenient to create a new class for storing data from this
platform because there are no unique identifiers as in Affymetrix and so 
an exprSet object was not appropiate. I think an RGlist/MAlist-like 
definition could do the job fine so I almost cut and paste the classes 
definition in the limma package an modified it a little (any concerns 
about using other's code is wellcome. I'm willing to release this code 
under LGPL licence) to make a currently named "Codelink" class. The 
class currently stores signal values, flag values, type of spot and 
codelink identifiers. Also I make some annotation packages for the 
human10k, humanwholegenome and ratwholegenome bioarrays but i have to 
test it deeper. A remember and older thread in the BioC list about 
modifiying the exprSet definition to deal with this problem. Is there 
any plan to change it so I can use a more standard approach better that 
create a new class?

3. Created/adapted some functions for plotting: Density plots, MAplot, 
Scatter plot that let see the position of diferent type of spots in 
codelink chips: that is FIDUCIAL, DISCOVERY, POSITIVE, etc.


All that worked more or less. Data can be imported as RAW data (default)
or normalized (Codelink median normalization) and as log2 values
(default) or not. When I calculate log2 values, i initially decided to
set negative values (obtained from subtracting background to signal
sometimes) to somethign like 0.01. This works fine with
normalize.quantiles and normalize.loess from affy package but create
artifact point in a MAplot and a low intensity peak in a density plot.
Then, I saw a post from Gordon Smyth (sorry dont have link) telling that
limma make negative values NA prior to log2. I wanted to do that and
modified the functions involved. Finaly this is when I found a mayor
problem:

1) normalize.loess: I got to modify this function to allow NA values.
The function compares each array whith the others so if you have 10
arrays then each array is compared and modified (after loess
estimation) 10 times. If in each iteration I select the non NA values
created when you calculate A or M then in each iteration you have a
different subset of genes used in the normalization process. As a little
example, if you have 4 arrays with 5 genes:

	chip1	chip2	chip3	chip4
gen1	1	10	NA	5
gen2	3	NA	40	6
gen3	4	NA	NA	7
gen4	3	12	45	6
gen5	2	1	4	3

normalize.loess take a matrix as argument and do all pairwise comparisons:

chip1-chip2: used gen1,gen4,gen5 to estimate loess curve.
chip1-chip3: used gen2,gen4,gen5 to estimate loess curve.
chip1-chip4: used all genes to estimate loess curve.
chip2-chip3: used gen4,gen5 to estimate loess curve.
chip2-chip4: used gen1,gen4,gen5 to estimate loess curve.
chip3-chip4: used gen2,gen4,gen5 to estimate loess curve.

The code modified/added to allow for NA is between #:
(from normalize.loess in affy 1.5.8)

------------------------------------------------------------------------

normalize.loess <- function (mat, subset = sample(1:(dim(mat)[1]), 
min(c(5000, nrow(mat)))),
     epsilon = 10^-2, maxit = 1, log.it = TRUE, verbose = TRUE,
     span = 2/3, family.loess = "symmetric")
{
     J <- dim(mat)[2]
     II <- dim(mat)[1]
     newData <- mat
     if (log.it) {
         mat <- log2(mat)
         newData <- log2(newData)
     }
     change <- epsilon + 1
     fs <- matrix(0, II, J)
     iter <- 0
     w <- c(0, rep(1, length(subset)), 0)
     while (iter < maxit) {
         iter <- iter + 1
         means <- matrix(0, II, J)
         for (j in 1:(J - 1)) {
             for (k in (j + 1):J) {
                 y <- newData[, j] - newData[, k]
                 x <- (newData[, j] + newData[, k])/2
		#####################################
                 # Select genes that are not set to NA
                 sel <- which(!is.na(as.character(y)))
                 y <- y[sel]
                 x <- x[sel]
                 #####################################
                 index <- c(order(x)[1], subset, order(-x)[1])
                 xx <- x[index]
                 yy <- y[index]
                 aux <- loess(yy ~ xx, span = span, degree = 1,
                   weights = w, family = family.loess)
                 aux <- predict(aux, data.frame(xx = x))/J
		#####################################
                 # Apply correction to genes not NA.
                 means[sel, j] <- means[sel, j] + aux
                 means[sel, k] <- means[sel, k] - aux
		#####################################
                 if (verbose)
                   cat("Done with", j, "vs", k, " in iteration ",
                     iter, "\n")
             }
         }
         fs <- fs + means
         newData <- mat - fs
         change <- max(colMeans((means[subset, ])^2))
         if (verbose)
             cat(iter, change, "\n")
         oldfs <- fs
     }
     if ((change > epsilon) & (maxit > 1))
         warning(paste("No convergence after", maxit, "iterations.\n"))
     if (log.it) {
         return(2^newData)
     }
     else return(newData)
}

------------------------------------------------------------------------

I'm not an statitician so not sure if this could be a correct approach: 
Is this correct or I cannot do that?

2) With normalize.quantiles all seems to work fine... the function don't 
say nothing about NA values and the density plots seems correct (almost 
same density plot for all chips) but the MAplot have changed greatly 
with a great number of genes with greater values of M (in absolute 
value) that is, a wider cloud of points. I don't know why is this 
happening. I also tried to see at the C code for normalize.quantiles (at 
qnorm.c) but not found an easy answer. Hopefully I finally found that I 
can use normalizeQuantiles from limma package that allows for NA values.

By the way, is this really necesary or can I assign a low value to 
negative ones?. I think it is better to use NA but...
Something I'm going to do sonner is to import all the values obtained 
from the Codelink data (mean foreground and background, median 
foreground and background, etc) so it could be possible to calculate a 
signal value within R and then use a method that avoids negative values. 
But that could be the same in some cases to assign a small value (Or I'm 
wrong)

I'm working with a linux box (Debian Sarge) with R 2.0.1 compiled from 
sources and BioC 1.5.

Any comment on all that would be very appreciated.

Thanks in advanced!


D.



More information about the Bioconductor mailing list