[BioC] merge SMD data from different print batch

Sun Apr 2 04:31:17 CEST 2006

Dear all,

I need some advice on certain potentially ad-hoc measure that I am using
to deal with Stanford Microarray Database (SMD) data.

Problem : 
A single study in SMD often utilises different print batches - sometimes
with twice as many unique features in one batch than other batches. This
is compounded by the fact there are multiple copies for some genes.

Current solution : 
I am aware that marray has read.SMD(), limma has read.maimagenes( ...,
source="smd"), merge.MAList() and someone else proposed read.SMD2() on
the mailing list. I could be wrong here I believe these assume that the
arrays contain the same probeset but potentially disordered.

Potential solution : 
My steps are as following. Scripts to do this are given at the end.

1) First we preprocess each array with LOESS followed by scale
normalisation before combining with other arrays. This is done is the
first half of the function my.SMD.expr().

2) Next, we average the log ratios over the LUID or SUID (for old SMD
dataset) and removing redundant gene annotations. This is done in
get.SMD.expr(). This is potentially the a contentious issue.

3) Finally we merge the different arrays by using the LUID and the
average gene expression for that LUID that was calculated in step 2. 

Is this complete nonsense ? Can someone please shed some light.
Unfortunately, I did not expect SMD datasets to be this complicated.

Regards, Adai

------------------------------------------------------------------------------

get.SMD.expr <- function(input, by, summary.fn=mean){

  tmp   <- lapply( input, function(m) m$genes )
  genes <- do.call("rbind", tmp);
  good  <- !duplicated( genes[ , by] )
  genes <- genes[ good, c(by, "CLONE ID", "GENE SYMBOL", 
                              "CLUSTER ID", "ACCESSION", "GENE NAME")]
  rownames(genes) <- genes[ , by]

  cn <- sapply( input, function(m) m$target$FileName )
  M  <- matrix( nr=nrow(genes), nc=length(input),
                dimnames=list( rownames(genes), cn ) )

  for(i in 1:length(input)){

    key <- as.character( input[[i]]$genes[ , by] )
    val <- as.numeric( input[[i]]$M )
    out <- tapply( val, key, summary.fn, na.rm=TRUE )

    M[ names(out), i ] <- out
    cat(i, "\t")
  }
  cat("\n")
  return( list(M=M, genes=genes) )
}

my.SMD.expr <- function( files, source=c("smd", "smd.old") ){

  require(limma)
  raw <- list(NULL)
  for(i in 1:length(files)){
    tmp1 <- read.maimages( files=files[i], source=source )
    tmp2 <- normalizeWithinArrays( tmp1, method="loess" )
    tmp3 <- normalizeBetweenArrays( tmp2, method="scale" )

    colnames(tmp3$genes) <- toupper( colnames(tmp3$genes) )
    raw[[i]] <- tmp3
    rm(tmp1, tmp2, tmp3)
  }
  names(raw) <- files

  if(source=="smd") by <- "LUID"
  if(source=="smd.old") by <- "SUID"
  z <- get.SMD.expr( raw, by=by )
}