[BioC] limma - arrays with different GAL files

Fri Apr 23 03:32:47 CEST 2004

Hi Helen,

I have an experiment that I have analysed with limma, and spans two print
runs in which the layout was changed between print runs.  The approach I
used was to normalise the print runs separately (since I wanted to use
print-tip normalisation which is obviously layout dependent), and then
combine the two MA objects.

There are several approaches to combining the print runs depending upon the
unqiueness of your probes. If all your probes have unique names then you can
use a modified version of the merge function (I think Gordon was going to
implement it, but I made my own version based on merge.RGlist)

setMethod("merge", c("MAList","MAList"), definition=function(x,y,z,...) {
#  Merge MAList y into x aligning by row names - based on merge.RGlist
	genes1 <- rownames(x$M)
	if(is.null(genes1)) genes1 <- rownames(x$A)
	genes2 <- rownames(y$M)
	if(is.null(genes2)) genes2 <- rownames(y$A)
	if(is.null(genes1) || is.null(genes2)) stop("Need row names to align on")
	fields1 <- names(x)
	fields2 <- names(y)
	if(!identical(fields1,fields2)) stop("The two MALists have different
elements")
	ord2 <- match(makeUnique(genes1), makeUnique(genes2))
	for (i in fields1) x[[i]] <- cbind(x[[i]],y[[i]][ord2,])
	x
})
You'd call this using
MA.Combined <- merge(MA.PrintRun1,MA.PrintRun2)

this will merge two MA objects on the rownames. However if you have some
duplicate rownames, this method doesn't work. I think the MA values from the
second print run will all be set to NA which is not what you want.

I had the situation where ~95% of my probe names were unique, so the above
solution wasn't quite suitable. To get around this, I wrote my own function
to map array location of print run 2 onto the locations of print run 1.
This requires you to know how the arrays were laid out differently. If you
can work this out, create a vector that maps the index of run2 to run 1

mapRun2ToRun1[IndexInGALForPrintRun2] = indexInGALForPrintRun1

Then use mapRun2toRun1 in place of ord2 in the above code or use
MA.combined <- new("MAList",
  list(M=cbind(MAPrintRun1[["M"]],MAPrintRun2[["M"]][mapRun2ToRun1,]),
       A=cbind(MAPrintRun2[["A"]],MAPrintRun2[["A"]][mapRun2ToRun1,]),

weights=cbind(MAPrintRun1[["weights"]],MAPrintRun2[["weights"]][mapRun2ToRun
1,]))

As to your question on how to cope with duplicate spots on one array and not
the other, that is a little trickier. I have some suggestions, but I suspect
others may be better qualified to answer this part.
I'll assume print run1 is single spots, print run2 is duplicate spots and
that the bottom half of the array is the (duplicate) copy of the top half of
the array.

First create a dummy MA list that is just run1 duplicated to look like run
2, and set all the duplicate values to NA's
eg if the bottom half of the array is the (duplicate) copy of the top half
of the array them
MA1.duplicate <- rbind(MA1,MA1)
MA1.duplicate[(dim(MA1)[1]+1):(2*dim(MA1)[1],] <- NA
the combine runs 1 and 2, run dupcor and lmFit. I think this should work,
but  I'm not sure of the validity of this approach (anyone else like to
comment??)

Finally with regard to how what happens to genes on one array and not the
other I think it depends on how the data from the different arrays are
merged. If you retain the probe, you could give it the data from run 1, and
set all the run2 estimates to NA's. In that case you'll get an estimate out
of limma - it will be a poorer estimate compared to the other genes (since
it has fewer degrees of freedom).  One comment is that its possible you
could have different sequences (ie regions) from the same gene on the array.
If this is the case, I would treat each sequence variant separately (ie as
different genes since the non-specific hyb and binding efficiency probably
vary between the sequences)

Finally I would suggest that you contact whoever printed your arrays and
find out exactly how the print runs differ, and what was spotted on the
arrays and where it came from. This should then help you decide what is the
best approach for analysing the data.

Cheers
Chris

Dr Chris Wilkinson

Research Officer (Bioinformatics)        | Visiting Research Fellow
Child Health Research Institute (CHRI)   | Microarray Analysis Group
7th floor, Clarence Rieger Building      | Room 121
Women's and Children's Hospital          | School of Applied Mathematics
72 King William Rd, North Adelaide, 5006 | The University of Adelaide, 5005

Math's Office (Room 121)        Ph: 8303 3714
CHRI   Office (CR2 52A)         Ph: 8161 6363

Christopher.Wilkinson at adelaide.edu.au

http://mag.maths.adelaide.edu.au/crwilkinson.html

> Can anyone help me with the following please?
>
> I have an experiment using two sets of arrays with different layouts and
> therefore different GAL files that I need to analyse together. A
> previous suggestion on this mailing list was to normalize and then
> combine the log ratios. Can anyone tell me what the code for combining 2
> MALists is please?
>
> Secondly one of the sets of arrays has the genes printed in duplicate
> and the other set does not. Is there a way I can use dupcor.series for
> the arrays with the duplicates and then combine them with the other set
> of arrays? (or at least take the average - without having to manually
> alter my gpr files)
>
> Failing all this, if I just combined the MALists as they are, will I
> have problems since the genes are in duplicate on some arrays with same
> IDs etc and not others?
>
> Finally....I have been told that the genes are the same on both types of
> slides - whether this means 100% the same or more or less the same, I'm
> not sure. If they are not completely identical how will the genes, that
> are only on one set of arrays, be dealt with? i.e. will they be excluded
> or will they be included in the calculations with data only from one set
> of arrays?
>
> Many thanks,
>
> Helen