[BioC] Mismatch probe handling for exon arrays

Fri Oct 13 15:56:05 CEST 2006

Hi Steven,

Steven McKinney wrote:
> The new Affy Exon chips do not have a mismatch probe for
> every perfect match probe, but rather have a collection
> of GC-varied background probes (about 50000 probes out
> of the 6 million on the chip - the "background probe 
> collection" BGP). 
> 
> Info is in, amongst others, 
> http://www.affymetrix.com/support/technical/whitepapers/exon_background_correction_whitepaper.pdf
> 
> Affy has modified their PLIER algorithm to use this
> background probe collection to perform a "PM-GCBG"
> correction, using the median BGP intensity for probes
> with the same GC content as the PM probe.
> 
> Has any implementation of the PM-GCBG idea been done for
> justRMA() or justPLIER() in R/BioC?

Certainly not for justRMA(), since it doesn't use MM probe values.

> 
> Can anyone comment on which input parameters or
> control options of these functions should be specifically
> set to allow these functions to do a reasonable job
> of normalizing/correcting exon array data?  
> 
> For example, justPLIER() has argument usemm=TRUE
> but no documentation about it - no doubt
> usemm = FALSE is appropriate for the exon arrays.
> But, will the algorithms still perform alright?
> 
> Are there newer versions of the algorithms that handle
> the exon data configuration?

The problem with the exon arrays right now has to do with the amount of 
data involved and the current paradigm we use for analyzing these data. 
Currently we hold all the data in RAM, and given R's pass-by-value 
semantics, there can be quite a bit of copying. This isn't such a 
problem with the 3' biased arrays from Affy, especially if you have a 
reasonable amount of RAM.

Unfortunately, the larger genotyping arrays and the exon arrays are so 
huge that this paradigm is really not working well anymore. Going 
forward the goal is to transition from holding the data in RAM to 
putting it all in SQlite databases, so one can work with a subset of 
data that is appropriate given the amount of RAM available. Since this 
will involve putting three things in databases (the cdf information, the 
annotation information, and the data) that will all have to play nice 
together, it is inherently a slow process.

So, long story short, unless you have a 64 bit operating system and LOTS 
of RAM, the algorithm used to compute expression values is currently a 
moot point.

Best,

Jim

> 
> Any feedback appreciated.
> 
> 
> 
> Steven McKinney
> 
> Statistician
> Molecular Oncology and Breast Cancer Program
> British Columbia Cancer Research Centre
> 
> email: smckinney at bccrc.ca
> 
> tel: 604-675-8000 x7561
> 
> BCCRC
> Molecular Oncology
> 675 West 10th Ave, Floor 4
> Vancouver B.C. 
> V5Z 1L3
> Canada
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
James W. MacDonald, M.S.
Biostatistician
Affymetrix and cDNA Microarray Core
University of Michigan Cancer Center
1500 E. Medical Center Drive
7410 CCGC
Ann Arbor MI 48109
734-647-5623

**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues.