[BioC] RMA probe summarization when sizes of probesets are unequal

Fri Aug 16 18:25:50 CEST 2013

Hi Xin,

On 8/16/2013 11:50 AM, Xin Lin [guest] wrote:
> Dear all,
>
> I have a customized two-channel microarray designed by NimbleGen (one of the last they produced) based on 26981 cDNA sequences of tomato. 60-mer oligonucleotide probes were designed and multiple probes were used for each transcript. The problem is, the sizes of probe-sets are not equal -- ranging from 5 to 1 (26784 of them have 5 probes).
>
> I am now trying to use rma() in oligo for normalization and summarization. My question is, in the situation where sizes of probe-sets are unequal, how does rma() do the probe summarization? Will it be problematic if I use rma() directly? If rma() could not do the correct job, what alternative method can I use for probe summarization?

The RMA algorithm will have no problem with different sized probesets. 
The number of probes per probeset has varied pretty much from the first 
Affy array, and continues to this day, so this has never been an issue.

You could make the argument that the reliability of the summary 
statistic that is generated by rma() is dependent on the number of 
probes that went into the summarization. Certainly it is true in a 
statistical sense, but you could argue that five poorly-performing 
probes won't give a better estimate of the level of a transcript than a 
single well-performing probe, so it is hard to make a blanket statement 
about an entire array. But all things equal, rma() will give a better 
estimate from five probes than from a single probe.

But what people tend not to worry about is the fact that we don't 
usually take that into consideration for downstream analyses. In other 
words, if you have two probesets, one that has 5 probes, and one that 
has a single probe, then the accuracy of the summarized expression value 
may be better for the first one than the second. If you then compute 
t-statistics using these two probesets, you don't take into account that 
the first probeset is likely to more accurately measure the underlying 
expression of the gene than the second.

The puma package is designed to account for the variable uncertainty (I 
must confess that I have never actually used it however), so if you are 
concerned about this sort of thing, then you might look at that package.

Best,

Jim

>
> I am new in microarray and R, and I'll appreciate your help very much! Thank you for your time!
>
> Xin
>
>   -- output of sessionInfo():
>
>> sessionInfo()
> R version 3.0.1 (2013-05-16)
> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
>   [1] qvalue_1.34.0                 affyPLM_1.36.0                preprocessCore_1.22.0
>   [4] gcrma_2.32.0                  affy_1.38.1                   pd.121114.slycop.tm.exp_0.0.1
>   [7] pdInfoBuilder_1.24.0          affxparser_1.32.3             oligo_1.24.0
> [10] oligoClasses_1.22.0           geneplotter_1.38.0            lattice_0.20-15
> [13] annotate_1.38.0               AnnotationDbi_1.22.6          Biobase_2.20.1
> [16] BiocGenerics_0.6.0            RColorBrewer_1.0-5            limma_3.16.6
> [19] genefilter_1.42.0             RSQLite_0.11.4                DBI_0.2-7
>
> loaded via a namespace (and not attached):
>   [1] affyio_1.28.0        BiocInstaller_1.10.2 Biostrings_2.28.0    bit_1.1-10
>   [5] codetools_0.2-8      ff_2.2-11            foreach_1.4.1        GenomicRanges_1.12.4
>   [9] grid_3.0.1           IRanges_1.18.2       iterators_1.0.6      splines_3.0.1
> [13] stats4_3.0.1         survival_2.37-4      tcltk_3.0.1          tools_3.0.1
> [17] XML_3.95-0.2         xtable_1.7-1         zlibbioc_1.6.0
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099