[BioC] Help on invariantset normalization function

Mon Jul 2 18:31:09 CEST 2012

Hi Sophie,

On 7/2/2012 10:35 AM, Sophie Lamarre wrote:
> Hello Jim,
>
> I have 151 patients in my file and 16 417 genes without the 20 
> housekeeping genes I need to normalize.
> I want to try different normalization methods using housekeeping genes.
> The classic method is to calculate the mean of the housekeeping genes 
> (selected) by patient, and subtract this value to each genes of the 
> same patient.
>
> I would try the invariant set method with my data file and my list of 
> housekeeping genes.
> When I read the help, one said I had to have 2 vectors: my data file 
> to normalize and my file containing the intensities of housekeeping 
> genes (which help me to normalize):

Ah, I see. The problem here is that you misunderstand what 
normalize.invariantset() is intended to do. It is not intended to do 
what you want, which is to use a set of housekeeping genes to normalize 
the data. Instead, this is really an internal function for 
normalize.AffyBatch.invariantset().

The idea here is to take one chip (which is what you did), and then some 
artificially derived 'reference' chip that contains the same number of 
genes as your chip (and is derived from the mean, median, etc for each 
gene), and then determine which genes don't change expression between 
the two, and then fit a line on those 'invariant' genes, which will then 
be used to normalize your data. If your two vectors are not the same 
length, you will get the error you see.

This is quite different from what you want to do. I don't think there 
are any functions to do such a simple normalization, and quite frankly 
what you propose is neither classic nor recommended (if by classic you 
mean 'a very common and accepted method' rather than 'what people did 
way back in the past before they knew better').

To do what you propose is just a simple application of colMeans() and 
sweep().

Best,

Jim

>
>       Usage
>
> normalize.AffyBatch.invariantset(abatch, prd.td = c(0.003, 0.007),
>                                   verbose = FALSE,
>                                   baseline.type = c("mean","median","pseudo-mean","pseudo-median"),
>                                   type = c("separate","pmonly","mmonly","together"))
>
> normalize.invariantset(data, ref, prd.td=c(0.003,0.007))
>
>
>       Arguments
>
> |abatch| 	
>
> an|AffyBatch <AffyBatch%2dclass.html>|object.
>
> |data| 	
>
> a vector of intensities on a chip (to normalize to the reference).
>
> |ref| 	
>
> a vector of reference intensities.
>
>
>
> Thank you for your help,
>
> Kind Regards,
> -- 
> Sophie LAMARRE
>
>
> Le 02/07/2012 16:12, James W. MacDonald a écrit :
>> Hi Sophie,
>>
>> On 7/2/2012 8:03 AM, Sophie Lamarre wrote:
>>> Hello,
>>>
>>> I try the invariantset normalization function (affy package) on my 
>>> data:
>>>
>>>>   test_pat1 = normalize.invariantset(data_ready_to_normalize_met1[,1],
>>> +                                    bd_20hk_norm[,1],
>>> +                                    prd.td=c(0.003,0.007))
>>> Error on while ((ns.old - ns)>   50) { :
>>>     missing value where TRUE / FALSE is required
>>
>> When you do
>>
>> data_ready_to_normalize_met1[,1]
>>
>>
>> you are selecting data from only one array. It isn't possible to 
>> figure out which probesets are invariant with only one array (because 
>> the implication is that the probesets don't vary in any array).
>>
>> Is there a particular reason that you are trying to normalize just 
>> one array?
>>
>> Best,
>>
>> Jim
>>
>>
>>
>>>
>>>
>>> # My data to normalize
>>>
>>>>   data_ready_to_normalize_met1[1:5,1]
>>> [1]  5.803779 11.566477  8.583049  8.531674  9.490483
>>>
>>> # My vector containing my 20 housekeeping genes
>>>>   bd_20hk_norm[1:5,1]
>>> [1] 14.92680 15.58281 15.15885 15.09599 15.23146
>>>
>>> My session info:
>>>
>>>
>>>>   sessionInfo()
>>> R version 2.14.1 (2011-12-22)
>>> Platform: x86_64-redhat-linux-gnu (64-bit)
>>>
>>> locale:
>>>    [1] LC_CTYPE=fr_FR.UTF-8       LC_NUMERIC=C               
>>> LC_TIME=fr_FR.UTF-8
>>>    [4] LC_COLLATE=fr_FR.UTF-8     LC_MONETARY=fr_FR.UTF-8    
>>> LC_MESSAGES=fr_FR.UTF-8
>>>    [7] LC_PAPER=C                 LC_NAME=C                  
>>> LC_ADDRESS=C
>>> [10] LC_TELEPHONE=C             LC_MEASUREMENT=fr_FR.UTF-8 
>>> LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] grid      stats     graphics  grDevices utils     datasets  
>>> methods   base
>>>
>>> other attached packages:
>>>    [1] affy_1.32.1           preprocessCore_1.16.0 
>>> gplots_2.10.1         KernSmooth_2.23-7
>>>    [5] caTools_1.13          bitops_1.0-4.1        
>>> gdata_2.8.2           gtools_2.6.2
>>>    [9] geneplotter_1.32.1    lattice_0.20-0        
>>> annotate_1.32.3       AnnotationDbi_1.16.19
>>> [13] Biobase_2.14.0        limma_3.10.3
>>>
>>> loaded via a namespace (and not attached):
>>> [1] affyio_1.22.0       BiocInstaller_1.2.1 DBI_0.2-5           
>>> IRanges_1.12.6
>>> [5] RColorBrewer_1.0-5  RSQLite_0.11.1      tools_2.14.1        
>>> xtable_1.7-0
>>> [9] zlibbioc_1.0.1
>>>
>>>
>>> I have no missing value:
>>>
>>>>   test = is.na(data_ready_to_normalize_met1[,1])
>>>>   sum(test)
>>> [1] 0
>>>
>>>
>>>
>>> Could you help me or give me a example in order I can resolve my 
>>> problem?
>>>
>>> Thank your very much,
>>>
>>> Kind Regards,
>>>
>>> Sophie LAMARRE
>>>
>>>     [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: 
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099