[BioC] problems about cDNA vs genomic arrays normalization

yanju yanju at liacs.nl
Tue Nov 21 13:05:08 CET 2006


Dear Jenny,

Generally, I got your point.  But still one not clear. you mentioned 
"you'll change your design matrix accordingly (no -1s!)".  May I know 
the reason why? 'Cos I generated the design matrix like this:
    #design<-modelMatrix(targets, ref="gDNA")
    > design
                        wt16 wt20 wt24 
    sample1        -1        0        0   
    sample2        -1        0        0    
    sample3        -1        0        0    
my dataset is generated by dual-channel array without dye swap. How 
should I change my design matrix?

Regards,
Yanju





Jenny Drnevich wrote:

> Hi Yanju,
>
>
>> After reading your explanation, I still have 2 puzzles.
>> 1. Before I also applied normalizeWithinArrays() method to this 
>> dataset.  Do you think it is correct or necessary in my case?
>
>
> No, you should not do normalizeWithinArrays! This assumes that most 
> genes are not changing expression between the two samples on one 
> array, and in your case you have every reason to expect that the 
> 'expression' levels of genomic DNA will not be anything like cDNA from 
> your experimental groups, as you mentioned in your first post.
>
>
>> 2. You said "For the statistical analysis, you use the R values 
>> directly."  But after normalizeBetweenArrays(), then a MAList was 
>> generated. It consisted of M, A value etc but not R value (red 
>> channel intensity).
>
>
> It's easy to convert between RGLists, which contain R and G values, 
> and MALists, which have M and A values. See 'RG.MA' and 'MA.RG' - 
> they're explained at the end of the details section of the help page 
> for 'normalizeWithinArrays'. Another thing - Are you doing a 
> background correction first? Because if you don't, and do 
> 'normalizeWithinArrays' or 'normalizeBetweenArrays' on a RGList that 
> still has the Rb and Gb items in it, a simple background subtraction 
> will be done automatically. This is not necessarily a good thing IMO 
> because a negative R or G values in either channel will cause the M & 
> A values to be lost, so that you cannot recreate the R & G values 
> again. Let's say for simplicity sake that RG is your original RGList 
> before any pre-processing, and the genomic DNA is in the Green channel 
> on each slide. I would do something like this:
>
> RG.nobg <- backgroundCorrect(RG, method="none")
>         # or maybe pick "half" to avoid neg. values
>
> MA.nobg.Gquant <- normalizeBetweenArrays(RG.nobg,method="Gquantile")
>         # do a quantile normalization on the G / genomic values
>
> RG.nobg.Gquant <- RG.MA(MA.nobg.Gquant)
>         # convert the MAList back to a RGList
>
> MA.fake <- MA.nobg.Gquant
>         # create a MAList to manipulate
>
> MA.fake$M <- log2(RG.nobg.Gquant$R)
>         # replace the M values with the log2(R) values so you can do 
> the analysis on them
>
> You can now proceed with the analysis as if you had Affymetrix-type 
> data. You'll have to change your design matrix accordingly (no -1s!), 
> but the rest of your analysis should be the same as you have below. It 
> gets a bit more complicated if the genomic DNA is not all in the G 
> channel - after the background correction you have to switch the R & G 
> values for the arrays that have genomic DNA in the R channel, then 
> account for the dye effect by fitting a block effect using 
> 'duplicateCorrelation'. It's very similar to the Technical 
> Replication/Randomized Block section of the limma vignette.
>
> Good luck,
> Jenny
>
>
>
>> And then I fited my MAlist to the linear model by using:
>>    design<-modelMatrix(targets, ref="gDNA")
>>    fit<-lmFit(ma.paq,design)
>> I think all my following analysis are based on the M value. Finally, 
>> I used eBayes function to summary statistics in order to detect the 
>> most differently expressed genes.
>>    cont.matrix<-makeContrasts( WTvsMU=wt-mu,levels=design)
>>    fit2<-contrasts.fit(fit,cont.matrix)
>>    fit2<-eBayes(fit2)
>> So, I have no idea how to use R values directly. Was my codes wrong?
>> I was not quite sure about my code or method, because at the end I 
>> gave some uninterpretable results which did not meet the expectation 
>> of the biologists. That is why now I am recheck my code and methods.  
>> Thank you again and also Wolfgang for your kindly help.
>>
>> Kind regards,
>> Yanju
>>
>>
>>
>> Jenny Drnevich wrote:
>>
>>> Hi Yanju,
>>>
>>> I have just been working with a couple of data sets similar to yours 
>>> where a) one channel has the same reference and b) the assumptions 
>>> of few differences between sample and reference are not necessarily 
>>> upheld. In these cases I have been using the Rquantile or Gquantile 
>>> methods of normalizeBetweenArrays() in limma. These methods will do 
>>> a quantile normalization on the R or G channel indicated so they 
>>> have the "same empirical distribution across arrays, leaving the 
>>> M-values (log-ratios) unchanged." Say your reference is in the green 
>>> channel - doing a Gquantile normalization would force all the 
>>> reference values to have the same distribution, and then adjust the 
>>> R channel values accordingly. For the statistical analysis, you use 
>>> the R values directly because if you use the M values, it would be 
>>> like you never did the normalization. If the reference is not all in 
>>> the same channel, I manipulate the RGList so that they are all in 
>>> the same channel, but then I also include 'dye' as a batch effect in 
>>> the model.
>>>
>>> HTH,
>>> Jenny
>>>
>>> At 10:32 AM 11/20/2006, yanju wrote:
>>>
>>>> Dear all,
>>>>
>>>> I have got a microarray dataset derived from common reference design.
>>>> The common reference is gemoic DNA.  In normal normalization, we 
>>>> assume
>>>> that  large fraction of genes is not differently expressed, then the
>>>> adjustment strategies are used to let the log-ratios have a 
>>>> median(mean)
>>>> of 0. But in my case, every spot would have the same observed 
>>>> signal in
>>>> the genomic channel while the signals in the cDNA channel vary 
>>>> greatly.
>>>> Therefore, the strategies that i just mentioned are not suitable. I 
>>>> was
>>>> wondering how to normalize this kinds of data? Is that any packages or
>>>> functions existed already? Expecting your reply.
>>>>
>>>> Regards,
>>>> Yanju
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at stat.math.ethz.ch
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives: 
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>>
>>>
>>> Jenny Drnevich, Ph.D.
>>>
>>> Functional Genomics Bioinformatics Specialist
>>> W.M. Keck Center for Comparative and Functional Genomics
>>> Roy J. Carver Biotechnology Center
>>> University of Illinois, Urbana-Champaign
>>>
>>> 330 ERML
>>> 1201 W. Gregory Dr.
>>> Urbana, IL 61801
>>> USA
>>>
>>> ph: 217-244-7355
>>> fax: 217-265-5066
>>> e-mail: drnevich at uiuc.edu
>>
>>
>
> Jenny Drnevich, Ph.D.
>
> Functional Genomics Bioinformatics Specialist
> W.M. Keck Center for Comparative and Functional Genomics
> Roy J. Carver Biotechnology Center
> University of Illinois, Urbana-Champaign
>
> 330 ERML
> 1201 W. Gregory Dr.
> Urbana, IL 61801
> USA
>
> ph: 217-244-7355
> fax: 217-265-5066
> e-mail: drnevich at uiuc.edu



More information about the Bioconductor mailing list