[BioC] Combining two datasets - help to use GeneMeta.

Robert Gentleman rgentlem at fhcrc.org
Mon Jun 12 17:24:45 CEST 2006


Hi,
  A bit, but you probably want to read the paper I referenced, as it has 
more complete details. I also, ought to emphasize at the outset that 
this argument is the wrong way around. If you want to do something (such 
as joint normalization) then it is incumbent on you to state why and 
under what assumptions it is sensible. I can easily state the ones under 
which separate normalization followed by a random effects model is 
appropriate and it is, AFAICS a super set of those where joint 
normalization would work.

Gordon Barr wrote:
> Robert
> 
> Could you elaborate a bit on why you think it a bad idea to normalize 
> separate experiments together. If  you normalize each experiment 
> separately are you requiring the same conditions in each?

  No, essentially the opposite. Normalization together presumes that the 
conditions were essentially the same and separate normalization allows 
them to be different. When they are the same, then separate 
normalization will almost surely be a bit less efficient (in a 
statistical sense) and when they are really different joint 
normalization can be very problematic.

  Essentially the problem is that normalization presumes things like few 
genes are differentially expressed, the rank order of the expression 
values is approximately correct etc, that tend to hold for single 
experiments but can be quite incorrect for different experiments.
  Another way of thinking of normalization is that you essentially want 
to fit a model to Y (the observed spot intensities) and correct for all 
experimental covariates, X (but none of the biological ones you intend 
to test for),
   Y = X b + e
and then you throw away the Xb and proceed to analyze the e's.
Most of the methods around try to do this without requiring explicit 
statements of X, but most would undoubtedly be improved if some parts of 
X could be specified (reagent batch, slide batch, technician, day of 
week, sample handling etc).
  Back to the main story: since the X's are very different in two 
different experiments, there are some real problems that arise from 
assuming that they are the same.
  On the other hand, keeping them separate and then using a random 
effects model seems to be appropriate in all cases and better reflects 
our belief about the data (at least I have only encountered situations 
where experiments should be treated as random effects). This stuff works 
and is appropriate - one only hopes that sooner or later folks will 
start to realize that just because you can do something does not mean 
you should. Statistical manipulations of data are merely mathematical 
transformations, they can always be carried out, the art is in 
determining when it is sensible to do so and for my money (and that of 
the people who's data I analyze) joint normalization makes no sense.

   best wishes
    Robert


> 
> Thanks
> 
> Sincerely,
> 
> Gordon
> 
> Senior Research Scientist
> Developmental Psychobiology
> NYS Psychiatric Institute
> Columbia College of Physicians and Surgeons
> 1051 Riverside Drive
> New York, New York 10032
> 212-543-5694 (voice)
> 212-543-5497 (fax)
> 
> _____________________________________________________
> This e-mail is confidential and may be privileged.  Use or disclosure of 
> it by anyone other than a designated addressee is unauthorized.  If you 
> are not an intended recipient, please delete this e-mail.
> 
> "Every gun that is made, every warship launched, every rocket fired, 
> signifies in a final sense a theft from those who hunger and are not 
> fed—those who are cold and are not clothed. This world in arms is not 
> spending its money alone—it is spending the sweat of its laborers, the 
> genius of its scientists, the hopes of its children."
> —Dwight David Eisenhower, 1953
> 
> 
> 
> On Jun 11, 2006, at 2:23 PM, Robert Gentleman wrote:
> 
>>
>>
>> Sean Davis wrote:
>>> Sharon wrote:
>>>> Hi,
>>>>
>>>> I am trying to combine two Affy datasets (on rae230a chips), where
>>>> experiments done one year apart. In the first dataset, we have 2
>>>> strains with each strain treated and untreated.  But for the second
>>>> dataset, we have just 2 strains untreated.
>>>>
>>>> Because of unequal levels in the 2 datasets, I am not able to use
>>>> 'getdF'  in GeneMeta as it is.  Any suggestions for using 'getdF' for
>>>> this situation?  or any alternate way of combining these 2 datasets?
>>>
>>> Are these datasets really that much different that you can't just
>>> combine them?  They may be, but have you looked at affyPLM results,
>>> density plots, etc., just to be sure?  If they aren't that much
>>> different, perhaps you can just normalize them together and move on?
>>> Just asking....
>>
>>   Sorry, but that is, IMHO, a bad idea. You should never jointly
>> normalize separate experiments. Normalize separately and use a random
>> effects model for the experiments. As, for how to handle different
>> levels of factors/covariates, the issue then becomes one of what can be
>> estimated from both. Once you identify that you can set up the
>> appropriate model and then use tools like nlme and lmer (depending on
>> the model) to estimate parameters. But this will require some
>> statistical expertise and for that you will have to look locally, these
>> things are too hard to do over the internet,  IMHO.
>>   There is a BioC technical report on Synthesis of microarray
>> experiments that outlines some of these details more completely.
>>
>>
>>   best wishes
>>    Robert
>>
>>>
>>> Sean
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: 
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>
>> --Robert Gentleman, PhD
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M2-B876
>> PO Box 19024
>> Seattle, Washington 98109-1024
>> 206-667-7700
>> rgentlem at fhcrc.org
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: 
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> 

-- 
Robert Gentleman, PhD
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
PO Box 19024
Seattle, Washington 98109-1024
206-667-7700
rgentlem at fhcrc.org



More information about the Bioconductor mailing list