[BioC] Some Genefilter questions

Robert Gentleman rgentlem at fhcrc.org
Fri Dec 1 20:02:37 CET 2006

Jenny Drnevich wrote:
> Hi Amy,
>> Jenny, just wanted to clarify what you said; you reckon if I only want to
>> remove the foreign species probesets I should do this before
>> preprocessing, but if I want to remove e.g. absent calls from my own
>> species probes I should do this after preprocessing.  Is this right?
> Yes, IMO, at least if you're doing the GCRMA background correction. With 
> the soybean data I've worked with, I've seen very large differences in 
> the GC-based background correction depending on whether the other 
> species' probesets were removed or not. Soybeans might be unusual

  I hate to say it, but I have reason to believe that this is not a 
function of the removal. The version of GCRMA in Bioconductor uses 
subsampling which can be greatly affected by what you did (but can be 
achieved in other ways, without reducing the number of probes).

  It would help to know if similar effects are observed with other 
methods, particularly either RMA or VSN.

> because about 90% of the soybean probesets seem to be expressed, so that 
> throwing out the non-expressed, non-soy probesets, radically changed the 
> distribution of the values sampled for the background estimation. I 
> created a scenario where 30% of the probesets were non-expressed 
> non-soy, 35% were non-expressed soy, and the remaining 35% were 
> expressed soy. The changes in the background correction after throwing 
> out the non-soy were not as extreme, but still could have a large effect 
> (over 4 FC!!) at low expression levels. I'm not sure which is "right" 
> and which is "wrong", but I tend to agree with Jim that I don't feel 
> comfortable using other species' non-expressed probesets to estimate 
> background or normalization distributions for my target species. 
> However, RMA's background correction wasn't really affected by throwing 
> out the non-soy probesets or not.

   I don't think Jim disagrees on the background - that should be fine. 
The real question is normalization, and well, there are reasons both pro 
and con.  I personally doubt the effect is large enough to warrant the 
effort in "fixing" it, if that is indeed what is happening.

>> Also, how do I create the character vector of my parasite probesets for
>> your code?
> You said before they all start with "Pf", so you can do something 
> similar to what Robert suggested
>  >parasites <- grep("Pf", geneNames(yourAffyBatchObject), value=TRUE)
> Giving the argument 'value=TRUE' will give you the gene names, instead 
> of their indices. BTW Robert - you had put "^Pf" - was the ^ a typo, or 
> does that indicate 'begins with' rather than 'anywhere'?

  It is begins with, since Amy said they "begin with", and I do not want 
anywhere. It is kind of important to get this right - not a typo.

>> Robert, I tried subsetting after preprocessing but before analysis ... it
>> made no difference to the order of probesets, however the numbers changed
>> slightly (all the probesets had slightly higher adjusted P.values after
>> removing the parasite probes).  See below:
>> Why would the adjusted P values be higher in the second case (number of
>> parasite probes removed was about 4,000)?
> This is due to the phenomenon that Claus mentioned - by removing the 
> parasite probes, which have low variation, the average variance across 
> genes will increase, subsequently leading to smaller t-values and larger 
> raw p-values. Even though you are correcting for fewer genes, the change 
> in the variance correction can have a larger effect on the adjusted 
> p-values.

   Using some sort of attenuated p-values should help to alleviate this 
problem. But this is surprising to me - I would need to think about it 
more to say anything more informative.


> Best,
> Jenny
>> Regards,
>> Amy
>> --------------------------------------------------------------------------- 
>> > Hi,
>> >
>> > It may be worth pointing out that a related question can have a huge
>> > impact on normalization of certain glass arrays. One of the standard
>> > protocols on the Agilent 44K human arrays causes several hundred 
>> control
>> > spots to light up extremely brightly in the green channel, but remain
>> > completely off in the red channel.  If you leave these control spots in
>> > the data set when you normalize between channels (i.e., within arrays),
>> > every known normalization methods breaks -- in the precise sense 
>> that it
>> > will systematically distort the comparison between the red and green
>> > channels.  If you then model the data incorporating a dye effect, you
>> > will think that almost every gene exhibits a dye bias.  On the other
>> > hand, if you remove these control spots before normalizing between
>> > channels, then modeling the dye bias suggest that it rarely exists....
>> >
>> > As for the question originally asked here, I would not expect the
>> > foreign species probes to break the normalization (unless they somehow
>> > light up in one group of samples but not in the other). So, my own bias
>> > would be to keep them for background correction and normalization, but
>> > remove them before the rest of the analysis.
>> >
>> > Best,
>> >       Kevin
>> >
>> > Jenny Drnevich wrote:
>> >> Hi Amy,
>> >>
>> >> Don't you just love it when you get one response suggesting you do one
>> >> thing (remove malarial genes after pre-processing) and another 
>> response
>> >> suggesting the opposite?  Although I think in this case Robert was
>> >> suggesting you remove them after pre-processing because it was easier
>> >> than
>> >> trying to modify either the normalization code or the cdf environment,
>> >> which is what Jim pointed out to you. I ran into this same problem 
>> with
>> >> having probesets for other species on the soybean array, which is 
>> why I
>> >> used Ariel's code. I think that if you're using a mixed species array
>> >> but
>> >> only put one of the species on it, then you should remove the other
>> >> species' probesets BEFORE doing the normalization because they really
>> >> have
>> >> no bearing on the transcriptome you're trying to measure. On the other
>> >> hand, if you also want to filter your species' probesets based on
>> >> presence/absence, minimum cutoff, variation, etc.* , then you should
>> >> filter
>> >> these genes AFTER doing the pre-processing because these probesets do
>> >> contain information about the transcriptome, even if it is just 'not
>> >> detectably expressed'.
>> >>
>> >> Cheers,
>> >> Jenny
>> >>
>> >> * Contrary to Robert, I prefer to filter on presence/absence (using
>> >> Affy's
>> >> calls) rather than variability :) I don't know if there is any
>> >> documentation on which may be "better"...
>> >>
>> -------------------------------------------
>> Amy Mikhail
>> Research student
>> University of Aberdeen
>> Zoology Building
>> Tillydrone Avenue
>> Aberdeen AB24 2TZ
>> Scotland
>> Email: a.mikhail at abdn.ac.uk
>> Phone: 00-44-1224-272880 (lab)
>>        00-44-1224-273256 (office)
> Jenny Drnevich, Ph.D.
> Functional Genomics Bioinformatics Specialist
> W.M. Keck Center for Comparative and Functional Genomics
> Roy J. Carver Biotechnology Center
> University of Illinois, Urbana-Champaign
> 330 ERML
> 1201 W. Gregory Dr.
> Urbana, IL 61801
> ph: 217-244-7355
> fax: 217-265-5066
> e-mail: drnevich at uiuc.edu 

Robert Gentleman, PhD
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
PO Box 19024
Seattle, Washington 98109-1024
rgentlem at fhcrc.org

More information about the Bioconductor mailing list