[BioC] Some Genefilter questions

Fri Dec 1 15:38:46 CET 2006

Hi Claus,

Claus Mayer wrote:
> Hello,
> 
> just to throw in my own bits of wisdom: I am clearly on Robert's side in 
> this argument, i.e. normalise with ALL genes, analyse just the species 
> specific ones. When you use GCRMA, you have three main steps in the 
> algorithm:
> 
> 1)Background correction: As Robert points out, the foreign genes should 
> improve this
> 
> 2) Quantile Normalisation: Obviously the distribution across all probes 
> will change (mainly it will have more mass on the low-intensity range), 
> but that will be the case for all arrays in the same way, as the foreign 
> genes are not expected to change, so I can't see why these extra genes 
> should be harmful.

This is the only point where Robert and I (and you, for that matter) 
don't necessarily agree. I agree that the distribution will have greater 
mass in the low-intensity range, and I also agree that the expression of 
the foreign genes won't change (since their transcript won't be hybed).

However, just because the transcript isn't hybed to the chip doesn't 
mean that the intensity values of the foreign probes won't vary 
(possibly widely - without data in hand, we can't know). Rafa has shown 
that hybing yeast DNA to a human chip will result in some probes 
lighting up (but AFAIR, he didn't replicate so we don't know the 
variability of the spurious signal). Throwing a bunch of possibly noisy 
data into the mix could easily trash any signal you might have for 
low-expressing genes. Affy data are noisy enough at the low end that I 
am not completely comfortable with an assumption that the probe 
intensity values for the foreign genes will be essentially static or at 
least well behaved.

Best,

Jim

> 
> 3)Summarizing the Probesets: For each gene only the values of all probes 
> correspoding to that gene are used, so this step will not be influenced 
> by additional genes.
> 
> For the analysis its a different thing. Obviously you want to get rid of 
>   genes which are not of interest before p-value adjustment for multiple 
> testing, because you will be more conservative then necessary otherwise.
> There is also a case for not wanting them to be in the limma analysis I 
> think. The foreign genes will be less variable, as they only show 
> background noise and thus are not affected by biological variability. 
> This will reduce the average variance across all genes and as limma 
> shrinks individual gene variances towards this average the denominators 
> in the moderated t-statistics will be reduced too, thus leading to false 
> positives. I am not sure whether it will really make a big difference 
> practically, but theoretically there is certainly an issue here.
> 
> Interesting discussion anyway,
> 
> Claus
> 
> Jenny Drnevich wrote:
> 
>>Hi Amy,
>>
>>Don't you just love it when you get one response suggesting you do one 
>>thing (remove malarial genes after pre-processing) and another response 
>>suggesting the opposite?  Although I think in this case Robert was 
>>suggesting you remove them after pre-processing because it was easier than 
>>trying to modify either the normalization code or the cdf environment, 
>>which is what Jim pointed out to you. I ran into this same problem with 
>>having probesets for other species on the soybean array, which is why I 
>>used Ariel's code. I think that if you're using a mixed species array but 
>>only put one of the species on it, then you should remove the other 
>>species' probesets BEFORE doing the normalization because they really have 
>>no bearing on the transcriptome you're trying to measure. On the other 
>>hand, if you also want to filter your species' probesets based on 
>>presence/absence, minimum cutoff, variation, etc.* , then you should filter 
>>these genes AFTER doing the pre-processing because these probesets do 
>>contain information about the transcriptome, even if it is just 'not 
>>detectably expressed'.
>>
>>Cheers,
>>Jenny
>>
>>* Contrary to Robert, I prefer to filter on presence/absence (using Affy's 
>>calls) rather than variability :) I don't know if there is any 
>>documentation on which may be "better"...
>>
>>At 05:15 PM 11/29/2006, Robert Gentleman wrote:
>>
>>>Hi,
>>>
>>>Amy Mikhail wrote:
>>>
>>>>Dear Bioconductors,
>>>>
>>>>I am annalysing 6 PlasmodiumAnopheles genechips, which have only Anopheles
>>>>mosquito samples hybridised to them (i.e. they are not infected
>>>>mosquitoes).  The 6 chips include 3 replicates, each consisting of two
>>>>time points.  The design matrix is as follows:
>>>>
>>>>
>>>>>design
>>>>
>>>>     M15d M43d
>>>>[1,]    1    0
>>>>[2,]    0    1
>>>>[3,]    1    0
>>>>[4,]    0    1
>>>>[5,]    1    0
>>>>[6,]    0    1
>>>>
>>>>
>>>>I have tried both gcRMA (in AffyLMGUI), and RMA, MBEI and MAS5 (in affy).
>>>>Looking at the (BH) adjusted p values <0.05, this gave me 2, 12, 0 and 0
>>>>DE genes, respectively... much less than I was expecting.
>>>>
>>>>As this affy chip contains probesets for both mosquito and malaria
>>>>parasite genes, I am wondering:
>>>>
>>>>(a) if it is better to remove all the parasite probesets before my 
>>>
>>>analysis;
>>>
>>>  Yes, if you don't intend to use them, and they are not relevant to
>>>your analysis. There is no point in doing p-value corrections for tests
>>>you know are not interesting/relevant a priori.
>>>
>>>
>>>>(b) if so at what stage I should do this (before or after normalisation
>>>>and background correction, or does it matter?)
>>>
>>>  After both and prior to analysis - otherwise you are likely to need to
>>>do some serious tweaking of the normalization code.
>>>
>>>
>>>>(c) how would I filter out these probesets using genefilter (all the
>>>>parasite affy IDs begin with Pf. - could I use this prefix in the affy IDs
>>>>to filter out the probesets, and if so how?)
>>>
>>>   you don't need genefilter at all, this is a subseting problem.
>>>  If you had an ExpressionSet you would do something like:
>>>
>>>   parasites = grep("^Pf", featureNames(myExpressionSet))
>>>
>>>   mySubset = myExpressionSet[!parasites,]
>>>
>>>
>>>>Secondly, I did not add any of the polyA controls to my samples.  I would
>>>>like to know:
>>>>
>>>>(d) Do any of the bg correct / normalisation methods I tried utilise
>>>>affymetrix control probesets, and if so, how?
>>>
>>>   I doubt it.
>>>
>>>
>>>>(e) Should I also filter out the control sets - again, if so at what stage
>>>>in the analysis and what would be an appropriate code to use?
>>>>
>>>
>>>   same place as you filter the parasite genes and pretty much in the
>>>same way. They are likely to start with AFFX.
>>>
>>>
>>>>I did try the code for non-specific filtering (on my RMA dataset) from pg.
>>>>232 of the bioconductor monograph, but the reduction in the number of
>>>>probesets was quite drastic;
>>>>
>>>>
>>>>>f1 <- pOverA(0.25, log2(100))
>>>>>f2 <- function(x) (IQR(x) > 0.5)
>>>
>>>  that is a typo in the text - you probably want to filter out those
>>>with IQR below the median, not for some fixed value.
>>>
>>>
>>>>>ff <- filterfun(f1, f2)
>>>>>selected <- genefilter(Baseage.transformed, ff)
>>>>>sum(selected)
>>>>
>>>>[1] 404   ###(The origninal no. of probesets is 22,726)###
>>>>
>>>>>Baseage.sub <- Baseage.transformed[selected, ]
>>>>
>>>>Also, I understood from the monograph that "100" was to filter out
>>>>fluorescence intensities less than this, but I am not clear if this is
>>>>from raw intensities or log2 values?
>>>
>>>  raw - 100 on the log2 scale is larger than can be represented in the
>>>image file formats used. And don't do that - it is not a good idea -
>>>filter on variability.
>>>
>>>
>>>
>>>>All the parasite probesets have raw intensities <35 .... so could I apply
>>>>this as a simple filter, and would this have to be on raw (rather than
>>>>normalised data)?
>>>
>>>  Best wishes
>>>    Robert
>>>
>>>
>>>>Appologies for the long posting...
>>>>
>>>>Looking forward to any replies,
>>>>Regards,
>>>>Amy
>>>>
>>>>
>>>>>sessionInfo()
>>>>
>>>>R version 2.4.0 (2006-10-03)
>>>>i386-pc-mingw32
>>>>
>>>>locale:
>>>>LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
>>>>States.1252;LC_MONETARY=English_United
>>>>States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
>>>>
>>>>attached base packages:
>>>> [1] "tcltk"     "splines"   "tools"     "methods"   "stats"
>>>>"graphics"  "grDevices" "utils"     "datasets"  "base"
>>>>
>>>>other attached packages:
>>>>plasmodiumanophelescdf              tkWidgets                 DynDoc
>>>>     widgetTools            agahomology
>>>>              "1.14.0"               "1.12.0"               "1.12.0"
>>>>        "1.10.0"               "1.14.2"
>>>>               affyPLM                  gcrma            matchprobes
>>>>        affydata                annaffy
>>>>              "1.10.0"                "2.6.0"                "1.6.0"
>>>>        "1.10.0"                "1.6.0"
>>>>                  KEGG                     GO                  limma
>>>>     geneplotter               annotate
>>>>              "1.14.0"               "1.14.0"                "2.9.1"
>>>>        "1.12.0"               "1.12.0"
>>>>                  affy                 affyio             genefilter
>>>>        survival                Biobase
>>>>              "1.12.0"                "1.2.0"               "1.12.0"
>>>>          "2.29"               "1.12.0"
>>>>
>>>>
>>>>-------------------------------------------
>>>>Amy Mikhail
>>>>Research student
>>>>University of Aberdeen
>>>>Zoology Building
>>>>Tillydrone Avenue
>>>>Aberdeen AB24 2TZ
>>>>Scotland
>>>>Email: a.mikhail at abdn.ac.uk
>>>>Phone: 00-44-1224-272880 (lab)
>>>>       00-44-1224-273256 (office)
>>>>
>>>>_______________________________________________
>>>>Bioconductor mailing list
>>>>Bioconductor at stat.math.ethz.ch
>>>>https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>Search the archives: 
>>>
>>>http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>--
>>>Robert Gentleman, PhD
>>>Program in Computational Biology
>>>Division of Public Health Sciences
>>>Fred Hutchinson Cancer Research Center
>>>1100 Fairview Ave. N, M2-B876
>>>PO Box 19024
>>>Seattle, Washington 98109-1024
>>>206-667-7700
>>>rgentlem at fhcrc.org
>>>
>>>_______________________________________________
>>>Bioconductor mailing list
>>>Bioconductor at stat.math.ethz.ch
>>>https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>Search the archives: 
>>>http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>Jenny Drnevich, Ph.D.
>>
>>Functional Genomics Bioinformatics Specialist
>>W.M. Keck Center for Comparative and Functional Genomics
>>Roy J. Carver Biotechnology Center
>>University of Illinois, Urbana-Champaign
>>
>>330 ERML
>>1201 W. Gregory Dr.
>>Urbana, IL 61801
>>USA
>>
>>ph: 217-244-7355
>>fax: 217-265-5066
>>e-mail: drnevich at uiuc.edu
>>
>>_______________________________________________
>>Bioconductor mailing list
>>Bioconductor at stat.math.ethz.ch
>>https://stat.ethz.ch/mailman/listinfo/bioconductor
>>Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>> 
>>
>>
> 
> 

-- 
James W. MacDonald, M.S.
Biostatistician
Affymetrix and cDNA Microarray Core
University of Michigan Cancer Center
1500 E. Medical Center Drive
7410 CCGC
Ann Arbor MI 48109
734-647-5623

**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues.