[BioC] Some Genefilter questions
David K Pritchard
dpritch at u.washington.edu
Fri Dec 1 02:33:10 CET 2006
there are two sets of studies which have suggested the ~ 40% expression level from what I remember. Classic COT curve studies from several decades ago suggested roughly this level. More recently, MPSS (Massive Parrelel Signature Sequencing) studies have also suggested this is a reasonable cutoff. Based on these studies I use the same rule of thumb that you do - the median.
On Thu, 30 Nov 2006, Robert Gentleman wrote:
> Lourdusamy A Anbarasu wrote:
>> Dear Dr. Robert,
>> You have mentioned that the filtering on the variability is preferred
>> than raw intensity value. I have also read your previous post on this
>> issue. For filters based on CV, are there any recommended cut-off values?
> Not really. A widely held, but AFAIK undocumented, belief is that in
> any given tissue/cell about 40% of the genome is expressed at any time.
> So, I usually choose the median - that is somewhat conservative with
> respect to the above cited statistic - but this is a personal
> preference. I have not seen any research (and I think it would be hard).
> best wishes
>> Thanks in advance.
>> Best regards,
>> On 11/30/06, *Robert Gentleman* <rgentlem at fhcrc.org
>> <mailto:rgentlem at fhcrc.org> > wrote:
>> Amy Mikhail wrote:
>> > Dear Bioconductors,
>> > I am annalysing 6 PlasmodiumAnopheles genechips, which have only
>> > mosquito samples hybridised to them (i.e. they are not infected
>> > mosquitoes). The 6 chips include 3 replicates, each consisting
>> of two
>> > time points. The design matrix is as follows:
>> >> design
>> > M15d M43d
>> > [1,] 1 0
>> > [2,] 0 1
>> > [3,] 1 0
>> > [4,] 0 1
>> > [5,] 1 0
>> > [6,] 0 1
>> > I have tried both gcRMA (in AffyLMGUI), and RMA, MBEI and MAS5
>> (in affy).
>> > Looking at the (BH) adjusted p values <0.05, this gave me 2, 12,
>> 0 and 0
>> > DE genes, respectively... much less than I was expecting.
>> > As this affy chip contains probesets for both mosquito and malaria
>> > parasite genes, I am wondering:
>> > (a) if it is better to remove all the parasite probesets before
>> my analysis;
>> Yes, if you don't intend to use them, and they are not relevant to
>> your analysis. There is no point in doing p-value corrections for tests
>> you know are not interesting/relevant a priori.
>> > (b) if so at what stage I should do this (before or after
>> > and background correction, or does it matter?)
>> After both and prior to analysis - otherwise you are likely to
>> need to
>> do some serious tweaking of the normalization code.
>> > (c) how would I filter out these probesets using genefilter (all the
>> > parasite affy IDs begin with Pf. - could I use this prefix in the
>> affy IDs
>> > to filter out the probesets, and if so how?)
>> you don't need genefilter at all, this is a subseting problem.
>> If you had an ExpressionSet you would do something like:
>> parasites = grep("^Pf", featureNames(myExpressionSet))
>> mySubset = myExpressionSet[!parasites,]
>> > Secondly, I did not add any of the polyA controls to my
>> samples. I would
>> > like to know:
>> > (d) Do any of the bg correct / normalisation methods I tried utilise
>> > affymetrix control probesets, and if so, how?
>> I doubt it.
>> > (e) Should I also filter out the control sets - again, if so at
>> what stage
>> > in the analysis and what would be an appropriate code to use?
>> same place as you filter the parasite genes and pretty much in the
>> same way. They are likely to start with AFFX.
>> > I did try the code for non-specific filtering (on my RMA dataset)
>> from pg.
>> > 232 of the bioconductor monograph, but the reduction in the number of
>> > probesets was quite drastic;
>> >> f1 <- pOverA(0.25, log2(100))
>> >> f2 <- function(x) (IQR(x) > 0.5)
>> that is a typo in the text - you probably want to filter out those
>> with IQR below the median, not for some fixed value.
>> >> ff <- filterfun(f1, f2)
>> >> selected <- genefilter(Baseage.transformed , ff)
>> >> sum(selected)
>> >  404 ###(The origninal no. of probesets is 22,726)###
>> >> Baseage.sub <- Baseage.transformed[selected, ]
>> > Also, I understood from the monograph that "100" was to filter out
>> > fluorescence intensities less than this, but I am not clear if
>> this is
>> > from raw intensities or log2 values?
>> raw - 100 on the log2 scale is larger than can be represented in the
>> image file formats used. And don't do that - it is not a good idea -
>> filter on variability.
>> > All the parasite probesets have raw intensities <35 .... so could
>> I apply
>> > this as a simple filter, and would this have to be on raw (rather
>> > normalised data)?
>> Best wishes
>> > Appologies for the long posting...
>> > Looking forward to any replies,
>> > Regards,
>> > Amy
>> >> sessionInfo()
>> > R version 2.4.0 (2006-10-03)
>> > i386-pc-mingw32
>> > locale:
>> > LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
>> > States.1252;LC_MONETARY=English_United
>> > States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
>> > attached base packages:
>> >  "tcltk" "splines" "tools" "methods" "stats"
>> > "graphics" "grDevices" "utils" "datasets" "base"
>> > other attached packages:
>> > plasmodiumanophelescdf tkWidgets DynDoc
>> > widgetTools agahomology
>> > "1.14.0" " 1.12.0" "1.12.0"
>> > "1.10.0" "1.14.2"
>> > affyPLM gcrma matchprobes
>> > affydata annaffy
>> > "1.10.0" "2.6.0" "1.6.0"
>> > "1.10.0" "1.6.0"
>> > KEGG GO limma
>> > geneplotter annotate
>> > "1.14.0" "1.14.0" "2.9.1"
>> > "1.12.0" "1.12.0"
>> > affy affyio genefilter
>> > survival Biobase
>> > "1.12.0" "1.2.0" "1.12.0 "
>> > "2.29" "1.12.0"
>> > -------------------------------------------
>> > Amy Mikhail
>> > Research student
>> > University of Aberdeen
>> > Zoology Building
>> > Tillydrone Avenue
>> > Aberdeen AB24 2TZ
>> > Scotland
>> > Email: a.mikhail at abdn.ac.uk <mailto:a.mikhail at abdn.ac.uk>
>> > Phone: 00-44-1224-272880 (lab)
>> > 00-44-1224-273256 (office)
>> > _______________________________________________
>> > Bioconductor mailing list
>> > Bioconductor at stat.math.ethz.ch
>> <mailto:Bioconductor at stat.math.ethz.ch>
>> > https://stat.ethz.ch/mailman/listinfo/bioconductor
>> > Search the archives:
>> Robert Gentleman, PhD
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M2-B876
>> PO Box 19024
>> Seattle, Washington 98109-1024
>> rgentlem at fhcrc.org <mailto:rgentlem at fhcrc.org>
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch <mailto:Bioconductor at stat.math.ethz.ch>
>> Search the archives:
>> Lourdusamy A Anbarasu
>> Dipartimento Medicina Sperimentale e Sanita Pubblica
>> Via Scalzino 3
>> 62032 Camerino (MC)
> Robert Gentleman, PhD
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M2-B876
> PO Box 19024
> Seattle, Washington 98109-1024
> rgentlem at fhcrc.org
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor