[BioC] Some Genefilter questions

Robert Gentleman rgentlem at fhcrc.org
Thu Nov 30 19:21:37 CET 2006


Hi,

Amy Mikhail wrote:
> Hi Robert and Jim,
> 
> Many thanks for your advice.  I have some more questions...
> 
> First, I tried what Robert suggested on my expression set.  However I got
> a strange result:
> 
>> load("E:\\Amy - Bioconductor analysis\\03. Base age\\Affymetrix - Base
> Age results & analysis\\Baseage - RMA normalised.RData")
>> ls()
> [1] "Data"      "eset"      "phenodata" "x"         "xy"        "y"
> 
>> parasites = grep("^Pf", featureNames(eset))
>> parasites
>    [1] 18192 18193 18194 18195 18196 18197 18198 18199 18200 18201 18202
> 18203
>   [13] 18204 18205 18206 18207 18208 18209 18210 18211 18212 18213 18214
> 18215
>   [25] 18216 18217 18218 18219 18220 18221 18222 18223 18224 18225 18226
> 18227 ### this list continues untill no. 4,514 ###


   you can tell by using

   featureNames(eset)[parasites]
  the values in the parasites vector are the indices of the features


> 
> I was expexting the parasite affy IDs to be listed  here, but these are (I
> think) the probeset numbers (I can't tell if they are the right ones or
> not...)?
> 
>> mossie.sub = eset[!parasites,]

  oops - should have been
    mossie.sub = eset[-parasites,]

  my mistake - I keep thinking grep returns a logical vector for some 
reason.

>> mossie.sub
> Expression Set (exprSet) with
>         0 genes
>         6 samples
>         phenoData object with 3 variables and 6 cases
>         varLabels
>                 Name: short name of datasets for graphs
>                 Population: Age of adult mosquitoes (in days) included in
> the sample
>                 Replicate: Replicate number of the experiment
> 
> So now it has removed all the genes... I don't understand why this would
> happen since the subset called "parasites" only contains a fraction of the
> total number of probesets (4,514 out of 22,769).
> 
> Next, I wanted to try Jim's suggestion on the raw data.  I can follow
> Jenny's post up to:
> 
> " all you need now is your affybatch object, and a character vector of
> probe set names"
> 
> I have an affybatch object, but how do I create a character vector for the
> probesets I want to remove?
> 
> I'm still not very R-literate, so tried using the same code as previous
> except with the raw data instead of my expression set but the
> "featureNames" bit was a problem:
> 
>> parasites = grep("^Pf", featureNames(data))
> Error in function (classes, fdef, mtable)  :
>         unable to find an inherited method for function "featureNames",
> for signature "function"
> 
> Any ideas?
> 
> Regards,
> 
> Amy
> 
> ---------------------------------------------------------------------------
> 
>> Hi Amy,
>>
>> Amy Mikhail wrote:
>>> Dear Bioconductors,
>>>
>>> I am annalysing 6 PlasmodiumAnopheles genechips, which have only
>>> Anopheles
>>> mosquito samples hybridised to them (i.e. they are not infected
>>> mosquitoes).  The 6 chips include 3 replicates, each consisting of two
>>> time points.  The design matrix is as follows:
>>>
>>>
>>>> design
>>>      M15d M43d
>>> [1,]    1    0
>>> [2,]    0    1
>>> [3,]    1    0
>>> [4,]    0    1
>>> [5,]    1    0
>>> [6,]    0    1
>>>
>>>
>>> I have tried both gcRMA (in AffyLMGUI), and RMA, MBEI and MAS5 (in
>>> affy).
>>> Looking at the (BH) adjusted p values <0.05, this gave me 2, 12, 0 and
>>> 0
>>> DE genes, respectively... much less than I was expecting.
>>>
>>> As this affy chip contains probesets for both mosquito and malaria
>>> parasite genes, I am wondering:
>>>
>>> (a) if it is better to remove all the parasite probesets before my
>>> analysis;
>> Probably. It's not the easiest thing to do. Here is a link to some code
>> you can use:
>>
>> http://article.gmane.org/gmane.science.biology.informatics.conductor/9869/match=remove+probes+cdf
>>
>> Read what Ariel and Jenny write there very closely so you don't make
>> mistakes.
>>
>>> (b) if so at what stage I should do this (before or after normalisation
>>> and background correction, or does it matter?)
>> Before doing anything, most likely, which is what the above code will do
>> for you.
>>
>>> (c) how would I filter out these probesets using genefilter (all the
>>> parasite affy IDs begin with Pf. - could I use this prefix in the affy
>>> IDs
>>> to filter out the probesets, and if so how?)
>>>
>>> Secondly, I did not add any of the polyA controls to my samples.  I
>>> would
>>> like to know:
>>>
>>> (d) Do any of the bg correct / normalisation methods I tried utilise
>>> affymetrix control probesets, and if so, how?
>> No.
>>
>>> (e) Should I also filter out the control sets - again, if so at what
>>> stage
>>> in the analysis and what would be an appropriate code to use?
>> No, there aren't enough of them to have an effect on your data.
>>
>>> I did try the code for non-specific filtering (on my RMA dataset) from
>>> pg.
>>> 232 of the bioconductor monograph, but the reduction in the number of
>>> probesets was quite drastic;
>>>
>>>
>>>> f1 <- pOverA(0.25, log2(100))
>>>> f2 <- function(x) (IQR(x) > 0.5)
>>>> ff <- filterfun(f1, f2)
>>>> selected <- genefilter(Baseage.transformed, ff)
>>>> sum(selected)
>>> [1] 404   ###(The origninal no. of probesets is 22,726)###
>>>
>>>> Baseage.sub <- Baseage.transformed[selected, ]
>>>
>>> Also, I understood from the monograph that "100" was to filter out
>>> fluorescence intensities less than this, but I am not clear if this is
>>> from raw intensities or log2 values?
>> It has to be data on the natural scale. The intensities for an Affy chip
>> come from a 16-bit TIFF image, which means the brightest value can be
>> 2^16, which in log2 scale is 16, so you cannot even have a value that
>> approaches 100 on the log scale.
>>
>> Best,
>>
>> Jim
>>
>>
>>
>>> All the parasite probesets have raw intensities <35 .... so could I
>>> apply
>>> this as a simple filter, and would this have to be on raw (rather than
>>> normalised data)?
>>>
>>> Appologies for the long posting...
>>>
>>> Looking forward to any replies,
>>> Regards,
>>> Amy
>>>
>>>
>>>> sessionInfo()
>>> R version 2.4.0 (2006-10-03)
>>> i386-pc-mingw32
>>>
>>> locale:
>>> LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
>>> States.1252;LC_MONETARY=English_United
>>> States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
>>>
>>> attached base packages:
>>>  [1] "tcltk"     "splines"   "tools"     "methods"   "stats"
>>> "graphics"  "grDevices" "utils"     "datasets"  "base"
>>>
>>> other attached packages:
>>> plasmodiumanophelescdf              tkWidgets                 DynDoc
>>>      widgetTools            agahomology
>>>               "1.14.0"               "1.12.0"               "1.12.0"
>>>         "1.10.0"               "1.14.2"
>>>                affyPLM                  gcrma            matchprobes
>>>         affydata                annaffy
>>>               "1.10.0"                "2.6.0"                "1.6.0"
>>>         "1.10.0"                "1.6.0"
>>>                   KEGG                     GO                  limma
>>>      geneplotter               annotate
>>>               "1.14.0"               "1.14.0"                "2.9.1"
>>>         "1.12.0"               "1.12.0"
>>>                   affy                 affyio             genefilter
>>>         survival                Biobase
>>>               "1.12.0"                "1.2.0"               "1.12.0"
>>>           "2.29"               "1.12.0"
>>>
>>>
>>>
>>> -------------------------------------------
>>> Amy Mikhail
>>> Research student
>>> University of Aberdeen
>>> Zoology Building
>>> Tillydrone Avenue
>>> Aberdeen AB24 2TZ
>>> Scotland
>>> Email: a.mikhail at abdn.ac.uk
>>> Phone: 00-44-1224-272880 (lab)
>>>        00-44-1224-273256 (office)
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>> --
>> James W. MacDonald, M.S.
>> Biostatistician
>> Affymetrix and cDNA Microarray Core
>> University of Michigan Cancer Center
>> 1500 E. Medical Center Drive
>> 7410 CCGC
>> Ann Arbor MI 48109
>> 734-647-5623
>>
>>
>> **********************************************************
>> Electronic Mail is not secure, may not be read every day, and should not
>> be used for urgent or sensitive issues.
>>
> 
> 
> -------------------------------------------
> Amy Mikhail
> Research student
> University of Aberdeen
> Zoology Building
> Tillydrone Avenue
> Aberdeen AB24 2TZ
> Scotland
> Email: a.mikhail at abdn.ac.uk
> Phone: 00-44-1224-272880 (lab)
>        00-44-1224-273256 (office)
> 
> 

-- 
Robert Gentleman, PhD
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
PO Box 19024
Seattle, Washington 98109-1024
206-667-7700
rgentlem at fhcrc.org



More information about the Bioconductor mailing list