[BioC] filtering probes in affymetrix data

James W. MacDonald jmacdon at uw.edu
Thu Feb 13 23:47:23 CET 2014


How many probesets do you have left after the genefilter step (e.g., 
what does dim(eset.filt) give you)?

On Thursday, February 13, 2014 4:19:11 PM, Sabet, Julia A wrote:
> Could it be because I did the kOverA function twice, since the first time I put a 5 and I meant to put a 12?  I thought it would just get overrided but maybe it didn't?
>
> antigm <- dbGetQuery(con, "select meta_fsetid from core_mps inner join
> + featureSet on core_mps.fsetid=featureSet.fsetid where
> + featureSet.type='4';")
>> bkg <- apply(exprs(eset)[as.character(antigm[,1]),], 2, quantile,
> + probs=0.95)
>> minval <- max(bkg)
>> ind <- genefilter(eset, filterfun(kOverA(5, minval)))
>> eset.filt <- eset[ind,]
>> ind <- genefilter(eset, filterfun(kOverA(12, minval)))
>> eset.filt <- eset[ind,]
>> library(affycoretools)
> eset.filt <- getMainProbes(eset.filt)
> Error in orig[[nm]][i, , ..., drop = drop] :
>    (subscript) logical subscript too long
>
> -----Original Message-----
> From: James W. MacDonald [mailto:jmacdon at uw.edu]
> Sent: Thursday, February 13, 2014 4:05 PM
> To: Sabet, Julia A
> Cc: bioconductor at r-project.org
> Subject: Re: [BioC] filtering probes in affymetrix data
>
> OK, what do you get if you type
>
> getMainProbes
>
> at an R prompt? It shouldn't be subsetting the ExpressionSet twice like that.
>
> Jim
>
>
>
> On Thursday, February 13, 2014 3:59:38 PM, Sabet, Julia A wrote:
>> I get this:
>>
>>> traceback()
>> 3: input[ind, ]
>> 2: input[ind, ]
>> 1: getMainProbes(eset.filt)
>>>
>>
>> -----Original Message-----
>> From: James W. MacDonald [mailto:jmacdon at uw.edu]
>> Sent: Thursday, February 13, 2014 3:57 PM
>> To: Sabet, Julia A
>> Cc: bioconductor at r-project.org
>> Subject: Re: [BioC] filtering probes in affymetrix data
>>
>> What do you get when you run
>>
>> traceback()
>>
>> right after that error?
>>
>> Jim
>>
>>
>> On 2/13/2014 3:51 PM, Sabet, Julia A wrote:
>>> Thank you both.  Solved that problem.  Now when I do the next line of code, I get another error message:
>>>
>>>> eset.filt <- getMainProbes(eset.filt)
>>> Error in orig[[nm]][i, , ..., drop = drop] :
>>>      (subscript) logical subscript too long
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: James W. MacDonald [mailto:jmacdon at uw.edu]
>>> Sent: Thursday, February 13, 2014 3:44 PM
>>> To: Sabet, Julia A
>>> Cc: bioconductor at r-project.org
>>> Subject: Re: [BioC] filtering probes in affymetrix data
>>>
>>> Hi Julia,
>>>
>>> On 2/13/2014 3:23 PM, Sabet, Julia A wrote:
>>>> Thank you Jim.  I think my R version is up to date and I am making sure to use "library()".  I started the whole thing over and now I have this new error message, at an earlier step:
>>>>
>>>> library(pd.mogene.2.0.st)
>>>>> con <- db(pd.mogene.2.0.st)
>>>>> dbGetQuery(con, "select * from type_dict;")
>>>>        type                   type_id
>>>> 1     1                      main
>>>> 2     2             control->affx
>>>> 3     3             control->chip
>>>> 4     4 control->bgp->antigenomic
>>>> 5     5     control->bgp->genomic
>>>> 6     6            normgene->exon
>>>> 7     7          normgene->intron
>>>> 8     8  rescue->FLmRNA->unmapped
>>>> 9     9  control->affx->bac_spike
>>>> 10   10            oligo_spike_in
>>>> 11   11           r1_bac_spike_at
>>>>> table(dbGetQuery(con, "select type from featureSet;")[,1])
>>>>          1      2      4      7      9
>>>> 263551     18     23   5331     18
>>>>> antigm <- dbGetQuery(con, "select meta_fsetid from core_mps inner
>>>>> join
>>>> + featureSet on core_mps.fsetid=featureSet.fsetid where
>>>> + featureSet.type='4';")
>>>>> bkg <- apply(exprs(eset)[as.character(antigm[,1]),], 2, quantile,
>>>> + probs=0.95)
>>>>> library(genefilter)
>>>>> minval <- max(bkg)
>>>>> ind <- genefilter(eset, filterfun(kOverA(5, minval))) eset.filt <-
>>> The above line has a bit extra at the end that R doesn't like.
>>>
>>>> Error: unexpected symbol in "ind <- genefilter(eset, filterfun(kOverA(5, minval))) eset.filt"
>>> And this is your hint. Error messages are your friends.
>>>
>>> Best,
>>>
>>> Jim
>>>
>>>
>>>>> ind <- genefilter(eset, filterfun(kOverA(12, minval))) eset.filt <-
>>>>> eset[ind,]
>>>> Error: unexpected symbol in "ind <- genefilter(eset, filterfun(kOverA(12, minval))) eset.filt"
>>>>> ind <- genefilter(eset, filterfun(kOverA(12, minval))) eset.filt <-
>>>> Error: unexpected symbol in "ind <- genefilter(eset, filterfun(kOverA(12, minval))) eset.filt"
>>>> Here is my sessionInfo() output:
>>>>
>>>> R version 3.0.2 (2013-09-25)
>>>> Platform: x86_64-w64-mingw32/x64 (64-bit)
>>>>
>>>> locale:
>>>> [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
>>>> [4] LC_NUMERIC=C                           LC_TIME=English_United States.1252
>>>>
>>>> attached base packages:
>>>> [1] parallel  stats     graphics  grDevices utils     datasets  methods   base
>>>>
>>>> other attached packages:
>>>>      [1] BiocInstaller_1.12.0                  genefilter_1.44.0                     mogene20sttranscriptcluster.db_2.13.0
>>>>      [4] org.Mm.eg.db_2.10.1                   AnnotationDbi_1.24.0                  pd.mogene.2.0.st_2.12.0
>>>>      [7] RSQLite_0.11.4                        DBI_0.2-7                             oligo_1.26.1
>>>> [10] Biostrings_2.30.1                     XVector_0.2.0                         IRanges_1.20.6
>>>> [13] Biobase_2.22.0                        oligoClasses_1.24.0                   BiocGenerics_0.8.0
>>>>
>>>> loaded via a namespace (and not attached):
>>>>      [1] affxparser_1.34.0     affyio_1.30.0         annotate_1.40.0       bit_1.1-11            codetools_0.2-8       ff_2.2-12
>>>>      [7] foreach_1.4.1         GenomicRanges_1.14.4  iterators_1.0.6       preprocessCore_1.24.0 splines_3.0.2         stats4_3.0.2
>>>> [13] survival_2.37-7       tools_3.0.2           XML_3.98-1.1          xtable_1.7-1          zlibbioc_1.8.0
>>>> I appreciate your help...
>>>> Julia
>>>>
>>>> -----Original Message-----
>>>> From: James W. MacDonald [mailto:jmacdon at uw.edu]
>>>> Sent: Thursday, February 13, 2014 12:08 PM
>>>> To: Sabet, Julia A
>>>> Cc: bioconductor at r-project.org
>>>> Subject: Re: [BioC] filtering probes in affymetrix data
>>>>
>>>> Hi Julia,
>>>>
>>>> You should always include the output from sessionInfo() with any questions, so we can see what versions you are running, and what you have loaded.
>>>>
>>>> My guess is you are using an old version of R, prior to the introduction of that function, or you forgot to do library(affycoretools).
>>>>
>>>> Best,
>>>>
>>>> Jim
>>>>
>>>> On Thursday, February 13, 2014 12:03:54 PM, Sabet, Julia A wrote:
>>>>> Thank you so much, Jim.  I did everything you recommended and everything seemed to be working and then I installed the affycoretools package and when I did:
>>>>> eset.filt <- getMainProbes(eset.filt)
>>>>>
>>>>> This error resulted:
>>>>> Error: could not find function "getMainProbes"
>>>>>
>>>>> What should I do?
>>>>> Thanks!
>>>>> Julia
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: James W. MacDonald [mailto:jmacdon at uw.edu]
>>>>> Sent: Thursday, February 13, 2014 9:36 AM
>>>>> To: Sabet, Julia A
>>>>> Cc: bioconductor at r-project.org
>>>>> Subject: Re: [BioC] filtering probes in affymetrix data
>>>>>
>>>>> Hi Julia,
>>>>>
>>>>> There are several different things you can do. I'll show you one possibility.
>>>>>
>>>>> First, note that there are multiple different control probes on
>>>>> this array that aren't intended to measure differential expression,
>>>>> and should be excluded. So first let's look at the possible types
>>>>> of
>>>>> probesets:
>>>>>
>>>>>> library(pd.mogene.2.0.st)
>>>>>> con <- db(pd.mogene.2.0.st)
>>>>>> dbGetQuery(con, "select * from type_dict;")
>>>>>         type                   type_id
>>>>> 1     1                      main
>>>>> 2     2             control->affx
>>>>> 3     3             control->chip
>>>>> 4     4 control->bgp->antigenomic
>>>>> 5     5     control->bgp->genomic
>>>>> 6     6            normgene->exon
>>>>> 7     7          normgene->intron
>>>>> 8     8  rescue->FLmRNA->unmapped
>>>>> 9     9  control->affx->bac_spike
>>>>> 10   10            oligo_spike_in
>>>>> 11   11           r1_bac_spike_at
>>>>>
>>>>> These are all the possible types of probesets, but we don't have all of them on this array. To see which ones we do have we can do:
>>>>>
>>>>>
>>>>>> table(dbGetQuery(con, "select type from featureSet;")[,1])
>>>>>           1      2      4      7      9
>>>>> 263551     18     23   5331     18
>>>>>
>>>>> So we only have these probeset types:
>>>>>
>>>>> 1     1                      main
>>>>> 2     2             control->affx
>>>>> 4     4 control->bgp->antigenomic
>>>>> 7     7          normgene->intron
>>>>> 9     9  control->affx->bac_spike
>>>>>
>>>>> And the 'main' probesets are those that we want to use for
>>>>> differential expression. Now one thing you could do is to say that
>>>>> the antigenomic probesets should give a good measure of background,
>>>>> as they are supposed to have sequences that don't exist in mice. So
>>>>> you could just extract those probesets, get some measure and use
>>>>> that as the lower limit of what you think is expressed or not.
>>>>> That's pretty naive, as a probe with higher GC content will have
>>>>> higher background than one with a lower GC content, but worrying
>>>>> about that is way beyond what I am prepared to go into.
>>>>>
>>>>> Now we can get the probeset IDs for the antigenomic probesets
>>>>>
>>>>> antigm <- dbGetQuery(con, "select meta_fsetid from core_mps inner
>>>>> join featureSet on core_mps.fsetid=featureSet.fsetid where
>>>>> featureSet.type='4';")
>>>>>
>>>>> And then extract those probesets and get a summary statistic.
>>>>>
>>>>> bkg <- apply(exprs(eset)[as.character(antigm[,1]),], 2, quantile,
>>>>> probs=0.95)
>>>>>
>>>>> Which will give us the 95th percentile of these background probes.
>>>>> You could then use the kOverA function in genefilter to filter out
>>>>> any probesets where all samples are below the background values.
>>>>> The idea being that you want to filter out any probesets unless k
>>>>> samples have expression levels >= A. So if you have 10 samples,
>>>>> where 5 are controls and 5 are treated, you would do something like
>>>>>
>>>>> minval <- max(bkg)
>>>>> ind <- genefilter(eset, filterfun(kOverA(5, minval))) eset.filt <-
>>>>> eset[ind,]
>>>>>
>>>>> You should also filter out all the non-main probesets. You can do
>>>>> that using getMainProbes() in the affycoretools package
>>>>>
>>>>> eset.filt <- getMainProbes(eset.filt)
>>>>>
>>>>> Best,
>>>>>
>>>>> Jim
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wednesday, February 12, 2014 10:16:31 PM, Sabet, Julia A wrote:
>>>>>> Hello all,
>>>>>> I am totally new to R/Bioconductor and have begun processing data from my Affymetrix Mouse Gene 2.0 ST arrays.  I normalized the data like this:
>>>>>>
>>>>>> library(pd.mogene.2.0.st)
>>>>>> eset <- rma(affyRaw)
>>>>>>
>>>>>> and added gene annotation and I am following the limma user's
>>>>>> guide, which recommends removing "probes that appear not be expressed in any of the experimental conditions."  I have read on previous posts that filtering may not be necessary.  Should I filter, and if so, how?  Using what code?
>>>>>>
>>>>>> Thank you!
>>>>>> Julia Sabet
>>>>>>
>>>>>> 	[[alternative HTML version deleted]]
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at r-project.org
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives:
>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>> --
>>>>> James W. MacDonald, M.S.
>>>>> Biostatistician
>>>>> University of Washington
>>>>> Environmental and Occupational Health Sciences
>>>>> 4225 Roosevelt Way NE, # 100
>>>>> Seattle WA 98105-6099
>>>> --
>>>> James W. MacDonald, M.S.
>>>> Biostatistician
>>>> University of Washington
>>>> Environmental and Occupational Health Sciences
>>>> 4225 Roosevelt Way NE, # 100
>>>> Seattle WA 98105-6099
>>> --
>>> James W. MacDonald, M.S.
>>> Biostatistician
>>> University of Washington
>>> Environmental and Occupational Health Sciences
>>> 4225 Roosevelt Way NE, # 100
>>> Seattle WA 98105-6099
>>>
>>
>> --
>> James W. MacDonald, M.S.
>> Biostatistician
>> University of Washington
>> Environmental and Occupational Health Sciences
>> 4225 Roosevelt Way NE, # 100
>> Seattle WA 98105-6099
>>
>
> --
> James W. MacDonald, M.S.
> Biostatistician
> University of Washington
> Environmental and Occupational Health Sciences
> 4225 Roosevelt Way NE, # 100
> Seattle WA 98105-6099

--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099



More information about the Bioconductor mailing list