[BioC] filtering probes in affymetrix data

James W. MacDonald jmacdon at uw.edu
Thu Feb 13 15:36:28 CET 2014


Hi Julia,

There are several different things you can do. I'll show you one 
possibility.

First, note that there are multiple different control probes on this 
array that aren't intended to measure differential expression, and 
should be excluded. So first let's look at the possible types of 
probesets:

> library(pd.mogene.2.0.st)
> con <- db(pd.mogene.2.0.st)
> dbGetQuery(con, "select * from type_dict;")
   type                   type_id
1     1                      main
2     2             control->affx
3     3             control->chip
4     4 control->bgp->antigenomic
5     5     control->bgp->genomic
6     6            normgene->exon
7     7          normgene->intron
8     8  rescue->FLmRNA->unmapped
9     9  control->affx->bac_spike
10   10            oligo_spike_in
11   11           r1_bac_spike_at

These are all the possible types of probesets, but we don't have all of 
them on this array. To see which ones we do have we can do:


> table(dbGetQuery(con, "select type from featureSet;")[,1])

     1      2      4      7      9
263551     18     23   5331     18

So we only have these probeset types:

1     1                      main
2     2             control->affx
4     4 control->bgp->antigenomic
7     7          normgene->intron
9     9  control->affx->bac_spike

And the 'main' probesets are those that we want to use for differential 
expression. Now one thing you could do is to say that the antigenomic 
probesets should give a good measure of background, as they are 
supposed to have sequences that don't exist in mice. So you could just 
extract those probesets, get some measure and use that as the lower 
limit of what you think is expressed or not. That's pretty naive, as a 
probe with higher GC content will have higher background than one with 
a lower GC content, but worrying about that is way beyond what I am 
prepared to go into.

Now we can get the probeset IDs for the antigenomic probesets

antigm <- dbGetQuery(con, "select meta_fsetid from core_mps inner join 
featureSet on core_mps.fsetid=featureSet.fsetid where 
featureSet.type='4';")

And then extract those probesets and get a summary statistic.

bkg <- apply(exprs(eset)[as.character(antigm[,1]),], 2, quantile, 
probs=0.95)

Which will give us the 95th percentile of these background probes. You 
could then use the kOverA function in genefilter to filter out any 
probesets where all samples are below the background values. The idea 
being that you want to filter out any probesets unless k samples have 
expression levels >= A. So if you have 10 samples, where 5 are controls 
and 5 are treated, you would do something like

minval <- max(bkg)
ind <- genefilter(eset, filterfun(kOverA(5, minval)))
eset.filt <- eset[ind,]

You should also filter out all the non-main probesets. You can do that 
using getMainProbes() in the affycoretools package

eset.filt <- getMainProbes(eset.filt)

Best,

Jim




On Wednesday, February 12, 2014 10:16:31 PM, Sabet, Julia A wrote:
> Hello all,
> I am totally new to R/Bioconductor and have begun processing data from my Affymetrix Mouse Gene 2.0 ST arrays.  I normalized the data like this:
>
> library(pd.mogene.2.0.st)
> eset <- rma(affyRaw)
>
> and added gene annotation and I am following the limma user's guide, which recommends removing "probes that appear not be
> expressed in any of the experimental conditions."  I have read on previous posts that filtering may not be necessary.  Should I filter, and if so, how?  Using what code?
>
> Thank you!
> Julia Sabet
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099



More information about the Bioconductor mailing list