[BioC] Oligo package: paCalls question

Thu Sep 4 21:41:25 CEST 2014

Hi Alison,

On Thu, Sep 4, 2014 at 2:50 PM, Alison Ziesel <aziesel at emory.edu> wrote:

> Hello Jim,
>
> Thanks so much for the speedy reply, I appreciate it.
>
> At this point in my workflow I haven’t yet performed RMA. That sounds like
> mistake #1 on my part.
>
> I too had found that post by Dr. Stratowa but I confess I didn’t entirely
> understand it. I am willing to exclude absent probes at the probeset level
> rather than transcript level for this project. So something like:
>
> >library(“oligo”)
> >library(“pd.hugene.2.0.st”)
> >allchips <- read.celfiles(“my list of cel files”)
> >eset <- rma(allchips_present, target='core')
> >featureData(eset) <- getNetAffx(eset, 'transcript')
>
> >present <- paCalls(eset, “DABG”)
> >index <- apply(present, 1, function(x) sum(x < 0.01) > 5)
> >eset_present <- eset[index,]
>

Where did you get this code? It's not doing what you think.

Please note that 'DABG' is computing P/A calls at a lower level than
'PSDABG' (e.g., DAGB is computing at the probe level, and PSDABG is
computing at the probeset level), so this is even more wrong than what you
were doing before.

As an example, let's look at this using some Mouse Gene ST 2.0 arrays I
have in hand:

> dat <- read.celfiles(filenames = list.celfiles())
Loading required package: pd.mogene.2.0.st
Loading required package: RSQLite
Loading required package: DBI
Platform design info loaded.
Reading in : MM_2528.CEL
Reading in : MM_2532.CEL
Reading in : MM_2536.CEL
Reading in : MM_2583.CEL
Reading in : MM_2588.CEL
Reading in : MM_2591.CEL
> dim(paCalls(dat, "DABG"))
Computing DABG calls... OK
[1] 767405      6
> dim(paCalls(dat, "PSDABG"))
Computing DABG calls... OK
[1] 269751      6
> eset <- rma(dat)
Background correcting
Normalizing
Calculating Expression
> dim(eset)
Features  Samples
   41345        6

So the 'present' matrix that you are using above has something like 700K
rows, whereas your expression set has around 50K rows.

Let's back up a bit and clarify what you are doing, and how these arrays
work.

At the most basic level there is the probe, which is a 25-mer intended to
measure the expression of a particular gene. Why they are 25-mers is beyond
the scope of this email, but note that this is a ridiculously short chunk
of DNA to reliably hybridize to its matching sequence. Affy knows this as
well, so they use a set of these probes to measure a given transcript (or
portion thereof).

The next level is known in Affy parlance as the probe set region (PSR),
which is usually made up of 4 probes (at least on the Exon arrays), but for
the Gene ST arrays a given PSR may only contain one or two probes (note
that the Gene ST arrays came out after the Exon ST arrays, and contain the
more reliable probes that they used on the Exon arrays. The PSR comes from
the Exon array format, and was originally intended to measure an exon, or a
portion thereof.).

The PSR level is also known as the 'probeset' level, which was the original
summarization level for the Exon arrays. Since there are many (many!)
probesets on the Gene ST arrays that have only one or two probes, I would
argue against summarizing at this level. The next level is the transcript
level (or 'core' level in oligo). At this level, all probes that
interrogate any portion of a gene are combined together, and then
summarized, to give one single measure for the expression level of that
gene. This is IMO the only trustworthy level of summarization for these
arrays.

When you do paCalls with method = 'DABG', what you are doing is comparing
the intensity for each probe to some measure of background, and trying to
decide if the given probe is binding transcript at a level that is arguably
higher than background. If you use method = "PSDABG", then you are taking
all the P/A calls from the 'DABG' level, combining them at the PSR
(probeset) level, and seeing if *in aggregate* they appear to be binding
transcript at a level that is arguably higher than background. In the oligo
package, there is no way to compute P/A calls at the transcript (or core)
level, so you cannot use P/A calls to filter probesets if you have
summarized at the transcript level.

What you have done with your code above is to compute P/A calls at the
probe level, then made an index where you say at least five of the probes
have to have a p < 0.01. This is fine, I suppose, but you cannot use that
index vector for anything. In other words, the ExpressionSet called 'eset'
contains the data after summarizing the original 700k probes into about 50k
gene-level probesets. So you can't use the P/A calls for the individual
probes to subset the ExpressionSet because there isn't a one-to-one
correspondence between the two.

If you are really wedded to the idea of using P/A calls to filter your
data, you will have to switch to the xps package. Otherwise you will have
to use some other way of filtering out low/unexpressed data.

Best,

Jim

>
> If I’m understanding this correctly, the ‘index’ object should be those
> probes whose X (p-value?) is less than 0.01?
>
> Thank you again for your advice!
>
> alison
>
> On Sep 4, 2014, at 2:05 PM, James W. MacDonald <jmacdon at uw.edu> wrote:
>
> HI Alison,
>
> When you do paCalls with method = "PSDABG", you are getting P/A calls at
> the 'probeset' level (which in Affymetrix terms is the probe set region or
> PSR, which roughly corresponds to exon-level). But when you do rma with
> target = "core", you are summarizing data at the 'core' or transcript level
> (e.g., at this level all the PSRs that interrogate a given gene are
> summarized together).
>
> So you cannot use oligo to generate P/A calls at the transcript level.
> It's my understanding that you can do this with xps, but I am unfamiliar
> with that package. However, a quick search of the BioC list using 'stratowa
> pacalls' got as the first hit this message:
> https://stat.ethz.ch/pipermail/bioconductor/2014-August/060977.html
>
> Best,
>
> Jim
>
>
> On Thu, Sep 4, 2014 at 1:36 PM, Alison Ziesel <aziesel at emory.edu> wrote:
>
>> Hello all,
>>
>> I’ve got very basic familiarity with the oligo package, but I’ve ran into
>> a bit of trouble with the paCalls function. I’d like to restrict the probes
>> in my eSet object to only those called present. Here’s what I’ve done thus
>> far:
>>
>> >library(“oligo”)
>> >library("pd.hugene.2.0.st”)
>> >allchips <- read.celfiles(“my list of cel files”)
>>
>> >allchipspa <-paCalls(allchips, method=“PSDABG”, verbose=TRUE)
>> >present <- rownames(allchipspa)
>> >allchips_present <- exprs(allchips[present,])
>>
>> Error in exprs(allchips[present, ]) :
>>   error in evaluating the argument 'object' in selecting a method for
>> function 'exprs': Error in orig[[nm]][i, , ..., drop = drop] : subscript
>> out of bounds
>>
>> Clearly my last line doesn’t do what I want it to do: filter the object
>> allchips versus the list of present probes in the object present. I suppose
>> at this point I should also confirm that my object allchipspa is just the
>> present probes!  What’s the correct way to perform this filter step? I’d be
>> grateful for any insight you may have.
>>
>> > sessionInfo()
>> R version 3.1.1 (2014-07-10)
>> Platform: x86_64-apple-darwin13.1.0 (64-bit)
>>
>> locale:
>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>
>> attached base packages:
>> [1] parallel  stats     graphics  grDevices utils     datasets  methods
>>  base
>>
>> other attached packages:
>>  [1] pd.hugene.2.0.st_3.8.1 RSQLite_0.11.4         DBI_0.2-7
>> oligo_1.28.2
>>  [5] Biostrings_2.32.1      XVector_0.4.0          IRanges_1.22.10
>> Biobase_2.24.0
>>  [9] oligoClasses_1.26.0    BiocGenerics_0.10.0
>>
>> loaded via a namespace (and not attached):
>>  [1] affxparser_1.36.0     affyio_1.32.0         BiocInstaller_1.14.2
>> bit_1.1-12
>>  [5] codetools_0.2-8       ff_2.2-13             foreach_1.4.2
>>  GenomeInfoDb_1.0.2
>>  [9] GenomicRanges_1.16.4  iterators_1.0.7       preprocessCore_1.26.1
>> splines_3.1.1
>> [13] stats4_3.1.1          tools_3.1.1           zlibbioc_1.10.0
>>
>> alison
>>
>> Alison Ziesel
>> Department of Ophthalmology, Emory University
>> aziesel at emory.edu
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
>
>
> --
> James W. MacDonald, M.S.
> Biostatistician
> University of Washington
> Environmental and Occupational Health Sciences
> 4225 Roosevelt Way NE, # 100
> Seattle WA 98105-6099
>
>
>

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099

	[[alternative HTML version deleted]]