[BioC] Oligo package: paCalls question

Thu Sep 4 23:45:30 CEST 2014

Dear Alison,

Since you mention that you had some difficulty with getting xps running, 
could you please tell me what your problem was.

 From your sessionInfo() I see that you are using a Mac running 
x86_64-apple-darwin13.1.0. If this is Yosemite, as I assume, you know 
that it is not even available yet, and thus you need to compile root 
yourself and install xps from source.

Best regards,
Christian
_._._._._._._._._._._._._._._._._._
C.h.r.i.s.t.i.a.n   S.t.r.a.t.o.w.a
V.i.e.n.n.a           A.u.s.t.r.i.a
e.m.a.i.l:        cstrato at aon.at
_._._._._._._._._._._._._._._._._._

On 9/4/14 9:49 PM, Alison Ziesel wrote:
> Hello again Jim,
>
> I adapted that second segment of code (poorly) from the page you had linked earlier, the August 2014 commentary from Dr. Stratowa. I confess I was reaching a bit with it.
>
> Thank you for clarifying where I was going wrong; I was operating on some very poor assumptions, it seems. I have had some difficulty with getting xps running, but it did have both the P/A and I/NI call functions I was interested in using. I usually don�t employ either of those calls, but I have been given an unusual data set with very low quality RNA and high variability within groups, so I�m looking at every way I can to tamp down noise in the dataset.
>
> Thank you again for the quick and detailed responses!
>
> alison
>
>
> On Sep 4, 2014, at 3:41 PM, James W. MacDonald <jmacdon at uw.edu> wrote:
>
>> Hi Alison,
>>
>>
>> On Thu, Sep 4, 2014 at 2:50 PM, Alison Ziesel <aziesel at emory.edu> wrote:
>> Hello Jim,
>>
>> Thanks so much for the speedy reply, I appreciate it.
>>
>> At this point in my workflow I haven�t yet performed RMA. That sounds like mistake #1 on my part.
>>
>> I too had found that post by Dr. Stratowa but I confess I didn�t entirely understand it. I am willing to exclude absent probes at the probeset level rather than transcript level for this project. So something like:
>>
>>> library(�oligo�)
>>> library(�pd.hugene.2.0.st�)
>>> allchips <- read.celfiles(�my list of cel files�)
>>> eset <- rma(allchips_present, target='core')
>>> featureData(eset) <- getNetAffx(eset, 'transcript')
>>
>>> present <- paCalls(eset, �DABG�)
>>> index <- apply(present, 1, function(x) sum(x < 0.01) > 5)
>>> eset_present <- eset[index,]
>>
>> Where did you get this code? It's not doing what you think.
>>
>> Please note that 'DABG' is computing P/A calls at a lower level than 'PSDABG' (e.g., DAGB is computing at the probe level, and PSDABG is computing at the probeset level), so this is even more wrong than what you were doing before.
>>
>> As an example, let's look at this using some Mouse Gene ST 2.0 arrays I have in hand:
>>
>>> dat <- read.celfiles(filenames = list.celfiles())
>> Loading required package: pd.mogene.2.0.st
>> Loading required package: RSQLite
>> Loading required package: DBI
>> Platform design info loaded.
>> Reading in : MM_2528.CEL
>> Reading in : MM_2532.CEL
>> Reading in : MM_2536.CEL
>> Reading in : MM_2583.CEL
>> Reading in : MM_2588.CEL
>> Reading in : MM_2591.CEL
>>> dim(paCalls(dat, "DABG"))
>> Computing DABG calls... OK
>> [1] 767405      6
>>> dim(paCalls(dat, "PSDABG"))
>> Computing DABG calls... OK
>> [1] 269751      6
>>> eset <- rma(dat)
>> Background correcting
>> Normalizing
>> Calculating Expression
>>> dim(eset)
>> Features  Samples
>>     41345        6
>>
>> So the 'present' matrix that you are using above has something like 700K rows, whereas your expression set has around 50K rows.
>>
>> Let's back up a bit and clarify what you are doing, and how these arrays work.
>>
>> At the most basic level there is the probe, which is a 25-mer intended to measure the expression of a particular gene. Why they are 25-mers is beyond the scope of this email, but note that this is a ridiculously short chunk of DNA to reliably hybridize to its matching sequence. Affy knows this as well, so they use a set of these probes to measure a given transcript (or portion thereof).
>>
>> The next level is known in Affy parlance as the probe set region (PSR), which is usually made up of 4 probes (at least on the Exon arrays), but for the Gene ST arrays a given PSR may only contain one or two probes (note that the Gene ST arrays came out after the Exon ST arrays, and contain the more reliable probes that they used on the Exon arrays. The PSR comes from the Exon array format, and was originally intended to measure an exon, or a portion thereof.).
>>
>> The PSR level is also known as the 'probeset' level, which was the original summarization level for the Exon arrays. Since there are many (many!) probesets on the Gene ST arrays that have only one or two probes, I would argue against summarizing at this level. The next level is the transcript level (or 'core' level in oligo). At this level, all probes that interrogate any portion of a gene are combined together, and then summarized, to give one single measure for the expression level of that gene. This is IMO the only trustworthy level of summarization for these arrays.
>>
>> When you do paCalls with method = 'DABG', what you are doing is comparing the intensity for each probe to some measure of background, and trying to decide if the given probe is binding transcript at a level that is arguably higher than background. If you use method = "PSDABG", then you are taking all the P/A calls from the 'DABG' level, combining them at the PSR (probeset) level, and seeing if in aggregate they appear to be binding transcript at a level that is arguably higher than background. In the oligo package, there is no way to compute P/A calls at the transcript (or core) level, so you cannot use P/A calls to filter probesets if you have summarized at the transcript level.
>>
>> What you have done with your code above is to compute P/A calls at the probe level, then made an index where you say at least five of the probes have to have a p < 0.01. This is fine, I suppose, but you cannot use that index vector for anything. In other words, the ExpressionSet called 'eset' contains the data after summarizing the original 700k probes into about 50k gene-level probesets. So you can't use the P/A calls for the individual probes to subset the ExpressionSet because there isn't a one-to-one correspondence between the two.
>>
>> If you are really wedded to the idea of using P/A calls to filter your data, you will have to switch to the xps package. Otherwise you will have to use some other way of filtering out low/unexpressed data.
>>
>> Best,
>>
>> Jim
>>
>>
>>
>>
>>
>>
>>
>>
>> If I�m understanding this correctly, the �index� object should be those probes whose X (p-value?) is less than 0.01?
>>
>> Thank you again for your advice!
>>
>> alison
>>
>> On Sep 4, 2014, at 2:05 PM, James W. MacDonald <jmacdon at uw.edu> wrote:
>>
>>> HI Alison,
>>>
>>> When you do paCalls with method = "PSDABG", you are getting P/A calls at the 'probeset' level (which in Affymetrix terms is the probe set region or PSR, which roughly corresponds to exon-level). But when you do rma with target = "core", you are summarizing data at the 'core' or transcript level (e.g., at this level all the PSRs that interrogate a given gene are summarized together).
>>>
>>> So you cannot use oligo to generate P/A calls at the transcript level. It's my understanding that you can do this with xps, but I am unfamiliar with that package. However, a quick search of the BioC list using 'stratowa pacalls' got as the first hit this message:  https://stat.ethz.ch/pipermail/bioconductor/2014-August/060977.html
>>>
>>> Best,
>>>
>>> Jim
>>>
>>>
>>> On Thu, Sep 4, 2014 at 1:36 PM, Alison Ziesel <aziesel at emory.edu> wrote:
>>> Hello all,
>>>
>>> I�ve got very basic familiarity with the oligo package, but I�ve ran into a bit of trouble with the paCalls function. I�d like to restrict the probes in my eSet object to only those called present. Here�s what I�ve done thus far:
>>>
>>>> library(�oligo�)
>>>> library("pd.hugene.2.0.st�)
>>>> allchips <- read.celfiles(�my list of cel files�)
>>>
>>>> allchipspa <-paCalls(allchips, method=�PSDABG�, verbose=TRUE)
>>>> present <- rownames(allchipspa)
>>>> allchips_present <- exprs(allchips[present,])
>>>
>>> Error in exprs(allchips[present, ]) :
>>>    error in evaluating the argument 'object' in selecting a method for function 'exprs': Error in orig[[nm]][i, , ..., drop = drop] : subscript out of bounds
>>>
>>> Clearly my last line doesn�t do what I want it to do: filter the object allchips versus the list of present probes in the object present. I suppose at this point I should also confirm that my object allchipspa is just the present probes!  What�s the correct way to perform this filter step? I�d be grateful for any insight you may have.
>>>
>>>> sessionInfo()
>>> R version 3.1.1 (2014-07-10)
>>> Platform: x86_64-apple-darwin13.1.0 (64-bit)
>>>
>>> locale:
>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>>
>>> attached base packages:
>>> [1] parallel  stats     graphics  grDevices utils     datasets  methods   base
>>>
>>> other attached packages:
>>>   [1] pd.hugene.2.0.st_3.8.1 RSQLite_0.11.4         DBI_0.2-7              oligo_1.28.2
>>>   [5] Biostrings_2.32.1      XVector_0.4.0          IRanges_1.22.10        Biobase_2.24.0
>>>   [9] oligoClasses_1.26.0    BiocGenerics_0.10.0
>>>
>>> loaded via a namespace (and not attached):
>>>   [1] affxparser_1.36.0     affyio_1.32.0         BiocInstaller_1.14.2  bit_1.1-12
>>>   [5] codetools_0.2-8       ff_2.2-13             foreach_1.4.2         GenomeInfoDb_1.0.2
>>>   [9] GenomicRanges_1.16.4  iterators_1.0.7       preprocessCore_1.26.1 splines_3.1.1
>>> [13] stats4_3.1.1          tools_3.1.1           zlibbioc_1.10.0
>>>
>>> alison
>>>
>>> Alison Ziesel
>>> Department of Ophthalmology, Emory University
>>> aziesel at emory.edu
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>>
>>>
>>> --
>>> James W. MacDonald, M.S.
>>> Biostatistician
>>> University of Washington
>>> Environmental and Occupational Health Sciences
>>> 4225 Roosevelt Way NE, # 100
>>> Seattle WA 98105-6099
>>
>>
>>
>>
>> --
>> James W. MacDonald, M.S.
>> Biostatistician
>> University of Washington
>> Environmental and Occupational Health Sciences
>> 4225 Roosevelt Way NE, # 100
>> Seattle WA 98105-6099
>
>
> 	[[alternative HTML version deleted]]
>
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>