[BioC] Duplicate probe coordinates with pd.hugene.2.1.st and oligo

Thu Jun 12 01:37:49 CEST 2014

Hi Steve,

On 6/11/2014 5:17 PM, Steve Piccolo wrote:
> I¹m trying to process some CEL files from Affy HuGene 2.1st platform. But
> it seems there may be a problem with the pd.hugene.2.1.st package or with
> the way oligo is handling them (or with something I am doing). Below is
> the code that I am using and the output I¹m getting.
>
> affyExpressionFS <- read.celfiles(celFilePath)
> xCoord = getX(affyExpressionFS, type="pm")
> yCoord = getY(affyExpressionFS, type="pm")
>
> pmSeq = pmSequence(affyExpressionFS)
>
> print(length(xCoord))
> print(length(yCoord))
> print(length(pmSeq))
> print(length(shouldUseProbes))
>
> [1] 1022045
> [1] 1022045
> [1] 1025088
> [1] 1025088
>
>
> Shouldn¹t the lengths of these all be identical? Also, I am seeing
> duplicate values for the x_y coordinates. For example, it is saying there
> are 8 probes with x_y coordinates of 1000_198, and the intensity values
> are different for each probe.

I think you might be conflating probe with probeset. If we look at the 
pmfeature table for the (x,y) coordinate you mention, we see this:

           fid   fsetid   atom    x   y
719881 236621 17016826 719881 1000 198
739683 236621 17026617 739683 1000 198
744589 236621 17028715 744589 1000 198
750333 236621 17031494 750333 1000 198
755872 236621 17033950 755872 1000 198
761063 236621 17036233 761063 1000 198
766702 236621 17038992 766702 1000 198
772172 236621 17041577 772172 1000 198

So you are correct that this probe is in the pmfeature table 8 times. 
This is because it is in eight different probesets (the fsetid column), 
and that is when you summarize at the probeset level. In other words, 
this single probe (the fid 236621) is used eight different times when 
you summarize using target = "probeset".

If you summarize at the transcript level (target = "core") this 
particular probe (fid) is also distributed into eight different probesets.

You don't show how you are getting the intensity values, so I can't 
comment on the different values. I would bet however that you are 
looking at eight different probesets after a summarization step, rather 
than the same probe intensity eight times.

Having explained that part, note that getX() and getY() are by default 
getting data at the 'probeset' level, which includes all the duplicated 
probes. The actual call will end up being

SELECT fid, x FROM pmfeature;

and the structure of the pmfeature table is as you see above, so in 
essence you are just getting the fid and x columns. On the other hand, 
pmSequence() can get sequences based on whether or not you are 
summarizing at the probeset or the transcript (or 'core') level. So if 
you had done:

 > z <- pmSequence(pd.hugene.2.1.st, target = "probeset")
 > length(z)
[1] 1022045

you would get comparable lengths. Now why are there more sequences at 
the 'core' level? It's because there is even more sharing of the probes 
at that level. In other words, a given probe may be in even more 
probesets at the 'core' level than it was if you summarized at the 
'probeset' level.

Best,

Jim

>
> Is there something I am missing? Or could this be due to a bug?
>
>
>
>> sessionInfo()
> R version 3.1.0 (2014-04-10)
> Platform: x86_64-apple-darwin13.1.0 (64-bit)
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets  methods
> [8] base
>
> other attached packages:
>   [1] SCAN.UPC_2.6.0      sva_3.10.0          mgcv_1.7-29
>   [4] nlme_3.1-117        corpcor_1.6.6       foreach_1.4.2
>   [7] affyio_1.32.0       affy_1.42.2         GEOquery_2.30.0
> [10] oligo_1.28.2        Biostrings_2.32.0   XVector_0.4.0
> [13] IRanges_1.22.8      oligoClasses_1.26.0 Biobase_2.24.0
> [16] BiocGenerics_0.10.0
>
> loaded via a namespace (and not attached):
>   [1] affxparser_1.36.0     BiocInstaller_1.14.2  bit_1.1-12
>   [4] codetools_0.2-8       DBI_0.2-7             ff_2.2-13
>   [7] GenomeInfoDb_1.0.2    GenomicRanges_1.16.3  grid_3.1.0
> [10] iterators_1.0.7       lattice_0.20-29       MASS_7.3-33
> [13] Matrix_1.1-3          preprocessCore_1.26.1 RCurl_1.95-4.1
> [16] splines_3.1.0         stats4_3.1.0          tools_3.1.0
> [19] XML_3.98-1.1          zlibbioc_1.10.0
>
>
> Thanks,
> -Steve
>
> ‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹
> Stephen Piccolo, Ph.D.
> Postdoctoral Research Associate
>
> Affiliations:
>    Department of Pharmacology and Toxicology, University of Utah
>    Division of Computational Biomedicine, Boston University School of
> Medicine
> ‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099