[BioC] makecdfenv and multi-mapping probes on Mouse Gene 1.0 ST array

Mon May 12 19:03:27 CEST 2008

> On Sun, May 11, 2008 at 10:38 AM, Heidi Dvinge <heidi at ebi.ac.uk> wrote:
>> On 11 May 2008, at 14:28, Sean Davis wrote:
>>> On Sun, May 11, 2008 at 8:01 AM, Heidi Dvinge <heidi at ebi.ac.uk> wrote:
>>>> Dear  all,
>>>> I'm currently looking at some Mouse Gene 1.0 ST arrays, and have used
the makecdfenv package to build a cdf environment based on the file
MoGene-1_0-st-v1.r3.cdf from the affymetrix webpage.
>>>> That worked without any problems, but out of curiosity I tried taking
a closer look at the format of the array, to see how many probes were in
each probe set etc.
>>>> I'm aware that some probes map to multiple probe sets and are removed
when the cdfenv is produced, which seems to be the case for about 8% of
the probes. My question is exactly how this happens? I would
expect the multiple-mapping probes to be removed from all probe sets, but
this doesn't seem to be the case.
>>> I believe that the probes are kept in the first or last probeset (not
sure which) seen.  Someone with a little more affy experience can
comment more fully.
>> I figured it was probably something along those lines, but what's the
reason
>> for not just removing them completely, instead of keeping them in a
'random'
>> probe set? Most probes that map multiple times map to > 2 probe sets.
And in
>> some cases it's large chunks of probe sets that 'overlap', whereas in
other
>> cases it's just a few or a single probe that 'jumps around'.
>
> I think this probe "removal" is a side effect of the way the original
affy package and affy chips were designed.  Before these newer arrays,
there were no probes that mapped to multiple probe sets, so there was
never a mechanism for "removing" probes or even maintain multiple
mappings.  So, the current behavior is due to the fact that there is not a
way to maintain the many-to-many mapping, if I understand it correctly and
is not really in any particular way optimal.  Again, someone with more
affy experience might have more to say.

The original use case was to be able to retrieve the probes in a given
probe set, without further consideration. The need for possible
alternative mappings was nevertheless considered, and it was made possible
to replace the mapping used to process data at any given time (there is a
vignette talking about that).

Regarding many-to-many association between probes and probesets, this is
indeed an annoying case (as in the original design, it was somehow assumed
that this is a perfect world). It is not at all impossible to have
"many-to-many" association, but it is certainly making it for a difficult
analysis of the data. To keep things simple, the recommendation would be
"each probe goes into one probe set"... and get rid of the rest.

The package "altcdfenvs" is also proposing extensions to the CDF
environments, with methods and functions to work with them.

> Sean
>
>
>>>> Example with the two overlapping probe sets 10344719 and 10353008,
where "raw" is my AffyBatch, and "cdf" is the raw cdf-file turned
into only tab-delimited info and read into R, and "INDEX" being a
unique probe identifier (the same as index-1 in the cdf env):
>>>>> cdf[cdf$QUAL=="10344719","INDEX"]
>>>>  [1]    7543  661828  575792  962890  963940  140756  337977
>>>> 510591  860722  968182  387524  386474
>>>> [13]  385518  384468 1076441 1075391  850724   51881  957657  100610
862535  506651  505601   82272
>>>> [25]   83322  692860  691810  494417  932343  689216  836826  894914
715393  421443   92496  485600
>>>> [37]  253868  352083  594288 1049892  370822  369772  416675  928371
505790  506840  135781
>>>>> cdf[cdf$QUAL=="10353008","INDEX"]
>>>>  [1]  506840  505790  928371  416675  369772  370822 1049892
>>>> 485600   92496  421443  715393  894914
>>>> [13] 1073586  110809  836826  689216  932343  494417  691810
>>>> 83322   82272  505601  506651  862535
>>>> [25]  100610  957657   51881  850724 1075391 1076441  384468  385518
386474  387524  968182  860722
>>>> [37]  510591  337977  140756  963940  962890  575792  661828    7543
>>>>> indexProbes(raw, genenames="10344719")
>>>> $`10344719`
>>>> [1] 692861 253869 352084 594289 135782
>>>>> indexProbes(raw, genenames="10353008")
>>>> $`10353008`
>>>>  [1]  506841  505791  928372  416676  369773  370823 1049893
>>>> 485601   92497  421444  715394  894915
>>>> [13] 1073587  110810  836827  689217  932344  494418  691811
>>>> 83323   82273  505602  506652  862536
>>>> [25]  100611  957658   51882  850725 1075392 1076442  384469  385519
386475  387525  968183  860723
>>>> [37]  510592  337978  140757  963941  962891  575793  661829    7544
So 10344719 and 10353008 have 47 and 44 probes respectively, 42 of which
are overlapping. In the cdf environment  10344719 appears to have the 42
overlapping probes removed, but they're still present in 10353008.
>>>> A similar situation is seen for e.g. the overlapping probe sets
10461391 and 10487930 with 41 probes each, 40 of which are identical:
>>>>> cdf[cdf$QUAL=="10461391","INDEX"]
>>>>  [1]  483268 1022846  409057  703153  328783  372162  882399
>>>> 569942  765746  868615  948367  413614
>>>> [13]  830931  434763  970910  600221  599171  135798    6746  455659
799186  912319  469313  145393
>>>> [25]  872191  126758  801051  774196  773146  965810  272742   19445
585800  999188 1012776  823868
>>>> [37]  156514  210874  645037  799505 1075142
>>>>> cdf[cdf$QUAL=="10487930","INDEX"]
>>>>  [1] 1075142  799505  645037  210874  156514  823868 1012776
>>>> 999188  585800   19445  272742  965810
>>>> [13]  773146  774196  801051  126758  872191  145393  469313  912319
799186  839098    6746  135798
>>>> [25]  599171  600221  970910  434763  830931  413614  948367  868615
765746  569942  882399  372162
>>>> [37]  328783  703153  409057 1022846  483268
>>>>> indexProbes(raw, genenames="10461391")
>>>> $`10461391`
>>>> [1] 455660
>>>>> indexProbes(raw, genenames="10487930")
>>>> $`10487930`
>>>>  [1] 1075143  799506  645038  210875  156515  823869 1012777
>>>> 999189  585801   19446  272743  965811
>>>> [13]  773147  774197  801052  126759  872192  145394  469314  912320
799187  839099    6747  135799
>>>> [25]  599172  600222  970911  434764  830932  413615  948368  868616
765747  569943  882400  372163
>>>> [37]  328784  703154  409058 1022847  483269
>>>> Any comments on this or on exactly how the cdf environment is created
would be much appreciated.
>>>> Thanks
>>>> \Heidi
>>>>> sessionInfo()
>>>> R version 2.7.0 Under development (unstable) (2008-02-12 r44439)
i386-apple-darwin8.10.1
>>>> locale:
>>>> en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
>>>> [1] tools     stats     graphics  grDevices utils     datasets
methods   base
>>>> other attached packages:
>>>> [1] makecdfenv_1.17.0    affy_1.17.3          preprocessCore_1.1.5
affyio_1.7.17
>>>> [5] Biobase_1.99.4
>>>> ------------<<>>------------
>>>> Heidi Dvinge
>>>> EMBL-European Bioinformatics Institute
>>>> Wellcome Trust Genome Campus
>>>> Hinxton, Cambridge
>>>> CB10 1SD
>>>> Mail: heidi at ebi.ac.uk
>>>> Phone: +44 (0) 1223 494 444
>>>> ------------<<>>------------
>>>>       [[alternative HTML version deleted]]
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at stat.math.ethz.ch
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>