[BioC] makecdfenv and multi-mapping probes on Mouse Gene 1.0 ST array

Heidi Dvinge heidi at ebi.ac.uk
Sun May 11 16:38:56 CEST 2008


On 11 May 2008, at 14:28, Sean Davis wrote:

> On Sun, May 11, 2008 at 8:01 AM, Heidi Dvinge <heidi at ebi.ac.uk> wrote:
>> Dear  all,
>>
>> I'm currently looking at some Mouse Gene 1.0 ST arrays, and have used
>> the makecdfenv package to build a cdf environment based on the file
>> MoGene-1_0-st-v1.r3.cdf from the affymetrix webpage.
>>
>> That worked without any problems, but out of curiosity I tried taking
>> a closer look at the format of the array, to see how many probes were
>> in each probe set etc.
>>
>> I'm aware that some probes map to multiple probe sets and are removed
>> when the cdfenv is produced, which seems to be the case for about 8%
>> of the probes. My question is exactly how this happens? I would
>> expect the multiple-mapping probes to be removed from all probe sets,
>> but this doesn't seem to be the case.
>
> I believe that the probes are kept in the first or last probeset (not
> sure which) seen.  Someone with a little more affy experience can
> comment more fully.
>
I figured it was probably something along those lines, but what's the  
reason for not just removing them completely, instead of keeping them  
in a 'random' probe set? Most probes that map multiple times map to >  
2 probe sets. And in some cases it's large chunks of probe sets that  
'overlap', whereas in other cases it's just a few or a single probe  
that 'jumps around'.

\Heidi

> Sean
>
>> Example with the two overlapping probe sets 10344719 and 10353008,
>> where "raw" is my AffyBatch, and "cdf" is the raw cdf-file turned
>> into only tab-delimited info and read into R, and "INDEX" being a
>> unique probe identifier (the same as index-1 in the cdf env):
>>
>>> cdf[cdf$QUAL=="10344719","INDEX"]
>>  [1]    7543  661828  575792  962890  963940  140756  337977
>> 510591  860722  968182  387524  386474
>> [13]  385518  384468 1076441 1075391  850724   51881  957657  100610
>> 862535  506651  505601   82272
>> [25]   83322  692860  691810  494417  932343  689216  836826  894914
>> 715393  421443   92496  485600
>> [37]  253868  352083  594288 1049892  370822  369772  416675  928371
>> 505790  506840  135781
>>> cdf[cdf$QUAL=="10353008","INDEX"]
>>  [1]  506840  505790  928371  416675  369772  370822 1049892
>> 485600   92496  421443  715393  894914
>> [13] 1073586  110809  836826  689216  932343  494417  691810
>> 83322   82272  505601  506651  862535
>> [25]  100610  957657   51881  850724 1075391 1076441  384468  385518
>> 386474  387524  968182  860722
>> [37]  510591  337977  140756  963940  962890  575792  661828    7543
>>> indexProbes(raw, genenames="10344719")
>> $`10344719`
>> [1] 692861 253869 352084 594289 135782
>>> indexProbes(raw, genenames="10353008")
>> $`10353008`
>>  [1]  506841  505791  928372  416676  369773  370823 1049893
>> 485601   92497  421444  715394  894915
>> [13] 1073587  110810  836827  689217  932344  494418  691811
>> 83323   82273  505602  506652  862536
>> [25]  100611  957658   51882  850725 1075392 1076442  384469  385519
>> 386475  387525  968183  860723
>> [37]  510592  337978  140757  963941  962891  575793  661829    7544
>>
>> So 10344719 and 10353008 have 47 and 44 probes respectively, 42 of
>> which are overlapping. In the cdf environment  10344719 appears to
>> have the 42 overlapping probes removed, but they're still present in
>> 10353008.
>>
>> A similar situation is seen for e.g. the overlapping probe sets
>> 10461391 and 10487930 with 41 probes each, 40 of which are identical:
>>
>>> cdf[cdf$QUAL=="10461391","INDEX"]
>>  [1]  483268 1022846  409057  703153  328783  372162  882399
>> 569942  765746  868615  948367  413614
>> [13]  830931  434763  970910  600221  599171  135798    6746  455659
>> 799186  912319  469313  145393
>> [25]  872191  126758  801051  774196  773146  965810  272742   19445
>> 585800  999188 1012776  823868
>> [37]  156514  210874  645037  799505 1075142
>>> cdf[cdf$QUAL=="10487930","INDEX"]
>>  [1] 1075142  799505  645037  210874  156514  823868 1012776
>> 999188  585800   19445  272742  965810
>> [13]  773146  774196  801051  126758  872191  145393  469313  912319
>> 799186  839098    6746  135798
>> [25]  599171  600221  970910  434763  830931  413614  948367  868615
>> 765746  569942  882399  372162
>> [37]  328783  703153  409057 1022846  483268
>>> indexProbes(raw, genenames="10461391")
>> $`10461391`
>> [1] 455660
>>> indexProbes(raw, genenames="10487930")
>> $`10487930`
>>  [1] 1075143  799506  645038  210875  156515  823869 1012777
>> 999189  585801   19446  272743  965811
>> [13]  773147  774197  801052  126759  872192  145394  469314  912320
>> 799187  839099    6747  135799
>> [25]  599172  600222  970911  434764  830932  413615  948368  868616
>> 765747  569943  882400  372163
>> [37]  328784  703154  409058 1022847  483269
>>
>> Any comments on this or on exactly how the cdf environment is created
>> would be much appreciated.
>>
>> Thanks
>> \Heidi
>>
>>> sessionInfo()
>> R version 2.7.0 Under development (unstable) (2008-02-12 r44439)
>> i386-apple-darwin8.10.1
>>
>> locale:
>> en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
>>
>> attached base packages:
>> [1] tools     stats     graphics  grDevices utils     datasets
>> methods   base
>>
>> other attached packages:
>> [1] makecdfenv_1.17.0    affy_1.17.3          preprocessCore_1.1.5
>> affyio_1.7.17
>> [5] Biobase_1.99.4
>>
>>
>> ------------<<>>------------
>> Heidi Dvinge
>>
>> EMBL-European Bioinformatics Institute
>> Wellcome Trust Genome Campus
>> Hinxton, Cambridge
>> CB10 1SD
>> Mail: heidi at ebi.ac.uk
>> Phone: +44 (0) 1223 494 444
>> ------------<<>>------------
>>
>>
>>
>>
>>
>>        [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/ 
>> gmane.science.biology.informatics.conductor
>>



More information about the Bioconductor mailing list