[BioC] pd.hugene.2.0.st missing normgene->exon mappings

Tue Jul 9 21:38:53 CEST 2013

Dear Jim,

In xps I use as basic file for exon arrays the probeset annotation file 
and then compare the data to the data from the pgf-file. Any differences 
will be reported.

I have just checked the different files for HuGene-2_0-st. If you check 
as an example the following probeset_ids:
16934607
16934608
16934609
16934610

Then you will see that the transcript annotation file lists these ids as 
'normgene->exon' and 'pos_control'. However, the probeset annotation 
file lists these ids as 'main' belonging to gene EIF3D with 
transcript_cluster_id 16934583. Looking for this id in the transcript 
annotation file reveals that the number of 'total_probes' is 24. Indeed, 
the probeset annotation file lists 24 probesets including the four above 
mentioned probeset_ids.

This means that although these four probesets are listed in the 
transcript annotation file as 'normgene->exon' the label 'main' in the 
pgf-file is correct since these probesets are part of the gene EIF3D.

Interestingly, the pgf-file for HuGene-1_0-st has extra probesets listed 
as 'normgene->exon'. However, in this case these probesets are also 
listed as 'normgene->exon' in the probeset annotation file, i.e. these 
probesets do not belong to any transcript listed in the probeset 
annotation file.

Best regards,
Christian

On 7/9/13 8:46 PM, James W. MacDonald wrote:
> Hi Christian,
>
> That's not the issue. Instead, the issue is that the pgf file lists the
> normgene->exon probeset IDs as being 'main'. I have received a response
> from Affy stating that the qcc file lists the normgene->exon probesets
> as pos_control, but that seems orthogonal to the issue at hand.
>
>  > qcc <- read.table("HuGene-2_0-st.qcc", comment.char="#",
> stringsAsFactors=F, header=T)
>  > pgf <- readPgf("HuGene-2_0-st.pgf")
>  > head(qcc)
>    probeset_id  group_name probeset_name quantification_in_header
> 1    16650001 neg_control      16650001                        0
> 2    16650003 neg_control      16650003                        0
>
> ## get the positive controls (normgene->exon probesets) from the qcc file
>  > pos_cont <- qcc[qcc[,2] == "pos_control",1]
>
> ## compare to pgf file
>  > x <- pgf$probesetType[pgf$probesetId %in% pos_cont]
>  > table(x)
> x
> main
> 1626
>
> So in the pgf file, these probesets are being called 'main' instead of
> some sort of control. How do you handle this in xps? Do you use the pgf
> file?
>
> Best,
>
> Jim
>
>
>
>
> On 7/9/2013 2:06 PM, cstrato wrote:
>> Dear Jim,
>>
>> I did not really follow the discussion so I may be wrong, but if you
>> mean that there is a difference between the number of 'main' types,
>> please note that number of 'main' for pgf, i.e 349012, corresponds to
>> the number of 'main' in the probeset annotation file and not in the
>> transcript annotation file.
>>
>> But as I said, I may have misunderstood the problem.
>>
>> I am mainly replying because at the beginning of this year I had long
>> discussions with DevNet to make sure that the annotation files for the
>> 2.X arrays are correct, and in version na33.2 DevNet did correct
>> everything what I have found.
>>
>> Best regards,
>> Christian
>>
>>
>> On 7/9/13 7:13 PM, James W. MacDonald wrote:
>>> Hi Mark,
>>>
>>> Thanks for the heads-up. We already knew that Affy messed up the
>>> transcript and probeset annotation files (and had them fixed), but
>>> didn't think I needed to check the others. Famous last words, no?
>>>
>>> > x <- readPgf("HuGene-2_0-st.pgf")
>>> > table(x$probesetType)
>>>
>>>               control->affx   control->affx->bac_spike
>>>                          18                         18
>>>   control->affx->ercc_spike control->affx->polya_spike
>>>                          92                         39
>>>   control->bgp->antigenomic                       main
>>>                          23                     349012
>>>            normgene->intron                   reporter
>>>                        3575                         82
>>>
>>> > y <- read.csv("HuGene-2_0-st-v1.na33.2.hg19.transcript.csv",
>>> comment.char = "#", stringsAsFactors=FALSE, header = TRUE)
>>> > table(y$category)
>>>
>>>               control->affx   control->affx->bac_spike
>>>                          18                         18
>>>   control->affx->ercc-spike control->affx->polya_spike
>>>                          92                         39
>>>   control->bgp->antigenomic                       main
>>>                          23                      44629
>>>              normgene->exon           normgene->intron
>>>                        1626                       3575
>>>                    reporter                     rescue
>>>                          82                       3515
>>>
>>> I'll ping Affymetrix and see what they have to say.
>>>
>>> Best,
>>>
>>> Jim
>>>
>>>
>>>
>>> On 7/9/2013 3:29 AM, Mark Cowley wrote:
>>>> Dear Benilton, James&  Bioconductors,
>>>> Thanks for providing the platform design packages for
>>>> hugene/mogene/ragene 1.0/1.1/2.0/2.1 arrays.
>>>>
>>>> I use SQL to query these packages&  ultimately retain only 'main'
>>>> probes in my analysis. This works well for 1.0 and 1.1 packages, but
>>>> nor for 2.0 and 2.1 ST arrays. For 2.0 and 2.1 arrays, the
>>>> normgene->exon control probes are misclassified as 'main' probes.
>>>>
>>>> evidence: the HuGene-2_0-st-v1.na33.2.hg19.transcript.csvNetAffx csv
>>>> files lists 1626 normgene->exon probes, however the pg.hugene.2.0.st
>>>> package lists 0, and assigns these 1626 probes to the 'main' category:
>>>>
>>>> # probe types:
>>>> library(pd.hugene.2.0.st)
>>>> conn<- db(pd.hugene.2.0.st)
>>>> dbGetQuery(conn,"SELECT * from type_dict")
>>>>     type                   type_id
>>>> 1     1                      main
>>>> 2     2             control->affx
>>>> 3     3             control->chip
>>>> 4     4 control->bgp->antigenomic
>>>> 5     5     control->bgp->genomic
>>>> 6     6            normgene->exon
>>>> 7     7          normgene->intron
>>>> 8     8  rescue->FLmRNA->unmapped
>>>> 9     9  control->affx->bac_spike
>>>> 10   10            oligo_spike_in
>>>> 11   11           r1_bac_spike_at
>>>>
>>>> # probe counts for each of the probe categories:
>>>> dbGetQuery(conn,"SELECT type, count(*) from featureSet GROUP BY type")
>>>>    type count(*)
>>>> 1   NA     3728
>>>> 2    1   345497
>>>> 3    2       18
>>>> 4    4       23
>>>> 5    7     3575
>>>> 6    9       18
>>>>
>>>> NB: no type 6 probes.
>>>> I've tested all 12 ho/mo/ra gene 1.0,1.1,2.0,2.1 ST packages, and see
>>>> this issue for all 2.0 and 2.1 arrays (see below)
>>>>
>>>> Can these mappings please be updated?
>>>>
>>>> PS, there's a bunch of probes with type = NA in the database. I
>>>> haven't investigated these in any detail.
>>>>
>>>> cheers,
>>>> Mark
>>>> -----------------------------------------------------
>>>> Mark Cowley, PhD
>>>>
>>>> Genome Informatics Division&  the Centre for Clinical Genomics
>>>> The Kinghorn Cancer Centre, Garvan Institute of Medical Research,
>>>> Sydney, Australia
>>>> -----------------------------------------------------
>>>>
>>>> All 12 packages below:
>>>> pd.packages<- c(
>>>>    "pd.hugene.1.0.st.v1", "pd.hugene.1.1.st.v1", "pd.hugene.2.0.st",
>>>> "pd.hugene.2.1.st",
>>>>    "pd.mogene.1.0.st.v1", "pd.mogene.1.1.st.v1", "pd.mogene.2.0.st",
>>>> "pd.mogene.2.1.st",
>>>>    "pd.ragene.1.0.st.v1", "pd.ragene.1.1.st.v1", "pd.ragene.2.0.st",
>>>> "pd.ragene.2.1.st"
>>>> )
>>>>
>>>> a<- b<- list()
>>>> for(pd.pkg.name in pd.packages) {
>>>>    try({
>>>>      require(pd.pkg.name, character.only=TRUE) || stop("Can't load the
>>>> pd.package")
>>>>      conn<- db(get(pd.pkg.name))
>>>>      a[[pd.pkg.name]]<- dbGetQuery(conn,"SELECT type, count(*) from
>>>> featureSet GROUP BY type")
>>>>      b[[pd.pkg.name]]<- dbGetQuery(conn,"SELECT fsetid from featureSet
>>>> WHERE type = 6")[,1]
>>>>    })
>>>> }
>>>> dbGetQuery(conn,"SELECT * from type_dict")
>>>>
>>>>> a
>>>> $pd.hugene.1.0.st.v1
>>>>    type count(*)
>>>> 1   NA      227
>>>> 2    1   253002
>>>> 3    2       57
>>>> 4    4       45
>>>> 5    6     1195
>>>> 6    7     2904
>>>>
>>>> $pd.hugene.1.1.st.v1
>>>>    type count(*)
>>>> 1   NA      227
>>>> 2    1   253002
>>>> 3    2       57
>>>> 4    4       45
>>>> 5    6     1195
>>>> 6    7     2904
>>>>
>>>> $pd.hugene.2.0.st
>>>>    type count(*)
>>>> 1   NA     3728
>>>> 2    1   345497
>>>> 3    2       18
>>>> 4    4       23
>>>> 5    7     3575
>>>> 6    9       18
>>>>
>>>> $pd.hugene.2.1.st
>>>>    type count(*)
>>>> 1   NA     3728
>>>> 2    1   345497
>>>> 3    2       18
>>>> 4    4       23
>>>> 5    7     3575
>>>> 6    9       18
>>>>
>>>> $pd.mogene.1.0.st.v1
>>>>    type count(*)
>>>> 1   NA       86
>>>> 2    1   234878
>>>> 3    2       21
>>>> 4    4       45
>>>> 5    6     1324
>>>> 6    7     5222
>>>>
>>>> $pd.mogene.1.1.st.v1
>>>>    type count(*)
>>>> 1   NA       86
>>>> 2    1   234878
>>>> 3    2       21
>>>> 4    4       45
>>>> 5    6     1324
>>>> 6    7     5222
>>>>
>>>> $pd.mogene.2.0.st
>>>>    type count(*)
>>>> 1   NA      810
>>>> 2    1   263551
>>>> 3    2       18
>>>> 4    4       23
>>>> 5    7     5331
>>>> 6    9       18
>>>>
>>>> $pd.mogene.2.1.st
>>>>    type count(*)
>>>> 1   NA      810
>>>> 2    1   263551
>>>> 3    2       18
>>>> 4    4       23
>>>> 5    7     5331
>>>> 6    9       18
>>>>
>>>> $pd.ragene.1.0.st.v1
>>>>    type count(*)
>>>> 1   NA      254
>>>> 2    1   211195
>>>> 3    2       21
>>>> 4    4       45
>>>> 5    6      399
>>>> 6    7     1153
>>>>
>>>> $pd.ragene.1.1.st.v1
>>>>    type count(*)
>>>> 1   NA      254
>>>> 2    1   211195
>>>> 3    2       21
>>>> 4    4       45
>>>> 5    6      399
>>>> 6    7     1153
>>>>
>>>> $pd.ragene.2.0.st
>>>>    type count(*)
>>>> 1   NA     1071
>>>> 2    1   214018
>>>> 3    2       18
>>>> 4    4       23
>>>> 5    7     5083
>>>> 6    9       18
>>>>
>>>> $pd.ragene.2.1.st
>>>>    type count(*)
>>>> 1   NA     1071
>>>> 2    1   214018
>>>> 3    2       18
>>>> 4    4       23
>>>> 5    7     5083
>>>> 6    9       18
>>>>
>>>>> sapply(b,length)
>>>> pd.hugene.1.0.st.v1 pd.hugene.1.1.st.v1    pd.hugene.2.0.st
>>>> pd.hugene.2.1.st
>>>>                 1195                1195
>>>> 0                   0
>>>> pd.mogene.1.0.st.v1 pd.mogene.1.1.st.v1    pd.mogene.2.0.st
>>>> pd.mogene.2.1.st
>>>>                 1324                1324
>>>> 0                   0
>>>> pd.ragene.1.0.st.v1 pd.ragene.1.1.st.v1    pd.ragene.2.0.st
>>>> pd.ragene.2.1.st
>>>>                  399                 399
>>>> 0                   0
>>>>
>>>>> sessionInfo()
>>>> R version 3.0.0 (2013-04-03)
>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>
>>>> locale:
>>>>   [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C
>>>>   [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8
>>>>   [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8
>>>>   [7] LC_PAPER=C                 LC_NAME=C
>>>>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>> [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
>>>>
>>>> attached base packages:
>>>> [1] parallel  stats     graphics  grDevices utils     datasets  methods
>>>> [8] base
>>>>
>>>> other attached packages:
>>>>   [1] pd.ragene.2.1.st_2.12.1   pd.ragene.2.0.st_2.12.0
>>>>   [3] pd.ragene.1.1.st.v1_3.8.0 pd.ragene.1.0.st.v1_3.8.0
>>>>   [5] pd.mogene.2.1.st_2.12.1   pd.mogene.2.0.st_2.12.0
>>>>   [7] pd.mogene.1.1.st.v1_3.8.0 pd.mogene.1.0.st.v1_3.8.0
>>>>   [9] pd.hugene.2.1.st_3.8.0    pd.hugene.1.1.st.v1_3.8.0
>>>> [11] pd.hugene.1.0.st.v1_3.8.0 pd.hugene.2.0.st_3.8.0
>>>> [13] oligo_1.24.0              Biobase_2.20.0
>>>> [15] oligoClasses_1.22.0       BiocGenerics_0.6.0
>>>> [17] RSQLite_0.11.4            DBI_0.2-7
>>>> [19] BiocInstaller_1.10.2
>>>>
>>>> loaded via a namespace (and not attached):
>>>>   [1] affxparser_1.32.1     affyio_1.28.0         Biostrings_2.28.0
>>>>   [4] bit_1.1-10            codetools_0.2-8       ff_2.2-11
>>>>   [7] foreach_1.4.1         GenomicRanges_1.12.3  IRanges_1.18.1
>>>> [10] iterators_1.0.6       preprocessCore_1.22.0 splines_3.0.0
>>>> [13] stats4_3.0.0          tools_3.0.0           zlibbioc_1.6.0
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>     [[alternative HTML version deleted]]
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>