[BioC] pd.hugene.2.0.st missing normgene->exon mappings

Wed Jul 10 12:31:57 CEST 2013

Hi,
I see the point you're making Christian.
I'm offline now, so cant check this, but i assume that EIF3D and the pos_control meta-probeset in question have different transcript cluster id's. it doesn't make sense for the pos_control transcript cluster Id to be tagged as 'main'. My grep -f was still running when i left work to confirm whether all normgene->exon probesets are all deployed within different real genes in the probeset csv file.

Cheers and thanks for looking into this

Mark

Sent from my iPhone

On 10/07/2013, at 7:53 AM, "cstrato" <cstrato at aon.at> wrote:

> Dear Jim,
> 
> As far as I understand it, at the transcript level 16934607 is on one hand part of the EIF3D transcript and on the other hand does serve as a positive control. To me this seems to be no contradiction, but probably only DevNet can explain.
> 
> Best regards,
> Christian
> 
> 
> On 7/9/13 11:46 PM, James W. MacDonald wrote:
>> Hi Christian,
>> 
>> Thanks for pointing that out. There is still a bit of an inconsistency
>> with the pd packages that should probably be corrected, as at the
>> probeset level e.g., 16934607 is intended to measure an exon of EIF3D,
>> whereas at the transcript level, this same probeset is intended to be a
>> positive control (and as you note below, these probes are incorporated
>> into a larger probeset intended to measure the EIF3D transcript).
>> 
>> It would be nice to be able to filter out the controls like Mark
>> attempted (and I do regularly as well).
>> 
>> Mark - I talked with Benilton Carvalho about this, and he will take a
>> look next week.
>> 
>> Best,
>> 
>> Jim
>> 
>> 
>> 
>> On 7/9/2013 3:38 PM, cstrato wrote:
>>> Dear Jim,
>>> 
>>> In xps I use as basic file for exon arrays the probeset annotation
>>> file and then compare the data to the data from the pgf-file. Any
>>> differences will be reported.
>>> 
>>> I have just checked the different files for HuGene-2_0-st. If you
>>> check as an example the following probeset_ids:
>>> 16934607
>>> 16934608
>>> 16934609
>>> 16934610
>>> 
>>> Then you will see that the transcript annotation file lists these ids
>>> as 'normgene->exon' and 'pos_control'. However, the probeset
>>> annotation file lists these ids as 'main' belonging to gene EIF3D with
>>> transcript_cluster_id 16934583. Looking for this id in the transcript
>>> annotation file reveals that the number of 'total_probes' is 24.
>>> Indeed, the probeset annotation file lists 24 probesets including the
>>> four above mentioned probeset_ids.
>>> 
>>> This means that although these four probesets are listed in the
>>> transcript annotation file as 'normgene->exon' the label 'main' in the
>>> pgf-file is correct since these probesets are part of the gene EIF3D.
>>> 
>>> Interestingly, the pgf-file for HuGene-1_0-st has extra probesets
>>> listed as 'normgene->exon'. However, in this case these probesets are
>>> also listed as 'normgene->exon' in the probeset annotation file, i.e.
>>> these probesets do not belong to any transcript listed in the probeset
>>> annotation file.
>>> 
>>> Best regards,
>>> Christian
>>> 
>>> 
>>> On 7/9/13 8:46 PM, James W. MacDonald wrote:
>>>> Hi Christian,
>>>> 
>>>> That's not the issue. Instead, the issue is that the pgf file lists the
>>>> normgene->exon probeset IDs as being 'main'. I have received a response
>>>> from Affy stating that the qcc file lists the normgene->exon probesets
>>>> as pos_control, but that seems orthogonal to the issue at hand.
>>>> 
>>>> > qcc <- read.table("HuGene-2_0-st.qcc", comment.char="#",
>>>> stringsAsFactors=F, header=T)
>>>> > pgf <- readPgf("HuGene-2_0-st.pgf")
>>>> > head(qcc)
>>>>   probeset_id  group_name probeset_name quantification_in_header
>>>> 1    16650001 neg_control      16650001                        0
>>>> 2    16650003 neg_control      16650003                        0
>>>> 
>>>> ## get the positive controls (normgene->exon probesets) from the qcc
>>>> file
>>>> > pos_cont <- qcc[qcc[,2] == "pos_control",1]
>>>> 
>>>> ## compare to pgf file
>>>> > x <- pgf$probesetType[pgf$probesetId %in% pos_cont]
>>>> > table(x)
>>>> x
>>>> main
>>>> 1626
>>>> 
>>>> So in the pgf file, these probesets are being called 'main' instead of
>>>> some sort of control. How do you handle this in xps? Do you use the pgf
>>>> file?
>>>> 
>>>> Best,
>>>> 
>>>> Jim
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On 7/9/2013 2:06 PM, cstrato wrote:
>>>>> Dear Jim,
>>>>> 
>>>>> I did not really follow the discussion so I may be wrong, but if you
>>>>> mean that there is a difference between the number of 'main' types,
>>>>> please note that number of 'main' for pgf, i.e 349012, corresponds to
>>>>> the number of 'main' in the probeset annotation file and not in the
>>>>> transcript annotation file.
>>>>> 
>>>>> But as I said, I may have misunderstood the problem.
>>>>> 
>>>>> I am mainly replying because at the beginning of this year I had long
>>>>> discussions with DevNet to make sure that the annotation files for the
>>>>> 2.X arrays are correct, and in version na33.2 DevNet did correct
>>>>> everything what I have found.
>>>>> 
>>>>> Best regards,
>>>>> Christian
>>>>> 
>>>>> 
>>>>> On 7/9/13 7:13 PM, James W. MacDonald wrote:
>>>>>> Hi Mark,
>>>>>> 
>>>>>> Thanks for the heads-up. We already knew that Affy messed up the
>>>>>> transcript and probeset annotation files (and had them fixed), but
>>>>>> didn't think I needed to check the others. Famous last words, no?
>>>>>> 
>>>>>> > x <- readPgf("HuGene-2_0-st.pgf")
>>>>>> > table(x$probesetType)
>>>>>> 
>>>>>>              control->affx   control->affx->bac_spike
>>>>>>                         18                         18
>>>>>>  control->affx->ercc_spike control->affx->polya_spike
>>>>>>                         92                         39
>>>>>>  control->bgp->antigenomic                       main
>>>>>>                         23                     349012
>>>>>>           normgene->intron                   reporter
>>>>>>                       3575                         82
>>>>>> 
>>>>>> > y <- read.csv("HuGene-2_0-st-v1.na33.2.hg19.transcript.csv",
>>>>>> comment.char = "#", stringsAsFactors=FALSE, header = TRUE)
>>>>>> > table(y$category)
>>>>>> 
>>>>>>              control->affx   control->affx->bac_spike
>>>>>>                         18                         18
>>>>>>  control->affx->ercc-spike control->affx->polya_spike
>>>>>>                         92                         39
>>>>>>  control->bgp->antigenomic                       main
>>>>>>                         23                      44629
>>>>>>             normgene->exon           normgene->intron
>>>>>>                       1626                       3575
>>>>>>                   reporter                     rescue
>>>>>>                         82                       3515
>>>>>> 
>>>>>> I'll ping Affymetrix and see what they have to say.
>>>>>> 
>>>>>> Best,
>>>>>> 
>>>>>> Jim
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On 7/9/2013 3:29 AM, Mark Cowley wrote:
>>>>>>> Dear Benilton, James&  Bioconductors,
>>>>>>> Thanks for providing the platform design packages for
>>>>>>> hugene/mogene/ragene 1.0/1.1/2.0/2.1 arrays.
>>>>>>> 
>>>>>>> I use SQL to query these packages&  ultimately retain only 'main'
>>>>>>> probes in my analysis. This works well for 1.0 and 1.1 packages, but
>>>>>>> nor for 2.0 and 2.1 ST arrays. For 2.0 and 2.1 arrays, the
>>>>>>> normgene->exon control probes are misclassified as 'main' probes.
>>>>>>> 
>>>>>>> evidence: the HuGene-2_0-st-v1.na33.2.hg19.transcript.csvNetAffx csv
>>>>>>> files lists 1626 normgene->exon probes, however the pg.hugene.2.0.st
>>>>>>> package lists 0, and assigns these 1626 probes to the 'main'
>>>>>>> category:
>>>>>>> 
>>>>>>> # probe types:
>>>>>>> library(pd.hugene.2.0.st)
>>>>>>> conn<- db(pd.hugene.2.0.st)
>>>>>>> dbGetQuery(conn,"SELECT * from type_dict")
>>>>>>>    type                   type_id
>>>>>>> 1     1                      main
>>>>>>> 2     2             control->affx
>>>>>>> 3     3             control->chip
>>>>>>> 4     4 control->bgp->antigenomic
>>>>>>> 5     5     control->bgp->genomic
>>>>>>> 6     6            normgene->exon
>>>>>>> 7     7          normgene->intron
>>>>>>> 8     8  rescue->FLmRNA->unmapped
>>>>>>> 9     9  control->affx->bac_spike
>>>>>>> 10   10            oligo_spike_in
>>>>>>> 11   11           r1_bac_spike_at
>>>>>>> 
>>>>>>> # probe counts for each of the probe categories:
>>>>>>> dbGetQuery(conn,"SELECT type, count(*) from featureSet GROUP BY
>>>>>>> type")
>>>>>>>   type count(*)
>>>>>>> 1   NA     3728
>>>>>>> 2    1   345497
>>>>>>> 3    2       18
>>>>>>> 4    4       23
>>>>>>> 5    7     3575
>>>>>>> 6    9       18
>>>>>>> 
>>>>>>> NB: no type 6 probes.
>>>>>>> I've tested all 12 ho/mo/ra gene 1.0,1.1,2.0,2.1 ST packages, and see
>>>>>>> this issue for all 2.0 and 2.1 arrays (see below)
>>>>>>> 
>>>>>>> Can these mappings please be updated?
>>>>>>> 
>>>>>>> PS, there's a bunch of probes with type = NA in the database. I
>>>>>>> haven't investigated these in any detail.
>>>>>>> 
>>>>>>> cheers,
>>>>>>> Mark
>>>>>>> -----------------------------------------------------
>>>>>>> Mark Cowley, PhD
>>>>>>> 
>>>>>>> Genome Informatics Division&  the Centre for Clinical Genomics
>>>>>>> The Kinghorn Cancer Centre, Garvan Institute of Medical Research,
>>>>>>> Sydney, Australia
>>>>>>> -----------------------------------------------------
>>>>>>> 
>>>>>>> All 12 packages below:
>>>>>>> pd.packages<- c(
>>>>>>>   "pd.hugene.1.0.st.v1", "pd.hugene.1.1.st.v1", "pd.hugene.2.0.st",
>>>>>>> "pd.hugene.2.1.st",
>>>>>>>   "pd.mogene.1.0.st.v1", "pd.mogene.1.1.st.v1", "pd.mogene.2.0.st",
>>>>>>> "pd.mogene.2.1.st",
>>>>>>>   "pd.ragene.1.0.st.v1", "pd.ragene.1.1.st.v1", "pd.ragene.2.0.st",
>>>>>>> "pd.ragene.2.1.st"
>>>>>>> )
>>>>>>> 
>>>>>>> a<- b<- list()
>>>>>>> for(pd.pkg.name in pd.packages) {
>>>>>>>   try({
>>>>>>>     require(pd.pkg.name, character.only=TRUE) || stop("Can't load
>>>>>>> the
>>>>>>> pd.package")
>>>>>>>     conn<- db(get(pd.pkg.name))
>>>>>>>     a[[pd.pkg.name]]<- dbGetQuery(conn,"SELECT type, count(*) from
>>>>>>> featureSet GROUP BY type")
>>>>>>>     b[[pd.pkg.name]]<- dbGetQuery(conn,"SELECT fsetid from
>>>>>>> featureSet
>>>>>>> WHERE type = 6")[,1]
>>>>>>>   })
>>>>>>> }
>>>>>>> dbGetQuery(conn,"SELECT * from type_dict")
>>>>>>> 
>>>>>>>> a
>>>>>>> $pd.hugene.1.0.st.v1
>>>>>>>   type count(*)
>>>>>>> 1   NA      227
>>>>>>> 2    1   253002
>>>>>>> 3    2       57
>>>>>>> 4    4       45
>>>>>>> 5    6     1195
>>>>>>> 6    7     2904
>>>>>>> 
>>>>>>> $pd.hugene.1.1.st.v1
>>>>>>>   type count(*)
>>>>>>> 1   NA      227
>>>>>>> 2    1   253002
>>>>>>> 3    2       57
>>>>>>> 4    4       45
>>>>>>> 5    6     1195
>>>>>>> 6    7     2904
>>>>>>> 
>>>>>>> $pd.hugene.2.0.st
>>>>>>>   type count(*)
>>>>>>> 1   NA     3728
>>>>>>> 2    1   345497
>>>>>>> 3    2       18
>>>>>>> 4    4       23
>>>>>>> 5    7     3575
>>>>>>> 6    9       18
>>>>>>> 
>>>>>>> $pd.hugene.2.1.st
>>>>>>>   type count(*)
>>>>>>> 1   NA     3728
>>>>>>> 2    1   345497
>>>>>>> 3    2       18
>>>>>>> 4    4       23
>>>>>>> 5    7     3575
>>>>>>> 6    9       18
>>>>>>> 
>>>>>>> $pd.mogene.1.0.st.v1
>>>>>>>   type count(*)
>>>>>>> 1   NA       86
>>>>>>> 2    1   234878
>>>>>>> 3    2       21
>>>>>>> 4    4       45
>>>>>>> 5    6     1324
>>>>>>> 6    7     5222
>>>>>>> 
>>>>>>> $pd.mogene.1.1.st.v1
>>>>>>>   type count(*)
>>>>>>> 1   NA       86
>>>>>>> 2    1   234878
>>>>>>> 3    2       21
>>>>>>> 4    4       45
>>>>>>> 5    6     1324
>>>>>>> 6    7     5222
>>>>>>> 
>>>>>>> $pd.mogene.2.0.st
>>>>>>>   type count(*)
>>>>>>> 1   NA      810
>>>>>>> 2    1   263551
>>>>>>> 3    2       18
>>>>>>> 4    4       23
>>>>>>> 5    7     5331
>>>>>>> 6    9       18
>>>>>>> 
>>>>>>> $pd.mogene.2.1.st
>>>>>>>   type count(*)
>>>>>>> 1   NA      810
>>>>>>> 2    1   263551
>>>>>>> 3    2       18
>>>>>>> 4    4       23
>>>>>>> 5    7     5331
>>>>>>> 6    9       18
>>>>>>> 
>>>>>>> $pd.ragene.1.0.st.v1
>>>>>>>   type count(*)
>>>>>>> 1   NA      254
>>>>>>> 2    1   211195
>>>>>>> 3    2       21
>>>>>>> 4    4       45
>>>>>>> 5    6      399
>>>>>>> 6    7     1153
>>>>>>> 
>>>>>>> $pd.ragene.1.1.st.v1
>>>>>>>   type count(*)
>>>>>>> 1   NA      254
>>>>>>> 2    1   211195
>>>>>>> 3    2       21
>>>>>>> 4    4       45
>>>>>>> 5    6      399
>>>>>>> 6    7     1153
>>>>>>> 
>>>>>>> $pd.ragene.2.0.st
>>>>>>>   type count(*)
>>>>>>> 1   NA     1071
>>>>>>> 2    1   214018
>>>>>>> 3    2       18
>>>>>>> 4    4       23
>>>>>>> 5    7     5083
>>>>>>> 6    9       18
>>>>>>> 
>>>>>>> $pd.ragene.2.1.st
>>>>>>>   type count(*)
>>>>>>> 1   NA     1071
>>>>>>> 2    1   214018
>>>>>>> 3    2       18
>>>>>>> 4    4       23
>>>>>>> 5    7     5083
>>>>>>> 6    9       18
>>>>>>> 
>>>>>>>> sapply(b,length)
>>>>>>> pd.hugene.1.0.st.v1 pd.hugene.1.1.st.v1    pd.hugene.2.0.st
>>>>>>> pd.hugene.2.1.st
>>>>>>>                1195                1195
>>>>>>> 0                   0
>>>>>>> pd.mogene.1.0.st.v1 pd.mogene.1.1.st.v1    pd.mogene.2.0.st
>>>>>>> pd.mogene.2.1.st
>>>>>>>                1324                1324
>>>>>>> 0                   0
>>>>>>> pd.ragene.1.0.st.v1 pd.ragene.1.1.st.v1    pd.ragene.2.0.st
>>>>>>> pd.ragene.2.1.st
>>>>>>>                 399                 399
>>>>>>> 0                   0
>>>>>>> 
>>>>>>>> sessionInfo()
>>>>>>> R version 3.0.0 (2013-04-03)
>>>>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>>>> 
>>>>>>> locale:
>>>>>>>  [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C
>>>>>>>  [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8
>>>>>>>  [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8
>>>>>>>  [7] LC_PAPER=C                 LC_NAME=C
>>>>>>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>>>>> [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
>>>>>>> 
>>>>>>> attached base packages:
>>>>>>> [1] parallel  stats     graphics  grDevices utils     datasets
>>>>>>> methods
>>>>>>> [8] base
>>>>>>> 
>>>>>>> other attached packages:
>>>>>>>  [1] pd.ragene.2.1.st_2.12.1   pd.ragene.2.0.st_2.12.0
>>>>>>>  [3] pd.ragene.1.1.st.v1_3.8.0 pd.ragene.1.0.st.v1_3.8.0
>>>>>>>  [5] pd.mogene.2.1.st_2.12.1   pd.mogene.2.0.st_2.12.0
>>>>>>>  [7] pd.mogene.1.1.st.v1_3.8.0 pd.mogene.1.0.st.v1_3.8.0
>>>>>>>  [9] pd.hugene.2.1.st_3.8.0    pd.hugene.1.1.st.v1_3.8.0
>>>>>>> [11] pd.hugene.1.0.st.v1_3.8.0 pd.hugene.2.0.st_3.8.0
>>>>>>> [13] oligo_1.24.0              Biobase_2.20.0
>>>>>>> [15] oligoClasses_1.22.0       BiocGenerics_0.6.0
>>>>>>> [17] RSQLite_0.11.4            DBI_0.2-7
>>>>>>> [19] BiocInstaller_1.10.2
>>>>>>> 
>>>>>>> loaded via a namespace (and not attached):
>>>>>>>  [1] affxparser_1.32.1     affyio_1.28.0         Biostrings_2.28.0
>>>>>>>  [4] bit_1.1-10            codetools_0.2-8       ff_2.2-11
>>>>>>>  [7] foreach_1.4.1         GenomicRanges_1.12.3  IRanges_1.18.1
>>>>>>> [10] iterators_1.0.6       preprocessCore_1.22.0 splines_3.0.0
>>>>>>> [13] stats4_3.0.0          tools_3.0.0           zlibbioc_1.6.0
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>    [[alternative HTML version deleted]]
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> Bioconductor mailing list
>>>>>>> Bioconductor at r-project.org
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>> Search the archives:
>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>