[BioC] package pair "hugene10stv1cdf"/"hugene10stprobeset.db"

Mon May 10 11:13:21 CEST 2010

Hi Laurent,

The help file for rma() in oligo describes that the default value for
target is "core". Therefore, "transcript cluster" version.

If you call rma() using target="probeset", you'll get the probeset version.

Best,

b

On Mon, May 10, 2010 at 7:37 AM, Laurent Gautier <laurent at cbs.dtu.dk> wrote:
> Hi Marc,
>
> Affymetrix possibly changing the way features are described might not be the
> only source of confusion.
>
> Using "oligo" does not appear to make things must better, as the information
> that can be obtained after running "rma()" is:
>
>> eset at annotation
> [1] "pd.hugene.1.0.st.v1"
>
> Is this the "probeset" version ? Is this the "transcript cluster" version ?
> Obviously this is of utmost importance (as the probe-level summarization
> step will use one given grouping).
>
> Going for a fishing expedition seems a bit awkward:
>
>> summary(featureNames(eset) %in% Lkeys(hugene10sttranscriptclusterSYMBOL))
>   Mode   FALSE    TRUE    NA's
> logical      40   33257       0
>
> As well, with the annotation (finally) becoming very fluid, does shipping
> any probe grouping without an associated annotation make any sense ?
>
>
>
> Laurent
>
>
> On 04/05/10 03:04, Marc Carlson wrote:
>>
>> Hi Laurent,
>>
>> The really confusing thing about the HuGene chip from Affymetrix is that
>> they changed the way they were describing their features mid-stream.  So
>> now people who work with this have to be mindful of how the probes have
>> been grouped ("probesets" or "transcript clusters"?).  Arthur Li has
>> been kind enough to furnish the project with both kinds of package as an
>> option which is why I noticed what I did earlier about the transcript
>> cluster version of the package.  But the fact that Affymetrix have
>> abandoned support for their cdf file is also creates a unique problem
>> for us.  I agree with Jim that people should arguably be using oligo
>> rather than affy for analyzing this kind of chip.  But I also agree with
>> you that a friendly warning would be a great idea for this one
>> particular package.
>>
>>
>>   Marc
>>
>>
>>
>>
>> On 05/03/2010 03:00 PM, Laurent Gautier wrote:
>>
>>>
>>> Hi Marc,
>>>
>>> What I am reading translates into very little confidence in anything
>>> related to hugene 1.0ST in the bioconductor "affy" pipeline, and I
>>> really think that it should be more difficult to use it without going
>>> through steps that require one to explicitly see that this is
>>> untested/not recommended/unsafe. The CDF seems to be of uncertain
>>> quality to all, yet provided by bioconductor, and a warning message /
>>> recommendation to switch to oligo when attaching the package would be
>>> helpful, I think.
>>>
>>> Best,
>>>
>>>
>>> Laurent
>>>
>>>
>>>
>>> On 5/3/10 7:07 PM, Marc Carlson wrote:
>>>
>>>>
>>>> Hi Laurent,
>>>>
>>>> Further complicating things, the hugene10stprobeset.db package was a
>>>> contributed package.  From the DESCRIPTION file you can see that it was
>>>> contributed by Arthur Li.  You might want to ask him for more details
>>>> about this package and also about the hugene10sttranscriptcluster.db
>>>> package.  Because I note that for the hugene10sttranscriptcluster.db
>>>> package I get the following:
>>>>
>>>>
>>>> summary(Lkeys(hugene10sttranscriptclusterSYMBOL) %in%
>>>> ls(hugene10stv1cdf))
>>>>
>>>>     Mode   FALSE    TRUE    NA's
>>>>     logical     962   32295       0
>>>>
>>>> summary(ls(hugene10stv1cdf) %in%
>>>> Lkeys(hugene10sttranscriptclusterSYMBOL))
>>>>
>>>>     Mode   FALSE    TRUE    NA's
>>>>     logical      26   32295       0
>>>>
>>>>
>>>> And this looks like a closer match for what you are doing (considering
>>>> that we don't have a properly supported cdf file in this case).
>>>>
>>>> Hope this helps,
>>>>
>>>>
>>>>    Marc
>>>>
>>>>
>>>>
>>>> On 05/03/2010 09:28 AM, Laurent Gautier wrote:
>>>>
>>>>
>>>>>
>>>>> Hi James,
>>>>>
>>>>> Thanks for the clarifications. I am happy to see that Affymetrix has
>>>>> picked up the concept of alternative CDF definitions and makes it
>>>>> easier for its users.
>>>>>
>>>>> Regarding bioconductor, wouldn't it make sense to either mark packages
>>>>> as "unsupported", or better take them to a different location, making
>>>>> their download by the unaware less likely. In the present case should
>>>>> the CDF be placed outside of the main repository ?
>>>>>
>>>>> In addition, wouldn't it make sense to coordinate the release the
>>>>> release of probe/probeset mapping structures and annotation files (I
>>>>> am reading below that there annotation for revision 5 while the
>>>>> mapping is for revision 4) ?
>>>>> What about making the revision number a documented _non-exported_
>>>>> vector in the packages ?
>>>>> This way one could do for example:
>>>>>
>>>>>
>>>>>>
>>>>>> hugene10stprobeset:::revision
>>>>>>
>>>>>>
>>>>>
>>>>> [1] "r5"
>>>>> (keeping the vector non-exported circumvents the issue of a scope
>>>>> pollution whenever different packages with a variable "revision" are
>>>>> in the search path).
>>>>>
>>>>> Best,
>>>>>
>>>>>
>>>>> Laurent
>>>>>
>>>>>
>>>>>
>>>>> On 03/05/10 17:05, James W. MacDonald wrote:
>>>>>
>>>>>
>>>>>>
>>>>>> Hi Laurent,
>>>>>>
>>>>>> Laurent Gautier wrote:
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Dear List,
>>>>>>>
>>>>>>> I am noting potential issues in the package pair
>>>>>>> "hugene10stv1cdf"/"hugene10stprobeset.db", as the respective sets of
>>>>>>> probe set IDs are not overlapping:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> library(hugene10stv1cdf)
>>>>>>>> library(hugene10stprobeset.db)
>>>>>>>> summary(ls(hugene10stv1cdf) %in% Lkeys(hugene10stprobesetSYMBOL))
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>     Mode   FALSE    TRUE    NA's
>>>>>>> logical   28026    4295       0
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> summary(Lkeys(hugene10stprobesetSYMBOL) %in% ls(hugene10stv1cdf))
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>     Mode   FALSE    TRUE    NA's
>>>>>>> logical  252727    4295       0
>>>>>>>
>>>>>>> Reading closely, one can observe that "hugene10stprobeset.db" refers
>>>>>>> to a "revision 5" while the "v1" in "hugene10stv1cdf" suggests a
>>>>>>> revision 1. It is unclear to me whether this is linked to the
>>>>>>> problem, but if so then there is no hugene10stv5cdf, neither
>>>>>>> annotation for v1.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> It's hard to say what the 'revision 5' refers to. There is only one
>>>>>> HuGene chip, and it is the version 1. There _have_ been nine versions
>>>>>> of the annotation file released by Affy (Releases 22-30), so there is
>>>>>> no telling what 'revision 5' refers to. But certainly it doesn't
>>>>>> refer to a HuGene-1_0-st-v5 chip, as no such thing exists.
>>>>>>
>>>>>> I have a personal thesis that the Exon and Gene chips contain all
>>>>>> manner of extra sequences that Affy threw on there so they wouldn't
>>>>>> have the same problem they had with their 3'-biased chips. Namely
>>>>>> that the chips were out-of-date the minute they finished the first
>>>>>> production run because the annotations are so fluid. Now they can
>>>>>> simply take the original 32K probesets and slice-n-dice them at will
>>>>>> to make things that  match up with the genome as we know it now.
>>>>>>
>>>>>> But back to the point at hand. The problem with the hugene10stv1cdf
>>>>>> is it is based on the _unsupported_ cdf file that Affy makes
>>>>>> available. We make it available as well, for those who insist on
>>>>>> using the makecdfenv/affy pipeline, rather than the
>>>>>> pdInfoBuilder/oligo pipeline, which is what one should arguably be
>>>>>> using. Given that the data being used to create the cdf package is
>>>>>> specifically unsupported, caveat emptor.
>>>>>>
>>>>>> I note that the supported library files do contain an 'r4' in the
>>>>>> file name, so assume without any backing data that this library would
>>>>>> actually hew more closely to the annotation data they supply.
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Jim
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> The obligatory sessionInfo() is:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> sessionInfo()
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> R version 2.11.0 Patched (2010-04-24 r51813)
>>>>>>> i686-pc-linux-gnu
>>>>>>>
>>>>>>> locale:
>>>>>>>   [1] LC_CTYPE=en_GB.utf8       LC_NUMERIC=C
>>>>>>>   [3] LC_TIME=en_GB.utf8        LC_COLLATE=en_GB.utf8
>>>>>>>   [5] LC_MONETARY=C             LC_MESSAGES=en_GB.utf8
>>>>>>>   [7] LC_PAPER=en_GB.utf8       LC_NAME=C
>>>>>>>   [9] LC_ADDRESS=C              LC_TELEPHONE=C
>>>>>>> [11] LC_MEASUREMENT=en_GB.utf8 LC_IDENTIFICATION=C
>>>>>>>
>>>>>>> attached base packages:
>>>>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>>>>
>>>>>>> other attached packages:
>>>>>>>   [1] oligo_1.12.0                AffyCompatible_1.8.0
>>>>>>>   [3] RCurl_1.4-1                 bitops_1.0-4.1
>>>>>>>   [5] XML_2.8-1                   oligoClasses_1.10.0
>>>>>>>   [7] limma_3.4.0                 hugene10stv1cdf_2.6.0
>>>>>>>   [9] hugene10stprobeset.db_5.0.1 org.Hs.eg.db_2.4.1
>>>>>>> [11] RSQLite_0.8-4               DBI_0.2-5
>>>>>>> [13] AnnotationDbi_1.10.0        affxparser_1.20.0
>>>>>>> [15] affy_1.26.0                 Biobase_2.8.0
>>>>>>>
>>>>>>> loaded via a namespace (and not attached):
>>>>>>> [1] affyio_1.16.0         Biostrings_2.16.0     IRanges_1.6.0
>>>>>>> [4] preprocessCore_1.10.0 splines_2.11.0        tcltk_2.11.0
>>>>>>> [7] tools_2.11.0
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>>
>>>>>>> Laurent
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioconductor mailing list
>>>>>>> Bioconductor at stat.math.ethz.ch
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>> Search the archives:
>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at stat.math.ethz.ch
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at stat.math.ethz.ch
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>