[BioC] package pair "hugene10stv1cdf"/"hugene10stprobeset.db"

Laurent Gautier laurent at cbs.dtu.dk
Mon May 10 08:37:33 CEST 2010


Hi Marc,

Affymetrix possibly changing the way features are described might not be 
the only source of confusion.

Using "oligo" does not appear to make things must better, as the 
information that can be obtained after running "rma()" is:

 > eset at annotation
[1] "pd.hugene.1.0.st.v1"

Is this the "probeset" version ? Is this the "transcript cluster" 
version ? Obviously this is of utmost importance (as the probe-level 
summarization step will use one given grouping).

Going for a fishing expedition seems a bit awkward:

 > summary(featureNames(eset) %in% Lkeys(hugene10sttranscriptclusterSYMBOL))
    Mode   FALSE    TRUE    NA's
logical      40   33257       0

As well, with the annotation (finally) becoming very fluid, does 
shipping any probe grouping without an associated annotation make any 
sense ?



Laurent


On 04/05/10 03:04, Marc Carlson wrote:
> Hi Laurent,
>
> The really confusing thing about the HuGene chip from Affymetrix is that
> they changed the way they were describing their features mid-stream.  So
> now people who work with this have to be mindful of how the probes have
> been grouped ("probesets" or "transcript clusters"?).  Arthur Li has
> been kind enough to furnish the project with both kinds of package as an
> option which is why I noticed what I did earlier about the transcript
> cluster version of the package.  But the fact that Affymetrix have
> abandoned support for their cdf file is also creates a unique problem
> for us.  I agree with Jim that people should arguably be using oligo
> rather than affy for analyzing this kind of chip.  But I also agree with
> you that a friendly warning would be a great idea for this one
> particular package.
>
>
>    Marc
>
>
>
>
> On 05/03/2010 03:00 PM, Laurent Gautier wrote:
>    
>> Hi Marc,
>>
>> What I am reading translates into very little confidence in anything
>> related to hugene 1.0ST in the bioconductor "affy" pipeline, and I
>> really think that it should be more difficult to use it without going
>> through steps that require one to explicitly see that this is
>> untested/not recommended/unsafe. The CDF seems to be of uncertain
>> quality to all, yet provided by bioconductor, and a warning message /
>> recommendation to switch to oligo when attaching the package would be
>> helpful, I think.
>>
>> Best,
>>
>>
>> Laurent
>>
>>
>>
>> On 5/3/10 7:07 PM, Marc Carlson wrote:
>>      
>>> Hi Laurent,
>>>
>>> Further complicating things, the hugene10stprobeset.db package was a
>>> contributed package.  From the DESCRIPTION file you can see that it was
>>> contributed by Arthur Li.  You might want to ask him for more details
>>> about this package and also about the hugene10sttranscriptcluster.db
>>> package.  Because I note that for the hugene10sttranscriptcluster.db
>>> package I get the following:
>>>
>>>
>>> summary(Lkeys(hugene10sttranscriptclusterSYMBOL) %in%
>>> ls(hugene10stv1cdf))
>>>
>>>      Mode   FALSE    TRUE    NA's
>>>      logical     962   32295       0
>>>
>>> summary(ls(hugene10stv1cdf) %in%
>>> Lkeys(hugene10sttranscriptclusterSYMBOL))
>>>
>>>      Mode   FALSE    TRUE    NA's
>>>      logical      26   32295       0
>>>
>>>
>>> And this looks like a closer match for what you are doing (considering
>>> that we don't have a properly supported cdf file in this case).
>>>
>>> Hope this helps,
>>>
>>>
>>>     Marc
>>>
>>>
>>>
>>> On 05/03/2010 09:28 AM, Laurent Gautier wrote:
>>>
>>>        
>>>> Hi James,
>>>>
>>>> Thanks for the clarifications. I am happy to see that Affymetrix has
>>>> picked up the concept of alternative CDF definitions and makes it
>>>> easier for its users.
>>>>
>>>> Regarding bioconductor, wouldn't it make sense to either mark packages
>>>> as "unsupported", or better take them to a different location, making
>>>> their download by the unaware less likely. In the present case should
>>>> the CDF be placed outside of the main repository ?
>>>>
>>>> In addition, wouldn't it make sense to coordinate the release the
>>>> release of probe/probeset mapping structures and annotation files (I
>>>> am reading below that there annotation for revision 5 while the
>>>> mapping is for revision 4) ?
>>>> What about making the revision number a documented _non-exported_
>>>> vector in the packages ?
>>>> This way one could do for example:
>>>>
>>>>          
>>>>> hugene10stprobeset:::revision
>>>>>
>>>>>            
>>>> [1] "r5"
>>>> (keeping the vector non-exported circumvents the issue of a scope
>>>> pollution whenever different packages with a variable "revision" are
>>>> in the search path).
>>>>
>>>> Best,
>>>>
>>>>
>>>> Laurent
>>>>
>>>>
>>>>
>>>> On 03/05/10 17:05, James W. MacDonald wrote:
>>>>
>>>>          
>>>>> Hi Laurent,
>>>>>
>>>>> Laurent Gautier wrote:
>>>>>
>>>>>            
>>>>>> Dear List,
>>>>>>
>>>>>> I am noting potential issues in the package pair
>>>>>> "hugene10stv1cdf"/"hugene10stprobeset.db", as the respective sets of
>>>>>> probe set IDs are not overlapping:
>>>>>>
>>>>>>
>>>>>>              
>>>>>>> library(hugene10stv1cdf)
>>>>>>> library(hugene10stprobeset.db)
>>>>>>> summary(ls(hugene10stv1cdf) %in% Lkeys(hugene10stprobesetSYMBOL))
>>>>>>>
>>>>>>>                
>>>>>>      Mode   FALSE    TRUE    NA's
>>>>>> logical   28026    4295       0
>>>>>>
>>>>>>              
>>>>>>> summary(Lkeys(hugene10stprobesetSYMBOL) %in% ls(hugene10stv1cdf))
>>>>>>>
>>>>>>>                
>>>>>>      Mode   FALSE    TRUE    NA's
>>>>>> logical  252727    4295       0
>>>>>>
>>>>>> Reading closely, one can observe that "hugene10stprobeset.db" refers
>>>>>> to a "revision 5" while the "v1" in "hugene10stv1cdf" suggests a
>>>>>> revision 1. It is unclear to me whether this is linked to the
>>>>>> problem, but if so then there is no hugene10stv5cdf, neither
>>>>>> annotation for v1.
>>>>>>
>>>>>>              
>>>>> It's hard to say what the 'revision 5' refers to. There is only one
>>>>> HuGene chip, and it is the version 1. There _have_ been nine versions
>>>>> of the annotation file released by Affy (Releases 22-30), so there is
>>>>> no telling what 'revision 5' refers to. But certainly it doesn't
>>>>> refer to a HuGene-1_0-st-v5 chip, as no such thing exists.
>>>>>
>>>>> I have a personal thesis that the Exon and Gene chips contain all
>>>>> manner of extra sequences that Affy threw on there so they wouldn't
>>>>> have the same problem they had with their 3'-biased chips. Namely
>>>>> that the chips were out-of-date the minute they finished the first
>>>>> production run because the annotations are so fluid. Now they can
>>>>> simply take the original 32K probesets and slice-n-dice them at will
>>>>> to make things that  match up with the genome as we know it now.
>>>>>
>>>>> But back to the point at hand. The problem with the hugene10stv1cdf
>>>>> is it is based on the _unsupported_ cdf file that Affy makes
>>>>> available. We make it available as well, for those who insist on
>>>>> using the makecdfenv/affy pipeline, rather than the
>>>>> pdInfoBuilder/oligo pipeline, which is what one should arguably be
>>>>> using. Given that the data being used to create the cdf package is
>>>>> specifically unsupported, caveat emptor.
>>>>>
>>>>> I note that the supported library files do contain an 'r4' in the
>>>>> file name, so assume without any backing data that this library would
>>>>> actually hew more closely to the annotation data they supply.
>>>>>
>>>>> Best,
>>>>>
>>>>> Jim
>>>>>
>>>>>
>>>>>
>>>>>            
>>>>>> The obligatory sessionInfo() is:
>>>>>>
>>>>>>
>>>>>>              
>>>>>>> sessionInfo()
>>>>>>>
>>>>>>>                
>>>>>> R version 2.11.0 Patched (2010-04-24 r51813)
>>>>>> i686-pc-linux-gnu
>>>>>>
>>>>>> locale:
>>>>>>    [1] LC_CTYPE=en_GB.utf8       LC_NUMERIC=C
>>>>>>    [3] LC_TIME=en_GB.utf8        LC_COLLATE=en_GB.utf8
>>>>>>    [5] LC_MONETARY=C             LC_MESSAGES=en_GB.utf8
>>>>>>    [7] LC_PAPER=en_GB.utf8       LC_NAME=C
>>>>>>    [9] LC_ADDRESS=C              LC_TELEPHONE=C
>>>>>> [11] LC_MEASUREMENT=en_GB.utf8 LC_IDENTIFICATION=C
>>>>>>
>>>>>> attached base packages:
>>>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>>>
>>>>>> other attached packages:
>>>>>>    [1] oligo_1.12.0                AffyCompatible_1.8.0
>>>>>>    [3] RCurl_1.4-1                 bitops_1.0-4.1
>>>>>>    [5] XML_2.8-1                   oligoClasses_1.10.0
>>>>>>    [7] limma_3.4.0                 hugene10stv1cdf_2.6.0
>>>>>>    [9] hugene10stprobeset.db_5.0.1 org.Hs.eg.db_2.4.1
>>>>>> [11] RSQLite_0.8-4               DBI_0.2-5
>>>>>> [13] AnnotationDbi_1.10.0        affxparser_1.20.0
>>>>>> [15] affy_1.26.0                 Biobase_2.8.0
>>>>>>
>>>>>> loaded via a namespace (and not attached):
>>>>>> [1] affyio_1.16.0         Biostrings_2.16.0     IRanges_1.6.0
>>>>>> [4] preprocessCore_1.10.0 splines_2.11.0        tcltk_2.11.0
>>>>>> [7] tools_2.11.0
>>>>>>
>>>>>>              
>>>>>>>
>>>>>>>                
>>>>>> Best,
>>>>>>
>>>>>>
>>>>>> Laurent
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at stat.math.ethz.ch
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives:
>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>
>>>>>>              
>>>>>
>>>>>            
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at stat.math.ethz.ch
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>>
>>>>          
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>>        
>>
>>
>>      
>
>



More information about the Bioconductor mailing list