[BioC] package pair "hugene10stv1cdf"/"hugene10stprobeset.db"

Tue May 11 10:30:05 CEST 2010

Hey Laurent,

the reason I chose to use probeset, core, full and extended is because
that's how Affymetrix describes their meta-probeset files. But, of
course, I'm open to suggestions to improve the package.

b

On Tue, May 11, 2010 at 7:37 AM, Laurent Gautier <laurent at cbs.dtu.dk> wrote:
> Hi Benilton,
>
> Thanks for the information. I did miss the named argument "target" at the
> end of the signature for rma().
>
> However, one note about the documentation: now that know the existence of
> "target" I am still unable to infer from the help
> file for rma() that target="core" returns "transcript cluster". Would a more
> explicit terms be clearer ?
>
> For example, the Affymetrix documentation refers to "probe groups" and under
> the following terms:
> """
> Probe Group â€“ A generic term for any grouping of related GeneChipÂ® array
> probes from the array design. On Exon Arrays, a probe group can be a probe
> set, exon cluster, or transcript cluster.ï¿½ On Gene Arrays, the only kind
> of probe group is the transcript cluster. NetAffx detail pages are provided
> for all probe groups of each type for Exon and Gene Arrays.
> """
>
> What about a parameter 'probe_group = c("probe set", "exon cluster",
> "transcript cluster")' ?
>
>
> Also, wouldn't the propagation of which "probe group" was used to the
> resulting expression set be helpful to end-users ?
>
>
> L.
>
>
> On 5/10/10 11:13 AM, Benilton Carvalho wrote:
>
> Hi Laurent,
>
> The help file for rma() in oligo describes that the default value for
> target is "core". Therefore, "transcript cluster" version.
>
> If you call rma() using target="probeset", you'll get the probeset version.
>
> Best,
>
> b
>
> On Mon, May 10, 2010 at 7:37 AM, Laurent Gautier <laurent at cbs.dtu.dk> wrote:
>
>
> Hi Marc,
>
> Affymetrix possibly changing the way features are described might not be the
> only source of confusion.
>
> Using "oligo" does not appear to make things must better, as the information
> that can be obtained after running "rma()" is:
>
>
>
> eset at annotation
>
>
> [1] "pd.hugene.1.0.st.v1"
>
> Is this the "probeset" version ? Is this the "transcript cluster" version ?
> Obviously this is of utmost importance (as the probe-level summarization
> step will use one given grouping).
>
> Going for a fishing expedition seems a bit awkward:
>
>
>
> summary(featureNames(eset) %in% Lkeys(hugene10sttranscriptclusterSYMBOL))
>
>
>   Mode   FALSE    TRUE    NA's
> logical      40   33257       0
>
> As well, with the annotation (finally) becoming very fluid, does shipping
> any probe grouping without an associated annotation make any sense ?
>
>
>
> Laurent
>
>
> On 04/05/10 03:04, Marc Carlson wrote:
>
>
> Hi Laurent,
>
> The really confusing thing about the HuGene chip from Affymetrix is that
> they changed the way they were describing their features mid-stream.  So
> now people who work with this have to be mindful of how the probes have
> been grouped ("probesets" or "transcript clusters"?).  Arthur Li has
> been kind enough to furnish the project with both kinds of package as an
> option which is why I noticed what I did earlier about the transcript
> cluster version of the package.  But the fact that Affymetrix have
> abandoned support for their cdf file is also creates a unique problem
> for us.  I agree with Jim that people should arguably be using oligo
> rather than affy for analyzing this kind of chip.  But I also agree with
> you that a friendly warning would be a great idea for this one
> particular package.
>
>
>   Marc
>
>
>
>
> On 05/03/2010 03:00 PM, Laurent Gautier wrote:
>
>
>
> Hi Marc,
>
> What I am reading translates into very little confidence in anything
> related to hugene 1.0ST in the bioconductor "affy" pipeline, and I
> really think that it should be more difficult to use it without going
> through steps that require one to explicitly see that this is
> untested/not recommended/unsafe. The CDF seems to be of uncertain
> quality to all, yet provided by bioconductor, and a warning message /
> recommendation to switch to oligo when attaching the package would be
> helpful, I think.
>
> Best,
>
>
> Laurent
>
>
>
> On 5/3/10 7:07 PM, Marc Carlson wrote:
>
>
>
> Hi Laurent,
>
> Further complicating things, the hugene10stprobeset.db package was a
> contributed package.  From the DESCRIPTION file you can see that it was
> contributed by Arthur Li.  You might want to ask him for more details
> about this package and also about the hugene10sttranscriptcluster.db
> package.  Because I note that for the hugene10sttranscriptcluster.db
> package I get the following:
>
>
> summary(Lkeys(hugene10sttranscriptclusterSYMBOL) %in%
> ls(hugene10stv1cdf))
>
>     Mode   FALSE    TRUE    NA's
>     logical     962   32295       0
>
> summary(ls(hugene10stv1cdf) %in%
> Lkeys(hugene10sttranscriptclusterSYMBOL))
>
>     Mode   FALSE    TRUE    NA's
>     logical      26   32295       0
>
>
> And this looks like a closer match for what you are doing (considering
> that we don't have a properly supported cdf file in this case).
>
> Hope this helps,
>
>
>    Marc
>
>
>
> On 05/03/2010 09:28 AM, Laurent Gautier wrote:
>
>
>
>
> Hi James,
>
> Thanks for the clarifications. I am happy to see that Affymetrix has
> picked up the concept of alternative CDF definitions and makes it
> easier for its users.
>
> Regarding bioconductor, wouldn't it make sense to either mark packages
> as "unsupported", or better take them to a different location, making
> their download by the unaware less likely. In the present case should
> the CDF be placed outside of the main repository ?
>
> In addition, wouldn't it make sense to coordinate the release the
> release of probe/probeset mapping structures and annotation files (I
> am reading below that there annotation for revision 5 while the
> mapping is for revision 4) ?
> What about making the revision number a documented _non-exported_
> vector in the packages ?
> This way one could do for example:
>
>
>
>
> hugene10stprobeset:::revision
>
>
>
>
> [1] "r5"
> (keeping the vector non-exported circumvents the issue of a scope
> pollution whenever different packages with a variable "revision" are
> in the search path).
>
> Best,
>
>
> Laurent
>
>
>
> On 03/05/10 17:05, James W. MacDonald wrote:
>
>
>
>
> Hi Laurent,
>
> Laurent Gautier wrote:
>
>
>
>
> Dear List,
>
> I am noting potential issues in the package pair
> "hugene10stv1cdf"/"hugene10stprobeset.db", as the respective sets of
> probe set IDs are not overlapping:
>
>
>
>
>
> library(hugene10stv1cdf)
> library(hugene10stprobeset.db)
> summary(ls(hugene10stv1cdf) %in% Lkeys(hugene10stprobesetSYMBOL))
>
>
>
>
>     Mode   FALSE    TRUE    NA's
> logical   28026    4295       0
>
>
>
>
> summary(Lkeys(hugene10stprobesetSYMBOL) %in% ls(hugene10stv1cdf))
>
>
>
>
>     Mode   FALSE    TRUE    NA's
> logical  252727    4295       0
>
> Reading closely, one can observe that "hugene10stprobeset.db" refers
> to a "revision 5" while the "v1" in "hugene10stv1cdf" suggests a
> revision 1. It is unclear to me whether this is linked to the
> problem, but if so then there is no hugene10stv5cdf, neither
> annotation for v1.
>
>
>
>
> It's hard to say what the 'revision 5' refers to. There is only one
> HuGene chip, and it is the version 1. There _have_ been nine versions
> of the annotation file released by Affy (Releases 22-30), so there is
> no telling what 'revision 5' refers to. But certainly it doesn't
> refer to a HuGene-1_0-st-v5 chip, as no such thing exists.
>
> I have a personal thesis that the Exon and Gene chips contain all
> manner of extra sequences that Affy threw on there so they wouldn't
> have the same problem they had with their 3'-biased chips. Namely
> that the chips were out-of-date the minute they finished the first
> production run because the annotations are so fluid. Now they can
> simply take the original 32K probesets and slice-n-dice them at will
> to make things that  match up with the genome as we know it now.
>
> But back to the point at hand. The problem with the hugene10stv1cdf
> is it is based on the _unsupported_ cdf file that Affy makes
> available. We make it available as well, for those who insist on
> using the makecdfenv/affy pipeline, rather than the
> pdInfoBuilder/oligo pipeline, which is what one should arguably be
> using. Given that the data being used to create the cdf package is
> specifically unsupported, caveat emptor.
>
> I note that the supported library files do contain an 'r4' in the
> file name, so assume without any backing data that this library would
> actually hew more closely to the annotation data they supply.
>
> Best,
>
> Jim
>
>
>
>
>
>
> The obligatory sessionInfo() is:
>
>
>
>
>
> sessionInfo()
>
>
>
>
> R version 2.11.0 Patched (2010-04-24 r51813)
> i686-pc-linux-gnu
>
> locale:
>   [1] LC_CTYPE=en_GB.utf8       LC_NUMERIC=C
>   [3] LC_TIME=en_GB.utf8        LC_COLLATE=en_GB.utf8
>   [5] LC_MONETARY=C             LC_MESSAGES=en_GB.utf8
>   [7] LC_PAPER=en_GB.utf8       LC_NAME=C
>   [9] LC_ADDRESS=C              LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_GB.utf8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
>   [1] oligo_1.12.0                AffyCompatible_1.8.0
>   [3] RCurl_1.4-1                 bitops_1.0-4.1
>   [5] XML_2.8-1                   oligoClasses_1.10.0
>   [7] limma_3.4.0                 hugene10stv1cdf_2.6.0
>   [9] hugene10stprobeset.db_5.0.1 org.Hs.eg.db_2.4.1
> [11] RSQLite_0.8-4               DBI_0.2-5
> [13] AnnotationDbi_1.10.0        affxparser_1.20.0
> [15] affy_1.26.0                 Biobase_2.8.0
>
> loaded via a namespace (and not attached):
> [1] affyio_1.16.0         Biostrings_2.16.0     IRanges_1.6.0
> [4] preprocessCore_1.10.0 splines_2.11.0        tcltk_2.11.0
> [7] tools_2.11.0
>
>
>
>
>
>
> Best,
>
>
> Laurent
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
>
>
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
>
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
>
>
>
>
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
>