[BioC] How map probeset_id to gene_symbols or other annotation information?

Mon Aug 10 20:52:48 CEST 2009

Hi Peng,

It seems thatI have to apologize for giving you a poor example.  The
error you got here is from this particular package being an extreme
case.  I will find a way to patch that for the next release, but I
seriously doubt that you will see it in anything other than the toy
example provided in the manual page here.  It is ultimately caused by
the huge amount of data in the mogene10stprobeset database, so much data
in fact, that we will have to change the way that we are querying the
underlying database when we display the results.  So thanks for
reminding me that I need to patch this up.  :)  However, in practical
usage, it is quite doubtful that you will ever run into this error
unless you are in the habit of routinely looking up more than 150,000
keys at once, so you should not let this issue scare you away.  You can
make the following example run again by simply changing the number of
mappedkeys sought to be something less than every single possible key at
once.  You can do it just like this:

             x <- mogene10stprobesetENTREZID
             # Get the probe identifiers that are mapped to an ENTREZ
Gene ID
             mapped_probes <- mappedkeys(x)[1:150000]  ##notice I am
reducing the number of keys down to just the 1st 150,000 at the end of
this step...
             # Convert to a list
             xx <- as.list(x[mapped_probes])
             if(length(xx) > 0) {
               # Get the ENTREZID for the first five probes
               xx[1:5]
               # Get the first one
               xx[[1]]
             }

To answer your other questions, a vignette is a document that will give
an overview with examples of what a package is for.  It differs from the
manual pages which are more terse and usually used to indicate lower
level infomation such as the arguments a method takes etc.  If you want
an even more generalized description about what some common bioconductor
packages do you can look at our common workflows here:

http://www.bioconductor.org/docs/workflows/index.html

Most vignettes will be named in a way that clearly indicates what
package they refer to, but you can always tell because their source code
will always be included in the package source and their pdf files will
always be on the website pages that correspond to the packages.  You can
see some examples of those in here:

http://www.bioconductor.org/download/

Finally, I will always prefer for you to use the annotation packages,
since that is why we provide them.  We spend a lot of effort maintaining
them and making sure that they are useful and updated twice a year.  All
the data in there is synchronized each release so you can safely cross
compare things like GO terms with GO IDs that are associated with the
probes you are using etc.  And also, the packages are versioned and
synchronized to go with a particular release of bioconductor, which
should aid you in keeping your results reproducible.  If you are using
biocLite() then all of this should be "matched up" for you.

  Marc

Peng Yu wrote:
> On Mon, Aug 10, 2009 at 11:52 AM, Marc Carlson<mcarlson at fhcrc.org> wrote:
>   
>> Hi Peng,
>>
>> There is in fact a lot of documentation inside of each package if you
>> know how to look for it.  One form is in the form of manual pages which
>> can be listed like this example:
>>
>> ls("package:mogene10stprobeset.db")
>>
>> And then you can read the manual pages by typing ? followed by the name
>> of the object you want to know about like this example:
>>
>> ?mogene10stprobesetENTREZID
>>
>> Finally, almost every bioconductor package has some sort vignette that
>> is associated with it.  In the case of the annotation packages, there
>> are three vignettes loaded with AnnotationDbi (which will always be
>> loaded before any annotation package, so they will always be there if
>> you look).  You can load a vignette by using the openVignette() command
>> like this:
>>
>> openVignette()
>>
>> And then just pick the number for the vignette that you would like to
>> read.  Reading the vignette will give a much more comprehensive overview
>> of the purpose of the package with even more examples than the manual
>> pages.  Both of these resources are critical if you want to be able to
>> use R.  I would recommend that you look at these in addition to reading
>> that R user manual that was mentioned before.
>>
>> With respect to the annotation packages, they are not simply a repeat of
>> what is in the csv files from Affymetrix.  In fact, we don't actually
>> even know where Affymetrix gets the data in those files from, nor do we
>> use most of that data in those files in building the annotation
>> packages.  Instead we go direct to the source whenever possible and get
>> most of our information from places like NCBI, the EBI etc.  The only
>> information that we get from Affymetrix is the basic probe to gene
>> mapping data (in the form of probe to entrez gene, genbank accession
>> etc.) which we then map onto the information from primary sources such
>> as NCBI etc. in order to tie the other data to the probes.  You are free
>> of course to use whichever information source you prefer, but please be
>> advised that they are probably not equivalent.
>>     
>
> Hi Marc,
>
> I run the following example shown in ?mogene10stprobesetENTREZID. It
> doesn't provide very meaningful error message (at the end of this
> message). Do you what the problem might be?
>
> I also run the following code. But I don't quite understand what the
> word 'vignette' means. Especially, what does it mean in R? Is
> 'vignette' a package documentation? Another problem is how to wisely
> choose the most relevant vignette if it shows 10 vignette?
>
>   
>> library(mogene10stprobeset.db)
>> openVignette()
>>     
> Please select a vignette:
>
>  1: AnnotationDbi - AnnotationDbi
>  2: AnnotationDbi - Creating probe packages
>  3: AnnotationDbi - SQLForge
>  4: Biobase - An introduction to Biobase and ExpressionSets
>  5: Biobase - Bioconductor Overview
>  6: Biobase - esApply Introduction
>  7: Biobase - Notes for eSet developers
>  8: Biobase - Notes for writing introductory 'how to' documents
>  9: Biobase - quick views of eSet instances
> 10: DBI - A Common Database Interface (DBI)
>
> Based on your last advice, most of the time, it is better to use the
> annotation package rather than the affymetrix csv files, right?
>
> Regards,
> Peng
>
> $ Rscript run.R
>   
>> library(mogene10stprobeset.db)
>>     
> Loading required package: methods
> Loading required package: AnnotationDbi
> Loading required package: Biobase
>
> Welcome to Bioconductor
>
>   Vignettes contain introductory material. To view, type
>   'openVignette()'. To cite Bioconductor, see
>   'citation("Biobase")' and for packages 'citation(pkgname)'.
>
> Loading required package: DBI
>   
>> x <- mogene10stprobesetENTREZID
>> # Get the probe identifiers that are mapped to an ENTREZ Gene ID
>> mapped_probes <- mappedkeys(x)
>> # Convert to a list
>> xx <- as.list(x[mapped_probes])
>>     
> Error in sqliteExecStatement(con, statement, bind.data) :
>   RS-DBI driver: (error in statement: String or BLOB exceeded size limit)
> Calls: as.list ... dbGetQuery -> sqliteQuickSQL -> sqliteExecStatement -> .Call
> Execution halted
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>