[BioC] mapping through org.Xx.eg.db packages

Thu Oct 6 13:58:19 CEST 2011

On Thu, Oct 6, 2011 at 7:50 AM, Iain Gallagher
<iaingallagher at btopenworld.com> wrote:
> Dear List
>
> I wonder is someone could shed some light on the following.
>
> Given a set of gene symbols I would like to retrieve different identifiers.
>
> Using the org.Xx.eg.db packages I can go about this by mapping through the EntrezIDs:
>
> # mapping through eg ids as package is eg id centric
> library(org.Hs.eg.db)
> syms <- c('ACTB', 'TNF', 'TGFB1')
> egID <- unlist(mget(syms, org.Hs.egSYMBOL2EG, ifnotfound=NA))
> ensID <- unlist(mget(egID, org.Hs.egENSEMBL, ifnotfound=NA))
>
>> ensID
>               60             71241             71242             71243
> "ENSG00000075624" "ENSG00000204490" "ENSG00000206439" "ENSG00000223952"
>            71244             71245             71246             71247
> "ENSG00000228321" "ENSG00000228849" "ENSG00000230108" "ENSG00000232810"
>             7040
> "ENSG00000105329"
>
>> egID
>  ACTB    TNF  TGFB1
>  "60" "7124" "7040"
>
> Now here I assumed that the names of the ensID object were the original EntrezIDs mapped from the symbols but because R does not handle duplicate names they are not - with renumbering for those EntrezIDs that have a plurality of matches (here 7124 becomes 71241, 71242 etc etc)
>
> This has caused me some confusion since each of these names is an actual Entrez ID - just not one I'm interested in.
>
> The same can happen when mapping from any ID that ends in a numeric part (eg Ensembl ids).
>
> It is useful to return a mapping showing the original identifier, the EntrezID mapped through and the required identifier so how could one reliably do this when mapping through e.g. Entrez IDs as in the method above (i.e. return the Entrez ID and Ensembl ID in one sweep)?
>

Hi, Ian.  Just leave out the "unlist" from your code.

> ensIDList <- mget(egID, org.Hs.egENSEMBL, ifnotfound=NA)
> ensIDList
$`60`
[1] "ENSG00000075624"

$`7124`
[1] "ENSG00000204490" "ENSG00000206439" "ENSG00000223952" "ENSG00000228321"
[5] "ENSG00000228849" "ENSG00000230108" "ENSG00000232810"

$`7040`
[1] "ENSG00000105329"

Hope that helps.

Sean

> I have tried using the SQL approach:
>
> dbCon <- org.Hs.eg_dbconn()
> sqlQuery <- 'SELECT * FROM genes, gene_info, ensembl WHERE genes._id = gene_info._id = ensembl._id;'
> result <- dbGetQuery(dbCon, sqlQuery)
>
> where one could filter the 'result' object with the symbols of interest but this query takes a long time to run. I know little SQL so that might be an issue!
>
> Best
>
> iain
>
>> sessionInfo()
> R version 2.13.2 (2011-09-30)
> Platform: x86_64-pc-linux-gnu (64-bit)
>
> locale:
>  [1] LC_CTYPE=en_GB.utf8       LC_NUMERIC=C
>  [3] LC_TIME=en_GB.utf8        LC_COLLATE=en_GB.utf8
>  [5] LC_MONETARY=C             LC_MESSAGES=en_GB.utf8
>  [7] LC_PAPER=en_GB.utf8       LC_NAME=C
>  [9] LC_ADDRESS=C              LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_GB.utf8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] org.Hs.eg.db_2.4.6   RSQLite_0.9-4        DBI_0.2-5
> [4] AnnotationDbi_1.14.1 Biobase_2.10.0
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>