[BioC] find overlaping genes in ENSEMBL gene ID list and NCBI gene ID list

Hervé Pagès hpages at fhcrc.org
Wed Mar 9 00:36:54 CET 2011


Hi Fernando,

In addition to Vince's suggestions, here is an approach where you
actually compute the mapping yourself based on the gene locations.

It uses the GenomicFeatures package to retrieve the transcript and
exon locations for each gene, and then findOverlaps() to find the
overlapping genes. All the information can be downloaded from the
"RefSeq Genes" and "Ensembl Genes" UCSC tracks. The first track is
associated with Entrez Gene IDs (NCBI) and the second track with
Ensembl Gene IDs.

   library(GenomicFeatures)
   txdb1 <- makeTranscriptDbFromUCSC("bosTau4", tablename="refGene")
   txdb2 <- makeTranscriptDbFromUCSC("bosTau4", tablename="ensGene")
   genes1 <- range(transcriptsBy(txdb1, by="gene"))  # 11668 Entrez genes
   genes2 <- range(transcriptsBy(txdb2, by="gene"))  # 25669 Ensembl genes
   mm <- findOverlaps(genes1, genes2)

   ## Entrez genes that don't overlap:
   unmapped_ezids <- names(genes1)[-unique(queryHits(mm))]  # 101 Entrez 
genes

   ## Ensembl genes that don't overlap:
   unmapped_ensids <- names(genes2)[-unique(subjectHits(mm))]  # 12876 
Ensembl genes

   ## Mapping between Entrez genes and Ensembl genes:
   map <- data.frame(ezid=names(genes1)[queryHits(mm)],
                     ensid=names(genes2)[subjectHits(mm)])

As you can see, it's not a 1-to-1 mapping:

   > head(map)
          ezid              ensid
   1 100034674 ENSBTAG00000006026
   2 100036590 ENSBTAG00000038843
   3 100048947 ENSBTAG00000009091
   4 100048949 ENSBTAG00000033312
   5 100101492 ENSBTAG00000032159
   6 100101492 ENSBTAG00000032234

Note that alternatively makeTranscriptDbFromBiomart() could be used to
make 'txdb2' but that means you would first need to make sure that the
current version of Ensembl (Ensembl 61) is also mapping the genes,
transcripts and exons of Bos taurus to bosTau4.

Hope this helps,
H.


On 03/08/2011 11:49 AM, Vincent Carey wrote:
> There are various approaches using Bioconductor.  The fundamental
> resource is the package org.Bt.eg.db, which you can acquire using
> biocLite().
>
> You can find associations between ENSEMBL ids and Entrez ids using
> mappings in that package.
>
> You may find GSEABase useful also.  For example
>
>> dput(ensex)
> c("ENSBTAG00000000005", "ENSBTAG00000000008", "ENSBTAG00000000009",
> "ENSBTAG00000000010", "ENSBTAG00000000011", "ENSBTAG00000000012",
> "ENSBTAG00000000013", "ENSBTAG00000000014", "ENSBTAG00000000015",
> "ENSBTAG00000000016", "ENSBTAG00000000018", "ENSBTAG00000000019",
> "ENSBTAG00000000020", "ENSBTAG00000000021", "ENSBTAG00000000022",
> "ENSBTAG00000000023", "ENSBTAG00000000024", "ENSBTAG00000000025",
> "ENSBTAG00000000026", "ENSBTAG00000000027")
>> e1 = GeneSet(ensex, geneIdType=ENSEMBLIdentifier("org.Bt.eg.db"))
>> e1
> setName: NA
> geneIds: ENSBTAG00000000005, ENSBTAG00000000008, ...,
> ENSBTAG00000000027 (total: 20)
> geneIdType: ENSEMBL (org.Bt.eg.db)
> collectionType: Null
> details: use 'details(object)'
>> g1 = e1
>> geneIdType(g1) = EntrezIdentifier("org.Bt.eg.db")
>> g1
> setName: NA
> geneIds: 282136, 539250, ..., 512788 (total: 21)
> geneIdType: EntrezId (org.Bt.eg.db)
> collectionType: Null
> details: use 'details(object)'
>
> This shows that there are 21 Entrez IDs associated with the 20 ENSEMBL
> ids given above.
> After converting sets to common ID type, you can use intersect,
> setdiff methods to answer some of the
> questions you pose.
>
>
>> sessionInfo()
> R version 2.13.0 Under development (unstable) (2011-03-01 r54628)
> Platform: x86_64-apple-darwin10.4.0/x86_64 (64-bit)
>
> locale:
> [1] C
>
> attached base packages:
> [1] stats     graphics  grDevices datasets  tools     utils     methods
> [8] base
>
> other attached packages:
>   [1] GSEABase_1.13.3       graph_1.29.3          annotate_1.29.3
>   [4] org.Bt.eg.db_2.4.6    RSQLite_0.9-4         DBI_0.2-5
>   [7] AnnotationDbi_1.13.15 Biobase_2.11.9        weaver_1.17.0
> [10] codetools_0.2-8       digest_0.4.2
>
> loaded via a namespace (and not attached):
> [1] Matrix_0.999375-47 XML_3.2-0          grid_2.13.0        lattice_0.19-17
> [5] xtable_1.5-6
>
>
> On Tue, Mar 8, 2011 at 12:13 PM, Biase, Fernando<biase at illinois.edu>  wrote:
>> Hi everyone,
>>
>> I have a list of ENSEMBL gene _IDS  and a list with NCBI gene_IDs. I need to find which  ids correspond to genes in both list (overlapping genes) and each genes are in each one of them but not present in the other list (non-overlapping genes).
>> Can anyone give me some advice on this task? Or indicate a material do read?
>> In case it is relevant, the organism is Bos taurus.
>>
>> Thanks in advance,
>> Fernando
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list