[BioC] find overlaping genes in ENSEMBL gene ID list and NCBI gene ID list

Hervé Pagès hpages at fhcrc.org
Wed Mar 9 00:48:36 CET 2011


Fernando,

I forgot to mention that the code below works fine and is reasonably
fast with BioC 2.8 (current devel). However, with BioC 2.7 (current
release) it might not run or you might run into some performance issues
since some of the methods operating on GRangesList objects (like
range()) didn't exist of were too slow.

Cheers,
H.


On 03/08/2011 03:36 PM, Hervé Pagès wrote:
> Hi Fernando,
>
> In addition to Vince's suggestions, here is an approach where you
> actually compute the mapping yourself based on the gene locations.
>
> It uses the GenomicFeatures package to retrieve the transcript and
> exon locations for each gene, and then findOverlaps() to find the
> overlapping genes. All the information can be downloaded from the
> "RefSeq Genes" and "Ensembl Genes" UCSC tracks. The first track is
> associated with Entrez Gene IDs (NCBI) and the second track with
> Ensembl Gene IDs.
>
> library(GenomicFeatures)
> txdb1 <- makeTranscriptDbFromUCSC("bosTau4", tablename="refGene")
> txdb2 <- makeTranscriptDbFromUCSC("bosTau4", tablename="ensGene")
> genes1 <- range(transcriptsBy(txdb1, by="gene")) # 11668 Entrez genes
> genes2 <- range(transcriptsBy(txdb2, by="gene")) # 25669 Ensembl genes
> mm <- findOverlaps(genes1, genes2)
>
> ## Entrez genes that don't overlap:
> unmapped_ezids <- names(genes1)[-unique(queryHits(mm))] # 101 Entrez genes
>
> ## Ensembl genes that don't overlap:
> unmapped_ensids <- names(genes2)[-unique(subjectHits(mm))] # 12876
> Ensembl genes
>
> ## Mapping between Entrez genes and Ensembl genes:
> map <- data.frame(ezid=names(genes1)[queryHits(mm)],
> ensid=names(genes2)[subjectHits(mm)])
>
> As you can see, it's not a 1-to-1 mapping:
>
>  > head(map)
> ezid ensid
> 1 100034674 ENSBTAG00000006026
> 2 100036590 ENSBTAG00000038843
> 3 100048947 ENSBTAG00000009091
> 4 100048949 ENSBTAG00000033312
> 5 100101492 ENSBTAG00000032159
> 6 100101492 ENSBTAG00000032234
>
> Note that alternatively makeTranscriptDbFromBiomart() could be used to
> make 'txdb2' but that means you would first need to make sure that the
> current version of Ensembl (Ensembl 61) is also mapping the genes,
> transcripts and exons of Bos taurus to bosTau4.
>
> Hope this helps,
> H.
>
>
> On 03/08/2011 11:49 AM, Vincent Carey wrote:
>> There are various approaches using Bioconductor. The fundamental
>> resource is the package org.Bt.eg.db, which you can acquire using
>> biocLite().
>>
>> You can find associations between ENSEMBL ids and Entrez ids using
>> mappings in that package.
>>
>> You may find GSEABase useful also. For example
>>
>>> dput(ensex)
>> c("ENSBTAG00000000005", "ENSBTAG00000000008", "ENSBTAG00000000009",
>> "ENSBTAG00000000010", "ENSBTAG00000000011", "ENSBTAG00000000012",
>> "ENSBTAG00000000013", "ENSBTAG00000000014", "ENSBTAG00000000015",
>> "ENSBTAG00000000016", "ENSBTAG00000000018", "ENSBTAG00000000019",
>> "ENSBTAG00000000020", "ENSBTAG00000000021", "ENSBTAG00000000022",
>> "ENSBTAG00000000023", "ENSBTAG00000000024", "ENSBTAG00000000025",
>> "ENSBTAG00000000026", "ENSBTAG00000000027")
>>> e1 = GeneSet(ensex, geneIdType=ENSEMBLIdentifier("org.Bt.eg.db"))
>>> e1
>> setName: NA
>> geneIds: ENSBTAG00000000005, ENSBTAG00000000008, ...,
>> ENSBTAG00000000027 (total: 20)
>> geneIdType: ENSEMBL (org.Bt.eg.db)
>> collectionType: Null
>> details: use 'details(object)'
>>> g1 = e1
>>> geneIdType(g1) = EntrezIdentifier("org.Bt.eg.db")
>>> g1
>> setName: NA
>> geneIds: 282136, 539250, ..., 512788 (total: 21)
>> geneIdType: EntrezId (org.Bt.eg.db)
>> collectionType: Null
>> details: use 'details(object)'
>>
>> This shows that there are 21 Entrez IDs associated with the 20 ENSEMBL
>> ids given above.
>> After converting sets to common ID type, you can use intersect,
>> setdiff methods to answer some of the
>> questions you pose.
>>
>>
>>> sessionInfo()
>> R version 2.13.0 Under development (unstable) (2011-03-01 r54628)
>> Platform: x86_64-apple-darwin10.4.0/x86_64 (64-bit)
>>
>> locale:
>> [1] C
>>
>> attached base packages:
>> [1] stats graphics grDevices datasets tools utils methods
>> [8] base
>>
>> other attached packages:
>> [1] GSEABase_1.13.3 graph_1.29.3 annotate_1.29.3
>> [4] org.Bt.eg.db_2.4.6 RSQLite_0.9-4 DBI_0.2-5
>> [7] AnnotationDbi_1.13.15 Biobase_2.11.9 weaver_1.17.0
>> [10] codetools_0.2-8 digest_0.4.2
>>
>> loaded via a namespace (and not attached):
>> [1] Matrix_0.999375-47 XML_3.2-0 grid_2.13.0 lattice_0.19-17
>> [5] xtable_1.5-6
>>
>>
>> On Tue, Mar 8, 2011 at 12:13 PM, Biase, Fernando<biase at illinois.edu>
>> wrote:
>>> Hi everyone,
>>>
>>> I have a list of ENSEMBL gene _IDS and a list with NCBI gene_IDs. I
>>> need to find which ids correspond to genes in both list (overlapping
>>> genes) and each genes are in each one of them but not present in the
>>> other list (non-overlapping genes).
>>> Can anyone give me some advice on this task? Or indicate a material
>>> do read?
>>> In case it is relevant, the organism is Bos taurus.
>>>
>>> Thanks in advance,
>>> Fernando
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list