[BioC] how to get Entrez Gene ID from genomic coordinates

Tue Nov 19 16:22:35 CET 2013

Hi Bruna,

On Tuesday, November 19, 2013 7:07:28 AM, Bruna Marini wrote:
> Hi,
> I have a batch of genomic coordinates, and I need to find all the genes (Entrez Gene ID) covered by those coordinates.
> I am new to R, but I am able to use biomaRt in R to get various annotation of genes from gene ID lists.
> Would it be possible to retrieve my IDs from the coordinates?
>
> The coordinates are in the chr:start:end format; I need the genes from both the strand.
> es.
> 1:11401198:11694590
> 1:14877629:15246452
> 1:38065507:38258622

You could probably use biomaRt, but Steffen Durinck would know the best 
way for that. He might be along later with an example, and it may well 
be a one-liner.

You could also use an organism level package as well. Note that getting 
used to the infrastructure available in BioC will help with many other 
queries you might have, so it is worth it to learn.

Here I assume these are coordinates from the GRCh37/hg19 Homo sapiens 
genome. If not, you will have to make the necessary changes.

library(Homo.sapiens)
kg <- genes(TxDb.Hsapiens.UCSC.hg19.knownGene)
mg <- GRanges(seq = rep("chr1", 3), IRanges(start = 
c(11401198,14877629,38065507), end = c(11694590,15246452,38258622)))

Normally you would have read these data in. You could do something like

dat <- scan("filename.txt", what = "c")
dat <- do.call("rbind", strsplit(dat, ":"))
mg <- GRanges(paste0("chr", dat[,1]), IRanges(dat[,2], dat[,3]))

which will replicate what I did by hand

olaps <- findOverlaps(kg, mg)
> olaps
Hits of length 6
queryLength: 23056
subjectLength: 3
  queryHits subjectHits
   <integer>   <integer>
 1      7105           2
 2      9113           3
 3      9114           3
 4     14481           3
 5     14649           3
 6     16011           1

gn <- elementMetadata(kg)[queryHits(olaps),"gene_id"]
CharacterList of length 6
[[1]] 23254
[[2]] 284654
[[3]] 284656
[[4]] 54955
[[5]] 55143
[[6]] 57540

> select(Homo.sapiens, unlist(gn), "SYMBOL", "ENTREZID")
  ENTREZID   SYMBOL
1    23254     KAZN
2   284654    RSPO1
3   284656   EPHA10
4    54955 C1orf109
5    55143    CDCA8
6    57540   PTCHD2

Or if you still want the ranges, you can just add the symbols

> smkg <- kg[queryHits(olaps),]
> elementMetadata(smkg)$Symbol <- select(Homo.sapiens, unlist(gn), "SYMBOL", "ENTREZID")[,2]
> smkg
GRanges with 6 ranges and 2 metadata columns:
         seqnames               ranges strand |         gene_id      
Symbol
            <Rle>            <IRanges>  <Rle> | <CharacterList> 
<character>
   23254     chr1 [14925213, 15444544]      + |           23254        
KAZN
  284654     chr1 [38076951, 38100595]      - |          284654       
RSPO1
  284656     chr1 [38179553, 38230824]      - |          284656      
EPHA10
   54955     chr1 [38147242, 38157888]      - |           54955    
C1orf109
   55143     chr1 [38158073, 38175391]      + |           55143       
CDCA8
   57540     chr1 [11539295, 11597640]      + |           57540      
PTCHD2

Best,

Jim

>
> Thank you for the attention,
> best regards,
>
> Bruna
>
> Bruna Marini,
> Molecular Medicine Lab
> ICGEB
> Padriciano 99, 34149 Trieste, Italy
> marini at icgeb.org
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099