[BioC] how to get Entrez Gene ID from genomic coordinates
James W. MacDonald
jmacdon at uw.edu
Tue Nov 19 16:22:35 CET 2013
Hi Bruna,
On Tuesday, November 19, 2013 7:07:28 AM, Bruna Marini wrote:
> Hi,
> I have a batch of genomic coordinates, and I need to find all the genes (Entrez Gene ID) covered by those coordinates.
> I am new to R, but I am able to use biomaRt in R to get various annotation of genes from gene ID lists.
> Would it be possible to retrieve my IDs from the coordinates?
>
> The coordinates are in the chr:start:end format; I need the genes from both the strand.
> es.
> 1:11401198:11694590
> 1:14877629:15246452
> 1:38065507:38258622
You could probably use biomaRt, but Steffen Durinck would know the best
way for that. He might be along later with an example, and it may well
be a one-liner.
You could also use an organism level package as well. Note that getting
used to the infrastructure available in BioC will help with many other
queries you might have, so it is worth it to learn.
Here I assume these are coordinates from the GRCh37/hg19 Homo sapiens
genome. If not, you will have to make the necessary changes.
library(Homo.sapiens)
kg <- genes(TxDb.Hsapiens.UCSC.hg19.knownGene)
mg <- GRanges(seq = rep("chr1", 3), IRanges(start =
c(11401198,14877629,38065507), end = c(11694590,15246452,38258622)))
Normally you would have read these data in. You could do something like
dat <- scan("filename.txt", what = "c")
dat <- do.call("rbind", strsplit(dat, ":"))
mg <- GRanges(paste0("chr", dat[,1]), IRanges(dat[,2], dat[,3]))
which will replicate what I did by hand
olaps <- findOverlaps(kg, mg)
> olaps
Hits of length 6
queryLength: 23056
subjectLength: 3
queryHits subjectHits
<integer> <integer>
1 7105 2
2 9113 3
3 9114 3
4 14481 3
5 14649 3
6 16011 1
gn <- elementMetadata(kg)[queryHits(olaps),"gene_id"]
CharacterList of length 6
[[1]] 23254
[[2]] 284654
[[3]] 284656
[[4]] 54955
[[5]] 55143
[[6]] 57540
> select(Homo.sapiens, unlist(gn), "SYMBOL", "ENTREZID")
ENTREZID SYMBOL
1 23254 KAZN
2 284654 RSPO1
3 284656 EPHA10
4 54955 C1orf109
5 55143 CDCA8
6 57540 PTCHD2
Or if you still want the ranges, you can just add the symbols
> smkg <- kg[queryHits(olaps),]
> elementMetadata(smkg)$Symbol <- select(Homo.sapiens, unlist(gn), "SYMBOL", "ENTREZID")[,2]
> smkg
GRanges with 6 ranges and 2 metadata columns:
seqnames ranges strand | gene_id
Symbol
<Rle> <IRanges> <Rle> | <CharacterList>
<character>
23254 chr1 [14925213, 15444544] + | 23254
KAZN
284654 chr1 [38076951, 38100595] - | 284654
RSPO1
284656 chr1 [38179553, 38230824] - | 284656
EPHA10
54955 chr1 [38147242, 38157888] - | 54955
C1orf109
55143 chr1 [38158073, 38175391] + | 55143
CDCA8
57540 chr1 [11539295, 11597640] + | 57540
PTCHD2
Best,
Jim
>
> Thank you for the attention,
> best regards,
>
> Bruna
>
> Bruna Marini,
> Molecular Medicine Lab
> ICGEB
> Padriciano 99, 34149 Trieste, Italy
> marini at icgeb.org
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099
More information about the Bioconductor
mailing list