[BioC] How to annotate genomic coordinates

Thu Nov 8 17:38:37 CET 2012

Hi Jose,

On 11/8/2012 10:28 AM, José Luis Lavín wrote:
> Dear James,
>
> To answer your questions swiftly, the features are methylation sites 
> (that's why I only have a coordinate instead of having a Start/End 
> pair) in mouse (mm9) genome and I have a list of those in the 
> following format:
>
> chr17.31246506
>
> How could I read the list so that I can input the "chr" and the 
> "coordinate" parameters into the expression you recommended?

First you need to split your data to chr and pos.

chr.pos <- do.call("rbind", strsplit(<your vector of chr17.pos data>, 
"\\."))

Note that your vector of chr.pos data may have been in as a factor, so 
you may need to wrap with an as.character(). If you don't know what I 
mean by that, you should first read 'An Introduction to R' 
(http://cran.r-project.org/doc/manuals/R-intro.html). That will be time 
well spent.

>
> x <- GRanges(seq = "chr17", IRanges(start = 31245606, width = 1))

Both the seqnames and ranges argument to GRanges() can be based on 
vectors. So if you had a matrix (called 'y') like

chr16    3423543
chr3    403992
chr18    3494202

then you can do

x <- GRanges(y[,1], IRanges(start = y[,2], width = 1))

see ?GRanges for more info.

As a side note, IIRC, methylation sites are not in general found in 
exons, but are more likely to be found upstream of a given gene. You 
might then want to use fiveUTRsByTranscript() rather than exonsBy(). See 
?exonsBy after loading the GenomicFeatures package.

>
> I 'm lookin forward to obtain a table where my "coordinate-based Id" 
> appears in a column, and the gene_name and if possible, the  
> Entrez/Ensembl Ids in two other columns :
>
> Coordinate Gene_name  Entrez_ID  Ensembl_ID
>
> Is it easy to do this in R?

Of course! Everything is easy to do in R. You should see my sweet R 
toast 'n coffee maker ;-D

But you should note that the fiveUTRByTranscript() function is 
transcript based (obvs), and so you will have multiple transcripts per 
gene. This makes things much more difficult, as a given methylation site 
may overlap multiple transcripts of the same gene. So that makes it hard 
to merge your original data with the overlapping transcripts.

You could do something like

ex <- fiveUTRsByTranscript(TxDb.Mmusculus.UCSC.mm10.knownGene, use.names 
= TRUE)
ex2 <- do.call("rbind", sapply(ex[ex %in% x], elementMetadata))$exon_id

then you could use unique Gene IDs thusly

my.trans <- select(Mus.musculus, unique(as.character(ex2)), 
c("CHR","CHRLOC","ENTREZID","ENSEMBL"), "ENTREZID")

That should at least give you a start. See where you can go on your own, 
and let us know if you get stuck.

Best,

Jim

>
> I'm still really new to R and I lack many concepts you may consider 
> basic... I'm awfully sorry
>
> Best
>
> JL
>
> 2012/11/8 James W. MacDonald <jmacdon at uw.edu <mailto:jmacdon at uw.edu>>
>
>     Hi Jose,
>
>
>     On 11/8/2012 8:19 AM, José Luis Lavín wrote:
>
>         Dear Bioconductor list,
>
>         I write you this email asking for a Bioconductor module that
>         allows me to
>         annotate genomic coordinates and get different GeneIds.
>         I'll show you an example of what I'm referring to:
>
>         I have this data:
>         Chromosome  coordinate
>         chr17              31246506
>
>
>     It depends on what that coordinate is. Is it the start of a
>     transcript? A SNP? Do you really just have a single coordinate, or
>     do you have a range? What species are we talking about here?
>
>     Depending on what your data are, you might want to use either one
>     of the TxDb packages, or a SNPlocs package. There really isn't
>     much to go on here. If I assume this is a coordinate that one
>     might think is within an exon, and if I further assume you are
>     working with H. sapiens, I could suggest something like
>
>     library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>     ex <- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "gene")
>
>     x <- GRanges(seq = "chr17", IRanges(start = 31245606, width = 1))
>
>     ex[ex %in% x]
>
>     or maybe more appropriately
>
>     names(ex)[ex %in% x]
>
>     which will give you the Gene ID, and you can go from there using
>     the org.Hs.eg.db package.
>
>     If however, your coordinate isn't in an exon, but might be in a
>     UTR, you can look at ?exonsBy to see what other sequences you can
>     extract to compare with.
>
>     If these are SNPs, then you can look at the help pages for the
>     relevant SNPlocs package.
>
>     Best,
>
>     Jim
>
>
>
>         which can also be written this way by the program that yielded
>         the result:
>         chr17.31246506
>
>         And I need to convert this data into a gene name and known
>         gene Ids, such
>         as:
>
>         Gene name  Entrez_ID  Ensembl_ID
>
>         Tff3 NM_011575 20050
>         Can you please advice me about a module able to perform this
>         ID conversion
>         using a list of  "chr17.31246506" type coordinates as input?
>
>         Thanks in advance
>
>         With best wishes
>
>
>
>         _______________________________________________
>         Bioconductor mailing list
>         Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>         https://stat.ethz.ch/mailman/listinfo/bioconductor
>         Search the archives:
>         http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
>     -- 
>     James W. MacDonald, M.S.
>     Biostatistician
>     University of Washington
>     Environmental and Occupational Health Sciences
>     4225 Roosevelt Way NE, # 100
>     Seattle WA 98105-6099
>
>
>
>
> -- 
> -- 
> Dr. José Luis Lavín Trueba
>
> Dpto. de Producción Agraria
> Grupo de Genética y Microbiología
> Universidad Pública de Navarra
> 31006 Pamplona
> Navarra
> SPAIN

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099