[BioC] How to annotate genomic coordinates

Harris A. Jaffee hj at jhu.edu
Thu Nov 8 19:49:54 CET 2012


On the elementary end of all this...

If the sites are on a *file*, one per line, you could do

	y <- read.delim(filename, sep=".", header=FALSE, as.is=TRUE)

etc.

On Nov 8, 2012, at 11:38 AM, James W. MacDonald wrote:

> Hi Jose,
> 
> On 11/8/2012 10:28 AM, José Luis Lavín wrote:
>> Dear James,
>> 
>> To answer your questions swiftly, the features are methylation sites (that's why I only have a coordinate instead of having a Start/End pair) in mouse (mm9) genome and I have a list of those in the following format:
>> 
>> chr17.31246506
>> 
>> How could I read the list so that I can input the "chr" and the "coordinate" parameters into the expression you recommended?
> 
> First you need to split your data to chr and pos.
> 
> chr.pos <- do.call("rbind", strsplit(<your vector of chr17.pos data>, "\\."))
> 
> Note that your vector of chr.pos data may have been in as a factor, so you may need to wrap with an as.character(). If you don't know what I mean by that, you should first read 'An Introduction to R' (http://cran.r-project.org/doc/manuals/R-intro.html). That will be time well spent.
> 
>> 
>> x <- GRanges(seq = "chr17", IRanges(start = 31245606, width = 1))
> 
> Both the seqnames and ranges argument to GRanges() can be based on vectors. So if you had a matrix (called 'y') like
> 
> chr16    3423543
> chr3    403992
> chr18    3494202
> 
> then you can do
> 
> x <- GRanges(y[,1], IRanges(start = y[,2], width = 1))
> 
> see ?GRanges for more info.
> 
> As a side note, IIRC, methylation sites are not in general found in exons, but are more likely to be found upstream of a given gene. You might then want to use fiveUTRsByTranscript() rather than exonsBy(). See ?exonsBy after loading the GenomicFeatures package.
> 
>> 
>> I 'm lookin forward to obtain a table where my "coordinate-based Id" appears in a column, and the gene_name and if possible, the  Entrez/Ensembl Ids in two other columns :
>> 
>> Coordinate Gene_name  Entrez_ID  Ensembl_ID
>> 
>> Is it easy to do this in R?
> 
> Of course! Everything is easy to do in R. You should see my sweet R toast 'n coffee maker ;-D
> 
> But you should note that the fiveUTRByTranscript() function is transcript based (obvs), and so you will have multiple transcripts per gene. This makes things much more difficult, as a given methylation site may overlap multiple transcripts of the same gene. So that makes it hard to merge your original data with the overlapping transcripts.
> 
> You could do something like
> 
> ex <- fiveUTRsByTranscript(TxDb.Mmusculus.UCSC.mm10.knownGene, use.names = TRUE)
> ex2 <- do.call("rbind", sapply(ex[ex %in% x], elementMetadata))$exon_id
> 
> then you could use unique Gene IDs thusly
> 
> my.trans <- select(Mus.musculus, unique(as.character(ex2)), c("CHR","CHRLOC","ENTREZID","ENSEMBL"), "ENTREZID")
> 
> That should at least give you a start. See where you can go on your own, and let us know if you get stuck.
> 
> Best,
> 
> Jim
> 
> 
> 
>> 
>> I'm still really new to R and I lack many concepts you may consider basic... I'm awfully sorry
>> 
>> Best
>> 
>> JL
>> 
>> 2012/11/8 James W. MacDonald <jmacdon at uw.edu <mailto:jmacdon at uw.edu>>
>> 
>>    Hi Jose,
>> 
>> 
>>    On 11/8/2012 8:19 AM, José Luis Lavín wrote:
>> 
>>        Dear Bioconductor list,
>> 
>>        I write you this email asking for a Bioconductor module that
>>        allows me to
>>        annotate genomic coordinates and get different GeneIds.
>>        I'll show you an example of what I'm referring to:
>> 
>>        I have this data:
>>        Chromosome  coordinate
>>        chr17              31246506
>> 
>> 
>>    It depends on what that coordinate is. Is it the start of a
>>    transcript? A SNP? Do you really just have a single coordinate, or
>>    do you have a range? What species are we talking about here?
>> 
>>    Depending on what your data are, you might want to use either one
>>    of the TxDb packages, or a SNPlocs package. There really isn't
>>    much to go on here. If I assume this is a coordinate that one
>>    might think is within an exon, and if I further assume you are
>>    working with H. sapiens, I could suggest something like
>> 
>>    library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>>    ex <- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "gene")
>> 
>>    x <- GRanges(seq = "chr17", IRanges(start = 31245606, width = 1))
>> 
>>    ex[ex %in% x]
>> 
>>    or maybe more appropriately
>> 
>>    names(ex)[ex %in% x]
>> 
>>    which will give you the Gene ID, and you can go from there using
>>    the org.Hs.eg.db package.
>> 
>>    If however, your coordinate isn't in an exon, but might be in a
>>    UTR, you can look at ?exonsBy to see what other sequences you can
>>    extract to compare with.
>> 
>>    If these are SNPs, then you can look at the help pages for the
>>    relevant SNPlocs package.
>> 
>>    Best,
>> 
>>    Jim
>> 
>> 
>> 
>>        which can also be written this way by the program that yielded
>>        the result:
>>        chr17.31246506
>> 
>>        And I need to convert this data into a gene name and known
>>        gene Ids, such
>>        as:
>> 
>>        Gene name  Entrez_ID  Ensembl_ID
>> 
>>        Tff3 NM_011575 20050
>>        Can you please advice me about a module able to perform this
>>        ID conversion
>>        using a list of  "chr17.31246506" type coordinates as input?
>> 
>>        Thanks in advance
>> 
>>        With best wishes
>> 
>> 
>> 
>>        _______________________________________________
>>        Bioconductor mailing list
>>        Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>>        https://stat.ethz.ch/mailman/listinfo/bioconductor
>>        Search the archives:
>>        http://news.gmane.org/gmane.science.biology.informatics.conductor
>> 
>> 
>>    --     James W. MacDonald, M.S.
>>    Biostatistician
>>    University of Washington
>>    Environmental and Occupational Health Sciences
>>    4225 Roosevelt Way NE, # 100
>>    Seattle WA 98105-6099
>> 
>> 
>> 
>> 
>> -- 
>> -- 
>> Dr. José Luis Lavín Trueba
>> 
>> Dpto. de Producción Agraria
>> Grupo de Genética y Microbiología
>> Universidad Pública de Navarra
>> 31006 Pamplona
>> Navarra
>> SPAIN
> 
> -- 
> James W. MacDonald, M.S.
> Biostatistician
> University of Washington
> Environmental and Occupational Health Sciences
> 4225 Roosevelt Way NE, # 100
> Seattle WA 98105-6099
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list