[BioC] How to annotate genomic coordinates

Wed Nov 14 15:25:21 CET 2012

Hi Jose,

On 11/14/2012 4:00 AM, José Luis Lavín wrote:
> Dear Valerie,
>
> I have an issue with the results of your approach. Now I get a table with
> the following fields:
>
>        ENTREZID   ENSEMBL  GENENAME
>
> 1    216285   ENSMUSG00000036602  ALX homeobox 1
>
>
> But the Identifier I converted to obtain the previously mentioned data is
> not present in that table (eg. chr10.100837476) , so there's nothing I can
> do with this results table because I can't correlate my previous
> chr.location type identifiers to the results of the table.
> I guess I've done something wrong somewhere in the code I modified so that
> I missed my "chr.coordinate" ids column. So I'll paste the code here again
> to see if any of you can tell me where I lost that column.

You didn't lose a column, you just left data behind. Using Valerie's 
example, slightly modified:

 > library(TxDb.Mmusculus.UCSC.mm9.knownGene)
 > txdb <- TxDb.Mmusculus.UCSC.mm9.knownGene
 > gr <- GRanges(seq = "chr17", IRanges(start = 31245606, width = 1))
 > loc <- locateVariants(query=gr, subject=txdb, region=AllVariants())
 > loc <- as.data.frame(loc)
 > gn <- select(Mus.musculus, loc$GENEID, c("GENENAME","ENSEMBL"), 
"ENTREZID")
 > the.data <- merge(gn, loc, by.x = 1, by.y = 10)
 > the.data[,c(4,5,1:3)]
   seqnames    start ENTREZID            ENSEMBL
1    chr17 31245606    11307 ENSMUSG00000024030
                                               GENENAME
1 ATP-binding cassette, sub-family G (WHITE), member 1

There are other data in the data.frame that you might want to keep as well.

Best,

Jim

>
> source("http://bioconductor.org/biocLite.R")
> biocLite('Mus.musculus')
> biocLite('TxDb.Mmusculus.UCSC.mm9.knownGene')
>
> biocLite('VariantAnnotation')
> library(Mus.musculus)
> library(TxDb.Mmusculus.UCSC.mm9.knownGene)
>
> txdb<- TxDb.Mmusculus.UCSC.mm9.knownGene
> y<- read.delim("/path_to_dir/ids_noM.txt", sep=".", header=FALSE, as.is
> =TRUE)
> probes<- GRanges(seqnames=y[,1], ranges=IRanges(start=y[,2], width=1))
>
> library(VariantAnnotation)
> loc<- locateVariants(query=probes, subject=txdb, region=AllVariants())
> head(loc)
>
> entrezid<- loc$GENEID
> entrezid2<-select(Mus.musculus, keytype="ENTREZID", keys=entrezid,
> cols=c("GENENAME", "ENSEMBL"))
>
> write.table(entrezid2, file = "Annotated_Ids.csv",quote = FALSE, sep = ",",
> eol = "\n")
>
> Thank you again for your advice
>
>
> 2012/11/14 Hervé Pagès<hpages at fhcrc.org>
>
>> Hi all,
>>
>>
>> On 11/12/2012 10:24 AM, Valerie Obenchain wrote:
>>
>>> Hi Jose,
>>>
>>> Here is a slightly different approach.
>>>
>>> library(Mus.musculus)
>>> library(TxDb.Mmusculus.UCSC.**mm9.knownGene)
>>> txdb<- TxDb.Mmusculus.UCSC.mm9.**knownGene
>>>
>>> ## I assume you've figured out how to read your data into a GRanges.
>>> ## Here we use a small example.
>>> gr<- GRanges(seq = "chr17", IRanges(start = 31245606, width = 20))
>>>
>>> ## The locateVariants() function in the VariantAnnotation package will
>>> ## give you the GENEIDs for all ranges in 'query'. If the range does not
>>> ## fall in a gene (i.e., it is intergenic), then the PRECEDEID and
>>> ## FOLLOWID are provided. The LOCATION column indicates what
>>> ## region the range fell in. See ?locateVariants for details on the
>>> ## different regions and the ability to set 'upstream' and 'downstream'
>>> ## values for promoter regions.
>>> library(VariantAnnotation)
>>> loc<- locateVariants(query=gr, subject=txdb, region=AllVariants())
>>>
>>>   >  loc
>>> GRanges with 1 range and 7 metadata columns:
>>>         seqnames               ranges strand | LOCATION   QUERYID      TXID
>>> <Rle>  <IRanges>  <Rle>  |<factor>  <integer>  <integer>
>>>     [1]    chr17 [31245606, 31245625]      * |   coding         1     47716
>>>             CDSID      GENEID   PRECEDEID    FOLLOWID
>>> <integer>  <character>  <character>  <character>
>>>     [1]    184246       11307<NA>  <NA>
>>>
>> Nice! So IIUC locateVariants() is not intrinsically for variants but
>> seems to be generally useful for "locating" any set of genomic
>> positions. Shouldn't we have this (or something similar) in
>> GenomicFeatures (probably with a different name)?
>>
>> Cheers,
>> H.
>>
>>
>>
>>> ## Use the select() function to map these GENEID's to the other
>>> ## values you are interested in. The GENEID's from locateVariants()
>>> ## are Entrez ID's so we use those as our 'keytype'.
>>> entrezid<- loc$GENEID
>>> select(Mus.musculus, keytype="ENTREZID", keys=entrezid,
>>> cols=c("GENENAME", "ENSEMBL"))
>>>   >  select(Mus.musculus, keytype="ENTREZID", keys=entrezid,
>>> cols=c("GENENAME", "ENSEMBL"))
>>>     ENTREZID            ENSEMBL
>>> 1    11307 ENSMUSG00000024030
>>>                                                 GENENAME
>>> 1 ATP-binding cassette, sub-family G (WHITE), member 1
>>>
>>>
>>> Valerie
>>>
>>>
>>>
>>> On 11/12/12 06:59, José Luis Lavín wrote:
>>>
>>>> Hello all,
>>>>
>>>> First of all I want to thank everybody that gave me advice on this
>>>> subject.
>>>> trying to follow the advice, I added some modifications mixing codes from
>>>> Tim,  Harris and James , but it seems I got lost somewhere in between...
>>>> I'm sorry for bothering you all again, but I'm afraid I need some more
>>>> help.
>>>>
>>>> I need to read my ids.txt file and then iterate use each line of id
>>>> (chr.position) to perform the annotation process with it. I thought of a
>>>> "for" loop to achieve it, but I do not seem to catch the essence of R
>>>> processes and I guess I miss my tryout.
>>>> Please have a look at my "disaster" and help me with it If such thing is
>>>> possible...
>>>>
>>>> biocLite('Mus.musculus')
>>>> require(Mus.musculus)
>>>> require(TxDb.Mmusculus.UCSC.**mm9.knownGene)
>>>> txdb<- transcriptsBy(TxDb.Mmusculus.**UCSC.mm9.knownGene, 'gene')
>>>> egid<- names(txdb)
>>>> name<- unlist(mget(egid, org.Mm.egSYMBOL, ifnotfound=NA))
>>>> length(name) == length(egid) ## TRUE, OK to use
>>>> esbl<- unlist(mget(egid, org.Mm.egENSEMBL, ifnotfound=NA))
>>>> length(esbl) == length(egid) ## FALSE, do not use
>>>>
>>>> #read input table
>>>> coordTable<- read.delim(/path_to_dir/ids.**txt, sep=".", header=FALSE,
>>>> as.is
>>>> =TRUE)
>>>>
>>>> for(i in 1:length(coordTable))
>>>> {
>>>> probes<- GRanges(seq= "y[,1]", IRanges(start = y[,2], width = 1),
>>>>                    strand='*')
>>>> }
>>>>
>>>> genome(probes)<- 'mm9' ## prevents some stoopid mistakes
>>>>
>>>> geneBodyProbes<- findOverlaps(probes, txdb)
>>>> geneBodyProbes
>>>>
>>>> write.table
>>>> (geneBodyProbes,file="/path_**to_dir/trans_id.tsv",quote=**
>>>> FALSE,sep="\t")
>>>>
>>>> ## Hits of length 1
>>>> ## queryLength: 1
>>>> ## subjectLength: 21761
>>>> ##   queryHits subjectHits
>>>> ##<integer>     <integer>
>>>> ##     1         1        1641
>>>> ##
>>>>
>>>> name[subjectHits(**geneBodyProbes)]
>>>>
>>>> ##   11307  # egid
>>>> ## "Abcg1"  # name
>>>> ##
>>>>
>>>> org.Mm.egCHRLOC[['11307']]
>>>>
>>>> ##       17
>>>> ## 31057694
>>>> ##
>>>>
>>>> org.Mm.egENSEMBL[['11307']]
>>>>
>>>> ## [1] "ENSMUSG00000024030"
>>>>
>>>> promotersByGene<- flank(txdb, 1500) # or however many bases you want
>>>> promotersByGene<- flank(promotersByGene, 200, FALSE) # a little extra
>>>> promoterProbes<- findOverlaps(probes, promotersByGene)
>>>> promoterProbes
>>>>
>>>> write.table
>>>> (promoterProbes,file="/path_**to_dir/promo_trans_id.tsv",**
>>>> quote=FALSE,sep="\t")
>>>>
>>>>
>>>> ## Hits of length 0
>>>> ## queryLength: 1
>>>> ## subjectLength: 21761
>>>> ##
>>>> ## read the GRanges and GenomicFeatures vignettes for more
>>>>
>>>> write.table
>>>> (geneBodyProbes,file="/path_**to_dir/trans_id.tsv",quote=**
>>>> FALSE,sep="\t")
>>>>
>>>>
>>>> *Thanks in advance*
>>>>
>>>> JL
>>>>
>>>> 2012/11/8 Harris A. Jaffee<hj at jhu.edu>
>>>>
>>>>   On the elementary end of all this...
>>>>> If the sites are on a *file*, one per line, you could do
>>>>>
>>>>>           y<- read.delim(filename, sep=".", header=FALSE, as.is=TRUE)
>>>>>
>>>>> etc.
>>>>>
>>>>> On Nov 8, 2012, at 11:38 AM, James W. MacDonald wrote:
>>>>>
>>>>>   Hi Jose,
>>>>>> On 11/8/2012 10:28 AM, José Luis Lavín wrote:
>>>>>>
>>>>>>> Dear James,
>>>>>>>
>>>>>>> To answer your questions swiftly, the features are methylation sites
>>>>>>>
>>>>>> (that's why I only have a coordinate instead of having a Start/End
>>>>> pair) in
>>>>> mouse (mm9) genome and I have a list of those in the following format:
>>>>>
>>>>>> chr17.31246506
>>>>>>> How could I read the list so that I can input the "chr" and the
>>>>>>>
>>>>>> "coordinate" parameters into the expression you recommended?
>>>>>> First you need to split your data to chr and pos.
>>>>>>
>>>>>> chr.pos<- do.call("rbind", strsplit(<your vector of chr17.pos data>,
>>>>>>
>>>>> "\\."))
>>>>>
>>>>>> Note that your vector of chr.pos data may have been in as a factor, so
>>>>>>
>>>>> you may need to wrap with an as.character(). If you don't know what I
>>>>> mean
>>>>> by that, you should first read 'An Introduction to R' (
>>>>> http://cran.r-project.org/doc/**manuals/R-intro.html<http://cran.r-project.org/doc/manuals/R-intro.html>).
>>>>> That will be time
>>>>> well spent.
>>>>>
>>>>>> x<- GRanges(seq = "chr17", IRanges(start = 31245606, width = 1))
>>>>>> Both the seqnames and ranges argument to GRanges() can be based on
>>>>>>
>>>>> vectors. So if you had a matrix (called 'y') like
>>>>>
>>>>>> chr16    3423543
>>>>>> chr3    403992
>>>>>> chr18    3494202
>>>>>>
>>>>>> then you can do
>>>>>>
>>>>>> x<- GRanges(y[,1], IRanges(start = y[,2], width = 1))
>>>>>>
>>>>>> see ?GRanges for more info.
>>>>>>
>>>>>> As a side note, IIRC, methylation sites are not in general found in
>>>>>>
>>>>> exons, but are more likely to be found upstream of a given gene. You
>>>>> might
>>>>> then want to use fiveUTRsByTranscript() rather than exonsBy(). See
>>>>> ?exonsBy
>>>>> after loading the GenomicFeatures package.
>>>>>
>>>>>> I 'm lookin forward to obtain a table where my "coordinate-based Id"
>>>>>> appears in a column, and the gene_name and if possible, the
>>>>> Entrez/Ensembl
>>>>> Ids in two other columns :
>>>>>
>>>>>> Coordinate Gene_name  Entrez_ID  Ensembl_ID
>>>>>>> Is it easy to do this in R?
>>>>>>>
>>>>>> Of course! Everything is easy to do in R. You should see my sweet R
>>>>>>
>>>>> toast 'n coffee maker ;-D
>>>>>
>>>>>> But you should note that the fiveUTRByTranscript() function is
>>>>>>
>>>>> transcript based (obvs), and so you will have multiple transcripts per
>>>>> gene. This makes things much more difficult, as a given methylation site
>>>>> may overlap multiple transcripts of the same gene. So that makes it
>>>>> hard to
>>>>> merge your original data with the overlapping transcripts.
>>>>>
>>>>>> You could do something like
>>>>>>
>>>>>> ex<- fiveUTRsByTranscript(TxDb.**Mmusculus.UCSC.mm10.knownGene,
>>>>>> use.names
>>>>>>
>>>>> = TRUE)
>>>>>
>>>>>> ex2<- do.call("rbind", sapply(ex[ex %in% x], elementMetadata))$exon_id
>>>>>>
>>>>>> then you could use unique Gene IDs thusly
>>>>>>
>>>>>> my.trans<- select(Mus.musculus, unique(as.character(ex2)),
>>>>>>
>>>>> c("CHR","CHRLOC","ENTREZID","**ENSEMBL"), "ENTREZID")
>>>>>
>>>>>> That should at least give you a start. See where you can go on your
>>>>>> own,
>>>>>>
>>>>> and let us know if you get stuck.
>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Jim
>>>>>>
>>>>>>
>>>>>>
>>>>>>   I'm still really new to R and I lack many concepts you may consider
>>>>>> basic... I'm awfully sorry
>>>>>> Best
>>>>>>> JL
>>>>>>>
>>>>>>> 2012/11/8 James W. MacDonald<jmacdon at uw.edu<**mailto:jmacdon at uw.edu>>
>>>>>>>
>>>>>>>      Hi Jose,
>>>>>>>
>>>>>>>
>>>>>>>      On 11/8/2012 8:19 AM, José Luis Lavín wrote:
>>>>>>>
>>>>>>>          Dear Bioconductor list,
>>>>>>>
>>>>>>>          I write you this email asking for a Bioconductor module that
>>>>>>>          allows me to
>>>>>>>          annotate genomic coordinates and get different GeneIds.
>>>>>>>          I'll show you an example of what I'm referring to:
>>>>>>>
>>>>>>>          I have this data:
>>>>>>>          Chromosome  coordinate
>>>>>>>          chr17              31246506
>>>>>>>
>>>>>>>
>>>>>>>      It depends on what that coordinate is. Is it the start of a
>>>>>>>      transcript? A SNP? Do you really just have a single coordinate, or
>>>>>>>      do you have a range? What species are we talking about here?
>>>>>>>
>>>>>>>      Depending on what your data are, you might want to use either one
>>>>>>>      of the TxDb packages, or a SNPlocs package. There really isn't
>>>>>>>      much to go on here. If I assume this is a coordinate that one
>>>>>>>      might think is within an exon, and if I further assume you are
>>>>>>>      working with H. sapiens, I could suggest something like
>>>>>>>
>>>>>>>      library(TxDb.Hsapiens.UCSC.**hg19.knownGene)
>>>>>>>      ex<- exonsBy(TxDb.Hsapiens.UCSC.**hg19.knownGene, "gene")
>>>>>>>
>>>>>>>      x<- GRanges(seq = "chr17", IRanges(start = 31245606, width = 1))
>>>>>>>
>>>>>>>      ex[ex %in% x]
>>>>>>>
>>>>>>>      or maybe more appropriately
>>>>>>>
>>>>>>>      names(ex)[ex %in% x]
>>>>>>>
>>>>>>>      which will give you the Gene ID, and you can go from there using
>>>>>>>      the org.Hs.eg.db package.
>>>>>>>
>>>>>>>      If however, your coordinate isn't in an exon, but might be in a
>>>>>>>      UTR, you can look at ?exonsBy to see what other sequences you can
>>>>>>>      extract to compare with.
>>>>>>>
>>>>>>>      If these are SNPs, then you can look at the help pages for the
>>>>>>>      relevant SNPlocs package.
>>>>>>>
>>>>>>>      Best,
>>>>>>>
>>>>>>>      Jim
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>          which can also be written this way by the program that yielded
>>>>>>>          the result:
>>>>>>>          chr17.31246506
>>>>>>>
>>>>>>>          And I need to convert this data into a gene name and known
>>>>>>>          gene Ids, such
>>>>>>>          as:
>>>>>>>
>>>>>>>          Gene name  Entrez_ID  Ensembl_ID
>>>>>>>
>>>>>>>          Tff3 NM_011575 20050
>>>>>>>          Can you please advice me about a module able to perform this
>>>>>>>          ID conversion
>>>>>>>          using a list of  "chr17.31246506" type coordinates as input?
>>>>>>>
>>>>>>>          Thanks in advance
>>>>>>>
>>>>>>>          With best wishes
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>          ______________________________**_________________
>>>>>>>          Bioconductor mailing list
>>>>>>>          Bioconductor at r-project.org<**mailto:Bioconductor at r-project.**
>>>>>>> org<Bioconductor at r-project.org>>
>>>>>>>          https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>>>>>>          Search the archives:
>>>>>>>
>>>>>>>   http://news.gmane.org/gmane.**science.biology.informatics.**
>>>>> conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>>>>
>>>>>>>      --     James W. MacDonald, M.S.
>>>>>>>      Biostatistician
>>>>>>>      University of Washington
>>>>>>>      Environmental and Occupational Health Sciences
>>>>>>>      4225 Roosevelt Way NE, # 100
>>>>>>>      Seattle WA 98105-6099
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> --
>>>>>>> Dr. José Luis Lavín Trueba
>>>>>>>
>>>>>>> Dpto. de Producción Agraria
>>>>>>> Grupo de Genética y Microbiología
>>>>>>> Universidad Pública de Navarra
>>>>>>> 31006 Pamplona
>>>>>>> Navarra
>>>>>>> SPAIN
>>>>>>>
>>>>>> --
>>>>>> James W. MacDonald, M.S.
>>>>>> Biostatistician
>>>>>> University of Washington
>>>>>> Environmental and Occupational Health Sciences
>>>>>> 4225 Roosevelt Way NE, # 100
>>>>>> Seattle WA 98105-6099
>>>>>>
>>>>>> ______________________________**_________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at r-project.org
>>>>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>>>>> Search the archives:
>>>>>>
>>>>> http://news.gmane.org/gmane.**science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>>>>
>>>>>
>>>>>
>>>>
>>>> ______________________________**_________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.**science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>>>
>>> ______________________________**_________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>> Search the archives:
>>> http://news.gmane.org/gmane.**science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org
>> Phone:  (206) 667-5791
>> Fax:    (206) 667-1319
>>
>
>
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099