[BioC] How to annotate genomic coordinates

Wed Nov 14 15:49:57 CET 2012

Hello,

Are you familiar with the R help pages? For any function or class object 
you can type ? followed by the name.

?locateVariants
?select

The page for locateVariants explains each of the the metadata columns in 
the result. One of the columns is 'QUERYID'. This id represents the row 
in your original query (i.e., the GRanges you entered as the 'query' 
argument). The page also explains that the result has one row for each 
range-transcript match.

gr <- GRanges(c("chr1", "chr17", "chr16"), IRanges(c(41245606, 31245606, 
11248806), width = 20))
loc <- locateVariants(query=gr, subject=txdb, region=AllVariants())

The first range hit an intergenic region and we therefore have PRECEDEID 
and FOLLOWID. The second range hit a coding region in a single 
transcript and the third range hit an intron in two different 
transcripts. Because the third range hit 2 transcripts we have 2 rows of 
data, each has a unique TXID but the same GENEID.

 > loc
GRanges with 4 ranges and 7 metadata columns:
       seqnames               ranges strand |   LOCATION   QUERYID      TXID
<Rle> <IRanges> <Rle> | <factor> <integer> <integer>
   [1]     chr1 [41245606, 41245625]      * | intergenic         1 <NA>
   [2]    chr17 [31245606, 31245625]      * |     coding         2     47716
   [3]    chr16 [11248806, 11248825]      * |     intron         3     46316
   [4]    chr16 [11248806, 11248825]      * |     intron         3     46317
           CDSID      GENEID   PRECEDEID    FOLLOWID
<integer> <character> <character> <character>
   [1] <NA> <NA>      211798      381339
   [2]    184246       11307 <NA> <NA>
   [3] <NA>       14852 <NA> <NA>
   [4] <NA>       14852 <NA> <NA>

Pull out the unique combinations of GENEID and QUERYID.

idx <- loc[ ,c("QUERYID", "GENEID")]
idx <- idx[!duplicated(idx)]
 > idx
GRanges with 3 ranges and 2 metadata columns:
       seqnames               ranges strand |   QUERYID      GENEID
<Rle> <IRanges> <Rle> | <integer> <character>
   [1]     chr1 [41245606, 41245625]      * |         1 <NA>
   [2]    chr17 [31245606, 31245625]      * |         2       11307
   [3]    chr16 [11248806, 11248825]      * |         3       14852

Remove NA GENEID's and call select().
geneid <- idx[!is.na(idx$GENEID)]
genetable <- select(Mus.musculus, keytype="ENTREZID", 
keys=geneid$GENEID, cols=c("GENENAME", "ENSEMBL"))
 > genetable
   ENTREZID            ENSEMBL
1    11307 ENSMUSG00000024030
2    14852 ENSMUSG00000062203
                                               GENENAME
1 ATP-binding cassette, sub-family G (WHITE), member 1
2                           G1 to S phase transition 1

You can use the QUERYID to map back to an original list of identifiers. 
If you just need 'chromosome.start' information you can extract that 
from the ranges in geneid. Note that the QUERYID maps back to the 
original 'query' but those same ranges are also included in the result.
id <- paste(as.character(seqnames(geneid)), ".", start(geneid), sep="")
 > data.frame(id, genetable)
               id ENTREZID            ENSEMBL
1 chr17.31245606    11307 ENSMUSG00000024030
2 chr16.11248806    14852 ENSMUSG00000062203
                                               GENENAME
1 ATP-binding cassette, sub-family G (WHITE), member 1
2                           G1 to S phase transition 1

Valerie

On 11/14/12 01:00, José Luis Lavín wrote:
> Dear Valerie,
>
> I have an issue with the results of your approach. Now I get a table 
> with the following fields:
>
>       ENTREZID   ENSEMBL  GENENAME
>
> 1    216285   ENSMUSG00000036602  ALX homeobox 1
>
>
> But the Identifier I converted to obtain the previously mentioned data 
> is not present in that table (eg. chr10.100837476) , so there's 
> nothing I can do with this results table because I can't correlate my 
> previous chr.location type identifiers to the results of the table.
> I guess I've done something wrong somewhere in the code I modified so 
> that I missed my "chr.coordinate" ids column. So I'll paste the code 
> here again to see if any of you can tell me where I lost that column.
>
> source("http://bioconductor.org/biocLite.R")
> biocLite('Mus.musculus')
> biocLite('TxDb.Mmusculus.UCSC.mm9.knownGene')
>
> biocLite('VariantAnnotation')
> library(Mus.musculus)
> library(TxDb.Mmusculus.UCSC.mm9.knownGene)
>
> txdb <- TxDb.Mmusculus.UCSC.mm9.knownGene
> y <- read.delim("/path_to_dir/ids_noM.txt", sep=".", header=FALSE, 
> as.is <http://as.is>=TRUE)
> probes<- GRanges(seqnames=y[,1], ranges=IRanges(start=y[,2], width=1))
>
> library(VariantAnnotation)
> loc <- locateVariants(query=probes, subject=txdb, region=AllVariants())
> head(loc)
>
> entrezid <- loc$GENEID
> entrezid2<-select(Mus.musculus, keytype="ENTREZID", keys=entrezid, 
> cols=c("GENENAME", "ENSEMBL"))
>
> write.table(entrezid2, file = "Annotated_Ids.csv",quote = FALSE, sep = 
> ",", eol = "\n")
>
> Thank you again for your advice
>
>
> 2012/11/14 Hervé Pagès <hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
>
>     Hi all,
>
>
>     On 11/12/2012 10:24 AM, Valerie Obenchain wrote:
>
>         Hi Jose,
>
>         Here is a slightly different approach.
>
>         library(Mus.musculus)
>         library(TxDb.Mmusculus.UCSC.mm9.knownGene)
>         txdb <- TxDb.Mmusculus.UCSC.mm9.knownGene
>
>         ## I assume you've figured out how to read your data into a
>         GRanges.
>         ## Here we use a small example.
>         gr <- GRanges(seq = "chr17", IRanges(start = 31245606, width =
>         20))
>
>         ## The locateVariants() function in the VariantAnnotation
>         package will
>         ## give you the GENEIDs for all ranges in 'query'. If the
>         range does not
>         ## fall in a gene (i.e., it is intergenic), then the PRECEDEID and
>         ## FOLLOWID are provided. The LOCATION column indicates what
>         ## region the range fell in. See ?locateVariants for details
>         on the
>         ## different regions and the ability to set 'upstream' and
>         'downstream'
>         ## values for promoter regions.
>         library(VariantAnnotation)
>         loc <- locateVariants(query=gr, subject=txdb,
>         region=AllVariants())
>
>         > loc
>         GRanges with 1 range and 7 metadata columns:
>                seqnames               ranges strand | LOCATION  
>         QUERYID      TXID
>         <Rle> <IRanges> <Rle> | <factor> <integer> <integer>
>            [1]    chr17 [31245606, 31245625]      * |   coding        
>         1     47716
>                    CDSID      GENEID   PRECEDEID    FOLLOWID
>         <integer> <character> <character> <character>
>            [1]    184246       11307 <NA> <NA>
>
>
>     Nice! So IIUC locateVariants() is not intrinsically for variants but
>     seems to be generally useful for "locating" any set of genomic
>     positions. Shouldn't we have this (or something similar) in
>     GenomicFeatures (probably with a different name)?
>
>     Cheers,
>     H.
>
>
>
>
>         ## Use the select() function to map these GENEID's to the other
>         ## values you are interested in. The GENEID's from
>         locateVariants()
>         ## are Entrez ID's so we use those as our 'keytype'.
>         entrezid <- loc$GENEID
>         select(Mus.musculus, keytype="ENTREZID", keys=entrezid,
>         cols=c("GENENAME", "ENSEMBL"))
>         > select(Mus.musculus, keytype="ENTREZID", keys=entrezid,
>         cols=c("GENENAME", "ENSEMBL"))
>            ENTREZID            ENSEMBL
>         1    11307 ENSMUSG00000024030
>                                                        GENENAME
>         1 ATP-binding cassette, sub-family G (WHITE), member 1
>
>
>         Valerie
>
>
>
>         On 11/12/12 06:59, José Luis Lavín wrote:
>
>             Hello all,
>
>             First of all I want to thank everybody that gave me advice
>             on this
>             subject.
>             trying to follow the advice, I added some modifications
>             mixing codes from
>             Tim,  Harris and James , but it seems I got lost somewhere
>             in between...
>             I'm sorry for bothering you all again, but I'm afraid I
>             need some more
>             help.
>
>             I need to read my ids.txt file and then iterate use each
>             line of id
>             (chr.position) to perform the annotation process with it.
>             I thought of a
>             "for" loop to achieve it, but I do not seem to catch the
>             essence of R
>             processes and I guess I miss my tryout.
>             Please have a look at my "disaster" and help me with it If
>             such thing is
>             possible...
>
>             biocLite('Mus.musculus')
>             require(Mus.musculus)
>             require(TxDb.Mmusculus.UCSC.mm9.knownGene)
>             txdb<- transcriptsBy(TxDb.Mmusculus.UCSC.mm9.knownGene,
>             'gene')
>             egid<- names(txdb)
>             name<- unlist(mget(egid, org.Mm.egSYMBOL, ifnotfound=NA))
>             length(name) == length(egid) ## TRUE, OK to use
>             esbl<- unlist(mget(egid, org.Mm.egENSEMBL, ifnotfound=NA))
>             length(esbl) == length(egid) ## FALSE, do not use
>
>             #read input table
>             coordTable<- read.delim(/path_to_dir/ids.txt, sep=".",
>             header=FALSE,
>             as.is <http://as.is>
>             =TRUE)
>
>             for(i in 1:length(coordTable))
>             {
>             probes<- GRanges(seq= "y[,1]", IRanges(start = y[,2],
>             width = 1),
>                               strand='*')
>             }
>
>             genome(probes)<- 'mm9' ## prevents some stoopid mistakes
>
>             geneBodyProbes<- findOverlaps(probes, txdb)
>             geneBodyProbes
>
>             write.table
>             (geneBodyProbes,file="/path_to_dir/trans_id.tsv",quote=FALSE,sep="\t")
>
>             ## Hits of length 1
>             ## queryLength: 1
>             ## subjectLength: 21761
>             ##   queryHits subjectHits
>             ##<integer> <integer>
>             ##     1         1        1641
>             ##
>
>             name[subjectHits(geneBodyProbes)]
>
>             ##   11307  # egid
>             ## "Abcg1"  # name
>             ##
>
>             org.Mm.egCHRLOC[['11307']]
>
>             ##       17
>             ## 31057694
>             ##
>
>             org.Mm.egENSEMBL[['11307']]
>
>             ## [1] "ENSMUSG00000024030"
>
>             promotersByGene<- flank(txdb, 1500) # or however many
>             bases you want
>             promotersByGene<- flank(promotersByGene, 200, FALSE) # a
>             little extra
>             promoterProbes<- findOverlaps(probes, promotersByGene)
>             promoterProbes
>
>             write.table
>             (promoterProbes,file="/path_to_dir/promo_trans_id.tsv",quote=FALSE,sep="\t")
>
>
>             ## Hits of length 0
>             ## queryLength: 1
>             ## subjectLength: 21761
>             ##
>             ## read the GRanges and GenomicFeatures vignettes for more
>
>             write.table
>             (geneBodyProbes,file="/path_to_dir/trans_id.tsv",quote=FALSE,sep="\t")
>
>
>             *Thanks in advance*
>
>             JL
>
>             2012/11/8 Harris A. Jaffee<hj at jhu.edu <mailto:hj at jhu.edu>>
>
>                 On the elementary end of all this...
>
>                 If the sites are on a *file*, one per line, you could do
>
>                          y<- read.delim(filename, sep=".",
>                 header=FALSE, as.is <http://as.is>=TRUE)
>
>                 etc.
>
>                 On Nov 8, 2012, at 11:38 AM, James W. MacDonald wrote:
>
>                     Hi Jose,
>
>                     On 11/8/2012 10:28 AM, José Luis Lavín wrote:
>
>                         Dear James,
>
>                         To answer your questions swiftly, the features
>                         are methylation sites
>
>                 (that's why I only have a coordinate instead of having
>                 a Start/End
>                 pair) in
>                 mouse (mm9) genome and I have a list of those in the
>                 following format:
>
>                         chr17.31246506
>
>                         How could I read the list so that I can input
>                         the "chr" and the
>
>                 "coordinate" parameters into the expression you
>                 recommended?
>
>                     First you need to split your data to chr and pos.
>
>                     chr.pos<- do.call("rbind", strsplit(<your vector
>                     of chr17.pos data>,
>
>                 "\\."))
>
>                     Note that your vector of chr.pos data may have
>                     been in as a factor, so
>
>                 you may need to wrap with an as.character(). If you
>                 don't know what I
>                 mean
>                 by that, you should first read 'An Introduction to R' (
>                 http://cran.r-project.org/doc/manuals/R-intro.html).
>                 That will be time
>                 well spent.
>
>                         x<- GRanges(seq = "chr17", IRanges(start =
>                         31245606, width = 1))
>
>                     Both the seqnames and ranges argument to GRanges()
>                     can be based on
>
>                 vectors. So if you had a matrix (called 'y') like
>
>                     chr16    3423543
>                     chr3    403992
>                     chr18    3494202
>
>                     then you can do
>
>                     x<- GRanges(y[,1], IRanges(start = y[,2], width = 1))
>
>                     see ?GRanges for more info.
>
>                     As a side note, IIRC, methylation sites are not in
>                     general found in
>
>                 exons, but are more likely to be found upstream of a
>                 given gene. You
>                 might
>                 then want to use fiveUTRsByTranscript() rather than
>                 exonsBy(). See
>                 ?exonsBy
>                 after loading the GenomicFeatures package.
>
>                         I 'm lookin forward to obtain a table where my
>                         "coordinate-based Id"
>
>                 appears in a column, and the gene_name and if
>                 possible, the
>                 Entrez/Ensembl
>                 Ids in two other columns :
>
>                         Coordinate Gene_name  Entrez_ID  Ensembl_ID
>
>                         Is it easy to do this in R?
>
>                     Of course! Everything is easy to do in R. You
>                     should see my sweet R
>
>                 toast 'n coffee maker ;-D
>
>                     But you should note that the fiveUTRByTranscript()
>                     function is
>
>                 transcript based (obvs), and so you will have multiple
>                 transcripts per
>                 gene. This makes things much more difficult, as a
>                 given methylation site
>                 may overlap multiple transcripts of the same gene. So
>                 that makes it
>                 hard to
>                 merge your original data with the overlapping transcripts.
>
>                     You could do something like
>
>                     ex<-
>                     fiveUTRsByTranscript(TxDb.Mmusculus.UCSC.mm10.knownGene,
>                     use.names
>
>                 = TRUE)
>
>                     ex2<- do.call("rbind", sapply(ex[ex %in% x],
>                     elementMetadata))$exon_id
>
>                     then you could use unique Gene IDs thusly
>
>                     my.trans<- select(Mus.musculus,
>                     unique(as.character(ex2)),
>
>                 c("CHR","CHRLOC","ENTREZID","ENSEMBL"), "ENTREZID")
>
>                     That should at least give you a start. See where
>                     you can go on your
>                     own,
>
>                 and let us know if you get stuck.
>
>                     Best,
>
>                     Jim
>
>
>
>                         I'm still really new to R and I lack many
>                         concepts you may consider
>
>                 basic... I'm awfully sorry
>
>                         Best
>
>                         JL
>
>                         2012/11/8 James W. MacDonald<jmacdon at uw.edu
>                         <mailto:jmacdon at uw.edu><mailto:jmacdon at uw.edu
>                         <mailto:jmacdon at uw.edu>>>
>
>                             Hi Jose,
>
>
>                             On 11/8/2012 8:19 AM, José Luis Lavín wrote:
>
>                                 Dear Bioconductor list,
>
>                                 I write you this email asking for a
>                         Bioconductor module that
>                                 allows me to
>                                 annotate genomic coordinates and get
>                         different GeneIds.
>                                 I'll show you an example of what I'm
>                         referring to:
>
>                                 I have this data:
>                                 Chromosome  coordinate
>                                 chr17              31246506
>
>
>                             It depends on what that coordinate is. Is
>                         it the start of a
>                             transcript? A SNP? Do you really just have
>                         a single coordinate, or
>                             do you have a range? What species are we
>                         talking about here?
>
>                             Depending on what your data are, you might
>                         want to use either one
>                             of the TxDb packages, or a SNPlocs
>                         package. There really isn't
>                             much to go on here. If I assume this is a
>                         coordinate that one
>                             might think is within an exon, and if I
>                         further assume you are
>                             working with H. sapiens, I could suggest
>                         something like
>
>                             library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>                             ex<-
>                         exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "gene")
>
>                             x<- GRanges(seq = "chr17", IRanges(start =
>                         31245606, width = 1))
>
>                             ex[ex %in% x]
>
>                             or maybe more appropriately
>
>                             names(ex)[ex %in% x]
>
>                             which will give you the Gene ID, and you
>                         can go from there using
>                             the org.Hs.eg.db package.
>
>                             If however, your coordinate isn't in an
>                         exon, but might be in a
>                             UTR, you can look at ?exonsBy to see what
>                         other sequences you can
>                             extract to compare with.
>
>                             If these are SNPs, then you can look at
>                         the help pages for the
>                             relevant SNPlocs package.
>
>                             Best,
>
>                             Jim
>
>
>
>                                 which can also be written this way by
>                         the program that yielded
>                                 the result:
>                                 chr17.31246506
>
>                                 And I need to convert this data into a
>                         gene name and known
>                                 gene Ids, such
>                                 as:
>
>                                 Gene name  Entrez_ID  Ensembl_ID
>
>                                 Tff3 NM_011575 20050
>                                 Can you please advice me about a
>                         module able to perform this
>                                 ID conversion
>                                 using a list of  "chr17.31246506" type
>                         coordinates as input?
>
>                                 Thanks in advance
>
>                                 With best wishes
>
>
>
>                                
>                         _______________________________________________
>                                 Bioconductor mailing list
>                         Bioconductor at r-project.org
>                         <mailto:Bioconductor at r-project.org><mailto:Bioconductor at r-project.org
>                         <mailto:Bioconductor at r-project.org>>
>                         https://stat.ethz.ch/mailman/listinfo/bioconductor
>                                 Search the archives:
>
>                 http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
>                             --     James W. MacDonald, M.S.
>                             Biostatistician
>                             University of Washington
>                             Environmental and Occupational Health Sciences
>                             4225 Roosevelt Way NE, # 100
>                             Seattle WA 98105-6099
>
>
>
>
>                         --
>                         --
>                         Dr. José Luis Lavín Trueba
>
>                         Dpto. de Producción Agraria
>                         Grupo de Genética y Microbiología
>                         Universidad Pública de Navarra
>                         31006 Pamplona
>                         Navarra
>                         SPAIN
>
>                     --
>                     James W. MacDonald, M.S.
>                     Biostatistician
>                     University of Washington
>                     Environmental and Occupational Health Sciences
>                     4225 Roosevelt Way NE, # 100
>                     Seattle WA 98105-6099
>
>                     _______________________________________________
>                     Bioconductor mailing list
>                     Bioconductor at r-project.org
>                     <mailto:Bioconductor at r-project.org>
>                     https://stat.ethz.ch/mailman/listinfo/bioconductor
>                     Search the archives:
>
>                 http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
>
>
>
>             _______________________________________________
>             Bioconductor mailing list
>             Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>             https://stat.ethz.ch/mailman/listinfo/bioconductor
>             Search the archives:
>             http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
>         _______________________________________________
>         Bioconductor mailing list
>         Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>         https://stat.ethz.ch/mailman/listinfo/bioconductor
>         Search the archives:
>         http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
>     -- 
>     Hervé Pagès
>
>     Program in Computational Biology
>     Division of Public Health Sciences
>     Fred Hutchinson Cancer Research Center
>     1100 Fairview Ave. N, M1-B514
>     P.O. Box 19024
>     Seattle, WA 98109-1024
>
>     E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>     Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>     Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>
>
>
> -- 
> -- 
> Dr. José Luis Lavín Trueba
>
> Dpto. de Producción Agraria
> Grupo de Genética y Microbiología
> Universidad Pública de Navarra
> 31006 Pamplona
> Navarra
> SPAIN