[BioC] How to annotate genomic coordinates

Hervé Pagès hpages at fhcrc.org
Wed Nov 14 02:32:58 CET 2012


Hi all,

On 11/12/2012 10:24 AM, Valerie Obenchain wrote:
> Hi Jose,
>
> Here is a slightly different approach.
>
> library(Mus.musculus)
> library(TxDb.Mmusculus.UCSC.mm9.knownGene)
> txdb <- TxDb.Mmusculus.UCSC.mm9.knownGene
>
> ## I assume you've figured out how to read your data into a GRanges.
> ## Here we use a small example.
> gr <- GRanges(seq = "chr17", IRanges(start = 31245606, width = 20))
>
> ## The locateVariants() function in the VariantAnnotation package will
> ## give you the GENEIDs for all ranges in 'query'. If the range does not
> ## fall in a gene (i.e., it is intergenic), then the PRECEDEID and
> ## FOLLOWID are provided. The LOCATION column indicates what
> ## region the range fell in. See ?locateVariants for details on the
> ## different regions and the ability to set 'upstream' and 'downstream'
> ## values for promoter regions.
> library(VariantAnnotation)
> loc <- locateVariants(query=gr, subject=txdb, region=AllVariants())
>
>  > loc
> GRanges with 1 range and 7 metadata columns:
>        seqnames               ranges strand | LOCATION   QUERYID      TXID
> <Rle> <IRanges> <Rle> | <factor> <integer> <integer>
>    [1]    chr17 [31245606, 31245625]      * |   coding         1     47716
>            CDSID      GENEID   PRECEDEID    FOLLOWID
> <integer> <character> <character> <character>
>    [1]    184246       11307 <NA> <NA>

Nice! So IIUC locateVariants() is not intrinsically for variants but
seems to be generally useful for "locating" any set of genomic
positions. Shouldn't we have this (or something similar) in
GenomicFeatures (probably with a different name)?

Cheers,
H.

>
>
> ## Use the select() function to map these GENEID's to the other
> ## values you are interested in. The GENEID's from locateVariants()
> ## are Entrez ID's so we use those as our 'keytype'.
> entrezid <- loc$GENEID
> select(Mus.musculus, keytype="ENTREZID", keys=entrezid,
> cols=c("GENENAME", "ENSEMBL"))
>  > select(Mus.musculus, keytype="ENTREZID", keys=entrezid,
> cols=c("GENENAME", "ENSEMBL"))
>    ENTREZID            ENSEMBL
> 1    11307 ENSMUSG00000024030
>                                                GENENAME
> 1 ATP-binding cassette, sub-family G (WHITE), member 1
>
>
> Valerie
>
>
>
> On 11/12/12 06:59, José Luis Lavín wrote:
>> Hello all,
>>
>> First of all I want to thank everybody that gave me advice on this
>> subject.
>> trying to follow the advice, I added some modifications mixing codes from
>> Tim,  Harris and James , but it seems I got lost somewhere in between...
>> I'm sorry for bothering you all again, but I'm afraid I need some more
>> help.
>>
>> I need to read my ids.txt file and then iterate use each line of id
>> (chr.position) to perform the annotation process with it. I thought of a
>> "for" loop to achieve it, but I do not seem to catch the essence of R
>> processes and I guess I miss my tryout.
>> Please have a look at my "disaster" and help me with it If such thing is
>> possible...
>>
>> biocLite('Mus.musculus')
>> require(Mus.musculus)
>> require(TxDb.Mmusculus.UCSC.mm9.knownGene)
>> txdb<- transcriptsBy(TxDb.Mmusculus.UCSC.mm9.knownGene, 'gene')
>> egid<- names(txdb)
>> name<- unlist(mget(egid, org.Mm.egSYMBOL, ifnotfound=NA))
>> length(name) == length(egid) ## TRUE, OK to use
>> esbl<- unlist(mget(egid, org.Mm.egENSEMBL, ifnotfound=NA))
>> length(esbl) == length(egid) ## FALSE, do not use
>>
>> #read input table
>> coordTable<- read.delim(/path_to_dir/ids.txt, sep=".", header=FALSE,
>> as.is
>> =TRUE)
>>
>> for(i in 1:length(coordTable))
>> {
>> probes<- GRanges(seq= "y[,1]", IRanges(start = y[,2], width = 1),
>>                   strand='*')
>> }
>>
>> genome(probes)<- 'mm9' ## prevents some stoopid mistakes
>>
>> geneBodyProbes<- findOverlaps(probes, txdb)
>> geneBodyProbes
>>
>> write.table
>> (geneBodyProbes,file="/path_to_dir/trans_id.tsv",quote=FALSE,sep="\t")
>>
>> ## Hits of length 1
>> ## queryLength: 1
>> ## subjectLength: 21761
>> ##   queryHits subjectHits
>> ##<integer>    <integer>
>> ##     1         1        1641
>> ##
>>
>> name[subjectHits(geneBodyProbes)]
>>
>> ##   11307  # egid
>> ## "Abcg1"  # name
>> ##
>>
>> org.Mm.egCHRLOC[['11307']]
>>
>> ##       17
>> ## 31057694
>> ##
>>
>> org.Mm.egENSEMBL[['11307']]
>>
>> ## [1] "ENSMUSG00000024030"
>>
>> promotersByGene<- flank(txdb, 1500) # or however many bases you want
>> promotersByGene<- flank(promotersByGene, 200, FALSE) # a little extra
>> promoterProbes<- findOverlaps(probes, promotersByGene)
>> promoterProbes
>>
>> write.table
>> (promoterProbes,file="/path_to_dir/promo_trans_id.tsv",quote=FALSE,sep="\t")
>>
>>
>> ## Hits of length 0
>> ## queryLength: 1
>> ## subjectLength: 21761
>> ##
>> ## read the GRanges and GenomicFeatures vignettes for more
>>
>> write.table
>> (geneBodyProbes,file="/path_to_dir/trans_id.tsv",quote=FALSE,sep="\t")
>>
>>
>> *Thanks in advance*
>>
>> JL
>>
>> 2012/11/8 Harris A. Jaffee<hj at jhu.edu>
>>
>>> On the elementary end of all this...
>>>
>>> If the sites are on a *file*, one per line, you could do
>>>
>>>          y<- read.delim(filename, sep=".", header=FALSE, as.is=TRUE)
>>>
>>> etc.
>>>
>>> On Nov 8, 2012, at 11:38 AM, James W. MacDonald wrote:
>>>
>>>> Hi Jose,
>>>>
>>>> On 11/8/2012 10:28 AM, José Luis Lavín wrote:
>>>>> Dear James,
>>>>>
>>>>> To answer your questions swiftly, the features are methylation sites
>>> (that's why I only have a coordinate instead of having a Start/End
>>> pair) in
>>> mouse (mm9) genome and I have a list of those in the following format:
>>>>> chr17.31246506
>>>>>
>>>>> How could I read the list so that I can input the "chr" and the
>>> "coordinate" parameters into the expression you recommended?
>>>> First you need to split your data to chr and pos.
>>>>
>>>> chr.pos<- do.call("rbind", strsplit(<your vector of chr17.pos data>,
>>> "\\."))
>>>> Note that your vector of chr.pos data may have been in as a factor, so
>>> you may need to wrap with an as.character(). If you don't know what I
>>> mean
>>> by that, you should first read 'An Introduction to R' (
>>> http://cran.r-project.org/doc/manuals/R-intro.html). That will be time
>>> well spent.
>>>>> x<- GRanges(seq = "chr17", IRanges(start = 31245606, width = 1))
>>>> Both the seqnames and ranges argument to GRanges() can be based on
>>> vectors. So if you had a matrix (called 'y') like
>>>> chr16    3423543
>>>> chr3    403992
>>>> chr18    3494202
>>>>
>>>> then you can do
>>>>
>>>> x<- GRanges(y[,1], IRanges(start = y[,2], width = 1))
>>>>
>>>> see ?GRanges for more info.
>>>>
>>>> As a side note, IIRC, methylation sites are not in general found in
>>> exons, but are more likely to be found upstream of a given gene. You
>>> might
>>> then want to use fiveUTRsByTranscript() rather than exonsBy(). See
>>> ?exonsBy
>>> after loading the GenomicFeatures package.
>>>>> I 'm lookin forward to obtain a table where my "coordinate-based Id"
>>> appears in a column, and the gene_name and if possible, the
>>> Entrez/Ensembl
>>> Ids in two other columns :
>>>>> Coordinate Gene_name  Entrez_ID  Ensembl_ID
>>>>>
>>>>> Is it easy to do this in R?
>>>> Of course! Everything is easy to do in R. You should see my sweet R
>>> toast 'n coffee maker ;-D
>>>> But you should note that the fiveUTRByTranscript() function is
>>> transcript based (obvs), and so you will have multiple transcripts per
>>> gene. This makes things much more difficult, as a given methylation site
>>> may overlap multiple transcripts of the same gene. So that makes it
>>> hard to
>>> merge your original data with the overlapping transcripts.
>>>> You could do something like
>>>>
>>>> ex<- fiveUTRsByTranscript(TxDb.Mmusculus.UCSC.mm10.knownGene, use.names
>>> = TRUE)
>>>> ex2<- do.call("rbind", sapply(ex[ex %in% x], elementMetadata))$exon_id
>>>>
>>>> then you could use unique Gene IDs thusly
>>>>
>>>> my.trans<- select(Mus.musculus, unique(as.character(ex2)),
>>> c("CHR","CHRLOC","ENTREZID","ENSEMBL"), "ENTREZID")
>>>> That should at least give you a start. See where you can go on your
>>>> own,
>>> and let us know if you get stuck.
>>>> Best,
>>>>
>>>> Jim
>>>>
>>>>
>>>>
>>>>> I'm still really new to R and I lack many concepts you may consider
>>> basic... I'm awfully sorry
>>>>> Best
>>>>>
>>>>> JL
>>>>>
>>>>> 2012/11/8 James W. MacDonald<jmacdon at uw.edu<mailto:jmacdon at uw.edu>>
>>>>>
>>>>>     Hi Jose,
>>>>>
>>>>>
>>>>>     On 11/8/2012 8:19 AM, José Luis Lavín wrote:
>>>>>
>>>>>         Dear Bioconductor list,
>>>>>
>>>>>         I write you this email asking for a Bioconductor module that
>>>>>         allows me to
>>>>>         annotate genomic coordinates and get different GeneIds.
>>>>>         I'll show you an example of what I'm referring to:
>>>>>
>>>>>         I have this data:
>>>>>         Chromosome  coordinate
>>>>>         chr17              31246506
>>>>>
>>>>>
>>>>>     It depends on what that coordinate is. Is it the start of a
>>>>>     transcript? A SNP? Do you really just have a single coordinate, or
>>>>>     do you have a range? What species are we talking about here?
>>>>>
>>>>>     Depending on what your data are, you might want to use either one
>>>>>     of the TxDb packages, or a SNPlocs package. There really isn't
>>>>>     much to go on here. If I assume this is a coordinate that one
>>>>>     might think is within an exon, and if I further assume you are
>>>>>     working with H. sapiens, I could suggest something like
>>>>>
>>>>>     library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>>>>>     ex<- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "gene")
>>>>>
>>>>>     x<- GRanges(seq = "chr17", IRanges(start = 31245606, width = 1))
>>>>>
>>>>>     ex[ex %in% x]
>>>>>
>>>>>     or maybe more appropriately
>>>>>
>>>>>     names(ex)[ex %in% x]
>>>>>
>>>>>     which will give you the Gene ID, and you can go from there using
>>>>>     the org.Hs.eg.db package.
>>>>>
>>>>>     If however, your coordinate isn't in an exon, but might be in a
>>>>>     UTR, you can look at ?exonsBy to see what other sequences you can
>>>>>     extract to compare with.
>>>>>
>>>>>     If these are SNPs, then you can look at the help pages for the
>>>>>     relevant SNPlocs package.
>>>>>
>>>>>     Best,
>>>>>
>>>>>     Jim
>>>>>
>>>>>
>>>>>
>>>>>         which can also be written this way by the program that yielded
>>>>>         the result:
>>>>>         chr17.31246506
>>>>>
>>>>>         And I need to convert this data into a gene name and known
>>>>>         gene Ids, such
>>>>>         as:
>>>>>
>>>>>         Gene name  Entrez_ID  Ensembl_ID
>>>>>
>>>>>         Tff3 NM_011575 20050
>>>>>         Can you please advice me about a module able to perform this
>>>>>         ID conversion
>>>>>         using a list of  "chr17.31246506" type coordinates as input?
>>>>>
>>>>>         Thanks in advance
>>>>>
>>>>>         With best wishes
>>>>>
>>>>>
>>>>>
>>>>>         _______________________________________________
>>>>>         Bioconductor mailing list
>>>>>         Bioconductor at r-project.org<mailto:Bioconductor at r-project.org>
>>>>>         https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>         Search the archives:
>>>>>
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>
>>>>>     --     James W. MacDonald, M.S.
>>>>>     Biostatistician
>>>>>     University of Washington
>>>>>     Environmental and Occupational Health Sciences
>>>>>     4225 Roosevelt Way NE, # 100
>>>>>     Seattle WA 98105-6099
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> --
>>>>> Dr. José Luis Lavín Trueba
>>>>>
>>>>> Dpto. de Producción Agraria
>>>>> Grupo de Genética y Microbiología
>>>>> Universidad Pública de Navarra
>>>>> 31006 Pamplona
>>>>> Navarra
>>>>> SPAIN
>>>> --
>>>> James W. MacDonald, M.S.
>>>> Biostatistician
>>>> University of Washington
>>>> Environmental and Occupational Health Sciences
>>>> 4225 Roosevelt Way NE, # 100
>>>> Seattle WA 98105-6099
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>>
>>
>>
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list