[BioC] How to annotate genomic coordinates

Harris A. Jaffee hj at jhu.edu
Mon Nov 12 18:17:19 CET 2012


You don't say what the disaster was, but I'll try to push you in a better direction.

On Nov 12, 2012, at 9:59 AM, José Luis Lavín wrote:
Hello all,
> 
> First of all I want to thank everybody that gave me advice on this subject. trying to follow the advice, I added some modifications mixing codes from Tim,  Harris and James , but it seems I got lost somewhere in between...
> I'm sorry for bothering you all again, but I'm afraid I need some more help.
> 
> I need to read my ids.txt file and then iterate use each line of id (chr.position) to perform the annotation process with it. I thought of a "for" loop to achieve it, but I do not seem to catch the essence of R processes and I guess I miss my tryout. 
> Please have a look at my "disaster" and help me with it If such thing is possible...
> 
> biocLite('Mus.musculus')
> require(Mus.musculus)
> require(TxDb.Mmusculus.UCSC.mm9.knownGene)
> txdb <- transcriptsBy(TxDb.Mmusculus.UCSC.mm9.knownGene, 'gene')
> egid <- names(txdb)
> name <- unlist(mget(egid, org.Mm.egSYMBOL, ifnotfound=NA))
> length(name) == length(egid) ## TRUE, OK to use
> esbl <- unlist(mget(egid, org.Mm.egENSEMBL, ifnotfound=NA))
> length(esbl) == length(egid) ## FALSE, do not use
> 
> #read input table
> coordTable <- read.delim(/path_to_dir/ids.txt, sep=".", header=FALSE, as.is=TRUE)
> 
> for(i in 1:length(coordTable))
> {
> probes<- GRanges(seq= "y[,1]", IRanges(start = y[,2], width = 1),
>                  strand='*')
> }

If this appears to be working at all, it is because it is using an object 'y'
from sometime earlier, when it should be using coordTable, obtained from the
read.delim call.  You don't need to loop, and if you did, it would probably be
over 1:nrow(coordTable).  (Examine length(coordTable) to see what it is.)

I believe you can just do

	y <- coordTable <- read.delim(/path_to_dir/ids.txt, sep=".", header=FALSE, as.is=TRUE)
	probes<- GRanges(seqnames=y[,1], ranges=IRanges(start=y[,2], width=1))

strand defaults to "*"; see ?GRanges

> genome(probes) <- 'mm9' ## prevents some stoopid mistakes
> 
> geneBodyProbes <- findOverlaps(probes, txdb)
> geneBodyProbes
> 
> write.table (geneBodyProbes,file="/path_to_dir/trans_id.tsv",quote=FALSE,sep="\t")
> 
> ## Hits of length 1
> ## queryLength: 1
> ## subjectLength: 21761
> ##   queryHits subjectHits
> ##    <integer>   <integer>
> ##     1         1        1641
> ##
> 
> name[subjectHits(geneBodyProbes)]
> 
> ##   11307  # egid
> ## "Abcg1"  # name
> ##
> 
> org.Mm.egCHRLOC[['11307']]
> 
> ##       17
> ## 31057694
> ##
> 
> org.Mm.egENSEMBL[['11307']]
> 
> ## [1] "ENSMUSG00000024030"
> 
> promotersByGene <- flank(txdb, 1500) # or however many bases you want
> promotersByGene <- flank(promotersByGene, 200, FALSE) # a little extra
> promoterProbes <- findOverlaps(probes, promotersByGene)
> promoterProbes
> 
> write.table (promoterProbes,file="/path_to_dir/promo_trans_id.tsv",quote=FALSE,sep="\t")
> 
> ## Hits of length 0
> ## queryLength: 1
> ## subjectLength: 21761
> ##
> ## read the GRanges and GenomicFeatures vignettes for more
> 
> write.table (geneBodyProbes,file="/path_to_dir/trans_id.tsv",quote=FALSE,sep="\t")
> 
> 
> Thanks in advance
> 
> JL
> 
> 2012/11/8 Harris A. Jaffee <hj at jhu.edu>
> On the elementary end of all this...
> 
> If the sites are on a *file*, one per line, you could do
> 
>         y <- read.delim(filename, sep=".", header=FALSE, as.is=TRUE)
> 
> etc.
> 
> On Nov 8, 2012, at 11:38 AM, James W. MacDonald wrote:
> 
> > Hi Jose,
> >
> > On 11/8/2012 10:28 AM, José Luis Lavín wrote:
> >> Dear James,
> >>
> >> To answer your questions swiftly, the features are methylation sites (that's why I only have a coordinate instead of having a Start/End pair) in mouse (mm9) genome and I have a list of those in the following format:
> >>
> >> chr17.31246506
> >>
> >> How could I read the list so that I can input the "chr" and the "coordinate" parameters into the expression you recommended?
> >
> > First you need to split your data to chr and pos.
> >
> > chr.pos <- do.call("rbind", strsplit(<your vector of chr17.pos data>, "\\."))
> >
> > Note that your vector of chr.pos data may have been in as a factor, so you may need to wrap with an as.character(). If you don't know what I mean by that, you should first read 'An Introduction to R' (http://cran.r-project.org/doc/manuals/R-intro.html). That will be time well spent.
> >
> >>
> >> x <- GRanges(seq = "chr17", IRanges(start = 31245606, width = 1))
> >
> > Both the seqnames and ranges argument to GRanges() can be based on vectors. So if you had a matrix (called 'y') like
> >
> > chr16    3423543
> > chr3    403992
> > chr18    3494202
> >
> > then you can do
> >
> > x <- GRanges(y[,1], IRanges(start = y[,2], width = 1))
> >
> > see ?GRanges for more info.
> >
> > As a side note, IIRC, methylation sites are not in general found in exons, but are more likely to be found upstream of a given gene. You might then want to use fiveUTRsByTranscript() rather than exonsBy(). See ?exonsBy after loading the GenomicFeatures package.
> >
> >>
> >> I 'm lookin forward to obtain a table where my "coordinate-based Id" appears in a column, and the gene_name and if possible, the  Entrez/Ensembl Ids in two other columns :
> >>
> >> Coordinate Gene_name  Entrez_ID  Ensembl_ID
> >>
> >> Is it easy to do this in R?
> >
> > Of course! Everything is easy to do in R. You should see my sweet R toast 'n coffee maker ;-D
> >
> > But you should note that the fiveUTRByTranscript() function is transcript based (obvs), and so you will have multiple transcripts per gene. This makes things much more difficult, as a given methylation site may overlap multiple transcripts of the same gene. So that makes it hard to merge your original data with the overlapping transcripts.
> >
> > You could do something like
> >
> > ex <- fiveUTRsByTranscript(TxDb.Mmusculus.UCSC.mm10.knownGene, use.names = TRUE)
> > ex2 <- do.call("rbind", sapply(ex[ex %in% x], elementMetadata))$exon_id
> >
> > then you could use unique Gene IDs thusly
> >
> > my.trans <- select(Mus.musculus, unique(as.character(ex2)), c("CHR","CHRLOC","ENTREZID","ENSEMBL"), "ENTREZID")
> >
> > That should at least give you a start. See where you can go on your own, and let us know if you get stuck.
> >
> > Best,
> >
> > Jim
> >
> >
> >
> >>
> >> I'm still really new to R and I lack many concepts you may consider basic... I'm awfully sorry
> >>
> >> Best
> >>
> >> JL
> >>
> >> 2012/11/8 James W. MacDonald <jmacdon at uw.edu <mailto:jmacdon at uw.edu>>
> >>
> >>    Hi Jose,
> >>
> >>
> >>    On 11/8/2012 8:19 AM, José Luis Lavín wrote:
> >>
> >>        Dear Bioconductor list,
> >>
> >>        I write you this email asking for a Bioconductor module that
> >>        allows me to
> >>        annotate genomic coordinates and get different GeneIds.
> >>        I'll show you an example of what I'm referring to:
> >>
> >>        I have this data:
> >>        Chromosome  coordinate
> >>        chr17              31246506
> >>
> >>
> >>    It depends on what that coordinate is. Is it the start of a
> >>    transcript? A SNP? Do you really just have a single coordinate, or
> >>    do you have a range? What species are we talking about here?
> >>
> >>    Depending on what your data are, you might want to use either one
> >>    of the TxDb packages, or a SNPlocs package. There really isn't
> >>    much to go on here. If I assume this is a coordinate that one
> >>    might think is within an exon, and if I further assume you are
> >>    working with H. sapiens, I could suggest something like
> >>
> >>    library(TxDb.Hsapiens.UCSC.hg19.knownGene)
> >>    ex <- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "gene")
> >>
> >>    x <- GRanges(seq = "chr17", IRanges(start = 31245606, width = 1))
> >>
> >>    ex[ex %in% x]
> >>
> >>    or maybe more appropriately
> >>
> >>    names(ex)[ex %in% x]
> >>
> >>    which will give you the Gene ID, and you can go from there using
> >>    the org.Hs.eg.db package.
> >>
> >>    If however, your coordinate isn't in an exon, but might be in a
> >>    UTR, you can look at ?exonsBy to see what other sequences you can
> >>    extract to compare with.
> >>
> >>    If these are SNPs, then you can look at the help pages for the
> >>    relevant SNPlocs package.
> >>
> >>    Best,
> >>
> >>    Jim
> >>
> >>
> >>
> >>        which can also be written this way by the program that yielded
> >>        the result:
> >>        chr17.31246506
> >>
> >>        And I need to convert this data into a gene name and known
> >>        gene Ids, such
> >>        as:
> >>
> >>        Gene name  Entrez_ID  Ensembl_ID
> >>
> >>        Tff3 NM_011575 20050
> >>        Can you please advice me about a module able to perform this
> >>        ID conversion
> >>        using a list of  "chr17.31246506" type coordinates as input?
> >>
> >>        Thanks in advance
> >>
> >>        With best wishes
> >>
> >>
> >>
> >>        _______________________________________________
> >>        Bioconductor mailing list
> >>        Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
> >>        https://stat.ethz.ch/mailman/listinfo/bioconductor
> >>        Search the archives:
> >>        http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>
> >>
> >>    --     James W. MacDonald, M.S.
> >>    Biostatistician
> >>    University of Washington
> >>    Environmental and Occupational Health Sciences
> >>    4225 Roosevelt Way NE, # 100
> >>    Seattle WA 98105-6099
> >>
> >>
> >>
> >>
> >> --
> >> --
> >> Dr. José Luis Lavín Trueba
> >>
> >> Dpto. de Producción Agraria
> >> Grupo de Genética y Microbiología
> >> Universidad Pública de Navarra
> >> 31006 Pamplona
> >> Navarra
> >> SPAIN
> >
> > --
> > James W. MacDonald, M.S.
> > Biostatistician
> > University of Washington
> > Environmental and Occupational Health Sciences
> > 4225 Roosevelt Way NE, # 100
> > Seattle WA 98105-6099
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at r-project.org
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> 
> 
> 
> -- 
> -- 
> Dr. José Luis Lavín Trueba
> 
> Dpto. de Producción Agraria
> Grupo de Genética y Microbiología
> Universidad Pública de Navarra
> 31006 Pamplona
> Navarra
> SPAIN



More information about the Bioconductor mailing list