[BioC] Genome position to miRNA or gene name

Thu Jan 22 06:27:39 CET 2009

Steve Lianoglou <mailinglist.honeypot at gmail.com> writes:

> Hi,
>
>> I have got a set of human genome locations that I have found using
>> Biostrings and BSGenome alignment e.g.
>>
>> seqname	start	        end	        strand	patternID
>> chr9	95978065	95978085	+	TGAGGTAGTAGGTTGTATAGT
>> chr11	121522487	121522507	-	TGAGGTAGTAGGTTGTATAGT
>> chr22	44887296	44887316	+	TGAGGTAGTAGGTTGTATAGT
>> chr22	44888235	44888256	+	TGAGGTAGTAGGTTGTGTGGTT
>>
>> What I would like to know is whether this genome location is within a
>> known miRNA or gene.  What would the best way be to go about this?
>
>
> One way could be to grab the appropriate GTF file for Hsapiens here:
>
> ftp://ftp.ensembl.org/pub/current_gtf/
>
> It's just a tab delimited file with genome annotations. You can just
> collect the lines for miRNA annotations and see if your positions fall
> in the bounds of the known/annotated miRNA's.
>
> For example, here's one:
>
> 18   miRNA   exon   38162   38272   .   +   .   gene_id
> "ENSG00000221441" ...
>
> You can also get miRNA data from miRBase
> (http://microrna.sanger.ac.uk/sequences/ ) perhaps it's a more
> complete set of data?). The data file I have  from them doesn't have
> chromosome positions, but does have the stem- 
> loop sequence, so that would require an intermediate alignment step
> before getting at what you're after.
>
> Probably not the best way, but a way nonetheless. If you're
> comfortable using biomaRt, I'm sure there's a way to pull down the
> same annotation info and do the comparison from there.
>
> Sorry ... no R code, but hopefully it's helpful nonetheless :-)

Very roughly, this

  fl <- "ftp://ftp.sanger.ac.uk/pub/mirbase/sequences/CURRENT/genomes/hsa.gff"

seems to be the Human miRNA data base from the Sanger. It parses in to
an R data frame with

  gff <- read.table(fl)

(hey, that's cool -- pulling the data directly from ftp!). The fourth
and fifth columns are chromosomal positions; there is also information
on chromosome (column 1) and strand (column 7).

My strategy would be to loop over your aligned sequences by chromosome
and strand, and for each subset construct an IRanges object (from the
IRanges package) that contains the start and end position of all
sequenes. Suppose we have the 'start' and 'end' of each sequence on
chromosome 1

  seqs <- IRanges(start, end)

and the same for the gff data

  miRNAs <- with(gff[gff$V1 == "1" & gff$V7 == "+",], IRanges(V4, V5))

Then use 'overlap' from IRanges. Along the lines of

  overlap(miRNAs, seqs, multiple=FALSE)

A little messy to keep track of everything, but should be
computationally quite efficient, even for very large numbers of
sequences. I think also that import.gff (in rtracklayer) and the
alignment data structures in Biostrings are all based on the same
classes as the IRanges, and that with a little bit of cleverness the
software might do all the book-keeping for you. 

Martin

> -steve
>
> --
> Steve Lianoglou
> Graduate Student: Physiology, Biophysics and Systems Biology
> Weill Medical College of Cornell University
>
> http://cbio.mskcc.org/~lianos
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793