[BioC] genes in region of miRNA genes

Mon Jun 8 15:18:47 CEST 2009

Hi list

I some data on the chromosome, start and end points of some microRNAs of interest:

miR	chromosome	start	end
hsa-mir-572 17 10979549  10979643 
hsa-mir-583 18 95440598  95440672 
hsa-mir-587 19 107338693 107338788
hsa-mir-598 21 10930126  10930222 
hsa-mir-599 21 100618040 100618134
hsa-mir-210 3  558089    558198   
hsa-mir-141 4  6943521   6943615  
hsa-mir-492 4  93752305  93752420 
hsa-mir-639 11 14501355  14501452 
hsa-mir-663 13 26136822  26136914 
hsa-mir-503 24 133508024 133508094

I was hoping to use biomaRt to extract information for genes upstream and downstream of these miRNAs (see script below).

I have created a list in the correct form for a multi filter query using biomaRt but the following query only retrieves data for chromosome 17. I gather that looping over data is discouraged for biomaRt (presumably to prevent overloading servers) and I was wondering if there was a better way of doing this.

In the following script the allMirs table is the result of:

allMirs <- "ftp://ftp.sanger.ac.uk/pub/mirbase/sequences/CURRENT/genomes/hsa.gff"

allMirs<-read.table(allMirs)

Although I did massages the data outside R to remove some extraneous columns (mainly those full of full stops) and add column names.

The 'miRsUpInFlu.txt' table is that above.

#get miR chromosome corrds from biomaRt

rm(list=ls())

library(biomaRt)

#read in list of miRs

mirs<-read.table('miRsUpInFlu.txt', header=T, sep='\t')
mirs<-sub('R', 'r', as.character(mirs[,1])) #correct miR labels

allMirs<-read.table('miRbaseJune2009.txt', header=T, sep='\t')

mirRow<-which(as.character(allMirs$id) %in% mirs)

mirsData<-allMirs[mirRow,]

#minor miRs are missing (eg * etc etc)

mirRow<-cbind(as.character(mirsData$id), mirsData[,2], mirsData[,4], mirsData[,5])

#now we have a dataframe containing the miR id, start and stop
#we have to extend the start and stop sites by 500000
#then retrieve genes in these regions 

starts<-as.numeric(mirRow[,3])
stops<-as.numeric(mirRow[,4])

limitStarts<-starts-500000#going 5'
limitStops<-stops+500000#going 3'

#this creates a dataframe in the form we need for list conversion
vals<-rbind(mirRow[,2], limitStarts, limitStops)

#the list conversion is required for the biomaRt query because we are using more than one filter
vals<-as.list(vals)

#generate query

db<-useMart('ensembl', dataset='hsapiens_gene_ensembl')

query<-getBM(c('hgnc_symbol', 'ensembl_transcript_id', 'chromosome_name', 'external_gene_id'), filters=c('chromosome_name', 'start', 'end'), values=vals, mart=db)

Any help would be appreciated.

Thanks

Iain

R version 2.9.0 (2009-04-17) 
x86_64-pc-linux-gnu 

locale:
LC_CTYPE=en_GB.UTF-8;LC_NUMERIC=C;LC_TIME=en_GB.UTF-8;LC_COLLATE=en_GB.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_GB.UTF-8;LC_PAPER=en_GB.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_GB.UTF-8;LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] biomaRt_2.0.0

loaded via a namespace (and not attached):
[1] RCurl_0.94-1 XML_2.3-0