[BioC] How do I use biomaRt to get upstreamFlank Genomic Sequence for many Genomes?

Noah Dowell noahd at ucla.edu
Mon Dec 20 19:54:57 CET 2010


Hello All,

Problem:

I would like to obtain the genomic sequence that is upstream (~500 bp) of a specific bacterial gene.  I want to get this sequence for all bacteria genomes that have the gene.  On EcoCyc I see that many (> 100) bacteria have the gene but I do not know how to get all of the sequence in a high-throughput manner so I was going to use biomaRt to get the sequence and send to alignment programs later.  I have read through the vignette and tried to get the function to work with a non-ensembl MART to no avail.  I also was presented with an error (see below) that suggested I report to the mailing list.  It looks like I will also have to query each of the 249 bacterial genomes in the "bacterial_mart_7" Mart individually (with getLDS or getBM) which does not seem high-throughput at all...  are there any other  suggestions that will allow me to take advantage a the large amount of bacterial genomic data for homology studies?

Thank you for your help.

Noah



Attempted Solution (for a single genome):

> bacGenome = useMart("bacterial_mart_7", dataset = "esc_20_gene")
Checking attributes ... ok
Checking filters ... ok
> 
> filters = c("external_gene_id")
> 
> attributes = c("external_gene_id","upstream_flank") 
> 
> values = list(external_gene_id = c("fis"), 500)
> seq = getBM(attributes=attributes, filters = filters, values = values, mart= bacGenome,
+ 			checkFilters= FALSE)
   V1
1 fis
Error in getBM(attributes = attributes, filters = filters, values = values,  : 
  The query to the BioMart webservice returned an invalid result: the number of columns in the result table does not equal the number of attributes in the query. Please report this to the mailing list.




> sessionInfo()
R version 2.11.0 (2010-04-22) 
i386-apple-darwin9.8.0 

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rtracklayer_1.8.1 RCurl_1.3-1       bitops_1.0-4.1    biomaRt_2.4.0    

loaded via a namespace (and not attached):
[1] Biobase_2.8.0       Biostrings_2.16.0   BSgenome_1.16.0     GenomicRanges_1.0.1 IRanges_1.6.0      
[6] tools_2.11.0        XML_2.8-1          



More information about the Bioconductor mailing list