[BioC] help with biomaRt bioconductor - Filter upstream_flank NOT FOUND problem

Tue Aug 7 11:11:46 CEST 2012

Oops, I forgot sessionInfo() for my previous post, here it is:

R Under development (unstable) (2012-08-07 r60182)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=la_AU.UTF-8
  [7] LC_PAPER=C                 LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] biomaRt_2.13.2 fortunes_1.5-0

loaded via a namespace (and not attached):
[1] RCurl_1.91-1 XML_3.9-4

Wolfgang Huber scripsit 08/07/2012 11:08 AM:
> Dear Steffen / List,
> below is a more compact code example that reproduces Tom's problem. I am
> rather confused by the fact that the problem seemed to occur
> stochastically!
>
> -------------------
> library(biomaRt)
> options(error=recover)
> ensembl = useMart("ensembl")
> human = useDataset("hsapiens_gene_ensembl",mart=ensembl)
> attr = c('ensembl_gene_id','ensembl_transcript_id',
>         'external_gene_id','chromosome_name','strand','transcript_start')
> bmres = getBM(attr, 'biotype', values = 'protein_coding', human)
>
> for(id in bmres[,"ensembl_transcript_id"]){
>   sequence = getSequence(id=id, type='ensembl_transcript_id',
>                         seqType='transcript_flank',upstream = 3000,
>                         mart = human)
>   sl = with(sequence, nchar(as.character(transcript_flank)))
>   cat(id, sl, "\n")
> }
> -------------------
>
> One running this once, I got
> ...(lots of lines)
> ENST00000520540 3000
> ENST00000519310 3000
> ENST00000442920 3000
> Error in getBM(c(seqType, type), filters = c(type, "upstream_flank"),  :
>    Query ERROR: caught BioMart::Exception::Usage: Filter upstream_flank
> NOT FOUND
>
> The next time, the same error already occurred in the very first
> iteration of the for-loop, for id="ENST00000539570". The next time, in
> the third iteration for id="ENST00000510508".
>
> Any idea what is going on here?
>
>
> Further comments:
> - for *Steffen*: The documentation and the code of 'getSequence' do not
> seem to match each other (e.g. the description of argument 'seqType'),
> MySQL mode is mentioned but afaIu is not supported any more -> perhaps
> some maintenance would be nice to users.
> - for *Tom*: Making these queries (such as getSequence) within a
> for-loop is bad practice, since it needlessly clogs the network and the
> BioMart webservers. Please use R's vector-capabilities, e.g.
>
> ------------------------
> sequence = getSequence(id=bmres[,"ensembl_transcript_id"],
>    type='ensembl_transcript_id', seqType='transcript_flank',
>    upstream = 3000, mart = human)
> sl = with(sequence, nchar(as.character(transcript_flank)))
> -------------------------
>
> Best wishes
>      Wolfgang
>
>
> Tom Hait scripsit 08/06/2012 12:37 PM:
>> Hello,
>>
>> I'm a student in bioinformatics in Tel Aviv University.
>> I'm working with you biomaRt API in order to generate automatically FASTA
>> sequences downloading.
>> I experienced some problem, here is my code:
>>
>> #open biomart libaray
>> library(biomaRt)
>> #open data set of human
>> human = useDataset("hsapiens_gene_ensembl",mart=ensembl)
>> #select the attributes that we want from the data set
>> attr<-c('ensembl_gene_id','ensembl_transcript_id',
>> 'external_gene_id','chromosome_name','strand','transcript_start')
>> #downloading the map between transcript id and transcript name
>> tmpgene<-getBM(attr, 'biotype', values = 'protein_coding', human)
>> #save in a TSV format (the file is saved in txt)
>> write.table(tmpgene,"Z:/tomhait/organisms/human/transcript_names.txt",
>> row.names=FALSE, quote=FALSE)
>> #collect all sequences with upstream flank 3000 bases based on the first
>> column (ensembl_id) of tmpgene
>> i<-1
>> for(id1 in tmpgene[,2]){
>>   #retrieve sequence
>>   sequence<-getSequence(id=id1,
>> type='ensembl_transcript_id',seqType='transcript_flank',upstream = 3000,
>> mart = human)
>>   #check if sequence was retrieved
>>   sLengths <- with(sequence, nchar(as.character(transcript_flank)))
>>
>> #writing to a new file in
>> "Z:/tomhait/organisms/human/mart_export_new.txt"
>> #you can change it to "mart_export_new.txt" and it will create a new file
>> in R directory
>>   if(length(sLengths) > 0){
>>    x<-sequence[,1]
>>    y<-y<-strsplit(gsub("([[:alnum:]]{60})", "\\1 ", x), " ")[[1]]
>>
>> title<-paste(paste(">",tmpgene[i,1],sep=""),tmpgene[i,2],tmpgene[i,3],tmpgene[i,4],tmpgene[i,5],tmpgene[i,6],
>>
>> sep="|")
>>
>> write(title,file="Z:/tomhait/organisms/human/mart_export_new.txt",ncolumns
>>
>> = 1, append=TRUE,sep="")
>>
>> write(y,file="Z:/tomhait/organisms/human/mart_export_new.txt",ncolumns =
>> 1, append=TRUE,sep="\n")
>>
>> write("\n",file="Z:/tomhait/organisms/human/mart_export_new.txt",ncolumns
>> = 1, append=TRUE,sep="\n")
>>   }
>>   i<-i+1
>> }
>>
>> I got the message:
>> Error in getBM(c(seqType, type), filters = c(type, "upstream_flank"),  :
>>    Query ERROR: caught BioMart::Exception::Usage: Filter
>> upstream_flank NOT
>> FOUND
>>
>> Could you please help me to solve this problem?
>>
>> Best Regards,
>>
>> Tom Hait.
>>
>>     [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
>

-- 
Best wishes
	Wolfgang

Wolfgang Huber
EMBL
http://www.embl.de/research/units/genome_biology/huber