[BioC] problem with biomaRt package using mart "snps", dataset "hsapiens_structvar", attribute "description"

Steffen Durinck durinck.steffen at gene.com
Fri Apr 1 01:41:20 CEST 2011


Thanks Mike, Wolfgang,

It looks like this should be an easy fix in biomaRt. We're currently
reading in the text connection as follows in biomaRt:

read.table(con, sep = "\t", header = FALSE, quote = "", comment.char =
"", stringsAsFactors = FALSE)

if we change this to:

read.table(con, sep = "\t", header = FALSE, quote = "\"", comment.char
= "", stringsAsFactors = FALSE)

I think it should work.  I'll fix biomaRt and provide a new dev
version within the next few days.

Cheers,
Steffen

On Thu, Mar 31, 2011 at 3:52 PM, Wolfgang Huber <whuber at embl.de> wrote:
> Dear Mick
>
> thank you for the (almost - see below) reproducible report.
>
> The bottomline is that R's read.table does not like newline (\n) characters
> within quoted text ("), interpretes them as line ends, which messes up the
> tab-delimited table that the BioMart query returns.
>
> I suggest either of two possible solutions:
> - The BioMart dataset is modified to abstain from putting \n and other funny
> characters within quoted text
> - the biomaRt package is modified to tolerate such behaviour
>
> I am not sure how it would be possible to make the communication between
> BioMart servers and its clients such as biomaRt more robust. Is there a
> clear specification of BioMart servers' tab-delimited format and what the
> legal characters are? This would certainly be helpful for people who program
> clients.
>
> I compacted your example into the following.
>
>
>  library("biomaRt")
>  options(error=recover)
>
>  ensembl.var <- useMart("snp")
>  sv <- useDataset("hsapiens_structvar", mart=ensembl.var)
>
>  x2 <- getBM(c("chrom_start", "chrom_end",
>           "structural_variation_name", "description"),
>            filters=c("chr_name"), values=list(6), mart=sv)
>
>
> This generates the "error in scan(file, what, nmax, sep, dec, quote, skip,
> nlines, na.strings, : line 135 did not have 4 elements". You then get a menu
> from R's debugger. Enter "4" to get into the local evaluation environment of
> the getBM function just before the error is thrown. Then, type
>
>  cat(postRes, file="postRes.txt")
>
> and open the file in a text editor, e.g. emacs. Lines 133-135 is:
> 269735  349386  esv29987        Levy 2007 "The diploid genome sequence of an
> individual human.
>
> " PMID:17803354 [remapped from build NCBI36]
>
> Note that there are two newlines (\n) within the title of the paper, which
> probably shouldn't be there. The same is also true at many other places in
> the file, whenever the Levy paper is refered.
>
> I leave it to Steffen to decide whether he wants to modify biomaRt; and to
> you, whether you want to lobby with the curators of that dataset to put more
> consistency in the 'description' field.
>
> Hope this helps.
>
>        Wolfgang
>
> PS: The line from your example code
>   useMart("snps")
> resulted for me in an error message "Incorrect BioMart name, use the
> listMarts function to see which BioMart databases are available". (There is
> an extraneous "s"). Next time, please always send an exact transcript of
> what you do, to make sure the problem is not due to a typing error.
>
>
>
> Second, and more to the point of your question, t
> Il Mar/31/11 5:25 PM, mmaguire ha scritto:
>>
>> To whom it may concern,
>> I work in the DGVa group at EBI, this group works on structural variants.
>>  I ran into a problem using the R package biomaRt when attempting to
>> retrieve information from the "snps" mart "hsapiens_structvar" dataset,
>> here is my code with comments:
>>
>> Here is the R code that I've written:
>>
>> # Testing retrieval of SVs from Biomart
>>
>> library(biomaRt)
>>
>> # Select the version "ENSEMBL  VARIATION 61 (SANGER UK)"
>> ensembl.var<- useMart("snps")
>>
>> # Select SV dataset from the chosen mart
>> sv<- useDataset("hsapiens_structvar", mart=ensembl.var)
>>
>> # Set attributes and filters for the chosen dataset and retrieve the data
>> into a data frame
>> chr6.svs<-getBM(c("chrom_start", "chrom_end",
>> "structural_variation_name"), filters=c("chr_name"), values=list(6),
>> mart=sv)
>> # Check for returned data (brings back 65,532 rows for chromosome 6)
>> summary(chr6.svs)
>> # Write the data frame to a text file
>> write.table(  chr6.svs, file='chr6_svs_from_biomart.txt', sep="\t",
>> quote=FALSE, append=FALSE, na="", row.names=FALSE )
>>
>>
>> # Adding "description" to the vector of attributes in the above call to
>> function "getBM()" causes the code to fail with the error given below.
>> chr6.svs<- getBM(c("chrom_start", "chrom_end",
>> "structural_variation_name", "description"), filters=c("chr_name"),
>> values=list(6), mart=sv) # Does not work
>> #Error returned by R when attempting to get the SV description attribute:
>> # Error in scan(file, what, nmax, sep, dec, quote, skip, nlines,
>> na.strings,  :
>> #  line 135 did not have 4 elements
>>
>> The code fails when the SV "description" attribute is added.  I think the
>> problem arises due to the spaces in the "description" field with R
>> incorrectly interpreting each space delimited word as vector element.  My R
>> is limited so I may be wrong.  Anyway, I can run the same query from the web
>> interface and correctly retrieve the "description" attribute.
>> I've checked this with our Biomart person, Rhoda Kinsella, and the data in
>> the Biomart looks correct and, as stated above, we can export it from the
>> web interface.
>> Any help gratefully received.
>>
>> Cheers
>>
>> Mick
>>
>>> Michael Maguire
>>> Variation Archive Bioinformatician
>>> European Bioinformatics Institute
>>> Wellcome Trust Genome Campus
>>> Hinxton
>>> Cambridge CB10 1SD
>>>
>>> Phone +44 1223 494674
>>> Email mmaguire at ebi.ac.uk
>>
>>
>>
>>
>>        [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> --
>
>
> Wolfgang Huber
> EMBL
> http://www.embl.de/research/units/genome_biology/huber
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>



More information about the Bioconductor mailing list