[BioC] problem with biomaRt package using mart "snps", dataset "hsapiens_structvar", attribute "description"

Steffen Durinck durinck.steffen at gene.com
Wed Apr 6 16:48:17 CEST 2011


Hi Michael,

This has been fixed now in the dev version of biomaRt.
Available here:

http://bioconductor.org/packages/2.8/bioc/html/biomaRt.html

Cheers,
Steffen

On Fri, Apr 1, 2011 at 1:24 AM, mmaguire <mmaguire at ebi.ac.uk> wrote:
> Thanks, Steffen, I've forwarded the mail to Rhoda, our Biomart person.
> Apologies for the typo re "mart", copy-and-paste followed mis-type!
>
> Cheers
>
> Mick
>> Michael Maguire
>> Variation Archive Bioinformatician
>> European Bioinformatics Institute
>> Wellcome Trust Genome Campus
>> Hinxton
>> Cambridge CB10 1SD
>>
>> Phone +44 1223 494674
>> Email mmaguire at ebi.ac.uk
>
> On Apr 1, 2011, at 12:41 AM, Steffen Durinck wrote:
>
>> Thanks Mike, Wolfgang,
>>
>> It looks like this should be an easy fix in biomaRt. We're currently
>> reading in the text connection as follows in biomaRt:
>>
>> read.table(con, sep = "\t", header = FALSE, quote = "", comment.char =
>> "", stringsAsFactors = FALSE)
>>
>> if we change this to:
>>
>> read.table(con, sep = "\t", header = FALSE, quote = "\"", comment.char
>> = "", stringsAsFactors = FALSE)
>>
>> I think it should work.  I'll fix biomaRt and provide a new dev
>> version within the next few days.
>>
>> Cheers,
>> Steffen
>>
>> On Thu, Mar 31, 2011 at 3:52 PM, Wolfgang Huber <whuber at embl.de> wrote:
>>> Dear Mick
>>>
>>> thank you for the (almost - see below) reproducible report.
>>>
>>> The bottomline is that R's read.table does not like newline (\n) characters
>>> within quoted text ("), interpretes them as line ends, which messes up the
>>> tab-delimited table that the BioMart query returns.
>>>
>>> I suggest either of two possible solutions:
>>> - The BioMart dataset is modified to abstain from putting \n and other funny
>>> characters within quoted text
>>> - the biomaRt package is modified to tolerate such behaviour
>>>
>>> I am not sure how it would be possible to make the communication between
>>> BioMart servers and its clients such as biomaRt more robust. Is there a
>>> clear specification of BioMart servers' tab-delimited format and what the
>>> legal characters are? This would certainly be helpful for people who program
>>> clients.
>>>
>>> I compacted your example into the following.
>>>
>>>
>>>  library("biomaRt")
>>>  options(error=recover)
>>>
>>>  ensembl.var <- useMart("snp")
>>>  sv <- useDataset("hsapiens_structvar", mart=ensembl.var)
>>>
>>>  x2 <- getBM(c("chrom_start", "chrom_end",
>>>           "structural_variation_name", "description"),
>>>            filters=c("chr_name"), values=list(6), mart=sv)
>>>
>>>
>>> This generates the "error in scan(file, what, nmax, sep, dec, quote, skip,
>>> nlines, na.strings, : line 135 did not have 4 elements". You then get a menu
>>> from R's debugger. Enter "4" to get into the local evaluation environment of
>>> the getBM function just before the error is thrown. Then, type
>>>
>>>  cat(postRes, file="postRes.txt")
>>>
>>> and open the file in a text editor, e.g. emacs. Lines 133-135 is:
>>> 269735  349386  esv29987        Levy 2007 "The diploid genome sequence of an
>>> individual human.
>>>
>>> " PMID:17803354 [remapped from build NCBI36]
>>>
>>> Note that there are two newlines (\n) within the title of the paper, which
>>> probably shouldn't be there. The same is also true at many other places in
>>> the file, whenever the Levy paper is refered.
>>>
>>> I leave it to Steffen to decide whether he wants to modify biomaRt; and to
>>> you, whether you want to lobby with the curators of that dataset to put more
>>> consistency in the 'description' field.
>>>
>>> Hope this helps.
>>>
>>>        Wolfgang
>>>
>>> PS: The line from your example code
>>>   useMart("snps")
>>> resulted for me in an error message "Incorrect BioMart name, use the
>>> listMarts function to see which BioMart databases are available". (There is
>>> an extraneous "s"). Next time, please always send an exact transcript of
>>> what you do, to make sure the problem is not due to a typing error.
>>>
>>>
>>>
>>> Second, and more to the point of your question, t
>>> Il Mar/31/11 5:25 PM, mmaguire ha scritto:
>>>>
>>>> To whom it may concern,
>>>> I work in the DGVa group at EBI, this group works on structural variants.
>>>>  I ran into a problem using the R package biomaRt when attempting to
>>>> retrieve information from the "snps" mart "hsapiens_structvar" dataset,
>>>> here is my code with comments:
>>>>
>>>> Here is the R code that I've written:
>>>>
>>>> # Testing retrieval of SVs from Biomart
>>>>
>>>> library(biomaRt)
>>>>
>>>> # Select the version "ENSEMBL  VARIATION 61 (SANGER UK)"
>>>> ensembl.var<- useMart("snps")
>>>>
>>>> # Select SV dataset from the chosen mart
>>>> sv<- useDataset("hsapiens_structvar", mart=ensembl.var)
>>>>
>>>> # Set attributes and filters for the chosen dataset and retrieve the data
>>>> into a data frame
>>>> chr6.svs<-getBM(c("chrom_start", "chrom_end",
>>>> "structural_variation_name"), filters=c("chr_name"), values=list(6),
>>>> mart=sv)
>>>> # Check for returned data (brings back 65,532 rows for chromosome 6)
>>>> summary(chr6.svs)
>>>> # Write the data frame to a text file
>>>> write.table(  chr6.svs, file='chr6_svs_from_biomart.txt', sep="\t",
>>>> quote=FALSE, append=FALSE, na="", row.names=FALSE )
>>>>
>>>>
>>>> # Adding "description" to the vector of attributes in the above call to
>>>> function "getBM()" causes the code to fail with the error given below.
>>>> chr6.svs<- getBM(c("chrom_start", "chrom_end",
>>>> "structural_variation_name", "description"), filters=c("chr_name"),
>>>> values=list(6), mart=sv) # Does not work
>>>> #Error returned by R when attempting to get the SV description attribute:
>>>> # Error in scan(file, what, nmax, sep, dec, quote, skip, nlines,
>>>> na.strings,  :
>>>> #  line 135 did not have 4 elements
>>>>
>>>> The code fails when the SV "description" attribute is added.  I think the
>>>> problem arises due to the spaces in the "description" field with R
>>>> incorrectly interpreting each space delimited word as vector element.  My R
>>>> is limited so I may be wrong.  Anyway, I can run the same query from the web
>>>> interface and correctly retrieve the "description" attribute.
>>>> I've checked this with our Biomart person, Rhoda Kinsella, and the data in
>>>> the Biomart looks correct and, as stated above, we can export it from the
>>>> web interface.
>>>> Any help gratefully received.
>>>>
>>>> Cheers
>>>>
>>>> Mick
>>>>
>>>>> Michael Maguire
>>>>> Variation Archive Bioinformatician
>>>>> European Bioinformatics Institute
>>>>> Wellcome Trust Genome Campus
>>>>> Hinxton
>>>>> Cambridge CB10 1SD
>>>>>
>>>>> Phone +44 1223 494674
>>>>> Email mmaguire at ebi.ac.uk
>>>>
>>>>
>>>>
>>>>
>>>>        [[alternative HTML version deleted]]
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>> --
>>>
>>>
>>> Wolfgang Huber
>>> EMBL
>>> http://www.embl.de/research/units/genome_biology/huber
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>
>> Michael Maguire
>> Variation Archive Bioinformatician
>> European Bioinformatics Institute
>> Wellcome Trust Genome Campus
>> Hinxton
>> Cambridge CB10 1SD
>>
>> Phone +44 1223 494674
>> Email mmaguire at ebi.ac.uk
>
>
>
>



More information about the Bioconductor mailing list