[BioC] problem with biomaRt package using mart "snps", dataset "hsapiens_structvar", attribute "description"

Wolfgang Huber whuber at embl.de
Fri Apr 1 00:52:09 CEST 2011


Dear Mick

thank you for the (almost - see below) reproducible report.

The bottomline is that R's read.table does not like newline (\n) 
characters within quoted text ("), interpretes them as line ends, which 
messes up the tab-delimited table that the BioMart query returns.

I suggest either of two possible solutions:
- The BioMart dataset is modified to abstain from putting \n and other 
funny characters within quoted text
- the biomaRt package is modified to tolerate such behaviour

I am not sure how it would be possible to make the communication between 
BioMart servers and its clients such as biomaRt more robust. Is there a 
clear specification of BioMart servers' tab-delimited format and what 
the legal characters are? This would certainly be helpful for people who 
program clients.

I compacted your example into the following.


   library("biomaRt")
   options(error=recover)

   ensembl.var <- useMart("snp")
   sv <- useDataset("hsapiens_structvar", mart=ensembl.var)

   x2 <- getBM(c("chrom_start", "chrom_end",
            "structural_variation_name", "description"),
             filters=c("chr_name"), values=list(6), mart=sv)


This generates the "error in scan(file, what, nmax, sep, dec, quote, 
skip, nlines, na.strings, : line 135 did not have 4 elements". You then 
get a menu from R's debugger. Enter "4" to get into the local evaluation 
environment of the getBM function just before the error is thrown. Then, 
type

   cat(postRes, file="postRes.txt")

and open the file in a text editor, e.g. emacs. Lines 133-135 is:
269735	349386	esv29987	Levy 2007 "The diploid genome sequence of an 
individual human.

" PMID:17803354 [remapped from build NCBI36]

Note that there are two newlines (\n) within the title of the paper, 
which probably shouldn't be there. The same is also true at many other 
places in the file, whenever the Levy paper is refered.

I leave it to Steffen to decide whether he wants to modify biomaRt; and 
to you, whether you want to lobby with the curators of that dataset to 
put more consistency in the 'description' field.

Hope this helps.

	Wolfgang

PS: The line from your example code
    useMart("snps")
resulted for me in an error message "Incorrect BioMart name, use the 
listMarts function to see which BioMart databases are available". (There 
is an extraneous "s"). Next time, please always send an exact transcript 
of what you do, to make sure the problem is not due to a typing error.



Second, and more to the point of your question, t
Il Mar/31/11 5:25 PM, mmaguire ha scritto:
> To whom it may concern,
> I work in the DGVa group at EBI, this group works on structural variants.  I ran into a problem using the R package biomaRt when attempting to retrieve information from the "snps" mart "hsapiens_structvar" dataset,
> here is my code with comments:
>
> Here is the R code that I've written:
>
> # Testing retrieval of SVs from Biomart
>
> library(biomaRt)
>
> # Select the version "ENSEMBL  VARIATION 61 (SANGER UK)"
> ensembl.var<- useMart("snps")
>
> # Select SV dataset from the chosen mart
> sv<- useDataset("hsapiens_structvar", mart=ensembl.var)
>
> # Set attributes and filters for the chosen dataset and retrieve the data into a data frame
> chr6.svs<-getBM(c("chrom_start", "chrom_end", "structural_variation_name"), filters=c("chr_name"), values=list(6), mart=sv)
> # Check for returned data (brings back 65,532 rows for chromosome 6)
> summary(chr6.svs)
> # Write the data frame to a text file
> write.table(  chr6.svs, file='chr6_svs_from_biomart.txt', sep="\t", quote=FALSE, append=FALSE, na="", row.names=FALSE )
>
>
> # Adding "description" to the vector of attributes in the above call to function "getBM()" causes the code to fail with the error given below.
> chr6.svs<- getBM(c("chrom_start", "chrom_end", "structural_variation_name", "description"), filters=c("chr_name"), values=list(6), mart=sv) # Does not work
> #Error returned by R when attempting to get the SV description attribute:
> # Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
> #  line 135 did not have 4 elements
>
> The code fails when the SV "description" attribute is added.  I think the problem arises due to the spaces in the "description" field with R incorrectly interpreting each space delimited word as vector element.  My R is limited so I may be wrong.  Anyway, I can run the same query from the web interface and correctly retrieve the "description" attribute.
> I've checked this with our Biomart person, Rhoda Kinsella, and the data in the Biomart looks correct and, as stated above, we can export it from the web interface.
> Any help gratefully received.
>
> Cheers
>
> Mick
>
>> Michael Maguire
>> Variation Archive Bioinformatician
>> European Bioinformatics Institute
>> Wellcome Trust Genome Campus
>> Hinxton
>> Cambridge CB10 1SD
>>
>> Phone +44 1223 494674
>> Email mmaguire at ebi.ac.uk
>
>
>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 


Wolfgang Huber
EMBL
http://www.embl.de/research/units/genome_biology/huber



More information about the Bioconductor mailing list