[BioC] BioMart and Ensembl questions !!!

James W. MacDonald jmacdon at med.umich.edu
Mon Sep 21 16:55:54 CEST 2009



Rhoda Kinsella wrote:
> Hi Paul
> I'm not really sure why you get this error... I am using the following 
> version:
> 
>  > sessionInfo()
> R version 2.8.0 (2008-10-20)
> i386-apple-darwin8.11.1
> 
> locale:
> en_GB.UTF-8/en_GB.UTF-8/C/C/en_GB.UTF-8/en_GB.UTF-8
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
> 
> other attached packages:
> [1] biomaRt_1.16.0
> 
> loaded via a namespace (and not attached):
> [1] RCurl_0.92-0 XML_1.98-1
> 
> Does anyone know why Paul is getting this error?

Yes.

In your version of biomaRt, the URI is constructed like this:

registry = getURL(paste(host, "?type=registry&requestid=biomaRt",
                 sep = ""))

whereas in the current and devel versions, the URI is constructed like this:

registry = bmRequest(paste("http://", host, ":", port,
             path, "?type=registry&requestid=biomaRt", sep = ""))

And 
http://nov2008.archive.ensembl.org/biomart/martservice:80/biomart/martservice?type=registry&requestid=biomaRt

will result in a 404 error from the Biomart server.

If you modify the host to "nov2008.archive.ensembl.org", you still get a 
busted URI. The only way I could get it to work is by running through 
the debugger and substituting something reasonable in after the registry 
object is created.

Best,

Jim



> Regards,
> Rhoda
> 
> 
> On 21 Sep 2009, at 14:53, Paul Leo wrote:
> 
>> HI Rhoda ,
>> Thanks that seems exactly like I want but .. but it does not work for
>> me...
>>
>>  library(biomaRt)
>>> listMarts(host="nov2008.archive.ensembl.org/biomart/martservice")
>> Entity 'nbsp' not defined
>> Entity 'nbsp' not defined
>> Entity 'nbsp' not defined
>> Entity 'nbsp' not defined
>> Entity 'nbsp' not defined
>> Entity 'nbsp' not defined
>> Entity 'copy' not defined
>> Entity 'nbsp' not defined
>> Entity 'nbsp' not defined
>> Error in names(x) <- value :
>>  'names' attribute [2] must be the same length as the vector [0]
>>>
>>
>>
>>
>>
>> http://www.ensembl.org/info/website/archives/
>>
>>
>> Once you are there, click on the release you would like to look at and
>> then on the biomart button. This will give you the
>> URI you need to use
>> in the biomaRt package to get access to that archive. For example the 
>> release 51 archive biomart is
>> available at:
>>
>>
>> http://nov2008.archive.ensembl.org/biomart/martview/
>>
>>
>> If you then
>> plug this into biomart you can get access to the information you require:
>>
>>
>>> library(biomaRt)
>>> listMarts(host="may2009.archive.ensembl.org/biomart/martservice")
>>               biomart              version
>> 1 ENSEMBL_MART_ENSEMBL           Ensembl 54
>> 2     ENSEMBL_MART_SNP Ensembl Variation 54
>> 3    ENSEMBL_MART_VEGA              Vega 35
>> 4             REACTOME   Reactome(CSHL US)
>> 5     wormbase_current   WormBase (CSHL US)
>> 6                pride       PRIDE (EBI UK)
>>> mart=useMart("ENSEMBL_MART_ENSEMBL",
>> host="may2009.archive.ensembl.org/biomart/martservice")
>>
>>
>> etc....
>>
>>
>> I hope that helps,
>> Regards,
>> Rhoda
>>
>>
>>
>>
>>
>>
>> On 21 Sep 2009, at 14:25, Paul Leo wrote:
>>
>>> Wow that is fairly terrible , I was surprised this thread was not
>>> followed... did I miss something?
>>>
>>> You can't access hg18 via BioMART only CRCh37!!
>>>
>>> 1)listMarts(archive=TRUE)   # shows mart back to 43 are there
>>>
>>> I'll start tracking back
>>>
>>>
>>> 2)mart<-
>>> useMart("ensembl_mart_51",dataset="hsapiens_gene_ensembl",archive
>>> ### WORKS FINE but is CRCh37
>>>
>>> 3)mart<-
>>> useMart("ensembl_mart_50",dataset="hsapiens_gene_ensembl",archive=TRUE)
>>>
>>> Error in value[[3L]](cond) :
>>> Request to BioMart web service failed. Verify if you are still
>>> connected to the internet.  Alternatively the BioMart web service is
>>> temporarily down.
>>> In addition: Warning message:
>>> In file(file, "r") : unable to resolve 'july2008.archive.ensembl.org'
>>>> #####  THAT's JUST BAD !
>>>
>>> 4)mart<-
>>> useMart("ensembl_mart_49",dataset="hsapiens_gene_ensembl",archive=TRUE)
>>> Checking attributes ... ok
>>> Checking filters ... ok
>>> Warning message:
>>> In bmAttrFilt("filters", mart) :
>>> biomaRt warning: looks like we're connecting to an older version of
>>> BioMart suite. Some biomaRt functions might not work.
>>>
>>> . ### works but that is NCBI36 but the attributes have old
>>> descriptions
>>> but may work for you (and me)
>>>
>>>
>>>
>>> I think 'july2008.archive.ensembl.org'  SHOULD BE
>>> 'jul2008.archive.ensembl.org'
>>> (three letter month name)
>>>
>>> Anyway to fix that?
>>>
>>> Cheers
>>> Paul
>>>
>>> NOTE also broken in production version 2.9.2 I think
>>>
>>>> sessionInfo()
>>> R version 2.10.0 Under development (unstable) (2009-09-20 r49770)
>>> x86_64-unknown-linux-gnu
>>>
>>> locale:
>>> [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C
>>> [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8
>>> [5] LC_MONETARY=C              LC_MESSAGES=en_AU.UTF-8
>>> [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C
>>> [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods
>>> base
>>>
>>> other attached packages:
>>> [1] biomaRt_2.1.0
>>>
>>> loaded via a namespace (and not attached):
>>> [1] RCurl_1.2-0 XML_2.6-0
>>>
>>> -----Original Message-----
>>> From: jiayu wen <jiayu.jwen at gmail.com>
>>> To: bioconductor at stat.math.ethz.ch
>>> Subject: [BioC] BioMart and Ensembl questions
>>> Date: Tue, 1 Sep 2009 09:11:09 +0200
>>>
>>>
>>> Dear list,
>>>
>>> About over a year ago, I extracted 3'UTR sequences for about 7000
>>> genes using Biomart for my project. This is the command that I used:
>>>
>>> (my gene_list is in gene symbol)
>>>> my_mart = useMart("ensembl",dataset="hsapiens_gene_ensembl")
>>>> seq_3utr = getSequence(id = unique(gene.symbol),
>>> type="hgnc_symbol",seqType="3utr",mart = my_mart)
>>>> seq_3utr = seq_3utr[seq_3utr[,"3utr"] != "Sequence unavailable",]
>>>> here: extract longest 3'UTR for each unique gene symbol
>>>> exportFASTA(seq_3utr, file=paste("s3utr.fa",sep=""))
>>>
>>> As my project goes, I now need 3'UTR genomic coordinates to get
>>> phastcons conservation for some regions in 3'UTR.
>>> To do that, I first convert hgnc_symbol back to ensembl_gene_id, then
>>>
>>> get 3'UTR coordinates using getBM like this:
>>>
>>>> s3utr = read.DNAStringSet(paste("s3utr.fa",sep=""),format="fasta")
>>>> gene_names = names(s3utr)
>>>> hgnc2ensembl  =
>>>> getBM(attributes=c("hgnc_symbol","ensembl_gene_id"),
>>> filters="hgnc_symbol", values=gene_names, mart=my_mart)
>>>> s3utr_pos  = getBM(attributes=c("ensembl_gene_id",
>>> "chromosome_name","strand","3_utr_start", "3_utr_end"),
>>> filters="ensembl_gene_id", values=as.character(hgnc2ensembl
>>> $ensembl_gene_id), mart=my_mart)
>>>> s3utr_pos = s3utr_pos[complete.cases(s3utr_pos),]
>>>
>>> By doing that, now I can only get about 5000 gene symbols with 3'UTR
>>> coordinates (converting from hgnc_symbol back to ensembl_gene_id
>>> looses about 250 genes). I was thinking it might be version
>>> difference? So I tried to use ensembl archive but it gives me error
>>> as
>>> below:
>>>
>>>> my_mart =
>>> useMart("ensembl_mart_50",dataset="hsapiens_gene_ensembl",archive=T)
>>> Error in value[[3L]](cond) :
>>>  Request to BioMart web service failed. Verify if you are still
>>> connected to the internet.  Alternatively the BioMart web service is
>>> temporarily down.
>>> In addition: Warning message:
>>> In file(file, "r") : cannot open: HTTP status was '404 Not Found'
>>>
>>> Is there anyway that I can get 3'UTR coordinates for all my gene list?
>>>
>>> Thanks for any help.
>>>
>>> Jean
>>> [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>
>> Rhoda Kinsella Ph.D.
>> Ensembl Bioinformatician,
>> European Bioinformatics Institute (EMBL-EBI),
>> Wellcome Trust Genome Campus,
>> Hinxton
>> Cambridge CB10 1SD,
>> UK.
>>
>>
> 
> Rhoda Kinsella Ph.D.
> Ensembl Bioinformatician,
> European Bioinformatics Institute (EMBL-EBI),
> Wellcome Trust Genome Campus,
> Hinxton
> Cambridge CB10 1SD,
> UK.
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: 
> http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
James W. MacDonald, M.S.
Biostatistician
Douglas Lab
University of Michigan
Department of Human Genetics
5912 Buhl
1241 E. Catherine St.
Ann Arbor MI 48109-5618
734-615-7826



More information about the Bioconductor mailing list