[BioC] BioMart and Ensembl questions !!!

Paul Leo p.leo at uq.edu.au
Tue Sep 22 02:41:24 CEST 2009


Just so this thread has a tidy conclusion:

with the development and production version to get to NCBI36 use as
follows:

Choose you archive version use

http://www.ensembl.org/index.html (follow BioMart tab and then "View in
archive site" and the bottom of the page). Click on the archive version
and the URl in the browser will give you the host to use below:


library(biomaRt)
listMarts(host="may2009.archive.ensembl.org",path="/biomart/martservice",archive=FALSE)
### NOTE use archive=FALSE (I had TRUE before which was incorrect)
### that will give you the name of the "biomart" to use
ENSEMBL_MART_ENSEMBL in my case

#### say you wanted human then use  (again archive=FALSE is new if you
### want higher than ensemble_mart_51  

mart=useMart("ENSEMBL_MART_ENSEMBL",dataset="hsapiens_gene_ensembl",
host="may2009.archive.ensembl.org",path="/biomart/martservice",archive=FALSE)

All else is then as usual. Though watch out for that "NA" chromosome ;-)

Cheers
Paul







-----Original Message-----
From: Steffen at stat.Berkeley.EDU
To: Rhoda Kinsella <rhoda at ebi.ac.uk>
Cc: Paul Leo <p.leo at uq.edu.au>, bioconductor
<bioconductor at stat.math.ethz.ch>
Subject: Re: [BioC] BioMart and Ensembl questions !!!
Date: Mon, 21 Sep 2009 10:56:02 -0700 (PDT)

Hi Paul, Rhoda,

Jim's earlier suggestion should fix this.  You need to specify a value for
the chromosome name you're interested in.

fil.vals = list(1,67325000,67620000)

Then your query should return results (if there are any genes in this
region).

Cheers,
Steffen

> Hi Paul,
> It looks like you are using an unstable version of biomaRt (R version
> 2.10.0 Under development (unstable) (2009-09-20 r49770))
> so can you try this with the 2.9.0 version and see if that works? Let
> me know how you get on.
> Regards,
> Rhoda
>
> On 21 Sep 2009, at 15:23, Paul Leo wrote:
>
>> HI Rhoda,
>> Yes a different version is probably it . There is STILL something
>> wrong, based on your suggestions:
>>
>> library(biomaRt)
>> listMarts(host="may2009.archive.ensembl.org",path="/biomart/
>> martservice",archive=TRUE)
>> mart=useMart("ensembl_mart_51", dataset="hsapiens_gene_ensembl",
>> host="may2009.archive.ensembl.org",path="/biomart/
>> martservice",archive=TRUE)
>>
>> works BUT queries then fail:
>>
>> ann<-getBM(attributes =
>> c
>> ( "ensembl_gene_id
>> ","external_gene_id
>> ","chromosome_name
>> ","start_position
>> ","end_position","strand","hgnc_symbol","gene_biotype"), filters =
>> a.filter, values=fil.vals, mart = mart)
>>> ann
>> [1] ensembl_gene_id  external_gene_id chromosome_name  start_position
>> [5] end_position     strand           hgnc_symbol
>> <0 rows> (or 0-length row.names)
>>
>>
>>> a.filter
>> [1] "chromosome_name" "start"           "end"
>>> fil.vals
>> [[1]]
>> [1] NA
>>
>> [[2]]
>> [1] 67325000
>>
>> [[3]]
>> [1] 67620000
>>
>>
>> I will try again tomorrow... it's late  at night in Australia....
>>
>>
>>
>> -----Original Message-----
>> From: Rhoda Kinsella <rhoda at ebi.ac.uk>
>> To: Paul Leo <p.leo at uq.edu.au>
>> Cc: bioconductor <bioconductor at stat.math.ethz.ch>
>> Subject: Re: [BioC] BioMart and Ensembl questions !!!
>> Date: Mon, 21 Sep 2009 15:10:42 +0100
>>
>> Hi Paul
>> I'm not really sure why you get this error... I am using the following
>> version:
>>
>>> sessionInfo()
>> R version 2.8.0 (2008-10-20)
>> i386-apple-darwin8.11.1
>>
>> locale:
>> en_GB.UTF-8/en_GB.UTF-8/C/C/en_GB.UTF-8/en_GB.UTF-8
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>> [1] biomaRt_1.16.0
>>
>> loaded via a namespace (and not attached):
>> [1] RCurl_0.92-0 XML_1.98-1
>>
>> Does anyone know why Paul is getting this error?
>> Regards,
>> Rhoda
>>
>>
>> On 21 Sep 2009, at 14:53, Paul Leo wrote:
>>
>>> HI Rhoda ,
>>> Thanks that seems exactly like I want but .. but it does not work for
>>> me...
>>>
>>> library(biomaRt)
>>>> listMarts(host="nov2008.archive.ensembl.org/biomart/martservice")
>>> Entity 'nbsp' not defined
>>> Entity 'nbsp' not defined
>>> Entity 'nbsp' not defined
>>> Entity 'nbsp' not defined
>>> Entity 'nbsp' not defined
>>> Entity 'nbsp' not defined
>>> Entity 'copy' not defined
>>> Entity 'nbsp' not defined
>>> Entity 'nbsp' not defined
>>> Error in names(x) <- value :
>>> 'names' attribute [2] must be the same length as the vector [0]
>>>>
>>>
>>>
>>>
>>>
>>> http://www.ensembl.org/info/website/archives/
>>>
>>>
>>> Once you are there, click on the release you would like to look at
>>> and
>>> then on the biomart button. This will give you the
>>> URI you need to use
>>> in the biomaRt package to get access to that archive. For example
>>> the release 51 archive biomart is
>>> available at:
>>>
>>>
>>> http://nov2008.archive.ensembl.org/biomart/martview/
>>>
>>>
>>> If you then
>>> plug this into biomart you can get access to the information you
>>> require:
>>>
>>>
>>>> library(biomaRt)
>>>> listMarts(host="may2009.archive.ensembl.org/biomart/martservice")
>>>              biomart              version
>>> 1 ENSEMBL_MART_ENSEMBL           Ensembl 54
>>> 2     ENSEMBL_MART_SNP Ensembl Variation 54
>>> 3    ENSEMBL_MART_VEGA              Vega 35
>>> 4             REACTOME   Reactome(CSHL US)
>>> 5     wormbase_current   WormBase (CSHL US)
>>> 6                pride       PRIDE (EBI UK)
>>>> mart=useMart("ENSEMBL_MART_ENSEMBL",
>>> host="may2009.archive.ensembl.org/biomart/martservice")
>>>
>>>
>>> etc....
>>>
>>>
>>> I hope that helps,
>>> Regards,
>>> Rhoda
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 21 Sep 2009, at 14:25, Paul Leo wrote:
>>>
>>>> Wow that is fairly terrible , I was surprised this thread was not
>>>> followed... did I miss something?
>>>>
>>>> You can't access hg18 via BioMART only CRCh37!!
>>>>
>>>> 1)listMarts(archive=TRUE)   # shows mart back to 43 are there
>>>>
>>>> I'll start tracking back
>>>>
>>>>
>>>> 2)mart<-
>>>> useMart("ensembl_mart_51",dataset="hsapiens_gene_ensembl",archive
>>>> ### WORKS FINE but is CRCh37
>>>>
>>>> 3)mart<-
>>>> useMart
>>>> ("ensembl_mart_50",dataset="hsapiens_gene_ensembl",archive=TRUE)
>>>>
>>>> Error in value[[3L]](cond) :
>>>> Request to BioMart web service failed. Verify if you are still
>>>> connected to the internet.  Alternatively the BioMart web service is
>>>> temporarily down.
>>>> In addition: Warning message:
>>>> In file(file, "r") : unable to resolve
>>>> 'july2008.archive.ensembl.org'
>>>>> #####  THAT's JUST BAD !
>>>>
>>>> 4)mart<-
>>>> useMart
>>>> ("ensembl_mart_49",dataset="hsapiens_gene_ensembl",archive=TRUE)
>>>> Checking attributes ... ok
>>>> Checking filters ... ok
>>>> Warning message:
>>>> In bmAttrFilt("filters", mart) :
>>>> biomaRt warning: looks like we're connecting to an older version of
>>>> BioMart suite. Some biomaRt functions might not work.
>>>>
>>>> . ### works but that is NCBI36 but the attributes have old
>>>> descriptions
>>>> but may work for you (and me)
>>>>
>>>>
>>>>
>>>> I think 'july2008.archive.ensembl.org'  SHOULD BE
>>>> 'jul2008.archive.ensembl.org'
>>>> (three letter month name)
>>>>
>>>> Anyway to fix that?
>>>>
>>>> Cheers
>>>> Paul
>>>>
>>>> NOTE also broken in production version 2.9.2 I think
>>>>
>>>>> sessionInfo()
>>>> R version 2.10.0 Under development (unstable) (2009-09-20 r49770)
>>>> x86_64-unknown-linux-gnu
>>>>
>>>> locale:
>>>> [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C
>>>> [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8
>>>> [5] LC_MONETARY=C              LC_MESSAGES=en_AU.UTF-8
>>>> [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C
>>>> [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>> [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
>>>>
>>>> attached base packages:
>>>> [1] stats     graphics  grDevices utils     datasets  methods
>>>> base
>>>>
>>>> other attached packages:
>>>> [1] biomaRt_2.1.0
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] RCurl_1.2-0 XML_2.6-0
>>>>
>>>> -----Original Message-----
>>>> From: jiayu wen <jiayu.jwen at gmail.com>
>>>> To: bioconductor at stat.math.ethz.ch
>>>> Subject: [BioC] BioMart and Ensembl questions
>>>> Date: Tue, 1 Sep 2009 09:11:09 +0200
>>>>
>>>>
>>>> Dear list,
>>>>
>>>> About over a year ago, I extracted 3'UTR sequences for about 7000
>>>> genes using Biomart for my project. This is the command that I used:
>>>>
>>>> (my gene_list is in gene symbol)
>>>>> my_mart = useMart("ensembl",dataset="hsapiens_gene_ensembl")
>>>>> seq_3utr = getSequence(id = unique(gene.symbol),
>>>> type="hgnc_symbol",seqType="3utr",mart = my_mart)
>>>>> seq_3utr = seq_3utr[seq_3utr[,"3utr"] != "Sequence unavailable",]
>>>>> here: extract longest 3'UTR for each unique gene symbol
>>>>> exportFASTA(seq_3utr, file=paste("s3utr.fa",sep=""))
>>>>
>>>> As my project goes, I now need 3'UTR genomic coordinates to get
>>>> phastcons conservation for some regions in 3'UTR.
>>>> To do that, I first convert hgnc_symbol back to ensembl_gene_id,
>>>> then
>>>>
>>>> get 3'UTR coordinates using getBM like this:
>>>>
>>>>> s3utr = read.DNAStringSet(paste("s3utr.fa",sep=""),format="fasta")
>>>>> gene_names = names(s3utr)
>>>>> hgnc2ensembl  =
>>>>> getBM(attributes=c("hgnc_symbol","ensembl_gene_id"),
>>>> filters="hgnc_symbol", values=gene_names, mart=my_mart)
>>>>> s3utr_pos  = getBM(attributes=c("ensembl_gene_id",
>>>> "chromosome_name","strand","3_utr_start", "3_utr_end"),
>>>> filters="ensembl_gene_id", values=as.character(hgnc2ensembl
>>>> $ensembl_gene_id), mart=my_mart)
>>>>> s3utr_pos = s3utr_pos[complete.cases(s3utr_pos),]
>>>>
>>>> By doing that, now I can only get about 5000 gene symbols with 3'UTR
>>>> coordinates (converting from hgnc_symbol back to ensembl_gene_id
>>>> looses about 250 genes). I was thinking it might be version
>>>> difference? So I tried to use ensembl archive but it gives me error
>>>> as
>>>> below:
>>>>
>>>>> my_mart =
>>>> useMart("ensembl_mart_50",dataset="hsapiens_gene_ensembl",archive=T)
>>>> Error in value[[3L]](cond) :
>>>> Request to BioMart web service failed. Verify if you are still
>>>> connected to the internet.  Alternatively the BioMart web service is
>>>> temporarily down.
>>>> In addition: Warning message:
>>>> In file(file, "r") : cannot open: HTTP status was '404 Not Found'
>>>>
>>>> Is there anyway that I can get 3'UTR coordinates for all my gene
>>>> list?
>>>>
>>>> Thanks for any help.
>>>>
>>>> Jean
>>>> [[alternative HTML version deleted]]
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at stat.math.ethz.ch
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>> [[alternative HTML version deleted]]
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at stat.math.ethz.ch
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>
>>> Rhoda Kinsella Ph.D.
>>> Ensembl Bioinformatician,
>>> European Bioinformatics Institute (EMBL-EBI),
>>> Wellcome Trust Genome Campus,
>>> Hinxton
>>> Cambridge CB10 1SD,
>>> UK.
>>>
>>>
>>
>> Rhoda Kinsella Ph.D.
>> Ensembl Bioinformatician,
>> European Bioinformatics Institute (EMBL-EBI),
>> Wellcome Trust Genome Campus,
>> Hinxton
>> Cambridge CB10 1SD,
>> UK.
>>
>
> Rhoda Kinsella Ph.D.
> Ensembl Bioinformatician,
> European Bioinformatics Institute (EMBL-EBI),
> Wellcome Trust Genome Campus,
> Hinxton
> Cambridge CB10 1SD,
> UK.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>



More information about the Bioconductor mailing list