[BioC] BioMart and Ensembl questions !!!

jiayu wen jiayu.jean.wen at gmail.com
Tue Sep 22 12:06:56 CEST 2009


Hi Paul and others,

Thanks for responding my questions. I will try these suggestions.

Jean

On Sep 22, 2009, at 2:41 AM, Paul Leo wrote:

>
> Just so this thread has a tidy conclusion:
>
> with the development and production version to get to NCBI36 use as
> follows:
>
> Choose you archive version use
>
> http://www.ensembl.org/index.html (follow BioMart tab and then "View  
> in
> archive site" and the bottom of the page). Click on the archive  
> version
> and the URl in the browser will give you the host to use below:
>
>
> library(biomaRt)
> listMarts(host="may2009.archive.ensembl.org",path="/biomart/ 
> martservice",archive=FALSE)
> ### NOTE use archive=FALSE (I had TRUE before which was incorrect)
> ### that will give you the name of the "biomart" to use
> ENSEMBL_MART_ENSEMBL in my case
>
> #### say you wanted human then use  (again archive=FALSE is new if you
> ### want higher than ensemble_mart_51
>
> mart=useMart("ENSEMBL_MART_ENSEMBL",dataset="hsapiens_gene_ensembl",
> host="may2009.archive.ensembl.org",path="/biomart/ 
> martservice",archive=FALSE)
>
> All else is then as usual. Though watch out for that "NA"  
> chromosome ;-)
>
> Cheers
> Paul
>
>
>
>
>
>
>
> -----Original Message-----
> From: Steffen at stat.Berkeley.EDU
> To: Rhoda Kinsella <rhoda at ebi.ac.uk>
> Cc: Paul Leo <p.leo at uq.edu.au>, bioconductor
> <bioconductor at stat.math.ethz.ch>
> Subject: Re: [BioC] BioMart and Ensembl questions !!!
> Date: Mon, 21 Sep 2009 10:56:02 -0700 (PDT)
>
> Hi Paul, Rhoda,
>
> Jim's earlier suggestion should fix this.  You need to specify a  
> value for
> the chromosome name you're interested in.
>
> fil.vals = list(1,67325000,67620000)
>
> Then your query should return results (if there are any genes in this
> region).
>
> Cheers,
> Steffen
>
>> Hi Paul,
>> It looks like you are using an unstable version of biomaRt (R version
>> 2.10.0 Under development (unstable) (2009-09-20 r49770))
>> so can you try this with the 2.9.0 version and see if that works? Let
>> me know how you get on.
>> Regards,
>> Rhoda
>>
>> On 21 Sep 2009, at 15:23, Paul Leo wrote:
>>
>>> HI Rhoda,
>>> Yes a different version is probably it . There is STILL something
>>> wrong, based on your suggestions:
>>>
>>> library(biomaRt)
>>> listMarts(host="may2009.archive.ensembl.org",path="/biomart/
>>> martservice",archive=TRUE)
>>> mart=useMart("ensembl_mart_51", dataset="hsapiens_gene_ensembl",
>>> host="may2009.archive.ensembl.org",path="/biomart/
>>> martservice",archive=TRUE)
>>>
>>> works BUT queries then fail:
>>>
>>> ann<-getBM(attributes =
>>> c
>>> ( "ensembl_gene_id
>>> ","external_gene_id
>>> ","chromosome_name
>>> ","start_position
>>> ","end_position","strand","hgnc_symbol","gene_biotype"), filters =
>>> a.filter, values=fil.vals, mart = mart)
>>>> ann
>>> [1] ensembl_gene_id  external_gene_id chromosome_name   
>>> start_position
>>> [5] end_position     strand           hgnc_symbol
>>> <0 rows> (or 0-length row.names)
>>>
>>>
>>>> a.filter
>>> [1] "chromosome_name" "start"           "end"
>>>> fil.vals
>>> [[1]]
>>> [1] NA
>>>
>>> [[2]]
>>> [1] 67325000
>>>
>>> [[3]]
>>> [1] 67620000
>>>
>>>
>>> I will try again tomorrow... it's late  at night in Australia....
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Rhoda Kinsella <rhoda at ebi.ac.uk>
>>> To: Paul Leo <p.leo at uq.edu.au>
>>> Cc: bioconductor <bioconductor at stat.math.ethz.ch>
>>> Subject: Re: [BioC] BioMart and Ensembl questions !!!
>>> Date: Mon, 21 Sep 2009 15:10:42 +0100
>>>
>>> Hi Paul
>>> I'm not really sure why you get this error... I am using the  
>>> following
>>> version:
>>>
>>>> sessionInfo()
>>> R version 2.8.0 (2008-10-20)
>>> i386-apple-darwin8.11.1
>>>
>>> locale:
>>> en_GB.UTF-8/en_GB.UTF-8/C/C/en_GB.UTF-8/en_GB.UTF-8
>>>
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>
>>> other attached packages:
>>> [1] biomaRt_1.16.0
>>>
>>> loaded via a namespace (and not attached):
>>> [1] RCurl_0.92-0 XML_1.98-1
>>>
>>> Does anyone know why Paul is getting this error?
>>> Regards,
>>> Rhoda
>>>
>>>
>>> On 21 Sep 2009, at 14:53, Paul Leo wrote:
>>>
>>>> HI Rhoda ,
>>>> Thanks that seems exactly like I want but .. but it does not work  
>>>> for
>>>> me...
>>>>
>>>> library(biomaRt)
>>>>> listMarts(host="nov2008.archive.ensembl.org/biomart/martservice")
>>>> Entity 'nbsp' not defined
>>>> Entity 'nbsp' not defined
>>>> Entity 'nbsp' not defined
>>>> Entity 'nbsp' not defined
>>>> Entity 'nbsp' not defined
>>>> Entity 'nbsp' not defined
>>>> Entity 'copy' not defined
>>>> Entity 'nbsp' not defined
>>>> Entity 'nbsp' not defined
>>>> Error in names(x) <- value :
>>>> 'names' attribute [2] must be the same length as the vector [0]
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>> http://www.ensembl.org/info/website/archives/
>>>>
>>>>
>>>> Once you are there, click on the release you would like to look at
>>>> and
>>>> then on the biomart button. This will give you the
>>>> URI you need to use
>>>> in the biomaRt package to get access to that archive. For example
>>>> the release 51 archive biomart is
>>>> available at:
>>>>
>>>>
>>>> http://nov2008.archive.ensembl.org/biomart/martview/
>>>>
>>>>
>>>> If you then
>>>> plug this into biomart you can get access to the information you
>>>> require:
>>>>
>>>>
>>>>> library(biomaRt)
>>>>> listMarts(host="may2009.archive.ensembl.org/biomart/martservice")
>>>>             biomart              version
>>>> 1 ENSEMBL_MART_ENSEMBL           Ensembl 54
>>>> 2     ENSEMBL_MART_SNP Ensembl Variation 54
>>>> 3    ENSEMBL_MART_VEGA              Vega 35
>>>> 4             REACTOME   Reactome(CSHL US)
>>>> 5     wormbase_current   WormBase (CSHL US)
>>>> 6                pride       PRIDE (EBI UK)
>>>>> mart=useMart("ENSEMBL_MART_ENSEMBL",
>>>> host="may2009.archive.ensembl.org/biomart/martservice")
>>>>
>>>>
>>>> etc....
>>>>
>>>>
>>>> I hope that helps,
>>>> Regards,
>>>> Rhoda
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 21 Sep 2009, at 14:25, Paul Leo wrote:
>>>>
>>>>> Wow that is fairly terrible , I was surprised this thread was not
>>>>> followed... did I miss something?
>>>>>
>>>>> You can't access hg18 via BioMART only CRCh37!!
>>>>>
>>>>> 1)listMarts(archive=TRUE)   # shows mart back to 43 are there
>>>>>
>>>>> I'll start tracking back
>>>>>
>>>>>
>>>>> 2)mart<-
>>>>> useMart("ensembl_mart_51",dataset="hsapiens_gene_ensembl",archive
>>>>> ### WORKS FINE but is CRCh37
>>>>>
>>>>> 3)mart<-
>>>>> useMart
>>>>> ("ensembl_mart_50",dataset="hsapiens_gene_ensembl",archive=TRUE)
>>>>>
>>>>> Error in value[[3L]](cond) :
>>>>> Request to BioMart web service failed. Verify if you are still
>>>>> connected to the internet.  Alternatively the BioMart web  
>>>>> service is
>>>>> temporarily down.
>>>>> In addition: Warning message:
>>>>> In file(file, "r") : unable to resolve
>>>>> 'july2008.archive.ensembl.org'
>>>>>> #####  THAT's JUST BAD !
>>>>>
>>>>> 4)mart<-
>>>>> useMart
>>>>> ("ensembl_mart_49",dataset="hsapiens_gene_ensembl",archive=TRUE)
>>>>> Checking attributes ... ok
>>>>> Checking filters ... ok
>>>>> Warning message:
>>>>> In bmAttrFilt("filters", mart) :
>>>>> biomaRt warning: looks like we're connecting to an older version  
>>>>> of
>>>>> BioMart suite. Some biomaRt functions might not work.
>>>>>
>>>>> . ### works but that is NCBI36 but the attributes have old
>>>>> descriptions
>>>>> but may work for you (and me)
>>>>>
>>>>>
>>>>>
>>>>> I think 'july2008.archive.ensembl.org'  SHOULD BE
>>>>> 'jul2008.archive.ensembl.org'
>>>>> (three letter month name)
>>>>>
>>>>> Anyway to fix that?
>>>>>
>>>>> Cheers
>>>>> Paul
>>>>>
>>>>> NOTE also broken in production version 2.9.2 I think
>>>>>
>>>>>> sessionInfo()
>>>>> R version 2.10.0 Under development (unstable) (2009-09-20 r49770)
>>>>> x86_64-unknown-linux-gnu
>>>>>
>>>>> locale:
>>>>> [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C
>>>>> [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8
>>>>> [5] LC_MONETARY=C              LC_MESSAGES=en_AU.UTF-8
>>>>> [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C
>>>>> [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>>> [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
>>>>>
>>>>> attached base packages:
>>>>> [1] stats     graphics  grDevices utils     datasets  methods
>>>>> base
>>>>>
>>>>> other attached packages:
>>>>> [1] biomaRt_2.1.0
>>>>>
>>>>> loaded via a namespace (and not attached):
>>>>> [1] RCurl_1.2-0 XML_2.6-0
>>>>>
>>>>> -----Original Message-----
>>>>> From: jiayu wen <jiayu.jwen at gmail.com>
>>>>> To: bioconductor at stat.math.ethz.ch
>>>>> Subject: [BioC] BioMart and Ensembl questions
>>>>> Date: Tue, 1 Sep 2009 09:11:09 +0200
>>>>>
>>>>>
>>>>> Dear list,
>>>>>
>>>>> About over a year ago, I extracted 3'UTR sequences for about 7000
>>>>> genes using Biomart for my project. This is the command that I  
>>>>> used:
>>>>>
>>>>> (my gene_list is in gene symbol)
>>>>>> my_mart = useMart("ensembl",dataset="hsapiens_gene_ensembl")
>>>>>> seq_3utr = getSequence(id = unique(gene.symbol),
>>>>> type="hgnc_symbol",seqType="3utr",mart = my_mart)
>>>>>> seq_3utr = seq_3utr[seq_3utr[,"3utr"] != "Sequence unavailable",]
>>>>>> here: extract longest 3'UTR for each unique gene symbol
>>>>>> exportFASTA(seq_3utr, file=paste("s3utr.fa",sep=""))
>>>>>
>>>>> As my project goes, I now need 3'UTR genomic coordinates to get
>>>>> phastcons conservation for some regions in 3'UTR.
>>>>> To do that, I first convert hgnc_symbol back to ensembl_gene_id,
>>>>> then
>>>>>
>>>>> get 3'UTR coordinates using getBM like this:
>>>>>
>>>>>> s3utr =  
>>>>>> read.DNAStringSet(paste("s3utr.fa",sep=""),format="fasta")
>>>>>> gene_names = names(s3utr)
>>>>>> hgnc2ensembl  =
>>>>>> getBM(attributes=c("hgnc_symbol","ensembl_gene_id"),
>>>>> filters="hgnc_symbol", values=gene_names, mart=my_mart)
>>>>>> s3utr_pos  = getBM(attributes=c("ensembl_gene_id",
>>>>> "chromosome_name","strand","3_utr_start", "3_utr_end"),
>>>>> filters="ensembl_gene_id", values=as.character(hgnc2ensembl
>>>>> $ensembl_gene_id), mart=my_mart)
>>>>>> s3utr_pos = s3utr_pos[complete.cases(s3utr_pos),]
>>>>>
>>>>> By doing that, now I can only get about 5000 gene symbols with  
>>>>> 3'UTR
>>>>> coordinates (converting from hgnc_symbol back to ensembl_gene_id
>>>>> looses about 250 genes). I was thinking it might be version
>>>>> difference? So I tried to use ensembl archive but it gives me  
>>>>> error
>>>>> as
>>>>> below:
>>>>>
>>>>>> my_mart =
>>>>> useMart 
>>>>> ("ensembl_mart_50",dataset="hsapiens_gene_ensembl",archive=T)
>>>>> Error in value[[3L]](cond) :
>>>>> Request to BioMart web service failed. Verify if you are still
>>>>> connected to the internet.  Alternatively the BioMart web  
>>>>> service is
>>>>> temporarily down.
>>>>> In addition: Warning message:
>>>>> In file(file, "r") : cannot open: HTTP status was '404 Not Found'
>>>>>
>>>>> Is there anyway that I can get 3'UTR coordinates for all my gene
>>>>> list?
>>>>>
>>>>> Thanks for any help.
>>>>>
>>>>> Jean
>>>>> [[alternative HTML version deleted]]
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at stat.math.ethz.ch
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>
>>>>> [[alternative HTML version deleted]]
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at stat.math.ethz.ch
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>
>>>>
>>>> Rhoda Kinsella Ph.D.
>>>> Ensembl Bioinformatician,
>>>> European Bioinformatics Institute (EMBL-EBI),
>>>> Wellcome Trust Genome Campus,
>>>> Hinxton
>>>> Cambridge CB10 1SD,
>>>> UK.
>>>>
>>>>
>>>
>>> Rhoda Kinsella Ph.D.
>>> Ensembl Bioinformatician,
>>> European Bioinformatics Institute (EMBL-EBI),
>>> Wellcome Trust Genome Campus,
>>> Hinxton
>>> Cambridge CB10 1SD,
>>> UK.
>>>
>>
>> Rhoda Kinsella Ph.D.
>> Ensembl Bioinformatician,
>> European Bioinformatics Institute (EMBL-EBI),
>> Wellcome Trust Genome Campus,
>> Hinxton
>> Cambridge CB10 1SD,
>> UK.
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list