[BioC] BioMart and Ensembl questions !!!

Paul Leo p.leo at uq.edu.au
Mon Sep 21 15:53:56 CEST 2009


HI Rhoda ,
Thanks that seems exactly like I want but .. but it does not work for
me...

  library(biomaRt)
>  listMarts(host="nov2008.archive.ensembl.org/biomart/martservice")
Entity 'nbsp' not defined
Entity 'nbsp' not defined
Entity 'nbsp' not defined
Entity 'nbsp' not defined
Entity 'nbsp' not defined
Entity 'nbsp' not defined
Entity 'copy' not defined
Entity 'nbsp' not defined
Entity 'nbsp' not defined
Error in names(x) <- value : 
  'names' attribute [2] must be the same length as the vector [0]
> 




http://www.ensembl.org/info/website/archives/


Once you are there, click on the release you would like to look at and
then on the biomart button. This will give you the 
URI you need to use
in the biomaRt package to get access to that archive. For example the release 51 archive biomart is
available at:


http://nov2008.archive.ensembl.org/biomart/martview/


If you then
plug this into biomart you can get access to the information you require:


> library(biomaRt)
> listMarts(host="may2009.archive.ensembl.org/biomart/martservice")
               biomart              version
1 ENSEMBL_MART_ENSEMBL           Ensembl 54
2     ENSEMBL_MART_SNP Ensembl Variation 54
3    ENSEMBL_MART_VEGA              Vega 35
4             REACTOME   Reactome(CSHL US) 
5     wormbase_current   WormBase (CSHL US)
6                pride       PRIDE (EBI UK)
> mart=useMart("ENSEMBL_MART_ENSEMBL",
host="may2009.archive.ensembl.org/biomart/martservice")


etc....


I hope that helps,
Regards,
Rhoda






On 21 Sep 2009, at 14:25, Paul Leo wrote:

> Wow that is fairly terrible , I was surprised this thread was not
> followed... did I miss something?
> 
> You can't access hg18 via BioMART only CRCh37!!
> 
> 1)listMarts(archive=TRUE)   # shows mart back to 43 are there
> 
> I'll start tracking back
> 
> 
> 2)mart<-
> useMart("ensembl_mart_51",dataset="hsapiens_gene_ensembl",archive
> ### WORKS FINE but is CRCh37
> 
> 3)mart<-
> useMart("ensembl_mart_50",dataset="hsapiens_gene_ensembl",archive=TRUE)
> 
> Error in value[[3L]](cond) : 
>  Request to BioMart web service failed. Verify if you are still
> connected to the internet.  Alternatively the BioMart web service is
> temporarily down.
> In addition: Warning message:
> In file(file, "r") : unable to resolve 'july2008.archive.ensembl.org'
> > #####  THAT's JUST BAD !
> 
> 4)mart<-
> useMart("ensembl_mart_49",dataset="hsapiens_gene_ensembl",archive=TRUE)
> Checking attributes ... ok
> Checking filters ... ok
> Warning message:
> In bmAttrFilt("filters", mart) :
>  biomaRt warning: looks like we're connecting to an older version of
> BioMart suite. Some biomaRt functions might not work.
> 
> . ### works but that is NCBI36 but the attributes have old
> descriptions
> but may work for you (and me)
> 
> 
> 
> I think 'july2008.archive.ensembl.org'  SHOULD BE
> 'jul2008.archive.ensembl.org'
> (three letter month name)
> 
> Anyway to fix that?  
> 
> Cheers
> Paul
> 
> NOTE also broken in production version 2.9.2 I think
> 
> > sessionInfo()
> R version 2.10.0 Under development (unstable) (2009-09-20 r49770) 
> x86_64-unknown-linux-gnu 
> 
> locale:
> [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C              
> [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8    
> [5] LC_MONETARY=C              LC_MESSAGES=en_AU.UTF-8   
> [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                 
> [9] LC_ADDRESS=C               LC_TELEPHONE=C            
> [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods
> base     
> 
> other attached packages:
> [1] biomaRt_2.1.0
> 
> loaded via a namespace (and not attached):
> [1] RCurl_1.2-0 XML_2.6-0  
> 
> -----Original Message-----
> From: jiayu wen <jiayu.jwen at gmail.com>
> To: bioconductor at stat.math.ethz.ch
> Subject: [BioC] BioMart and Ensembl questions
> Date: Tue, 1 Sep 2009 09:11:09 +0200
> 
> 
> Dear list,
> 
> About over a year ago, I extracted 3'UTR sequences for about 7000  
> genes using Biomart for my project. This is the command that I used:
> 
> (my gene_list is in gene symbol)
> > my_mart = useMart("ensembl",dataset="hsapiens_gene_ensembl")
> > seq_3utr = getSequence(id = unique(gene.symbol),  
> type="hgnc_symbol",seqType="3utr",mart = my_mart)
> > seq_3utr = seq_3utr[seq_3utr[,"3utr"] != "Sequence unavailable",]
> > here: extract longest 3'UTR for each unique gene symbol
> > exportFASTA(seq_3utr, file=paste("s3utr.fa",sep=""))
> 
> As my project goes, I now need 3'UTR genomic coordinates to get  
> phastcons conservation for some regions in 3'UTR.
> To do that, I first convert hgnc_symbol back to ensembl_gene_id, then
>  
> get 3'UTR coordinates using getBM like this:
> 
> > s3utr = read.DNAStringSet(paste("s3utr.fa",sep=""),format="fasta")
> > gene_names = names(s3utr)
> > hgnc2ensembl  =
> > getBM(attributes=c("hgnc_symbol","ensembl_gene_id"),  
> filters="hgnc_symbol", values=gene_names, mart=my_mart)
> > s3utr_pos  = getBM(attributes=c("ensembl_gene_id",  
> "chromosome_name","strand","3_utr_start", "3_utr_end"),
> filters="ensembl_gene_id", values=as.character(hgnc2ensembl 
> $ensembl_gene_id), mart=my_mart)
> > s3utr_pos = s3utr_pos[complete.cases(s3utr_pos),]
> 
> By doing that, now I can only get about 5000 gene symbols with 3'UTR  
> coordinates (converting from hgnc_symbol back to ensembl_gene_id  
> looses about 250 genes). I was thinking it might be version  
> difference? So I tried to use ensembl archive but it gives me error
> as  
> below:
> 
> > my_mart =  
> useMart("ensembl_mart_50",dataset="hsapiens_gene_ensembl",archive=T)
> Error in value[[3L]](cond) :
>   Request to BioMart web service failed. Verify if you are still  
> connected to the internet.  Alternatively the BioMart web service is  
> temporarily down.
> In addition: Warning message:
> In file(file, "r") : cannot open: HTTP status was '404 Not Found'
> 
> Is there anyway that I can get 3'UTR coordinates for all my gene list?
> 
> Thanks for any help.
> 
> Jean
> [[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> [[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
> 

Rhoda Kinsella Ph.D.
Ensembl Bioinformatician,
European Bioinformatics Institute (EMBL-EBI),
Wellcome Trust Genome Campus, 
Hinxton
Cambridge CB10 1SD,
UK.



More information about the Bioconductor mailing list