[BioC] BioMart and Ensembl questions !!!

James W. MacDonald jmacdon at med.umich.edu
Mon Sep 21 16:48:27 CEST 2009


Hi Paul,

Paul Leo wrote:
> HI Rhoda, 
> Yes a different version is probably it . There is STILL something
> wrong, based on your suggestions:
> 
> library(biomaRt)
> listMarts(host="may2009.archive.ensembl.org",path="/biomart/martservice",archive=TRUE)
> mart=useMart("ensembl_mart_51", dataset="hsapiens_gene_ensembl",
> host="may2009.archive.ensembl.org",path="/biomart/martservice",archive=TRUE)
> 
> works BUT queries then fail:
> 
> ann<-getBM(attributes =
> c( "ensembl_gene_id","external_gene_id","chromosome_name","start_position","end_position","strand","hgnc_symbol","gene_biotype"), filters = a.filter, values=fil.vals, mart = mart)
>> ann
> [1] ensembl_gene_id  external_gene_id chromosome_name  start_position  
> [5] end_position     strand           hgnc_symbol     
> <0 rows> (or 0-length row.names)
> 
> 
>> a.filter
> [1] "chromosome_name" "start"           "end"            
>> fil.vals
> [[1]]
> [1] NA
> 
> [[2]]
> [1] 67325000
> 
> [[3]]
> [1] 67620000

How do you expect a chromosome name of NA to retrieve anything? This 
won't work regardless of the mart.

## first try with 'regular' mart...

 > library(biomaRt)
 > att <- c("ensembl_gene_id", "external_gene_id", 
"chromosome_name","start_position", "end_position","strand","hgnc_symbol")
 > a.filter <- c("chromosome_name","start","end")
 > fil.vals <- list(NA, 67325000,67620000)
 > mart <- useMart("ensembl","hsapiens_gene_ensembl")
Checking attributes ... ok
Checking filters ... ok
 > getBM(att, a.filter, fil.vals, mart)
[1] ensembl_gene_id  external_gene_id chromosome_name  start_position
[5] end_position     strand           hgnc_symbol
<0 rows> (or 0-length row.names)

## can't retrieve with an NA chromosome. try chromosome 1

 > fil.vals[[1]] <- 1
 > getBM(att, a.filter, fil.vals, mart)
   ensembl_gene_id external_gene_id chromosome_name start_position 
end_position
1 ENSG00000152763            WDR78               1       67278574 
67390570
2 ENSG00000198160            MIER1               1       67390640 
67454302
3 ENSG00000116704          SLC35D1               1       67469679 
67520080
4 ENSG00000203963         C1orf141               1       67557859 
67594220
   strand hgnc_symbol
1     -1       WDR78
2      1       MIER1
3     -1     SLC35D1
4     -1    C1orf141

## now let's try an archive...

mart <- useMart("ensembl_mart_51", dataset="hsapiens_gene_ensembl",
+ 
host="may2009.archive.ensembl.org",path="/biomart/martservice",archive=TRUE)
Checking attributes ... ok
Checking filters ... ok
 >
 > getBM(att, a.filter, fil.vals, mart)
   ensembl_gene_id external_gene_id chromosome_name start_position 
end_position
1 ENSG00000221076      AL389925.10               1       67505872 
67505998
2 ENSG00000203963         C1orf141               1       67330447 
67366808
3 ENSG00000210924       AL133320.8               1       67340919 
67341049
4 ENSG00000162594            IL23R               1       67404757 
67498236
5 ENSG00000210928      AL109843.25               1       67434415 
67434511
6 ENSG00000210936      AL389925.10               1       67505870 
67505998
7 ENSG00000081985          IL12RB2               1       67545635 
67635171
8 ENSG00000217598      RP4-763G1.1               1       67516323 
67517050
9 ENSG00000221733      AL109843.25               1       67477711 
67477798
   strand hgnc_symbol
1     -1
2     -1    C1orf141
3     -1
4      1       IL23R
5      1
6     -1
7      1     IL12RB2
8      1
9     -1


Best,

Jim



> 
> 
> I will try again tomorrow... it's late  at night in Australia....
> 
> 
> 
> -----Original Message-----
> From: Rhoda Kinsella <rhoda at ebi.ac.uk>
> To: Paul Leo <p.leo at uq.edu.au>
> Cc: bioconductor <bioconductor at stat.math.ethz.ch>
> Subject: Re: [BioC] BioMart and Ensembl questions !!!
> Date: Mon, 21 Sep 2009 15:10:42 +0100
> 
> Hi Paul
> I'm not really sure why you get this error... I am using the following  
> version:
> 
>  > sessionInfo()
> R version 2.8.0 (2008-10-20)
> i386-apple-darwin8.11.1
> 
> locale:
> en_GB.UTF-8/en_GB.UTF-8/C/C/en_GB.UTF-8/en_GB.UTF-8
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
> 
> other attached packages:
> [1] biomaRt_1.16.0
> 
> loaded via a namespace (and not attached):
> [1] RCurl_0.92-0 XML_1.98-1
> 
> Does anyone know why Paul is getting this error?
> Regards,
> Rhoda
> 
> 
> On 21 Sep 2009, at 14:53, Paul Leo wrote:
> 
>> HI Rhoda ,
>> Thanks that seems exactly like I want but .. but it does not work for
>> me...
>>
>>  library(biomaRt)
>>> listMarts(host="nov2008.archive.ensembl.org/biomart/martservice")
>> Entity 'nbsp' not defined
>> Entity 'nbsp' not defined
>> Entity 'nbsp' not defined
>> Entity 'nbsp' not defined
>> Entity 'nbsp' not defined
>> Entity 'nbsp' not defined
>> Entity 'copy' not defined
>> Entity 'nbsp' not defined
>> Entity 'nbsp' not defined
>> Error in names(x) <- value :
>>  'names' attribute [2] must be the same length as the vector [0]
>>
>>
>>
>> http://www.ensembl.org/info/website/archives/
>>
>>
>> Once you are there, click on the release you would like to look at and
>> then on the biomart button. This will give you the
>> URI you need to use
>> in the biomaRt package to get access to that archive. For example  
>> the release 51 archive biomart is
>> available at:
>>
>>
>> http://nov2008.archive.ensembl.org/biomart/martview/
>>
>>
>> If you then
>> plug this into biomart you can get access to the information you  
>> require:
>>
>>
>>> library(biomaRt)
>>> listMarts(host="may2009.archive.ensembl.org/biomart/martservice")
>>               biomart              version
>> 1 ENSEMBL_MART_ENSEMBL           Ensembl 54
>> 2     ENSEMBL_MART_SNP Ensembl Variation 54
>> 3    ENSEMBL_MART_VEGA              Vega 35
>> 4             REACTOME   Reactome(CSHL US)
>> 5     wormbase_current   WormBase (CSHL US)
>> 6                pride       PRIDE (EBI UK)
>>> mart=useMart("ENSEMBL_MART_ENSEMBL",
>> host="may2009.archive.ensembl.org/biomart/martservice")
>>
>>
>> etc....
>>
>>
>> I hope that helps,
>> Regards,
>> Rhoda
>>
>>
>>
>>
>>
>>
>> On 21 Sep 2009, at 14:25, Paul Leo wrote:
>>
>>> Wow that is fairly terrible , I was surprised this thread was not
>>> followed... did I miss something?
>>>
>>> You can't access hg18 via BioMART only CRCh37!!
>>>
>>> 1)listMarts(archive=TRUE)   # shows mart back to 43 are there
>>>
>>> I'll start tracking back
>>>
>>>
>>> 2)mart<-
>>> useMart("ensembl_mart_51",dataset="hsapiens_gene_ensembl",archive
>>> ### WORKS FINE but is CRCh37
>>>
>>> 3)mart<-
>>> useMart 
>>> ("ensembl_mart_50",dataset="hsapiens_gene_ensembl",archive=TRUE)
>>>
>>> Error in value[[3L]](cond) :
>>> Request to BioMart web service failed. Verify if you are still
>>> connected to the internet.  Alternatively the BioMart web service is
>>> temporarily down.
>>> In addition: Warning message:
>>> In file(file, "r") : unable to resolve 'july2008.archive.ensembl.org'
>>>> #####  THAT's JUST BAD !
>>> 4)mart<-
>>> useMart 
>>> ("ensembl_mart_49",dataset="hsapiens_gene_ensembl",archive=TRUE)
>>> Checking attributes ... ok
>>> Checking filters ... ok
>>> Warning message:
>>> In bmAttrFilt("filters", mart) :
>>> biomaRt warning: looks like we're connecting to an older version of
>>> BioMart suite. Some biomaRt functions might not work.
>>>
>>> . ### works but that is NCBI36 but the attributes have old
>>> descriptions
>>> but may work for you (and me)
>>>
>>>
>>>
>>> I think 'july2008.archive.ensembl.org'  SHOULD BE
>>> 'jul2008.archive.ensembl.org'
>>> (three letter month name)
>>>
>>> Anyway to fix that?
>>>
>>> Cheers
>>> Paul
>>>
>>> NOTE also broken in production version 2.9.2 I think
>>>
>>>> sessionInfo()
>>> R version 2.10.0 Under development (unstable) (2009-09-20 r49770)
>>> x86_64-unknown-linux-gnu
>>>
>>> locale:
>>> [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C
>>> [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8
>>> [5] LC_MONETARY=C              LC_MESSAGES=en_AU.UTF-8
>>> [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C
>>> [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods
>>> base
>>>
>>> other attached packages:
>>> [1] biomaRt_2.1.0
>>>
>>> loaded via a namespace (and not attached):
>>> [1] RCurl_1.2-0 XML_2.6-0
>>>
>>> -----Original Message-----
>>> From: jiayu wen <jiayu.jwen at gmail.com>
>>> To: bioconductor at stat.math.ethz.ch
>>> Subject: [BioC] BioMart and Ensembl questions
>>> Date: Tue, 1 Sep 2009 09:11:09 +0200
>>>
>>>
>>> Dear list,
>>>
>>> About over a year ago, I extracted 3'UTR sequences for about 7000
>>> genes using Biomart for my project. This is the command that I used:
>>>
>>> (my gene_list is in gene symbol)
>>>> my_mart = useMart("ensembl",dataset="hsapiens_gene_ensembl")
>>>> seq_3utr = getSequence(id = unique(gene.symbol),
>>> type="hgnc_symbol",seqType="3utr",mart = my_mart)
>>>> seq_3utr = seq_3utr[seq_3utr[,"3utr"] != "Sequence unavailable",]
>>>> here: extract longest 3'UTR for each unique gene symbol
>>>> exportFASTA(seq_3utr, file=paste("s3utr.fa",sep=""))
>>> As my project goes, I now need 3'UTR genomic coordinates to get
>>> phastcons conservation for some regions in 3'UTR.
>>> To do that, I first convert hgnc_symbol back to ensembl_gene_id, then
>>>
>>> get 3'UTR coordinates using getBM like this:
>>>
>>>> s3utr = read.DNAStringSet(paste("s3utr.fa",sep=""),format="fasta")
>>>> gene_names = names(s3utr)
>>>> hgnc2ensembl  =
>>>> getBM(attributes=c("hgnc_symbol","ensembl_gene_id"),
>>> filters="hgnc_symbol", values=gene_names, mart=my_mart)
>>>> s3utr_pos  = getBM(attributes=c("ensembl_gene_id",
>>> "chromosome_name","strand","3_utr_start", "3_utr_end"),
>>> filters="ensembl_gene_id", values=as.character(hgnc2ensembl
>>> $ensembl_gene_id), mart=my_mart)
>>>> s3utr_pos = s3utr_pos[complete.cases(s3utr_pos),]
>>> By doing that, now I can only get about 5000 gene symbols with 3'UTR
>>> coordinates (converting from hgnc_symbol back to ensembl_gene_id
>>> looses about 250 genes). I was thinking it might be version
>>> difference? So I tried to use ensembl archive but it gives me error
>>> as
>>> below:
>>>
>>>> my_mart =
>>> useMart("ensembl_mart_50",dataset="hsapiens_gene_ensembl",archive=T)
>>> Error in value[[3L]](cond) :
>>>  Request to BioMart web service failed. Verify if you are still
>>> connected to the internet.  Alternatively the BioMart web service is
>>> temporarily down.
>>> In addition: Warning message:
>>> In file(file, "r") : cannot open: HTTP status was '404 Not Found'
>>>
>>> Is there anyway that I can get 3'UTR coordinates for all my gene  
>>> list?
>>>
>>> Thanks for any help.
>>>
>>> Jean
>>> [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>> Rhoda Kinsella Ph.D.
>> Ensembl Bioinformatician,
>> European Bioinformatics Institute (EMBL-EBI),
>> Wellcome Trust Genome Campus,
>> Hinxton
>> Cambridge CB10 1SD,
>> UK.
>>
>>
> 
> Rhoda Kinsella Ph.D.
> Ensembl Bioinformatician,
> European Bioinformatics Institute (EMBL-EBI),
> Wellcome Trust Genome Campus,
> Hinxton
> Cambridge CB10 1SD,
> UK.
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
James W. MacDonald, M.S.
Biostatistician
Douglas Lab
University of Michigan
Department of Human Genetics
5912 Buhl
1241 E. Catherine St.
Ann Arbor MI 48109-5618
734-615-7826



More information about the Bioconductor mailing list