[BioC] "start" filter in biomaRt

Mon Aug 25 23:43:30 CEST 2008

Hi Steffen,

Thanks!

...Tao

----- Original Message ----
From: "steffen at stat.Berkeley.EDU" <steffen at stat.Berkeley.EDU>
To: "Shi, Tao" <shidaxia at yahoo.com>
Cc: bioconductor at stat.math.ethz.ch
Sent: Monday, August 25, 2008 12:33:29 PM
Subject: Re: [BioC] "start" filter in biomaRt

Hi Tao,

If you add the strand attribute (see below), you'll notice that the first
gene lays on the reverse strand.  In ensembl everything is given on the
forward strand.  The start position of the first gene is in fact 1029881
as it lays on the reverse strand and this is why it was returned.

mart = useMart("ensembl", dataset="hsapiens_gene_ensembl")
g = getBM(c("ensembl_gene_id","start_position","end_position", "strand"),
filters = c("chromosome_name","start"), values=list(17,1000000), mart =
mart)
ord=order(g[,2])
g=g[ord,]
g[1:10,]
     ensembl_gene_id start_position end_position strand
1135 ENSG00000159842         853510      1029881     -1
1442 ENSG00000205899        1120603      1121504      1
690  ENSG00000184811        1129707      1151031      1
1134 ENSG00000209456        1156460      1156529     -1
452  ENSG00000108953        1194595      1250267     -1
451  ENSG00000167193        1272190      1306294     -1
1133 ENSG00000197879        1314232      1342633     -1
450  ENSG00000132376        1344622      1366719     -1
689  ENSG00000174238        1368037      1412835     -1
449  ENSG00000167703        1424448      1478880     -1

cheers,
Steffen

> Hi list,
>
> In the following code, I'm using 'biomaRt' to retrieve all the genes that
> start beyond 1000000 on chromosome 17.  However, I'm not expecting the
> first gene which starts at 853510 to be in the resulting list.  It seems
> the "start" filter is not just simple ">=".   More explanations, please!
>
> Thanks!
>
> ...Tao
>
>
>
>> library(biomaRt)
> Loading required package: RCurl
>> mart = useMart("ensembl", dataset="hsapiens_gene_ensembl")
> Checking attributes and filters ... ok
>> tmp2 <-
>> getBM(attributes=c("ensembl_gene_id","start_position","end_position"),filters
>> = c("chromosome_name", "start"), values=list("17", "1000000"), mart =
>> mart)
>> tmp2[order(tmp2$start_position),][1:10,]
>      ensembl_gene_id start_position end_position
> 1135 ENSG00000159842         853510      1029881
> 1442 ENSG00000205899        1120603      1121504
> 690  ENSG00000184811        1129707      1151031
> 1134 ENSG00000209456        1156460      1156529
> 452  ENSG00000108953        1194595      1250267
> 451  ENSG00000167193        1272190      1306294
> 1133 ENSG00000197879        1314232      1342633
> 450  ENSG00000132376        1344622      1366719
> 689  ENSG00000174238        1368037      1412835
> 449  ENSG00000167703        1424448      1478880
>> sessionInfo()
> R version 2.7.0 (2008-04-22)
> i386-pc-mingw32
>
> locale:
> LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
> States.1252;LC_MONETARY=English_United
> States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] biomaRt_1.14.0 RCurl_0.9-3
>
> loaded via a namespace (and not attached):
> [1] XML_1.95-2
>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>