[BioC] unexpected genes names list using getBM{biomaRt}

Tue Dec 8 17:52:58 CET 2009

Hi Ramzi,

When using multiple filters with more than one value for each filter, the
BioMart webservice won't match up the different values for each filter.

For example the following query:

getBM(c("hgnc_symbol","chromosome_name","start_position","end_position"),filters=c("chromosome_name","start","end"),values=list(c(11,2),c(30576841,86479831),c(30576891,86479881)),mart=mart)

in XML:

<?xml version='1.0' encoding='UTF-8'?><!DOCTYPE Query><Query 
virtualSchemaName = 'default' uniqueRows = '1' count = '0'
datasetConfigVersion = '0.6' requestid= "biomaRt"> <Dataset name =
'hsapiens_gene_ensembl'><Attribute name = 'hgnc_symbol'/><Attribute name =
'chromosome_name'/><Attribute name = 'start_position'/><Attribute name =
'end_position'/><Filter name = 'chromosome_name' value = '11,2' /><Filter
name = 'start' value = '30576841,86479831' /><Filter name = 'end' value =
'30576891,86479881' /></Dataset></Query>

will return hgnc_symbols from genes on chromosome 11 between position
30576841 and 30576891  but probably also genes on chromosome 11 between
position 30576841 and 86479881 (and all other combinations).

For this type of query you'll either have to query region by region at
once, or even better query all hgnc_symbols and their positions first and
then select in R which genes fall in your regions of interest.
You could use the following query to do this.

mart=useMart("ensembl", dataset="hsapiens_gene_ensembl")
allGenePositions =
getBM(c("hgnc_symbol","ensembl_gene_id","chromosome_name","start_position","end_position"),
mart=mart)

Cheers,
Steffen

Ramzi TEMANNI wrote:
Hi James,
I found out that I'm using UCSC hg19 ref but Biomart uses GRCh37 ref and
that's why the output generates many genes.
i'm trying to find out a way to convert coordinates but seems I have to
align again using the GRCh37 ref.

I don't think that is the problem, as UCSC hg19 is based on GRCh37. I
think there might be a problem on the Biomart server side. As far as I can
tell, the query string produced by biomaRt is correct, but the Biomart
server returns more than I think it should.

In fact, if you do the query for each chromosome individually, you either
get one or zero genes returned. It's only if you do the query for more
than one chromosomal region at a time that you get these extra genes:

> getBM("hgnc_symbol", c("chromosome_name","start","end"),
list(t.cpd[3,1], t.cpd[3,2], t.cpd[3,2]+50), enseembl)
[1] hgnc_symbol
<0 rows> (or 0-length row.names)
> getBM("hgnc_symbol", c("chromosome_name","start","end"),
list(t.cpd[4,1], t.cpd[4,2], t.cpd[4,2]+50), enseembl)
 hgnc_symbol
1      MPPED2
> getBM("hgnc_symbol", c("chromosome_name","start","end"),
list(t.cpd[5,1], t.cpd[5,2], t.cpd[5,2]+50), enseembl)
 hgnc_symbol
1       REEP1
> tmp <- getBM("hgnc_symbol", c("chromosome_name","start","end"),
list(t.cpd[4:5,1], t.cpd[4:5,2], t.cpd[4:5,2]+50), enseembl)
> dim(tmp)
[1] 946   1
> head(tmp)
 hgnc_symbol
1       ZFP91
2        CNTF
3    C11orf76
4     RPLP0P2
5    C11orf64
6      OR4D8P
> grep("REEP1", tmp[,1], value=T)
[1] "REEP1"

The XMLQuery that is sent to the Biomart server looks like this (if Rhoda
Kinsella or Steffen Durinck are following this at all).

Browse[2]> xmlQuery
[1] "<?xml version='1.0' encoding='UTF-8'?><!DOCTYPE Query><Query
virtualSchemaName = 'default' uniqueRows = '1' count = '0'
datasetConfigVersion = '0.6' requestid= \"biomaRt\"> <Dataset name =
'hsapiens_gene_ensembl'><Attribute name = 'hgnc_symbol'/><Filter name =
'chromosome_name' value = '11,2' /><Filter name = 'start' value =
'30576841,86479831' /><Filter name = 'end' value = '30576891,86479881'
/></Dataset></Query>"

Best,

Jim

Regards,
Ramzi
----------------------------------------------------------------
On Mon, Dec 7, 2009 at 4:32 PM, Ramzi TEMANNI <ramzi.temanni at gmail.com
<mailto:ramzi.temanni at gmail.com>> wrote:
   Hi James,
   Thanks for your reply,
   As i have short reads of 50b, I've modified the code accordigly to
   your suggestion:
   gn.m1<-getBM(attributes= c("hgnc_symbol"),
          filters=c("chromosome_name","
   start",*"end"*),
             values=list(t.cpd[1:10,1],as.numeric(t.cpd[1:10,2]),*as.numeric(t.cpd[1:10,2])+50*),
   mart=ensembl)
   But still having more genes than expected:
   hgnc_symbol
   1      UBE2G1
   2      DUSP5P
   3   HIST3H2BA
   4     ZNF847P
   5     ACTBP11
   6     ZC3H11B
   ........
   8488     PPP2R2C
   8489        WFS1
   8490    SNORD73A
   8491     SNORA24
   8492     SNORA26
   Did I miss something  ?
   Thanks for the remark regarding the position, you are right ! I did
   not notice that, have to get back to the bowtie alignment file and
   see what is the reason behind that. I'm using bowtie with the last
   hg19 ref.
   Thanks again for your comment.     Best Regards,
   Ramzi
   ----------------------------------------------------------------
   On Mon, Dec 7, 2009 at 4:20 PM, James W. MacDonald
   <jmacdon at med.umich.edu <mailto:jmacdon at med.umich.edu>> wrote:
       Hi Ramzi,
       Ramzi TEMANNI wrote:
           Hi,
           I want to extract the gene names knowing the chromosome and
           the position for
           each genes:
               t.cpd[1:10,1:2]
                CHR.M1 POS.M1
            [1,] "12"   "140059033"
            [2,] "19"   "164634640"
            [3,] "10"   "32347784"
            [4,] "11"   "30576841"
            [5,] "2"    "86479831"
            [6,] "12"   "237019866"
            [7,] "4"    "76487174"
            [8,] "20"   "136121868"
            [9,] "2"    "6255547"
           [10,] "1"    "67658137"
           i use the following commands:
           library(biomaRt)
           mart = useMart("ensembl")
           ensembl = useDataset("hsapiens_gene_ensembl", mart = mart)
           gn.m1<-getBM(attributes= c("hgnc_symbol"),
                 filters=c("chromosome_name","start"),
                 values=list(t.cpd[1:10,1],t.cpd[1:10,2]), mart=ensembl)
           I'm expecting having a list of 10 genes names, but instead i
           get 8652 genes:
           hgnc_symbol
           1      OR2M1P
           2      OR2L1P
           3   HSD17B7P1
           4     OR14L1P
           5       OR2W5
           6       VN1R5
           ......
           8649        WFS1
           8650    SNORD73A
           8651     SNORA24
           8652     SNORA26
           Did I miss something ?
       Yes. You are giving the start position, but not the end. Without
       explicitly telling the Biomart server where to stop looking for
       genes, where do you think it will stop by default?
       Also, several of your coordinates are nonsensical. For instance,
       chr12 is only 133851859 bases long, chr20 is 63025520 bases
       long, etc.
       Best,
       Jim
           Thanks in advance for your help
           Best Regards,
           Ramzi
           ----------------------------------------------------------------
                  [[alternative HTML version deleted]]
           _______________________________________________
           Bioc-sig-sequencing mailing list
           Bioc-sig-sequencing at r-project.org
           <mailto:Bioc-sig-sequencing at r-project.org>
           https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
       --         James W. MacDonald, M.S.
       Biostatistician
       Douglas Lab
       University of Michigan
       Department of Human Genetics
       5912 Buhl
       1241 E. Catherine St.
       Ann Arbor MI 48109-5618
       734-615-7826
       **********************************************************
       Electronic Mail is not secure, may not be read every day, and
       should not be used for urgent or sensitive issues

-- 
James W. MacDonald, M.S.
Biostatistician
Douglas Lab
University of Michigan
Department of Human Genetics
5912 Buhl
1241 E. Catherine St.
Ann Arbor MI 48109-5618
734-615-7826
**********************************************************
Electronic Mail is not secure, may not be read every day, and should not
be used for urgent or sensitive issues
_______________________________________________
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing at r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing