[BioC] biomaRt manual

Steffen Durinck durincks at mail.nih.gov
Thu Mar 29 18:03:23 CEST 2007


Hi Weiwei,

There are duplicates because Ensembl maps everything to the transcript 
level.  If you would add the ensembl_transcript_id to your query you 
would get a better understanding of this. For example:

getBM(attributes=c("affy_hg_u95a", "entrezgene","ensembl_transcript_id"), filters="affy_hg_u95a", values="32864_at", mart=human)

gives:

  affy_hg_u95a entrezgene ensembl_transcript_id
1     32864_at       6736       ENST00000383070
2     32864_at       6736       ENST00000327563

same entrezgene id with different transcript identifiers.  Note that if 
you have questions on  the content of the Ensembl database/webservice 
you can also contact them directly at helpdesk at ensembl.org.  The biomaRt 
package only provides an interface between their webservices and R and 
as such we have little control on the data their webservice returns.

Best,
Steffen

Weiwei Shi wrote:
> Here is another question:
>   
>> length(unique(ids2))
>>     
> [1] 12558
>   
>> length(ids2)
>>     
> [1] 12558
>   
>> head(ids2)
>>     
> [1] "31307_at"   "31308_at"   "31309_r_at" "31310_at"   "31311_at"
> [6] "31312_at"
>   
>> t1 <- getBM(attributes=c("affy_hg_u95a", "entrezgene"), filters="affy_hg_u95a", values=(ids2), mart=human)
>> dim(t1)
>>     
> [1] 26360     2
>   
>> t1[1:20,]
>>     
>    affy_hg_u95a entrezgene
> 1      32864_at       6736
> 2      32864_at       6736
> 3      41214_at       6192
> 4      41214_at       6192
> 5      31534_at       7544
> 6      31534_at       7544
> 7      36367_at      83259
> 8      36367_at      83259
> 9      36367_at      83259
> 10     36367_at      83259
> 11      1199_at         NA
> 12   35929_s_at      64591
> 13   35929_s_at      64591
> 14   35929_s_at         NA
>
> Please look at line 12-14.
> Why are there so many duplications? Why is there some inconsistency
> between line12-14?
>
> Thanks for the previous prompt replies from every "hardworking"
> people. I am now at China and it should be about 6am at US.
>
> Cheers,
>
> Weiwei
>
>
>
> On 3/29/07, Sean Davis <sdavis2 at mail.nih.gov> wrote:
>   
>> On Thursday 29 March 2007 07:28, James W. MacDonald wrote:
>>     
>>> Hi Weiwei,
>>>
>>> Weiwei Shi wrote:
>>>       
>>>> Sorry :) when I am composing the following email, I did not realize
>>>> there are a couple of replies now. I read the manual carefully but I
>>>> am still having some questions like this:
>>>>
>>>> For example,
>>>>
>>>>         
>>>>> getBM(attributes=c("affy_hg_u95a", "entrezgene"), filters="affy_hg_u95a",
>>>>> values=head(ids2), mart=human)
>>>>>           
>>>>   affy_hg_u95a entrezgene
>>>> 1     31308_at         NA
>>>> 2     31310_at       2741
>>>> 3     31312_at       9312
>>>>
>>>>         
>>>>> head(ids2)
>>>>>           
>>>> [1] "31307_at"   "31308_at"   "31309_r_at" "31310_at"   "31311_at"
>>>> [6] "31312_at"
>>>>
>>>>         
>>>>> getBM(attributes=c("affy_hg_u95a", "entrezgene"), filters="affy_hg_u95a",
>>>>> values="31307_at", mart=human)
>>>>>           
>>>> NULL
>>>>
>>>> I am confused by "NULL" and "NA". I am wondering about the difference b/w
>>>> them.
>>>>         
>>> Steffen Durinck will know better, but I believe NULL means that Ensembl
>>> doesn't think that probeset maps to anything (e.g., there is nothing
>>> available), and NA means that there is no Entrez Gene ID for that probeset.
>>>
>>> For instance, if you pull the Entrez Gene ID for 31307_at from the
>>> hgu95aENTREZID environment, it lists 9594, but if you search Entrez Gene
>>> for that ID it says it has been discontinued.
>>>
>>>       
>>>> Another question is how to make >8000 queries faster though I read
>>>> some from previous posts.
>>>>         
>> Make sure that you really need to make 8000 queries.  It is much faster to
>> make one or a few large queries than to make many small ones.
>>
>> Sean
>>
>>     
>
>
>   


-- 
Steffen Durinck, Ph.D.

Oncogenomics Section
Pediatric Oncology Branch
National Cancer Institute, National Institutes of Health
URL: http://home.ccr.cancer.gov/oncology/oncogenomics/

Phone: 301-402-8103
Address:
Advanced Technology Center,
8717 Grovemont Circle
Gaithersburg, MD 20877



More information about the Bioconductor mailing list