[BioC] BiomaRt return value

Sun Nov 22 23:12:58 CET 2009

Hi Tony

thanks for these good ideas. Both of these you could implement in a 
small wrapper function around getBM. Once you find that this is a 
stable, generally useful function, we'd be happy to accept your patch 
for the biomaRt package!

Btw, ENSP00000045065 is a valid protein sequence ID with many hits for 
it in Google, and indeed in the search box at http://www.ebi.ac.uk. The 
fact that the hsapiens_gene_ensembl mart does not know a mapping of it 
to an extant gene name could have all sorts of reasons, historical or 
scientific, which you could explore at the EBI website.

	Best wishes
	Wolfgang

  Chiang wrote:
> Hi Steffen, Sean, Wolfgang,
> 
> I have a question about the return value of the getBM() function. It is a
> data frame object, and in the examples that I have seen, usually if I want
> to map from EMBL IDs to Entrez Gene IDs, we would still also want to map the
> EMBL IDs back to the EMBL IDs so we know what has mapped to what. Example
> code to follow if my explanation is not clear:
> 
> ################
> library(biomaRt)
> ensembl = useMart("ensembl", dataset = "hsapiens_gene_ensembl")
> filters = listFilters(ensembl)
> attributes = listAttributes(ensembl)
> ##Here are my IDs from String
> test = c("9606.ENSP00000045065", "9606.ENSP00000158762",
> "9606.ENSP00000174653",
> "9606.ENSP00000202967", "9606.ENSP00000204517", "9606.ENSP00000212015",
> "9606.ENSP00000220616", "9606.ENSP00000222008", "9606.ENSP00000222390",
> "9606.ENSP00000223051")
> emblID = sapply(strsplit(test, "\\."), function(x) x[2])
> ##And the code I am using for the mapping is:
> getBM(attributes=c("ensembl_peptide_id", "entrezgene","ensembl_gene_id",
> "hgnc_automatic_gene_name"), filters="ensembl_peptide_id", values=emblID,
> mart=ensembl)
> ##################
> 
> So I guess I have two questions: would it be a good idea to always return
> what we input in the output data frame so we would have not to have the
> redundant attribute ("ensembl_peptide_id" in my example). Also, if you ran
> the code, you will see that ENSP00000045065 did not map at all , so I assume
> that it is not a valid ensembl_peptide_id (this is a bit strange since I am
> using EMBL IDs); I also want to ask if there is some way to make that more
> transparent...maybe a row of NA values? I realize that these are not
> terrible things to work around, but would it not make sense to have this? If
> not, please let me know.
> 
> Cheers,
> --Tony
> 
>> sessionInfo()
> R version 2.10.0 Patched (2009-10-27 r50222)
> x86_64-apple-darwin9.8.0
> 
> locale:
> [1] en_US.utf-8/en_US.utf-8/C/C/en_US.utf-8/en_US.utf-8
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
> 
> other attached packages:
> [1] biomaRt_2.2.0
> 
> loaded via a namespace (and not attached):
> [1] RCurl_1.2-1 XML_2.6-0
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 

Best wishes
      Wolfgang

--
Wolfgang Huber
EMBL
http://www.embl.de/research/units/genome_biology/huber/contact