[BioC] matching Entrez-IDs to Affy probesets using biomaRt

Marc Carlson mcarlson at fhcrc.org
Fri Mar 14 21:48:56 CET 2014


Hi Naomi,

I don't have an answer for what biomaRt is doing here (although I bet 
that they will have some kind of explanation).  But if you just need to 
do some quick annotation there is also a bioconductor package for that 
platform that you can use called 'rat2302.db'

library(rat2302.db)
length(keys(rat2302.db, keytype='PROBEID'))

Shows that it has 31099 probeset ids.

Then to annotate some probes you could do it like this:

probes <- head(keys(rat2302.db, keytype='PROBEID'))
select(rat2302.db, keys=probes, columns=c('SYMBOL','GENENAME'), 
keytype='PROBEID')


And just in case you are currently only acclimated to biomaRt, you can 
learn more about how to use this package here:

http://www.bioconductor.org/packages/devel/bioc/vignettes/AnnotationDbi/inst/doc/IntroToAnnotationPackages.pdf



Marc



On 03/13/2014 02:01 PM, Naomi Altman wrote:
> After my premature posting yesterday, I am bit hesitant to ask, but I am
> puzzled by what I am getting from biomaRt.  (To avoid clutter, I added
> the sessionInfo at the end of the message.)
>
> I used ReadAffy() to read in a rat dataset and called it CELdata.
>
> CELdata
> AffyBatch object
> size of arrays=834x834 features (19 kb)
> cdf=Rat230_2 (31099 affyids)
> number of samples=8
> number of genes=31099
> annotation=rat2302
> notes=
>
> features=featureNames(CELdata)
>> length(features)
> [1] 31099
>> sum(is.na(features))
> [1] 0
>
> I use features to query biomaRt for the Entrez-ids.  I got back only 18882 probesets (but actually fewer, because some probesets are matched to 2 Entrez-ids).  On the other hand, some of the Affy-ids there were returned did not match anything, so I am not sure why they were returned.
>
> matchFeature=getBM(attributes=c('affy_rat230_2','entrezgene'), filters ='affy_rat230_2', values = features, mart = ensembl)
>> dim(matchFeature)
> [1] 18882     2
>> sum(!is.na(matchFeature$affy_rat230_2))
> [1] 18882
>> sum(!is.na(matchFeature$entrezgene))
> [1] 17814
>
>
> I then use the non-missing Entrez-ids  to query biomaRt for the Affy-ids.  I got back only 18249 Entrez-ids (presumable because some Entrez-ids are matched to 2 probesets).  Nothing is missing.
>
>
>
> matchEntrez=getBM(attributes=c('affy_rat230_2','entrezgene'), filters ='entrezgene', values = matchFeature[!is.na(matchFeature[,2]),2], mart = ensembl)
>
>> dim(matchEntrez)
> [1] 18249     2
>> sum(!is.na(matchEntrez[,1]))
> [1] 18249
>> sum(!is.na(matchEntrez[,2]))
> [1] 18249
>
>
>    I am pretty sure that the discrepancies in the counts has to do with
> how getBM is handling multiple matches.
>
> length(unique(matchFeature[,1]))
> [1] 16851
>> length(unique(matchEntrez[,1]))
> [1] 16143
>> length(unique(matchFeature[,2]))
> [1] 13738
>> length(unique(matchEntrez[,2]))
> [1] 13737
>> length(unique(matchFeature[!is.na(matchFeature[,2]),1]))
> [1] 16142
>
>
>
> In any case, I seem to be missing about 13000 probesets.  Surely there
> cannot be that many probesets on the array with no Entrez-id?
>
> Thanks for any help you can provide.
>
> Naomi Altman
>
>
>> sessionInfo()
> R version 3.0.2 (2013-09-25)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
>
> locale:
> [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252
> [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
> [5] LC_TIME=English_United States.1252
>
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] rat2302cdf_2.13.0    hgu95av2cdf_2.13.0   AnnotationDbi_1.24.0 biomaRt_2.18.0
> [5] edgeR_3.4.2          limma_3.18.13        affy_1.40.0          Biobase_2.22.0
> [9] BiocGenerics_0.8.0
>
> loaded via a namespace (and not attached):
>    [1] affyio_1.30.0         BiocInstaller_1.12.0  DBI_0.2-7             IRanges_1.20.7
>    [5] preprocessCore_1.24.0 RCurl_1.95-4.1        RSQLite_0.11.4        stats4_3.0.2
>    [9] tools_3.0.2           XML_3.98-1.1          zlibbioc_1.8.0
>
>     
>
>
>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list