[BioC] Error in Data.frame for mogene10stv1cdf and mogene10sttranscriptcluster.db

Mon Jun 30 15:32:45 CEST 2014

Hi Joel,

On 6/28/2014 7:06 AM, Joel Ma wrote:
> Hi
>
> I was trying to make a data.frame of geneIDs, symbols and names for my eset, but the cdf and .db did not match up.
>
>> library(mogene10sttranscriptcluster.db)
>> library(mogene10stv1cdf)
>> geneIDs <- ls(mogene10stv1cdf)
>> geneSymbols <- unlist(as.list(mogene10sttranscriptclusterSYMBOL))
>> geneNames <- unlist(as.list(mogene10sttranscriptclusterGENENAME))
>> genelist <- data.frame(GeneID=geneIDs,GeneSymbol=geneSymbols,GeneName=geneNames,"</a>",sep="")
>
> Error in data.frame(GeneID = geneIDs, GeneSymbol = geneSymbols, GeneName = geneNames,  :
>    arguments imply differing number of rows: 34760, 35556, 1
>
>> summary(geneIDs)
>     Length     Class      Mode
>      34760 character character
>> summary(geneNames)
>     Length     Class      Mode
>      35556 character character
>
> What should I do?

First, please note that there is no reason to expect that all of the 
probesets for this (or any) array will actually measure a known gene, 
with a Gene ID, symbol, and gene name. There are many different controls 
on this array (which, being controls, should not be something that you 
would want to annotate, even if you could).

Nor should you expect that all of the things being measured by this 
array will have a symbol or a gene name. Affy has put all manner of 
non-translated content on the Gene ST arrays, which you can find out by 
looking at their marketing materials.

Plus, you are using very old methods of accessing data from the 
annotation package, which I would not recommend using. The methods for 
accessing data were updated for a good reason.

In addition, I wouldn't personally use the affy/makecdfenv pipeline for 
analyzing these data. The oligo and xps packages are much better suited. 
I would do something like this:

library(oligo)
eset <- rma(read.celfiles(list.celfiles()))
<shameless plug>
library(affycoretools)
eset <- getMainProbes(eset)
</shameless plug>
annot <- select(hugene10sttranscriptcluster.db, featureNames(eset), 
c("ENTREZID","SYMBOL","GENENAME"))

At this point you may get a message informing you that there have been 
some one-to-many mappings (I haven't analyzed hugene10st arrays for a 
while, so don't remember if this is true or not). This means that some 
of the probesets may measure transcript abundance for things that are 
annotated with more than one Gene ID, symbol or gene name. This would 
have been masked the way you did things before (you would get an NA for 
all three instead).

How you deal with this duplication is up to you. I tend to be naive 
about it and just take the first of the dups:

annot <- annot[!duplicated(annot),]

Others have argued in the past that this is suboptimal, and that you 
should instead randomly choose the annotation to use. Alternatively you 
could do something like

annotlst <- split(annot[,-1], annot[,1])

annot <- sapply(annot, function(x) if(nrow(x) == 1) return(x) else 
return(apply(x, 2, paste, collapse = " | "))

Which will cause duplicated IDs or symbols to be concatenated with a 
pipe symbol. The above is untested; sometimes sapply will return a 
transposed matrix, so you might have to fiddle with the code to get what 
you want.

Best,

Jim

>
> Cheers
>
> Joel
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099