[BioC] Error in Data.frame for mogene10stv1cdf and mogene10sttranscriptcluster.db
James W. MacDonald
jmacdon at uw.edu
Mon Jun 30 15:32:45 CEST 2014
On 6/28/2014 7:06 AM, Joel Ma wrote:
> I was trying to make a data.frame of geneIDs, symbols and names for my eset, but the cdf and .db did not match up.
>> geneIDs <- ls(mogene10stv1cdf)
>> geneSymbols <- unlist(as.list(mogene10sttranscriptclusterSYMBOL))
>> geneNames <- unlist(as.list(mogene10sttranscriptclusterGENENAME))
>> genelist <- data.frame(GeneID=geneIDs,GeneSymbol=geneSymbols,GeneName=geneNames,"</a>",sep="")
> Error in data.frame(GeneID = geneIDs, GeneSymbol = geneSymbols, GeneName = geneNames, :
> arguments imply differing number of rows: 34760, 35556, 1
> Length Class Mode
> 34760 character character
> Length Class Mode
> 35556 character character
> What should I do?
First, please note that there is no reason to expect that all of the
probesets for this (or any) array will actually measure a known gene,
with a Gene ID, symbol, and gene name. There are many different controls
on this array (which, being controls, should not be something that you
would want to annotate, even if you could).
Nor should you expect that all of the things being measured by this
array will have a symbol or a gene name. Affy has put all manner of
non-translated content on the Gene ST arrays, which you can find out by
looking at their marketing materials.
Plus, you are using very old methods of accessing data from the
annotation package, which I would not recommend using. The methods for
accessing data were updated for a good reason.
In addition, I wouldn't personally use the affy/makecdfenv pipeline for
analyzing these data. The oligo and xps packages are much better suited.
I would do something like this:
eset <- rma(read.celfiles(list.celfiles()))
eset <- getMainProbes(eset)
annot <- select(hugene10sttranscriptcluster.db, featureNames(eset),
At this point you may get a message informing you that there have been
some one-to-many mappings (I haven't analyzed hugene10st arrays for a
while, so don't remember if this is true or not). This means that some
of the probesets may measure transcript abundance for things that are
annotated with more than one Gene ID, symbol or gene name. This would
have been masked the way you did things before (you would get an NA for
all three instead).
How you deal with this duplication is up to you. I tend to be naive
about it and just take the first of the dups:
annot <- annot[!duplicated(annot),]
Others have argued in the past that this is suboptimal, and that you
should instead randomly choose the annotation to use. Alternatively you
could do something like
annotlst <- split(annot[,-1], annot[,1])
annot <- sapply(annot, function(x) if(nrow(x) == 1) return(x) else
return(apply(x, 2, paste, collapse = " | "))
Which will cause duplicated IDs or symbols to be concatenated with a
pipe symbol. The above is untested; sometimes sapply will return a
transposed matrix, so you might have to fiddle with the code to get what
> [[alternative HTML version deleted]]
> Bioconductor mailing list
> Bioconductor at r-project.org
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
James W. MacDonald, M.S.
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099
More information about the Bioconductor