[R] FW: Selecting undefined column of a data frame (was [BioC] read.phenoData vs read.AnnotatedDataFrame)

Prof Brian Ripley ripley at stats.ox.ac.uk
Fri Aug 3 22:05:15 CEST 2007


I've since seen your followup a more detailed explanation may help.
The path through the code for your argument list does not go where you 
quoted, and there is a reason for it.

Generally when you extract in R and ask for an non-existent index you get 
NA or NULL as the result (and no warning), e.g.

> y <- list(x=1, y=2)
> y[["z"]]
NULL

Because data frames 'must' have (column) names, they are a partial 
exception and when the result is a data frame you get an error if it would 
contain undefined columns.

But in the case of foo[, "FileName"], the result is a single column and so 
will not have a name: there seems no reason to be different from

> foo[["FileName"]]
NULL
> foo$FileName
NULL

which similarly select a single column.  At one time they were different 
in R, for no documented reason.


On Fri, 3 Aug 2007, Prof Brian Ripley wrote:

> You are reading the wrong part of the code for your argument list:
>
>>  foo["FileName"]
> Error in `[.data.frame`(foo, "FileName") : undefined columns selected
>
> [.data.frame is one of the most complex functions in R, and does many 
> different things depending on which arguments are supplied.
>
>
> On Fri, 3 Aug 2007, Steven McKinney wrote:
>
>> Hi all,
>> 
>> What are current methods people use in R to identify
>> mis-spelled column names when selecting columns
>> from a data frame?
>> 
>> Alice Johnson recently tackled this issue
>> (see [BioC] posting below).
>> 
>> Due to a mis-spelled column name ("FileName"
>> instead of "Filename") which produced no warning,
>> Alice spent a fair amount of time tracking down
>> this bug.  With my fumbling fingers I'll be tracking
>> down such a bug soon too.
>> 
>> Is there any options() setting, or debug technique
>> that will flag data frame column extractions that
>> reference a non-existent column?  It seems to me
>> that the "[.data.frame" extractor used to throw an
>> error if given a mis-spelled variable name, and I
>> still see lines of code in "[.data.frame" such as
>> 
>> if (any(is.na(cols)))
>>            stop("undefined columns selected")
>> 
>> 
>> 
>> In R 2.5.1 a NULL is silently returned.
>> 
>>> foo <- data.frame(Filename = c("a", "b"))
>>> foo[, "FileName"]
>> NULL
>> 
>> Has something changed so that the code lines
>> if (any(is.na(cols)))
>>            stop("undefined columns selected")
>> in "[.data.frame" no longer work properly (if
>> I am understanding the intention properly)?
>> 
>> If not, could  "[.data.frame" check an
>> options() variable setting (say
>> warn.undefined.colnames) and throw a warning
>> if a non-existent column name is referenced?
>> 
>> 
>> 
>> 
>>> sessionInfo()
>> R version 2.5.1 (2007-06-27)
>> powerpc-apple-darwin8.9.1
>> 
>> locale:
>> en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8
>> 
>> attached base packages:
>> [1] "stats"     "graphics"  "grDevices" "utils"     "datasets"  "methods" 
>> "base"
>> 
>> other attached packages:
>>     plotrix         lme4       Matrix      lattice
>>     "2.2-3"  "0.99875-4" "0.999375-0"     "0.16-2"
>>> 
>> 
>> 
>> 
>> Steven McKinney
>> 
>> Statistician
>> Molecular Oncology and Breast Cancer Program
>> British Columbia Cancer Research Centre
>> 
>> email: smckinney +at+ bccrc +dot+ ca
>> 
>> tel: 604-675-8000 x7561
>> 
>> BCCRC
>> Molecular Oncology
>> 675 West 10th Ave, Floor 4
>> Vancouver B.C.
>> V5Z 1L3
>> Canada
>> 
>> 
>> 
>> 
>> -----Original Message-----
>> From: bioconductor-bounces at stat.math.ethz.ch on behalf of Johnstone, Alice
>> Sent: Wed 8/1/2007 7:20 PM
>> To: bioconductor at stat.math.ethz.ch
>> Subject: Re: [BioC] read.phenoData vs read.AnnotatedDataFrame
>> 
>> For interest sake, I have found out why I wasn't getting my expected
>> results when using read.AnnotatedDataFrame
>> Turns out the error was made in the ReadAffy command, where I specified
>> the filenames to be read from my AnnotatedDataFrame object.  There was a
>> typo error with a capital N ($FileName) rather than lowercase n
>> ($Filename) as in my target file..whoops.  However this meant the
>> filename argument was ignored without the error message(!) and instead
>> of using the information in the AnnotatedDataFrame object (which
>> included filenames, but not alphabetically) it read the .cel files in
>> alphabetical order from the working directory - hence the wrong file was
>> given the wrong label (given by the order of Annotated object) and my
>> comparisons were confused without being obvious as to why or where.
>> Our solution: specify that filename is as.character so assignment of
>> file to target is correct(after correcting $Filename) now that using
>> read.AnnotatedDataFrame rather than readphenoData.
>> 
>> Data<-ReadAffy(filenames=as.character(pData(pd)$Filename),phenoData=pd)
>> 
>> Hurrah!
>> 
>> It may be beneficial to others, that if the filename argument isn't
>> specified, that filenames are read from the phenoData object if included
>> here.
>> 
>> Thanks!
>> 
>> -----Original Message-----
>> From: Martin Morgan [mailto:mtmorgan at fhcrc.org]
>> Sent: Thursday, 26 July 2007 11:49 a.m.
>> To: Johnstone, Alice
>> Cc: bioconductor at stat.math.ethz.ch
>> Subject: Re: [BioC] read.phenoData vs read.AnnotatedDataFrame
>> 
>> Hi Alice --
>> 
>> "Johnstone, Alice" <Alice.Johnstone at esr.cri.nz> writes:
>> 
>>> Using R2.5.0 and Bioconductor I have been following code to analysis
>>> Affymetrix expression data: 2 treatments vs control.  The original
>>> code was run last year and used the read.phenoData command, however
>>> with the newer version I get the error message Warning messages:
>>> read.phenoData is deprecated, use read.AnnotatedDataFrame instead The
>>> phenoData class is deprecated, use AnnotatedDataFrame (with
>>> ExpressionSet) instead
>>> 
>>> I use the read.AnnotatedDataFrame command, but when it comes to the
>>> end of the analysis the comparison of the treatment to the controls
>>> gets mixed up compared to what you get using the original
>>> read.phenoData ie it looks like the 3 groups get labelled wrong and so
>> 
>>> the comparisons are different (but they can still be matched up).
>>> My questions are,
>>> 1) do you need to set up your target file differently when using
>>> read.AnnotatedDataFrame - what is the standard format?
>> 
>> I can't quite tell where things are going wrong for you, so it would
>> help if you can narrow down where the problem occurs.  I think
>> read.AnnotatedDataFrame should be comparable to read.phenoData. Does
>> 
>>> pData(pd)
>> 
>> look right? What about
>> 
>>> pData(Data)
>> 
>> and
>> 
>>> pData(eset.rma)
>> 
>> ? It's not important but pData(pd)$Target is the same as pd$Target.
>> Since the analysis is on eset.rma, it probably makes sense to use the
>> pData from there to construct your design matrix
>> 
>>> targs<-factor(eset.rma$Target)
>>> design<-model.matrix(~0+targs)
>>> colnames(design)<-levels(targs)
>> 
>> Does design look right?
>> 
>>> I have three columns sample, filename and target.
>>> 2) do you need to use a different model matrix to what I have?
>>> 3) do you use a different command for making the contrasts?
>> 
>> Depends on the question! If you're performing the same analysis as last
>> year, then the model matrix and contrasts have to be the same!
>> 
>>> I have included my code below if that is of any assistance.
>>> Many Thanks!
>>> Alice
>>> 
>>> 
>>> 
>>> ##Read data
>>> pd<-read.AnnotatedDataFrame("targets.txt",header=T,row.name="sample")
>>> Data<-ReadAffy(filenames=pData(pd)$FileName,phenoData=pd)
>>> ##normalisation
>>> eset.rma<-rma(Data)
>>> ##analysis
>>> targs<-factor(pData(pd)$Target)
>>> design<-model.matrix(~0+targs)
>>> colnames(design)<-levels(targs)
>>> fit<-lmFit(eset.rma,design)
>>> cont.wt<-makeContrasts("treatment1-control","treatment2-control",level
>>> s=
>>> design)
>>> fit2<-contrasts.fit(fit,cont.wt)
>>> fit2.eb<-eBayes(fit2)
>>> testconts<-classifyTestsF(fit2.eb,p.value=0.01)
>>> topTable(fit2.eb,coef=2,n=300)
>>> topTable(fit2.eb,coef=1,n=300)
>>> 
>>>
>>> 	[[alternative HTML version deleted]]
>>> 
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> 
>> --
>> Martin Morgan
>> Bioconductor / Computational Biology
>> http://bioconductor.org
>> 
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: 
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> 
>> ______________________________________________
>> R-help at stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>> 
>
>

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595



More information about the R-help mailing list