[R] FW: Selecting undefined column of a data frame (was [BioC] read.phenoData vs read.AnnotatedDataFrame)

Fri Aug 3 21:54:45 CEST 2007

Thanks Prof Ripley,

I used double indexing (if I understand the doc correctly)
so my call was

> foo[, "FileName"]

I traced through each line of `[.data.frame`
following the sequence of commands executed
for my call.

In the code section

    if (missing(i)) {
        if (missing(j) && drop && length(x) == 1L)
            return(.subset2(x, 1L))
        y <- if (missing(j))
            x
        else .subset(x, j)
        if (drop && length(y) == 1L)
            return(.subset2(y, 1L)) ## This returns a result before undefined columns check is done.  Is this intended?
        cols <- names(y)
        if (any(is.na(cols)))
            stop("undefined columns selected")
        if (any(duplicated(cols)))
            names(y) <- make.unique(cols)
        nrow <- .row_names_info(x, 2L)
        if (drop && !mdrop && nrow == 1L)
            return(structure(y, class = NULL, row.names = NULL))
        else return(structure(y, class = oldClass(x), row.names = .row_names_info(x,
            0L)))
    }

the return happened after execution of
    if (drop && length(y) == 1L)
                return(.subset2(y, 1L))
before the check on column names.

Shouldn't the check on column names
        cols <- names(y)
        if (any(is.na(cols)))
            stop("undefined columns selected")
occur before
    if (drop && length(y) == 1L)
                return(.subset2(y, 1L))
rather than after?

-----Original Message-----
From: Prof Brian Ripley [mailto:ripley at stats.ox.ac.uk]
Sent: Fri 8/3/2007 12:25 PM
To: Steven McKinney
Cc: r-help at stat.math.ethz.ch
Subject: Re: [R] FW: Selecting undefined column of a data frame (was [BioC] read.phenoData vs read.AnnotatedDataFrame)

You are reading the wrong part of the code for your argument list:

>  foo["FileName"]
Error in `[.data.frame`(foo, "FileName") : undefined columns selected

[.data.frame is one of the most complex functions in R, and does many 
different things depending on which arguments are supplied.

On Fri, 3 Aug 2007, Steven McKinney wrote:

> Hi all,
>
> What are current methods people use in R to identify
> mis-spelled column names when selecting columns
> from a data frame?
>
> Alice Johnson recently tackled this issue
> (see [BioC] posting below).
>
> Due to a mis-spelled column name ("FileName"
> instead of "Filename") which produced no warning,
> Alice spent a fair amount of time tracking down
> this bug.  With my fumbling fingers I'll be tracking
> down such a bug soon too.
>
> Is there any options() setting, or debug technique
> that will flag data frame column extractions that
> reference a non-existent column?  It seems to me
> that the "[.data.frame" extractor used to throw an
> error if given a mis-spelled variable name, and I
> still see lines of code in "[.data.frame" such as
>
> if (any(is.na(cols)))
>            stop("undefined columns selected")
>
>
>
> In R 2.5.1 a NULL is silently returned.
>
>> foo <- data.frame(Filename = c("a", "b"))
>> foo[, "FileName"]
> NULL
>
> Has something changed so that the code lines
> if (any(is.na(cols)))
>            stop("undefined columns selected")
> in "[.data.frame" no longer work properly (if
> I am understanding the intention properly)?
>
> If not, could  "[.data.frame" check an
> options() variable setting (say
> warn.undefined.colnames) and throw a warning
> if a non-existent column name is referenced?
>
>
>
>
>> sessionInfo()
> R version 2.5.1 (2007-06-27)
> powerpc-apple-darwin8.9.1
>
> locale:
> en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8
>
> attached base packages:
> [1] "stats"     "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"
>
> other attached packages:
>     plotrix         lme4       Matrix      lattice
>     "2.2-3"  "0.99875-4" "0.999375-0"     "0.16-2"
>>
>
>
>
> Steven McKinney
>
> Statistician
> Molecular Oncology and Breast Cancer Program
> British Columbia Cancer Research Centre
>
> email: smckinney +at+ bccrc +dot+ ca
>
> tel: 604-675-8000 x7561
>
> BCCRC
> Molecular Oncology
> 675 West 10th Ave, Floor 4
> Vancouver B.C.
> V5Z 1L3
> Canada
>
>
>
>
> -----Original Message-----
> From: bioconductor-bounces at stat.math.ethz.ch on behalf of Johnstone, Alice
> Sent: Wed 8/1/2007 7:20 PM
> To: bioconductor at stat.math.ethz.ch
> Subject: Re: [BioC] read.phenoData vs read.AnnotatedDataFrame
>
> For interest sake, I have found out why I wasn't getting my expected
> results when using read.AnnotatedDataFrame
> Turns out the error was made in the ReadAffy command, where I specified
> the filenames to be read from my AnnotatedDataFrame object.  There was a
> typo error with a capital N ($FileName) rather than lowercase n
> ($Filename) as in my target file..whoops.  However this meant the
> filename argument was ignored without the error message(!) and instead
> of using the information in the AnnotatedDataFrame object (which
> included filenames, but not alphabetically) it read the .cel files in
> alphabetical order from the working directory - hence the wrong file was
> given the wrong label (given by the order of Annotated object) and my
> comparisons were confused without being obvious as to why or where.
> Our solution: specify that filename is as.character so assignment of
> file to target is correct(after correcting $Filename) now that using
> read.AnnotatedDataFrame rather than readphenoData.
>
> Data<-ReadAffy(filenames=as.character(pData(pd)$Filename),phenoData=pd)
>
> Hurrah!
>
> It may be beneficial to others, that if the filename argument isn't
> specified, that filenames are read from the phenoData object if included
> here.
>
> Thanks!
>
> -----Original Message-----
> From: Martin Morgan [mailto:mtmorgan at fhcrc.org]
> Sent: Thursday, 26 July 2007 11:49 a.m.
> To: Johnstone, Alice
> Cc: bioconductor at stat.math.ethz.ch
> Subject: Re: [BioC] read.phenoData vs read.AnnotatedDataFrame
>
> Hi Alice --
>
> "Johnstone, Alice" <Alice.Johnstone at esr.cri.nz> writes:
>
>> Using R2.5.0 and Bioconductor I have been following code to analysis
>> Affymetrix expression data: 2 treatments vs control.  The original
>> code was run last year and used the read.phenoData command, however
>> with the newer version I get the error message Warning messages:
>> read.phenoData is deprecated, use read.AnnotatedDataFrame instead The
>> phenoData class is deprecated, use AnnotatedDataFrame (with
>> ExpressionSet) instead
>>
>> I use the read.AnnotatedDataFrame command, but when it comes to the
>> end of the analysis the comparison of the treatment to the controls
>> gets mixed up compared to what you get using the original
>> read.phenoData ie it looks like the 3 groups get labelled wrong and so
>
>> the comparisons are different (but they can still be matched up).
>> My questions are,
>> 1) do you need to set up your target file differently when using
>> read.AnnotatedDataFrame - what is the standard format?
>
> I can't quite tell where things are going wrong for you, so it would
> help if you can narrow down where the problem occurs.  I think
> read.AnnotatedDataFrame should be comparable to read.phenoData. Does
>
>> pData(pd)
>
> look right? What about
>
>> pData(Data)
>
> and
>
>> pData(eset.rma)
>
> ? It's not important but pData(pd)$Target is the same as pd$Target.
> Since the analysis is on eset.rma, it probably makes sense to use the
> pData from there to construct your design matrix
>
>> targs<-factor(eset.rma$Target)
>> design<-model.matrix(~0+targs)
>> colnames(design)<-levels(targs)
>
> Does design look right?
>
>> I have three columns sample, filename and target.
>> 2) do you need to use a different model matrix to what I have?
>> 3) do you use a different command for making the contrasts?
>
> Depends on the question! If you're performing the same analysis as last
> year, then the model matrix and contrasts have to be the same!
>
>> I have included my code below if that is of any assistance.
>> Many Thanks!
>> Alice
>>
>>
>>
>> ##Read data
>> pd<-read.AnnotatedDataFrame("targets.txt",header=T,row.name="sample")
>> Data<-ReadAffy(filenames=pData(pd)$FileName,phenoData=pd)
>> ##normalisation
>> eset.rma<-rma(Data)
>> ##analysis
>> targs<-factor(pData(pd)$Target)
>> design<-model.matrix(~0+targs)
>> colnames(design)<-levels(targs)
>> fit<-lmFit(eset.rma,design)
>> cont.wt<-makeContrasts("treatment1-control","treatment2-control",level
>> s=
>> design)
>> fit2<-contrasts.fit(fit,cont.wt)
>> fit2.eb<-eBayes(fit2)
>> testconts<-classifyTestsF(fit2.eb,p.value=0.01)
>> topTable(fit2.eb,coef=2,n=300)
>> topTable(fit2.eb,coef=1,n=300)
>>
>>
>> 	[[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> --
> Martin Morgan
> Bioconductor / Computational Biology
> http://bioconductor.org
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595