[R] two questions for R beginners

John Sorkin jsorkin at grecc.umaryland.edu
Wed Mar 3 21:10:19 CET 2010


Bill,
The points you make are well taken; one needs to know when to stop. 

I would suggest standardizing the methods used to refer to elements of a matrix and a dataframe and going no further. Why do I say this? A beginner, even a more experienced R users, probably envisions a dataframe and a matrix has having the same structure, but not the same contents. Both appear to be multi-dimensional structures that can store data, albeit data of different types. A matrix stores numerical values, a dataframe stores data of mixed types. This being the case it makes sense to assume that 
A%*%B will work when A and B are matrices, 
but C%*% D will not work when C and D are dataframes. 
This is quite logical and intuitive. It is an extension of the truism that one can perform the following arithmetic operation 2*3, but can't perform the following operation "Bill"*"John" (I use quotes to indicate that the names are proper names and not variable names). Despite the observation that on can reasonably expect that there are certain operations that one can perform on matrices, but not on dataframes (and conversely), the apparent similarity in structure of the two objects makes one assume (incorrectly at this time) that the syntax used to access elements of an array and a dataframe should be the same. I submit that having similar syntax for accessing elements of the two structures will assist users learn R. It will not cause them to assume that one can perform the exactly the same operations on the two structures.

I apologize to other members of the listserver for the length of this subthread. It appears that I have lost the argument, and have not convinced those who would need to make the changes to allow matrices and dataframes to have similar syntax for addressing elements of the respective structures. I do not expect I will be adding any additional comments to this thread, but will continue to follow contributions other people make. Perhaps I will learn that I am not the only person who feels that the syntax should be consistent, but given what I have read so far, I doubt it. I thank everyone who has contributed to the discussion.
John







John David Sorkin M.D., Ph.D.
Chief, Biostatistics and Informatics
University of Maryland School of Medicine Division of Gerontology
Baltimore VA Medical Center
10 North Greene Street
GRECC (BT/18/GR)
Baltimore, MD 21201-1524
(Phone) 410-605-7119
(Fax) 410-605-7913 (Please call phone number above prior to faxing)>>> "William Dunlap" <wdunlap at tibco.com> 3/3/2010 1:15 PM >>>
If R made
   matrix$columnName
mean the same as
   matrix[, "columnName"]
(a vector) so matrices looked more like data.frames,
would we also want the following to work
as they do with data.frames?
   with(matrix, log(columnName)) # log of that column as a vector
   matrix["columnName"] # 1-column matrix
   matrix[["columnName"]] # vector equivalent of that 1-column matrix 
   lm(responseColumn~predictorColumn, data=matrix)
   eval(quote(columnName), envir=matrix)
The last 2 bump into the rule allowing envir to be
a frame number (since a 1x1 matrix is currently taken
as the frame number now).

Perhaps the print methods for data.frame and matrix
should announce the class of the object being printed.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com  

> -----Original Message-----
> From: r-help-bounces at r-project.org 
> [mailto:r-help-bounces at r-project.org] On Behalf Of Patrick Burns
> Sent: Wednesday, March 03, 2010 2:44 AM
> To: r-help at r-project.org 
> Subject: Re: [R] two questions for R beginners
> 
> I think Duncan's example of a list that is
> a matrix is a compelling argument not to do
> the change.
> 
> A matrix that is a list with both names and
> dimnames *is* probably rare (but certainly
> imaginable).  A matrix that is a list is not
> so rare, and the proposed double meaning of
> '$' would certainly be confusing in that case.
> 
> Pat
> 
> 
> On 02/03/2010 17:55, Duncan Murdoch wrote:
> > On 02/03/2010 11:53 AM, William Dunlap wrote:
> >> > -----Original Message-----
> >> > From: r-help-bounces at r-project.org >
> >> [mailto:r-help-bounces at r-project.org] On Behalf Of John Sorkin
> >> > Sent: Tuesday, March 02, 2010 3:46 AM
> >> > To: Karl Ove Hufthammer; r-help at stat.math.ethz.ch 
> >> > Subject: Re: [R] two questions for R beginners
> >> > > Please take what follows not as an ad hominem statement, but >
> >> rather as an attempt to improve what is already an excellent >
> >> program, that has been built as a result of many, many hours > of
> >> dedicated work by many, many unpaid, unsung volunteers.
> >> > > It troubles me a bit that when a confusing aspect of R is >
> >> pointed out the response is not to try to improve the > 
> language so as
> >> to avoid the confusion, but rather to state > that the confusion is
> >> inherent in the language. I understand > that to make changes that
> >> would avoid the confusing aspect of > the language that has been
> >> discussed in this thread would > take time and effort by 
> an R wizard
> >> (which I am not), time > and effort that would not be 
> compensated in
> >> the traditional > sense. This does not mean that we should not
> >> acknowledge the > confusion. If we what R to be the de facto lingua
> >> franca of > statistical analysis doesn't it make sense to 
> strive for >
> >> syntax that is as straight forward and consistent as possible?
> >> Whenever one changes the language that way old code
> >> will break.
> > I think in this case not much code would break. Mostly when 
> people have
> > a matrix M and ask for M$column they'll get an error; the 
> proposal is
> > that they'll get the requested column. (It is possible to 
> have a list
> > with names that is also a matrix with dimnames, but I think 
> that is a
> > pretty unusual construction.) But I haven't been convinced that the
> > proposal is a net improvement to the language.
> > Duncan Murdoch
> >
> >> The developers can, with a lot of effort,
> >> fix their own code, and perhaps even user-written code
> >> on CRAN, but code that thousands of users have written
> >> will break. There is a lot of code out there that was
> >> written by trial and error and by folks who no longer
> >> work at an institution: the code works but no one knows
> >> exactly why it works. Telling folks they need to change
> >> that code because we have a cleaner but different syntax
> >> now is not good. Why would one spend time writing a
> >> package that might stop working when R is "upgraded"?
> >>
> >> I think the solution is not to change current semantics
> >> but to write functions that behave better and encourage
> >> users to use them, gradually abandoning the old constructs.
> >>
> >> Bill Dunlap
> >> Spotfire, TIBCO Software
> >> wdunlap tibco.com
> >> > > Again, please understand that my comment is made with deepest >
> >> respect for the many people who have unselfishly 
> contributed > to the
> >> R project. Many thanks to each and every one of you.
> >> > > John
> >> > > > >>> Karl Ove Hufthammer <karl at huftis.org> 3/2/2010 
> 4:00 AM >>>
> >> > On Mon, 01 Mar 2010 10:00:07 -0500 Duncan Murdoch >
> >> <murdoch at stats.uwo.ca> > wrote:
> >> > > Suppose X is a dataframe or a matrix. What would you > 
> expect to
> >> get from > > X[1]? What about as.vector(X), or as.numeric(X)?
> >> > > All this of course depends on type of object one is 
> speaking > of.
> >> There > are plenty of surprises available, and it's best 
> to use the >
> >> most logical > way of extracting. E.g., to extract the top-left
> >> element of a 2D > structure (data frame or matrix), use 'X[1,1]'.
> >> > > Luckily, R provides some shortcuts. For example, you 
> can > write
> >> 'X[2,3]' > on a data frame, just as if it was a matrix, even though
> >> the > underlying > structure is completely different. (This doesn't
> >> work on a > normal list; > there you have to type the 
> whole 'X[[2]][3]'.)
> >> > > The behaviour of the 'as.' functions may sometimes be 
> surprising,
> >> at > least for me. For example, 'as.data.frame' on a named vector
> >> gives a > single-column data frame, instead of a 
> single-row data frame.
> >> > > (I'm not sure what's the recommended way of converting 
> a > named
> >> vector to > row data frame, but 'as.data.frame(t(X))' works, even
> >> though both 'X' > and 't(X)' looks like a row of numbers.)
> >> > > > The point is that a dataframe is a list, and a 
> matrix > isn't.
> >> If users > > don't understand that, then they'll be confused
> >> somewhere. Making > > matrices more list-like in one 
> respect will just
> >> move the confusion > > elsewhere. The solution is to understand the
> >> difference.
> >> > > My main problem is not understanding the difference, which is >
> >> easy, but > knowing which type of I have when I get the output a
> >> function in a > package. If I know the object is a named 
> vector or a
> >> matrix > with column > names, it's easy enough to type
> >> 'X[,"colname"]', and if it's a data > frame one may use 
> the shortcut
> >> 'X$colname'.
> >> > > Usually, it *is* documented what the return value of a 
> > function
> >> is, but > just looking at the output is much faster, and *usually*
> >> gives the > correct answer.
> >> > > For example, 'mean' applied on a data frame gives a named >
> >> vector, not a > data frame, which is somewhat surprising 
> (given that
> >> the columns of a > data frame may be of different types, while the
> >> elements of a > vector may > not). (And yes, I know that it's
> >> *documented* that it returns a named > vector.) On the other hand,
> >> perhaps it is surprising that > 'mean' works > on data 
> frames at all. :-)
> >> > > -- > Karl Ove Hufthammer
> >> > > ______________________________________________
> >> > R-help at r-project.org mailing list
> >> > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the
> >> posting guide > http://www.R-project.org/posting-guide.html > and
> >> provide commented, minimal, self-contained, reproducible code.
> >> > > Confidentiality Statement:
> >> > This email message, including any attachments, is for >
> >> th...{{dropped:6}}
> >> > > ______________________________________________
> >> > R-help at r-project.org mailing list
> >> > https://stat.ethz.ch/mailman/listinfo/r-help 
> >> > PLEASE do read the posting guide >
> >> http://www.R-project.org/posting-guide.html 
> >> > and provide commented, minimal, self-contained, 
> reproducible code.
> >> >
> >> ______________________________________________
> >> R-help at r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help 
> >> PLEASE do read the posting guide
> >> http://www.R-project.org/posting-guide.html 
> >> and provide commented, minimal, self-contained, reproducible code.
> >>
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help 
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html 
> > and provide commented, minimal, self-contained, reproducible code.
> >
> 
> -- 
> Patrick Burns
> pburns at pburns.seanet.com 
> http://www.burns-stat.com 
> (home of 'The R Inferno' and 'A Guide for the Unwilling S User')
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help 
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html 
> and provide commented, minimal, self-contained, reproducible code.
> 

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help 
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html 
and provide commented, minimal, self-contained, reproducible code.

Confidentiality Statement:
This email message, including any attachments, is for th...{{dropped:6}}



More information about the R-help mailing list