[R] flexible approach to subsetting data

David Winsemius dwinsemius at comcast.net
Tue Jul 23 23:59:51 CEST 2013


On Jul 23, 2013, at 2:00 PM, David Carlson wrote:

> Actually the ".0" on the first variable is not needed.
> 
> You could modify the reshape() call to search for the base
> name of each variable so you would not need to change the code
> if the number of replications changes:
> 
> reshape(df5,  direction="long", v.names=c("dose", "resp"), 
> 	varying=list(dose=grepl("dose", names(df5)),
> 	resp=grepl("resp", names(df5)) )
>      )
> 

That's really elegant and much more "elastic". (I hadn't realized that a logical vector would be accepted.) Also possible to just use 'grep' which would instead construct a vector of column numbers as the list elements of 'varying'. I've wondered for years whether the help page description of 'varying could be improved. It currently says:

"varying : 
names of sets of variables in the wide format that correspond to single variables in long format (‘time-varying’). This is canonically a list of vectors of variable names, but it can optionally be a matrix of names, or a single vector of names. In each case, the names can be replaced by indices which are interpreted as referring to names(data). See ‘Details’ for more details and options."

I wondered if it might say instead:

"a list of sets of variables in the wide format that each correspond to single variables in long format (‘time-varying’). This is canonically a list of vectors of column names or numbers , but it can optionally be a matrix of names, or a single vector of names. In each case, the names can be replaced by numeric or logical indices which are interpreted as extracting from names(data). See ‘Details’ for more details and options."

But it supposedly is the case that it can be a set of names, and in that case there is also a  further promise that an effort to do the automagic splitting. Unfortunately the magic is often unsuccessful

> reshape(df5,  direction="long", 
+ 	varying=c("dose", "resp")
+      )
Error in guess(varying) : 
  failed to guess time-varying variables from their names

# Seems like it should have been possible:
> df5
  dose.0 resp.0 dose.1 resp.1 dose.2 resp.2 dose.3 resp.3
1     40     40      1      4      2      4      3      4
2     50     50      2      5      1      5      3      5
3     60     60      1      4      2      4      3      4
4     50     50      2      5      1      5      3      5

-- 
David.

> -------------------------------------
> David L Carlson
> Associate Professor of Anthropology
> Texas A&M University
> College Station, TX 77840-4352
> 
> -----Original Message-----
> From: r-help-bounces at r-project.org
> [mailto:r-help-bounces at r-project.org] On Behalf Of David
> Winsemius
> Sent: Tuesday, July 23, 2013 1:12 PM
> To: David Winsemius
> Cc: R help; Andrea Lamont
> Subject: Re: [R] flexible approach to subsetting data
> 
> 
> On Jul 23, 2013, at 10:49 AM, David Winsemius wrote:
> 
>> 
>> On Jul 23, 2013, at 10:01 AM, Adams, Jean wrote:
>> 
>>> Check out the reshape() function of the reshape package.
> Here's one of the
>>> examples from ?reshape.
>>> 
>>> Jean
>>> 
>>> 
>>> library(reshape)   # No,  at least not for the
> reshape-function
>> 
>> The reshape function is from the 'base' package. The
> 'reshape' and 'reshape2' packages were written (at least in
> part) because the 'reshape'-function was so difficult to
> understand.
>> 
>> If you do choose to use the reshape2 package, which is
> well-respected and often extremely helpful, the function you
> will want to start with is 'melt'.
>> 
>> 
>>> long <- reshape(wide, direction="long")
>> 
>> I don't think this example will be particularly helpful
> since the initial direction is "long" (from "wide") and more
> input would be needed.
> 
> Here's a dataset to experiment with
> 
> df5 <- data.frame(dose.0 =
> c(40,50,60,50),resp.0=c(40,50,60,50), 
> dose.1 = c(1,2,1,2), resp.1=c(1,2,1,2)+3, 
> dose.2 = c(2,1,2,1), resp.2=c(1,2,1,2)+3,
> dose.3 = c(3,3,3,3), resp.3=c(1,2,1,2)+3 )
> 
> Notice that you would need add the ".0" to the column names
> 
> reshape(df5,  direction="long", 
>              v.names=c("dose", "resp"), 
>               varying=list(dose=c(1,3,5,7), resp=c(2,4,6,8) )
>        )  # succeeds
> 
> 
> 
> So perhaps could use similar call (after append the ".0"'s)
> with:
> 
>  varying=list(sim=seq(1,810,by=4),
>               X1= seq(2,810,by=4),
>               X2= seq(3,810,by=4),
>               X3= seq(4,810,by=4)
>               )
> 
>> 
>> 
>>> wide
>>> long
>>> 
>>> 
>>> 
>>> On Tue, Jul 23, 2013 at 9:35 AM, Andrea Lamont
> <alamont082 at gmail.com> wrote:
>>> 
>>>> Hello:
>>>> 
>>>> I am running a simulation study and am stuck with a
> subsetting problem.
>>>> 
>>>> Here is the basic issue:
>>>> I generated data and am running a simulation that uses
> multiple imputation.
>>>> For each generated dataset, I used multiple imputation.
> The resultant
>>>> dataset is in wide for where each imputation is recorded
> as a separate
>>>> column (though the different simulations are stacked).
> Here is an example
>>>> of what it looks like:
>>>> 
>>>> sim   X1   X2   X3   sim.1   X1.1    X1.1    X3.1
>> 
>>>> 1         #    #     #        #           #          #
> #
>>>> 1         #    #     #        #           #          #
> #
>>>> 1         #    #     #        #           #          #
> #
>>>> 2         #    #     #        #           #          #
> #
>>>> 2         #    #     #        #           #          #
> #
>>>> 2         #    #     #        #           #          #
> #
>>>> 
>>>> sim refers to the simulated/generated dataset. X1-X3 are
> the values for the
>>>> first imputed dataset, X1.1-X3.1 are the values for the
> second imputed
>>>> dataset.
>>>> 
>>>> The problem is that I want the data to be in long format,
> like this:
>>>> 
>>>> sim m X1 X2 X3
>>>> 1  1   #   #    #
>>>> 1  2   #   #    #
>>>> 2  1   #   #    #
>>>> 2  2   #   #    #
>>>> 
>>>> where m is the imputation number.
>>>> This will allow me to do cleaner calculations (e.g.
> X3-X1).
>>>> 
>>>> I know I can subset the data manually - e.g. [,1:10] and
> save this to
>>>> separate datasets then  rbind; however, I'm looking for a
> more flexible
>>>> approach to do this.  This manual approach would be quite
> tedious as number
>>>> of imputations (and therefore number of columns) increased
> (with only 10
>>>> imputations, there are roughly 810 columns). Also,I would
> like to
>>>> avoid having to recode each time I change the number of
> imputations.
>>>> 
>>>> THe same is true for the reshape function, which would
> require naming
>>>> a huge number of columns and edits each time 'm' changes.
>> 
>> If the columns are named regularly, then 'reshape' will
> attempt to split properly without an explicit naming. Details
> and a better description of the problem might allow more
> specific answers to emerge. The fact that the first instances
> have no numeric indicators may be a problem for the algorithm.
> 
>> 
>> Why not post dput(head( dfrm[ ,1:12]))
>> 
>> -- 
>> David.
>> 
>>>> 
>>>> 
>>>> Is there a flexible way to approach this? I'm inclined to
> use a for loop,
>>>> but know that 1) this is generally inefficient and 2) am
> having trouble
>>>> with
>>>> the coding regardless.
>>>> 
>>>> Any suggestions are appreciated.
>>>> 
>>>> Thanks,
>>>> Andrea
>>>> 
> 
> 
> David Winsemius
> Alameda, CA, USA
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible
> code.
> 

David Winsemius
Alameda, CA, USA



More information about the R-help mailing list