[R] flexible approach to subsetting data

David Winsemius dwinsemius at comcast.net
Tue Jul 23 20:12:21 CEST 2013


On Jul 23, 2013, at 10:49 AM, David Winsemius wrote:

> 
> On Jul 23, 2013, at 10:01 AM, Adams, Jean wrote:
> 
>> Check out the reshape() function of the reshape package.  Here's one of the
>> examples from ?reshape.
>> 
>> Jean
>> 
>> 
>> library(reshape)   # No,  at least not for the reshape-function
> 
> The reshape function is from the 'base' package. The 'reshape' and 'reshape2' packages were written (at least in part) because the 'reshape'-function was so difficult to understand.
> 
> If you do choose to use the reshape2 package, which is well-respected and often extremely helpful, the function you will want to start with is 'melt'.
> 
> 
>> long <- reshape(wide, direction="long")
> 
> I don't think this example will be particularly helpful since the initial direction is "long" (from "wide") and more input would be needed.

Here's a dataset to experiment with

df5 <- data.frame(dose.0 = c(40,50,60,50),resp.0=c(40,50,60,50), 
 dose.1 = c(1,2,1,2), resp.1=c(1,2,1,2)+3, 
 dose.2 = c(2,1,2,1), resp.2=c(1,2,1,2)+3,
 dose.3 = c(3,3,3,3), resp.3=c(1,2,1,2)+3 )

Notice that you would need add the ".0" to the column names

reshape(df5,  direction="long", 
              v.names=c("dose", "resp"), 
               varying=list(dose=c(1,3,5,7), resp=c(2,4,6,8) )
        )  # succeeds



So perhaps could use similar call (after append the ".0"'s) with:

  varying=list(sim=seq(1,810,by=4),
               X1= seq(2,810,by=4),
               X2= seq(3,810,by=4),
               X3= seq(4,810,by=4)
               )
               
> 
> 
>> wide
>> long
>> 
>> 
>> 
>> On Tue, Jul 23, 2013 at 9:35 AM, Andrea Lamont <alamont082 at gmail.com> wrote:
>> 
>>> Hello:
>>> 
>>> I am running a simulation study and am stuck with a subsetting problem.
>>> 
>>> Here is the basic issue:
>>> I generated data and am running a simulation that uses multiple imputation.
>>> For each generated dataset, I used multiple imputation.  The resultant
>>> dataset is in wide for where each imputation is recorded as a separate
>>> column (though the different simulations are stacked).  Here is an example
>>> of what it looks like:
>>> 
>>> sim   X1   X2   X3   sim.1   X1.1    X1.1    X3.1
> 
>>> 1         #    #     #        #           #          #         #
>>> 1         #    #     #        #           #          #         #
>>> 1         #    #     #        #           #          #         #
>>> 2         #    #     #        #           #          #         #
>>> 2         #    #     #        #           #          #         #
>>> 2         #    #     #        #           #          #         #
>>> 
>>> sim refers to the simulated/generated dataset. X1-X3 are the values for the
>>> first imputed dataset, X1.1-X3.1 are the values for the second imputed
>>> dataset.
>>> 
>>> The problem is that I want the data to be in long format, like this:
>>> 
>>> sim m X1 X2 X3
>>> 1  1   #   #    #
>>> 1  2   #   #    #
>>> 2  1   #   #    #
>>> 2  2   #   #    #
>>> 
>>> where m is the imputation number.
>>> This will allow me to do cleaner calculations (e.g. X3-X1).
>>> 
>>> I know I can subset the data manually - e.g. [,1:10] and save this to
>>> separate datasets then  rbind; however, I'm looking for a more flexible
>>> approach to do this.  This manual approach would be quite tedious as number
>>> of imputations (and therefore number of columns) increased (with only 10
>>> imputations, there are roughly 810 columns). Also,I would like to
>>> avoid having to recode each time I change the number of imputations.
>>> 
>>> THe same is true for the reshape function, which would require naming
>>> a huge number of columns and edits each time 'm' changes.
> 
> If the columns are named regularly, then 'reshape' will attempt to split properly without an explicit naming. Details and a better description of the problem might allow more specific answers to emerge. The fact that the first instances have no numeric indicators may be a problem for the algorithm. 
> 
> Why not post dput(head( dfrm[ ,1:12]))
> 
> -- 
> David.
> 
>>> 
>>> 
>>> Is there a flexible way to approach this? I'm inclined to use a for loop,
>>> but know that 1) this is generally inefficient and 2) am having trouble
>>> with
>>> the coding regardless.
>>> 
>>> Any suggestions are appreciated.
>>> 
>>> Thanks,
>>> Andrea
>>> 


David Winsemius
Alameda, CA, USA



More information about the R-help mailing list