[R] flexible approach to subsetting data

arun smartpink111 at yahoo.com
Thu Jul 25 05:53:48 CEST 2013


Hi,
It works in small dataset.
rt<- structure(list(sim = c(1L, 1L, 1L, 2L, 2L, 2L), txt.y.obs = c(5L, 
4L, 3L, 6L, 7L, 9L), cont.y.obs = c(4L, 3L, 9L, 4L, 8L, 6L), 
    ID = 1:6, obs.txt = c(5L, 2L, 4L, 8L, 4L, 7L), TE = c(5L, 
    7L, 4L, 3L, 5L, 8L), X1 = c(1L, 1L, 1L, 2L, 2L, 2L), sim.1 = c(4L, 
    7L, 5L, 3L, 5L, 9L), txt.y.obs.1 = c(3L, 5L, 7L, 9L, 5L, 
    4L), cont.y.obs.1 = c(3L, 4L, 8L, 9L, 4L, 5L), ID.1 = 1:6, 
    obs.txt.1 = c(7L, 1L, 4L, 5L, 8L, 6L), TE.1 = c(5L, 6L, 3L, 
    4L, 9L, 10L), X1.1 = c(6L, 4L, 3L, 8L, 5L, 6L)), .Names = c("sim", 
"txt.y.obs", "cont.y.obs", "ID", "obs.txt", "TE", "X1", "sim.1", 
"txt.y.obs.1", "cont.y.obs.1", "ID.1", "obs.txt.1", "TE.1", "X1.1"
), class = "data.frame", row.names = c(NA, -6L))



rtr<-reshape(rt,  direction="long",
varying=list(
sim=grepl("sim", names(rt)),
txt.y.obs=grepl("txt.y.obs", names(rt)),
cont.y.obs=grepl("cont.y.obs", names(rt)),
ID=grepl("ID", names(rt)),
obs.txt=grepl("obs.txt", names(rt)),
TE=grepl("TE", names(rt)),
X1=grepl("X1", names(rt))),
v.names=
c("sim","txt.y.obs","cont.y.obs","ID","obs.txt", "TE", "X1"),
timevar="imputation")



#Using a bigger dataset:
set.seed(48)
rtNew<- as.data.frame(matrix(sample(1:50,405*5,replace=TRUE),ncol=405))
colnames(rtNew)<-paste0(gsub("\\d+","",colnames(rtNew)),1:81)
colnames(rtNew)[-c(1:81)]<-paste(colnames(rtNew)[-c(1:81)],rep(1:4,each=81),sep=".")
res<- reshape(rtNew,direction="long",varying=list(V1=grepl("V1",names(rtNew)),
V2=grepl("V2",names(rtNew)),V3=grepl("V3",names(rtNew)),V4=grepl("V4",names(rtNew)),
V5=grepl("V5",names(rtNew)),V6=grepl("V6",names(rtNew)),V7=grepl("V7",names(rtNew))),
v.names=c("V1","V2","V3","V4","V5","V6","V7"),timevar="imputation")
#works

#When I forgot to close the list bracket:

reshape(rtNew,direction="long",varying=list(V1=grepl("V1",names(rtNew)),
V2=grepl("V2",names(rtNew)),V3=grepl("V3",names(rtNew)),V4=grepl("V4",names(rtNew)),
V5=grepl("V5",names(rtNew)),V6=grepl("V6",names(rtNew)),V7=grepl("V7",names(rtNew)),
v.names=c("V1","V2","V3","V4","V5","V6","V7"),timevar="imputation"))
#Error in reshapeLong(data, idvar = idvar, timevar = timevar, varying = varying,  : 
 # 'varying' arguments must be the same length
Though, your code looks fine with respect to closing brackets.
A.K.




----- Original Message -----
From: Andrea Lamont <alamont082 at gmail.com>
To: David Carlson <dcarlson at tamu.edu>
Cc: R help <r-help at r-project.org>
Sent: Wednesday, July 24, 2013 9:41 PM
Subject: Re: [R] flexible approach to subsetting data

Hi, all:

I have a follow-up question.

I have 81 variables in my dataset (all of which are repeated).  Reshape
seems to give me an error whenever more than six variables are used. The
error message is this: Error in reshapeLong(data, idvar = idvar,
timevar =timevar
, varying = varying, : 'varying arguments must be the same length.

I have tested the lengths of all the variables, and they are all equal.
Further, when I mix up the variables used in the reshape function, it
works -- so long as I keep the number of variables used under six. As soon
as I add the seventh variable (regardless of what it is), I receive this
error.


#This works:
rtr<-reshape(rt,  direction="long",
varying=list(
sim=grepl("sim", names(rt)),
txt.y.obs=grepl("txt.y.obs", names(rt)),
cont.y.obs=grepl("cont.y.obs", names(rt)),
ID=grepl("ID", names(rt)),
obs.txt=grepl("obs.txt", names(rt)),
TE=grepl("TE", names(rt))),
v.names=
c("sim","txt.y.obs","cont.y.obs","ID","obs.txt", "TE"),
timevar="imputation")



#The addition of one more variable creates an error. The problem is not
with X1.
rtr<-reshape(rt,  direction="long",
varying=list(
sim=grepl("sim", names(rt)),
txt.y.obs=grepl("txt.y.obs", names(rt)),
cont.y.obs=grepl("cont.y.obs", names(rt)),
ID=grepl("ID", names(rt)),
obs.txt=grepl("obs.txt", names(rt)),
TE=grepl("TE", names(rt)),
X1=grepl("X1", names(rt))),
v.names=
c("sim","txt.y.obs","cont.y.obs","ID","obs.txt", "TE", "X1"),
timevar="imputation")




On Tue, Jul 23, 2013 at 5:00 PM, David Carlson <dcarlson at tamu.edu> wrote:

> Actually the ".0" on the first variable is not needed.
>
> You could modify the reshape() call to search for the base
> name of each variable so you would not need to change the code
> if the number of replications changes:
>
> reshape(df5,  direction="long", v.names=c("dose", "resp"),
>         varying=list(dose=grepl("dose", names(df5)),
>         resp=grepl("resp", names(df5)) )
>       )
>
> -------------------------------------
> David L Carlson
> Associate Professor of Anthropology
> Texas A&M University
> College Station, TX 77840-4352
>
> -----Original Message-----
> From: r-help-bounces at r-project.org
> [mailto:r-help-bounces at r-project.org] On Behalf Of David
> Winsemius
> Sent: Tuesday, July 23, 2013 1:12 PM
> To: David Winsemius
> Cc: R help; Andrea Lamont
> Subject: Re: [R] flexible approach to subsetting data
>
>
> On Jul 23, 2013, at 10:49 AM, David Winsemius wrote:
>
> >
> > On Jul 23, 2013, at 10:01 AM, Adams, Jean wrote:
> >
> >> Check out the reshape() function of the reshape package.
> Here's one of the
> >> examples from ?reshape.
> >>
> >> Jean
> >>
> >>
> >> library(reshape)   # No,  at least not for the
> reshape-function
> >
> > The reshape function is from the 'base' package. The
> 'reshape' and 'reshape2' packages were written (at least in
> part) because the 'reshape'-function was so difficult to
> understand.
> >
> > If you do choose to use the reshape2 package, which is
> well-respected and often extremely helpful, the function you
> will want to start with is 'melt'.
> >
> >
> >> long <- reshape(wide, direction="long")
> >
> > I don't think this example will be particularly helpful
> since the initial direction is "long" (from "wide") and more
> input would be needed.
>
> Here's a dataset to experiment with
>
> df5 <- data.frame(dose.0 =
> c(40,50,60,50),resp.0=c(40,50,60,50),
>  dose.1 = c(1,2,1,2), resp.1=c(1,2,1,2)+3,
>  dose.2 = c(2,1,2,1), resp.2=c(1,2,1,2)+3,
>  dose.3 = c(3,3,3,3), resp.3=c(1,2,1,2)+3 )
>
> Notice that you would need add the ".0" to the column names
>
> reshape(df5,  direction="long",
>               v.names=c("dose", "resp"),
>                varying=list(dose=c(1,3,5,7), resp=c(2,4,6,8) )
>         )  # succeeds
>
>
>
> So perhaps could use similar call (after append the ".0"'s)
> with:
>
>   varying=list(sim=seq(1,810,by=4),
>                X1= seq(2,810,by=4),
>                X2= seq(3,810,by=4),
>                X3= seq(4,810,by=4)
>                )
>
> >
> >
> >> wide
> >> long
> >>
> >>
> >>
> >> On Tue, Jul 23, 2013 at 9:35 AM, Andrea Lamont
> <alamont082 at gmail.com> wrote:
> >>
> >>> Hello:
> >>>
> >>> I am running a simulation study and am stuck with a
> subsetting problem.
> >>>
> >>> Here is the basic issue:
> >>> I generated data and am running a simulation that uses
> multiple imputation.
> >>> For each generated dataset, I used multiple imputation.
> The resultant
> >>> dataset is in wide for where each imputation is recorded
> as a separate
> >>> column (though the different simulations are stacked).
> Here is an example
> >>> of what it looks like:
> >>>
> >>> sim   X1   X2   X3   sim.1   X1.1    X1.1    X3.1
> >
> >>> 1         #    #     #        #           #          #
> #
> >>> 1         #    #     #        #           #          #
> #
> >>> 1         #    #     #        #           #          #
> #
> >>> 2         #    #     #        #           #          #
> #
> >>> 2         #    #     #        #           #          #
> #
> >>> 2         #    #     #        #           #          #
> #
> >>>
> >>> sim refers to the simulated/generated dataset. X1-X3 are
> the values for the
> >>> first imputed dataset, X1.1-X3.1 are the values for the
> second imputed
> >>> dataset.
> >>>
> >>> The problem is that I want the data to be in long format,
> like this:
> >>>
> >>> sim m X1 X2 X3
> >>> 1  1   #   #    #
> >>> 1  2   #   #    #
> >>> 2  1   #   #    #
> >>> 2  2   #   #    #
> >>>
> >>> where m is the imputation number.
> >>> This will allow me to do cleaner calculations (e.g.
> X3-X1).
> >>>
> >>> I know I can subset the data manually - e.g. [,1:10] and
> save this to
> >>> separate datasets then  rbind; however, I'm looking for a
> more flexible
> >>> approach to do this.  This manual approach would be quite
> tedious as number
> >>> of imputations (and therefore number of columns) increased
> (with only 10
> >>> imputations, there are roughly 810 columns). Also,I would
> like to
> >>> avoid having to recode each time I change the number of
> imputations.
> >>>
> >>> THe same is true for the reshape function, which would
> require naming
> >>> a huge number of columns and edits each time 'm' changes.
> >
> > If the columns are named regularly, then 'reshape' will
> attempt to split properly without an explicit naming. Details
> and a better description of the problem might allow more
> specific answers to emerge. The fact that the first instances
> have no numeric indicators may be a problem for the algorithm.
>
> >
> > Why not post dput(head( dfrm[ ,1:12]))
> >
> > --
> > David.
> >
> >>>
> >>>
> >>> Is there a flexible way to approach this? I'm inclined to
> use a for loop,
> >>> but know that 1) this is generally inefficient and 2) am
> having trouble
> >>> with
> >>> the coding regardless.
> >>>
> >>> Any suggestions are appreciated.
> >>>
> >>> Thanks,
> >>> Andrea
> >>>
>
>
> David Winsemius
> Alameda, CA, USA
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible
> code.
>
>


-- 
Andrea Lamont, MA
Clinical-Community Psychology
University of South Carolina
Barnwell College
Columbia, SC 29208

Please consider the environment before printing this email.

CONFIDENTIAL: This transmission is intended for the use of the
individual(s) or entity to which it is addressed, and may contain
information that is privileged, confidential, and exempt from disclosure
under applicable law. Should the reader of this message not be the intended
recipient(s), you are hereby notified that any dissemination, distribution,
or copying of this communication is strictly prohibited.  If you are not
the intended recipient, please contact the sender by reply email and
destroy/delete all copies of the original message.

    [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list