[R] How to sum values across multiple variables using a wildcard?

Tue Feb 21 02:32:07 CET 2006

Thanks Gabor, Simon and Marc...I got this to work with the grep() and
rowSums examples  you provided.

Mark

On 2/20/06, Marc Schwartz <MSchwartz at mn.rr.com> wrote:
> On Mon, 2006-02-20 at 18:41 -0600, mtb954 at gmail.com wrote:
> > I have a dataframe called "data" with 5 records (in rows) each of
> > which has been scored on each of many variables (in columns).
> >
> > Five of the variables are named var1, var2, var3, var4, var5 using
> > headers. The other variables are named using other conventions.
> >
> > I can create a new variable called var6 with the value 15 for each
> > record with this code:
> >
> > > var6=var1+var2+var3+var4+var5
> >
> > but this is tedious for my real dataset with dozens of variables. I
> > would rather use a wildcard to add up all the variables that begin
> > with "Var" like this pseudocode:
> >
> > > Var6=sum(var*)
> >
> > Any suggestions for implementing this in R? Thanks! Mark
>
> Here is one approach using grep().
>
> Given a data frame called MyDF with the following structure:
>
> > str(MyDF)
> `data.frame':   10 obs. of  20 variables:
>  $ other4 : num  -0.869  0.376 -2.022  0.619 -0.129 ...
>  $ var8   : num  -0.380  1.428 -1.075 -0.796 -0.588 ...
>  $ var4   : num  -0.0850 -0.7335 -0.5019 -1.1633 -0.0197 ...
>  $ other9 : num   0.0210 -0.6455  0.0289  1.2405 -1.3359 ...
>  $ var10  : num   0.647 -0.798  0.180  1.135 -0.258 ...
>  $ other2 : num   0.1332 -0.2227  0.0423  0.6881  2.0304 ...
>  $ other10: num  0.811 2.166 0.569 0.302 0.669 ...
>  $ var1   : num  -0.774 -1.812 -1.230 -0.969  0.245 ...
>  $ var2   : num  -0.0538  0.3712  0.8222 -0.8025 -0.6914 ...
>  $ other6 : num  0.871 0.291 2.079 1.098 1.025 ...
>  $ other1 : num  -0.5130  0.1358  0.8744  0.0997  1.7458 ...
>  $ var9   : num   0.664 -0.456  0.415  2.090 -0.283 ...
>  $ other3 : num  -0.425 -0.283  0.706 -1.879 -0.828 ...
>  $ other7 : num   0.100  0.177  0.570 -0.631 -1.009 ...
>  $ var3   : num   1.446 -0.862  0.184  1.077  0.146 ...
>  $ var5   : num   0.402 -0.498 -0.906  0.641  1.690 ...
>  $ var6   : num   0.892 -0.242  0.561  0.530 -0.291 ...
>  $ other5 : num  -1.210  0.815 -1.284 -0.152  0.329 ...
>  $ other8 : num  -0.265 -1.278  1.152  0.232 -1.189 ...
>  $ var7   : num  -0.616 -0.994 -0.263  1.626 -1.372 ...
>
>
> Note that the column names are either var* or other*.
>
> Using grep() we get the indices of the column names that contain "other"
> plus one or more following characters, where the "other" begins the
> word:
>
> > grep("\\bother.", names(MyDF))
>  [1]  1  4  6  7 10 11 13 14 18 19
>
> See ?regexp for more information. Note that I use "\\b" to being the
> search at the starting word boundary and then "." to require that there
> be following characters. Thus, this would not match " other1" or
> " other".
>
> You can then use the following to subset the data frame MyDF and sum the
> rows for the requested columns:
>
> > rowSums(MyDF[, grep("\\bother.", names(MyDF))])
>         1         2         3         4         5         6         7
> -1.344893  1.531417  2.715234  1.616971  1.307379  4.655568  4.638446
>         8         9        10
> -2.640485 -2.226270 -2.158248
>
>
> You could use grep("other", names(MyDF)), but this would also get
> "other" if it appears anywhere in the name. For example:
>
> > grep("other", "Thisotherone")
> [1] 1
>
> It just depends upon your naming schema and how strict you need to be in
> the search.
>
> HTH,
>
> Marc Schwartz
>
>
>