[R] concatenating columns in data.frame

Avi Gross @v|gro@@ @end|ng |rom ver|zon@net
Fri Jul 2 05:41:06 CEST 2021


Micha,

Others have provided ways in standard R so I will contribute a somewhat odd solution using the dplyr and related packages in the tidyverse including a sample data.frame/tibble I made. It requires newer versions of R and other  packages as it uses some fairly esoteric features including "the big bang" and the new ":=" operator and more.

You can use your own data with whatever columns you need, of course.

The goal is to have umpteen columns in the data that you want to add an additional columns to an existing tibble that is the result of concatenating the rowwise contents of a dynamically supplied vector of column names in quotes. First we need something to work with so here is a sample:

#--start
# load required packages, or a bunch at once!
library(tidyverse)

# Pick how many rows you want. For a demo, 3 is plenty N <- 3

# Make a sample tibble with N rows and the following 4 columns mydf <- tibble(alpha = 1:N, 
               beta=letters[1:N],
               gamma = N:1,
               delta = month.abb[1:N])

# show the original tibble
print(mydf)
#--end

In flat text mode, here is the output:

> print(mydf)
# A tibble: 3 x 4
alpha beta  gamma delta
<int> <chr> <int> <chr>
  1     1 a         3 Jan  
2     2 b         2 Feb  
3     3 c         1 Mar

Now I want to make a function that is used instead of the mutate verb. I made a weird one-liner that is a tad hard to explain so first let me mention the requirements.

It will take a first argument that is a tibble and in a pipeline this would be passed invisibly.
The second required argument is a vector or list containing the names of the columns as strings. A column can be re-used multiple times.
The third optional argument is what to name the new column with a default if omitted.
The fourth optional argument allows you to choose a different separator than "" if you wish.

The function should be usable in a pipeline on both sides so it should also return the input tibble with an extra column to the output.

Here is the function:

my_mutate <- function(df, columns, colnew="concatenated", sep=""){
  df %>%
    mutate( "{colnew}" := paste(!!!rlang::syms(columns), sep = sep )) }

Yes, the above can be done inline as a long one-liner:

my_mutate <- function(df, columns, colnew="concatenated", sep="") mutate(df, "{colnew}" := paste(!!!rlang::syms(columns), sep = sep ))

Here are examples of it running:


> choices <- c("beta", "delta", "alpha", "delta") mydf %>% 
> my_mutate(choices, "me2")
# A tibble: 3 x 5
alpha beta  gamma delta me2     
<int> <chr> <int> <chr> <chr>   
  1     1 a         3 Jan   aJan1Jan
2     2 b         2 Feb   bFeb2Feb
3     3 c         1 Mar   cMar3Mar
> mydf %>% my_mutate(choices, "me2",":")
# A tibble: 3 x 5
alpha beta  gamma delta me2        
<int> <chr> <int> <chr> <chr>      
  1     1 a         3 Jan   a:Jan:1:Jan
2     2 b         2 Feb   b:Feb:2:Feb
3     3 c         1 Mar   c:Mar:3:Mar
> mydf %>% my_mutate(c("beta", "beta", "gamma", "gamma", "delta", 
> "alpha"))
# A tibble: 3 x 5
alpha beta  gamma delta concatenated
<int> <chr> <int> <chr> <chr>       
  1     1 a         3 Jan   aa33Jan1    
2     2 b         2 Feb   bb22Feb2    
3     3 c         1 Mar   cc11Mar3    
> mydf %>% my_mutate(list("beta", "beta", "gamma", "gamma", "delta", 
> "alpha"))
# A tibble: 3 x 5
alpha beta  gamma delta concatenated
<int> <chr> <int> <chr> <chr>       
  1     1 a         3 Jan   aa33Jan1    
2     2 b         2 Feb   bb22Feb2    
3     3 c         1 Mar   cc11Mar3    
> mydf %>% my_mutate(columns=list("alpha", "beta", "gamma", "delta", 
> "gamma", "beta", "alpha"),
                     +                    sep="/*/",
                     +                    colnew="NewRandomNAME"
                     +                    )
# A tibble: 3 x 5
alpha beta  gamma delta NewRandomNAME              
<int> <chr> <int> <chr> <chr>                      
  1     1 a         3 Jan   1/*/a/*/3/*/Jan/*/3/*/a/*/1
2     2 b         2 Feb   2/*/b/*/2/*/Feb/*/2/*/b/*/2
3     3 c         1 Mar   3/*/c/*/1/*/Mar/*/1/*/c/*/3

Does this meet your normal need? Just to show it works in a pipeline, here is a variant:

mydf %>%
  tail(2) %>%
  my_mutate(c("beta", "beta"), "betabeta") %>%
  print() %>%
  my_mutate(list("alpha", "betabeta", "gamma"),
            "buildson", 
            "&")

The above only keeps the last two lines of the tibble, makes a double copy of "beta" under a new name, prints the intermediate result, continues to make another concatenation using the variable created earlier then prints the result:

Here is the run:

> mydf %>%
  +   tail(2) %>%
  +   my_mutate(c("beta", "beta"), "betabeta") %>%
  +   print() %>%
  +   my_mutate(list("alpha", "betabeta", "gamma"),
                +             "buildson", 
                +             "&")
# A tibble: 2 x 5
alpha beta  gamma delta betabeta
<int> <chr> <int> <chr> <chr>   
  1     2 b         2 Feb   bb      
2     3 c         1 Mar   cc      
# A tibble: 2 x 6
alpha beta  gamma delta betabeta buildson
<int> <chr> <int> <chr> <chr>    <chr>   
  1     2 b         2 Feb   bb       2&bb&2  
2     3 c         1 Mar   cc       3&cc&1  

As to how the darn function works, that was a learning experience for me to build using features I have not had occasion to use. If anyone remains interested, read on. 

The following needs newish features:

	"{colnew}" := SOMETHING

The colon-equals operator in newer R/dplyr can be sort of used in an odd way that allows the name of the variable to be in quotes and in brackets akin to the way glue() does it. The variable colnew is evaluated and substituted so the name used for the column is now dynamic.

The function does a paste using this:

	!!!rlang::syms(columns)

The problem is paste() wants multiple arguments and we have a single argument that is either a vector or another kind of vector called a list. The trick is to convert the vector into symbols then use "!!!" to convert something like 'c("alpha", "beta", "gamma")' into something more like ' "alpha", "beta", "gamma" ' so that paste sees them as multiple arguments to concatenate in vector fashion.

And, the function is not polished but I am sure you can all see some of what is needed like checking the arguments for validity, including not having a name for the new column that clashes with existing column names, doing something sane if no columns to concatenate are offered and so on.

Just showing a different approach. The base R methods are fine.

- Avi

-----Original Message-----
From: R-help <r-help-bounces using r-project.org> On Behalf Of Micha Silver
Sent: Thursday, July 1, 2021 10:36 AM
To: R-help using r-project.org
Subject: [R] concatenating columns in data.frame

I need to create a new data.frame column as a concatenation of existing character columns. But the number and name of the columns to concatenate needs to be passed in dynamically. The code below does what I want, but seems very clumsy. Any suggestions how to improve?


df = data.frame("A"=sample(letters, 10), "B"=sample(letters, 10), "C"=sample(letters,10), "D"=sample(letters, 10))

# Which columns to concat:

use_columns = c("D", "B")


UpdateCombo = function(df, use_columns) {
     use_df = df[, use_columns]
     combo_list = lapply(1:nrow(use_df), function(r) {
     r_combo = paste(use_df[r,], collapse="_")
     return(data.frame("Combo" = r_combo))
     })
     combo = do.call(rbind, combo_list)

     names(combo) = "Combo"

     return(combo)

}


combo_col = UpdateCombo(df, use_columns)

df_combo = do.call(cbind, list(df, combo_col))


Thanks


--
Micha Silver
Ben Gurion Univ.
Sde Boker, Remote Sensing Lab
cell: +972-523-665918

______________________________________________
R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list