[R] concatenating columns in data.frame

Jeff Newmiller jdnewm|| @end|ng |rom dcn@d@v|@@c@@u@
Sat Jul 3 01:59:21 CEST 2021


I am very agnostic about tidyverse/base R. However, the complexity of 
setting up NSE functions is often simply not needed, and I encounter so 
many people who simply disregard base R as being too outdated so that they 
never learn how simple solutions in R can be. The contrast between your 
solution and Bert's was... perhaps informative, but a nuclear bomb where 
an axe was sufficient.

On Fri, 2 Jul 2021, Avi Gross via R-help wrote:

> I know what you mean Jeff. Yes I am very familiar with base R techniques. What I had hoped for was to do two things that some of the other methods mentioned do that ended up bringing two data.frames together as part of the solution.
>
> Much of what I used is now standard R. I was looking at the accessory functions now commonly used in dplyr that let you dynamically select which columns to work with like begins_with() to choose. Sadly, they seem to work on a top-level but not easily within a call to something like paste(...) where they are not evaluated in the way I want.
>
> But the odd method I tried can also be used in standard R with a bit of work. You can create a function without using dplyr that takes your df and uses it to concatenate and end with something like:
>
> df$new_col <- do_something(df, selected_cols)
>
> That too adds a column without the need to merge larger structures explicitly..
>
> But your other point is a tad religious in a sense. I happen to prefer learning a core language first then looking at enhancement opportunities. But at some point, if teaching someone new who wants to focus on getting a job done simply but not necessarily repeatedly or in some ideal way, it is best to do things in a way that their mind flows better.
>
> Many things in the tidyverse are redundant with base R or just "fix" inconsistencies like making sure the first argument is always the same. But many add substantially to doing things in a more step-by-step manner.
>
> I do not worship the base language as it first came out or even as it has evolved. I do like to know what choices I have and pick and choose among them as needed. Of course a forum like this is more about base R than otherwise and I acknowledge that. Still, the ":=" operator is now base R. There is a new pipeline operator "|>" in base R. Some ideas, good or otherwise, do get in eventually.
>
> I started doing graphs using base R as in the plot() command. It was adequate but I wanted better. So I learned about Lattice and various packages and eventually ggplot. I can now do things I barely imagined before and am still learning that there is much more I can do with packages underneath much of the magic and also additional packages layered above it, in some sense. So I do not approach that with an either-or mentality either.
>
> Note I am not really talking about just R. I have similar issues with other languages I program in such as Python. None of them were created fully-formed and many had to add huge amounts to adapt to additional wants and needs. Base R for me is often inadequate. But so what?
>
> The task being asked for in this thread in isolation, indeed may not be done any better using packages. However, if it is part of a larger set of tasks that can be pipelined, it may well be and I personally was wondering if there was a way in dplyr. There probably is a much better way than I assembled if I only knew about it, and if not, they may add this kind of indirection in a future release if deemed worthy of doing. I have gone back to programs I did years ago with humungous amounts of code using what I knew then and reducing it drastically now that I can tell a function to select say all my column names that end in .orig and apply a set of functions to them with output going to the base name followed by .mean and .sd and so on. All that can often be done in one or two lines of code where previously I had to do 18 near repetitions of each part and then another and another. That used a limited form of dynamism.
>
> Be that as it may I think the requester has enough info and we can move on.
>
> -----Original Message-----
> From: Jeff Newmiller <jdnewmil using dcn.davis.ca.us>
> Sent: Friday, July 2, 2021 1:03 AM
> To: Avi Gross <avigross using verizon.net>; Avi Gross via R-help <r-help using r-project.org>; R-help using r-project.org
> Subject: Re: [R] concatenating columns in data.frame
>
> I use parts of the tidyverse frequently, but this post is the best argument I can imagine for learning base R techniques.
>
> On July 1, 2021 8:41:06 PM PDT, Avi Gross via R-help <r-help using r-project.org> wrote:
>> Micha,
>>
>> Others have provided ways in standard R so I will contribute a somewhat
>> odd solution using the dplyr and related packages in the tidyverse
>> including a sample data.frame/tibble I made. It requires newer versions
>> of R and other  packages as it uses some fairly esoteric features
>> including "the big bang" and the new ":=" operator and more.
>>
>> You can use your own data with whatever columns you need, of course.
>>
>> The goal is to have umpteen columns in the data that you want to add an
>> additional columns to an existing tibble that is the result of
>> concatenating the rowwise contents of a dynamically supplied vector of
>> column names in quotes. First we need something to work with so here is
>> a sample:
>>
>> #--start
>> # load required packages, or a bunch at once!
>> library(tidyverse)
>>
>> # Pick how many rows you want. For a demo, 3 is plenty N <- 3
>>
>> # Make a sample tibble with N rows and the following 4 columns mydf <-
>> tibble(alpha = 1:N,
>>               beta=letters[1:N],
>>               gamma = N:1,
>>               delta = month.abb[1:N])
>>
>> # show the original tibble
>> print(mydf)
>> #--end
>>
>> In flat text mode, here is the output:
>>
>>> print(mydf)
>> # A tibble: 3 x 4
>> alpha beta  gamma delta
>> <int> <chr> <int> <chr>
>>  1     1 a         3 Jan
>> 2     2 b         2 Feb
>> 3     3 c         1 Mar
>>
>> Now I want to make a function that is used instead of the mutate verb.
>> I made a weird one-liner that is a tad hard to explain so first let me
>> mention the requirements.
>>
>> It will take a first argument that is a tibble and in a pipeline this
>> would be passed invisibly.
>> The second required argument is a vector or list containing the names
>> of the columns as strings. A column can be re-used multiple times.
>> The third optional argument is what to name the new column with a
>> default if omitted.
>> The fourth optional argument allows you to choose a different separator
>> than "" if you wish.
>>
>> The function should be usable in a pipeline on both sides so it should
>> also return the input tibble with an extra column to the output.
>>
>> Here is the function:
>>
>> my_mutate <- function(df, columns, colnew="concatenated", sep=""){
>>  df %>%
>>    mutate( "{colnew}" := paste(!!!rlang::syms(columns), sep = sep )) }
>>
>> Yes, the above can be done inline as a long one-liner:
>>
>> my_mutate <- function(df, columns, colnew="concatenated", sep="")
>> mutate(df, "{colnew}" := paste(!!!rlang::syms(columns), sep = sep ))
>>
>> Here are examples of it running:
>>
>>
>>> choices <- c("beta", "delta", "alpha", "delta") mydf %>%
>>> my_mutate(choices, "me2")
>> # A tibble: 3 x 5
>> alpha beta  gamma delta me2
>> <int> <chr> <int> <chr> <chr>
>>  1     1 a         3 Jan   aJan1Jan
>> 2     2 b         2 Feb   bFeb2Feb
>> 3     3 c         1 Mar   cMar3Mar
>>> mydf %>% my_mutate(choices, "me2",":")
>> # A tibble: 3 x 5
>> alpha beta  gamma delta me2
>> <int> <chr> <int> <chr> <chr>
>>  1     1 a         3 Jan   a:Jan:1:Jan
>> 2     2 b         2 Feb   b:Feb:2:Feb
>> 3     3 c         1 Mar   c:Mar:3:Mar
>>> mydf %>% my_mutate(c("beta", "beta", "gamma", "gamma", "delta",
>>> "alpha"))
>> # A tibble: 3 x 5
>> alpha beta  gamma delta concatenated
>> <int> <chr> <int> <chr> <chr>
>>  1     1 a         3 Jan   aa33Jan1
>> 2     2 b         2 Feb   bb22Feb2
>> 3     3 c         1 Mar   cc11Mar3
>>> mydf %>% my_mutate(list("beta", "beta", "gamma", "gamma", "delta",
>>> "alpha"))
>> # A tibble: 3 x 5
>> alpha beta  gamma delta concatenated
>> <int> <chr> <int> <chr> <chr>
>>  1     1 a         3 Jan   aa33Jan1
>> 2     2 b         2 Feb   bb22Feb2
>> 3     3 c         1 Mar   cc11Mar3
>>> mydf %>% my_mutate(columns=list("alpha", "beta", "gamma", "delta",
>>> "gamma", "beta", "alpha"),
>>                     +                    sep="/*/",
>>                     +                    colnew="NewRandomNAME"
>>                     +                    )
>> # A tibble: 3 x 5
>> alpha beta  gamma delta NewRandomNAME
>> <int> <chr> <int> <chr> <chr>
>>  1     1 a         3 Jan   1/*/a/*/3/*/Jan/*/3/*/a/*/1
>> 2     2 b         2 Feb   2/*/b/*/2/*/Feb/*/2/*/b/*/2
>> 3     3 c         1 Mar   3/*/c/*/1/*/Mar/*/1/*/c/*/3
>>
>> Does this meet your normal need? Just to show it works in a pipeline,
>> here is a variant:
>>
>> mydf %>%
>>  tail(2) %>%
>>  my_mutate(c("beta", "beta"), "betabeta") %>%
>>  print() %>%
>>  my_mutate(list("alpha", "betabeta", "gamma"),
>>            "buildson",
>>            "&")
>>
>> The above only keeps the last two lines of the tibble, makes a double
>> copy of "beta" under a new name, prints the intermediate result,
>> continues to make another concatenation using the variable created
>> earlier then prints the result:
>>
>> Here is the run:
>>
>>> mydf %>%
>>  +   tail(2) %>%
>>  +   my_mutate(c("beta", "beta"), "betabeta") %>%
>>  +   print() %>%
>>  +   my_mutate(list("alpha", "betabeta", "gamma"),
>>                +             "buildson",
>>                +             "&")
>> # A tibble: 2 x 5
>> alpha beta  gamma delta betabeta
>> <int> <chr> <int> <chr> <chr>
>>  1     2 b         2 Feb   bb
>> 2     3 c         1 Mar   cc
>> # A tibble: 2 x 6
>> alpha beta  gamma delta betabeta buildson
>> <int> <chr> <int> <chr> <chr>    <chr>
>>  1     2 b         2 Feb   bb       2&bb&2
>> 2     3 c         1 Mar   cc       3&cc&1
>>
>> As to how the darn function works, that was a learning experience for
>> me to build using features I have not had occasion to use. If anyone
>> remains interested, read on.
>>
>> The following needs newish features:
>>
>> 	"{colnew}" := SOMETHING
>>
>> The colon-equals operator in newer R/dplyr can be sort of used in an
>> odd way that allows the name of the variable to be in quotes and in
>> brackets akin to the way glue() does it. The variable colnew is
>> evaluated and substituted so the name used for the column is now
>> dynamic.
>>
>> The function does a paste using this:
>>
>> 	!!!rlang::syms(columns)
>>
>> The problem is paste() wants multiple arguments and we have a single
>> argument that is either a vector or another kind of vector called a
>> list. The trick is to convert the vector into symbols then use "!!!" to
>> convert something like 'c("alpha", "beta", "gamma")' into something
>> more like ' "alpha", "beta", "gamma" ' so that paste sees them as
>> multiple arguments to concatenate in vector fashion.
>>
>> And, the function is not polished but I am sure you can all see some of
>> what is needed like checking the arguments for validity, including not
>> having a name for the new column that clashes with existing column
>> names, doing something sane if no columns to concatenate are offered
>> and so on.
>>
>> Just showing a different approach. The base R methods are fine.
>>
>> - Avi
>>
>> -----Original Message-----
>> From: R-help <r-help-bounces using r-project.org> On Behalf Of Micha Silver
>> Sent: Thursday, July 1, 2021 10:36 AM
>> To: R-help using r-project.org
>> Subject: [R] concatenating columns in data.frame
>>
>> I need to create a new data.frame column as a concatenation of existing
>> character columns. But the number and name of the columns to
>> concatenate needs to be passed in dynamically. The code below does what
>> I want, but seems very clumsy. Any suggestions how to improve?
>>
>>
>> df = data.frame("A"=sample(letters, 10), "B"=sample(letters, 10),
>> "C"=sample(letters,10), "D"=sample(letters, 10))
>>
>> # Which columns to concat:
>>
>> use_columns = c("D", "B")
>>
>>
>> UpdateCombo = function(df, use_columns) {
>>     use_df = df[, use_columns]
>>     combo_list = lapply(1:nrow(use_df), function(r) {
>>     r_combo = paste(use_df[r,], collapse="_")
>>     return(data.frame("Combo" = r_combo))
>>     })
>>     combo = do.call(rbind, combo_list)
>>
>>     names(combo) = "Combo"
>>
>>     return(combo)
>>
>> }
>>
>>
>> combo_col = UpdateCombo(df, use_columns)
>>
>> df_combo = do.call(cbind, list(df, combo_col))
>>
>>
>> Thanks
>>
>>
>> --
>> Micha Silver
>> Ben Gurion Univ.
>> Sde Boker, Remote Sensing Lab
>> cell: +972-523-665918
>>
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> --
> Sent from my phone. Please excuse my brevity.
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil using dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                       Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k



More information about the R-help mailing list