[R] concatenating columns in data.frame

Micha Silver t@v|b@r @end|ng |rom gm@||@com
Sat Jul 3 13:26:08 CEST 2021


Again thanks for carrying on this thread with your additional, 
informative comments, as well as the welcome humor.



On 7/3/21 2:59 AM, Jeff Newmiller wrote:
> I am very agnostic about tidyverse/base R. However, the complexity of 
> setting up NSE functions is often simply not needed, and I encounter 
> so many people who simply disregard base R as being too outdated so 
> that they never learn how simple solutions in R can be. The contrast 
> between your solution and Bert's was... perhaps informative, but a 
> nuclear bomb where an axe was sufficient.
>
> On Fri, 2 Jul 2021, Avi Gross via R-help wrote:
>
>> I know what you mean Jeff. Yes I am very familiar with base R 
>> techniques. What I had hoped for was to do two things that some of 
>> the other methods mentioned do that ended up bringing two data.frames 
>> together as part of the solution.
>>
>> Much of what I used is now standard R. I was looking at the accessory 
>> functions now commonly used in dplyr that let you dynamically select 
>> which columns to work with like begins_with() to choose. Sadly, they 
>> seem to work on a top-level but not easily within a call to something 
>> like paste(...) where they are not evaluated in the way I want.
>>
>> But the odd method I tried can also be used in standard R with a bit 
>> of work. You can create a function without using dplyr that takes 
>> your df and uses it to concatenate and end with something like:
>>
>> df$new_col <- do_something(df, selected_cols)
>>
>> That too adds a column without the need to merge larger structures 
>> explicitly..
>>
>> But your other point is a tad religious in a sense. I happen to 
>> prefer learning a core language first then looking at enhancement 
>> opportunities. But at some point, if teaching someone new who wants 
>> to focus on getting a job done simply but not necessarily repeatedly 
>> or in some ideal way, it is best to do things in a way that their 
>> mind flows better.
>>
>> Many things in the tidyverse are redundant with base R or just "fix" 
>> inconsistencies like making sure the first argument is always the 
>> same. But many add substantially to doing things in a more 
>> step-by-step manner.
>>
>> I do not worship the base language as it first came out or even as it 
>> has evolved. I do like to know what choices I have and pick and 
>> choose among them as needed. Of course a forum like this is more 
>> about base R than otherwise and I acknowledge that. Still, the ":=" 
>> operator is now base R. There is a new pipeline operator "|>" in base 
>> R. Some ideas, good or otherwise, do get in eventually.
>>
>> I started doing graphs using base R as in the plot() command. It was 
>> adequate but I wanted better. So I learned about Lattice and various 
>> packages and eventually ggplot. I can now do things I barely imagined 
>> before and am still learning that there is much more I can do with 
>> packages underneath much of the magic and also additional packages 
>> layered above it, in some sense. So I do not approach that with an 
>> either-or mentality either.
>>
>> Note I am not really talking about just R. I have similar issues with 
>> other languages I program in such as Python. None of them were 
>> created fully-formed and many had to add huge amounts to adapt to 
>> additional wants and needs. Base R for me is often inadequate. But so 
>> what?
>>
>> The task being asked for in this thread in isolation, indeed may not 
>> be done any better using packages. However, if it is part of a larger 
>> set of tasks that can be pipelined, it may well be and I personally 
>> was wondering if there was a way in dplyr. There probably is a much 
>> better way than I assembled if I only knew about it, and if not, they 
>> may add this kind of indirection in a future release if deemed worthy 
>> of doing. I have gone back to programs I did years ago with humungous 
>> amounts of code using what I knew then and reducing it drastically 
>> now that I can tell a function to select say all my column names that 
>> end in .orig and apply a set of functions to them with output going 
>> to the base name followed by .mean and .sd and so on. All that can 
>> often be done in one or two lines of code where previously I had to 
>> do 18 near repetitions of each part and then another and another. 
>> That used a limited form of dynamism.
>>
>> Be that as it may I think the requester has enough info and we can 
>> move on.
>>
>> -----Original Message-----
>> From: Jeff Newmiller <jdnewmil using dcn.davis.ca.us>
>> Sent: Friday, July 2, 2021 1:03 AM
>> To: Avi Gross <avigross using verizon.net>; Avi Gross via R-help 
>> <r-help using r-project.org>; R-help using r-project.org
>> Subject: Re: [R] concatenating columns in data.frame
>>
>> I use parts of the tidyverse frequently, but this post is the best 
>> argument I can imagine for learning base R techniques.
>>
>> On July 1, 2021 8:41:06 PM PDT, Avi Gross via R-help 
>> <r-help using r-project.org> wrote:
>>> Micha,
>>>
>>> Others have provided ways in standard R so I will contribute a somewhat
>>> odd solution using the dplyr and related packages in the tidyverse
>>> including a sample data.frame/tibble I made. It requires newer versions
>>> of R and other  packages as it uses some fairly esoteric features
>>> including "the big bang" and the new ":=" operator and more.
>>>
>>> You can use your own data with whatever columns you need, of course.
>>>
>>> The goal is to have umpteen columns in the data that you want to add an
>>> additional columns to an existing tibble that is the result of
>>> concatenating the rowwise contents of a dynamically supplied vector of
>>> column names in quotes. First we need something to work with so here is
>>> a sample:
>>>
>>> #--start
>>> # load required packages, or a bunch at once!
>>> library(tidyverse)
>>>
>>> # Pick how many rows you want. For a demo, 3 is plenty N <- 3
>>>
>>> # Make a sample tibble with N rows and the following 4 columns mydf <-
>>> tibble(alpha = 1:N,
>>>               beta=letters[1:N],
>>>               gamma = N:1,
>>>               delta = month.abb[1:N])
>>>
>>> # show the original tibble
>>> print(mydf)
>>> #--end
>>>
>>> In flat text mode, here is the output:
>>>
>>>> print(mydf)
>>> # A tibble: 3 x 4
>>> alpha beta  gamma delta
>>> <int> <chr> <int> <chr>
>>>  1     1 a         3 Jan
>>> 2     2 b         2 Feb
>>> 3     3 c         1 Mar
>>>
>>> Now I want to make a function that is used instead of the mutate verb.
>>> I made a weird one-liner that is a tad hard to explain so first let me
>>> mention the requirements.
>>>
>>> It will take a first argument that is a tibble and in a pipeline this
>>> would be passed invisibly.
>>> The second required argument is a vector or list containing the names
>>> of the columns as strings. A column can be re-used multiple times.
>>> The third optional argument is what to name the new column with a
>>> default if omitted.
>>> The fourth optional argument allows you to choose a different separator
>>> than "" if you wish.
>>>
>>> The function should be usable in a pipeline on both sides so it should
>>> also return the input tibble with an extra column to the output.
>>>
>>> Here is the function:
>>>
>>> my_mutate <- function(df, columns, colnew="concatenated", sep=""){
>>>  df %>%
>>>    mutate( "{colnew}" := paste(!!!rlang::syms(columns), sep = sep )) }
>>>
>>> Yes, the above can be done inline as a long one-liner:
>>>
>>> my_mutate <- function(df, columns, colnew="concatenated", sep="")
>>> mutate(df, "{colnew}" := paste(!!!rlang::syms(columns), sep = sep ))
>>>
>>> Here are examples of it running:
>>>
>>>
>>>> choices <- c("beta", "delta", "alpha", "delta") mydf %>%
>>>> my_mutate(choices, "me2")
>>> # A tibble: 3 x 5
>>> alpha beta  gamma delta me2
>>> <int> <chr> <int> <chr> <chr>
>>>  1     1 a         3 Jan   aJan1Jan
>>> 2     2 b         2 Feb   bFeb2Feb
>>> 3     3 c         1 Mar   cMar3Mar
>>>> mydf %>% my_mutate(choices, "me2",":")
>>> # A tibble: 3 x 5
>>> alpha beta  gamma delta me2
>>> <int> <chr> <int> <chr> <chr>
>>>  1     1 a         3 Jan   a:Jan:1:Jan
>>> 2     2 b         2 Feb   b:Feb:2:Feb
>>> 3     3 c         1 Mar   c:Mar:3:Mar
>>>> mydf %>% my_mutate(c("beta", "beta", "gamma", "gamma", "delta",
>>>> "alpha"))
>>> # A tibble: 3 x 5
>>> alpha beta  gamma delta concatenated
>>> <int> <chr> <int> <chr> <chr>
>>>  1     1 a         3 Jan   aa33Jan1
>>> 2     2 b         2 Feb   bb22Feb2
>>> 3     3 c         1 Mar   cc11Mar3
>>>> mydf %>% my_mutate(list("beta", "beta", "gamma", "gamma", "delta",
>>>> "alpha"))
>>> # A tibble: 3 x 5
>>> alpha beta  gamma delta concatenated
>>> <int> <chr> <int> <chr> <chr>
>>>  1     1 a         3 Jan   aa33Jan1
>>> 2     2 b         2 Feb   bb22Feb2
>>> 3     3 c         1 Mar   cc11Mar3
>>>> mydf %>% my_mutate(columns=list("alpha", "beta", "gamma", "delta",
>>>> "gamma", "beta", "alpha"),
>>>                     +                    sep="/*/",
>>>                     + colnew="NewRandomNAME"
>>>                     +                    )
>>> # A tibble: 3 x 5
>>> alpha beta  gamma delta NewRandomNAME
>>> <int> <chr> <int> <chr> <chr>
>>>  1     1 a         3 Jan   1/*/a/*/3/*/Jan/*/3/*/a/*/1
>>> 2     2 b         2 Feb   2/*/b/*/2/*/Feb/*/2/*/b/*/2
>>> 3     3 c         1 Mar   3/*/c/*/1/*/Mar/*/1/*/c/*/3
>>>
>>> Does this meet your normal need? Just to show it works in a pipeline,
>>> here is a variant:
>>>
>>> mydf %>%
>>>  tail(2) %>%
>>>  my_mutate(c("beta", "beta"), "betabeta") %>%
>>>  print() %>%
>>>  my_mutate(list("alpha", "betabeta", "gamma"),
>>>            "buildson",
>>>            "&")
>>>
>>> The above only keeps the last two lines of the tibble, makes a double
>>> copy of "beta" under a new name, prints the intermediate result,
>>> continues to make another concatenation using the variable created
>>> earlier then prints the result:
>>>
>>> Here is the run:
>>>
>>>> mydf %>%
>>>  +   tail(2) %>%
>>>  +   my_mutate(c("beta", "beta"), "betabeta") %>%
>>>  +   print() %>%
>>>  +   my_mutate(list("alpha", "betabeta", "gamma"),
>>>                +             "buildson",
>>>                +             "&")
>>> # A tibble: 2 x 5
>>> alpha beta  gamma delta betabeta
>>> <int> <chr> <int> <chr> <chr>
>>>  1     2 b         2 Feb   bb
>>> 2     3 c         1 Mar   cc
>>> # A tibble: 2 x 6
>>> alpha beta  gamma delta betabeta buildson
>>> <int> <chr> <int> <chr> <chr> <chr>
>>>  1     2 b         2 Feb   bb       2&bb&2
>>> 2     3 c         1 Mar   cc       3&cc&1
>>>
>>> As to how the darn function works, that was a learning experience for
>>> me to build using features I have not had occasion to use. If anyone
>>> remains interested, read on.
>>>
>>> The following needs newish features:
>>>
>>>     "{colnew}" := SOMETHING
>>>
>>> The colon-equals operator in newer R/dplyr can be sort of used in an
>>> odd way that allows the name of the variable to be in quotes and in
>>> brackets akin to the way glue() does it. The variable colnew is
>>> evaluated and substituted so the name used for the column is now
>>> dynamic.
>>>
>>> The function does a paste using this:
>>>
>>>     !!!rlang::syms(columns)
>>>
>>> The problem is paste() wants multiple arguments and we have a single
>>> argument that is either a vector or another kind of vector called a
>>> list. The trick is to convert the vector into symbols then use "!!!" to
>>> convert something like 'c("alpha", "beta", "gamma")' into something
>>> more like ' "alpha", "beta", "gamma" ' so that paste sees them as
>>> multiple arguments to concatenate in vector fashion.
>>>
>>> And, the function is not polished but I am sure you can all see some of
>>> what is needed like checking the arguments for validity, including not
>>> having a name for the new column that clashes with existing column
>>> names, doing something sane if no columns to concatenate are offered
>>> and so on.
>>>
>>> Just showing a different approach. The base R methods are fine.
>>>
>>> - Avi
>>>
>>> -----Original Message-----
>>> From: R-help <r-help-bounces using r-project.org> On Behalf Of Micha Silver
>>> Sent: Thursday, July 1, 2021 10:36 AM
>>> To: R-help using r-project.org
>>> Subject: [R] concatenating columns in data.frame
>>>
>>> I need to create a new data.frame column as a concatenation of existing
>>> character columns. But the number and name of the columns to
>>> concatenate needs to be passed in dynamically. The code below does what
>>> I want, but seems very clumsy. Any suggestions how to improve?
>>>
>>>
>>> df = data.frame("A"=sample(letters, 10), "B"=sample(letters, 10),
>>> "C"=sample(letters,10), "D"=sample(letters, 10))
>>>
>>> # Which columns to concat:
>>>
>>> use_columns = c("D", "B")
>>>
>>>
>>> UpdateCombo = function(df, use_columns) {
>>>     use_df = df[, use_columns]
>>>     combo_list = lapply(1:nrow(use_df), function(r) {
>>>     r_combo = paste(use_df[r,], collapse="_")
>>>     return(data.frame("Combo" = r_combo))
>>>     })
>>>     combo = do.call(rbind, combo_list)
>>>
>>>     names(combo) = "Combo"
>>>
>>>     return(combo)
>>>
>>> }
>>>
>>>
>>> combo_col = UpdateCombo(df, use_columns)
>>>
>>> df_combo = do.call(cbind, list(df, combo_col))
>>>
>>>
>>> Thanks
>>>
>>>
>>> -- 
>>> Micha Silver
>>> Ben Gurion Univ.
>>> Sde Boker, Remote Sensing Lab
>>> cell: +972-523-665918
>>>
>>> ______________________________________________
>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>> ______________________________________________
>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> -- 
>> Sent from my phone. Please excuse my brevity.
>>
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> --------------------------------------------------------------------------- 
>
> Jeff Newmiller                        The     .....       ..... Go 
> Live...
> DCN:<jdnewmil using dcn.davis.ca.us>        Basics: ##.#. ##.#.  Live Go...
>                                       Live:   OO#.. Dead: OO#.. Playing
> Research Engineer (Solar/Batteries            O.O#.       #.O#. with
> /Software/Embedded Controllers)               .OO#.       .OO#. 
> rocks...1k
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Micha Silver
Ben Gurion Univ.
Sde Boker, Remote Sensing Lab
cell: +972-523-665918



More information about the R-help mailing list