[R] Split

Thu Sep 24 03:20:29 CEST 2020

Thank you again for your help  and giving me the opportunity to choose
the efficient method.  For a small data set there is no discernable
difference between the different approaches.  I will carry out a
comparison using  the large data set.

On Wed, Sep 23, 2020 at 11:52 AM LMH <lmh_users-groups using molconn.com> wrote:
>
> Below is a script in bash the uses the awk tokenizer to do the work.
>
> This assumes that your input and output delimiter is space. The number of consecutive delimiters in
> the input is not important. This also assumes that the input file does not have a header row. That
> is easy to modify if you want. I always keep header rows in my data files as I think that removing
> them is asking for trouble down the road.
>
> I added a NULL for cases where there is no value for the last field. You could use "." if you want.
>
> You should be able to find how to run this from inside R if you want. You will, of course, need a
> bash environment to run this, so if you are not in linux you will need cygwin or something similar.
>
> This should be very fast, but let me know if needs to be faster. If the X1_X2 variant occurs less
> frequently than not then we should switch the order in which the logic evaluates the options.
>
> LMH
>
>
> #! /bin/bash
>
> # input filename
> input_file=$1
>
> # output filename
> output_file=$2
>
> # make sure the input file exists
> if [ ! -f $input_file ]; then
>    echo $input_file "  cannot be found"
>    exit 0
> fi
>
> # create the output file
> touch $output_file
>
> # make sure the output was created
> if [ ! -f $output_file ]; then
>    echo $output_file "  was not created"
>    exit 0
> fi
>
> # write the header row
> echo "ID1 ID2 Y1 X1 X2" >> $output_file
>
> # character to find in the third token
> look_for='_'
>
> # process with awk
> # if the 3rd token contains '_'
> #   split the third token on '_' into F[1] and F[2]
> #   print the first two tokens, the indicator value of 1, and the split fields F[1] and F[2]
> # otherwise,
> #   print the first two tokens, the indicator value of 0, the 3rd token, and NULL
>
> cat $input_file | \
> awk -v find_char=$look_for '{ if($3 ~ find_char) { { split ($3, F, "_") }
>                                                    { print $1, $2, "1", F[1], F[2] }
>                                                  }
>                               else { print $1, $2, "0", $3, "NULL" }
>                             }' >> $output_file
>
>
>
>
>
>
>
> Val wrote:
> > Thank you all for the help!
> >
> > LMH, Yes I would like to see the alternative.  I am using this for a
> > large data set and if the  alternative is more efficient than this
> > then I would be happy.
> >
> > On Tue, Sep 22, 2020 at 6:25 PM Bert Gunter <bgunter.4567 using gmail.com> wrote:
> >>
> >> To be clear, I think Rui's solution is perfectly fine and probably better than what I offer below. But just for fun, I wanted to do it without the lapply().  Here is one way. I think my comments suffice to explain.
> >>
> >>> ## which are the  non "_" indices?
> >>> wh <- grep("_",F1$text, fixed = TRUE, invert = TRUE)
> >>> ## paste "_." to these
> >>> F1[wh,"text"] <- paste(F1[wh,"text"],".",sep = "_")
> >>> ## Now strsplit() and unlist() them to get a vector
> >>> z <- unlist(strsplit(F1$text, "_"))
> >>> ## now cbind() to the data frame
> >>> F1 <- cbind(F1, matrix(z, ncol = 2, byrow = TRUE))
> >>> F1
> >>   ID1 ID2   text    1  2
> >> 1  A1  B1 NONE_. NONE  .
> >> 2  A1  B1  cf_12   cf 12
> >> 3  A1  B1 NONE_. NONE  .
> >> 4  A2  B2  X2_25   X2 25
> >> 5  A2  B3  fd_15   fd 15
> >>> ## You can change the names of the 2 columns yourself
> >>
> >> Cheers,
> >> Bert
> >>
> >> Bert Gunter
> >>
> >> "The trouble with having an open mind is that people keep coming along and sticking things into it."
> >> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> >>
> >>
> >> On Tue, Sep 22, 2020 at 12:19 PM Rui Barradas <ruipbarradas using sapo.pt> wrote:
> >>>
> >>> Hello,
> >>>
> >>> A base R solution with strsplit, like in your code.
> >>>
> >>> F1$Y1 <- +grepl("_", F1$text)
> >>>
> >>> tmp <- strsplit(as.character(F1$text), "_")
> >>> tmp <- lapply(tmp, function(x) if(length(x) == 1) c(x, ".") else x)
> >>> tmp <- do.call(rbind, tmp)
> >>> colnames(tmp) <- c("X1", "X2")
> >>> F1 <- cbind(F1[-3], tmp)    # remove the original column
> >>> rm(tmp)
> >>>
> >>> F1
> >>> #  ID1 ID2 Y1   X1 X2
> >>> #1  A1  B1  0 NONE  .
> >>> #2  A1  B1  1   cf 12
> >>> #3  A1  B1  0 NONE  .
> >>> #4  A2  B2  1   X2 25
> >>> #5  A2  B3  1   fd 15
> >>>
> >>>
> >>> Note that cbind dispatches on F1, an object of class "data.frame".
> >>> Therefore it's the method cbind.data.frame that is called and the result
> >>> is also a df, though tmp is a "matrix".
> >>>
> >>>
> >>> Hope this helps,
> >>>
> >>> Rui Barradas
> >>>
> >>>
> >>> Às 20:07 de 22/09/20, Rui Barradas escreveu:
> >>>> Hello,
> >>>>
> >>>> Something like this?
> >>>>
> >>>>
> >>>> F1$Y1 <- +grepl("_", F1$text)
> >>>> F1 <- F1[c(1, 2, 4, 3)]
> >>>> F1 <- tidyr::separate(F1, text, into = c("X1", "X2"), sep = "_", fill =
> >>>> "right")
> >>>> F1
> >>>>
> >>>>
> >>>> Hope this helps,
> >>>>
> >>>> Rui Barradas
> >>>>
> >>>> Às 19:55 de 22/09/20, Val escreveu:
> >>>>> HI All,
> >>>>>
> >>>>> I am trying to create   new columns based on another column string
> >>>>> content. First I want to identify rows that contain a particular
> >>>>> string.  If it contains, I want to split the string and create two
> >>>>> variables.
> >>>>>
> >>>>> Here is my sample of data.
> >>>>> F1<-read.table(text="ID1  ID2  text
> >>>>> A1 B1   NONE
> >>>>> A1 B1   cf_12
> >>>>> A1 B1   NONE
> >>>>> A2 B2   X2_25
> >>>>> A2 B3   fd_15  ",header=TRUE,stringsAsFactors=F)
> >>>>> If the variable "text" contains this "_" I want to create an indicator
> >>>>> variable as shown below
> >>>>>
> >>>>> F1$Y1 <- ifelse(grepl("_", F1$text),1,0)
> >>>>>
> >>>>>
> >>>>> Then I want to split that string in to two, before "_" and after "_"
> >>>>> and create two variables as shown below
> >>>>> x1= strsplit(as.character(F1$text),'_',2)
> >>>>>
> >>>>> My problem is how to combine this with the original data frame. The
> >>>>> desired  output is shown   below,
> >>>>>
> >>>>>
> >>>>> ID1 ID2  Y1   X1    X2
> >>>>> A1  B1    0   NONE   .
> >>>>> A1  B1   1    cf        12
> >>>>> A1  B1   0  NONE   .
> >>>>> A2  B2   1    X2    25
> >>>>> A2  B3   1    fd    15
> >>>>>
> >>>>> Any help?
> >>>>> Thank you.
> >>>>>
> >>>>> ______________________________________________
> >>>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>> PLEASE do read the posting guide
> >>>>> http://www.R-project.org/posting-guide.html
> >>>>> and provide commented, minimal, self-contained, reproducible code.
> >>>>>
> >>>>
> >>>> ______________________________________________
> >>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>> PLEASE do read the posting guide
> >>>> http://www.R-project.org/posting-guide.html
> >>>> and provide commented, minimal, self-contained, reproducible code.
> >>>
> >>> ______________________________________________
> >>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> >>> and provide commented, minimal, self-contained, reproducible code.
> >
> > ______________________________________________
> > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>