[R] splitting a string column into multiple columns faster

arun smartpink111 at yahoo.com
Sat Jun 8 05:27:10 CEST 2013


HI,
Tried it on 1e5 row dataset:

l1<- letters[1:10]
s1<-sapply(seq_along(l1),function(i) paste(rep(l1[i],3),collapse=""))
set.seed(24)
x1<-data.frame(x=paste(paste0(sample(s1,1e5,replace=TRUE),sample(1:15,1e5,replace=TRUE)),paste0(sample(s1,1e5,replace=TRUE),sample(1:15,1e5,replace=TRUE)),paste0(sample(s1,1e5,replace=TRUE),sample(1:15,1e5,replace=TRUE)),sep="_"),stringsAsFactors=FALSE)
system.time(resNew<-data.frame(x=x1,read.table(text=gsub("[A-Za-z]","",x1[,1]),sep="_",header=FALSE),stringsAsFactors=FALSE))
#   user  system elapsed 
#  2.712   0.016   2.732 

head(resNew)
#                  x V1 V2 V3
#1  ccc12_ggg2_jjj14 12  2 14
#2  ccc7_ddd15_aaa11  7 15 11
#3 hhh12_ddd14_fff12 12 14 12
#4  fff11_bbb15_aaa6 11 15  6
#5   ggg12_ccc9_ggg8 12  9  8
#6   jjj8_eee12_eee4  8 12  4

A.K.


----- Original Message -----
From: arun <smartpink111 at yahoo.com>
To: Dimitri Liakhovitski <dimitri.liakhovitski at gmail.com>
Cc: R help <r-help at r-project.org>
Sent: Friday, June 7, 2013 11:00 PM
Subject: Re: [R] splitting a string column into multiple columns faster

HI,
May be this helps:

res<-data.frame(x=x,read.table(text=gsub("[A-Za-z]","",x[,1]),sep="_",header=FALSE),stringsAsFactors=FALSE)
res
#               x V1 V2 V3
#1 aaa1_bbb1_ccc3  1  1  3
#2 aaa2_bbb3_ccc2  2  3  2
#3 aaa3_bbb2_ccc1  3  2  1
A.K.

----- Original Message -----
From: Dimitri Liakhovitski <dimitri.liakhovitski at gmail.com>
To: r-help <r-help at r-project.org>
Cc: 
Sent: Friday, June 7, 2013 9:24 PM
Subject: [R] splitting a string column into multiple columns faster

Hello!

I have a column in my data frame that I have to split: I have to distill
the numbers from the text. Below is my example and my solution.

x<-data.frame(x=c("aaa1_bbb1_ccc3","aaa2_bbb3_ccc2","aaa3_bbb2_ccc1"))
x
library(stringr)
out<-as.data.frame(str_split_fixed(x$x,"aaa",2))
out2<-as.data.frame(str_split_fixed(out$V2,"_bbb",2))
out3<-as.data.frame(str_split_fixed(out2$V2,"_ccc",2))
result<-cbind(x,out2[1],out3)
result
My problem is:
str_split.fixed is relatively slow. In my real data frame I have over
80,000 rows so that it takes almost 30 seconds to run just one line (like
out<-... above)
And it's even slower because I have to do it step-by-step many times.

Any way to do it by specifying all 3 delimiters at once
("aaa","_bbb","_ccc") and then split it in one swoop into a data frame with
several columns?

Thanks a lot for any pointers!

-- 
Dimitri Liakhovitski

    [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list