[R] splitting strings effriciently

Martin Morgan mtmorgan at fhcrc.org
Sun Jan 8 23:26:34 CET 2012


On 01/08/2012 11:37 AM, jim holtman wrote:
> Just a quick followup to the previous post using 4M entries:  (20
> seconds would seem like a reasonable time for the operation)
>
>>   ip<- "123.456.789.321"  ## example data
>>   df<- data.frame(ip = rep(ip, 4e6), stringsAsFactors=FALSE)
>>   system.time(x<- strsplit(df$ip, '\\.'))

or if the IP addresses really are repeated multiple times

df <- data.frame(ip=rep(ip, 4e6))  ## df$ip is a factor

 > system.time(x <- local({
+     ip0 <- strsplit(levels(df$ip), "\\.")
+     ip0[match(df$ip, levels(df$ip))]
+ }))
    user  system elapsed
   0.352   0.000   0.352

although the speed-up in the example is best-case.

Martin

>     user  system elapsed
>    19.47    0.12   20.86
>>   str(x)
> List of 4000000
>   $ : chr [1:4] "123" "456" "789" "321"
>   $ : chr [1:4] "123" "456" "789" "321"
>   $ : chr [1:4] "123" "456" "789" "321"
>   $ : chr [1:4] "123" "456" "789" "321"
>   $ : chr [1:4] "123" "456" "789" "321"
>   $ : chr [1:4] "123" "456" "789" "321"
>   $ : chr [1:4] "123" "456" "789" "321"
>   $ : chr [1:4] "123" "456" "789" "321"
>   $ : chr [1:4] "123" "456" "789" "321"
>
>
>
>
> On Sun, Jan 8, 2012 at 8:11 AM, Enrico Schumann<enricoschumann at yahoo.de>  wrote:
>>
>> Hi Andrew,
>>
>> you can use strsplit for a character vector; you do not have to call it for
>> every element data$ComputerName[i].
>>
>> If I understand correctly, maybe something like this helps
>>
>>> ip<- "123.456.789.321"  ## example data
>>> df<- data.frame(ip = rep(ip, 9), stringsAsFactors=FALSE)
>>> df
>>                ip
>> 1 123.456.789.321
>> 2 123.456.789.321
>> 3 123.456.789.321
>> 4 123.456.789.321
>> 5 123.456.789.321
>> 6 123.456.789.321
>> 7 123.456.789.321
>> 8 123.456.789.321
>> 9 123.456.789.321
>>
>>>
>>> res<- unlist(strsplit(df[["ip"]], "\\."))
>>> ii<- seq(1, nrow(df)*4, by = 4)
>>> res[ii]   ## A
>> [1] "123" "123" "123" "123" "123" "123" "123"
>> [8] "123" "123"
>>> res[ii+1] ## B
>> [1] "456" "456" "456" "456" "456" "456" "456"
>> [8] "456" "456"
>>> res[ii+2] ## C
>> [1] "789" "789" "789" "789" "789" "789" "789"
>> [8] "789" "789"
>>> res[ii+3] ## D
>> [1] "321" "321" "321" "321" "321" "321" "321"
>> [8] "321" "321"
>>
>>
>> Regards,
>> Enrico
>>
>>
>> Am 08.01.2012 11:06, schrieb Andrew Roberts:
>>
>>> Folks,
>>>
>>> I have a data frame with 4861469 rows that contains an ip address
>>> xxx.xxx.xxx.xxx as one of the columns. I want to assign a site to each
>>> row based on IP ranges. To do this I have a function to split the ip
>>> address as character into class A,B,C and D components. It works but is
>>> horribly inefficient in terms of speed. I can't quite see how one of the
>>> l/s/m/t/apply functions could be brought to bear on the problem. Does
>>> anyone have any thoughts?
>>>
>>> for(i in 1:4861469)
>>>     {
>>>     lst<-unlist(strsplit(data$ComputerName[i], "\\."))
>>>     data$IPA[i]<-lst[[1]]
>>>     data$IPB[i]<-lst[[2]]
>>>     data$IPC[i]<-lst[[3]]
>>>     data$IPD[i]<-lst[[4]]
>>>     rm(lst)
>>>     }
>>>
>>> Andrew
>>>
>>> Andrew Roberts
>>> Children's Orthopaedic Surgeon
>>> RJAH, Oswestry, UK
>>>
>>>         [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>> --
>> Enrico Schumann
>> Lucerne, Switzerland
>> http://nmof.net/
>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>
>


-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793



More information about the R-help mailing list