[R] splitting very long character string

Marc Schwartz marc_schwartz at comcast.net
Wed Nov 1 17:05:16 CET 2006


On Wed, 2006-11-01 at 16:47 +0100, Arne.Muller at sanofi-aventis.com wrote:
> Hello,
> 
> I've a very long character array (>500k characters) that need to split
> by '\n' resulting in an array of about 60k numbers. The help on
> strsplit says to use perl=TRUE to get better formance, but still it
> takes several minutes to split this string.
> 
> The massive string is the return value of a call to
> xmlElementsByTagName from the XML library and looks like this:
> 
> ...
> 12345
> 564376
> 5674
> 6356656
> 5666
> ...
> 
> I've to read about a hundred of these files and was wondering whether
> there's a more efficient way to turn this string into an array of
> numerics. Any ideas?
> 
> 	thanks a lot for your help
> 	and kind regards,
> 
> 	Arne
> 

Vec <- sample(c(0:9, "\n"), 500000, replace = TRUE)

> str(Vec)
 chr [1:500000] "7" "0" "9" "6" "5" "3" "1" "9" ...

> table(Vec)
Vec
   \n     0     1     2     3     4     5     6     7     8     9
45432 45723 45641 45526 45460 45284 45378 45392 45374 45314 45476


> sink("Vec.txt")
> cat(Vec)
> sink()

First 10 lines of Vec.txt:

7 0 9 6 5 3 1 9 8 1 8 3 4 2 
 1 2 2 
 3 7 7 6 8 3 4 7 4 
 9 2 1 9 8 7 2 0 9 4 3 
 9 3 5 2 2 5 8 0 5 4 5 6 1 5 8 7 4 1 2 8 3 2 6 4 9 4 1 6 8 5 0 8 8 8 5 3 0 5 3 5 4 8 5 4 3 
 9 
 5 3 6 5 8 9 7 6 9 
 5 8 
 2 4 6 
 5 

> system.time(Vec.Split <- scan("Vec.txt", sep = "\n"))
Read 41276 items
[1] 0.180 0.004 0.186 0.000 0.000

> str(Vec.Split)
 num [1:41276] 7.10e+13 1.22e+02 3.78e+08 9.22e+10 9.35e+44 ...

> sprintf("%.0f", Vec.Split[1:10])
 [1] "70965319818342"
 [2] "122"
 [3] "377683474"
 [4] "92198720943"
 [5] "935225805456158720742405574866620654670577664"
 [6] "9"
 [7] "536589769"
 [8] "58"
 [9] "246"
[10] "5"


Does that help?

Marc Schwartz



More information about the R-help mailing list