[R] Parsing question, partly comma separated partly underscore separated string

Tue Mar 8 04:45:57 CET 2011

Thanks to Gabor Grothendieck and Dennis Murphy I can now solve first
part of my problem and already impress my colleagues with the
R-program below (I know it could be written in a smarter way, but I am
learning). It reads my partly comma separated partly underscore
separated string and cleans it up in a very need way.

Regardless of my inability to write tight code I moved on to the
second part of my quest, to put it all in to a loop to be able to loop
over my approximately 100 .txt files in /usr2/username/data/ I got
started with list.files() and my loop is more or less working, but I
got stuck on the last cbind part.

Is there a friendly R-hacker out there that would be willing to take a
look at my loop below*2?

Thanks,
Eric

###########################################
##                                                                    ##
##   The answer to the first part  of my question   ##
##                                                                    ##
###########################################

Line <- readLines(file("/usr2/efail/data/example.txt"))
s <- strsplit(Line, "ZZ_")[[1]]
s2 <- sub("BLOCK.*", "BLOCK", s)
s3 <- sub("@9z.svg", "", s2)
s4 <- gsub("_", ",", s3)
s5 <- read.table(textConnection(s4[1]), sep = ",")
DF <- read.table(textConnection(s4), skip = 1, sep = ",", as.is = TRUE)
DF$block <- head(cumsum(c("", DF$V8) == "BLOCK")+1, -1)
DF$run <- ave(DF$block, DF$block, FUN = seq_along)
DF$V8 <- NULL
names(DF) <- c("IngNam", "Tx", "Ty", "Treatment", "x", "y", "Y", "BLOCK", "RUN")
DF$ID <- s5$V1
DF

#####################################
##                                                          ##
##       The PARTLY WORKING loop        ##
##                                                          ##
#####################################

fname <- list.files("/usr2/efail/data",pattern=".txt", full.names =
TRUE, recursive =TRUE, ignore.case = TRUE)

for (sp in 1:length(fname)) {
Line <- readLines(file(fname[sp]))
s <- strsplit(Line, "ZZ_")[[1]]
s2 <- sub("BLOCK.*", "BLOCK", s)
s3 <- sub("@9z.svg", "", s2)
s4 <- gsub("_", ",", s3)
s5 <- read.table(textConnection(s4[1]), sep = ",")
DF <- read.table(textConnection(s4), skip = 1, sep = ",", as.is = TRUE)
DF$block <- head(cumsum(c("", DF$V8) == "BLOCK")+1, -1)
DF$run <- ave(DF$block, DF$block, FUN = seq_along)
DF$V8 <- NULL
names(DF) <- c("IngNam", "Tx", "Ty", "Treatment", "x", "y", "Y", "BLOCK", "RUN")
DF$ID <- s5$V1
FINAL.DF <- cbind(DF… ## This is where I got stuck.
}

On Mon, Mar 7, 2011 at 8:18 AM, Gabor Grothendieck
<ggrothendieck at gmail.com> wrote:
> On Sun, Mar 6, 2011 at 10:13 PM, Eric Fail <eric.fail at gmx.com> wrote:
>> Dear R-list,
>>
>> I have a partly comma separated partly underscore separated string that I am trying to parse into R.
>>
>> Furthermore I have a bunch of them, and they are quite long. I have now spent most of my Sunday trying to figure this out and thought I would try the list to see if someone here would be able to get me started.
>>
>> My data structure looks like this,
>>
>> (in a example.txt file)
>> Subject ID,ExperimentName,2010-04-23,32:34:23,Version 0.4, 640 by 960  pixels, On Device M, M, 3.2.4,ZZ_373_462_488_TRT at 9z.svg,592,820,3.35,ZZ_032_288_436_CON at 9z.svg,332,878,3.66,ZZ_384_204_433_TRT at 9z.svg,334,824,3.28,ZZ_365_575_683_TRT at 9z.svg,598,878,3.50,ZZ_005_480_239_CON at 9z.svg,630,856,8.03,ZZ_030_423_394_CON at 9z.svg,98,846,4.09,ZZ_033_596_398_CON at 9z.svg,636,902,3.28,ZZ_263_064_320_TRT at 9z.svg,570,894,1.26,BLOCK at 9z.svg,322,842,32.96,ZZ_004_088_403_CON at 9z.svg,606,908,3.32,ZZ_703_546_434_CON at 9z.svg,624,934,2.58,ZZ_712_348_543_CON at 9z.svg,20,828,5.36,ZZ_005_48_239_CON at 9z.svg,580,830,4.36,ZZ_310_444_623_TRT at 9z.svg,586,806,0.08,ZZ_030_423_394_CON at 9z.svg,350,854,3.84,ZZ_340_382_539_TRT at 9z.svg,570,894,1.26,BLOCK at 9z.svg,542,840,4.44,ZZ_345_230_662_TRT at 9z.svg,632,844,2.47,ZZ_006_335_309_CON at 9z.svg,96,930,3.63,ZZ_782_346_746_TRT at 9z.svg,306,850,2.58,ZZ_334_200_333_TRT at 9z.svg,304,842,3.34,ZZ_383_506_726_TRT at 9z.svg,622,884,3.84,ZZ_294_360_448_TRT at 9z.svg,90,858,3.56,ZZ_334_335_473_TRT at 9z.svg,570,894,1.26,BLOCK at 9z.svg,320,852,4.04,
>> (end of example.txt file)
>>
>> The above is approximate 5% of the length of a full file, and then I got about 100 of them. Please note that the strings end with a comma.
>>
>> I am trying to parse it into something like this
>>
>> ID ImgNam BLOCK RUN Tx Ty Treatment x y Y
>> Subject ID 373 1 1 462 488 TRT 592 820 3.35
>> Subject ID 32 1 2 288 436 CON 332 878 3.66
>> Subject ID 384 1 3 204 433 TRT 334 824 3.28
>> Subject ID 365 1 4 575 683 TRT 598 878 3.5
>> Subject ID 5 1 5 480 239 CON 630 856 8.03
>> Subject ID 30 1 6 423 394 CON 98 846 4.09
>> Subject ID 33 1 7 596 398 CON 636 902 3.28
>> Subject ID 263 1 8 64 320 TRT 570 894 1.26
>> Subject ID 4 2 1 88 403 CON 606 908 3.32
>> Subject ID 703 2 2 546 434 CON 624 934 2.58
>> Subject ID 712 2 3 348 543 CON 20 828 5.36
>> Subject ID 5 2 4 48 239 CON 580 830 4.36
>> Subject ID 310 2 5 444 623 TRT 586 806 0.08
>> Subject ID 30 2 6 423 394 CON 350 854 3.84
>> Subject ID 340 2 7 382 539 TRT 570 894 1.26
>> Subject ID 345 3 1 230 662 TRT 632 844 2.47
>> Subject ID 6 3 2 335 309 CON 96 930 3.63
>> Subject ID 782 3 3 346 746 TRT 306 850 2.58
>> Subject ID 334 3 4 200 333 TRT 304 842 3.34
>> Subject ID 383 3 5 506 726 TRT 622 884 3.84
>> Subject ID 294 3 6 360 448 TRT 90 858 3.56
>> Subject ID 334 3 7 335 473 TRT 570 894 1.26
>>
>> I could do it in Excel, but it would take me a week--and it would be stupid--if someone could please help me get started I would very much appreciate it. It would not only benefit me, but my colleagues would see the benefit of R and the R-list in particular.
>>
>
> Try this.  We split the line by ZZ_ giving s and remove the junk after
> the word BLOCK giving s2.  Then we remove @9z.svg giving s3 and
> convert each _ to , giving s4.  We then read it into a data frame
> using comma as the separator, calculate the block and run columns,
> remove one junk column and assign column names.
>
>> Line <- "Subject ID,ExperimentName,2010-04-23,32:34:23,Version 0.4, 640 by 960  pixels, On Device M, M, 3.2.4,ZZ_373_462_488_TRT at 9z.svg,592,820,3.35,ZZ_032_288_436_CON at 9z.svg,332,878,3.66,ZZ_384_204_433_TRT at 9z.svg,334,824,3.28,ZZ_365_575_683_TRT at 9z.svg,598,878,3.50,ZZ_005_480_239_CON at 9z.svg,630,856,8.03,ZZ_030_423_394_CON at 9z.svg,98,846,4.09,ZZ_033_596_398_CON at 9z.svg,636,902,3.28,ZZ_263_064_320_TRT at 9z.svg,570,894,1.26,BLOCK at 9z.svg,322,842,32.96,ZZ_004_088_403_CON at 9z.svg,606,908,3.32,ZZ_703_546_434_CON at 9z.svg,624,934,2.58,ZZ_712_348_543_CON at 9z.svg,20,828,5.36,ZZ_005_48_239_CON at 9z.svg,580,830,4.36,ZZ_310_444_623_TRT at 9z.svg,586,806,0.08,ZZ_030_423_394_CON at 9z.svg,350,854,3.84,ZZ_340_382_539_TRT at 9z.svg,570,894,1.26,BLOCK at 9z.svg,542,840,4.44,ZZ_345_230_662_TRT at 9z.svg,632,844,2.47,ZZ_006_335_309_CON at 9z.svg,96,930,3.63,ZZ_782_346_746_TRT at 9z.svg,306,850,2.58,ZZ_334_200_333_TRT at 9z.svg,304,842,3.34,ZZ_383_506_726_TRT at 9z.svg,622,884,3.84,ZZ_294_360_448_TRT at 9z.svg,90,858,3.56,ZZ_334_335_473_TRT at 9z.svg,570,894,1.26,BLOCK at 9z.svg,320,852,4.04,"
>>
>> s <- strsplit(Line, "ZZ_")[[1]]
>> s2 <- sub("BLOCK.*", "BLOCK", s)
>> s3 <- sub("@9z.svg", "", s2)
>> s4 <- gsub("_", ",", s3)
>> DF <- read.table(textConnection(s4), skip = 1, sep = ",", as.is = TRUE)
>> DF$block <- head(cumsum(c("", DF$V8) == "BLOCK")+1, -1)
>> DF$run <- ave(DF$block, DF$block, FUN = seq_along)
>> DF$V8 <- NULL
>> names(DF) <- c("IngNam", "Tx", "Ty", "Treatment", "x", "y", "Y", "BLOCK", "RUN")
>> DF
>   IngNam  Tx  Ty Treatment   x   y    Y BLOCK RUN
> 1     373 462 488       TRT 592 820 3.35     1   1
> 2      32 288 436       CON 332 878 3.66     1   2
> 3     384 204 433       TRT 334 824 3.28     1   3
> 4     365 575 683       TRT 598 878 3.50     1   4
> 5       5 480 239       CON 630 856 8.03     1   5
> 6      30 423 394       CON  98 846 4.09     1   6
> 7      33 596 398       CON 636 902 3.28     1   7
> 8     263  64 320       TRT 570 894 1.26     1   8
> 9       4  88 403       CON 606 908 3.32     2   1
> 10    703 546 434       CON 624 934 2.58     2   2
> 11    712 348 543       CON  20 828 5.36     2   3
> 12      5  48 239       CON 580 830 4.36     2   4
> 13    310 444 623       TRT 586 806 0.08     2   5
> 14     30 423 394       CON 350 854 3.84     2   6
> 15    340 382 539       TRT 570 894 1.26     2   7
> 16    345 230 662       TRT 632 844 2.47     3   1
> 17      6 335 309       CON  96 930 3.63     3   2
> 18    782 346 746       TRT 306 850 2.58     3   3
> 19    334 200 333       TRT 304 842 3.34     3   4
> 20    383 506 726       TRT 622 884 3.84     3   5
> 21    294 360 448       TRT  90 858 3.56     3   6
> 22    334 335 473       TRT 570 894 1.26     3   7
>
>
> --
> Statistics & Software Consulting
> GKX Group, GKX Associates Inc.
> tel: 1-877-GKX-GROUP
> email: ggrothendieck at gmail.com
>