[R] extracting values from txt file that follow user-supplied quote

Bert Gunter gunter.berton at gene.com
Wed Jun 6 20:35:10 CEST 2012


I think 1 gb is small enough that this can be easily and efficiently
done in R. The key is: regular expressions are your friend.

I shall assume that the text file has been read into R as a single
character string, named "mystring" . The code below could easily be
modifed to work on a a vector of strings if the file is read in line
by line. Alternatively, (paste, yourvec,collapse="") could be used to
collapse such a vector into a single string, which would be necessary
if, for instance, the key info could be broken up over several
lines/elements of the vector.

Here is a small reproducible example of a way to do this, in which the
keword string to search for is "qdxpRxt" and the next 3 characters
that follow it are what you wish to extract.

### Create the example
test <- lapply(sample(1:20,30,rep=TRUE) ,function(n)
			paste(sample(c(letters,LETTERS,rep(" ",12)),n,rep=TRUE),collapse=""))
mystring <- paste(lapply(test,function(x)paste(x,c("qdxpRxt"),
 			floor(runif(1,0,1000)),sep="")),collapse="")

## Extract the strings of interest
extracted <-  strsplit(gsub(".*?qdxpRxt(.{3})","\\1%&",mystring),"%&")

Note that this is not quite right yet. The extracted strings might
include characters, not just numbers; the strings have to be converted
from character to numeric; and I assumed that "%&" would not occur in
the extracted strings and could be used as a split string. I leave the
fixups to you.

Note that I make no claims about the efficiency of this approach vis a
vis external tools like gawk -- of which there are many -- nor that
any of several R string function packages might also easily do the
job. I only wanted to point out that it appears to be straightforward
in vanilla R since this already has the regular expression engine in
it.

Cheers,
Bert


On Wed, Jun 6, 2012 at 10:34 AM, Rainer Schuermann
<rainer.schuermann at gmx.net> wrote:
> R may not be the best tool for this.
> Did you look at gawk? It is also available for Windows:
> http://gnuwin32.sourceforge.net/packages/gawk.htm
>
> Once gawk has written a new file that only contains the lines / data you want, you could use R for the next steps.
> You also can run gawk from within R with the System() command.
>
> Rgds,
> Rainer
>
>
> On Wednesday 06 June 2012 09:54:15 emorway wrote:
>> useRs-
>>
>> I'm attempting to scan a more than 1Gb text file and read and store the
>> values that follow a specific key-phrase that is repeated multiple time
>> throughout the file.  A snippet of the text file I'm trying to read is
>> attached.  The text file is a dumping ground for various aspects of the
>> performance of the model that generates it.  Thus, the location of
>> information I'm wanting to extract from the file is not in a fixed position
>> (i.e. it does not always appears in a predictable location, like line 1000,
>> or 2000, etc.).  Rather, the desired values always follow a specific phrase:
>> "   PERCENT DISCREPANCY ="
>>
>> One approach I took was the following:
>>
>> library(R.utils)
>>
>> txt_con<-file(description="D:/MCR_BeoPEST - Copy/MCR.out",open="r")
>> #The above will need to be altered if one desires to test code on the
>> attached txt file, which will run much quicker
>> system.time(num_lines<-countLines("D:/MCR_BeoPEST - Copy/MCR.out"))
>> #elapsed time on full 1Gb file took about 55 seconds on a 3.6Gh Xeon
>> num_lines
>> #14405247
>>
>> system.time(
>> for(i in 1:num_lines){
>>   txt_line<-readLines(txt_con,n=1)
>>   if (length(grep("    PERCENT DISCREPANCY =",txt_line))) {
>>     pd<-c(pd,as.numeric(substr(txt_line,70,78)))
>>   }
>> }
>> )
>> #Time took about 5 minutes
>>
>> The inefficiencies in this approach arise due to reading the file twice
>> (first to get num_lines, then to step through each line looking for the
>> desired text).
>>
>> Is there a way to speed this process up through the use of a ?scan  ?  I
>> wan't able to get anything working, but what I had in mind was scan through
>> the more than 1Gb file and when the keyphrase (e.g.  "     PERCENT
>> DISCREPANCY =  ") is encountered, read and store the next 13 characters
>> (which will include some white spaces) as a numeric value, then resume the
>> scan until the key phrase is encountered again and repeat until the
>> end-of-the-file marker is encountered.  Is such an approach even possible or
>> is line-by-line the best bet?
>>
>> http://r.789695.n4.nabble.com/file/n4632558/MCR.out MCR.out
>>
>>
>>
>> --
>> View this message in context: http://r.789695.n4.nabble.com/extracting-values-from-txt-file-that-follow-user-supplied-quote-tp4632558.html
>> Sent from the R help mailing list archive at Nabble.com.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 

Bert Gunter
Genentech Nonclinical Biostatistics

Internal Contact Info:
Phone: 467-7374
Website:
http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm



More information about the R-help mailing list