[R] Drop matching lines from readLines

Bert Gunter gunter.berton at gene.com
Thu Oct 14 17:55:41 CEST 2010


If I understand correctly, the poster knows what regex error pattern
to look for, in which case (mod memory capacity -- but 200 mb should
not be a problem, I think) is not merely

cleanData <- dirtyData[!grepl("errorPatternregex",dirtyData)]

sufficient?

Cheers,
Bert

On Thu, Oct 14, 2010 at 4:05 AM, Mike Marchywka <marchywka at hotmail.com> wrote:
>
>
>
>
>
>
> ----------------------------------------
>> From: santosh.srinivas at gmail.com
>> To: r-help at r-project.org
>> Date: Thu, 14 Oct 2010 11:27:57 +0530
>> Subject: [R] Drop matching lines from readLines
>>
>> Dear R-group,
>>
>> I have some noise in my text file (coding issues!) ... I imported a 200 MB
>> text file using readlines
>> Used grep to find the lines with the error?
>>
>> What is the easiest way to drop those lines? I plan to write back the
>> "cleaned" data set to my base file.
>
> Generally for text processing, I've been using utilities external to R
> although there may be R alternatives that work better for you. You
> mention grep, I've suggested sed as a general way to fix formatting things,
> there is also something called "uniq" on linux or cygwin.
> I have gotten into the habit of using these for a variety of data
> manipulation tasks, only feed clean data into R.
>
> $ echo -e a bc\\na bc
> a bc
> a bc
>
> $ echo -e a bc\\na bc | uniq
> a bc
>
> $ uniq --help
> Usage: uniq [OPTION]... [INPUT [OUTPUT]]
> Filter adjacent matching lines from INPUT (or standard input),
> writing to OUTPUT (or standard output).
>
> With no options, matching lines are merged to the first occurrence.
>
> Mandatory arguments to long options are mandatory for short options too.
>   -c, --count           prefix lines by the number of occurrences
>   -d, --repeated        only print duplicate lines
>   -D, --all-repeated[=delimit-method]  print all duplicate lines
>                         delimit-method={none(default),prepend,separate}
>                         Delimiting is done with blank lines
>   -f, --skip-fields=N   avoid comparing the first N fields
>   -i, --ignore-case     ignore differences in case when comparing
>   -s, --skip-chars=N    avoid comparing the first N characters
>   -u, --unique          only print unique lines
>   -z, --zero-terminated  end lines with 0 byte, not newline
>   -w, --check-chars=N   compare no more than N characters in lines
>       --help     display this help and exit
>       --version  output version information and exit
>
> A field is a run of blanks (usually spaces and/or TABs), then non-blank
> characters.  Fields are skipped before chars.
>
> Note: 'uniq' does not detect repeated lines unless they are adjacent.
> You may want to sort the input first, or use `sort -u' without `uniq'.
> Also, comparisons honor the rules specified by `LC_COLLATE'.
>
>
>
>
>
>
>
>
>
>
>>
>> Thanks.
>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Bert Gunter
Genentech Nonclinical Biostatistics



More information about the R-help mailing list