[R] iterators : checkFunc with ireadLines

Laurent Rhelp L@urentRHe|p @end|ng |rom |ree@|r
Wed May 27 10:56:42 CEST 2020


I installed raku on my PC to test your solution:

The command raku -e '.put for lines.grep( / ^^N053 | ^^N163 /, :p );'  
Laurents.txt works fine when I write it in the bash command but when I 
use the pipe command in R as you say there is nothing in lines with 
lines <- read.table(i)

There is the same problem with Ivan's solution the command grep -E 
'^(N053|N163)' test.txt works fine under the bash command but not i <- 
pipe("grep -E '^(N053|N163)' test.txt"); lines <- read.table(i)

May be it is because I work with MS windows ?

thx
LP




Le 24/05/2020 à 04:34, William Michels a écrit :
> Hi Laurent,
>
> Seeking to give you an "R-only" solution, I thought the read.fwf()
> function might be useful (to read-in your first column of data, only).
> However Jeff is correct that this is a poor strategy, since read.fwf()
> reads the entire file into R (documented in "Fixed-width-format
> files", Section 2.2: R Data Import/Export Manual).
>
> Jeff has suggested a number of packages, as well as using a database.
> Ivan Krylov has posted answers using grep, awk and perl (perl5--to
> disambiguate). [In point of fact, the R Data Import/Export Manual
> suggests using perl]. Similar to Ivan, I've posted code below using
> the Raku programming language (the language formerly known as Perl6).
> Regexes are claimed to be more readable, but are currently very slow
> in Raku. However on the plus side, the language is designed to handle
> Unicode gracefully:
>
>> # pipe() using raku-grep on Laurent's data (sep=mult whitespace):
>> con_obj1 <- pipe(paste("raku -e '.put for lines.grep( / ^^N053 | ^^N163 /, :p );' ", "Laurents.txt"), open="rt");
>> p6_import_a <- scan(file=con_obj1, what=list("","","","","","","","","",""), flush=TRUE, multi.line=FALSE, quiet=TRUE);
>> close(con_obj1);
>> as.data.frame(sapply(p6_import_a, t), stringsAsFactors=FALSE);
>    V1   V2        V3        V4        V5        V6        V7        V8
>        V9       V10
> 1  2 N053 -0.014083 -0.004741  0.001443 -0.010152 -0.012996 -0.005337
> -0.008738 -0.015094
> 2  4 N163 -0.054023 -0.049345 -0.037158  -0.04112 -0.044612 -0.036953
> -0.036061 -0.044516
>> # pipe() using raku-grep "starts-with" to find genbankID ( >3GB TSV file)
>> # "lines[0..5]" restricts raku to reading first 6 lines!
>> # change "lines[0..5]" to "lines" to run raku code on whole file:
>> con_obj2 <- pipe(paste("raku -e '.put for lines[0..5].grep( *.starts-with(q[A00145]), :p);' ", "genbankIDs_3GB.tsv"), "rt");
>> p6_import_b <- read.table(con_obj2, sep="\t");
>> close(con_obj2)
>> p6_import_b
>    V1     V2       V3          V4 V5
> 1  4 A00145 A00145.1 IFN-alpha A NA
>> # unicode test using R's system() function:
>> try(system("raku -ne '.grep( /  你好  |  こんにちは  |  مرحبا  |  Привет  /, :v ).put;'  hello_7lang.txt", intern = TRUE, ignore.stderr = FALSE))
> [1] ""                    ""                    ""
> "你好 Chinese"
> [5] "こんにちは Japanese" "مرحبا Arabic"        "Привет Russian"
> [special thanks to Brad Gilbert, Joseph Brenner and others on the
> perl6-users mailing list. All errors above are my own.]
>
> HTH, Bill.
>
> W. Michels, Ph.D.
>
>
>
>
> On Fri, May 22, 2020 at 4:48 AM Laurent Rhelp <LaurentRHelp using free.fr> wrote:
>> Hi Ivan,
>>     Endeed, it is a good idea. I am under MSwindows but I can use the
>> bash command I use with git. I will see how to do that with the unix
>> command lines.
>>
>>
>> Le 20/05/2020 à 09:46, Ivan Krylov a écrit :
>>> Hi Laurent,
>>>
>>> I am not saying this will work every time and I do recognise that this
>>> is very different from a more general solution that you had envisioned,
>>> but if you are on an UNIX-like system or have the relevant utilities
>>> installed and on the %PATH% on Windows, you can filter the input file
>>> line-by-line using a pipe and an external program:
>>>
>>> On Sun, 17 May 2020 15:52:30 +0200
>>> Laurent Rhelp <LaurentRHelp using free.fr> wrote:
>>>
>>>> # sensors to keep
>>>> sensors <-  c("N053", "N163")
>>> # filter on the beginning of the line
>>> i <- pipe("grep -E '^(N053|N163)' test.txt")
>>> # or:
>>> # filter on the beginning of the given column
>>> # (use $2 for the second column, etc.)
>>> i <- pipe("awk '($1 ~ \"^(N053|N163)\")' test.txt")
>>> # or:
>>> # since your message is full of Unicode non-breaking spaces, I have to
>>> # bring in heavier machinery to handle those correctly;
>>> # only this solution manages to match full column values
>>> # (here you can also use $F[1] for second column and so on)
>>> i <- pipe("perl -CSD -F'\\s+' -lE \\
>>>    'print join qq{\\t}, @F if $F[0] =~ /^(N053|N163)$/' \\
>>>    test.txt
>>> ")
>>> lines <- read.table(i) # closes i when done
>>>
>>> The downside of this approach is having to shell-escape the command
>>> lines, which can become complicated, and choosing between use of regular
>>> expressions and more wordy programs (Unicode whitespace in the input
>>> doesn't help, either).
>>>
>>
>> --
>> L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
>> https://www.avast.com/antivirus
>>
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.



-- 
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
https://www.avast.com/antivirus



More information about the R-help mailing list