[Rd] Fwd: Process substitution and read.table/scan

Elena Grassi grassi.e at gmail.com
Sat Apr 6 11:44:23 CEST 2013


Maybe after all this is more appropriate on r-devel, so I'm forwarding it here.

---------- Forwarded message ----------
From: Elena Grassi <grassi.e at gmail.com>
Date: Wed, Apr 3, 2013 at 2:19 PM
Subject: Process substitution and read.table/scan
To: r-help at r-project.org


Hello, I did the same question on stackoverflow
(http://stackoverflow.com/questions/15784373/process-substitution) but
did not understand completely the issue so I'm reporting it here:

"
I've given a look around about what puzzles me and I only found this:
http://stackoverflow.com/questions/4274171/do-some-programs-not-accept-process-substitution-for-input-files

which is partially helping, but I really would like to understand the
full story. I noticed that some of my R scripts give different (ie.
wrong) results when I use process substitution.

I tried to pinpoint the problem with a test case:

This script:

#!/usr/bin/Rscript

args  <- commandArgs(TRUE)
file  <-args[1]
cat(file)
cat("\n")
data <- read.table(file, header=F)
cat(mean(data$V1))
cat("\n")

with an input file generated in this way:

$ for i in `seq 1 10`; do echo $i >> p; done
$ for i in `seq 1 500`; do cat p >> test; done

leads me to this:

$ ./mean.R test
test
5.5

$ ./mean.R <(cat test)
/dev/fd/63
5.501476

Further tests reveal that some lines are lost...but I would like to
understand why. Does read.table (scan gives the same results) uses
seek?

Ps. with a smaller test file (100) an error is reported:

$./mean.R <(cat test3)
/dev/fd/63
Error in read.table(file, header = F) : no lines available in input
Execution halted
"

Other notes: with a modified script that uses scan the results are the same.
Printing the whole data.frame results in 5001 lines in the first case
(which is correct) and only 3050 with the process redirection.

I checked read.table source code and I saw that it goes around in the
file to check for column types and so on...I thought that this was an
explanation for this problem but I would prefer an error message
reported instead than a result gotten from partial data...then someone
on stackoverflow pointed me to fifo() which solves the problem (i.e
the mean is reported correctly even with the process redirection) and
therefore I'm even more puzzled: does fifo() allows seeks and peeks
around a named pipe?
I'm willing to read the relevant code to understand what's really
happening (and even help if someone thinks that this issue could
represent a small bug) but I would really appreciate some pointers.

Here the sessionInfo() and other possibly relevant things:
> sessionInfo()
R version 3.0.0 beta (2013-03-23 r62384)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.utf8       LC_NUMERIC=C
 [3] LC_TIME=en_US.utf8        LC_COLLATE=en_US.utf8
 [5] LC_MONETARY=en_US.utf8    LC_MESSAGES=en_US.utf8
 [7] LC_PAPER=C                LC_NAME=C
 [9] LC_ADDRESS=C              LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

$ uname -a
Linux femto 3.6-trunk-amd64 #1 SMP Debian 3.6.9-1~experimental.1
x86_64 GNU/Linux

I use the debian R package: r-base-core, 3.0.0~20130324-1

Thanks,
Elena Grassi

ps.
I started on R-help as long as this could be of general interest,
sorry if that's a bad call.
--
$ pom


--
$ pom



More information about the R-devel mailing list