[R] readBin with connection of unknown compression type

William Dunlap wdunl@p @ending from tibco@com
Sat Oct 13 00:32:00 CEST 2018


I would like to use readBin to read parts of a compressed binary file whose
compression type is not known (e.g, a *.RData file, which may be compressed
with gz, xz, or bz compression or not compressed at all).

If I use
    con <- file("theFile", "r")
to create the connection then the compression type is detected;
summary(con)$class will give the compression type (or "file" if no
compression type is recognized).

However, I cannot use that connection in readBin because uncompressed data
that con will produce is considered text, not binary.  E.g. with the files
produced with the code in the postscript I get

> con <- file(df["bz", "binaryFile"], "r")
> dput(summary(con))
list(description =
"C:\\Users\\wdunlap\\AppData\\Local\\Temp\\RtmpAlyXT6\\file472c73e710cd/
binary.bz",
    class = "bzfile", mode = "r", text = "text", opened = "opened",
    `can read` = "yes", `can write` = "no")
> readBin(con, what="raw", n=8)
Error in readBin(con, what = "raw", n = 8) :
  can only read from a binary connection

I can read compressed text files with scan(file(.., "r")) and I don't have
to tell it what sort of compression was used:
> con <- file(df["bz", "textFile"], "r")
> scan(con, what="integer", n=4)
Read 4 items
[1] "2" "3" "5" "7"

I can read binary files with unknown compression by saving the class of the
connection returned by file("r"), mapping that to one of file, bzfile,
xzfile, or gzfile, and reopening the compressed file with "rb".  E.g.,

myBinaryFile <- function(filename) {
    con <- file(filename, "r")
    class <- summary(con)$class
    close(con)
    # rely on class of a connection also being the name of a connection
creator
    con <- getFunction(class)(filename, "rb")
    con
}
> lapply(bn(df$binaryFile), FUN=function(f) { con <- myBinaryFile(f) ;
on.exit(close(con)) ; tryCatch(readBin(con, what="raw",n=12),
error=function(e)conditionMessage(e))})
$`binary.gz`
 [1] 02 00 00 00 03 00 00 00 05 00 00 00

$binary.bz
 [1] 02 00 00 00 03 00 00 00 05 00 00 00

$binary.xz
 [1] 02 00 00 00 03 00 00 00 05 00 00 00

$binary.uncompressed
 [1] 02 00 00 00 03 00 00 00 05 00 00 00

Is this repeated opening of the file required to read binary files of
unknown compression type, or did I miss a way to make readBin() with just
one call to a connection-creating function?

Bill Dunlap
TIBCO Software
wdunlap tibco.com

Code to produce compressed binary and text files:
    tdata <- as.integer(c(2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41,
43, 47, 53,
        59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103, 107, 109, 113, 127,
131, 137))
    dir.create(tdir <- tempfile())
    bn <- function(filename) structure(filename, names=basename(filename))
    df <- data.frame(conMaker = I(list(gz = gzfile, bz = bzfile, xz =
xzfile, uncompressed = file)))
    df$binaryFile <- vapply(rownames(df), FUN.VALUE=NA_character_,
FUN=function(nm) {
        con <- df[[nm, "conMaker"]]( file <- file.path(tdir, paste(sep=".",
"binary", nm)), "wb")
        on.exit(close(con))
        writeBin(tdata, con)
        file
    })
    df$textFile <- vapply(rownames(df), FUN.VALUE=NA_character_,
FUN=function(nm) {
        con <- df[[nm, "conMaker"]]( file <- file.path(tdir, paste(sep=".",
"text", nm)), "wt")
        on.exit(close(con))
        cat(tdata, sep="\n", file=con)
        file
    })

	[[alternative HTML version deleted]]



More information about the R-help mailing list