[R] Accelerating binRead

Michael Sumner mdsumner at gmail.com
Mon Sep 19 00:41:53 CEST 2016


Thanks Henrik, that's it. Fwiw I found this old post too, I am still
surprised this doesn't seem to get used a lot(?). It's a "neat trick" for
row-wise binary, without compiled code.

http://cyclemumner.blogspot.com.au/2010/06/read-las-data-with-r.html?m=1

Also you should look at Paul Murrell's hexView package, and associated R
Journal paper.

Cheers, Mike

On Mon, 19 Sep 2016, 02:20 Henrik Bengtsson <henrik.bengtsson at gmail.com>
wrote:

> I second Mike's proposal - it works, e.g.
>
> https://github.com/HenrikBengtsson/affxparser/blob/5bf1a9162904c56d59c4735a8d7eb427e4f085e4/R/readCcg.R#L535-L583
>
> Here's an outline. Say each row consists of tuple (iiii=4-byte
> integer, ffff=4-byte float, ss=2 byte integer) so that the
> byte-by-byte content of your file look like this:
>
>   iiiiffffss
>   iiiiffffss
>   iiiiffffss
>   ...
>   iiiiffffss
>
> Then read this is as raw bytes (file_size can also be a very large
> number in case it's unknown):
>
>   raw <- readBin(con, what="raw", n=file_size)
>
> Turn into a (4+4+2)-by-K raw matrix:
>
>   raw <- matrix(raw, nrow=4+4+2)
>
> so that your raw bytes has the following layout:
>
>   iii ... i
>   iii ... i
>   iii ... i
>   iii ... i
>   fff ... f
>   fff ... f
>   fff ... f
>   fff ... f
>   sss ... s
>   sss ... s
>
> Then extract the three submatrices of interest:
>
>   iiii <- raw[1:4,]
>   ffff <- raw[5:8,]
>   ss <- raw[9:10,]
>
> Here you can discard raw, i.e. rm(list="raw").
>
> Since R stores matrices in a column-by-column order internally, your
> bytes are already in the proper order.  Finally, re-read these with
> appropriate readBin() settings, e.g.
>
>   i <- readBin(iiii, what="integer", size=4L)
>   f <- readBin(ffff, what="double", size=4L)
>   s <- readBin(ss, what="integer", size=2L)
>
> Put into a 3-by-K data.frame:
>
>   data <- data.frame(i=i, f=f, s=s)
>
> /Henrik
>
> On Sun, Sep 18, 2016 at 8:02 AM, Philippe de Rochambeau <phiroc at free.fr>
> wrote:
> > I would gladly examine your example, Mike.
> > Cheers,
> > Philippe
> >
> >> Le 18 sept. 2016 à 16:05, Michael Sumner <mdsumner at gmail.com> a écrit :
> >>
> >>
> >>
> >>> On Sun, 18 Sep 2016, 19:04 Philippe de Rochambeau <phiroc at free.fr>
> wrote:
> >>> Please find below code that attempts to read ints, longs and floats
> from a binary file (which is a simplification of my original program).
> >>> Please disregard the R inefficiencies, such as using rbind, for now.
> >>> I’ve also included Java code to generate the binary file.
> >>> The output shows that, at one point, anInt becomes undefined.
> Unfortunately, I couldn’t find the correct R function to determine whether
> inInt is undefined or not, as is.null, is.nan, and is.infinite don’t work.
> >>> Any help would be much appreciated.
> >>> Many thanks in advance.
> >>> Philippe
> >>>
> >>> ———————
> >>> [1] "anInt = 1"
> >>> [1] "is.null  FALSE"
> >>> [1] "is.nan  FALSE"
> >>> [1] "is.infinite  FALSE"
> >>> [1] "aLong = 2"
> >>> [1] "aFloat = 3.44440007209778"
> >>> [1] "--------------------------"
> >>> [1] "anInt = 2"
> >>> [1] "is.null  FALSE"
> >>> [1] "is.nan  FALSE"
> >>> [1] "is.infinite  FALSE"
> >>> [1] "aLong = 22"
> >>> [1] "aFloat = 13.4644002914429"
> >>> [1] "--------------------------"
> >>> [1] "anInt = 3"
> >>> [1] "is.null  FALSE"
> >>> [1] "is.nan  FALSE"
> >>> [1] "is.infinite  FALSE"
> >>> [1] "aLong = 55"
> >>> [1] "aFloat = 45.4444007873535"
> >>> [1] "--------------------------"
> >>> [1] "anInt = "
> >>> [1] "is.null  FALSE"
> >>> [1] "is.nan  "
> >>> [1] "is.infinite  "
> >>> [1] "aLong = "
> >>> [1] "aFloat = "
> >>> [1] "--------------------------"
> >>>      [,1]      [,2]      [,3]
> >>> [1,] 1         2         3.4444
> >>> [2,] 2         22        13.4644
> >>> [3,] 3         55        45.4444
> >>> [4,] Integer,0 Integer,0 Numeric,0
> >>> >
> >>>
> >>> -----------
> >>>
> >>>
> >>> —————————————————————
> >>>
> >>> readFile <- function(inputPath) {
> >>>   URL <- file(inputPath, "rb")
> >>>   PLT <- matrix(nrow=0, ncol=3)
> >>>   counte <- 0
> >>>   max <- 4
> >>>   while (counte < max) {
> >>>     anInt <- readBin(con=URL, what=integer(), size=4, n=1,
> endian="big")
> >>>     print(paste("anInt =", anInt))
> >>>     #if (! (anInt == 0)) { print(paste("empty int")); break }
> >>>     print(paste("is.null ", is.null(anInt)))
> >>>     print(paste("is.nan ", is.nan(anInt)))
> >>>     print(paste("is.infinite ", is.infinite(anInt)))
> >>>     aLong <- readBin(URL, integer(), size=8, n=1, endian="big")
> >>>     print(paste("aLong =", aLong))
> >>>     aFloat <- readBin(URL, numeric(), size=4, n=1, endian="big")
> >>>     print(paste("aFloat =", aFloat))
> >>>     print("--------------------------")
> >>>     PLT <- rbind(PLT, list(anInt, aLong, aFloat))
> >>>     counte <- counte + 1
> >>>   } # end while
> >>>   close(URL)
> >>>   PLT
> >>> }
> >>> fichier <- "/Users/philippe/Desktop/datatests/data0.bin"
> >>> PLT2 <- readFile(fichier)
> >>> print(PLT2)
> >>> —————————————————————
> >>>
> >>> import java.io.*;
> >>>
> >>> public class Main {
> >>>
> >>>         Main() {
> >>>                 writeData();
> >>>         }
> >>>
> >>>         public static void main(String[] args) {
> >>>                 new Main();
> >>>         }
> >>>
> >>>         public void writeData() {
> >>>
> >>>                 final String path =
> "/Users/philippe/Desktop/datatests/data0.bin";
> >>>
> >>>                 DataOutputStream dos;
> >>>                 try {
> >>>                         dos = new DataOutputStream(new
> BufferedOutputStream(new FileOutputStream(path)));
> >>>                         // big endian write! ("high byte first") , see
> https://docs.oracle.com/javase/7/docs/api/java/io/DataOutputStream.html
> >>>                         dos.writeInt(1);
> >>>                         dos.writeLong(2L);
> >>>                         dos.writeFloat(3.4444F);
> >>>
> >>>                         dos.writeInt(2);
> >>>                         dos.writeLong(22L);
> >>>                         dos.writeFloat(13.4644F);
> >>>
> >>>                         dos.writeInt(3);
> >>>                         dos.writeLong(55L);
> >>>                         dos.writeFloat(45.4444F);
> >>>
> >>>                         dos.close();
> >>>                 } catch (FileNotFoundException e) {
> >>>                         e.printStackTrace();
> >>>                 } catch (IOException ioe) {
> >>>                         ioe.printStackTrace();
> >>>                 }
> >>>
> >>>         }
> >>>
> >>> }
> >>>
> >>>
> >>> —————————————————————
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> > Le 17 sept. 2016 à 20:45, Philippe de Rochambeau <phiroc at free.fr> a
> écrit :
> >>> >
> >>> > Hi Jim,
> >>> > this is exactly the answer I was look for. Many thanks. I didn’t R
> had a pack function, as in PERL.
> >>> > To answer your earlier question, I am trying to update legacy code
> to read a binary file with unknown size, over a network, slice up it into
> rows each containing an integer, an integer, a long, a short, a float and a
> float, and stuff the rows into a matrix.
> >>
> >>
> >>
> >> It's possible to read all rows fast as raw(), then parse in a
> vectorised way with matrix indexing to group the bytes appropriately. There
> is an example on the mailing list somewhere, but otherwise I can show an
> example if that's of interest.
> >>
> >>
> >> Cheers, Mike
> >>
> >>
> >>> > Best regards,
> >>> > Philippe
> >>> >
> >>> >> Le 17 sept. 2016 à 20:38, jim holtman <jholtman at gmail.com <mailto:
> jholtman at gmail.com>> a écrit :
> >>> >>
> >>> >> Here is an example of how to do it:
> >>> >>
> >>> >> x <- 1:10  # integer values
> >>> >> xf <- seq(1.0, 2, by = 0.1)  # floating point
> >>> >>
> >>> >> setwd("d:/temp")
> >>> >>
> >>> >> # create file to write to
> >>> >> output <- file('integer.bin', 'wb')
> >>> >> writeBin(x, output)  # write integer
> >>> >> writeBin(xf, output)  # write reals
> >>> >> close(output)
> >>> >>
> >>> >>
> >>> >> library(pack)
> >>> >> library(readr)
> >>> >>
> >>> >> # read all the data at once
> >>> >> allbin <- read_file_raw('integer.bin')
> >>> >>
> >>> >> # decode the data into a list
> >>> >> (result <- unpack("V V V V V V V V V V d d d d d d d d d d",
> allbin))
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> >>> >> Jim Holtman
> >>> >> Data Munger Guru
> >>> >>
> >>> >> What is the problem that you are trying to solve?
> >>> >> Tell me what you want to do, not how you want to do it.
> >>> >>
> >>> >> On Sat, Sep 17, 2016 at 11:04 AM, Ismail SEZEN <
> sezenismail at gmail.com <mailto:sezenismail at gmail.com><mailto:
> sezenismail at gmail.com <mailto:sezenismail at gmail.com>>> wrote:
> >>> >> I noticed same issue but didnt care much :)
> >>> >>
> >>> >> On Sat, Sep 17, 2016, 18:01 jim holtman <jholtman at gmail.com
> <mailto:jholtman at gmail.com> <mailto:jholtman at gmail.com <mailto:
> jholtman at gmail.com>>> wrote:
> >>> >> Your example was not reproducible.  Also how do you "break" out of
> the
> >>> >> "while" loop?
> >>> >>
> >>> >>
> >>> >> Jim Holtman
> >>> >> Data Munger Guru
> >>> >>
> >>> >> What is the problem that you are trying to solve?
> >>> >> Tell me what you want to do, not how you want to do it.
> >>> >>
> >>> >> On Sat, Sep 17, 2016 at 8:05 AM, Philippe de Rochambeau <
> phiroc at free.fr <mailto:phiroc at free.fr> <mailto:phiroc at free.fr <mailto:
> phiroc at free.fr>>>
> >>> >> wrote:
> >>> >>
> >>> >>> Hello,
> >>> >>> the following function, which stores numeric values extracted from
> a
> >>> >>> binary file, into an R matrix, is very slow, especially when the
> said file
> >>> >>> is several MB in size.
> >>> >>> Should I rewrite the function in inline C or in C/C++ using Rcpp?
> If the
> >>> >>> latter case is true, how do you « readBin »  in Rcpp (I’m a total
> Rcpp
> >>> >>> newbie)?
> >>> >>> Many thanks.
> >>> >>> Best regards,
> >>> >>> phiroc
> >>> >>>
> >>> >>>
> >>> >>> -------------
> >>> >>>
> >>> >>> # inputPath is something like http://myintranet/getData <
> http://myintranet/getData><http://myintranet/getData <
> http://myintranet/getData>>?
> >>> >>> pathToFile=/usr/lib/xxx/yyy/data.bin <http://myintranet/getData <
> http://myintranet/getData> <http://myintranet/getData <
> http://myintranet/getData>>?
> >>> >>> pathToFile=/usr/lib/xxx/yyy/data.bin>
> >>> >>>
> >>> >>> PLTreader <- function(inputPath){
> >>> >>>        URL <- file(inputPath, "rb")
> >>> >>>        PLT <- matrix(nrow=0, ncol=6)
> >>> >>>        compteurDePrints = 0
> >>> >>>        compteurDeLignes <- 0
> >>> >>>        maxiPrints = 5
> >>> >>>        displayData <- FALSE
> >>> >>>        while (TRUE) {
> >>> >>>                periodIndex <- readBin(URL, integer(), size=4, n=1,
> >>> >>> endian="little") # int (4 bytes)
> >>> >>>                eventId <- readBin(URL, integer(), size=4, n=1,
> >>> >>> endian="little") # int (4 bytes)
> >>> >>>                dword1 <- readBin(URL, integer(), size=4,
> signed=FALSE,
> >>> >>> n=1, endian="little") # int
> >>> >>>                dword2 <- readBin(URL, integer(), size=4,
> signed=FALSE,
> >>> >>> n=1, endian="little") # int
> >>> >>>                if (dword1 < 0) {
> >>> >>>                        dword1 = dword1 + 2^32-1;
> >>> >>>                }
> >>> >>>                eventDate = (dword2*2^32 + dword1)/1000
> >>> >>>                repNum <- readBin(URL, integer(), size=2, n=1,
> >>> >>> endian="little") # short (2 bytes)
> >>> >>>                exp <- readBin(URL, numeric(), size=4, n=1,
> >>> >>> endian="little") # float (4 bytes, strangely enough, would expect
> 8)
> >>> >>>                loss <- readBin(URL, numeric(), size=4, n=1,
> >>> >>> endian="little") # float (4 bytes)
> >>> >>>                PLT <- rbind(PLT, c(periodIndex, eventId, eventDate,
> >>> >>> repNum, exp, loss))
> >>> >>>        } # end while
> >>> >>>        return(PLT)
> >>> >>>        close(URL)
> >>> >>> }
> >>> >>>
> >>> >>> ----------------
> >>> >>>        [[alternative HTML version deleted]]
> >>> >>>
> >>> >>> ______________________________________________
> >>> >>> R-help at r-project.org <mailto:R-help at r-project.org> <mailto:
> R-help at r-project.org <mailto:R-help at r-project.org>> mailing list -- To
> UNSUBSCRIBE and more, see
> >>> >>> https://stat.ethz.ch/mailman/listinfo/r-help <
> https://stat.ethz.ch/mailman/listinfo/r-help><
> https://stat.ethz.ch/mailman/listinfo/r-help <
> https://stat.ethz.ch/mailman/listinfo/r-help>>
> >>> >>> PLEASE do read the posting guide http://www.R-project.org/ <
> http://www.r-project.org/> <http://www.r-project.org/ <
> http://www.r-project.org/>>
> >>> >>> posting-guide.html
> >>> >>> and provide commented, minimal, self-contained, reproducible code.
> >>> >>
> >>> >>        [[alternative HTML version deleted]]
> >>> >>
> >>> >> ______________________________________________
> >>> >> R-help at r-project.org <mailto:R-help at r-project.org> <mailto:
> R-help at r-project.org <mailto:R-help at r-project.org>> mailing list -- To
> UNSUBSCRIBE and more, see
> >>> >> https://stat.ethz.ch/mailman/listinfo/r-help <
> https://stat.ethz.ch/mailman/listinfo/r-help><
> https://stat.ethz.ch/mailman/listinfo/r-help <
> https://stat.ethz.ch/mailman/listinfo/r-help>>
> >>> >> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html <
> http://www.r-project.org/posting-guide.html> <
> http://www.r-project.org/posting-guide.html <
> http://www.r-project.org/posting-guide.html>>
> >>> >> and provide commented, minimal, self-contained, reproducible code.
> >>> >
> >>> >
> >>> >       [[alternative HTML version deleted]]
> >>> >
> >>> > ______________________________________________
> >>> > R-help at r-project.org <mailto:R-help at r-project.org> mailing list --
> To UNSUBSCRIBE and more, see
> >>> > https://stat.ethz.ch/mailman/listinfo/r-help <
> https://stat.ethz.ch/mailman/listinfo/r-help>
> >>> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html <
> http://www.r-project.org/posting-guide.html>
> >>> > and provide commented, minimal, self-contained, reproducible code.
> >>>
> >>>
> >>>         [[alternative HTML version deleted]]
> >>>
> >>> ______________________________________________
> >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> >>> and provide commented, minimal, self-contained, reproducible code.
> >>
> >> --
> >> Dr. Michael Sumner
> >> Software and Database Engineer
> >> Australian Antarctic Division
> >> 203 Channel Highway
> >> Kingston Tasmania 7050 Australia
> >>
> >
> >         [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
-- 
Dr. Michael Sumner
Software and Database Engineer
Australian Antarctic Division
203 Channel Highway
Kingston Tasmania 7050 Australia

	[[alternative HTML version deleted]]



More information about the R-help mailing list