[R] Accelerating binRead

Henrik Bengtsson henrik.bengtsson at gmail.com
Sun Sep 18 18:20:01 CEST 2016


I second Mike's proposal - it works, e.g.
https://github.com/HenrikBengtsson/affxparser/blob/5bf1a9162904c56d59c4735a8d7eb427e4f085e4/R/readCcg.R#L535-L583

Here's an outline. Say each row consists of tuple (iiii=4-byte
integer, ffff=4-byte float, ss=2 byte integer) so that the
byte-by-byte content of your file look like this:

  iiiiffffss
  iiiiffffss
  iiiiffffss
  ...
  iiiiffffss

Then read this is as raw bytes (file_size can also be a very large
number in case it's unknown):

  raw <- readBin(con, what="raw", n=file_size)

Turn into a (4+4+2)-by-K raw matrix:

  raw <- matrix(raw, nrow=4+4+2)

so that your raw bytes has the following layout:

  iii ... i
  iii ... i
  iii ... i
  iii ... i
  fff ... f
  fff ... f
  fff ... f
  fff ... f
  sss ... s
  sss ... s

Then extract the three submatrices of interest:

  iiii <- raw[1:4,]
  ffff <- raw[5:8,]
  ss <- raw[9:10,]

Here you can discard raw, i.e. rm(list="raw").

Since R stores matrices in a column-by-column order internally, your
bytes are already in the proper order.  Finally, re-read these with
appropriate readBin() settings, e.g.

  i <- readBin(iiii, what="integer", size=4L)
  f <- readBin(ffff, what="double", size=4L)
  s <- readBin(ss, what="integer", size=2L)

Put into a 3-by-K data.frame:

  data <- data.frame(i=i, f=f, s=s)

/Henrik

On Sun, Sep 18, 2016 at 8:02 AM, Philippe de Rochambeau <phiroc at free.fr> wrote:
> I would gladly examine your example, Mike.
> Cheers,
> Philippe
>
>> Le 18 sept. 2016 à 16:05, Michael Sumner <mdsumner at gmail.com> a écrit :
>>
>>
>>
>>> On Sun, 18 Sep 2016, 19:04 Philippe de Rochambeau <phiroc at free.fr> wrote:
>>> Please find below code that attempts to read ints, longs and floats from a binary file (which is a simplification of my original program).
>>> Please disregard the R inefficiencies, such as using rbind, for now.
>>> I’ve also included Java code to generate the binary file.
>>> The output shows that, at one point, anInt becomes undefined. Unfortunately, I couldn’t find the correct R function to determine whether inInt is undefined or not, as is.null, is.nan, and is.infinite don’t work.
>>> Any help would be much appreciated.
>>> Many thanks in advance.
>>> Philippe
>>>
>>> ———————
>>> [1] "anInt = 1"
>>> [1] "is.null  FALSE"
>>> [1] "is.nan  FALSE"
>>> [1] "is.infinite  FALSE"
>>> [1] "aLong = 2"
>>> [1] "aFloat = 3.44440007209778"
>>> [1] "--------------------------"
>>> [1] "anInt = 2"
>>> [1] "is.null  FALSE"
>>> [1] "is.nan  FALSE"
>>> [1] "is.infinite  FALSE"
>>> [1] "aLong = 22"
>>> [1] "aFloat = 13.4644002914429"
>>> [1] "--------------------------"
>>> [1] "anInt = 3"
>>> [1] "is.null  FALSE"
>>> [1] "is.nan  FALSE"
>>> [1] "is.infinite  FALSE"
>>> [1] "aLong = 55"
>>> [1] "aFloat = 45.4444007873535"
>>> [1] "--------------------------"
>>> [1] "anInt = "
>>> [1] "is.null  FALSE"
>>> [1] "is.nan  "
>>> [1] "is.infinite  "
>>> [1] "aLong = "
>>> [1] "aFloat = "
>>> [1] "--------------------------"
>>>      [,1]      [,2]      [,3]
>>> [1,] 1         2         3.4444
>>> [2,] 2         22        13.4644
>>> [3,] 3         55        45.4444
>>> [4,] Integer,0 Integer,0 Numeric,0
>>> >
>>>
>>> -----------
>>>
>>>
>>> —————————————————————
>>>
>>> readFile <- function(inputPath) {
>>>   URL <- file(inputPath, "rb")
>>>   PLT <- matrix(nrow=0, ncol=3)
>>>   counte <- 0
>>>   max <- 4
>>>   while (counte < max) {
>>>     anInt <- readBin(con=URL, what=integer(), size=4, n=1, endian="big")
>>>     print(paste("anInt =", anInt))
>>>     #if (! (anInt == 0)) { print(paste("empty int")); break }
>>>     print(paste("is.null ", is.null(anInt)))
>>>     print(paste("is.nan ", is.nan(anInt)))
>>>     print(paste("is.infinite ", is.infinite(anInt)))
>>>     aLong <- readBin(URL, integer(), size=8, n=1, endian="big")
>>>     print(paste("aLong =", aLong))
>>>     aFloat <- readBin(URL, numeric(), size=4, n=1, endian="big")
>>>     print(paste("aFloat =", aFloat))
>>>     print("--------------------------")
>>>     PLT <- rbind(PLT, list(anInt, aLong, aFloat))
>>>     counte <- counte + 1
>>>   } # end while
>>>   close(URL)
>>>   PLT
>>> }
>>> fichier <- "/Users/philippe/Desktop/datatests/data0.bin"
>>> PLT2 <- readFile(fichier)
>>> print(PLT2)
>>> —————————————————————
>>>
>>> import java.io.*;
>>>
>>> public class Main {
>>>
>>>         Main() {
>>>                 writeData();
>>>         }
>>>
>>>         public static void main(String[] args) {
>>>                 new Main();
>>>         }
>>>
>>>         public void writeData() {
>>>
>>>                 final String path = "/Users/philippe/Desktop/datatests/data0.bin";
>>>
>>>                 DataOutputStream dos;
>>>                 try {
>>>                         dos = new DataOutputStream(new BufferedOutputStream(new FileOutputStream(path)));
>>>                         // big endian write! ("high byte first") , see https://docs.oracle.com/javase/7/docs/api/java/io/DataOutputStream.html
>>>                         dos.writeInt(1);
>>>                         dos.writeLong(2L);
>>>                         dos.writeFloat(3.4444F);
>>>
>>>                         dos.writeInt(2);
>>>                         dos.writeLong(22L);
>>>                         dos.writeFloat(13.4644F);
>>>
>>>                         dos.writeInt(3);
>>>                         dos.writeLong(55L);
>>>                         dos.writeFloat(45.4444F);
>>>
>>>                         dos.close();
>>>                 } catch (FileNotFoundException e) {
>>>                         e.printStackTrace();
>>>                 } catch (IOException ioe) {
>>>                         ioe.printStackTrace();
>>>                 }
>>>
>>>         }
>>>
>>> }
>>>
>>>
>>> —————————————————————
>>>
>>>
>>>
>>>
>>>
>>>
>>> > Le 17 sept. 2016 à 20:45, Philippe de Rochambeau <phiroc at free.fr> a écrit :
>>> >
>>> > Hi Jim,
>>> > this is exactly the answer I was look for. Many thanks. I didn’t R had a pack function, as in PERL.
>>> > To answer your earlier question, I am trying to update legacy code to read a binary file with unknown size, over a network, slice up it into rows each containing an integer, an integer, a long, a short, a float and a float, and stuff the rows into a matrix.
>>
>>
>>
>> It's possible to read all rows fast as raw(), then parse in a vectorised way with matrix indexing to group the bytes appropriately. There is an example on the mailing list somewhere, but otherwise I can show an example if that's of interest.
>>
>>
>> Cheers, Mike
>>
>>
>>> > Best regards,
>>> > Philippe
>>> >
>>> >> Le 17 sept. 2016 à 20:38, jim holtman <jholtman at gmail.com <mailto:jholtman at gmail.com>> a écrit :
>>> >>
>>> >> Here is an example of how to do it:
>>> >>
>>> >> x <- 1:10  # integer values
>>> >> xf <- seq(1.0, 2, by = 0.1)  # floating point
>>> >>
>>> >> setwd("d:/temp")
>>> >>
>>> >> # create file to write to
>>> >> output <- file('integer.bin', 'wb')
>>> >> writeBin(x, output)  # write integer
>>> >> writeBin(xf, output)  # write reals
>>> >> close(output)
>>> >>
>>> >>
>>> >> library(pack)
>>> >> library(readr)
>>> >>
>>> >> # read all the data at once
>>> >> allbin <- read_file_raw('integer.bin')
>>> >>
>>> >> # decode the data into a list
>>> >> (result <- unpack("V V V V V V V V V V d d d d d d d d d d", allbin))
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> Jim Holtman
>>> >> Data Munger Guru
>>> >>
>>> >> What is the problem that you are trying to solve?
>>> >> Tell me what you want to do, not how you want to do it.
>>> >>
>>> >> On Sat, Sep 17, 2016 at 11:04 AM, Ismail SEZEN <sezenismail at gmail.com <mailto:sezenismail at gmail.com><mailto:sezenismail at gmail.com <mailto:sezenismail at gmail.com>>> wrote:
>>> >> I noticed same issue but didnt care much :)
>>> >>
>>> >> On Sat, Sep 17, 2016, 18:01 jim holtman <jholtman at gmail.com <mailto:jholtman at gmail.com> <mailto:jholtman at gmail.com <mailto:jholtman at gmail.com>>> wrote:
>>> >> Your example was not reproducible.  Also how do you "break" out of the
>>> >> "while" loop?
>>> >>
>>> >>
>>> >> Jim Holtman
>>> >> Data Munger Guru
>>> >>
>>> >> What is the problem that you are trying to solve?
>>> >> Tell me what you want to do, not how you want to do it.
>>> >>
>>> >> On Sat, Sep 17, 2016 at 8:05 AM, Philippe de Rochambeau <phiroc at free.fr <mailto:phiroc at free.fr> <mailto:phiroc at free.fr <mailto:phiroc at free.fr>>>
>>> >> wrote:
>>> >>
>>> >>> Hello,
>>> >>> the following function, which stores numeric values extracted from a
>>> >>> binary file, into an R matrix, is very slow, especially when the said file
>>> >>> is several MB in size.
>>> >>> Should I rewrite the function in inline C or in C/C++ using Rcpp? If the
>>> >>> latter case is true, how do you « readBin »  in Rcpp (I’m a total Rcpp
>>> >>> newbie)?
>>> >>> Many thanks.
>>> >>> Best regards,
>>> >>> phiroc
>>> >>>
>>> >>>
>>> >>> -------------
>>> >>>
>>> >>> # inputPath is something like http://myintranet/getData <http://myintranet/getData><http://myintranet/getData <http://myintranet/getData>>?
>>> >>> pathToFile=/usr/lib/xxx/yyy/data.bin <http://myintranet/getData <http://myintranet/getData> <http://myintranet/getData <http://myintranet/getData>>?
>>> >>> pathToFile=/usr/lib/xxx/yyy/data.bin>
>>> >>>
>>> >>> PLTreader <- function(inputPath){
>>> >>>        URL <- file(inputPath, "rb")
>>> >>>        PLT <- matrix(nrow=0, ncol=6)
>>> >>>        compteurDePrints = 0
>>> >>>        compteurDeLignes <- 0
>>> >>>        maxiPrints = 5
>>> >>>        displayData <- FALSE
>>> >>>        while (TRUE) {
>>> >>>                periodIndex <- readBin(URL, integer(), size=4, n=1,
>>> >>> endian="little") # int (4 bytes)
>>> >>>                eventId <- readBin(URL, integer(), size=4, n=1,
>>> >>> endian="little") # int (4 bytes)
>>> >>>                dword1 <- readBin(URL, integer(), size=4, signed=FALSE,
>>> >>> n=1, endian="little") # int
>>> >>>                dword2 <- readBin(URL, integer(), size=4, signed=FALSE,
>>> >>> n=1, endian="little") # int
>>> >>>                if (dword1 < 0) {
>>> >>>                        dword1 = dword1 + 2^32-1;
>>> >>>                }
>>> >>>                eventDate = (dword2*2^32 + dword1)/1000
>>> >>>                repNum <- readBin(URL, integer(), size=2, n=1,
>>> >>> endian="little") # short (2 bytes)
>>> >>>                exp <- readBin(URL, numeric(), size=4, n=1,
>>> >>> endian="little") # float (4 bytes, strangely enough, would expect 8)
>>> >>>                loss <- readBin(URL, numeric(), size=4, n=1,
>>> >>> endian="little") # float (4 bytes)
>>> >>>                PLT <- rbind(PLT, c(periodIndex, eventId, eventDate,
>>> >>> repNum, exp, loss))
>>> >>>        } # end while
>>> >>>        return(PLT)
>>> >>>        close(URL)
>>> >>> }
>>> >>>
>>> >>> ----------------
>>> >>>        [[alternative HTML version deleted]]
>>> >>>
>>> >>> ______________________________________________
>>> >>> R-help at r-project.org <mailto:R-help at r-project.org> <mailto:R-help at r-project.org <mailto:R-help at r-project.org>> mailing list -- To UNSUBSCRIBE and more, see
>>> >>> https://stat.ethz.ch/mailman/listinfo/r-help <https://stat.ethz.ch/mailman/listinfo/r-help><https://stat.ethz.ch/mailman/listinfo/r-help <https://stat.ethz.ch/mailman/listinfo/r-help>>
>>> >>> PLEASE do read the posting guide http://www.R-project.org/ <http://www.r-project.org/> <http://www.r-project.org/ <http://www.r-project.org/>>
>>> >>> posting-guide.html
>>> >>> and provide commented, minimal, self-contained, reproducible code.
>>> >>
>>> >>        [[alternative HTML version deleted]]
>>> >>
>>> >> ______________________________________________
>>> >> R-help at r-project.org <mailto:R-help at r-project.org> <mailto:R-help at r-project.org <mailto:R-help at r-project.org>> mailing list -- To UNSUBSCRIBE and more, see
>>> >> https://stat.ethz.ch/mailman/listinfo/r-help <https://stat.ethz.ch/mailman/listinfo/r-help><https://stat.ethz.ch/mailman/listinfo/r-help <https://stat.ethz.ch/mailman/listinfo/r-help>>
>>> >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html <http://www.r-project.org/posting-guide.html> <http://www.r-project.org/posting-guide.html <http://www.r-project.org/posting-guide.html>>
>>> >> and provide commented, minimal, self-contained, reproducible code.
>>> >
>>> >
>>> >       [[alternative HTML version deleted]]
>>> >
>>> > ______________________________________________
>>> > R-help at r-project.org <mailto:R-help at r-project.org> mailing list -- To UNSUBSCRIBE and more, see
>>> > https://stat.ethz.ch/mailman/listinfo/r-help <https://stat.ethz.ch/mailman/listinfo/r-help>
>>> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html <http://www.r-project.org/posting-guide.html>
>>> > and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>>>         [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> --
>> Dr. Michael Sumner
>> Software and Database Engineer
>> Australian Antarctic Division
>> 203 Channel Highway
>> Kingston Tasmania 7050 Australia
>>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list