[R] Accelerating binRead

Michael Sumner mdsumner at gmail.com
Sun Sep 18 16:05:59 CEST 2016


On Sun, 18 Sep 2016, 19:04 Philippe de Rochambeau <phiroc at free.fr> wrote:

> Please find below code that attempts to read ints, longs and floats from a
> binary file (which is a simplification of my original program).
> Please disregard the R inefficiencies, such as using rbind, for now.
> I’ve also included Java code to generate the binary file.
> The output shows that, at one point, anInt becomes undefined.
> Unfortunately, I couldn’t find the correct R function to determine whether
> inInt is undefined or not, as is.null, is.nan, and is.infinite don’t work.
> Any help would be much appreciated.
> Many thanks in advance.
> Philippe
>
> ———————
> [1] "anInt = 1"
> [1] "is.null  FALSE"
> [1] "is.nan  FALSE"
> [1] "is.infinite  FALSE"
> [1] "aLong = 2"
> [1] "aFloat = 3.44440007209778"
> [1] "--------------------------"
> [1] "anInt = 2"
> [1] "is.null  FALSE"
> [1] "is.nan  FALSE"
> [1] "is.infinite  FALSE"
> [1] "aLong = 22"
> [1] "aFloat = 13.4644002914429"
> [1] "--------------------------"
> [1] "anInt = 3"
> [1] "is.null  FALSE"
> [1] "is.nan  FALSE"
> [1] "is.infinite  FALSE"
> [1] "aLong = 55"
> [1] "aFloat = 45.4444007873535"
> [1] "--------------------------"
> [1] "anInt = "
> [1] "is.null  FALSE"
> [1] "is.nan  "
> [1] "is.infinite  "
> [1] "aLong = "
> [1] "aFloat = "
> [1] "--------------------------"
>      [,1]      [,2]      [,3]
> [1,] 1         2         3.4444
> [2,] 2         22        13.4644
> [3,] 3         55        45.4444
> [4,] Integer,0 Integer,0 Numeric,0
> >
>
> -----------
>
>
> —————————————————————
>
> readFile <- function(inputPath) {
>   URL <- file(inputPath, "rb")
>   PLT <- matrix(nrow=0, ncol=3)
>   counte <- 0
>   max <- 4
>   while (counte < max) {
>     anInt <- readBin(con=URL, what=integer(), size=4, n=1, endian="big")
>     print(paste("anInt =", anInt))
>     #if (! (anInt == 0)) { print(paste("empty int")); break }
>     print(paste("is.null ", is.null(anInt)))
>     print(paste("is.nan ", is.nan(anInt)))
>     print(paste("is.infinite ", is.infinite(anInt)))
>     aLong <- readBin(URL, integer(), size=8, n=1, endian="big")
>     print(paste("aLong =", aLong))
>     aFloat <- readBin(URL, numeric(), size=4, n=1, endian="big")
>     print(paste("aFloat =", aFloat))
>     print("--------------------------")
>     PLT <- rbind(PLT, list(anInt, aLong, aFloat))
>     counte <- counte + 1
>   } # end while
>   close(URL)
>   PLT
> }
> fichier <- "/Users/philippe/Desktop/datatests/data0.bin"
> PLT2 <- readFile(fichier)
> print(PLT2)
> —————————————————————
>
> import java.io.*;
>
> public class Main {
>
>         Main() {
>                 writeData();
>         }
>
>         public static void main(String[] args) {
>                 new Main();
>         }
>
>         public void writeData() {
>
>                 final String path =
> "/Users/philippe/Desktop/datatests/data0.bin";
>
>                 DataOutputStream dos;
>                 try {
>                         dos = new DataOutputStream(new
> BufferedOutputStream(new FileOutputStream(path)));
>                         // big endian write! ("high byte first") , see
> https://docs.oracle.com/javase/7/docs/api/java/io/DataOutputStream.html
>                         dos.writeInt(1);
>                         dos.writeLong(2L);
>                         dos.writeFloat(3.4444F);
>
>                         dos.writeInt(2);
>                         dos.writeLong(22L);
>                         dos.writeFloat(13.4644F);
>
>                         dos.writeInt(3);
>                         dos.writeLong(55L);
>                         dos.writeFloat(45.4444F);
>
>                         dos.close();
>                 } catch (FileNotFoundException e) {
>                         e.printStackTrace();
>                 } catch (IOException ioe) {
>                         ioe.printStackTrace();
>                 }
>
>         }
>
> }
>
>
> —————————————————————
>
>
>
>
>
>
> > Le 17 sept. 2016 à 20:45, Philippe de Rochambeau <phiroc at free.fr> a
> écrit :
> >
> > Hi Jim,
> > this is exactly the answer I was look for. Many thanks. I didn’t R had a
> pack function, as in PERL.
> > To answer your earlier question, I am trying to update legacy code to
> read a binary file with unknown size, over a network, slice up it into rows
> each containing an integer, an integer, a long, a short, a float and a
> float, and stuff the rows into a matrix.
>


It's possible to read all rows fast as raw(), then parse in a vectorised
way with matrix indexing to group the bytes appropriately. There is an
example on the mailing list somewhere, but otherwise I can show an example
if that's of interest.


Cheers, Mike


> Best regards,
> > Philippe
> >
> >> Le 17 sept. 2016 à 20:38, jim holtman <jholtman at gmail.com <mailto:
> jholtman at gmail.com>> a écrit :
> >>
> >> Here is an example of how to do it:
> >>
> >> x <- 1:10  # integer values
> >> xf <- seq(1.0, 2, by = 0.1)  # floating point
> >>
> >> setwd("d:/temp")
> >>
> >> # create file to write to
> >> output <- file('integer.bin', 'wb')
> >> writeBin(x, output)  # write integer
> >> writeBin(xf, output)  # write reals
> >> close(output)
> >>
> >>
> >> library(pack)
> >> library(readr)
> >>
> >> # read all the data at once
> >> allbin <- read_file_raw('integer.bin')
> >>
> >> # decode the data into a list
> >> (result <- unpack("V V V V V V V V V V d d d d d d d d d d", allbin))
> >>
> >>
> >>
> >>
> >> Jim Holtman
> >> Data Munger Guru
> >>
> >> What is the problem that you are trying to solve?
> >> Tell me what you want to do, not how you want to do it.
> >>
> >> On Sat, Sep 17, 2016 at 11:04 AM, Ismail SEZEN <sezenismail at gmail.com
> <mailto:sezenismail at gmail.com><mailto:sezenismail at gmail.com <mailto:
> sezenismail at gmail.com>>> wrote:
> >> I noticed same issue but didnt care much :)
> >>
> >> On Sat, Sep 17, 2016, 18:01 jim holtman <jholtman at gmail.com <mailto:
> jholtman at gmail.com> <mailto:jholtman at gmail.com <mailto:jholtman at gmail.com>>>
> wrote:
> >> Your example was not reproducible.  Also how do you "break" out of the
> >> "while" loop?
> >>
> >>
> >> Jim Holtman
> >> Data Munger Guru
> >>
> >> What is the problem that you are trying to solve?
> >> Tell me what you want to do, not how you want to do it.
> >>
> >> On Sat, Sep 17, 2016 at 8:05 AM, Philippe de Rochambeau <phiroc at free.fr
> <mailto:phiroc at free.fr> <mailto:phiroc at free.fr <mailto:phiroc at free.fr>>>
> >> wrote:
> >>
> >>> Hello,
> >>> the following function, which stores numeric values extracted from a
> >>> binary file, into an R matrix, is very slow, especially when the said
> file
> >>> is several MB in size.
> >>> Should I rewrite the function in inline C or in C/C++ using Rcpp? If
> the
> >>> latter case is true, how do you « readBin »  in Rcpp (I’m a total Rcpp
> >>> newbie)?
> >>> Many thanks.
> >>> Best regards,
> >>> phiroc
> >>>
> >>>
> >>> -------------
> >>>
> >>> # inputPath is something like http://myintranet/getData <
> http://myintranet/getData><http://myintranet/getData <
> http://myintranet/getData>>?
> >>> pathToFile=/usr/lib/xxx/yyy/data.bin <http://myintranet/getData <
> http://myintranet/getData> <http://myintranet/getData <
> http://myintranet/getData>>?
> >>> pathToFile=/usr/lib/xxx/yyy/data.bin>
> >>>
> >>> PLTreader <- function(inputPath){
> >>>        URL <- file(inputPath, "rb")
> >>>        PLT <- matrix(nrow=0, ncol=6)
> >>>        compteurDePrints = 0
> >>>        compteurDeLignes <- 0
> >>>        maxiPrints = 5
> >>>        displayData <- FALSE
> >>>        while (TRUE) {
> >>>                periodIndex <- readBin(URL, integer(), size=4, n=1,
> >>> endian="little") # int (4 bytes)
> >>>                eventId <- readBin(URL, integer(), size=4, n=1,
> >>> endian="little") # int (4 bytes)
> >>>                dword1 <- readBin(URL, integer(), size=4, signed=FALSE,
> >>> n=1, endian="little") # int
> >>>                dword2 <- readBin(URL, integer(), size=4, signed=FALSE,
> >>> n=1, endian="little") # int
> >>>                if (dword1 < 0) {
> >>>                        dword1 = dword1 + 2^32-1;
> >>>                }
> >>>                eventDate = (dword2*2^32 + dword1)/1000
> >>>                repNum <- readBin(URL, integer(), size=2, n=1,
> >>> endian="little") # short (2 bytes)
> >>>                exp <- readBin(URL, numeric(), size=4, n=1,
> >>> endian="little") # float (4 bytes, strangely enough, would expect 8)
> >>>                loss <- readBin(URL, numeric(), size=4, n=1,
> >>> endian="little") # float (4 bytes)
> >>>                PLT <- rbind(PLT, c(periodIndex, eventId, eventDate,
> >>> repNum, exp, loss))
> >>>        } # end while
> >>>        return(PLT)
> >>>        close(URL)
> >>> }
> >>>
> >>> ----------------
> >>>        [[alternative HTML version deleted]]
> >>>
> >>> ______________________________________________
> >>> R-help at r-project.org <mailto:R-help at r-project.org> <mailto:
> R-help at r-project.org <mailto:R-help at r-project.org>> mailing list -- To
> UNSUBSCRIBE and more, see
> >>> https://stat.ethz.ch/mailman/listinfo/r-help <
> https://stat.ethz.ch/mailman/listinfo/r-help><
> https://stat.ethz.ch/mailman/listinfo/r-help <
> https://stat.ethz.ch/mailman/listinfo/r-help>>
> >>> PLEASE do read the posting guide http://www.R-project.org/ <
> http://www.r-project.org/> <http://www.r-project.org/ <
> http://www.r-project.org/>>
> >>> posting-guide.html
> >>> and provide commented, minimal, self-contained, reproducible code.
> >>
> >>        [[alternative HTML version deleted]]
> >>
> >> ______________________________________________
> >> R-help at r-project.org <mailto:R-help at r-project.org> <mailto:
> R-help at r-project.org <mailto:R-help at r-project.org>> mailing list -- To
> UNSUBSCRIBE and more, see
> >> https://stat.ethz.ch/mailman/listinfo/r-help <
> https://stat.ethz.ch/mailman/listinfo/r-help><
> https://stat.ethz.ch/mailman/listinfo/r-help <
> https://stat.ethz.ch/mailman/listinfo/r-help>>
> >> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html <
> http://www.r-project.org/posting-guide.html> <
> http://www.r-project.org/posting-guide.html <
> http://www.r-project.org/posting-guide.html>>
> >> and provide commented, minimal, self-contained, reproducible code.
> >
> >
> >       [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at r-project.org <mailto:R-help at r-project.org> mailing list -- To
> UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help <
> https://stat.ethz.ch/mailman/listinfo/r-help>
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html <
> http://www.r-project.org/posting-guide.html>
> > and provide commented, minimal, self-contained, reproducible code.
>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Dr. Michael Sumner
Software and Database Engineer
Australian Antarctic Division
203 Channel Highway
Kingston Tasmania 7050 Australia

	[[alternative HTML version deleted]]



More information about the R-help mailing list