[Rd] How to handle INT8 data

Gabriel Becker gmbecker at ucdavis.edu
Fri Jan 20 19:16:14 CET 2017


I, again, can't speak for R-core so I may be wrong about any of this and
they are welcome to correct me but it seems unlikely that they would
integrate a package that defines 64 bit integers in R into the core of R
 without making the changes necessary to provide 64 bit integers as a
fundamental (atomic vector) type. I know this has come up before and they
have been reluctant to make the changes necessary.

As Pete points out, they could "simply" change integers in R to always be
64 bit, though that would make all* (to an extent) integer vectors in R
take up twice as much memory as they do now.

I should also mention that even if R-core did take up this cause, it
wouldn't happen quickly enough for what you probably need. I would guess we
would be talking months or year(s) (i.e. the next non-patch R versions at
the earliest, and likely the one after that >1yr out).

One pragmatic solution (other than the factors which is what I Would
probably do) would be to only distribute your data as an R data package
which depends on csvread or similar.

~G

On Fri, Jan 20, 2017 at 10:05 AM, Nicolas Paris <nicolas.paris at aphp.fr>
wrote:

> Hi,
>
> I do have < INT_MAX.
> This looks attractive but since they are unique identifiers, storing
> them as factor will be likely to be counter-productive. (a string
> version + an int32 for each)
>
> I was looking to https://cran.r-project.org/web/packages/csvread/index.
> html
> This looks like a good feet for my needs.
> Any chances such an external package for int64 would be integrated in core
> ?
>
>
> Le 20 janv. 2017 à 18h57, Gabriel Becker écrivait :
> > How many unique idenfiiers do you have?
> >
> > If they are large (in terms of bytes) but you don't have that many of
> them (eg
> > the total possible number you'll ever have is < INT_MAX), you could
> store them
> > as factors. You get the speed of integers but the labeling of full
> "precision"
> > strings.  Factors are fast for joins.
> >
> > ~G
> >
> > On Fri, Jan 20, 2017 at 9:47 AM, Nicolas Paris <nicolas.paris at aphp.fr>
> wrote:
> >
> >     Well I definitely cannot use them as numeric because join is the main
> >     reason of those identifiers.
> >
> >     About int64 and bit64 packages, it's not a solution, because I am
> >     releasing a dataset for external users. I cannot ask them to install
> a
> >     package in order to exploit them.
> >
> >     I have to be very carefull when releasing the data. If a user just
> use
> >     read.csv functions, they by default cast the identifiers as numeric.
> >
> >     $ more res.csv
> >     "col1";"col2"
> >     "-1311071933951566764";"toto"
> >     "-1311071933951566764";"tata"
> >
> >
> >     > read.table("res.csv",sep=";",header=T)
> >                col1 col2
> >     1 -1.311072e+18 toto
> >     2 -1.311072e+18 tata
> >
> >     >sapply(read.table("res.csv",sep=";",header=T),class)
> >          col1      col2
> >     "numeric"  "factor"
> >
> >     > read.table("res.csv",sep=";",header=T,colClasses="character")
> >     col1 col2
> >     1 -1311071933951566764 toto
> >     2 -1311071933951566764 tata
> >
> >     Am I comdemned to provide a R script with the data in order to
> exploit the
> >     dataset ?
> >
> >     Le 20 janv. 2017 à 18h29, Murray Stokely écrivait :
> >     > 2^53 == 2^53+1
> >     > TRUE
> >     >
> >     > Which makes joining or grouping data sets with 64 bit identifiers
> >     problematic.
> >     >
> >     > Murray (mobile)
> >     >
> >     > On Jan 20, 2017 9:15 AM, "Nicolas Paris" <nicolas.paris at aphp.fr>
> wrote:
> >     >
> >     >     Le 20 janv. 2017 à 18h09, Murray Stokely écrivait :
> >     >     > The lack of 64 bit integer support causes lots of problems
> when
> >     dealing
> >     >     with
> >     >     > certain types of data where the loss of precision from
> coercing to
> >     53
> >     >     bits with
> >     >     > double is unacceptable.
> >     >
> >     >     Hello Murray,
> >     >     Do you mean, by eg. -1311071933951566764 loses in precision
> during
> >     >     as.numeric(-1311071933951566764) process ?
> >     >     Thanks,
> >     >     >
> >     >     > Two packages were developed to deal with this:  int64 and
> bit64.
> >     >     >
> >     >     > You may need to find archival versions of these packages if
> they've
> >     >     fallen off
> >     >     > cran.
> >     >     >
> >     >     > Murray (mobile phone)
> >     >     >
> >     >     > On Jan 20, 2017 7:20 AM, "Gabriel Becker" <
> gmbecker at ucdavis.edu>
> >     wrote:
> >     >     >
> >     >     >     I am not on R-core, so cannot speak to future plans to
> >     internally
> >     >     support
> >     >     >     int8 (though my impression is that there aren't any, at
> least
> >     none
> >     >     that are
> >     >     >     close to fruition).
> >     >     >
> >     >     >     The standard way of dealing with whole numbers too big
> to fit
> >     in an
> >     >     integer
> >     >     >     is to put them in a numeric (double down in C land).
> this can
> >     >     represent
> >     >     >     integers up to 2^53 without loss of precision see (
> >     >     >     http://stackoverflow.com/questions/1848700/biggest-
> >     >     >     integer-that-can-be-stored-in-a-double).
> >     >     >     This is how long vector indices are (currently)
> implemented in
> >     R. If
> >     >     it's
> >     >     >     good enough for indices it's probably good enough for
> whatever
> >     you
> >     >     need
> >     >     >     them for.
> >     >     >
> >     >     >     Hope that helps.
> >     >     >
> >     >     >     ~G
> >     >     >
> >     >     >
> >     >     >     On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris <
> >     nicolas.paris at aphp.fr
> >     >     >
> >     >     >     wrote:
> >     >     >
> >     >     >     > Hello r users,
> >     >     >     >
> >     >     >     > I have to deal with int8 data with R. AFAIK  R does
> only
> >     handle
> >     >     int4
> >     >     >     > with `as.integer` function [1]. I wonder:
> >     >     >     > 1. what is the better approach to handle int8 ?
> `as.character
> >     ` ?
> >     >     >     > `as.numeric` ?
> >     >     >     > 2. is there any plan to handle int8 in the future ? As
> you
> >     might
> >     >     know,
> >     >     >     > int4 is to small to deal with earth population right
> now.
> >     >     >     >
> >     >     >     > Thanks for you ideas,
> >     >     >     >
> >     >     >     > int8 eg:
> >     >     >     >
> >     >     >     >      human_id
> >     >     >     > ----------------------
> >     >     >     >  -1311071933951566764
> >     >     >     >  -4708675461424073238
> >     >     >     >  -6865005668390999818
> >     >     >     >   5578000650960353108
> >     >     >     >  -3219674686933841021
> >     >     >     >  -6469229889308771589
> >     >     >     >   -606871692563545028
> >     >     >     >  -8199987422425699249
> >     >     >     >   -463287495999648233
> >     >     >     >   7675955260644241951
> >     >     >     >
> >     >     >     > reference:
> >     >     >     > 1. https://www.r-bloggers.com/r-in-a-64-bit-world/
> >     >     >     >
> >     >     >     > --
> >     >     >     > Nicolas PARIS
> >     >     >     >
> >     >     >     > ______________________________________________
> >     >     >     > R-devel at r-project.org mailing list
> >     >     >     > https://stat.ethz.ch/mailman/listinfo/r-devel
> >     >     >     >
> >     >     >
> >     >     >
> >     >     >
> >     >     >     --
> >     >     >     Gabriel Becker, PhD
> >     >     >     Associate Scientist (Bioinformatics)
> >     >     >     Genentech Research
> >     >     >
> >     >     >             [[alternative HTML version deleted]]
> >     >     >
> >     >     >     ______________________________________________
> >     >     >     R-devel at r-project.org mailing list
> >     >     >     https://stat.ethz.ch/mailman/listinfo/r-devel
> >     >     >
> >     >     >
> >     >
> >     >     --
> >     >     Nicolas PARIS
> >     >
> >     >
> >
> >     --
> >     Nicolas PARIS
> >
> >
> >
> >
> > --
> > Gabriel Becker, PhD
> > Associate Scientist (Bioinformatics)
> > Genentech Research
>
> --
> Nicolas PARIS
> Responsable R & D
> WIND - PACTE, Hôpital Rothschild ( RTH )
> Courriel : nicolas.paris at aphp.fr
> Tel : 01 48 04 21 07
>



-- 
Gabriel Becker, PhD
Associate Scientist (Bioinformatics)
Genentech Research

	[[alternative HTML version deleted]]



More information about the R-devel mailing list