[Rd] How to handle INT8 data

Nicolas Paris nicolas.paris at aphp.fr
Fri Jan 20 19:05:31 CET 2017


Hi, 

I do have < INT_MAX.
This looks attractive but since they are unique identifiers, storing
them as factor will be likely to be counter-productive. (a string
version + an int32 for each)

I was looking to https://cran.r-project.org/web/packages/csvread/index.html
This looks like a good feet for my needs. 
Any chances such an external package for int64 would be integrated in core ?


Le 20 janv. 2017 à 18h57, Gabriel Becker écrivait :
> How many unique idenfiiers do you have?
> 
> If they are large (in terms of bytes) but you don't have that many of them (eg
> the total possible number you'll ever have is < INT_MAX), you could store them
> as factors. You get the speed of integers but the labeling of full "precision"
> strings.  Factors are fast for joins.
> 
> ~G
> 
> On Fri, Jan 20, 2017 at 9:47 AM, Nicolas Paris <nicolas.paris at aphp.fr> wrote:
> 
>     Well I definitely cannot use them as numeric because join is the main
>     reason of those identifiers.
> 
>     About int64 and bit64 packages, it's not a solution, because I am
>     releasing a dataset for external users. I cannot ask them to install a
>     package in order to exploit them.
> 
>     I have to be very carefull when releasing the data. If a user just use
>     read.csv functions, they by default cast the identifiers as numeric.
> 
>     $ more res.csv
>     "col1";"col2"
>     "-1311071933951566764";"toto"
>     "-1311071933951566764";"tata"
> 
> 
>     > read.table("res.csv",sep=";",header=T)
>                col1 col2
>     1 -1.311072e+18 toto
>     2 -1.311072e+18 tata
> 
>     >sapply(read.table("res.csv",sep=";",header=T),class)
>          col1      col2
>     "numeric"  "factor"
> 
>     > read.table("res.csv",sep=";",header=T,colClasses="character")
>     col1 col2
>     1 -1311071933951566764 toto
>     2 -1311071933951566764 tata
> 
>     Am I comdemned to provide a R script with the data in order to exploit the
>     dataset ?
> 
>     Le 20 janv. 2017 à 18h29, Murray Stokely écrivait :
>     > 2^53 == 2^53+1
>     > TRUE
>     >
>     > Which makes joining or grouping data sets with 64 bit identifiers
>     problematic.
>     >
>     > Murray (mobile)
>     >
>     > On Jan 20, 2017 9:15 AM, "Nicolas Paris" <nicolas.paris at aphp.fr> wrote:
>     >
>     >     Le 20 janv. 2017 à 18h09, Murray Stokely écrivait :
>     >     > The lack of 64 bit integer support causes lots of problems when
>     dealing
>     >     with
>     >     > certain types of data where the loss of precision from coercing to
>     53
>     >     bits with
>     >     > double is unacceptable.
>     >
>     >     Hello Murray,
>     >     Do you mean, by eg. -1311071933951566764 loses in precision during
>     >     as.numeric(-1311071933951566764) process ?
>     >     Thanks,
>     >     >
>     >     > Two packages were developed to deal with this:  int64 and bit64.
>     >     >
>     >     > You may need to find archival versions of these packages if they've
>     >     fallen off
>     >     > cran.
>     >     >
>     >     > Murray (mobile phone)
>     >     >
>     >     > On Jan 20, 2017 7:20 AM, "Gabriel Becker" <gmbecker at ucdavis.edu>
>     wrote:
>     >     >
>     >     >     I am not on R-core, so cannot speak to future plans to
>     internally
>     >     support
>     >     >     int8 (though my impression is that there aren't any, at least
>     none
>     >     that are
>     >     >     close to fruition).
>     >     >
>     >     >     The standard way of dealing with whole numbers too big to fit
>     in an
>     >     integer
>     >     >     is to put them in a numeric (double down in C land). this can
>     >     represent
>     >     >     integers up to 2^53 without loss of precision see (
>     >     >     http://stackoverflow.com/questions/1848700/biggest-
>     >     >     integer-that-can-be-stored-in-a-double).
>     >     >     This is how long vector indices are (currently) implemented in
>     R. If
>     >     it's
>     >     >     good enough for indices it's probably good enough for whatever
>     you
>     >     need
>     >     >     them for.
>     >     >
>     >     >     Hope that helps.
>     >     >
>     >     >     ~G
>     >     >
>     >     >
>     >     >     On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris <
>     nicolas.paris at aphp.fr
>     >     >
>     >     >     wrote:
>     >     >
>     >     >     > Hello r users,
>     >     >     >
>     >     >     > I have to deal with int8 data with R. AFAIK  R does only
>     handle
>     >     int4
>     >     >     > with `as.integer` function [1]. I wonder:
>     >     >     > 1. what is the better approach to handle int8 ? `as.character
>     ` ?
>     >     >     > `as.numeric` ?
>     >     >     > 2. is there any plan to handle int8 in the future ? As you
>     might
>     >     know,
>     >     >     > int4 is to small to deal with earth population right now.
>     >     >     >
>     >     >     > Thanks for you ideas,
>     >     >     >
>     >     >     > int8 eg:
>     >     >     >
>     >     >     >      human_id
>     >     >     > ----------------------
>     >     >     >  -1311071933951566764
>     >     >     >  -4708675461424073238
>     >     >     >  -6865005668390999818
>     >     >     >   5578000650960353108
>     >     >     >  -3219674686933841021
>     >     >     >  -6469229889308771589
>     >     >     >   -606871692563545028
>     >     >     >  -8199987422425699249
>     >     >     >   -463287495999648233
>     >     >     >   7675955260644241951
>     >     >     >
>     >     >     > reference:
>     >     >     > 1. https://www.r-bloggers.com/r-in-a-64-bit-world/
>     >     >     >
>     >     >     > --
>     >     >     > Nicolas PARIS
>     >     >     >
>     >     >     > ______________________________________________
>     >     >     > R-devel at r-project.org mailing list
>     >     >     > https://stat.ethz.ch/mailman/listinfo/r-devel
>     >     >     >
>     >     >
>     >     >
>     >     >
>     >     >     --
>     >     >     Gabriel Becker, PhD
>     >     >     Associate Scientist (Bioinformatics)
>     >     >     Genentech Research
>     >     >
>     >     >             [[alternative HTML version deleted]]
>     >     >
>     >     >     ______________________________________________
>     >     >     R-devel at r-project.org mailing list
>     >     >     https://stat.ethz.ch/mailman/listinfo/r-devel
>     >     >
>     >     >
>     >
>     >     --
>     >     Nicolas PARIS
>     >
>     >
> 
>     --
>     Nicolas PARIS
> 
> 
> 
> 
> --
> Gabriel Becker, PhD
> Associate Scientist (Bioinformatics)
> Genentech Research

-- 
Nicolas PARIS
Responsable R & D
WIND - PACTE, Hôpital Rothschild ( RTH )
Courriel : nicolas.paris at aphp.fr
Tel : 01 48 04 21 07



More information about the R-devel mailing list