[Rd] How to handle INT8 data

Gabriel Becker gmbecker at ucdavis.edu
Fri Jan 20 18:57:04 CET 2017


How many unique idenfiiers do you have?

If they are large (in terms of bytes) but you don't have that many of them
(eg the total possible number you'll ever have is < INT_MAX), you could
store them as factors. You get the speed of integers but the labeling of
full "precision" strings.  Factors are fast for joins.

~G

On Fri, Jan 20, 2017 at 9:47 AM, Nicolas Paris <nicolas.paris at aphp.fr>
wrote:

> Well I definitely cannot use them as numeric because join is the main
> reason of those identifiers.
>
> About int64 and bit64 packages, it's not a solution, because I am
> releasing a dataset for external users. I cannot ask them to install a
> package in order to exploit them.
>
> I have to be very carefull when releasing the data. If a user just use
> read.csv functions, they by default cast the identifiers as numeric.
>
> $ more res.csv
> "col1";"col2"
> "-1311071933951566764";"toto"
> "-1311071933951566764";"tata"
>
>
> > read.table("res.csv",sep=";",header=T)
>            col1 col2
> 1 -1.311072e+18 toto
> 2 -1.311072e+18 tata
>
> >sapply(read.table("res.csv",sep=";",header=T),class)
>      col1      col2
> "numeric"  "factor"
>
> > read.table("res.csv",sep=";",header=T,colClasses="character")
> col1 col2
> 1 -1311071933951566764 toto
> 2 -1311071933951566764 tata
>
> Am I comdemned to provide a R script with the data in order to exploit the
> dataset ?
>
> Le 20 janv. 2017 à 18h29, Murray Stokely écrivait :
> > 2^53 == 2^53+1
> > TRUE
> >
> > Which makes joining or grouping data sets with 64 bit identifiers
> problematic.
> >
> > Murray (mobile)
> >
> > On Jan 20, 2017 9:15 AM, "Nicolas Paris" <nicolas.paris at aphp.fr> wrote:
> >
> >     Le 20 janv. 2017 à 18h09, Murray Stokely écrivait :
> >     > The lack of 64 bit integer support causes lots of problems when
> dealing
> >     with
> >     > certain types of data where the loss of precision from coercing to
> 53
> >     bits with
> >     > double is unacceptable.
> >
> >     Hello Murray,
> >     Do you mean, by eg. -1311071933951566764 loses in precision during
> >     as.numeric(-1311071933951566764) process ?
> >     Thanks,
> >     >
> >     > Two packages were developed to deal with this:  int64 and bit64.
> >     >
> >     > You may need to find archival versions of these packages if they've
> >     fallen off
> >     > cran.
> >     >
> >     > Murray (mobile phone)
> >     >
> >     > On Jan 20, 2017 7:20 AM, "Gabriel Becker" <gmbecker at ucdavis.edu>
> wrote:
> >     >
> >     >     I am not on R-core, so cannot speak to future plans to
> internally
> >     support
> >     >     int8 (though my impression is that there aren't any, at least
> none
> >     that are
> >     >     close to fruition).
> >     >
> >     >     The standard way of dealing with whole numbers too big to fit
> in an
> >     integer
> >     >     is to put them in a numeric (double down in C land). this can
> >     represent
> >     >     integers up to 2^53 without loss of precision see (
> >     >     http://stackoverflow.com/questions/1848700/biggest-
> >     >     integer-that-can-be-stored-in-a-double).
> >     >     This is how long vector indices are (currently) implemented in
> R. If
> >     it's
> >     >     good enough for indices it's probably good enough for whatever
> you
> >     need
> >     >     them for.
> >     >
> >     >     Hope that helps.
> >     >
> >     >     ~G
> >     >
> >     >
> >     >     On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris <
> nicolas.paris at aphp.fr
> >     >
> >     >     wrote:
> >     >
> >     >     > Hello r users,
> >     >     >
> >     >     > I have to deal with int8 data with R. AFAIK  R does only
> handle
> >     int4
> >     >     > with `as.integer` function [1]. I wonder:
> >     >     > 1. what is the better approach to handle int8 ?
> `as.character` ?
> >     >     > `as.numeric` ?
> >     >     > 2. is there any plan to handle int8 in the future ? As you
> might
> >     know,
> >     >     > int4 is to small to deal with earth population right now.
> >     >     >
> >     >     > Thanks for you ideas,
> >     >     >
> >     >     > int8 eg:
> >     >     >
> >     >     >      human_id
> >     >     > ----------------------
> >     >     >  -1311071933951566764
> >     >     >  -4708675461424073238
> >     >     >  -6865005668390999818
> >     >     >   5578000650960353108
> >     >     >  -3219674686933841021
> >     >     >  -6469229889308771589
> >     >     >   -606871692563545028
> >     >     >  -8199987422425699249
> >     >     >   -463287495999648233
> >     >     >   7675955260644241951
> >     >     >
> >     >     > reference:
> >     >     > 1. https://www.r-bloggers.com/r-in-a-64-bit-world/
> >     >     >
> >     >     > --
> >     >     > Nicolas PARIS
> >     >     >
> >     >     > ______________________________________________
> >     >     > R-devel at r-project.org mailing list
> >     >     > https://stat.ethz.ch/mailman/listinfo/r-devel
> >     >     >
> >     >
> >     >
> >     >
> >     >     --
> >     >     Gabriel Becker, PhD
> >     >     Associate Scientist (Bioinformatics)
> >     >     Genentech Research
> >     >
> >     >             [[alternative HTML version deleted]]
> >     >
> >     >     ______________________________________________
> >     >     R-devel at r-project.org mailing list
> >     >     https://stat.ethz.ch/mailman/listinfo/r-devel
> >     >
> >     >
> >
> >     --
> >     Nicolas PARIS
> >
> >
>
> --
> Nicolas PARIS
>



-- 
Gabriel Becker, PhD
Associate Scientist (Bioinformatics)
Genentech Research

	[[alternative HTML version deleted]]



More information about the R-devel mailing list