[R] 'format' behaviour in a 'apply' call depending on 'options(digits = K)'

Mathieu Basille basille.web at ase-research.org
Thu Aug 1 20:40:36 CEST 2013


Ista, you were right with the integer vs. double issue: I just found this 
out while filing a bug to the R Bugzilla. You can find the bug report here:

https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=15411

Please let me know if it does not seem to cover all your comments, I'll add 
more details in the bug report.

Let's see now how this one turns out...
Mathieu.


Le 08/01/2013 02:08 PM, Ista Zahn a écrit :
> Hi Mathieu,
>
> I don't have a full explanation, but here is some additional observations:
>
>> options(digits = 4)
>>
>> ## Simplified example
>> df2 <- data.frame(x = rnorm(21), y = rnorm(21), id = 99990:100010)
>> apply(df2, 1, function(dfi) format(dfi["id"], scientific = FALSE))
>   [1] "99990"  "99991"  "99992"  "99993"  "99994"  " 99995" " 99996" "
> 99997" " 99998" " 99999" "100000" "100001" "100002" "100003"
> [15] "100004" "100005" "100006" "100007" "100008" "100009" "100010"
>>
>> ## Based on magnitude of id (> 9994 gets padded regardless of position)
>> df2 <- data.frame(x = rnorm(21), y = rnorm(21), id = 100010:99990)
>> apply(df2, 1, function(dfi) format(dfi["id"], scientific = FALSE))
>   [1] "100010" "100009" "100008" "100007" "100006" "100005" "100004"
> "100003" "100002" "100001" "100000" " 99999" " 99998" " 99997"
> [15] " 99996" " 99995" "99994"  "99993"  "99992"  "99991"  "99990"
>>
>> ## The issue is that formatting a double leads to the originally noted behavior.
>> ## The apply version coerces df2 to a matrix of type double which is why this
>> ## happens there as well.
>>
>> for(i in 1:nrow(df2)) print(format(df2[i, "id"], scientific=FALSE))
> [1] "100010"
> [1] "100009"
> [1] "100008"
> [1] "100007"
> [1] "100006"
> [1] "100005"
> [1] "100004"
> [1] "100003"
> [1] "100002"
> [1] "100001"
> [1] "100000"
> [1] "99999"
> [1] "99998"
> [1] "99997"
> [1] "99996"
> [1] "99995"
> [1] "99994"
> [1] "99993"
> [1] "99992"
> [1] "99991"
> [1] "99990"
>> for(i in 1:nrow(df2)) print(format(as.double(df2[i, "id"]), scientific=FALSE))
> [1] "100010"
> [1] "100009"
> [1] "100008"
> [1] "100007"
> [1] "100006"
> [1] "100005"
> [1] "100004"
> [1] "100003"
> [1] "100002"
> [1] "100001"
> [1] "100000"
> [1] " 99999"
> [1] " 99998"
> [1] " 99997"
> [1] " 99996"
> [1] " 99995"
> [1] "99994"
> [1] "99993"
> [1] "99992"
> [1] "99991"
> [1] "99990"
>
> Best,
> Ista
>
> On Thu, Aug 1, 2013 at 11:31 AM, Mathieu Basille
> <basille.web at ase-research.org> wrote:
>> This problem does not seem to be widely popular, but at least affects two
>> users (both on Linux, maybe a hint here?). To me, it looks like a bug (is it
>> a R bug, or a OS-related bug, I don't know). Should I forward it to R-devel,
>> or some other place where R gurus may have a chance to look at it?
>>
>> Mathieu.
>>
>>
>> Le 07/30/2013 02:34 PM, arun a écrit :
>>
>>> Hi Mathieu
>>> yes, the original problem occurs in my system too. I am using R 3.0.1 on
>>> linux mint 15.  I guess the default case would be trim=FALSE, but still it
>>> looks very strange especially in ?apply(), as it starts from " 99995"
>>> onwards.
>>>
>>> sessionInfo()
>>> R version 3.0.1 (2013-05-16)
>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>
>>> locale:
>>>    [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C
>>>    [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8
>>>    [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8
>>>    [7] LC_PAPER=C                 LC_NAME=C
>>>    [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>
>>> other attached packages:
>>> [1] stringr_0.6.2  reshape2_1.2.2
>>>
>>> loaded via a namespace (and not attached):
>>> [1] plyr_1.8    tools_3.0.1
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> ----- Original Message -----
>>> From: Mathieu Basille <basille.web at ase-research.org>
>>> To: arun <smartpink111 at yahoo.com>
>>> Cc: R help <r-help at r-project.org>
>>> Sent: Tuesday, July 30, 2013 2:29 PM
>>> Subject: Re: [R] 'format' behaviour in a 'apply' call depending on
>>> 'options(digits = K)'
>>>
>>> Thanks Arun for your answer. 'trim = TRUE' does indeed solve the symptoms
>>> of the problem, and this is the solution I'm currently using. However, it
>>> does not help to understand what the problem is, and what is the cause of
>>> it.
>>>
>>> Can you confirm that the original problem also occurs on your computer
>>> (and
>>> what is your OS)? It would be interesting since David is not able to
>>> reproduce the problem with Mac OS X.
>>> Mathieu.
>>>
>>>
>>> Le 07/30/2013 02:15 PM, arun a écrit :
>>>>
>>>> Hi,
>>>> Try using trim=TRUE, in ?format()
>>>> options(digits=4)
>>>>
>>>> df2 <- data.frame(x = rnorm(110000), y = rnorm(110000), id = 1:110000)
>>>>      df2$id2 <- apply(df2, 1, function(dfi) format(dfi["id"],
>>>> trim=TRUE,scientific = FALSE))
>>>>       df2$id2[99990:100010]
>>>> # [1] "99990"  "99991"  "99992"  "99993"  "99994"  "99995"  "99996"
>>>> "99997"
>>>> # [9] "99998"  "99999"  "100000" "100001" "100002" "100003" "100004"
>>>> "100005"
>>>> #[17] "100006" "100007" "100008" "100009" "100010"
>>>>
>>>>
>>>> id2 <- format(1:110000, scientific = FALSE,trim=TRUE)
>>>> id2[99990:100010]
>>>> # [1] "99990"  "99991"  "99992"  "99993"  "99994"  "99995"  "99996"
>>>> "99997"
>>>>      #[9] "99998"  "99999"  "100000" "100001" "100002" "100003" "100004"
>>>> "100005"
>>>> #[17] "100006" "100007" "100008" "100009" "100010"
>>>> A.K.
>>>>
>>>>
>>>> ----- Original Message -----
>>>> From: Mathieu Basille <basille.web at ase-research.org>
>>>> To: David Winsemius <dwinsemius at comcast.net>
>>>> Cc: r-help at r-project.org
>>>> Sent: Tuesday, July 30, 2013 2:07 PM
>>>> Subject: Re: [R] 'format' behaviour in a 'apply' call depending on
>>>> 'options(digits = K)'
>>>>
>>>> Thanks David for your interest. I have to admit that your answer puzzles
>>>> me
>>>> even more than before. It seems that the underlying problem is way beyond
>>>> my R skills...
>>>>
>>>> The generation of id2 is indeed quite demanding, especially compared to a
>>>> simple 'as.character' call. Anyway, since it seems to be system specific,
>>>> here is the sessionInfo() that I forgot to attach to my first message:
>>>>
>>>> R version 3.0.1 (2013-05-16)
>>>> Platform: x86_64-pc-linux-gnu (64-bit)
>>>>
>>>> locale:
>>>>       [1] LC_CTYPE=fr_FR.UTF-8       LC_NUMERIC=C
>>>>       [3] LC_TIME=fr_FR.UTF-8        LC_COLLATE=fr_FR.UTF-8
>>>>       [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=fr_FR.UTF-8
>>>>       [7] LC_PAPER=C                 LC_NAME=C
>>>>       [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>> [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
>>>>
>>>> attached base packages:
>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>
>>>> In brief: last stable R available under Debian Testing... Hopefully this
>>>> can help tracking down the problem.
>>>> Mathieu.
>>>>
>>>>
>>>> Le 07/30/2013 01:58 PM, David Winsemius a écrit :
>>>>>
>>>>>
>>>>> On Jul 30, 2013, at 9:01 AM, Mathieu Basille wrote:
>>>>>
>>>>>> Dear list,
>>>>>>
>>>>>> Here is a simple example in which the behaviour of 'format' does not
>>>>>> make sense to me. I have read the documentation and searched the archives,
>>>>>> but nothing pointed me in the right direction to understand this behaviour.
>>>>>> Let's start with a simple data frame:
>>>>>>
>>>>>> df1 <- data.frame(x = rnorm(110000), y = rnorm(110000), id = 1:110000)
>>>>>>
>>>>>> Let's now create a new variable 'id2' which is the character
>>>>>> representation of 'id'. Note that I use 'scientific = FALSE' to ensure that
>>>>>> long numbers such as 100,000 are not formatted using their scientific
>>>>>> representation (in this case 1e+05):
>>>>>>
>>>>>> df1$id2 <- apply(df1, 1, function(dfi) format(dfi["id"], scientific =
>>>>>> FALSE))
>>>>>>
>>>>>> Let's have a look at part of the result:
>>>>>>
>>>>>> df1$id2[99990:100010]
>>>>>> [1] "99990"  "99991"  "99992"  "99993"  "99994"  "99995"  "99996"
>>>>>> [8] "99997"  "99998"  "99999"  "100000" "100001" "100002" "100003"
>>>>>> [15] "100004" "100005" "100006" "100007" "100008" "100009" "100010"
>>>>>
>>>>>
>>>>> Some formating processes are carried out by system functions. In this
>>>>> case I am unable to reproduce with the same code on a Mac OS 10.7.5/R 3.0.1
>>>>> Patched
>>>>>
>>>>>> df1$id2[99990:100010]
>>>>>
>>>>>       [1] "99990"  "99991"  "99992"  "99993"  "99994"  "99995"  "99996"
>>>>> "99997"
>>>>>       [9] "99998"  "99999"  "100000" "100001" "100002" "100003" "100004"
>>>>> "100005"
>>>>> [17] "100006" "100007" "100008" "100009" "100010"
>>>>>
>>>>> (I did notice that generation of the id2 variable seemed to take an
>>>>> inordinately long time.)
>>>>>
>>>>> -- David.
>>>>>>
>>>>>>
>>>>>> So far, so good. Let's now play with the 'digits' option:
>>>>>>
>>>>>> options(digits = 4)
>>>>>> df2 <- data.frame(x = rnorm(110000), y = rnorm(110000), id = 1:110000)
>>>>>> df2$id2 <- apply(df2, 1, function(dfi) format(dfi["id"], scientific =
>>>>>> FALSE))
>>>>>> df2$id2[99990:100010]
>>>>>> [1] "99990"  "99991"  "99992"  "99993"  "99994"  " 99995" " 99996"
>>>>>> [8] " 99997" " 99998" " 99999" "100000" "100001" "100002" "100003"
>>>>>> [15] "100004" "100005" "100006" "100007" "100008" "100009" "100010"
>>>>>>
>>>>>> Notice the extra leading space from 99995 to 99999? To make sure it
>>>>>> only happened there:
>>>>>>
>>>>>> df2$id2[which(df1$id2 != df2$id2)]
>>>>>> [1] " 99995" " 99996" " 99997" " 99998" " 99999"
>>>>>>
>>>>>> And just to make sure it only occurs in a 'apply' call, here is the
>>>>>> same directly on a numeric vector:
>>>>>>
>>>>>> id2 <- format(1:110000, scientific = FALSE)
>>>>>> id2[99990:100010]
>>>>>> [1] " 99990" " 99991" " 99992" " 99993" " 99994" " 99995" " 99996"
>>>>>> [8] " 99997" " 99998" " 99999" "100000" "100001" "100002" "100003"
>>>>>> [15] "100004" "100005" "100006" "100007" "100008" "100009" "100010"
>>>>>>
>>>>>> Here the leading spaces are for every number, which makes sense to me.
>>>>>> Is there anything I'm misinterpreting in the behaviour of 'format'?
>>>>>> Thanks in advance for any hint,
>>>>>> Mathieu.
>>>>>>
>>>>>>
>>>>>> PS: Some background for this question. It all comes from a Rmd
>>>>>> document, that knitr consistently failed to process, while the R code was
>>>>>> fine using batch or interactive R. knitr uses 'options(digits = 4)' as
>>>>>> opposed to 'options(digits = 7)' by default in R, which made one of my
>>>>>> function throw an error with knitr, but not with batch or interactive R. I
>>>>>> managed to solve the problem using 'trim = TRUE' in 'format', but I still do
>>>>>> not understand what's going on...
>>>>>> If you're interested, see here for more details on the original
>>>>>> problem:
>>>>>> http://stackoverflow.com/questions/17866230/knitr-vs-interactive-r-behaviour/17872176
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> ~$ whoami
>>>>>> Mathieu Basille, PhD
>>>>>>
>>>>>> ~$ locate --details
>>>>>> University of Florida \\
>>>>>> Fort Lauderdale Research and Education Center
>>>>>> (+1) 954-577-6314
>>>>>> http://ase-research.org/basille
>>>>>>
>>>>>> ~$ fortune
>>>>>> « Le tout est de tout dire, et je manque de mots
>>>>>> Et je manque de temps, et je manque d'audace. »
>>>>>> -- Paul Éluard
>>>>>>
>>>>>> ______________________________________________
>>>>>> R-help at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide
>>>>>> http://www.R-project.org/posting-guide.html
>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>>>
>>>>> David Winsemius
>>>>> Alameda, CA, USA
>>>>>
>>>>
>>>>
>>>>
>>>>>
>>>>> On Jul 30, 2013, at 9:01 AM, Mathieu Basille wrote:
>>>>>
>>>>>> Dear list,
>>>>>>
>>>>>> Here is a simple example in which the behaviour of 'format' does not
>>>>>> make sense to me. I have read the documentation and searched the archives,
>>>>>> but nothing pointed me in the right direction to understand this behaviour.
>>>>>> Let's start with a simple data frame:
>>>>>>
>>>>>> df1 <- data.frame(x = rnorm(110000), y = rnorm(110000), id = 1:110000)
>>>>>>
>>>>>> Let's now create a new variable 'id2' which is the character
>>>>>> representation of 'id'. Note that I use 'scientific = FALSE' to ensure that
>>>>>> long numbers such as 100,000 are not formatted using their scientific
>>>>>> representation (in this case 1e+05):
>>>>>>
>>>>>> df1$id2 <- apply(df1, 1, function(dfi) format(dfi["id"], scientific =
>>>>>> FALSE))
>>>>>>
>>>>>> Let's have a look at part of the result:
>>>>>>
>>>>>> df1$id2[99990:100010]
>>>>>> [1] "99990"  "99991"  "99992"  "99993"  "99994"  "99995"  "99996"
>>>>>> [8] "99997"  "99998"  "99999"  "100000" "100001" "100002" "100003"
>>>>>> [15] "100004" "100005" "100006" "100007" "100008" "100009" "100010"
>>>>>
>>>>>
>>>>> Some formating processes are carried out by system functions. In this
>>>>> case I am unable to reproduce with the same code on a Mac OS 10.7.5/R 3.0.1
>>>>> Patched
>>>>>
>>>>>> df1$id2[99990:100010]
>>>>>
>>>>>        [1] "99990"  "99991"  "99992"  "99993"  "99994"  "99995"  "99996"
>>>>> "99997"
>>>>>        [9] "99998"  "99999"  "100000" "100001" "100002" "100003" "100004"
>>>>> "100005"
>>>>> [17] "100006" "100007" "100008" "100009" "100010"
>>>>>
>>>>> (I did notice that generation of the id2 variable seemed to take an
>>>>> inordinately long time.)
>>>>>
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list