[Rd] split() - unexpected sorting of results

Rui Barradas ruipbarradas at sapo.pt
Sat Oct 21 06:35:29 CEST 2017


Hello,

In order to solve that problem of sorting numerics made characters there 
is package stringr, functions str_sort and str_order.

library(stringr)

set.seed(2447)

x <- sample(11L)
sort(as.character(x))
[1] "1"  "10" "11" "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"

str_sort(as.character(x), numeric = TRUE)
[1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11"

str_order(as.character(x), numeric = TRUE)
#[1]  1  4 11  8  6  5  3 10  9  7  2

i <- str_order(as.character(x), numeric = TRUE)
as.character(x)[i]
#[1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11"


Unfortunately this does not solve the OP's question, factor(), 
as.factor(), split() and others use the base R sorter and this can only 
be changed by changing their sources.

Hope this helps,

Rui Barradas

Em 21-10-2017 00:32, Hervé Pagès escreveu:
> Hi,
>
> On 10/20/2017 12:53 PM, Peter Meissner wrote:
>> Thanks, for the explanation.
>>
>> Still, I think this is surprising bahaviour which might be handled
>> better.
>
> Maybe a little surprising, but no more than:
>
>  > x <- sample(11L)
>
>  > sort(x)
>   [1]  1  2  3  4  5  6  7  8  9 10 11
>
>  > sort(as.character(x))
>   [1] "1"  "10" "11" "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"
>
> The fact that sort(), as.factor(), split() and many other things behave
> consistently with respect to the underlying order of character vectors
> avoids other even bigger surprises.
>
> Also note that the underlying order of character vectors actually
> depends on your locale. One way to guarantee consistent results across
> platforms/locales is by explicitly specifying the levels when making
> a factor e.g.
>
>    f <- factor(x, levels=unique(x))
>    split(1:11, f)
>
> This is particularly sensible when writing unit tests.
>
> Cheers,
> H.
>
>>
>> Best, Peter
>>
>> Am 20.10.2017 9:49 nachm. schrieb "Iñaki Úcar" <i.ucar86 at gmail.com>:
>>
>>> Hi Peter,
>>>
>>> 2017-10-20 21:33 GMT+02:00 Peter Meissner <retep.meissner at gmail.com>:
>>>> Hey,
>>>>
>>>> I found this - for me - quite surprising and puzzling behaviour of
>>> split().
>>>>
>>>>
>>>> split(1:11, as.character(1:11))
>>>> split(1:11, 1:11)
>>>>
>>>>
>>>> When splitting by numerics everything works as expected - sorting of
>>> input
>>>> == sorting of output -- but when using a character vector everything
>>>> gets
>>>> re-sorted alphabetical.
>>>>
>>>>
>>>> Although, there are some references in the help files to what happens
>>> when
>>>> using split, I did not find any note on this - for me - rather
>>>> unexpected
>>>> behaviour.
>>>
>>> As the documentation states,
>>>
>>>         f: a ‘factor’ in the sense that ‘as.factor(f)’ defines the
>>>            grouping, or a list of such factors in which case their
>>>            interaction is used for the grouping.
>>>
>>> And, in fact,
>>>
>>>> as.factor(1:11)
>>>   [1] 1  2  3  4  5  6  7  8  9  10 11
>>> Levels: 1 2 3 4 5 6 7 8 9 10 11
>>>
>>>> as.factor(as.character(1:11))
>>>   [1] 1  2  3  4  5  6  7  8  9  10 11
>>> Levels: 1 10 11 2 3 4 5 6 7 8 9
>>>
>>> Regards,
>>> Iñaki
>>>
>>>> I would like it best when the sorting of split results stays the
>>>> same no
>>>> matter the input (sorting of input == sorting of output)
>>>>
>>>> If that is not possibly a note of caution in the help pages and
>>>> maybe an
>>>> example might be valuable.
>>>>
>>>>
>>>> Best, Peter
>>>>
>>>>          [[alternative HTML version deleted]]
>>>>
>>>> ______________________________________________
>>>> R-devel at r-project.org mailing list
>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=o5-lZT7zAjFNU8C0Z9D7XaQO_2NGmhKF-IbGZFhSvO0&s=4cZ9rSLJAVnnjULGMCDPAclXHoc9_le3Z1DrZg0nQqg&e=
>>>>
>>>
>>
>>     [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=o5-lZT7zAjFNU8C0Z9D7XaQO_2NGmhKF-IbGZFhSvO0&s=4cZ9rSLJAVnnjULGMCDPAclXHoc9_le3Z1DrZg0nQqg&e=
>>
>>
>



More information about the R-devel mailing list