[Rd] R string comparisons may vary with platform (plain text)

Mon Nov 24 15:36:20 CET 2014

The 'stringi' package claims robust cross-platform performance. It exports
much functionality of the ICU library and will attempt to install it when
not present.
The function 'stri_sort' accepts a collation argument that can be defined
with 'stri_opts_collator'.

On Sun, Nov 23, 2014 at 5:15 PM, Martin Morgan <mtmorgan at fredhutch.org>
wrote:

>
> For many scientific applications one is really dealing with ASCII
> characters and LC_COLLATE="C", even if the user is running in non-C
> locales. What robust approaches (if any?) are available to write code that
> sorts in a locale-independent way? The Note in ?Sys.setlocale is not overly
> optimistic about setting the locale within a session.
>
> Martin Morgan
>
>
> On 11/23/2014 03:44 AM, Prof Brian Ripley wrote:
>
>> On 23/11/2014 09:39, peter dalgaard wrote:
>>
>>>
>>>  On 23 Nov 2014, at 01:05 , Henrik Bengtsson <hb at biostat.ucsf.edu>
>>>> wrote:
>>>>
>>>> On Sat, Nov 22, 2014 at 12:42 PM, Duncan Murdoch
>>>> <murdoch.duncan at gmail.com> wrote:
>>>>
>>>>> On 22/11/2014, 2:59 PM, Stuart Ambler wrote:
>>>>>
>>>>>> A colleague¹s R program behaved differently when I ran it, and we
>>>>>> thought
>>>>>> we traced it probably to different results from string comparisons as
>>>>>> below, with different R versions.  However the platforms also
>>>>>> differed.  A
>>>>>> friend ran it on a few machines and found that the comparison behavior
>>>>>> didn¹t correlate with R version, but rather with platform.
>>>>>>
>>>>>> I wonder if you¹ve seen this.  If it¹s not some setting I¹m unaware
>>>>>> of,
>>>>>> maybe someone should look into it.  Sorry I haven¹t taken the time to
>>>>>> read
>>>>>> the source code myself.
>>>>>>
>>>>>
>>>>> Looks like a collation order issue.  See ?Comparison.
>>>>>
>>>>
>>>> With the oddity that both platforms use what look like similar locales:
>>>>
>>>> LC_COLLATE=en_US.UTF-8
>>>> LC_COLLATE=en_US.utf8
>>>>
>>>
>>> It's the sort of thing thay I've tried to wrap my mind around multiple
>>> times
>>> and failed, but have a look at
>>>
>>> http://stackoverflow.com/questions/19967555/postgres-
>>> collation-differences-osx-v-ubuntu
>>>
>>>
>>> which seems to be essentially the same issue, just for Postgres. If you
>>> have
>>> the stamina, also look into the python question that it links to.
>>>
>>> As I understand it, there are two potential reasons: Either the two
>>> platforms
>>> are not using the same collation table for en_US, or at least one of
>>> them is
>>> not fully implementing the Unicode Collation Algorithm.
>>>
>>
>> And I have seen both with R.  At the very least, check if ICU is being
>> used
>> (capabilities("ICU") in current R, maybe not in some of the obsolete
>> versions
>> seen in this thread).
>>
>> As a further possibility, there are choices in the UCA (in R, see
>> ?icuSetCollate) and ICU can be compiled with different default choices.
>> It is
>> not clear to me what (if any) difference ICU versions make, but in R-devel
>> extSoftVersion() reports that.
>>
>>
>>  In general, collation is a minefield: Some languages have the same
>>> letters in
>>> different order (e.g. Estonian with Z between S and T); accented
>>> characters
>>> sort with the unaccented counterpart in some languages but as separate
>>> characters in others; some locales sort ABab, others AaBb, yet others
>>> aAbB;
>>> sometimes punctuation is ignored, sometimes not; sometimes multiple
>>> characters
>>> count as one, etc.
>>>
>>>  As ?Comparison has long said.
>>
>>
>>
>
> --
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
>
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

	[[alternative HTML version deleted]]