[R] Symbol/String comparison in R

Jeff Newmiller jdnewm|| @end|ng |rom dcn@d@v|@@c@@u@
Thu Apr 14 18:44:04 CEST 2022


I encourage others to read Kristjan's clarifications below Timothy's reply 
below, as I very nearly missed them. I get similar results on Windows 10 
(21H1, with R 4.1.2) and Ubuntu (20.04, with R 4.1.3), both of which 
indicate Sys.getlocale("LC_COLLATE") is "C.UTF-8", which non-definitive 
sources (e.g. [1]) say should sort in raw code point order.

Re: OP reference to the "local" function... do learn to read the help 
regarding functions, as in ?local. That function has nothing to do with 
the concept of a locale (with an e at the end), for which other replies 
have suggested reading ?Sys.setlocale or ?icuSetCollate (which, by the 
way, indicates that its functionality is OPTIONAL and should not control 
collation until it is invoked to activate it, so could be regarded as 
informative but tangential unless the documentation is out of date).

Timothy, at some point it would behoove you to try to answer the questions 
that are asked instead of going off on tangents.

- Posting in html is interfering with communication, as what you see is
not what we see (e.g. "quotes" in "?abaca? < ?acaba?" below).
- OP seems to have used the term symbol to indicate what I might refer to 
as "glyph", a visual representation. They are apparently not familiar with 
the meta-language handling in R that refers to syntactic language elements 
as symbols. So
    a<b
was never included in the question. Therefore, issues around R symbols T 
vs t are tangents that divert from the question.
- OP was clear from the beginning that they are familiar with the ASCII 
code point table, so pointing them at it missed the mark by a mile.

The upside of this latest post is that you re-introduced the OP's off-list 
clarification comments to the mailing list thread. Kristjan, do use 
reply-all if you don't want to get lost in off-list conversations with 
an audience of 1, and do read the Posting Guide (e.g. re plain text).

[1] https://community.hpe.com/t5/HP-UX-General/Difference-between-C-utf8-and-en-us-utf8-points/td-p/4418194#.YlhHXujMLb0

On Thu, 14 Apr 2022, Ebert,Timothy Aaron wrote:

> These outcomes are correct. It is an element wise comparison with the left most comparison taking precedence.
> ?abaca? < ?acaba?
> TRUE
> ?abaca? < ?acabammmmm?
> TRUE
>
> This is important because it makes clear sort order: What is the right order for the numbers 1 through 10 sorted as character?
>
> Another fun game is to explain
> F > T
> FALSE
> f > t
> error
> ?f?>?t?
> FALSE
>
> I can also change the first outcome
> F<-4
> T<-2
> F>T
> TRUE
>
> The key is to know when I am comparing variables versus strings and a short cut R uses as a default for TRUE and FALSE that can be reset by the user.
>
> Tim
>
> From: Kristjan Kure <kristjan.kure.1 using gmail.com>
> Sent: Thursday, April 14, 2022 8:00 AM
> To: Ebert,Timothy Aaron <tebert using ufl.edu>
> Cc: Bert Gunter <bgunter.4567 using gmail.com>; R-help <r-help using r-project.org>
> Subject: Re: [R] Symbol/String comparison in R
>
> [External Email]
> Thank you for your response. This is the current status:
>
> I am looking fundamental why for these comparisons:
> "1040" <= "12000" # returns true
> "1040" <= "10000" # returns false
> "a" < "A" # returns true
> "A" < "a" # returns false
> "raining" <= "raining x" #true
>
> Feedback so far:
> 1) Bert: "lexicographic", "locale"
> 2) Timothy: https://en.wikipedia.org/wiki/ASCII<https://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikipedia.org_wiki_ASCII&d=DwMFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=e86X_tk8tjXPbrbY_0i-7WvJzmij3UUzZzE0awNSlL0fEpHaKLIEWyoJ3diccoJ3&s=Ds3AyEefVfUhjBBwZYJ9CqhqtfyorOkZfENCpIZv7hU&e=>
>
> My comments:
> 1) "Lexicographic" - The phrase lexicographic order means alphabetical order. This will help me only when comparing:
> "a" < "b" # I suppose it returns true, bluntly because (1 < 2)? Position 1 - A, position 2 - B
> "b" < "a" # I suppose it returns false, bluntly because (2 < 1)? Position 2 - B, position 1 - A
>
> 2) Checking the alphabet or ASCII table won't help me understand why "a" < "A" returns true.
>
> 3) Letter "A" has smaller values compared to "a" (Checking oct, dec, hex values in https://en.wikipedia.org/wiki/ASCII<https://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikipedia.org_wiki_ASCII&d=DwMFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=e86X_tk8tjXPbrbY_0i-7WvJzmij3UUzZzE0awNSlL0fEpHaKLIEWyoJ3diccoJ3&s=Ds3AyEefVfUhjBBwZYJ9CqhqtfyorOkZfENCpIZv7hU&e=>). On the other hand, if alphabetical order is different
> in country X the whole ASCII table is obsolete?
>
> 4) "Locale" - I understand the order of letters can be different between locales/alphabets. Still, it does not help with "a" < "A" comparison.
> Tried to use local() function in RStudio - did not get additional insight. Or is there any local table somewhere listing all symbols, lowercase, and uppercase symbols/letters?
>
> I understand it might be a rare occasion for this type of comparison, but I still want to understand the why. Also, some functions might return strings instead of numbers
> and then it might be helpful to understand what is really going on.
>
> If no one can answer how these comparisons fundamentally work should this kind of string comparison return NaN in R?
> Best regards,
> Kristjan
>
> On Thu, Apr 14, 2022 at 1:20 PM Ebert,Timothy Aaron <tebert using ufl.edu<mailto:tebert using ufl.edu>> wrote:
> For some issues it can be useful to learn by experiment. It gives you experience and shows you what sorts of error messages you can expect. In the console type things like this:
> a>B
> gives an error
> "a">"B"
> FALSE
>
>
> -----Original Message-----
> From: R-help <r-help-bounces using r-project.org<mailto:r-help-bounces using r-project.org>> On Behalf Of Bert Gunter
> Sent: Wednesday, April 13, 2022 10:00 PM
> To: Kristjan Kure <kristjan.kure.1 using gmail.com<mailto:kristjan.kure.1 using gmail.com>>
> Cc: R-help <r-help using r-project.org<mailto:r-help using r-project.org>>
> Subject: Re: [R] Symbol/String comparison in R
>
> [External Email]
>
> "I was not able to find answers to my questions (tried Google, Stack Overflow, etc). Please correct me if anything is wrong here."
>
> R has an extensive Help system. That should always be your first place to look. In this case, ?"<" (at the R prompt) brings you to the Help page for comparisons (as would ?Comparison, but only if the 'c" is in upper case, unfortunately). Among lots of other stuff, it says:
>
> "Comparison of strings in character vectors is lexicographic within the strings using the collating sequence of the locale in use: see locales." ... (+ lots more).
>
> Incidentally, rseek.org<https://urldefense.proofpoint.com/v2/url?u=http-3A__rseek.org&d=DwMFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=e86X_tk8tjXPbrbY_0i-7WvJzmij3UUzZzE0awNSlL0fEpHaKLIEWyoJ3diccoJ3&s=Wp6_AwvgFE91zeQ3W1r0TCGdfxdJhVtv4ZrlieWqeaA&e=> and rdrr.io<https://urldefense.proofpoint.com/v2/url?u=http-3A__rdrr.io&d=DwMFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=e86X_tk8tjXPbrbY_0i-7WvJzmij3UUzZzE0awNSlL0fEpHaKLIEWyoJ3diccoJ3&s=lmyiTc5RbfDL4dT_DLta_PeLG-6BghH_cmU2Zr01jtI&e=> are another couple of good places to look for R documentation.
>
>
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along and sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along and sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Wed, Apr 13, 2022 at 5:10 PM Kristjan Kure <kristjan.kure.1 using gmail.com<mailto:kristjan.kure.1 using gmail.com>> wrote:
>>
>> Hi!
>>
>> Sorry, I am a beginner in R.
>>
>> I was not able to find answers to my questions (tried Google, Stack
>> Overflow, etc). Please correct me if anything is wrong here.
>>
>> When comparing symbols/strings in R - raw numeric values are compared
>> symbol by symbol starting from left? If raw numeric values are not
>> used is there an ASCII / Unicode table where symbols have
>> values/ranking/order and R compares those values?
>>
>> *2) Comparing symbols*
>> Letter "a" raw value is 61, letter "b" raw value is 62? Is this correct?
>>
>> # Raw value for "a" = 61
>> a_raw <- charToRaw("a")
>> a_raw
>>
>> # Raw value for "b" = 62
>> b_raw <- charToRaw("b")
>> b_raw
>>
>> # equals TRUE
>> "a" < "b"
>>
>> Ok, so 61 is less than 62 so it's TRUE. Is this correct?
>>
>> *3) Comparing strings #1*
>> "1040" <= "12000"
>>
>> raw_1040 <- charToRaw("1040")
>> raw_1040
>> #31 *30* (comparison happens with the second symbol) 34 30
>>
>> raw_12000 <- charToRaw("12000")
>> raw_12000
>> #31 *32* (comparison happens with the second symbol) 30 30 30
>>
>> The symbol in the second position is 30 and it's less than 32. Equals
>> to true. Is this correct?
>>
>> *4) Comparing strings #2*
>> "1040" <= "10000"
>>
>> raw_1040 <- charToRaw("1040")
>> raw_1040
>> #31 30 *34*  (comparison happens with third symbol) 30
>>
>> raw_10000 <- charToRaw("10000")
>> raw_10000
>> #31 30 *30*  (comparison happens with third symbol) 30 30
>>
>> The symbol in the third position is 34 is greater than 30. Equals to false.
>> Is this correct?
>>
>> *5) Problem - Why does this equal FALSE?* *"A" < "a"*
>>
>> 41 < 61 # FALSE?
>>
>> # Raw value for "A" = 41
>> A_raw <- charToRaw("A")
>> A_raw
>>
>> # Raw value for "a" = 61
>> a_raw <- charToRaw("a")
>> a_raw
>>
>> Why is capitalized "A" not less than lowercase "a"? Based on raw
>> values it should be. What am I missing here?
>>
>> Thanks
>> Kristjan
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help using r-project.org<mailto:R-help using r-project.org> mailing list -- To UNSUBSCRIBE and more, see
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mail
>> man_listinfo_r-2Dhelp&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAs
>> Rzsn7AkP-g&m=yz1V2nVJPZSQ9gn4HFUMVpUhZKZg_cwwu3HIvvS5jCkbCbdw_4DHCUxzb
>> 1Z4DKFB&s=7MT7GhFYxYsVOPG_ayqqA63o6SYSWKlMJYSq5BhbGow&e=
>> PLEASE do read the posting guide
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.or
>> g_posting-2Dguide.html&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeA
>> sRzsn7AkP-g&m=yz1V2nVJPZSQ9gn4HFUMVpUhZKZg_cwwu3HIvvS5jCkbCbdw_4DHCUxz
>> b1Z4DKFB&s=-FIG1LH5_F3fqVDTUEvJUFpwYehrqtqS2P6YhyETQwY&e=
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help using r-project.org<mailto:R-help using r-project.org> mailing list -- To UNSUBSCRIBE and more, see https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Dhelp&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=yz1V2nVJPZSQ9gn4HFUMVpUhZKZg_cwwu3HIvvS5jCkbCbdw_4DHCUxzb1Z4DKFB&s=7MT7GhFYxYsVOPG_ayqqA63o6SYSWKlMJYSq5BhbGow&e=
> PLEASE do read the posting guide https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.org_posting-2Dguide.html&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=yz1V2nVJPZSQ9gn4HFUMVpUhZKZg_cwwu3HIvvS5jCkbCbdw_4DHCUxzb1Z4DKFB&s=-FIG1LH5_F3fqVDTUEvJUFpwYehrqtqS2P6YhyETQwY&e=
> and provide commented, minimal, self-contained, reproducible code.
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil using dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                       Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k



More information about the R-help mailing list