[R] Confusing behavior when using gsub to insert unicode character (minimal working example provided)

Ista Zahn istazahn at gmail.com
Thu May 29 23:23:27 CEST 2014


10Hi Thomas,

On Thu, May 29, 2014 at 9:15 AM, Thomas Stewart
<tgs.public.mail at gmail.com> wrote:
> Thanks to to Ista Zahn, I was able to find a work around solution.  The key
> seems to be that string1 needs to be encoded as UTF-8 prior to being passed
> to gsub.  For whatever reason,
>
> Encoding(string1) <- "UTF-8"
>
> does not change the encoding on my Windows machine.

Right, because "ASCII strings will never be marked with a declared
encoding" (read ?Encoding again).

The work around:  I
> paste an obvious UTF-8 character "\u00A0" to the start of the string, send
> the string through gsub, then remove the "\u00A0" character from the output.
>
> string1 <- "\u00A0text X"; string1
> Encoding(string1)
> new_string1 <- gsub("X","\u2265",string1); new_string1
> new_string2 <- substring(new_string1,2); new_string2
>
> If you know of a less hackish way to accomplish this, I'm interested to
> hear it.

Why not just set the encoding after the fact, as I suggested?

string1 <- "X"; string1
new_string1 <- gsub("X","\u2265",string1); new_string1
Encoding(new_string1) <- "UTF-8"; new_string1

Best,
Ista
  However, this work around is sufficient for now.
>
> Thanks,
> -tgs
>
>
> On Wed, May 28, 2014 at 10:25 PM, Thomas Stewart <tgs.public.mail at gmail.com>
> wrote:
>
>> Can anyone help me understand the following behavior?
>>
>> I want to replace the letter 'X' in
>> the string
>> 'text X' with '≥' (\u226
>> 5
>> ).  The output from gsub is not what I expect.  It gives: "text ≥".
>>
>> Now, suppose I want to replace the character '≤' in
>> the string
>> 'text ≤' with '≥'.  Then, gsub gives the expected, desired output.
>>
>> What am I missing?
>>
>> Thanks for any insight.
>> -tgs
>>
>> Minimal Working Example:
>>
>> string1 <- "text X"; string1
>> new_string1 <- gsub("X","\u2265",string1); new_string1
>>
>> string2 <- "text \u2264"; string2
>> new_string2 <- gsub("\u2264","\u2265",string2); new_string2
>>
>> charToRaw(new_string1)
>> charToRaw(new_string2)
>>
>> sessionInfo()
>>
>> ## OUTPUT
>>
>> > string1 <- "text X"; string1
>> [1] "text X"
>>
>> > new_string1 <- gsub("X","\u2265",string1); new_string1
>> [1] "text ≥"
>>
>> > string2 <- "text \u2264"; string2
>> [1] "text ≤"
>>
>> > new_string2 <- gsub("\u2264","\u2265",string2); new_string2
>> [1] "text ≥"
>>
>> > charToRaw(new_string1)
>> [1] 74 65 78 74 20 e2 89 a5
>>
>> > charToRaw(new_string2)
>> [1] 74 65 78 74 20 e2 89 a5
>>
>> > sessionInfo()
>> R version 3.0.2 (2013-09-25)
>> Platform: x86_64-w64-mingw32/x64 (64-bit)
>>
>> locale:
>> [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United
>> States.1252    LC_MONETARY=English_United States.1252
>> [4] LC_NUMERIC=C                           LC_TIME=English_United
>> States.1252
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> loaded via a namespace (and not attached):
>> [1] tools_3.0.2
>>
>
>         [[alternative HTML version deleted]]
>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list