[R] pasteFromExcel

David Winsemius dwinsemius at comcast.net
Tue Oct 21 22:23:33 CEST 2014


To the List;

I replied privately with a screenshot and an .xlsx worked example since it seemed the major issue was how Excel handled currency formatting for character strings created n R, ... not really an on-topic subject matter for this list. I know Dan also reads the R-SIG-Insurance list and it may be more on-topic over there.

-- 
David.

On Oct 21, 2014, at 6:03 AM, Dan Murphy wrote:

> Sure, that is one pattern to try to detect, but there are many more
> (e.g., cannot have multiple '$' or '€'). For speed, I'm looking for a
> *single* expression to detect valid currency strings in one grep.
> 
> The one shown for US works. For euros, it might suffice to replace '$'
> with '€' but I cannot test that in my location. Can you? I.e.,
> 
> Change currencypattern to
> 
> currencypattern <-
>  "^\\€?\\-?([1-9]{1}[0-9]{0,2}(\\,\\d{3})*(\\.\\d{0,2})?|[1-9]{1}\\d{0,}(\\.\\d{0,2})?|0(\\.\\d{0,2})?|(\\.\\d{1,2}))€|^\\-?\\€?([1-9]{1}\\d{0,2}(\\,\\d{3})*(\\.\\d{0,2})?|[1-9]{1}\\d{0,}(\\.\\d{0,2})?|0(\\.\\d{0,2})?|(\\.\\d{1,2}))€|^\\€?\\(([1-9]{1}\\d{0,2}(\\,\\d{3})*(\\.\\d{0,2})?|[1-9]{1}\\d{0,}(\\.\\d{0,2})?|0(\\.\\d{0,2})?|(\\.\\d{1,2}))\\)€"
> 
> Does Excel display euro values per the format within this test vector?
> 
> x <- c("1,234.00", "12,34.00", "€1,000", "(124)", "$(123)", "(€123)",
> "  1,000   ", "NA")
> 
> and does grep yield the correct answer? It should, but better to test
> it than assume.
> 
> grep(currencypattern, trim(x))
> [1] 1 3 4 5 7  # correct answer
> 
> I suppose my biggest holdup is knowing how Excel formats currencies in
> other denominations. Maybe there's a way for me to test euro,
> sterling, etc. in my location, but I haven't discovered it yet. :(
> 
> Again, thanks for your help.
> 
> -Dan
> 
> On Mon, Oct 20, 2014 at 3:39 PM, David Winsemius <dwinsemius at comcast.net> wrote:
>> 
>> On Oct 20, 2014, at 2:34 PM, Dan Murphy wrote:
>> 
>>> Good ideas, David.
>>> 
>>> 1) By "confirm that MS Excel honors that OutDec" I mean that, in a
>>> location (France? others?) where options("OutDec") is a comma, does MS
>>> Excel format numbers that way when displaying currencies with decimal
>>> places? I have no way of knowing if that is true in all OutDec = ","
>>> locales.
>>> 
>>> 2) I wish it were as simple as just removing unwanted "adornments."
>>> The issue is that such "adornments" must be in their proper places for
>>> the character string to represent a currency value, or a numeric value
>>> for that matter. If I add one more comma to your first element in the
>>> wrong place, it should not translate to a valid numeric, but it does
>>> with your gsub, which would be a bug if that were in pasteFromExcel:
>>>> gsub(rmchar, "", c("$1,0,00", "1,200", "800"))
>>> [1] "1000" "1200" "800"
>> 
>> If you wanted to restrict the substitutions to only the commas that were succeeded by three digits then this succeeds:
>> 
>> gsub("(\\,)(\\d{3,3})", "\\2", c("1,000,000,000.00") )
>> [1] "1000000000.00"
>> 
>> You should also take a look at formatC which has provisions for output using commas.
>> 
>> -
>> david.
>> 
>> 
>>> 
>>> When I originally looked into this I believed I couldn't be the first
>>> one asking that question .. and I wasn't. There are many hits for
>>> regular expressions that purport to successfully identify well-formed
>>> *US dollar* currency strings. The expression in pasteFromExcel is
>>> based on http://stackoverflow.com/questions/354044/what-is-the-best-u-s-currency-regex.
>>> 
>>> I'm curious if anyone has come across -- and tested -- a similar
>>> regular expression in other places that might have use for
>>> pasteFromExcel.
>>> 
>>> This is how pasteFromExcel uses its currency regular expression (the
>>> first ugly assignment is what I'm looking for in other locales around
>>> the world -- maybe there's a Regular Expression mailing list out
>>> there):
>>> 
>>> currencypattern <-
>>> "^\\$?\\-?([1-9]{1}[0-9]{0,2}(\\,\\d{3})*(\\.\\d{0,2})?|[1-9]{1}\\d{0,}(\\.\\d{0,2})?|0(\\.\\d{0,2})?|(\\.\\d{1,2}))$|^\\-?\\$?([1-9]{1}\\d{0,2}(\\,\\d{3})*(\\.\\d{0,2})?|[1-9]{1}\\d{0,}(\\.\\d{0,2})?|0(\\.\\d{0,2})?|(\\.\\d{1,2}))$|^\\$?\\(([1-9]{1}\\d{0,2}(\\,\\d{3})*(\\.\\d{0,2})?|[1-9]{1}\\d{0,}(\\.\\d{0,2})?|0(\\.\\d{0,2})?|(\\.\\d{1,2}))\\)$"
>>> 
>>> # Here's a test vector
>>> x <- c("1,234.00", "12,34.00", "$1,000", "(124)", "$(123)", "($123)",
>>> "  1,000   ", "NA")
>>> 
>>> # grep will tell you whether elements of x, trimmed of
>>> beginning/ending whitespace, match the currencypattern
>>> grep(currencypattern, trim(x))
>>> [1] 1 3 4 5 7  # correct answer
>>> 
>>> *Now* one may remove unwanted characters from the well-formed strings.
>>> And deal with the "negatives" of course .. and NAs. See how that's
>>> done in excelRio.r in the excelRio package on github:
>>> https://github.com/trinostics/excelRio
>>> 
>>> Thanks for your interest.
>>> 
>>> 
>>> On Mon, Oct 20, 2014 at 10:56 AM, David Winsemius
>>> <dwinsemius at comcast.net> wrote:
>>>> 
>>>> On Oct 20, 2014, at 10:29 AM, Dan Murphy wrote:
>>>> 
>>>>> Nice.
>>>>> So if someone were to offer a currency regular expression that works
>>>>> in their locale, I should also ask them to give me the results of
>>>>> Sys.getlocale("LC_MONETARY")
>>>>> and
>>>>> options("OutDec")
>>>>> and confirm that MS Excel honors that OutDec.
>>>> 
>>>> I'm not sure we can know what you mean by "confirm that MS Excel honors that OutDec." The result of options("OutDec") was intended for you to determine what character not to remove from a monetary value in an R workspace. If the assumption is that all values will be in the same unit and that the user is not doing any currency conversions then:
>>>> 
>>>>> decsep <- options("OutDec")
>>>>> rmchar <- paste0( "[$£€", c(".", ",")[!c(".", ",") %in% decsep], "]" )
>>>>> gsub(rmchar, "", c("$1,000", "1,200", "800"))
>>>> [1] "1000" "1200" "800"
>>>> 
>>>> 
>>>>> Thank you, David.
>>>>> -Dan
>>>>> 
>>>>> On Mon, Oct 20, 2014 at 10:04 AM, David Winsemius
>>>>> <dwinsemius at comcast.net> wrote:
>>>>>> 
>>>>>> On Oct 19, 2014, at 11:18 PM, Dan Murphy wrote:
>>>>>> 
>>>>>>> To Users of Excel:
>>>>>>> 
>>>>>>> Following advice from Brian and Markus, I created an RMarkdown "vignette"
>>>>>>> that shows an example of how the pasteFromExcel function in the excelRio
>>>>>>> package on github could be used by an actuary to transfer a triangle from
>>>>>>> Excel to R. See today's post at http://trinostics.blogspot.com/
>>>>>>> 
>>>>>>> Unfortunately, if you are located outside the US, the demonstrated
>>>>>>> functionality will not work for you because the currency regex implemented
>>>>>>> assumes the dollar sign ($) and comma/decimal punctuation of the form
>>>>>>> 999,999.00.
>>>>>>> 
>>>>>>> If anyone is interested in contributing currency regex expressions that
>>>>>>> work in your locale, I would be happy to try to incorporate them in the
>>>>>>> package. If anyone knows how best to determine the user's locale (might
>>>>>>> "timezone" suffice?), I'd appreciate that help too.
>>>>>>> 
>>>>>> 
>>>>>> ?Sys.getlocale   # perhaps "LC_MONETARY"
>>>>>> 
>>>>>> ?options   # look for OutDec
>>>>>> 
>>>>>> 
>>>>>>>    [[alternative HTML version deleted]]
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> David Winsemius
>>>>>> Alameda, CA, USA
>>>>>> 
>>>> 
>>>> David Winsemius
>>>> Alameda, CA, USA
>>>> 
>> 
>> David Winsemius
>> Alameda, CA, USA
>> 

David Winsemius
Alameda, CA, USA



More information about the R-help mailing list