[R] removing specified length of text after a period in dataframe of char's

Wed Dec 7 15:40:23 CET 2011

Hi Sarah,

this is a neat solution. Thanks very much for your help, and your
patience with my poorly posed questions. I've learned a lot from your
approach.

best regards,
Aidan

On Wed, Dec 7, 2011 at 1:40 PM, Sarah Goslee <sarah.goslee at gmail.com> wrote:
> Hi,
>
> If you really wanted precision (significant figures) rather than decimal places,
> it would be easy: format() handles that, I believe.
>
> Your original email said you'd been reading about regular expressions;
> continuing
> that reading will lead you to the meaning of the cryptic ^ and all the \.
>
> As for the final ., you're right: I didn't think about having nothing
> following the
> decimal place. It's much easier to do in two steps:
>
>> testdata <- data.frame(values=c("10,000.0", "5.321", "1.1"), digits=c(0, 1, 2))
>> intermediate <- apply(testdata, 1, function(x)sub(paste("(^.*\\.\\d{", x[2], "})(\\d*)", sep=""), "\\1", x[1]))
>> intermediate
> [1] "10,000." "5.3"     "1.1"
>> sub("\\.$", "", intermediate)
> [1] "10,000" "5.3"    "1.1"
>
> Sarah
> On Wed, Dec 7, 2011 at 8:20 AM, Aidan Corcoran
> <aidan.corcoran11 at gmail.com> wrote:
>> Hi Sarah,
>>
>> apologies for the excess. A smaller example:
>>
>> f<-structure(list(c("GDP per capita (LCU)", "Ratio to EZ GDP Per Cap"
>> ), `2005` = c(32128, 0.1), `2009` = c(52163, 0.1), `2010` = c(63100,
>> 0.1), `2011` = c(72461, 0.1), `2012` = c(81313, 0.1)), .Names = c("",
>> "2005", "2009", "2010", "2011", "2012"), row.names = 3:4, class = c("cast_df",
>> "data.frame"))
>>
>> nam2<-
>> structure(list(var1 = c("GDP per capita (LCU)", "Ratio to EZ GDP Per Cap"
>> ), digi = c(0, 1)), .Names = c("var1", "digi"), row.names = c("98",
>> "110"), class = "data.frame")
>>
>> I'm trying to place a thousand separator in the numbers in the table f:
>>
>>> f
>>                             2005    2009    2010    2011    2012
>> 3    GDP per capita (LCU) 32128.0 52163.0 63100.0 72461.0 81313.0
>> 4 Ratio to EZ GDP Per Cap     0.1     0.1     0.1     0.1     0.1
>>
>> and also have precision given by variable digi:
>>
>>> nam2
>>                       var1 digi
>> 98     GDP per capita (LCU)    0
>> 110 Ratio to EZ GDP Per Cap    1
>>
>> format
>>  hi<-format(f,big.mark=",",scientific=F)
>> gives me the comma, but now I'm not sure how to get the precision.
>>
>> Your answer seems to be doing what I want, although when I changed the
>> testdata slightly
>>>testdata[1,1]<-10000
>>>   hi<-format(testdata,big.mark=",",scientific=F)
>>> hi
>>    values digits
>> 1 10,000.0      0
>> 2      5.3      1
>> 3      1.1      2
>>> apply(hi, 1, function(x)sub(paste("(^.*\\.\\d{", x[2], "})(\\d*)", sep=""), "\\1", x[1]))
>>         1          2          3
>>  "10,000." "     5.3" "     1.1"
>> The decimal appears to be left behind in 10,000.
>>
>> Unfortunately your approach is a bit too advanced for me, so I can't
>> adapt it. Perhaps you could recommend somewhere where I could read up
>> on what the caret and other symbols mean in your paste call?
>>
>> thanks for your help!
>>
>> Aidan
>>
>> On Wed, Dec 7, 2011 at 12:05 PM, Sarah Goslee <sarah.goslee at gmail.com> wrote:
>>> Hi,
>>>
>>> Example data is crucial, but small simple example data is even better.
>>> I'm too lazy to figure out which bits I need from your data, so here's
>>> a simple example of one way to approach your question. You could
>>> use gsub() in very much the same manner if you need more complex
>>> output.
>>>
>>>> testdata <- data.frame(values=c(2.0, 5.3, 1.1), digits=c(0, 1, 2))
>>>> testdata
>>>  values digits
>>> 1    2.0      0
>>> 2    5.3      1
>>> 3    1.1      2
>>> # a nice way that works on numbers
>>>> apply(testdata, 1, function(x)sprintf(paste("%0.", x[2], "f", sep=""), x[1]))
>>> [1] "2"    "5.3"  "1.10"
>>>
>>> # a messy way that works on strings
>>>> apply(testdata, 1, function(x)sub(paste("(^.*\\.\\d{", x[2], "})(\\d*)", sep=""), "\\1", x[1]))
>>> [1] "2"   "5.3" "1.1"
>>>
>>> Also note that the second method will not add zeros to pad out the
>>> end. If you need that, I'd consider rearranging the order of your
>>> steps so that you can use sprintf().
>>>
>>> Someone else might have a more flexible way too; I'd be interested to see it.
>>> Unfortunately I don't think sprintf() has a way to insert a thousands separator,
>>> or that would be a one-step solution.
>>>
>>> Sarah
>>>
>>> On Wed, Dec 7, 2011 at 6:05 AM, Aidan Corcoran
>>> <aidan.corcoran11 at gmail.com> wrote:
>>>>  Dear all,
>>>>
>>>>  I'm trying to remove some text after the period (a decimal point) in
>>>> the data frame 'hi', below. This is one step in formatting a table. So
>>>> I would like e.g.
>>>> "2.0" to become "2"
>>>> and "5.3" to be "5.3",
>>>> where the variable digordered contains the number of digits after the
>>>> decimal that I would like to display, in the same order in which the
>>>> variables appear in hi. If it makes it easier to use, this info is
>>>> also contained in the dataframe nam2. The reason the numbers are
>>>> recorded as characters is because I used format to get a thousand
>>>> separator, which I also need.
>>>>
>>>> The string manipulation functions in R generally don't seem to work
>>>> with matrices or data frames, so e.g.   regexpr("\\.",  hi[1,2]) works
>>>> but not regexpr("\\.", hi). Finding the location of the period and
>>>> then using substring was the approach I was thinking of taking, but
>>>> this would seem to need for loops here. I was wondering if anyone
>>>> knows any easier ways.
>>>>
>>>> Thanks very much for any help!
>>>>
>>>> Aidan
>>>>
>>>>
>>>> digordered<-  c(0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1)
>>>> f<-structure(list(c("GDP (LCU,bn)", "GDP ($, bn)", "GDP per capita (LCU)",
>>>> "Ratio to EZ GDP Per Cap", "Share of World GDP (Intl $, %)",
>>>> "Real GDP Growth (%)", "Population (mn)", "Unemployment Rate (%)",
>>>> "Ratio of Employed/Unemployed", "PPP Exchange Rate", "Nominal Exchange
>>>> Rate (LCU per $)",
>>>> "Inflation (%)", "Main Lending Rate to Private Sector (%)", "Claims on
>>>> Central Gov",
>>>> "Claims on Private Sector", "Bank Assets", "Regulator Capital to RWA",
>>>> "Tier 1 Capital to RWA", "Return on Equity", "Liquid Assets to ST Liabilities"
>>>> ), `2005` = c(35662, 809, 32128, 0.1, 4.3, 9, 1110, 3.5, NA,
>>>> 14.7, 44.1, 4, 10.8, 7, 15, 22835, NA, NA, NA, NA), `2009` = c(61240,
>>>> 1265, 52163, 0.1, 5.2, 6.8, 1174, NA, NA, 16.8, 48.4, 10.9, 12.2,
>>>> 14, 31, 47180, 13.6, 9, 10.8, 42.8), `2010` = c(75122, 1632,
>>>> 63100, 0.1, 5.5, 10.1, 1191, NA, NA, 18.5, 45.7, 12, NA, 15,
>>>> 39, 56787, 14.7, 9.9, 10.5, 41.1), `2011` = c(87455, 1843, 72461,
>>>> 0.1, 5.7, 7.8, 1207, NA, NA, 19.6, NA, 10.6, NA, NA, NA, NA,
>>>> 13.5, 9.3, 14.3, 35.8), `2012` = c(99459, 2013, 81313, 0.1, 5.9,
>>>> 7.5, 1223, NA, NA, 20.5, NA, 8.6, NA, NA, NA, NA, NA, NA, NA,
>>>> NA)), .Names = c("", "2005", "2009", "2010", "2011", "2012"), row.names = c(NA,
>>>> 20L), class = c("cast_df", "data.frame"))
>>>>
>>>>  hi<-format(f,big.mark=",",scientific=F)
>>>>  regexpr("\\.",  hi) #don't know to get location of "." in a dataframe of chars
>>>>
>>>>
>>>> nam2<-  structure(list(var1 = c("GDP (LCU,bn)", "GDP ($, bn)", "GDP
>>>> per capita (LCU)",
>>>> "Ratio to EZ GDP Per Cap", "GDP per capita (Intl $)", "EU GDP per
>>>> capita (Intl $)",
>>>> "Share of World GDP (Intl $, %)", "Real GDP Growth (%)", "Population (mn)",
>>>> "Unemployment Rate (%)", "Ratio of Employed/Unemployed", "Employment (1000s)",
>>>> "Unemployment (1000s)", "PPP Exchange Rate", "Nominal Exchange Rate
>>>> (LCU per $)",
>>>> "Inflation (%)", "Main Lending Rate to Private Sector (%)", "Claims on
>>>> Central Gov",
>>>> "Claims on Private Sector", "Bank Assets", "Regulator Capital to RWA",
>>>> "Tier 1 Capital to RWA", "Return on Equity", "Liquid Assets to ST Liabilities",
>>>> "Reserves"), digi = c(0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0,
>>>> 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0)), .Names = c("var1", "digi"
>>>> ), row.names = c("96", "97", "98", "110", "99", "100", "101",
>>>> "102", "103", "111", "112", "104", "105", "106", "107", "108",
>>>> "109", "114", "115", "113", "119", "120", "121", "122", "116"
>>>> ), class = "data.frame")
>>>>
>>>> ________________________