[R] OT: computing percentage changes with negative and zero values?

Tue Feb 16 18:06:13 CET 2010

Dear all
I need to compute percentage changes of my data, but unfortunately
they contain both negative and zero values, and I am quite confused on
how to proceed. Searching the internet I found that many people ran
into similar issues, with no obvious solution available.

The last couple of weeks I've been playing with all the data
transformations that I could think of. Below I will expose  on a dummy
example the issues encountered:
> x$var
 [1]  0.43 -0.79  0.69  0.76  0.00 -1.51 -0.71  0.80  1.17  1.58  1.48
-1.83 -0.88  1.44 -0.72 -0.22  1.89 -1.27 -0.76
[20]  1.33

- raw data: percentage variations of the original data---containing
negative and zero values---get messed up when passing from a negative
to a positive value, and around the value 0.
> x[, "raw"] <- c(NA, diff(x$var) / x[1:19,"var"])

- raw data with abs denominator: compared to the above improves the
handling of the signs, but still fails around zero, and in some cases
gives unexpected results (see [1]).
> x[, "raw abs"] <- c(NA, diff(x$var) / abs(x[1:19,"var"]))

- raw data + constant: add a constant to the data to transform them to
strictly positive, then compute the deltas. This solves the negative
and zero value problems, but I am not sure if this introduces some
bias along the way.
> x[, "raw +cst"] <- c(NA, diff((2 + x$var)) / (2 + x[1:19,"var"]))

- log, car::box.cox.powers: both transformations involve adding a
constant to the original data.
> x[, "log"] <- c(NA, diff(log(2 + x$var)) / log(2 + x[1:19,"var"]))
> require(car)
> x1 <- box.cox.powers(2 + x$var); x1$lambda
> x[, "box cox"] <- c(NA, diff(box.cox(2 + x$var, x1$lambda)) / box.cox(2 + x[1:19,"var"], x1$lambda))

- sqrt: very similar to the above, but the results are a bit different
(and apparently better).
> x[, "sqrt"] <- c(NA, diff(sqrt(2 + x$var)) / sqrt(2 + x[1:19,"var"]))

- exp: the exponential transformation introduces too much, and
unevenly distributed variability (my actual data contain values bigger
than "5"), and the variations can quickly get to astronomical levels.
> x[, "exp"] <- c(NA, diff(exp(x$var)) / exp(x[1:19,"var"]))

- atan transformation: this is an in-house bred solution, which
insures that values from -Inf to +Inf are stacked between 0 and pi.
Again, not sure what bias this might introduce.
> mytan <- function(x) .5*pi + atan(x)
> x[, "mytan"] <- c(NA, diff(mytan(x$var)) / mytan(x[1:19,"var"]))

The resulting data frame:
> round(x, 3)
     var    raw raw abs raw +cst    log   sqrt box cox    exp  mytan
1   0.43     NA      NA       NA     NA     NA      NA     NA     NA
2  -0.79 -2.837  -2.837   -0.502 -0.785 -0.294  -0.840 -0.705 -0.544
3   0.69 -1.873   1.873    1.223  4.191  0.491   6.289  3.393  1.411
4   0.76  0.101   0.101    0.026  0.026  0.013   0.038  0.073  0.021
5   0.00 -1.000  -1.000   -0.275 -0.317 -0.149  -0.407 -0.532 -0.293
6  -1.51   -Inf    -Inf   -0.755 -2.029 -0.505  -1.591 -0.779 -0.628
7  -0.71 -0.530   0.530    1.633 -1.357  0.623  -1.517  1.226  0.630
8   0.80 -2.127   2.127    1.171  3.043  0.473   4.631  3.527  1.355
9   1.17  0.462   0.462    0.132  0.121  0.064   0.185  0.448  0.084
10  1.58  0.350   0.350    0.129  0.105  0.063   0.169  0.507  0.059
11  1.48 -0.063  -0.063   -0.028 -0.022 -0.014  -0.035 -0.095 -0.012
12 -1.83 -2.236  -2.236   -0.951 -2.421 -0.779  -1.450 -0.963 -0.804
13 -0.88 -0.519   0.519    5.588 -1.064  1.567  -1.124  1.586  0.698
14  1.44 -2.636   2.636    2.071  9.902  0.753  16.643  9.176  1.985
15 -0.72 -1.500  -1.500   -0.628 -0.800 -0.390  -0.870 -0.885 -0.626
16 -0.22 -0.694   0.694    0.391  1.336  0.179   1.679  0.649  0.430
17  1.89 -9.591   9.591    1.185  1.356  0.478   2.333  7.248  0.960
18 -1.27 -1.672  -1.672   -0.812 -1.232 -0.567  -1.115 -0.958 -0.749
19 -0.76 -0.402   0.402    0.699 -1.684  0.303  -1.841  0.665  0.381
20  1.33 -2.750   2.750    1.685  4.592  0.639   7.559  7.085  1.711

As you have noticed, I'm quite unsure on how to proceed. My actual
data represents financial EPS (earnings per share) forecasts, ranging
from -1 to 5. So, it has a "natural zero point"  (see David Winsemius'
comments in [2]). However, I need to compute percentage variations
since I am primarily interested in the evolution of the forecasts (for
a given company), while EPS data between two companies are not
necessarily comparable. The percentage data would subsequently be used
in performing statistical analyses (regression, etc.).

Please advise
Liviu

[1] http://sci.tech-archive.net/Archive/sci.stat.math/2006-04/msg00544.html
[2] http://sci.tech-archive.net/Archive/sci.stat.math/2006-04/msg00548.html