[Rd] Understanding svd usage and its necessity in generalized inverse calculation

Thu Dec 6 00:58:42 CET 2012

Dear R-devel:

I could use some advice about matrix calculations and steps that might
make for faster computation of generalized inverses. It appears in
some projects there is a bottleneck at the use of svd in calculation
of generalized inverses.

Here's some Rprof output I need to understand.

>   summaryRprof("Amelia.out")
$by.self
                             self.time self.pct total.time total.pct
"La.svd"                        150.34    27.66     164.82     30.32
"emfred"                         40.90     7.52     535.62     98.53
"<Anonymous>"                    32.92     6.06     122.16     22.47
"amsweep"                        26.70     4.91     475.94     87.55
"%*%"                            26.06     4.79      26.06      4.79
"as.matrix"                      23.32     4.29      37.34      6.87
"structure"                      22.30     4.10      30.68      5.64
"t"                              18.64     3.43      25.34      4.66
"matrix"                         14.66     2.70      15.02      2.76
"deparse"                        14.52     2.67      33.58      6.18
"strsplit"                        9.90     1.82       9.90      1.82
"match"                           9.80     1.80      22.94      4.22
"conditionMessage"                8.92     1.64      14.50      2.67
"svd"                             8.70     1.60     185.88     34.19
"mpinv"                           8.40     1.55     221.56     40.76
"c"                               8.16     1.50       8.16      1.50
".deparseOpts"                    7.92     1.46      12.42      2.28
"$"                               6.94     1.28       6.94      1.28
"t.default"                       6.70     1.23       6.70      1.23
"diag"                            5.66     1.04       8.96      1.65
"cbind"                           5.48     1.01       5.48      1.01
"rbind"                           5.36     0.99      10.84      1.99
"am.inv"                          4.88     0.90      10.76      1.98
".Call"                           4.70     0.86       4.70      0.86
[snip]

$by.total
                             total.time total.pct self.time self.pct
"amelia.default"                 543.58     99.99      0.04     0.01
"amelia"                         543.58     99.99      0.00     0.00
"emarch"                         536.94     98.77      0.18     0.03
"emfred"                         535.62     98.53     40.90     7.52
"amsweep"                        475.94     87.55     26.70     4.91
"mpinv"                          221.56     40.76      8.40     1.55
"svd"                            185.88     34.19      8.70     1.60
"La.svd"                         164.82     30.32    150.34    27.66
"try"                            161.52     29.71      0.50     0.09
"tryCatch"                       161.04     29.62      3.38     0.62
"tryCatchList"                   157.06     28.89      1.06     0.19
"tryCatchOne"                    156.00     28.70      3.20     0.59
"<Anonymous>"                    122.16     22.47     32.92     6.06
"as.matrix"                       37.34      6.87     23.32     4.29
"deparse"                         33.58      6.18     14.52     2.67
"structure"                       30.68      5.64     22.30     4.10
"%*%"                             26.06      4.79     26.06     4.79
"t"                               25.34      4.66     18.64     3.43
"match"                           22.94      4.22      9.80     1.80
"%in%"                            21.74      4.00      2.44     0.45
"simpleError"                     16.72      3.08      0.62     0.11
"matrix"                          15.02      2.76     14.66     2.70
"conditionMessage"                14.50      2.67      8.92     1.64
"mode"                            14.44      2.66      1.76     0.32
"doTryCatch"                      13.94      2.56      2.72     0.50
".deparseOpts"                    12.42      2.28      7.92     1.46

I *Think* this means that a bottlleneck here is svd, which is being
called by this function that calculates generalized inverses:

## Moore-Penrose Inverse function (aka Generalized Inverse)
##   X:    symmetric matrix
##   tol:  convergence requirement
mpinv <- function(X, tol = sqrt(.Machine$double.eps)) {
  s <- svd(X)
  e <- s$d
  e[e > tol] <- 1/e[e > tol]
  s$v %*% diag(e,nrow=length(e)) %*% t(s$u)
}

That is from the Amerlia package, which we like to use very much.

Basically, I wonder if I should use a customized generalized inverse
or svd calculator to make this faster.

Why bother, you ask?  We have many people who do multiple imputation
for missing data and the psychologists are persuaded that they now
ought to collect 50 or 100 imputations (gulp).  The users want to
include many many variables in these models, and so we see iterations
in the 100s for each imputation.  That makes for some super long
running jobs, and I've been profiling the multiple imputation
algorithms in various packages to see what I can do to make them
faster.

Question: Is the usage of svd really necessary for generalized
inverse? Can I employ some alternative to get faster?  The literature
on this is pretty specialized.  I mean to say, "will one of you math
professors please tell me what is safe or unsafe".

The cholesky decomposition is fastest for the well conditioned matrix,
but not suitable to all matrices.  svd is the gold standard for
accuracy, but it is slowest. In the past, I've chased speedups like
this and sometimes find a happy answer, but just as often I wish I had
asked an expert before wandering off on a fool's errand.

There's plenty of literature about it. See, for example, this article:

Smoktunowicz, A., & Wróbel, I. (2012). Numerical aspects of computing
the Moore-Penrose inverse of full column rank matrices. BIT Numerical
Mathematics, 52(2), 503–524. doi:10.1007/s10543-011-0362-0

Which seems to say that a method proposed by Byers and Xu (2008) is as
about as good as, but faster than, svd.

It appears to me work on this traces back to a speedup obtained by
Courrieu, whose method takes about one-half as much time as the svd.

P. Courrieu. Fast Computation of Moore-Penrose Inverse matrices.
Neural Information
Processing-Letters and Reviews, 8:25–29, 2005.

Katsikis, Casilos N. and Dimitrios Pappas. 2008. Fast computing of the
Moore-Penrose Inverse Matrix. Electronic Journal of Linear Algebra 17:
637-650.
http://celc.cii.fc.ul.pt/iic/ela/ela-articles/articles/vol17_pp337-350.pdf

Toutounian, F., & Ataei, A. (2009). A new method for computing
Moore-Penrose inverse matrices. J. Comput. Appl. Math., 228(1),
412–417. doi:10.1016/j.cam.2008.10.008

I see in the MASS package there is a ginv function, it uses the svd,
just as the mpinv in Amelia does.

Maybe I should ignore this, or leave it up to some BLAS implementation?

pj

-- 
Paul E. Johnson
Professor, Political Science      Assoc. Director
1541 Lilac Lane, Room 504      Center for Research Methods
University of Kansas                 University of Kansas
http://pj.freefaculty.org               http://quant.ku.edu