[R] What exactly is an dgCMatrixclass. There are so many attributes.
Martin Maechler
maechler at stat.math.ethz.ch
Sat Oct 21 19:13:45 CEST 2017
>>>>> David Winsemius <dwinsemius at comcast.net>
>>>>> on Sat, 21 Oct 2017 09:05:38 0700 writes:
>> On Oct 21, 2017, at 7:50 AM, Martin Maechler <maechler at stat.math.ethz.ch> wrote:
>>
>>>>>>> C W <tmrsg11 at gmail.com>
>>>>>>> on Fri, 20 Oct 2017 15:51:16 0400 writes:
>>
>>> Thank you for your responses. I guess I don't feel
>>> alone. I don't find the documentation go into any detail.
>>
>>> I also find it surprising that,
>>
>>>> object.size(train$data)
>>> 1730904 bytes
>>
>>>> object.size(as.matrix(train$data))
>>> 6575016 bytes
>>
>>> the dgCMatrix actually takes less memory, though it
>>> *looks* like the opposite.
>>
>> to whom?
>>
>> The whole idea of these sparse matrix classes in the 'Matrix'
>> package (and everywhere else in applied math, CS, ...) is that
>> 1. they need much less memory and
>> 2. matrix arithmetic with them can be much faster because it is based on
>> sophisticated sparse matrix linear algebra, notably the
>> sparse Cholesky decomposition for solve() etc.
>>
>> Of course the efficency only applies if most of the
>> matrix entries _are_ 0.
>> You can measure the "sparsity" or rather the "density", of a
>> matrix by
>>
>> nnzero(A) / length(A)
>>
>> where length(A) == nrow(A) * ncol(A) as for regular matrices
>> (but it does *not* integer overflow)
>> and nnzero(.) is a simple utility from Matrix
>> which  very efficiently for sparseMatrix objects  gives the
>> number of nonzero entries of the matrix.
>>
>> All of these classes are formally defined classes and have
>> therefore help pages. Here ?dgCMatrixclass which then points
>> to ?CsparseMatrixclass (and I forget if Rstudio really helps
>> you find these ..; in emacs ESS they are found nicely via the usual key)
>>
>> To get started, you may further look at ?Matrix _and_ ?sparseMatrix
>> (and possibly the Matrix package vignettes  though they need
>> work  I'm happy for collaborators there !)
>>
>> Bill Dunlap's comment applies indeed:
>> In principle all these matrices should work like regular numeric
>> matrices, just faster with less memory foot print if they are
>> really sparse (and not just formally of a sparseMatrix class)
>> ((and there are quite a few more niceties in the package))
>>
>> Martin Maechler
>> (here, maintainer of 'Matrix')
>>
>>
>>> On Fri, Oct 20, 2017 at 3:22 PM, David Winsemius <dwinsemius at comcast.net>
>>> wrote:
>>
>>>>> On Oct 20, 2017, at 11:11 AM, C W <tmrsg11 at gmail.com> wrote:
>>>>>
>>>>> Dear R list,
>>>>>
>>>>> I came across dgCMatrix. I believe this class is associated with sparse
>>>>> matrix.
>>>>
>>>> Yes. See:
>>>>
>>>> help('dgCMatrixclass', pack=Matrix)
>>>>
>>>> If Martin Maechler happens to respond to this you should listen to him
>>>> rather than anything I write. Much of what the Matrix package does appears
>>>> to be magical to one such as I.
>>>>
[............]
>>>>> data(agaricus.train, package='xgboost')
>>>>> train < agaricus.train
>>>>> names( attributes(train$data) )
>>>> [1] "i" "p" "Dim" "Dimnames" "x" "factors"
>>>> "class"
>>>>> str(train$data)
>>>> Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
>>>> ..@ i : int [1:143286] 2 6 8 11 18 20 21 24 28 32 ...
>>>> ..@ p : int [1:127] 0 369 372 3306 5845 6489 6513 8380 8384 10991
>>>> ...
>>>> ..@ Dim : int [1:2] 6513 126
>>>> ..@ Dimnames:List of 2
>>>> .. ..$ : NULL
>>>> .. ..$ : chr [1:126] "capshape=bell" "capshape=conical"
>>>> "capshape=convex" "capshape=flat" ...
>>>> ..@ x : num [1:143286] 1 1 1 1 1 1 1 1 1 1 ...
>>>> ..@ factors : list()
>>>>
>>>>> Where is the data, is it in $p, $i, or $x?
>>>>
>>>> So the "data" (meaning the values of the sparse matrix) are in the @x
>>>> leaf. The values all appear to be the number 1. The @i leaf is the sequence
>>>> of row locations for the values entries while the @p items are somehow
>>>> connected with the columns (I think, since 127 and 126=number of columns
>>>> from the @Dim leaf are only off by 1).
>>
>> You are right David.
>>
>> well, they follow sparse matrix standards which (like C) start
>> counting at 0.
>>
>>>>
>>>> Doing this > colSums(as.matrix(train$data))
>>
>> The above colSums() again is "very" inefficient:
>> All such R functions have smartly defined Matrix methods that
>> directly work on sparse matrices.
> I did get an error with colSums(train$data)
>> colSums(train$data)
> Error in colSums(train$data) :
> 'x' must be an array of at least two dimensions
The same problem C.W. saw with head()
It, e.g., all works after calling str() on train$data.
But I am still puzzled, because head() is similar to str():
both are S3 generics (in "utils") but str()'s useMethod() I
think see that the class belongs to package "Matrix" and hence
attaches it {not just *load* it  hence, import etc does not matter}.
but head() does not.
Even more curiously, colSums() *also* attaches Matrix but
still fails, but it works on a 2nd call
Example 1, in a fresh R session:

> data(agaricus.train, package="xgboost")
> M < agaricus.train$data
> methods(str)
[1] str.data.frame* str.Date* str.default* str.dendrogram* str.logLik* str.POSIXt*
# see '?methods' for accessing help and source code
> str(M)
Loading required package: Matrix <<<<<<<<< SEE ! <<<<<<<<<<<<<<<<<
Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
..@ i : int [1:143286] 2 6 8 11 18 20 21 24 28 32 ...
..@ p : int [1:127] 0 369 372 3306 5845 6489 6513 8380 8384 10991 ...
..@ Dim : int [1:2] 6513 126
..@ Dimnames:List of 2
.. ..$ : NULL
R .. ..$ : chr [1:126] "capshape=bell" "capshape=conical" "capshape=convex" "capshape=flat" ...
..@ x : num [1:143286] 1 1 1 1 1 1 1 1 1 1 ...
..@ factors : list()
>
> head(M)
6 x 126 sparse Matrix of class "dgCMatrix"
[[ suppressing 126 column names ‘capshape=bell’, ‘capshape=conical’, ‘capshape=convex’ ... ]]
[1,] . . 1 . . . . . . 1 1 . . . . . . . . . 1 . . . . . . . . 1 . . . 1 . 1 . . . 1 1 . . . . . . . . . . . 1 . .
................
................
................


See, str() is a nice one generic function ==> attaches Matrix (see
the message where I have added '<<<<<<<<< SEE ! <<<<.........'),
but as we know head() does not strangely.
Now, the curious colSums() behavior:
Example 2, in a fresh R session:

> data(agaricus.train, package='xgboost')
> M < agaricus.train$data
> cm < colSums(M) ## first time, loads Matrix but then fails !!
Loading required package: Matrix
Error in colSums(M) : 'x' must be an array of at least two dimensions
> cm < colSums(M) ## 2nd time, works because Matrix methods are all there
> str(cm)
Named num [1:126] 369 3 2934 2539 644 ...
 attr(*, "names")= chr [1:126] "capshape=bell" "capshape=conical" "capshape=convex" "capshape=flat" ...
>

> Which as it turned out was due to my having not yet loaded pkg:Matrix. Perhaps the xgboost package only imports certain functions from pkg:Matrix and that colSums is not one of them. This resembles the errors I get when I try to use grip package functions on ggplot2 objects. Since ggplot2 is built on top of grid I always am surprised when this happens and after a headslap and explicitly loading pfk:grid I continue on my stumbling way.
> library(Matrix)
> colSums(train$data) # no error
>> Note that as.matrix(M) can "blow up" your R, when the matrix M
>> is really large and sparse such that its dense version does not
>> even fit in your computer's RAM.
> I did know that, so I first calculated whether the dense matrix version of that object would fit in my RAM space and it fit easily so I proceeded.
> I find the TsparseMatrix indexing easier for my more naive notion of sparsity, although thinking about it now, I think I can see that the CsparseMatrix more closely resembles the "folded vector" design of dense R matrices. I will sometimes coerce CMatrix objeccts to TMatrix objects if I am working on the "inner" indices. I should probably stop doing that.
Well, it depends if speed and efficiency are the only important
issues.
The triplet representation (<==> TsparseMatrix) is of course
much easier to understand and explain than the columncompressed
one (CsparseMatrix)  but the latter is the one that is
efficiently used in the Clevel libraries for matrix
multiplication, Cholesky etc.
> I sincerely hope my stumbling efforts have not caused any delays.
Not at all, thank you David for all your helping on Rhelp !!!
Martin
> 
> David.
[..................]
> David Winsemius
> Alameda, CA, USA
> 'Any technology distinguishable from magic is insufficiently advanced.' Gehm's Corollary to Clarke's Third Law
ok.... given your other statement, it may be that Matrix *is*
sufficiently adanced ;) :)
More information about the Rhelp
mailing list