[R] What exactly is an dgCMatrix-class. There are so many attributes.

Sat Oct 21 18:05:38 CEST 2017

> On Oct 21, 2017, at 7:50 AM, Martin Maechler <maechler at stat.math.ethz.ch> wrote:
> 
>>>>>> C W <tmrsg11 at gmail.com>
>>>>>>    on Fri, 20 Oct 2017 15:51:16 -0400 writes:
> 
>> Thank you for your responses.  I guess I don't feel
>> alone. I don't find the documentation go into any detail.
> 
>> I also find it surprising that,
> 
>>> object.size(train$data)
>> 1730904 bytes
> 
>>> object.size(as.matrix(train$data))
>> 6575016 bytes
> 
>> the dgCMatrix actually takes less memory, though it
>> *looks* like the opposite.
> 
> to whom?
> 
> The whole idea of these sparse matrix classes in the 'Matrix'
> package (and everywhere else in applied math, CS, ...) is that
> 1. they need  much less memory   and
> 2. matrix arithmetic with them can be much faster because it is based on
>   sophisticated sparse matrix linear algebra, notably the
>   sparse Cholesky decomposition for solve() etc.
> 
> Of course the efficency only applies if most of the
> matrix entries _are_ 0.
> You can measure the  "sparsity" or rather the  "density", of a
> matrix by
> 
>  nnzero(A) / length(A)
> 
> where length(A) == nrow(A) * ncol(A)  as for regular matrices
> (but it does *not* integer overflow)
> and nnzero(.) is a simple utility from Matrix
> which -- very efficiently for sparseMatrix objects -- gives the
> number of nonzero entries of the matrix.
> 
> All of these classes are formally defined classes and have
> therefore help pages.  Here  ?dgCMatrix-class  which then points
> to  ?CsparseMatrix-class  (and I forget if Rstudio really helps
> you find these ..; in emacs ESS they are found nicely via the usual key)
> 
> To get started, you may further look at  ?Matrix _and_  ?sparseMatrix
> (and possibly the Matrix package vignettes --- though they need
> work -- I'm happy for collaborators there !)
> 
> Bill Dunlap's comment applies indeed:
> In principle all these matrices should work like regular numeric
> matrices, just faster with less memory foot print if they are
> really sparse (and not just formally of a sparseMatrix class)
>  ((and there are quite a few more niceties in the package))
> 
> Martin Maechler
> (here, maintainer of 'Matrix')
> 
> 
>> On Fri, Oct 20, 2017 at 3:22 PM, David Winsemius <dwinsemius at comcast.net>
>> wrote:
> 
>>>> On Oct 20, 2017, at 11:11 AM, C W <tmrsg11 at gmail.com> wrote:
>>>> 
>>>> Dear R list,
>>>> 
>>>> I came across dgCMatrix. I believe this class is associated with sparse
>>>> matrix.
>>> 
>>> Yes. See:
>>> 
>>> help('dgCMatrix-class', pack=Matrix)
>>> 
>>> If Martin Maechler happens to respond to this you should listen to him
>>> rather than anything I write. Much of what the Matrix package does appears
>>> to be magical to one such as I.
>>> 
>>>> 
>>>> I see there are 8 attributes to train$data, I am confused why are there
>>> so
>>>> many, some are vectors, what do they do?
>>>> 
>>>> Here's the R code:
>>>> 
>>>> library(xgboost)
>>>> data(agaricus.train, package='xgboost')
>>>> data(agaricus.test, package='xgboost')
>>>> train <- agaricus.train
>>>> test <- agaricus.test
>>>> attributes(train$data)
>>>> 
>>> 
>>> I got a bit of an annoying surprise when I did something similar. It
>>> appearred to me that I did not need to load the xgboost library since all
>>> that was being asked was "where is the data" in an object that should be
>>> loaded from that library using the `data` function. The last command asking
>>> for the attributes filled up my console with a 100K length vector (actually
>>> 2 of such vectors). The `str` function returns a more useful result.
>>> 
>>>> data(agaricus.train, package='xgboost')
>>>> train <- agaricus.train
>>>> names( attributes(train$data) )
>>> [1] "i"        "p"        "Dim"      "Dimnames" "x"        "factors"
>>> "class"
>>>> str(train$data)
>>> Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
>>> ..@ i       : int [1:143286] 2 6 8 11 18 20 21 24 28 32 ...
>>> ..@ p       : int [1:127] 0 369 372 3306 5845 6489 6513 8380 8384 10991
>>> ...
>>> ..@ Dim     : int [1:2] 6513 126
>>> ..@ Dimnames:List of 2
>>> .. ..$ : NULL
>>> .. ..$ : chr [1:126] "cap-shape=bell" "cap-shape=conical"
>>> "cap-shape=convex" "cap-shape=flat" ...
>>> ..@ x       : num [1:143286] 1 1 1 1 1 1 1 1 1 1 ...
>>> ..@ factors : list()
>>> 
>>>> Where is the data, is it in $p, $i, or $x?
>>> 
>>> So the "data" (meaning the values of the sparse matrix) are in the @x
>>> leaf. The values all appear to be the number 1. The @i leaf is the sequence
>>> of row locations for the values entries while the @p items are somehow
>>> connected with the columns (I think, since 127 and 126=number of columns
>>> from the @Dim leaf are only off by 1).
> 
> You are right David.
> 
> well, they follow sparse matrix standards which (like C) start
> counting at 0.
> 
>>> 
>>> Doing this > colSums(as.matrix(train$data))
> 
> The above colSums() again is "very" inefficient:
> All such R functions  have smartly defined  Matrix methods that
> directly work on sparse matrices.

I did get an error with colSums(train$data)

> colSums(train$data)
Error in colSums(train$data) : 
  'x' must be an array of at least two dimensions

Which as it turned out was due to my having not yet loaded pkg:Matrix. Perhaps the xgboost package only imports certain functions from pkg:Matrix and that colSums is not one of them. This resembles the errors I get when I try to use grip package functions on ggplot2 objects. Since ggplot2 is built on top of grid I always am surprised when this happens and after a headslap and explicitly loading pfk:grid I continue on my stumbling way.

library(Matrix)
colSums(train$data)   # no error

> Note that  as.matrix(M)  can "blow up" your R, when the matrix M
> is really large and sparse such that its dense version does not
> even fit in your computer's RAM.

I did know that, so I first calculated whether the dense matrix version of that object would fit in my RAM space and it fit easily so I proceeded. 

I find the TsparseMatrix indexing easier for my more naive notion of sparsity, although thinking about it now,  I think I can see that the CsparseMatrix more closely resembles the "folded vector" design of dense R matrices. I will sometimes coerce CMatrix objeccts to TMatrix objects if I am working on the "inner" indices. I should probably stop doing that.

I sincerely hope my stumbling efforts have not caused any delays.

-- 
David.

> 
>>> cap-shape=bell                cap-shape=conical
>>> 369                                3
>>> cap-shape=convex                   cap-shape=flat
>>> 2934                             2539
>>> cap-shape=knobbed                 cap-shape=sunken
>>> 644                               24
>>> cap-surface=fibrous              cap-surface=grooves
>>> 1867                                4
>>> cap-surface=scaly               cap-surface=smooth
>>> 2607                             2035
>>> cap-color=brown                   cap-color=buff
>>> 1816
>>> # now snipping the rest of that output.
>>> 
>>> 
>>> 
>>> Now this makes me think that the @p vector gives you the cumulative sum of
>>> number of items per column:
>>> 
>>>> all( cumsum( colSums(as.matrix(train$data)) ) == train$data at p[-1] )
>>> [1] TRUE
>>> 
>>>> 
>>>> Thank you very much!
>>>> 
>>>>      [[alternative HTML version deleted]]
>>> 
>>> Please read the Posting Guide. Your code was not mangled in this instance,
>>> but HTML code often arrives in an unreadable mess.
>>> 
>>>> 
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posti
>>> ng-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>> 
>>> David Winsemius
>>> Alameda, CA, USA
>>> 
>>> 'Any technology distinguishable from magic is insufficiently advanced.'
>>> -Gehm's Corollary to Clarke's Third Law
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
> 
>> [[alternative HTML version deleted]]
> 
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA

'Any technology distinguishable from magic is insufficiently advanced.'   -Gehm's Corollary to Clarke's Third Law