[R] What exactly is an dgCMatrix-class. There are so many attributes.

Sat Oct 21 16:50:47 CEST 2017

>>>>> C W <tmrsg11 at gmail.com>
>>>>>     on Fri, 20 Oct 2017 15:51:16 -0400 writes:

    > Thank you for your responses.  I guess I don't feel
    > alone. I don't find the documentation go into any detail.

    > I also find it surprising that,

    >> object.size(train$data)
    > 1730904 bytes

    >> object.size(as.matrix(train$data))
    > 6575016 bytes

    > the dgCMatrix actually takes less memory, though it
    > *looks* like the opposite.

to whom?

The whole idea of these sparse matrix classes in the 'Matrix'
package (and everywhere else in applied math, CS, ...) is that
1. they need  much less memory   and
2. matrix arithmetic with them can be much faster because it is based on
   sophisticated sparse matrix linear algebra, notably the
   sparse Cholesky decomposition for solve() etc.

Of course the efficency only applies if most of the
matrix entries _are_ 0.
You can measure the  "sparsity" or rather the  "density", of a
matrix by

  nnzero(A) / length(A)

where length(A) == nrow(A) * ncol(A)  as for regular matrices
(but it does *not* integer overflow)
and nnzero(.) is a simple utility from Matrix
which -- very efficiently for sparseMatrix objects -- gives the
number of nonzero entries of the matrix.

All of these classes are formally defined classes and have
therefore help pages.  Here  ?dgCMatrix-class  which then points
to  ?CsparseMatrix-class  (and I forget if Rstudio really helps
you find these ..; in emacs ESS they are found nicely via the usual key)

To get started, you may further look at  ?Matrix _and_  ?sparseMatrix
(and possibly the Matrix package vignettes --- though they need
 work -- I'm happy for collaborators there !)

Bill Dunlap's comment applies indeed:
In principle all these matrices should work like regular numeric
matrices, just faster with less memory foot print if they are
really sparse (and not just formally of a sparseMatrix class)
  ((and there are quite a few more niceties in the package))

Martin Maechler
(here, maintainer of 'Matrix')

    > On Fri, Oct 20, 2017 at 3:22 PM, David Winsemius <dwinsemius at comcast.net>
    > wrote:

    >> > On Oct 20, 2017, at 11:11 AM, C W <tmrsg11 at gmail.com> wrote:
    >> >
    >> > Dear R list,
    >> >
    >> > I came across dgCMatrix. I believe this class is associated with sparse
    >> > matrix.
    >> 
    >> Yes. See:
    >> 
    >> help('dgCMatrix-class', pack=Matrix)
    >> 
    >> If Martin Maechler happens to respond to this you should listen to him
    >> rather than anything I write. Much of what the Matrix package does appears
    >> to be magical to one such as I.
    >> 
    >> >
    >> > I see there are 8 attributes to train$data, I am confused why are there
    >> so
    >> > many, some are vectors, what do they do?
    >> >
    >> > Here's the R code:
    >> >
    >> > library(xgboost)
    >> > data(agaricus.train, package='xgboost')
    >> > data(agaricus.test, package='xgboost')
    >> > train <- agaricus.train
    >> > test <- agaricus.test
    >> > attributes(train$data)
    >> >
    >> 
    >> I got a bit of an annoying surprise when I did something similar. It
    >> appearred to me that I did not need to load the xgboost library since all
    >> that was being asked was "where is the data" in an object that should be
    >> loaded from that library using the `data` function. The last command asking
    >> for the attributes filled up my console with a 100K length vector (actually
    >> 2 of such vectors). The `str` function returns a more useful result.
    >> 
    >> > data(agaricus.train, package='xgboost')
    >> > train <- agaricus.train
    >> > names( attributes(train$data) )
    >> [1] "i"        "p"        "Dim"      "Dimnames" "x"        "factors"
    >> "class"
    >> > str(train$data)
    >> Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
    >> ..@ i       : int [1:143286] 2 6 8 11 18 20 21 24 28 32 ...
    >> ..@ p       : int [1:127] 0 369 372 3306 5845 6489 6513 8380 8384 10991
    >> ...
    >> ..@ Dim     : int [1:2] 6513 126
    >> ..@ Dimnames:List of 2
    >> .. ..$ : NULL
    >> .. ..$ : chr [1:126] "cap-shape=bell" "cap-shape=conical"
    >> "cap-shape=convex" "cap-shape=flat" ...
    >> ..@ x       : num [1:143286] 1 1 1 1 1 1 1 1 1 1 ...
    >> ..@ factors : list()
    >> 
    >> > Where is the data, is it in $p, $i, or $x?
    >> 
    >> So the "data" (meaning the values of the sparse matrix) are in the @x
    >> leaf. The values all appear to be the number 1. The @i leaf is the sequence
    >> of row locations for the values entries while the @p items are somehow
    >> connected with the columns (I think, since 127 and 126=number of columns
    >> from the @Dim leaf are only off by 1).

You are right David.

well, they follow sparse matrix standards which (like C) start
counting at 0.

    >> 
    >> Doing this > colSums(as.matrix(train$data))

The above colSums() again is "very" inefficient:
All such R functions  have smartly defined  Matrix methods that
directly work on sparse matrices.

Note that  as.matrix(M)  can "blow up" your R, when the matrix M
is really large and sparse such that its dense version does not
even fit in your computer's RAM.

    >> cap-shape=bell                cap-shape=conical
    >> 369                                3
    >> cap-shape=convex                   cap-shape=flat
    >> 2934                             2539
    >> cap-shape=knobbed                 cap-shape=sunken
    >> 644                               24
    >> cap-surface=fibrous              cap-surface=grooves
    >> 1867                                4
    >> cap-surface=scaly               cap-surface=smooth
    >> 2607                             2035
    >> cap-color=brown                   cap-color=buff
    >> 1816
    >> # now snipping the rest of that output.
    >> 
    >> 
    >> 
    >> Now this makes me think that the @p vector gives you the cumulative sum of
    >> number of items per column:
    >> 
    >> > all( cumsum( colSums(as.matrix(train$data)) ) == train$data at p[-1] )
    >> [1] TRUE
    >> 
    >> >
    >> > Thank you very much!
    >> >
    >> >       [[alternative HTML version deleted]]
    >> 
    >> Please read the Posting Guide. Your code was not mangled in this instance,
    >> but HTML code often arrives in an unreadable mess.
    >> 
    >> >
    >> > ______________________________________________
    >> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
    >> > https://stat.ethz.ch/mailman/listinfo/r-help
    >> > PLEASE do read the posting guide http://www.R-project.org/posti
    >> ng-guide.html
    >> > and provide commented, minimal, self-contained, reproducible code.
    >> 
    >> David Winsemius
    >> Alameda, CA, USA
    >> 
    >> 'Any technology distinguishable from magic is insufficiently advanced.'
    >> -Gehm's Corollary to Clarke's Third Law
    >> 
    >> 
    >> 
    >> 
    >> 
    >> 

    > [[alternative HTML version deleted]]

    > ______________________________________________
    > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
    > https://stat.ethz.ch/mailman/listinfo/r-help
    > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
    > and provide commented, minimal, self-contained, reproducible code.