[R] What exactly is an dgCMatrix-class. There are so many attributes.

Martin Maechler maechler at stat.math.ethz.ch
Sat Oct 21 18:27:37 CEST 2017


>>>>> C W <tmrsg11 at gmail.com>
>>>>>     on Fri, 20 Oct 2017 16:01:06 -0400 writes:

    > Subsetting using [] vs. head(), gives different results.
    > R code:

    >> head(train$data, 5)
    > [1] 0 0 1 0 0

The above is surprising ... and points to a bug somewhere.
It is different (and correct) after you do

   require(Matrix)

but I think something like that should happen
semi-automatically.

As I just see, it is even worse if you get the data from xgboost
without loading the xgboost package, which you can do (and is
also more efficient !):

If you start R, and then do

   data(agaricus.train, package='xgboost')

   loadedNamespaces() # does not contain "xgboost" nor "Matrix"

so, no wonder

   head(agaricus.train $ data)

does not find head()s "Matrix" method [which _is_ exported by Matrix
via  exportMethods(.)].
But even more curiously, even after I do

    loadNamespace("Matrix")
    methods(head)

now does show the "Matrix" method,
but then head() *still* does not call it.  There's a bug
somewhere and I suspect it's in R's data() or methods package or
?? rather than in 'Matrix'.
But that will be another thread on R-devel or R's bugzilla.

Martin


    >> train$data[1:5, 1:5]
    > 5 x 5 sparse Matrix of class "dgCMatrix"
    >      cap-shape=bell cap-shape=conical cap-shape=convex
    > [1,]              .                 .                1
    > [2,]              .                 .                1
    > [3,]              1                 .                .
    > [4,]              .                 .                1
    > [5,]              .                 .                1
    >      cap-shape=flat cap-shape=knobbed
    > [1,]              .                 .
    > [2,]              .                 .
    > [3,]              .                 .
    > [4,]              .                 .
    > [5,]              .                 .

    > On Fri, Oct 20, 2017 at 3:51 PM, C W <tmrsg11 at gmail.com> wrote:

    >> Thank you for your responses.
    >> 
    >> I guess I don't feel alone. I don't find the documentation go into any
    >> detail.
    >> 
    >> I also find it surprising that,
    >> 
    >> > object.size(train$data)
    >> 1730904 bytes
    >> 
    >> > object.size(as.matrix(train$data))
    >> 6575016 bytes
    >> 
    >> the dgCMatrix actually takes less memory, though it *looks* like the
    >> opposite.
    >> 
    >> Cheers!
    >> 
    >> On Fri, Oct 20, 2017 at 3:22 PM, David Winsemius <dwinsemius at comcast.net>
    >> wrote:
    >> 
    >>> 
    >>> > On Oct 20, 2017, at 11:11 AM, C W <tmrsg11 at gmail.com> wrote:
    >>> >
    >>> > Dear R list,
    >>> >
    >>> > I came across dgCMatrix. I believe this class is associated with sparse
    >>> > matrix.
    >>> 
    >>> Yes. See:
    >>> 
    >>> help('dgCMatrix-class', pack=Matrix)
    >>> 
    >>> If Martin Maechler happens to respond to this you should listen to him
    >>> rather than anything I write. Much of what the Matrix package does appears
    >>> to be magical to one such as I.
    >>> 
    >>> >
    >>> > I see there are 8 attributes to train$data, I am confused why are there
    >>> so
    >>> > many, some are vectors, what do they do?
    >>> >
    >>> > Here's the R code:
    >>> >
    >>> > library(xgboost)
    >>> > data(agaricus.train, package='xgboost')
    >>> > data(agaricus.test, package='xgboost')
    >>> > train <- agaricus.train
    >>> > test <- agaricus.test
    >>> > attributes(train$data)
    >>> >
    >>> 
    >>> I got a bit of an annoying surprise when I did something similar. It
    >>> appearred to me that I did not need to load the xgboost library since all
    >>> that was being asked was "where is the data" in an object that should be
    >>> loaded from that library using the `data` function. The last command asking
    >>> for the attributes filled up my console with a 100K length vector (actually
    >>> 2 of such vectors). The `str` function returns a more useful result.
    >>> 
    >>> > data(agaricus.train, package='xgboost')
    >>> > train <- agaricus.train
    >>> > names( attributes(train$data) )
    >>> [1] "i"        "p"        "Dim"      "Dimnames" "x"        "factors"
    >>> "class"
    >>> > str(train$data)
    >>> Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
    >>> ..@ i       : int [1:143286] 2 6 8 11 18 20 21 24 28 32 ...
    >>> ..@ p       : int [1:127] 0 369 372 3306 5845 6489 6513 8380 8384 10991
    >>> ...
    >>> ..@ Dim     : int [1:2] 6513 126
    >>> ..@ Dimnames:List of 2
    >>> .. ..$ : NULL
    >>> .. ..$ : chr [1:126] "cap-shape=bell" "cap-shape=conical"
    >>> "cap-shape=convex" "cap-shape=flat" ...
    >>> ..@ x       : num [1:143286] 1 1 1 1 1 1 1 1 1 1 ...
    >>> ..@ factors : list()
    >>> 
    >>> > Where is the data, is it in $p, $i, or $x?
    >>> 
    >>> So the "data" (meaning the values of the sparse matrix) are in the @x
    >>> leaf. The values all appear to be the number 1. The @i leaf is the sequence
    >>> of row locations for the values entries while the @p items are somehow
    >>> connected with the columns (I think, since 127 and 126=number of columns
    >>> from the @Dim leaf are only off by 1).
    >>> 
    >>> Doing this > colSums(as.matrix(train$data))
    >>> cap-shape=bell                cap-shape=conical
    >>> 369                                3
    >>> cap-shape=convex                   cap-shape=flat
    >>> 2934                             2539
    >>> cap-shape=knobbed                 cap-shape=sunken
    >>> 644                               24
    >>> cap-surface=fibrous              cap-surface=grooves
    >>> 1867                                4
    >>> cap-surface=scaly               cap-surface=smooth
    >>> 2607                             2035
    >>> cap-color=brown                   cap-color=buff
    >>> 1816
    >>> # now snipping the rest of that output.
    >>> 
    >>> 
    >>> 
    >>> Now this makes me think that the @p vector gives you the cumulative sum
    >>> of number of items per column:
    >>> 
    >>> > all( cumsum( colSums(as.matrix(train$data)) ) == train$data at p[-1] )
    >>> [1] TRUE
    >>> 
    >>> >
    >>> > Thank you very much!
    >>> >
    >>> >       [[alternative HTML version deleted]]
    >>> 
    >>> Please read the Posting Guide. Your code was not mangled in this
    >>> instance, but HTML code often arrives in an unreadable mess.
    >>> 
    >>> >
    >>> > ______________________________________________
    >>> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
    >>> > https://stat.ethz.ch/mailman/listinfo/r-help
    >>> > PLEASE do read the posting guide http://www.R-project.org/posti
    >>> ng-guide.html
    >>> > and provide commented, minimal, self-contained, reproducible code.
    >>> 
    >>> David Winsemius
    >>> Alameda, CA, USA
    >>> 
    >>> 'Any technology distinguishable from magic is insufficiently advanced.'
    >>> -Gehm's Corollary to Clarke's Third Law
    >>> 
    >>> 
    >>> 
    >>> 
    >>> 
    >>> 
    >> 

    > [[alternative HTML version deleted]]

    > ______________________________________________
    > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
    > https://stat.ethz.ch/mailman/listinfo/r-help
    > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
    > and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list