[R] What exactly is an dgCMatrix-class. There are so many attributes.

Fri Oct 20 21:22:26 CEST 2017

> On Oct 20, 2017, at 11:11 AM, C W <tmrsg11 at gmail.com> wrote:
> 
> Dear R list,
> 
> I came across dgCMatrix. I believe this class is associated with sparse
> matrix.

Yes. See:

 help('dgCMatrix-class', pack=Matrix)

If Martin Maechler happens to respond to this you should listen to him rather than anything I write. Much of what the Matrix package does appears to be magical to one such as I.

> 
> I see there are 8 attributes to train$data, I am confused why are there so
> many, some are vectors, what do they do?
> 
> Here's the R code:
> 
> library(xgboost)
> data(agaricus.train, package='xgboost')
> data(agaricus.test, package='xgboost')
> train <- agaricus.train
> test <- agaricus.test
> attributes(train$data)
> 

I got a bit of an annoying surprise when I did something similar. It appearred to me that I did not need to load the xgboost library since all that was being asked was "where is the data" in an object that should be loaded from that library using the `data` function. The last command asking for the attributes filled up my console with a 100K length vector (actually 2 of such vectors). The `str` function returns a more useful result.

> data(agaricus.train, package='xgboost')
> train <- agaricus.train
> names( attributes(train$data) )
[1] "i"        "p"        "Dim"      "Dimnames" "x"        "factors"  "class"   
> str(train$data)
Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  ..@ i       : int [1:143286] 2 6 8 11 18 20 21 24 28 32 ...
  ..@ p       : int [1:127] 0 369 372 3306 5845 6489 6513 8380 8384 10991 ...
  ..@ Dim     : int [1:2] 6513 126
  ..@ Dimnames:List of 2
  .. ..$ : NULL
  .. ..$ : chr [1:126] "cap-shape=bell" "cap-shape=conical" "cap-shape=convex" "cap-shape=flat" ...
  ..@ x       : num [1:143286] 1 1 1 1 1 1 1 1 1 1 ...
  ..@ factors : list()

> Where is the data, is it in $p, $i, or $x?

So the "data" (meaning the values of the sparse matrix) are in the @x leaf. The values all appear to be the number 1. The @i leaf is the sequence of row locations for the values entries while the @p items are somehow connected with the columns (I think, since 127 and 126=number of columns from the @Dim leaf are only off by 1). 

Doing this > colSums(as.matrix(train$data))
                  cap-shape=bell                cap-shape=conical 
                             369                                3 
                cap-shape=convex                   cap-shape=flat 
                            2934                             2539 
               cap-shape=knobbed                 cap-shape=sunken 
                             644                               24 
             cap-surface=fibrous              cap-surface=grooves 
                            1867                                4 
               cap-surface=scaly               cap-surface=smooth 
                            2607                             2035 
                 cap-color=brown                   cap-color=buff 
                            1816  
# now snipping the rest of that output.

Now this makes me think that the @p vector gives you the cumulative sum of number of items per column:

> all( cumsum( colSums(as.matrix(train$data)) ) == train$data at p[-1] )
[1] TRUE

> 
> Thank you very much!
> 
> 	[[alternative HTML version deleted]]

Please read the Posting Guide. Your code was not mangled in this instance, but HTML code often arrives in an unreadable mess.

> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA

'Any technology distinguishable from magic is insufficiently advanced.'   -Gehm's Corollary to Clarke's Third Law