Martin Maechler
maechler at stat.math.ethz.ch
Sat Oct 21 18:27:37 CEST 2017
>>>>>> C W <tmrsg11 at gmail.com>
>>>>> on Fri, 20 Oct 2017 16:01:06 -0400 writes:
> Subsetting using [] vs. head(), gives different results.
> R code:
>> head(train$data, 5)
> [1] 0 0 1 0 0
The above is surprising ... and points to a bug somewhere.
It is different (and correct) after you do
require(Matrix)
but I think something like that should happen
semi-automatically.
As I just see, it is even worse if you get the data from xgboost
without loading the xgboost package, which you can do (and is
also more efficient !):
If you start R, and then do
data(agaricus.train, package='xgboost')
loadedNamespaces() # does not contain "xgboost" nor "Matrix"
so, no wonder
head(agaricus.train $ data)
does not find head()s "Matrix" method [which _is_ exported by Matrix
via exportMethods(.)].
But even more curiously, even after I do
loadNamespace("Matrix")
methods(head)
now does show the "Matrix" method,
but then head() *still* does not call it. There's a bug
somewhere and I suspect it's in R's data() or methods package or
?? rather than in 'Matrix'.
But that will be another thread on R-devel or R's bugzilla.
Martin
>> train$data[1:5, 1:5]
> 5 x 5 sparse Matrix of class "dgCMatrix"
> cap-shape=bell cap-shape=conical cap-shape=convex
> [1,] . . 1
> [2,] . . 1
> [3,] 1 . .
> [4,] . . 1
> [5,] . . 1
> cap-shape=flat cap-shape=knobbed
> [1,] . .
> [2,] . .
> [3,] . .
> [4,] . .
> [5,] . .
On Fri, Oct 20, 2017 at 3:51 PM, C W wrote:
>> Thank you for your responses.
>>
>> I guess I don't feel alone. I don't find the documentation go into any
>> detail.
>>
>> I also find it surprising that,
>>
>> > object.size(train$data)
>> 1730904 bytes
>>
>> > object.size(as.matrix(train$data))
>> 6575016 bytes
>>
>> the dgCMatrix actually takes less memory, though it *looks* like the
>> opposite.
>>
>> Cheers!
>>
On Fri, Oct 20, 2017 at 3:22 PM, David Winsemius wrote:
>> wrote:
>>
>>>
> On Oct 20, 2017, at 11:11 AM, C W wrote:
>>> >
>>> > Dear R list,
>>> >
>>> > I came across dgCMatrix. I believe this class is associated with sparse
>>> > matrix.
>>>
>>> Yes. See:
>>>
>>> help('dgCMatrix-class', pack=Matrix)
>>>
>>> If Martin Maechler happens to respond to this you should listen to him
>>> rather than anything I write. Much of what the Matrix package does appears
>>> to be magical to one such as I.
>>>
>>> >
>>> > I see there are 8 attributes to train$data, I am confused why are there
>>> so
>>> > many, some are vectors, what do they do?
>>> >
>>> > Here's the R code:
>>> >
>>> > library(xgboost)
>>> > data(agaricus.train, package='xgboost')
>>> > data(agaricus.test, package='xgboost')
>>> > train <- agaricus.train
>>> > test <- agaricus.test
>>> > attributes(train$data)
>>> >
>>>
>>> I got a bit of an annoying surprise when I did something similar. It
>>> appearred to me that I did not need to load the xgboost library since all
>>> that was being asked was "where is the data" in an object that should be
>>> loaded from that library using the `data` function. The last command asking
>>> for the attributes filled up my console with a 100K length vector (actually
>>> 2 of such vectors). The `str` function returns a more useful result.
>>>
>>> > data(agaricus.train, package='xgboost')
>>> > train <- agaricus.train
>>> > names( attributes(train$data) )
>>> [1] "i" "p" "Dim" "Dimnames" "x" "factors"
>>> "class"
>>> > str(train$data)
>>> Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
>>> ..@ i : int [1:143286] 2 6 8 11 18 20 21 24 28 32 ...
>>> ..@ p : int [1:127] 0 369 372 3306 5845 6489 6513 8380 8384 10991
>>> ...
>>> ..@ Dim : int [1:2] 6513 126
>>> ..@ Dimnames:List of 2
>>> .. ..$ : NULL
>>> .. ..$ : chr [1:126] "cap-shape=bell" "cap-shape=conical"
>>> "cap-shape=convex" "cap-shape=flat" ...
>>> ..@ x : num [1:143286] 1 1 1 1 1 1 1 1 1 1 ...
>>> ..@ factors : list()
>>>
>>> > Where is the data, is it in $p, $i, or $x?
>>>
>>> So the "data" (meaning the values of the sparse matrix) are in the @x
>>> leaf. The values all appear to be the number 1. The @i leaf is the sequence
>>> of row locations for the values entries while the @p items are somehow
>>> connected with the columns (I think, since 127 and 126=number of columns
>>> from the @Dim leaf are only off by 1).
>>>
>>> Doing this > colSums(as.matrix(train$data))
>>> cap-shape=bell cap-shape=conical
>>> 369 3
>>> cap-shape=convex cap-shape=flat
>>> 2934 2539
>>> cap-shape=knobbed cap-shape=sunken
>>> 644 24
>>> cap-surface=fibrous cap-surface=grooves
>>> 1867 4
>>> cap-surface=scaly cap-surface=smooth
>>> 2607 2035
>>> cap-color=brown cap-color=buff
>>> 1816
>>> # now snipping the rest of that output.
>>>
>>>
>>>
>>> Now this makes me think that the @p vector gives you the cumulative sum
>>> of number of items per column:
>>>
>>> > all( cumsum( colSums(as.matrix(train$data)) ) == train$data at p[-1] )
>>> [1] TRUE
>>>
>>> >
>>> > Thank you very much!
>>> >
>>>
>>> Please read the Posting Guide. Your code was not mangled in this
>>> instance, but HTML code often arrives in an unreadable mess.
>>>
>>> >
>>> > ______________________________________________
>>> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> > https://stat.ethz.ch/mailman/listinfo/r-help
>>> > PLEASE do read the posting guide http://www.R-project.org/posti
>>> ng-guide.html
>>> > and provide commented, minimal, self-contained, reproducible code.
>>>
>>> David Winsemius
>>> Alameda, CA, USA
>>>
>>> 'Any technology distinguishable from magic is insufficiently advanced.'
>>> -Gehm's Corollary to Clarke's Third Law
>>>
>>>
>>>
>>>
>>>
>>>
>>
