[Rd] tm and e1071 question

Ingo Feinerer feinerer at logic.at
Sat Dec 5 10:43:39 CET 2009


On Fri, Dec 04, 2009 at 02:21:52PM +0100, Achim Zeileis wrote:
> I would like to use the svm function of the e1071 package for text
> classification tasks. Preprocessing can be carried out by using the
> excellent tm text mining package.

:-)

> TermDocumentMatrix and DocumentTermMatrix objects of the package tm
> are currently implemented based on the sparse matrix data structures
> provided by the slam package.
>
> Unfortunately, the svm function of the e1071 package accepts only sparse
> matrices of class Matrix provided by the Matrix package, or of class
> matrix.csr as provided by the package SparseM.
>
> In order to train an SVM with a DocumentTermMatrix object the latter
> must be converted to a matrix.csr sparse matrix structure. However, none
> of the publicly available packages of CRAN provides such a conversion
> function. It is quite straightforward to write the conversion function,
> but it would be much confortable to pass slam sparse matrix objects
> directly to the svm function.

You are right. If you have small matrices as(as.matrix(m), "Matrix")
will work. Then there exists some (non published experimental) code in
the slam package for conversion to Matrix format (located in
slam/work/Matrix.R):

setAs("simple_triplet_matrix", "dgTMatrix",
      function(from) {
          new("dgTMatrix",
              i = as.integer(from$i - 1L),
              j = as.integer(from$j - 1L),
              x = from$v,
              Dim = c(from$nrow, from$ncol),
              Dimnames = from$dimnames)
      })

setAs("simple_triplet_matrix", "dgCMatrix",
      function(from) {
          ind <- order(from$j, from$i)
          new("dgCMatrix",
              i = from$i[ind] - 1L,
              p = c(0L, cumsum(tabulate(from$j[ind], from$ncol))),
              x = from$v[ind],
              Dim = c(from$nrow, from$ncol),
              Dimnames = from$dimnames)
      })

which allows then:

class(m) <- "simple_triplet_matrix"
as(m, "dgTMatrix")
as(m, "dgCMatrix")

> Do you plan to add slam sparse matrix support to the e1071 package?

I cannot answer this since I am neither directly involved in the e1071
nor in the slam package.

Best regards, Ingo Feinerer

-- 
Ingo Feinerer
Vienna University of Technology
http://www.dbai.tuwien.ac.at/staff/feinerer



More information about the R-devel mailing list