[R] SVM. How to use categorical attributes?

Wed Mar 28 15:56:53 CEST 2012

Sorry -- I should add that I'm pointing out the potential shogun
implementation because I suspect their implementation of a
bag-of-words -like kernel would use the kernel trick, so you won't
have to map all of your data explicitly into some huge feature space
that will blow your memory away.

I'm not 100% sure they have what you're looking for, but as I said ...
it's worth checking out.

-steve

On Wed, Mar 28, 2012 at 9:54 AM, Steve Lianoglou
<mailinglist.honeypot at gmail.com> wrote:
> Hi,
>
> These suggestions still require you to explicitly compute your feature
> space or kernel matrix first, which might kill you memory wise.
>
> You might consider taking a look at the shogun toolbox:
>
> http://www.shogun-toolbox.org/
>
> With some digging, I'm pretty sure you'll find a bag-of-words type of
> kernel there (it's related to the spectrum kernel, which you can find
> for searching the code base for something like "commword") ... you
> might consider posting to their mailing list after you give it the
> "good old college try" of sorting this out for yourself for a bit.
>
> The R interface to the toolbox is a bit ... alien, though. I'm working
> on making a nicer one but it's not quite ready for public consumption.
>
> -steve
>
>
> On Wed, Mar 28, 2012 at 7:38 AM, Ulrich Bodenhofer
> <bodenhofer at bioinf.jku.at> wrote:
>> Alex,
>>
>> To avoid the memory issue, you can directly use a "bag of words" kernel
>> (which corresponds to using the linear kernel on the sparse bag of words
>> matrix Steve suggested). Just a little toy example how this is done for two
>> :
>>
>>> x1 <- c("how", "to", "grow", "tree")
>>> x2 <- c("where", "to", "go", "weekend", "cinema")
>>> k12 <- length(intersect(x1, x2))
>>> k12
>> [1] 1
>>
>> If you run this for every pair of samples (additionally exploiting the
>> symmetry of the resulting matrix), you will get an L x L matrix of kernel
>> values (where L is the number of samples) without the need of having to
>> store the large bag of words matrix. That's exactly one of the beauties of
>> SVMs, in my humble opinion.
>>
>> Just as a side note: the result above is 1 because there is one overlap in
>> the two bags of words, the word "to". Maybe it is a good idea to remove such
>> unspecific words first and, moreover, to do word stemming, as is the
>> standard in analyses like the one you are aiming at.
>>
>> Best regards,
>> Ulrich
>>
>> --
>> View this message in context: http://r.789695.n4.nabble.com/SVM-How-to-use-categorical-attributes-tp4508460p4512034.html
>> Sent from the R help mailing list archive at Nabble.com.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>
>
> --
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
>  | Memorial Sloan-Kettering Cancer Center
>  | Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact