[R] any r package can handle factor levels not in the test set

Bert Gunter gunter.berton at gene.com
Tue Jan 13 18:22:02 CET 2015


Folks:

I believe this discussion would be better moved to a statistical
discussion forum, like stats.stackexchange.com ,as it appears to be
all about statistical issues, not R. I do not understand how you can
possibly expect to predict behavior in new categories for which you
have no prior information, but perhaps I do not understand or there
are appropriate ways to do this in your subject matter area that
discussion on a statistical forum would uncover.  If you find any, you
might then come back to R (see CRAN's task views:
http://cran.r-project.org/web/views/ or simply search using a search
engine) to see whether/how such methodology is implemented in R.

Cheers,
Bert

Bert Gunter
Genentech Nonclinical Biostatistics
(650) 467-7374

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
Clifford Stoll




On Tue, Jan 13, 2015 at 8:59 AM, HelponR <suncertain at gmail.com> wrote:
> Thanks for your reply. But I cannot control the data.
> I am dealing with real world stream data. It is very normal that the test
> data(when you apply model to do prediction) have new values that are not
> seen in training data.
> If I code myself, I would give a random guess or just an intercept for such
> situation. But it seems most R package returns an error and exit.
>
> On Mon, Jan 12, 2015 at 6:08 PM, Richard M. Heiberger <rmh at temple.edu>
> wrote:
>
>> You need to define the levels of the training set to include all
>> levels that you might see.
>> Something like this
>>
>> > A <- factor(letters[1:5])
>> > B <- factor(letters[c(1,3,5,7,9)])
>> > A
>> [1] a b c d e
>> Levels: a b c d e
>> > B
>> [1] a c e g i
>> Levels: a c e g i
>> > training <- factor(A, levels=unique(c(levels(A), levels(B))))
>> > training
>> [1] a b c d e
>> Levels: a b c d e g i
>> >
>>
>> In the future please "provide commented, minimal, self-contained,
>> reproducible code."
>>
>> On Mon, Jan 12, 2015 at 9:00 PM, HelponR <suncertain at gmail.com> wrote:
>> > It looks like gbm, glm all has this issue
>> >
>> > I wonder if any R package is immune of this?
>> >
>> > In reality, it is very normal that test data has data unseen in training
>> > data. It looks like I have to give up R?
>> >
>> > Thanks!
>> >
>> >         [[alternative HTML version deleted]]
>> >
>> > ______________________________________________
>> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list