[R] Random Forest weighting

Wed Dec 3 23:24:28 CET 2008

Folks,

I have a query around weighting in Random Forest (RF). I know that several
earlier emails in this group have raised this issue, but I did not find an
answer to my query.

I am working on a dataset (dataset1) that consists of 4 million records that
can be reduced to a dataset (dataset2) of approximately 1500 unique records
with frequency counts that add up to the 4 million records number as above.
Because of size issues, I cannot work with dataset1 in R and therefore, I am
working with dataset2 .

Each record consists of whether or not a patient chose a particular drug
based on 14 comorbidity (Yes / No) variables; I am using RF to understand
the comorbidity drivers of drug adoption (yes/no) classification.

At full dataset level (dataset1), the drug adoption incidence is ~11%. At
the reduced dataset dataset2 level, the drug adoption incidence increases to
~38%.

My question is that, if am using the reduced dataset (dataset2), how should
I inform RF that the adoption incidence at the full dataset level was 11%.
Should that be used as a classwt prior with classwt=c(Yes=.11, No=.89)? My
understanding is that RF does not allow case weighting.
Or can this be handled with the sampsize arguement through oversampling?
What proportions should one use for this (e.g., sampsize=c(Yes=100,
No=100))?

I would appreciate any feedback or pointers to any earlier thread that I may
have overlooked.

Regards,

Raghu