[R] Data Issues with ctree/glm and controlling classification parameters

Sun May 1 00:34:03 CEST 2016

Hi,

I have a dataset obtained as:

mydata <- read.csv("data.csv", header = TRUE) which contains the variable
'y' (y is binary 0 or 1) and also another variable 'weight' (weight is a
numerical variable - taking fractional values between 0 and 1).

1>
I want to first apply ctree() on  mydata, but dont want to use this
'weight' variable in the tree-buiding process. Can you please suggest how
to do this? Please note, I *don't* want to delete/remove this variable from
mydata.

2>
Another question: Say, I split up mydata into train (80%) and test(20%) as:
d<-sort(sample(nrow(mydata), nrow(mydata)*0.8));
train <- mydata[d,];
test < -mydata[-d,];

Then, I perform weighted glm (essentially, logistic regression) on train as:
#Build GLM model on train data
model <-glm(y~., data = train, weights = train$weight, family = binomial);
********************(A)
#Apply model on test
score <-predict(model, type = 'response',test); **************(B)
#Get classification for each observation in test as 'positive' or 'negative'
classify <-performance(score,"tpr","fpr"); **************(C)

My question here is:
2a> Again, how do I proceed if I don't want to use the variable 'weight' as
a regressor in the glm() function in (A) above (but use all other variables
in train)?
2b> In step (B) & (C), how do I control the classification rule, i.e. R
might classify observations with model-fitted probability > 0.5 as a
'positive' and <= 0.5 as a 'negative'. Is there a way I can change this
threshold to say, 0.75 instead of whatever R might be using (I used 0.5 as
example).

Thank you in advance for your help.
-Preetam
-- 
Preetam Pal
(+91)-9432212774
M-Stat 2nd Year,                                             Room No. N-114
Statistics Division,                                           C.V.Raman
Hall
Indian Statistical Institute,                                 B.H.O.S.
Kolkata.

	[[alternative HTML version deleted]]