[R] Training a model using glm

Mohan Radhakrishnan radhakrishnan.mohan at gmail.com
Thu Sep 18 09:53:40 CEST 2014


Thanks Max and Dennis. Based on the syntax change I got the result for the
PCA part also.

training2 <- training[,grepl("^IL",names(training))]


preProc <- preProcess(training2,method="pca",thresh=0.8)

test2 <- testing[,grepl("^IL",names(testing))]


trainpca <- predict(preProc, training2)

testpca <- predict(preProc, test2)


modelFitpca <- train(training1$diagnosis ~ .,method="glm",data=trainpca)


confusionMatrix(test1$diagnosis,predict(modelFitpca, testpca))


Mohan

On Thu, Sep 18, 2014 at 12:43 PM, Mohan Radhakrishnan <
radhakrishnan.mohan at gmail.com> wrote:

> Oh. I understand now. There is nothing wrong with the logic. It is the
> syntax.
>
> > library(AppliedPredictiveModeling)
>
> *Warning message:*
>
> *package ‘AppliedPredictiveModeling’ was built under R version 3.1.1 *
>
> > set.seed(3433)
>
> > data(AlzheimerDisease)
>
> > adData = data.frame(diagnosis,predictors)
>
> > inTrain = createDataPartition(adData$diagnosis, p = 3/4)[[1]]
>
> > training = adData[ inTrain,]
>
> > testing = adData[-inTrain,]
>
> > training1 <- training[,grepl("^IL|^diagnosis",names(training))]
>
> >
>
> > test1 <- testing[,grepl("^IL|^diagnosis",names(testing))]
>
> > modelFit <- train(diagnosis ~ .,method="glm",data=training1)
>
> > confusionMatrix(test1$diagnosis,predict(modelFit, test1))
>
> Confusion Matrix and Statistics
>
>
>           Reference
>
> Prediction Impaired Control
>
>   Impaired        2      20
>
>   Control         9      51
>
>
>
>                Accuracy : 0.6463
>
>                  95% CI : (0.533, 0.7488)
>
>     No Information Rate : 0.8659
>
>     P-Value [Acc > NIR] : 1.00000
>
>
>
>                   Kappa : -0.0702
>
>  Mcnemar's Test P-Value : 0.06332
>
>
>
>             Sensitivity : 0.18182
>
>             Specificity : 0.71831
>
>          Pos Pred Value : 0.09091
>
>          Neg Pred Value : 0.85000
>
>              Prevalence : 0.13415
>
>          Detection Rate : 0.02439
>
>    Detection Prevalence : 0.26829
>
>       Balanced Accuracy : 0.45006
>
>
>
>        'Positive' Class : Impaired
>
>
> Thanks,
>
> Mohan
>
> On Thu, Sep 18, 2014 at 12:21 AM, Max Kuhn <mxkuhn at gmail.com> wrote:
>
>> You have not shown all of your code and it is difficult to diagnose the
>> issue.
>>
>> I assume that you are using the data from:
>>
>>    library(AppliedPredictiveModeling)
>>    data(AlzheimerDisease)
>>
>> If so, there is example code to analyze these data in that package. See
>> ?scriptLocation.
>>
>> We have no idea how you got to the `training` object (package versions
>> would be nice too).
>>
>> I suspect that Dennis is correct. Try using more normal syntax without
>> the $ indexing in the formula. I wouldn't say it is (absolutely) wrong but
>> it doesn't look right either.
>>
>> Max
>>
>>
>> On Wed, Sep 17, 2014 at 2:04 PM, Mohan Radhakrishnan <
>> radhakrishnan.mohan at gmail.com> wrote:
>>
>>> Hi Dennis,
>>>
>>>                      Why is there that warning ? I think my syntax is
>>> right. Isn't it not? So the warning can be ignored ?
>>>
>>> Thanks,
>>> Mohan
>>>
>>> On Wed, Sep 17, 2014 at 9:48 PM, Dennis Murphy <djmuser at gmail.com>
>>> wrote:
>>>
>>> > No reproducible example (i.e., no data) supplied, but the following
>>> > should work in general, so I'm presuming this maps to the caret
>>> > package as well. Thoroughly untested.
>>> >
>>> > library(caret)    # something you failed to mention
>>> >
>>> > ...
>>> > modelFit <- train(diagnosis ~ ., data = training1)    # presumably a
>>> > logistic regression
>>> > confusionMatrix(test1$diagnosis, predict(modelFit, newdata = test1,
>>> > type = "response"))
>>> >
>>> > For GLMs, there are several types of possible predictions. The default
>>> > is 'link', which associates with the linear predictor. caret may have
>>> > a different syntax so you should check its help pages re the supported
>>> > predict methods.
>>> >
>>> > Hint: If a function takes a data = argument, you don't need to specify
>>> > the variables as components of the data frame - the variable names are
>>> > sufficient. You should also do some reading to understand why the
>>> > model formula I used is correct if you're modeling one variable as
>>> > response and all others in the data frame as covariates.
>>> >
>>> > Dennis
>>> >
>>> > On Tue, Sep 16, 2014 at 11:15 PM, Mohan Radhakrishnan
>>> > <radhakrishnan.mohan at gmail.com> wrote:
>>> > > I answered this question which was part of the online course
>>> correctly by
>>> > > executing some commands and guessing.
>>> > >
>>> > > But I didn't get the gist of this approach though my R code works.
>>> > >
>>> > > I have a training and test dataset.
>>> > >
>>> > >> nrow(training)
>>> > >
>>> > > [1] 251
>>> > >
>>> > >> nrow(testing)
>>> > >
>>> > > [1] 82
>>> > >
>>> > >> head(training1)
>>> > >
>>> > >    diagnosis    IL_11    IL_13    IL_16   IL_17E IL_1alpha      IL_3
>>> > > IL_4
>>> > >
>>> > > 6   Impaired 6.103215 1.282549 2.671032 3.637051 -8.180721 -3.863233
>>> > > 1.208960
>>> > >
>>> > > 10  Impaired 4.593226 1.269463 3.476091 3.637051 -7.369791 -4.017384
>>> > > 1.808289
>>> > >
>>> > > 11  Impaired 6.919778 1.274133 2.154845 4.749337 -7.849364 -4.509860
>>> > > 1.568616
>>> > >
>>> > > 12  Impaired 3.218759 1.286356 3.593860 3.867347 -8.047190 -3.575551
>>> > > 1.916923
>>> > >
>>> > > 13  Impaired 4.102821 1.274133 2.876338 5.731246 -7.849364 -4.509860
>>> > > 1.808289
>>> > >
>>> > > 16  Impaired 4.360856 1.278484 2.776394 5.170380 -7.662778 -4.017384
>>> > > 1.547563
>>> > >
>>> > >          IL_5       IL_6 IL_6_Receptor     IL_7     IL_8
>>> > >
>>> > > 6  -0.4004776  0.1856864   -0.51727788 2.776394 1.708270
>>> > >
>>> > > 10  0.1823216 -1.5342758    0.09668586 2.154845 1.701858
>>> > >
>>> > > 11  0.1823216 -1.0965412    0.35404039 2.924466 1.719944
>>> > >
>>> > > 12  0.3364722 -0.3987186    0.09668586 2.924466 1.675557
>>> > >
>>> > > 13  0.0000000  0.4223589   -0.53219115 1.564217 1.691393
>>> > >
>>> > > 16  0.2623643  0.4223589    0.18739989 1.269636 1.705116
>>> > >
>>> > > The testing dataset is similar with 13 columns. Number of rows vary.
>>> > >
>>> > >
>>> > > training1 <- training[,grepl("^IL|^diagnosis",names(training))]
>>> > >
>>> > > test1 <- testing[,grepl("^IL|^diagnosis",names(testing))]
>>> > >
>>> > > modelFit <- train(training1$diagnosis ~ training1$IL_11 +
>>> > training1$IL_13 +
>>> > > training1$IL_16 + training1$IL_17E + training1$IL_1alpha +
>>> > training1$IL_3 +
>>> > > training1$IL_4 + training1$IL_5 + training1$IL_6 +
>>> > training1$IL_6_Receptor
>>> > > + training1$IL_7 + training1$IL_8,method="glm",data=training1)
>>> > >
>>> > > confusionMatrix(test1$diagnosis,predict(modelFit, test1))
>>> > >
>>> > > I get this error when I run the above command to get the confusion
>>> > matrix.
>>> > >
>>> > > *'newdata' had 82 rows but variables found have 251 rows '*
>>> > >
>>> > > I thought this was simple. I train a model using the training
>>> dataset and
>>> > > predict using the test dataset and get the accuracy.
>>> > >
>>> > > Am I missing the obvious here ?
>>> > >
>>> > > Thanks,
>>> > >
>>> > > Mohan
>>> > >
>>> > >         [[alternative HTML version deleted]]
>>> > >
>>> > > ______________________________________________
>>> > > R-help at r-project.org mailing list
>>> > > https://stat.ethz.ch/mailman/listinfo/r-help
>>> > > PLEASE do read the posting guide
>>> > http://www.R-project.org/posting-guide.html
>>> > > and provide commented, minimal, self-contained, reproducible code.
>>> >
>>>
>>>         [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>>
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list