[R] How do I extract the scoring equations for neural networks and support vector machines?
jude.ryan at ubs.com
jude.ryan at ubs.com
Tue May 12 21:35:08 CEST 2009
Sorry for these multiple postings.
I solved the problem using na.omit() to drop records with missing values
for the time being. I will worry about imputation, etc. later.
I calculated the sum of squared errors for 3 models, linear regression,
neural networks, and support vector machines. This is the first run.
Without doing any parameter tuning on the SVM or playing around with the
number of nodes in the hidden layer of the neural network, I found that
the SVM had the lowest sum of squared errors, followed by neural
networks, with regression being last. This probably indicates that the
data has non-linear patterns.
I have a couple of questions.
1) Besides sum of squared errors, are there any other metrics that can
be used to compare these 3 models? AIC, BIC, etc, can be used for
regressions, but I am not sure whether they can be used for SVM's and
neural networks.
2) Is there any easy way to extract the scoring equations for SVM's and
neural networks? Using the R objects I can always score new data
manually but the model will need to be implemented in a production
environment. When the model gets implemented in production (could be the
mainframe) I will need equations that can be coded in any language
(COBOL or SAS on the mainframe). Also, getting the scoring equations for
all 3 models will let me create an ensemble model where the predicted
value could be the average of the predictions from the SVM, neural
network and linear regression. If the ensemble model has the smallest
sum of squared errors this would be the model I would use.
I have SAS Enterprise Miner as well and can get a scoring equation for
the neural network (I don't have SVM), but the scoring code that SAS EM
generates sucks and I would much rather extract a scoring equation from
R. I am using nnet() for the neural network.
Thanks in advance,
Jude Ryan
________________________________
From: Ryan, Jude
Sent: Tuesday, May 12, 2009 1:23 PM
To: 'r-help at r-project.org'
Cc: juderyan61 at yahoo.com
Subject: FW: neural network not using all observations
As a follow-up to my email below:
The input data frame to nnet() has dimensions:
> dim(coreaff.trn.nn)
[1] 5088 8
And the predictions from the neural network (35 records are dropped -
see email below for more details) has dimensions:
> pred <- predict(coreaff.nn1)
> dim(pred)
[1] 5053 1
So, the following line of R code does not work as the dimensions are
different.
> sum((coreaff.trn.nn$hh.iast.y - predict(coreaff.nn1))^2)
Error: dims [product 5053] do not match the length of object [5088]
In addition: Warning message:
In coreaff.trn.nn$hh.iast.y - predict(coreaff.nn1) :
longer object length is not a multiple of shorter object length
While:
> dim(pred)
[1] 5053 1
> tail(pred)
[,1]
5083 664551.9
5084 552170.6
5085 684834.3
5086 1215282.5
5087 1116302.2
5088 658112.1
shows that the last row of pred is 5,088, which corresponds to the
dimension of coreaff.trn.nn, the input data frame to the neural network.
I tried using row() to identify the 35 records that were dropped (or not
scored). The code I tried was:
> coreaff.trn.nn.subset <- coreaff.trn.nn[row(coreaff.trn.nn) ==
row(pred), ]
Error in row(coreaff.trn.nn) == row(pred) : non-conformable arrays
But I am not doing something right. pred has dimension = 1 and row()
requires an object of dimension = 2. So using cbind() I bound a column
of sequence numbers to pred to make the dimension = 2 but that did not
help.
Basically, if I can identify the 5,053 records that the neural network
made predictions for, in the data frame of 5,088 records
(coreaff.trn.nn) used by the neural network, then I can compare the
predictions to the actual values, and compare the predictive power of
the neural network to the predictive power of the linear regression
model.
Any idea how I can extract the 5,053 records that the neural network
made predictions for from the data frame (5,088 records) used to train
the neural network?
Thanks in advance,
Jude
________________________________
From: Ryan, Jude
Sent: Tuesday, May 12, 2009 11:11 AM
To: 'r-help at r-project.org'
Cc: juderyan61 at yahoo.com
Subject: neural network not using all observations
I am exploring neural networks (adding non-linearities) to see if I can
get more predictive power than a linear regression model I built. I am
using the function nnet and following the example of Venables and
Ripley, in Modern Applied Statistics with S, on pages 246 to 249. I have
standardized variables (z-scores) such as assets, age and tenure. I have
other variables that are binary (0 or 1). In max_acc_ownr_nwrth_n_med
for example, the variable has a value of 1 if the client's net worth is
above the median net worth and a value of 0 otherwise. These are derived
variable I created and variables that the regression algorithm has found
to be predictive. A regression on the same variables shown below gives
me an R-Square of about 0.12. I am trying to increase the predictive
power of this regression model with a neural network being careful to
avoid overfitting.
Similar to Venables and Ripley, I used the following code:
> library(nnet)
> dim(coreaff.trn.nn)
[1] 5088 8
> head(coreaff.trn.nn)
hh.iast.y WC_Total_Assets all_assets_per_hh age tenure
max_acc_ownr_liq_asts_n_med max_acc_ownr_nwrth_n_med
max_acc_ownr_ann_incm_n_med
1 3059448 -0.4692186 -0.4173532 -0.06599001 -1.04747935
0 1 0
2 4899746 3.4854334 4.0111164 -0.06599001 -0.72540200
1 1 1
3 727333 -0.2677357 -0.4177944 -0.30136473 -0.40332465
1 1 1
4 443138 -0.5295170 -0.6999646 -0.14444825 -1.04747935
0 0 0
5 484253 -0.6112205 -0.7306664 0.64013414 0.07979137
1 0 0
6 799054 0.6580506 1.1763114 0.24784295 0.07979137
0 1 1
> coreaff.nn1 <- nnet(hh.iast.y ~ WC_Total_Assets + all_assets_per_hh +
age + tenure + max_acc_ownr_liq_asts_n_med +
+ max_acc_ownr_nwrth_n_med +
max_acc_ownr_ann_incm_n_med, coreaff.trn.nn, size = 2, decay = 1e-3,
+ linout = T, skip = T, maxit = 1000, Hess = T)
# weights: 26
initial value 12893652845419998.000000
iter 10 value 6352515847944854.000000
final value 6287104424549762.000000
converged
> summary(coreaff.nn1)
a 7-2-1 network with 26 weights
options were - skip-layer connections linear output units decay=0.001
b->h1 i1->h1 i2->h1 i3->h1 i4->h1 i5->h1
i6->h1 i7->h1
-21604.84 -2675.80 -5001.90 -1240.16 -335.44 -12462.51
-13293.80 -9032.34
b->h2 i1->h2 i2->h2 i3->h2 i4->h2 i5->h2
i6->h2 i7->h2
210841.52 47296.92 58100.43 -13819.10 -9195.80 117088.99
131939.57 106994.47
b->o h1->o h2->o i1->o i2->o i3->o
i4->o i5->o i6->o i7->o
1115190.67 894123.33 -417269.57 89621.84 170268.12 44833.63
59585.05 112405.30 437581.05 244201.69
> sum((hh.iast.y - predict(coreaff.nn1))^2)
Error: object "hh.iast.y" not found
So I try:
> sum((coreaff.trn.nn$hh.iast.y - predict(coreaff.nn1))^2)
Error: dims [product 5053] do not match the length of object [5088]
In addition: Warning message:
In coreaff.trn.nn$hh.iast.y - predict(coreaff.nn1) :
longer object length is not a multiple of shorter object length
Doing a little debugging:
> pred <- predict(coreaff.nn1)
> dim(pred)
[1] 5053 1
> dim(coreaff.trn.nn)
[1] 5088 8
So it looks like the dimensions (number of records/cases) of the vector
pred is 5,053 and the number of records of the input dataset is 5,088.
It looks like the neural network is dropping 35 records. Does anyone
have any idea of why it would do this? It is most probably because those
35 records are "bad" data, a pretty common occurrence in the real world.
Does anyone know how I can identify the dropped records? If I can do
this I can get the dimensions of the input dataset to be 5,053 and then:
> sum((coreaff.trn.nn$hh.iast.y - predict(coreaff.nn1))^2)
would work.
A summary of my dataset is:
> summary(coreaff.trn.nn)
hh.iast.y WC_Total_Assets all_assets_per_hh age
tenure max_acc_ownr_liq_asts_n_med
Min. : 0 Min. :-6.970e-01 Min. :-8.918e-01 Min.
:-4.617e+00 Min. :-1.209e+00 Min. :0.0000
1st Qu.: 565520 1st Qu.:-5.387e-01 1st Qu.:-6.147e-01 1st
Qu.:-4.583e-01 1st Qu.:-7.254e-01 1st Qu.:0.0000
Median : 834164 Median :-3.160e-01 Median :-3.718e-01 Median :
9.093e-02 Median :-2.423e-01 Median :0.0000
Mean : 1060244 Mean : 2.948e-13 Mean : 3.204e-12 Mean
:-1.884e-11 Mean :-3.302e-12 Mean :0.4951
3rd Qu.: 1207181 3rd Qu.: 1.127e-01 3rd Qu.: 1.891e-01 3rd Qu.:
5.617e-01 3rd Qu.: 5.629e-01 3rd Qu.:1.0000
Max. :45003160 Max. : 1.332e+01 Max. : 4.011e+00 Max. :
5.818e+00 Max. : 4.267e+00 Max. :1.0000
NA's :
3.500e+01
max_acc_ownr_nwrth_n_med max_acc_ownr_ann_incm_n_med
Min. :0.0 Min. :0.0000
1st Qu.:0.0 1st Qu.:0.0000
Median :0.5 Median :0.0000
Mean :0.5 Mean :0.3634
3rd Qu.:1.0 3rd Qu.:1.0000
Max. :1.0 Max. :1.0000
Since I am writing this post, I have a few other questions.
I know I can compare 2 regression models using:
anova(model1, model2)
Will this work if one of the models is a regression model and the other
model is a neural network? I have not reached the point in building a
neural network to try this yet. If not, is there any other way I can
compare the performance of a regression model and neural network? If not
I may have to resort to programming to do this. I can probably use
predict() to get one vector for the regression model and another for the
neural network and then compare these predictions against the actual
value.
Is there any R package that can produce lift charts (ROC curves, gains
tables, etc.), K-S statistic, etc., that can be used to quantify the
performance of a predictive model (as done in database marketing)? If
so, such a package can be used to compare a regression model and a
neural network.
Another question I have is can any of the neural network packages in R
(nnet, AMORE, neural, neuralnet, or others I do not know about) do
variable selection (the way the regression methods do)? Or must I do
this manually looking at the weights and pruning the network by
eliminating weights close to zero (at all the layers in the network)?
Thanks in advance,
Jude
___________________________________________
Jude Ryan
Director, Client Analytical Services
Strategy & Business Development
UBS Financial Services Inc.
1200 Harbor Boulevard, 4th Floor
Weehawken, NJ 07086-6791
Tel. 201-352-1935
Fax 201-272-2914
Email: jude.ryan at ubs.com
-------------- next part --------------
Please do not transmit orders or instructions regarding a UBS
account electronically, including but not limited to e-mail,
fax, text or instant messaging. The information provided in
this e-mail or any attachments is not an official transaction
confirmation or account statement. For your protection, do not
include account numbers, Social Security numbers, credit card
numbers, passwords or other non-public information in your e-mail.
Because the information contained in this message may be privileged,
confidential, proprietary or otherwise protected from disclosure,
please notify us immediately by replying to this message and
deleting it from your computer if you have received this
communication in error. Thank you.
UBS Financial Services Inc.
UBS International Inc.
UBS Financial Services Incorporated of Puerto Rico
UBS AG
UBS reserves the right to retain all messages. Messages are protected
and accessed only in legally justified cases.
More information about the R-help
mailing list