[R] Problem while predicting in regression trees

Muhammad Bilal Muhammad2.Bilal at live.uwe.ac.uk
Mon May 9 21:32:46 CEST 2016


Hi Bill,


Many thanks for highlighting the issue. It worked as I predicted using the tr_m. I'm extremely grateful for the insight.


Thanks for all who gave me prior guidance as well.


--
Muhammad Bilal
Research Fellow and Doctoral Researcher,
Bristol Enterprise, Research, and Innovation Centre (BERIC),
University of the West of England (UWE),
Frenchay Campus,
Bristol,
BS16 1QY

muhammad2.bilal at live.uwe.ac.uk<mailto:olugbenga2.akinade at live.uwe.ac.uk>


________________________________
From: William Dunlap <wdunlap at tibco.com>
Sent: 09 May 2016 20:27:14
To: Muhammad Bilal
Cc: Max Kuhn; r-help at r-project.org
Subject: Re: [R] Problem while predicting in regression trees

Why are you predicting from tr_m$finalModel instead of from tr_m?

Bill Dunlap
TIBCO Software
wdunlap tibco.com<http://tibco.com>

On Mon, May 9, 2016 at 11:46 AM, Muhammad Bilal <Muhammad2.Bilal at live.uwe.ac.uk<mailto:Muhammad2.Bilal at live.uwe.ac.uk>> wrote:
Please find the sample dataset attached along with R code pasted below to reproduce the issue.


#Loading the data frame

pfi <- read.csv("pfi_data.csv")

#Splitting the data into training and test sets
split <- sample.split(pfi, SplitRatio = 0.7)
trainPFI <- subset(pfi, split == TRUE)
testPFI <- subset(pfi, split == FALSE)

#Cross validating the decision trees
tr.control <- trainControl(method="repeatedcv", number=20)
cp.grid <- expand.grid(.cp = (0:10)*0.001)
tr_m <- train(project_delay ~ project_lon + project_lat + project_duration + sector + contract_type + capital_value, data = trainPFI, method="rpart", trControl=tr.control, tuneGrid = cp.grid)

#Displaying the train results
tr_m

#Fetching the best tree
best_tree <- tr_m$finalModel

#Plotting the best tree
prp(best_tree)

#Using the best tree to make predictions [This command raises the error]
best_tree_pred <- predict(best_tree, newdata = testPFI)

#Calculating the SSE
best_tree_pred.sse <- sum((best_tree_pred - testPFI$project_delay)^2)

#
tree_pred.sse

...


Many Thanks and


Kind Regards



--
Muhammad Bilal
Research Fellow and Doctoral Researcher,
Bristol Enterprise, Research, and Innovation Centre (BERIC),
University of the West of England (UWE),
Frenchay Campus,
Bristol,
BS16 1QY

muhammad2.bilal at live.uwe.ac.uk<mailto:muhammad2.bilal at live.uwe.ac.uk><mailto:olugbenga2.akinade at live.uwe.ac.uk<mailto:olugbenga2.akinade at live.uwe.ac.uk>>


________________________________
From: Max Kuhn <mxkuhn at gmail.com<mailto:mxkuhn at gmail.com>>
Sent: 09 May 2016 17:22:22
To: Muhammad Bilal
Cc: Bert Gunter; r-help at r-project.org<mailto:r-help at r-project.org>
Subject: Re: [R] Problem while predicting in regression trees

It is extremely difficult to tell what the issue might be without a reproducible example.

The only thing that I can suggest is to use the non-formula interface to `train` so that you can avoid creating dummy variables.

On Mon, May 9, 2016 at 11:23 AM, Muhammad Bilal <Muhammad2.Bilal at live.uwe.ac.uk<mailto:Muhammad2.Bilal at live.uwe.ac.uk><mailto:Muhammad2.Bilal at live.uwe.ac.uk<mailto:Muhammad2.Bilal at live.uwe.ac.uk>>> wrote:
Hi Bert,

Thanks for the response.

I checked the datasets, however, the Hospitals level appears in both of them. See the output below:

> sqldf("SELECT sector, count(*) FROM trainPFI GROUP BY sector")
            sector count(*)
1          Defense        9
2        Hospitals      101
3          Housing       32
4           Others       99
5 Public Buildings       39
6          Schools      148
7      Social Care       10
8      Transportation       27
9            Waste       26
> sqldf("SELECT sector, count(*) FROM testPFI GROUP BY sector")
            sector count(*)
1          Defense        5
2        Hospitals       47
3          Housing       11
4           Others       44
5 Public Buildings       18
6          Schools       69
7      Social Care        9
8   Transportation        8
9            Waste       12

Any thing else to try?

--
Muhammad Bilal
Research Fellow and Doctoral Researcher,
Bristol Enterprise, Research, and Innovation Centre (BERIC),
University of the West of England (UWE),
Frenchay Campus,
Bristol,
BS16 1QY

muhammad2.bilal at live.uwe.ac.uk<mailto:muhammad2.bilal at live.uwe.ac.uk><mailto:muhammad2.bilal at live.uwe.ac.uk<mailto:muhammad2.bilal at live.uwe.ac.uk>>


________________________________________
From: Bert Gunter <bgunter.4567 at gmail.com<mailto:bgunter.4567 at gmail.com><mailto:bgunter.4567 at gmail.com<mailto:bgunter.4567 at gmail.com>>>
Sent: 09 May 2016 01:42:39
To: Muhammad Bilal
Cc: r-help at r-project.org<mailto:r-help at r-project.org><mailto:r-help at r-project.org<mailto:r-help at r-project.org>>
Subject: Re: [R] Problem while predicting in regression trees

It seems that the data that you used for prediction contained a level
"Hospitals" for the sector factor that did not appear in the training
data (or maybe it's the other way round). Check this.

Cheers,
Bert


Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Sun, May 8, 2016 at 4:14 PM, Muhammad Bilal
<Muhammad2.Bilal at live.uwe.ac.uk<mailto:Muhammad2.Bilal at live.uwe.ac.uk><mailto:Muhammad2.Bilal at live.uwe.ac.uk<mailto:Muhammad2.Bilal at live.uwe.ac.uk>>> wrote:
> Hi All,
>
> I have the following script, that raises error at the last command. I am new to R and require some clarification on what is going wrong.
>
> #Creating the training and testing data sets
> splitFlag <- sample.split(pfi_v3, SplitRatio = 0.7)
> trainPFI <- subset(pfi_v3, splitFlag==TRUE)
> testPFI <- subset(pfi_v3, splitFlag==FALSE)
>
>
> #Structure of the trainPFI data frame
>> str(trainPFI)
> *******
> 'data.frame': 491 obs. of  16 variables:
>  $ project_id             : int  1 2 3 6 7 9 10 12 13 14 ...
>  $ project_lat            : num  51.4 51.5 52.2 51.9 52.5 ...
>  $ project_lon            : num  -0.642 -1.85 0.08 -0.401 -1.888 ...
>  $ sector                 : Factor w/ 9 levels "Defense","Hospitals",..: 4 4 4 6 6 6 6 6 6 6 ...
>  $ contract_type          : chr  "Turnkey" "Turnkey" "Turnkey" "Turnkey" ...
>  $ project_duration       : int  1826 3652 121 730 730 790 522 819 998 372 ...
>  $ project_delay          : int  -323 0 -60 0 0 0 -91 0 0 7 ...
>  $ capital_value          : num  6.7 5.8 21.8 24.2 40.7 10.7 70 24.5 60.5 78 ...
>  $ project_delay_pct      : num  -17.7 0 -49.6 0 0 0 -17.4 0 0 1.9 ...
>  $ delay_type             : Ord.factor w/ 9 levels "7 months early & beyond"<..: 1 5 3 5 5 5 2 5 5 6 ...
>
> library(caret)
> library(e1071)
>
> set.seed(100)
>
> tr.control <- trainControl(method="cv", number=10)
> cp.grid <- expand.grid(.cp = (0:10)*0.001)
>
> #Fitting the model using regression tree
> tr_m <- train(project_delay ~ project_lon + project_lat + project_duration + sector + contract_type + capital_value, data = trainPFI, method="rpart", trControl=tr.control, tuneGrid = cp.grid)
>
> tr_m
>
> CART
> 491 samples
> 15 predictor
> No pre-processing
> Resampling: Cross-Validated (10 fold)
> Summary of sample sizes: 443, 442, 441, 442, 441, 442, ...
> Resampling results across tuning parameters:
>   cp     RMSE      Rsquared
>   0.000  441.1524  0.5417064
>   0.001  439.6319  0.5451104
>   0.002  437.4039  0.5487203
>   0.003  432.3675  0.5566661
>   0.004  434.2138  0.5519964
>   0.005  431.6635  0.5577771
>   0.006  436.6163  0.5474135
>   0.007  440.5473  0.5407240
>   0.008  441.0876  0.5399614
>   0.009  441.5715  0.5401718
>   0.010  441.1401  0.5407121
> RMSE was used to select the optimal model using  the smallest value.
> The final value used for the model was cp = 0.005.
>
> #Fetching the best tree
> best_tree <- tr_m$finalModel
>
> Alright, all the aforementioned commands worked fine.
>
> Except the subsequent command raises error, when the developed model is used to make predictions:
> best_tree_pred <- predict(best_tree, newdata = testPFI)
> Error in eval(expr, envir, enclos) : object 'sectorHospitals' not found
>
> Can someone guide me what to do to resolve this issue.
>
> Any help will be highly appreciated.
>
> Many Thanks and
>
> Kind Regards
>
> --
> Muhammad Bilal
> Research Fellow and Doctoral Researcher,
> Bristol Enterprise, Research, and Innovation Centre (BERIC),
> University of the West of England (UWE),
> Frenchay Campus,
> Bristol,
> BS16 1QY
>
> muhammad2.bilal at live.uwe.ac.uk<mailto:muhammad2.bilal at live.uwe.ac.uk><mailto:muhammad2.bilal at live.uwe.ac.uk<mailto:muhammad2.bilal at live.uwe.ac.uk>><mailto:olugbenga2.akinade at live.uwe.ac.uk<mailto:olugbenga2.akinade at live.uwe.ac.uk><mailto:olugbenga2.akinade at live.uwe.ac.uk<mailto:olugbenga2.akinade at live.uwe.ac.uk>>>
>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org<mailto:R-help at r-project.org><mailto:R-help at r-project.org<mailto:R-help at r-project.org>> mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org<mailto:R-help at r-project.org><mailto:R-help at r-project.org<mailto:R-help at r-project.org>> mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org<mailto:R-help at r-project.org> mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


	[[alternative HTML version deleted]]



More information about the R-help mailing list