[R] CART for 0/1 data

Martin Wegmann wegmann at biozentrum.uni-wuerzburg.de
Mon Sep 26 12:09:36 CEST 2005


Hello Robert, 

I tried over the week-end and managed, thanks to your help, to get my whole 
0/1 matrix into the model. However when I call summary(tree1) I get:

> summary(tree1)
Classification tree:
tree(formula = factor(dat$loc) ~ sp0.m)
Variables actually used in tree construction:
[1] "sp0.m.Pni"         "sp0.m.Bbe"
[3] "sp0.m.Arid"         "sp0.m.Pca"
Number of terminal nodes:  5
Residual mean deviance:  4.714 = 150.8 / 32
Misclassification error rate: 0.8649 = 32 / 37

The plot() output looks reasonable considering that only 5 locations are 
plotted.

However I have 400 variabels (in the 0/1 matrix sp0.m) and 50 locations 
(data$loc) and not just 5 terminal nodes. Is there a way to force tree() to 
plot all locations and their respective variables in the tree construction?

regards, Martin


On Friday 23 September 2005 19:19, Dave Roberts wrote:
> Martin,
>
>      Sorry, I don't think I read your message carefully enough.
>
>      When you say the error message is "+", that woudl seem to indicate
> that you still had an unclosed parenthesis and that the function was
> looking for more input.
>
>      Using a smaller data set (160 samples, 169 rows, only 5 classes) it
> did work fine for me.  pa = presence/absence dataframe, opt.5$clustering
> = cluster IDs.
>
> *********************************************************************
>
>  > test <- tree(factor(opt.5$clustering)~pa)
>  > test
>
> node), split, n, deviance, yval, (yprob)
>        * denotes terminal node
>
>   1) root 160 371.000 3 ( 0.23750 0.08750 0.57500 0.07500 0.02500 )
>     2) pa.symore < 0.5 79 216.500 1 ( 0.48101 0.17722 0.15190 0.13924
> 0.05063 )
>       4) pa.artarb < 0.5 42 123.600 2 ( 0.07143 0.33333 0.26190 0.23810
> 0.09524 )
>         8) pa.macgri < 0.5 31  75.280 2 ( 0.09677 0.45161 0.00000
> 0.32258 0.12903 )
>     .        .         .
>     .        .         .
>     .        .         .
>     3) pa.symore > 0.5 81  10.780 3 ( 0.00000 0.00000 0.98765 0.01235
> 0.00000 )
>       6) pa.carrss < 0.5 11   6.702 3 ( 0.00000 0.00000 0.90909 0.09091
> 0.00000 ) *
>       7) pa.carrss > 0.5 70   0.000 3 ( 0.00000 0.00000 1.00000 0.00000
> 0.00000 ) *
>
> ************************************************************************
>
> I'll try agin with a larger dataset and see if it's a memory limitation.
>
> Dave Roberts
>
> Martin Wegmann wrote:
> > On Friday 23 September 2005 17:08, Dave Roberts wrote:
> >>Martin,
> >>
> >>     If the data are actually coded 0/1, the tree function would
> >>probably intepret them as integers and try a regression instead of a
> >>classification.  If the dependent variable is called "var", try
> >
> > thanks, but I think I provided too less informations.
> > My dependent variable are the locations which are names (I could
> > transform them to numbers from 1 - n). The independent variables consist
> > of 0/1 data (species).
> > If I do
> > tree(locations~factor(species1)+factor(species2)+.....+factor(speciesn),
> > sp_data)
> > I receive the same results as without the factor() part.
> > BTW just a subset of the locations are displayed what is pretty weird
> > considering that I included all locations in the analysis.
> >
> > Martin
> >
> >>x <- tree(factor(var)~species)
> >>
> >>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >>David W. Roberts                                     office 406-994-4548
> >>Professor and Head                                      FAX 406-994-3190
> >>Department of Ecology                         email droberts at montana.edu
> >>Montana State University
> >>Bozeman, MT 59717-3460
> >>
> >>Martin Wegmann wrote:
> >>>Dear R-user,
> >>>
> >>>I tried to generate classification / regression tree with a
> >>>absence/presence matrix of species (400) in different locations (50) to
> >>>visualise species which are important for splitting up two locations.
> >>>Rpart and tree did not work for more than 10 species which is logical
> >>> due to the limited amount of locations (n=50). However the error prompt
> >>> is a "+" and no specific message, but I am pretty sure that I did not
> >>> enter a false sign by mistake.
> >>>Is it allowed at all to use 0/1 data for this statistical technique and
> >>>if yes is there a way or different method to use all 400 species
> >>> entries? Otherwise I would apply a PCA beforehand but I would prefer to
> >>> have the raw species informations.
> >>>
> >>>using R 2.1.1-1 (debian repos.)
> >>>
> >>>regards, Martin
> >>
> >>______________________________________________
> >>R-help at stat.math.ethz.ch mailing list
> >>https://stat.ethz.ch/mailman/listinfo/r-help
> >>PLEASE do read the posting guide!
> >>http://www.R-project.org/posting-guide.html
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html

-- 
Martin Wegmann

DLR - German Aerospace Center
German Remote Sensing Data Center
@
Dept.of Geography
Remote Sensing and Biodiversity Unit
&&
Dept. of Animal Ecology and Tropical Biology
University of Wuerzburg
Am Hubland
97074 Würzburg

phone: +49-(0)931 - 888 4797
mobile: +49-(0)175 2091725
fax:   +49-(0)931 - 888 4961
http://www.biota-africa.org
http://www.biogis.de




More information about the R-help mailing list