-
Notifications
You must be signed in to change notification settings - Fork 26
Description
the vignette for rpart implies that the partitions take the smallest subtree T for which R|a(T) is minimised.
However I have found that this doesn't happen for unbalanced data - this suggests to me that the loss/risk function used here is accuracy rather than gini index.
reproducible example:
library(dplyr)
library(rpart)
df<-data.frame(lung_cancer=as.factor(c(1,1,0,0,0,0,0,0,0,0), smoking=c(1,0,1,1,0,0,0,0,0,0)))
tree <- rpart(lung_cancer~smoking, data=df, control=rpart.control(minbucket=3, minsplit=4, cp=0.000000000001), parms=list(split="gini") )
print(tree)
Now eyeballing this, it should split data at 0.5 to have one terminal node of one 1 and 2 0's, and one node of one 1 and 6 0's - this has lower impurity (by either gini or information index) than the initial node of two 1's and eight 0's.
However it doesn't, it produces 0 splits - even as CP -> 0.
n= 10
node), split, n, loss, yval, (yprob)
* denotes terminal node
- root 10 2 0 (0.8000000 0.2000000) *
Ive observed this only seems to happen on unbalanced data(i.e. class prediction is same on all terminal nodes) - which leads me to believe the risk function R is accuracy not an impurity measure as implied and presumably intended.