Skip to content

suspected bug: rpart won't split unbalanced data even with infinitely small CP #70

@ilchef

Description

@ilchef

the vignette for rpart implies that the partitions take the smallest subtree T for which R|a(T) is minimised.

However I have found that this doesn't happen for unbalanced data - this suggests to me that the loss/risk function used here is accuracy rather than gini index.

reproducible example:

library(dplyr)
library(rpart)
df<-data.frame(lung_cancer=as.factor(c(1,1,0,0,0,0,0,0,0,0), smoking=c(1,0,1,1,0,0,0,0,0,0)))
tree <- rpart(lung_cancer~smoking, data=df, control=rpart.control(minbucket=3, minsplit=4, cp=0.000000000001), parms=list(split="gini") )
print(tree)

Now eyeballing this, it should split data at 0.5 to have one terminal node of one 1 and 2 0's, and one node of one 1 and 6 0's - this has lower impurity (by either gini or information index) than the initial node of two 1's and eight 0's.

However it doesn't, it produces 0 splits - even as CP -> 0.

n= 10

node), split, n, loss, yval, (yprob)
* denotes terminal node

  1. root 10 2 0 (0.8000000 0.2000000) *

Ive observed this only seems to happen on unbalanced data(i.e. class prediction is same on all terminal nodes) - which leads me to believe the risk function R is accuracy not an impurity measure as implied and presumably intended.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions