suspected bug: rpart won't split unbalanced data even with infinitely small CP

the vignette for rpart implies that the partitions take the smallest subtree T for which R|a(T) is minimised.

However I have found that this doesn't happen for unbalanced data  - this suggests to me that the loss/risk function used here is accuracy rather than gini index.

reproducible example:


`library(dplyr)`
`library(rpart)`
`df<-data.frame(lung_cancer=as.factor(c(1,1,0,0,0,0,0,0,0,0), smoking=c(1,0,1,1,0,0,0,0,0,0)))`
`tree <- rpart(lung_cancer~smoking, data=df, control=rpart.control(minbucket=3, minsplit=4, cp=0.000000000001), parms=list(split="gini")
)`
`print(tree) `


Now eyeballing this, it should split data at 0.5 to have one terminal node of one 1 and 2 0's, and one node of one 1 and 6 0's - this has lower impurity (by either gini or information index) than the initial node of two 1's and eight 0's.

However it doesn't, it produces 0 splits - even as CP -> 0.


> n= 10 
>
>node), split, n, loss, yval, (yprob)
>      * denotes terminal node
>
>1) root 10 2 0 (0.8000000 0.2000000) *


Ive observed this only seems to happen on unbalanced data(i.e. class prediction is same on all terminal nodes) - which leads me to believe the risk function R is accuracy not an impurity measure as implied and presumably intended.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

suspected bug: rpart won't split unbalanced data even with infinitely small CP #70

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

suspected bug: rpart won't split unbalanced data even with infinitely small CP #70

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions