Decision tree is an straightforward and powerful machine learning algorithm. However, it could mislead the result if the data is imbalanced. Today, we’re gonna talk about how to cope with imbalanced data by building three decision trees.
To illustrate how classification approach by using decision trees, We use the Red Wine Quality Data Set at UCI machine learning repository:
Build three decision trees based on three quality distributions – original, binned, oversampled. Then, discuss how imabalanced data affects decision tree’s performance.
Original – Decision Tree
To not overfit the data, we have to check on both train data and test data to see where the misclassification rate diverges. Then, use the parameters on this point to build a decision tree.
Binned – Decision Tree
Binning quality class 3, 4, 5 to low quality and quality class 6, 7, 8 to high quality makes the distribution more balance comparing to the original distribution. Low class has 744 cases and high class has 855 cases. Then, repeat the same process.
Oversampled – Decision Tree
Since the distribution is too imbalance, the minority data could be neglected while building a decision tree. To illustrate, quality class 3 only has 8 cases. If there is a node could highly classify class 3, but it could stop splitting because of the child node size limit (all child node must contain at least 10 cases). In this case, the minorities are neglected. To improve that, we could do oversampling to randomly pick cases from each minority and replicate it into the data to make minorities head to the majority in count. Theoretically, it could improve sensitivity of minority classes, however, it could lose some overall accuracy. Then, repeat the same process.
Between original data and binned data
Based on accuracy, the overall performance of binned classification is better by almost 15 percent.
Between original data and oversampled data
Though the overall accuracy of oversampling tree is worse than the original tree, the sensitivity of minority class is significantly improved. I would say it really depends on the purpose of this model. If the model is built for a wine collecting company to detect low quality wine for preventing from losing money on low quality wine. I will say the model is improved significantly. However, if the model is built for general wine classification, I would say the original model is better based on accuracy. No matter what, it did improve the original model in some way.