Logistic regression is wildly used to solve classification problems. Today, I am gonna use it to predict whether people buy car insurances by using a Car Insurance dataset from Kaggle.
The goal of this task is to build a predictive model that identifies customers who are likely to purchase car insurance to help business decision makers. In this article, two different approaches will be demonstrated, logistic regression and decision tree. Then, I will discuss their performances, make comparisons to decide which approach is better, and interpret the result to help to make business decision.
From the table above, we could see 6 categorical attributes, 2 temporal attributes, and 11 numeric attributes. Categorical attributes are are “Job“, “Marital“, “Education“, “Communication“, “LastContactMonth“, and “Outcome“. Temporal attributes are “CallStart“, “CallEnd“, and the rest are numeric attributes.
Since we want to use a regression model to predict “CarInsurance” represented by 1 (buy) and 0 (not buy), converting the data into all numeric is necessary. For categorical attributes, we could create dummy variables for them. For temporal attributes, we could take their difference, the length of promo calls, as a numeric variable to predict “CarInsurance“.
Since logistic regression and decision tree don’t require for standardization, we could ignore it and focus on dealing with data formatting, dummy variables, and variable transformation.
Missing Values and Empty Rows:
From the head of the table above, we notice that we have to eliminate every 2 rows. Also, the rows with missing values are more than 2000 rows, therefore, I populate “NotFound“.
Temporal Variable Transformation:
For transforming the temporal variables, “CallStart” and “CallEnd“, I take the difference of them by subtracting “CallEnd” by “CallStart” to derive the duration of call which might determine a customer decision.
Lastly, we have to create dummy variables for the logistic regression model since all regression models are not resistant to categorical variables.
For this task, I split the data by 75-25 proportion. 3000 instances are the training set, 1000 instances are the testing set.
First, we have to initialize a base model to assess the statistical significance of every variable.
The table shows that more than half of the variables are not relevant to this model. Hence, variable selection is needed. For this task, I will perform backward selection. Backward selection is to iteratively find the variable which minimizes AIC greatly and remove it one by one.
After backward selection, only two variables with p-value greater than 0.05. To select manually, we take out the variable with the greatest p-value one by one.
To determine whether a logistic model is statistically significant, we could check its chi-square p-value. If the p-value is lower than a given significance level (normally 0.05), the model is statistically meaningful. In result, since the chi-square p-value of this model is 0, we could reject the null null hypothesis and accept alternative hypothesis which states the model is statistically significant.
As shown, the distributions in two confusion matrices are similar, the test result and train result are consistent (83.86% and 80.60%). Moreover, to avoid sampling issue, we could perform a 10 x 10-fold cross-validation. In result, the error rate, 0.1674781, from the cross-validation is pretty close to the result from holdout test. Therefore, we could conclude our model is solid.
For obtaining the best decision tree, I use a iterative method from parent size 500 to parent size 1 to find the sweet spot having the best accuracy. Also, this fashion could prevent the tree from overfitting. After finding the best configuration, we could go deeper using the prune function in r to strengthen it.
In result, the accuracy from the training set is 84.43% and accuracy from the training set is 80.50%. These might suffer a little bit overfitting since they are diverged by almost 4%, but overall, the tree model seems OK because 4 out of 5 accuracy in holdout test is not bad.
Before comparison, we have to remind ourselves what is the most important thing in this analysis. For business usage, I would say money is everything. In this sense, simply evaluating their accuracy is inaccurate because accuracy is not the most relevant factor to do with making money.
Of course, we like a model with 100% accuracy. However, it is not likely to happen in the real world, so sometimes we have to compromise with other factors. In this case, precision will help, since the overall accuracy of both models are too close to choose the better one. For example, generally, a company only invests money on customers they think who will buy their product. Therefore, by precision, we could derive the accuracy only on the invested customers. In result, the logistic regression is about 5% more accurate than the decision tree is. This means the logistic model could make 5% more revenue than the decision tree does.
ROC curve plot is another way to do model selection. The curve depicts the relationship between true positive rate and false positive rate, which gives us a idea to choose the best model. For example, if a model is 100% accurate, the curve will attach the left bound and upper bound making the AUC 1. if a model is like making a random guess, the curve will be the diagonal making the AUC 0.5. In result, the AUC of the logistic regression model is higher than the decision tree model, thus the logistic regression model is better.
Based on the analysis, we recommend the logistic model to predict whether a customer buys car insurance, which is more profitable. Besides that, from the decision tree model, we also learn which factors influence the customer’s behavior the most.
Take the top 3 for example. The most important factor of buy car insurance is the call duration. A company could make a phone marketing SOP like “If a customer answer, try every thing to please them and promote.” according to the finding. The second is the outcome of the last marketing campaign. This indicates two things. First, marketing campaign is important and a company should keep working on it. Second, Its importance is far lesser than the call duration. If a decision maker found this, he/she would have adjusted the funding structure to maximize company’s profit. The third factor is Last Contact Month. By being aware of that, we could use the coefficient table from logistic model to obtain which month are statistically relevant to the customer’s behavior, then study it to improve the sale.