Crowdsourcing is a modern way to collect data thanks to the Internet. In OpenStreetMap, using crowdsourced data is prevalent. Today, we are going to analyze Crowdsourced Mapping Data from UCI repository.
Since the target is categorical, we will use classification algorithms instead of regression. In our analysis, first, we will do clustering analysis to gain the natural pattern in the satellite image data, then discuss the clustering result. Second, we are going to use several supervised learning techniques like KNN, Decision Tree, and Random forest to predict the satellite image data, then discuss the their performance to determine which model is the best for classifying satellite images.
As we can see, the target variable ‘class’ is categorical variable with multi class, and the predictor variables are all numeric with different scales from each other. Hence, we need to do two things before we do any machine learning application on it. The first thing is to understand the data distribution. The Second is to normalize the predictors when we do certain machine learning algorithms like k-nearest neighbor.
The distributions are not cheerful for classification since we have imbalance data. Imbalance data may cause inaccurate prediction. For example, in decision tree algorithm, the minority class could be ignored according to its greedy-approach characteristic. Since the algorithm is always choosing the best split (the most homogeneous node to split), the majority class is more prominent than minority class in the approach. At the end, the minority class might just be ignored by the limitation of set parameter like minimum instance to split. On the other hand, KNN accuracy could also be affected by imbalance data. Let’s say we set k=5 for classifying and the minority class only have two instances in the dataset. If we pass a new instance for classifying and it is actually in the minority group, it will never being classified to its real group since the group has too less to account.(Figure 1)
Therefore, we have to randomly sample the minority class to see the different results in both imbalance data and oversampling data, then use our judge to determine which result is better to reflect the reality for classification.
In this plot, we can visually apprehend the data characteristic and get some before-hand perspectives in the classification result. The water, impervious, farm, and grass are quite distributed saperately, which is a good thing for classification. However, the forest almost overlaps the farm, grass, and orchard. Also, the orchard seems too sparse in the plot. These might indicate that it would be difficult to distinguish farm, grass and orchard from forest, and there would be serious misclassification while classifying orchard.
The elbow method is to find the best k by finding the elbow in a Error-vs-k plot. The idea is to find a point (k) before fitting noisy. To illustrate, the total error is the sum of the distances from every data point to its corresponding centroid. While doing clustering, we want to minimize the total error. However, the total error will keep dropping till k equals to the size of the whole data (every point is its own centroid), so that we cannot actually minimize the total error, but find a point that if a number of k is greater than this point, the total error won’t decrease steeply because it is fitting noisy in the data after this point. To see the natural pattern of this data, we use elbow method to get the most natural k in this data, and the elbow happens when k=7. After knowing the k, we can apply PCA techniques on visualizing the natural pattern of the digital satellite data.
Since the data is obtained by digitizing satellite images of landscape, knowing this pattern is quite important. We could use the clustering result as feedback to discuss some thing deeper. For instance, it could help to reinforce the digitizing algorithm, to find some human error in marking the label, or to discover the unknown change on remote landscape.
The plot depicts that when the k equals to 7, the KNN classifier has its best accuracy, 0.6433, for the distance weight method, and when the k equals to 5, the KNN classifier has its best accuracy, 0.6433, for the uniform weight method. Since we know the best parameters, we can build the best KNN classifier to assess their performance.
The KNN overall accuracy is 0.6433. In the classification matrix, we could see how the KNN classifier classifies test-set instances. We can see the classifier has some difficulties in classifying forest and orchard, the recall of these two are only 0.54 and 0.34, respectively. This might be caused by the sparse characteristic we have seen in the PCA visualization. It is worth to mention the precision of orchard and water are pretty high so that we can really count on these two predicted classes. Finally, the average area under curve (AUC) is 0.87. (We will discuss AUC more in the conclusion part.)
For finding the best parameters for decision tree model, we set two different split criteria, Gini impurity and Entropy, to loop through the numbers of nodes (complexity). Since we want to avoid overfitting, we need to find a point that testing accuracy and cross validation accuracy diverge. This point happened when the numbers of nodes is 47 using Gini Impurity, and when the numbers of nodes is 51 using entropy. However, the latter’s accuracy is slightly higher the former. Therefore, we have the best parameters for decision tree classifier, which are criterion=’entropy’ and max_leaf_nodes=51.
The decision tree overall accuracy is 0.6067. In the classification matrix, we could see almost all the orchard instances was classified to forest, and the precision and recall of orchard are both 0. This might show the downside of decision tree algorithm to classify minority group. However, the classification report shows this model is still good at predicting most of group (not grass and orchard), and the predicted result in impervious (precision: 0.8) and water (precision: 0.95) is trustworthy. Lastly, the AUC of this model is 0.81.
To avoid getting the random forest model too complex to predict new data, we loop through the number of trees in the forest to get the best number of it. According to the plot, the best parameter, number of trees, for random forest model is 32.
The random forest overall accuracy is 0.64. In the classification matrix and classification report, we can see the similar result as decision tree. This is because random forest are just a bunch of decision tree using a voting mechanism to classify. It inherits the characteristic of decision tree and theoretically has a better accuracy than decision tree does. In result, the random forest classifier are still not able to distinguish orchard from others. Lastly, the AUC in ROC curve plot is 0.88.
After doing several classifying, we notice that the minority group, orchard, is ignored by the classifiers due to their algorithm. To not suffer from the imbalance data, we could oversample the minorities to force them being noticeable to the classifiers. For example, in figure 1, the new instance will not being assigned to group 1 when k=5. However, if we oversample the group 1 to the number of group 2, the result will change. In this example, it will increase accuracy (see figure 2). For another example, in decision tree algorithm, minimum split limitation is highly possible to ignore the minority. Let’s say the the limitation is ‘don’t split if a node contains less than 20 instances’. In this case, if the minority count is 15 in the train data, the trained tree will never split a node to classify the minority. Hence, we need to do the oversampling.
My way to oversample is to randomly pick a instance from each minority group, then add it the data till the count of each minority is the same as the majority (see figure 3).
With all being said, the overall accuracy might drop when we use oversampling data to train the model. To illustrate, if we have a set of data with a 80% majority and blindly classify all data point into the majority group, we will get an 80% overall accuracy no matter what. Now, if we oversample the dataset and retrain the model by oversampled data, the overall accuracy is very likely to be below 80%.
Classification Over Sampling
By using oversampled data, the KNN classifier has its best accuracy, 0.62, for the distance weight method when the k equals to 29 and the weight is uniform. We can notice that the accuracy is sightly lower than not using oversampled data, and this matches what we just discussed about.
Using oversampled data, the overall accuracy is 0.62 slightly dropped by 2.33% (from 0.6433). In the classification report, however, there is a 21% improvement in the recall on orchard (from 0.34 to 0.55), but a 22% decrease in the recall on forest (from 0.54 to 0.32). This is because the oversampling makes the count of orchard the same magnitude as the majority, so that the classifier cannot ignore it (figure 2). Conversely, it also relatively makes the majority sparse, in result, the recall on majority drops.
The best parameters for decision tree model using oversampled data are criterion=entropy and max_leaf_nodes=58.
The decision tree model on oversampled data has 0.63 overall accuracy, which is better than it on original data (acc. 0.6067). Also, we can see it sacrifices some forest recall to improve the recall on orchard by 36% (from 0%), On the downside, the predicted result in grass of this model is only 34% accountability in the perspectives of precision. Finally, the average AUC is 0.82.
The minimum validation error happened when the number of estimator is 78.
On oversampled data, the random forest overall accuracy is 0.66 which is 2% higher than it on imbalance data. In the classification matrix, we found that by oversampling the minority, the random forest model is able to capture orchard group. However, the recall on orchard group prediction is only 0.15, so to speak, only 15 orchard data will be correctly classified in every 100 orchard input data.
Before we made our conclusion, we have to understand how ROC curve work and the meaning of AUC values.
ROC/AUC: The idea for this plot is to find the best trade-off between sensitivity and specificity. In ROC curve, the y-axis is sensitivity and the x-axis is 1 minus specificity. The best scenario is we get 1 in both sensitivity and specificity, and it will show a square in the ROC plot and the area under curve is 1. However, in the most real world problem, it is impossible, so that we have to focus on what is really matter in the real world. To illustrate, a hospital is testing for the HIV. We actually don’t care the specificity that much because even though people are falsely tested positive, they won’t do any harm to the society in the perspective of disease control. What we really care about is the people they actually have HIV. Therefore, we can use ROC curve plot to assess models to get a model with higher sensitivity.
Generally, both accuracy and AUC (see table 1) indicate the random forest classifier on oversampled data is the best, with 0.66 accuracy, to classify landscape data from the landsat time-series satellite and the crowdsourced data. However, we have to consider some other real world problem. For example, in business, a navigation system company wants to use the model to modify its map. They do want a model with high overall accuracy, but when it comes to safety issue, they have to compromise on it. Since people trust navigation systems so much, it is possible that a car drive into see due to a bad navigation systems and no navigation systems companies want that because the accident like that will directly jeopardize the company. Therefore, in the navigation system business, the focus might shift from overall accuracy to the recall on water group, and they might choose the decision tree model on oversampled data since it has the highest overall accuracy among the highest recalls on water. Moreover, another example for the compromise is the national park administration. In north america, there many national parks with distinct land cover. All the national park administrations will only care about the precision and recall on the particular land cover they have for their own use.
In conclusion, if the purpose of use is general, we will suggest using the model with best accuracy, which is the random forest classifier on oversampled data. If it is for some specific purpose like we mentioned before, we suggest taking look into the classification report or ROC curve to find the most appropriate model for that specific use.
- scikit-plot: https://github.com/reiinakano/scikit-plot