PUBG Final Placement Prediction

PlayerUnknown’s BattleGround (PUBG) is one of the most popular first-person shooting games in the world. It was introduced the world in March 2017, within 2 years, it has accumulated over 40-million sales over the world. The goal of this study is to 1) predict the PUBG Final Placement in solo mode; 2) find the winning pattern to master the game; 3) derived some applicable pattern from the game to use in the real world. The method used in the study is use the processed data and feature selected data to cross test eight models, Lasso Regression, Ridge Regression, Regression Tree, Random Forest, Bagging (RT), Gradient boosting, Ada Boosting and Neural Network to find the most predictive one. The result demonstrates that the Neural Network model is the most predictive one in predicting PUBG Final Placement with RMSE 0.0577 and explained variance 0.9619. The findings in this study suggests Neural Network model is the best predictive model for PUBG game, and winning strategy is applicable in both game and life.

Get code


Introduction

Background
PUBG is one of the most popular first-person shooting games in the world. It was introduced the world in March 2017, within 2 years, it has accumulated over 40-million sales over the world. Some of PUBG’s inspiration came from the Japanese movie of the same name (Battle Royale), which pits players against one another in an all-out fight to determine the last person standing. The game starts with 100 players dropped into an isolate island, players has to find weapons and supplies to kills enemies (other players) and make a way to be the last survival. During the game, after every a certain amount of time, a blue circle (the life zone) will shrink to force the players get together in the same area to kill each other. For each match, players have three modes to choose, which are solo-mode, dual-mode and group-mode in a group of 1, 2 and 4, respectively.

Goal
As the gaming industry raises these years, prediction in gaming had become more and more important. In this project, the aim is three things. First, we want to predict the Final Placement, in solo mode, based on the stats data in the match. The prediction of final placement might boost the industry through video game lottery. Second, we want to find the key factors to win in the game. The analysis of PUBG data could help players know what they need to do to place higher. Third, we want to derive some applicable pattern from the game to apply in the real world. Since games are often metaphors for life, we are interested in what patterns from the game we find could be applied in the real life.


Methodology

Source
The dataset is from Kaggle competition. It has 1336190 instances and 29 attributes. 28 attributes are numeric, 1 attribute is categorical. Except the target (winPlacePerc), all attributes are in 8 categories based on the domain knowledge. They are i. ID/Type (Id, groupId, matchId, matchType); ii. Game Stats (matchDuration, maxPlace, numGroups); iii. Destroy (VehicleDestroys); iv. Teamwork (assist, revives); v. Kill/Damage (DBNOs, kills, headShots, roadKills, killStreaks, teamKills, killPlace, damageDealt); vi.Rank (killPoint, winPoint, rankPoint); vii. Equip (boosts, heals, weaponAquired); viii. Distance (walk, ride, swim, longestKill).

 

EDA
From the dataset description, the five-number summary shows the distributions of all attributes. It also indicates that we need to normalize the data since  the scales of attributes vary a lot. Although the five-number summary shows some degree of distribution, it is kinds of hard to comprehend all the attributes. To have a better look into the distributions, we need to visualize the target and predictors.

The target we are trying to predict is somehow kind of uniform. Also, we could find a lot of interesting distribution in the predictor side. Like many of them are just a single bar like how many times you knock off enemies and how many times you revive your teammates. Meanwhile, Match duration is bi-normal, Damage Dealt is exponential, etc.

Besides the distribution, correlation matrix also gives us a good insight into the data. From the heatmap below, we could find three important things. 1) kills are necessary to win the game (killPlace, kill); 2)Equipments are important (weaponAcquire, boosts, heals); 3) Walking distance is extremely important. I will discuss these three points later.

 

Prepocessing
Before the ID/Type (Id, groupId, matchId, matchType) being removed, the solo mode data was extracted by matchType attribute. The extracted data has 536761 instances. A 60/40 split is applied to holdout the test data. Since video games often have player cheating, data points with 0 moving distance and over 50% final placement are removed according to the domain experience. In total, 132 data points were removed. Also, all the independent variables are applied to min-max normalization because the metrics over predictors are varied. In result, the training set is in shape of (321924, 24) and the testing set is in shape of (210705, 24).

Feature Selection
For model generation, 3 feature selection methods are performed, which are low variance filter, stepwise selection and wrapper selection. The low variance filter used VarianceThreshold function from scikit-learn. The stepwise selection and wrapper selection, used 5 models to select feature. The selected features are in the below table.

 

Model Generation
In this study, 8 models, Lasso Regression, Ridge Regression, Regression Tree, Random Forest Regressor, Bagging (RT), Gradient boosting, Ada Boosting and Neural Network are candidates to perform prediction on the final placement. In order to obtain the best predictive model, a two-stage model selection strategy is applied. In the 1st stage, all the models, in default setting using 10-fold cross validation, are built to see the whether they fit the nature of PUBG data using all attributes, low variance filter, stepwise selection and wrapper selection. In result, three models performs consistently well. They are Random Forest, Gradient Boosting and Neural Network in the below Table. Also, the scores of RMSEs and explained variances using low variance filter is visualized.

In the 2nd stage, we implemented a two-steps parameter-tuning process on the best models derived from the 1st stage. In the two-step parameter-tuning process, first we try to find the interval where the optimum might be by testing each parameter, then we put the possible intervals into grid search to get the best configuration. In result, for Random Forest, the best parameter configuration is {n_estimators’: 5, ‘max_depth’: 20, ‘min_samples_split’: 100, ‘bootstrap’: False}. For Gradient Boosting, the best parameter configuration is {‘n_estimators’: 150, ‘loss’: ‘huber’, ‘subsample’: 1, ‘max_features’: 0.6}. For Neural Network, the best parameter configuration is {‘activation’: ‘relu’, ‘solver’: ‘adam’, ‘alpha’: 0.0005, ‘learning_rate_init’: 0.001} and its architecture is a single hidden layer with 170 nodes.


Result

Random Forest with its best configuration reaches RMSE 0.0621 and explained variance 0.9554. Gradient Boosting with its best configuration reaches RMSE 0.0642 and explained variance 0.9525. Neural Network with its best configuration and architecture reaches RMSE 0.0577 and explained variance 0.9619. In result, Neural Network outperformed Random Forest and Gradient Boosting on both training and testing data. The RMSE of Neural Network scores is around 0.5% lower than Random Forest and 0.7% lower than Gradient Boosting, the explained variance is around 0.5% higher than Random Forest and 1% higher than Gradient Boosting.


Discussion

Model
In general, Random Forest, Gradient Boosting and Neural Network perform very well in predicting the final placement. Their differences of RMSE indicates that error using these three models won’t exceed one place (up or down) in the game ranking board. Also, their explained variances suggest that their deviation from prediction are handled pretty well.

However, Neural Network is still the best over them. The reason might be the intrinsic designs of the algorithms. Random Forest is built by a bunch of regression tree to vote in order to improve the predictive accuracy and control overfitting. Gradient Boosting is an adaptive model using gradient descent on the derivative of loss function to optimize the model. Neural Network is a model using forward and backward propagation by using an optimizer on the derivative of loss function to learn the weights, which has the best prediction result. In data science filed, there is no free lunch. Besides understanding the mechanism of machine learning algorithms, most of time while we enter a new filed, we have to try different kinds of model to know which model might dominates the prediction in the field (like gaming data).

 

Game Strategy
In this study, different kinds of feature selection are performed. In total, the most common three selected features are ‘killPlace’, ‘walkDistance’ and ‘kills’ with 11, 10 and 8 times, respectively. Also, from the correlation analysis, ‘kills’, ‘killPlace’, ‘weaponAcquired’, ‘heals’, ‘boosts’, ‘walkDistance’ have high correlation with the target attribute. This might indicate, in PUBG match, skill, equipment and moving are extremely important. The attributes ‘kills’, ‘killPlace’ indicate that player have to practice shooting to sharpen their killing skill. The attributes ‘weaponAcquired’, ‘heals’, ‘boosts’ suggest that player need to obtain items as many as possible to survive and kill. Also, the ‘walkDsitance’ might indicates that good players know that keeping moving could get them a better chance to survive.

 

Back Into Real Life
People say “Life is a game”. Actually, the PUBG strategies from this study are applicable for our life. First, the ‘killPlace’, ‘kills’ features are like the KPI at work. In this game, although players can boom a jeep or do many crazy stuff, only the kills counts for the credits. We could actually reflect this at work or in life. In life, we should perform on the right spot to get the credits, the rest it doesn’t count. Second, the features ‘weaponAcquired’, ‘heals’, ‘boosts’ are the resource in life. To success in life, people need to know “where are the resource?”, “how to get the resource?”, and “acquire them as many as possible” in case someday we might need them. Third, feature ‘walkDistance’ is telling us keep moving. Keeping moving in life could help us find the new stuff to widen horizons in order to have a better life.


Conclusion

All in all, we have performed feature selections and cross test all the models. It results in Neural Network is found the best model to predict PUBG final placement with lowest RMSE and highest explained variance on the entire features. Next, the game strategy is important to adapt in the game. To win a PUBG match, sharpened skills, equipment gaining and the keep-moving strategy are necessary to have. Lastly, the strategy from PUBG is also important in life. To success in life, people should put all their efforts on things that matters like the KPI in companies, obtain resource as many as possible, and keep moving to widen horizons.


Future Work

First, originally, SVR was planned to test as well, but every time it ran SRV, the computer froze. I had to remove it from my candidates list. The reason might be the size of training data is too large (more than 30 thousands). In future, I should rent a cloud machine like AWS EC2 machine to solve the data size problem. Second, in this study, we only discuss the solo-mode prediction and analysis. Although the prediction and game strategy are robust and interesting, it will be better if dual-mode and group-mode could be done. Since human are social beings, analysis on dual-mode and group-mode could help us to find the pattern that people helping with each other. Therefore, I hope in future, this could be done.