People use regression model a lot to predict continuous variables. Other than that, regression model finds the correlations between input variables. By analyzing the relationships, people could find some underlying pattern to help us make decisions. Today, I will demonstrate how to do regression analysis and interpret its results by using house data from King County, Washington from May 2014 through May 2015.
There are 21 columns in the data file but not all are relevant here. The response variable of interest is the “Price” and we are interested in which predictors are significant variables of determining price values of houses.
Before running any regressions we have to check for multicollinearity issue. To do so, we could check correlation matrix to find the correlation among independent variables. If two independent variables are highly correlated to each other, which means we have multicollinearity issue, we have to remove one of them to make sure no multicollinearity brings up problems such as inaccuracies in the computations of slopes to our analysis.
In this plot above, we can find that the correlation between sqft_living (the 4th) and sqft_above (the 11th) is the highest correlation value (based on the color chart), this leads us to choose one of them to eliminate for solving multicollinearity issue. Since the sqft_living variable has more high correlation values associating with other independent variables, it would be a good choice to remove this variable for maintaining the variability in dimension. By this fashion, we successively remove sqft_above and sqft_lot15. The final result plot is shown below.
Besides that, since regressions are sensitive to missing values, we have to check whether there are any before running regression. The result shown below tells us there is no missing value in the dataset.
Also, we need to check all used variables have linear meaning so that we can use them in regression analysis. In the table below, I discuss the numeric meaning on each remained variable through previous process. All in all, only zipcode has no linear meaning, and we should remove it before running regression.
To build a multiple regression model, three automatic methods are available, they are stepwise, forward and backward. For this task I will use backward selection. Forward selection is good for time cost but bad at dealing with multicollinearity issue. In contrary, backward elimination might cost more time to compute, but, by its algorithm, it could alleviate multicollinearity problem. Therefore, since the dataset is not as large as we can’t deal with, I choose backward elimination for not having problems caused by multicollinearity.
The overall p-value indicates that we can reject the null hypothesis, which is H0: b1=b2=…=bk=0, and accept alternative hypothesis that states at least one beta coefficient is statistically significant to the model.
In the result of backward method, all the remained variables are significantly different from zero at the .05 level. They are bathrooms, floors, waterfront, view, condition, grade, sqft_basement, yr_built, yr_renovated, lat, and sqft_living15.
We could using the variables above, create a visualization, which will provide an interesting story or insight within this data and present it to the general public.
In the importance graph above, we could see how much importance of each variable contributes to this model. Also, we can study and interpret them for reality uses. Take the Top 3 for example: 1. grade provided by the King County auditor plays a very important role in house price. In the sense of that, people should especially supervise the auditors to avoid illegally manipulating the house market since the grades they provide greatly influence the house market in King County. 2. Variable latitude implies there might be class segmentation of living at north or south. In the “Latitude vs Price” plot, we could found house price in south King County is lower than the price in north King County. Learning this could help to make decision in the city development. 3. Waterfront-vs-Price plot depicts that people in King County appreciate houses near water more than houses in not, and this reflect on house price.