Principal Component Analysis is an important dimensional reduction technique in machine learning field. Today, I will demonstrate how to perform PCA and how to interpret the result of matrix rotation by using Brewer’s Recipe data.
The site Brewer’s Friend allowing users to share their recipes for home brew beer, this dataset contains a selection of the recipes uploaded thus far. We are interested in discovering the underlying pattern this brewing data. To understand the terminologies and process in brewing, Figure 1. demonstrate the general flow chart in brewing.
In my analysis, I will explain how I clean the data for PCA use, demonstrate the sample adequacy, determine the principal component extraction, analyze the components, and conclude the analysis.
Before doing the PCA on Brewer’s data, let’s get knowing the data by performing Exploratory Analysis.
Categorical Variable Distribution:
In the bar chart, we could find the distribution recipe’s style is highly imbalance. Imbalance data causes several problems in classification and clustering. To solve the imbalance issue, we might have to use sampling techniques or bin the data.
Continuous Variable Distribution:
The five-number summary of continuous variables tells us its minimum, 1st quarter, median, 3rd quarter, maximum. To get a better idea of continuous-variable comparison with each other, we min-max normalized (which does not change their distribution) the data in the box plot. In the plot, first, we find the many of variables are skewed. Therefore, before we run some specific model like regression, we might have to eliminate outliers. Next, there are some pairs have almost the same distribution, variable ‘Size.L.’ and ‘BoilSize’, for instance. Almost-like distribution causes multicollinearity that jeopardizes predicting models by producing anomalous beta coefficients. Also, the plot shows that some variables are well normally distributed such as ‘PitchRate’, and ‘Efficiency’.
Linear Relationship & Correlation:
The scatterplot matrices illustrate the linear relationship between each variable. First, we could assess the relationship between target variable (FG) and independent variables. In this stage, we are looking for highly-correlated linear relationship that guarantees adequate linearity in analyses based on regression. In these scatterplot matrices, we found ‘OG’ and ‘BoilGravity’ are good for predicting ‘FG’. Second, if the relationships of two independent variables is highly-correlated, it is likely to have multicollinearity. in the correlation matrix, several pairs have high correlation and they are ‘BoilSize’/‘Size.L.’, ‘OG’/’FG’, ‘BoilGravity’/’OG’ and ’FG’/‘BoilGravity’.
The data set contains various variables including numeric data measured in the brewing process, categorical data associated with certain brewing methods or providers’ information up to 73861 cases and 23 variables. In order to do principal component analysis (since PCA is not good to deal with categorical data), I first removed all categorical variables, then removed all missing values. The reason for I removed all missing values is the remain data set still have 20371, which I deem is enough to find its latent variables. The descriptions of the result set are listed in Table 1.
Sample size is one of the concerns that whether the data is right for principal component analysis. As we know, factor analysis is unstable for large data like the cleaned data set (contains 20371 instances), we can first assume doing principal component analysis is better in finding the underlying patterns in this data set. To test our assumption, we can look into two things, Bartlett-Sphericity test and Kaiser-Meyer-Olkin (KMO) test.
Bartlett’s test tells if k samples are from populations with equal variances. Its null hypothesis is that all variance is the same, and its alternative hypothesis is that at least one variance is different. If the p-value from Bartlett’s test is less than significance level 0.05, we can reject null hypothesis and accept alternative which indicates there is correlation to exploit. On the other hand, KMO test is another way to measure the adequacy of sample count by computing MSA (measure of sample adequacy) score. A higher MSA score indicates higher stability in principal component analysis.
In these test results, the p-value in Bartlett’s test (Table 2) is almost 0 and the overall MSA in KMO test (Table 3) is 0.62 which is greater than 0.5. Both indicate it is reasonable that principal component analysis is appropriate for this data set.
Determine Principal Component Number
Since we are sure the data is appropriate for principal component analysis, we have to determine how many principal component (as less as possible) to represent the entire data set. Here I am going to use three different ways to judge the best number of principal component. The first is cumulative proportion. Since it is for latent variables, I aim 60% to 80% total variance to represent the data set. In the result (Table 4), we can notice that 4 PCs cover 61.49% total proportion, and 7 PCs cover 83.76% total variance. In the perspective of cumulative proportion, 4, 5, 6 or 7 is reasonable number to use.
The second is KMO method that obtaining all PC with eigenvalue greater than one. One way to check that is looking into the scree plot and choose the number of PC with variance greater than one. In this method (see Figure 2), 4 PCs is the number we should use.
The third one is finding the knee in scree plot (Figure 2). The idea is to find a point that obtaining more after this point, it won’t fit a significant variance proportion but relatively fit noisy. In this fashion, we find the knee happens just between 4 and 5, which depict us that we should use only 4 PCs to represent that data. Therefore, by using these three methods, we could determine the number of PCs is 4.
Analyze the Components
In this section I will present analysis of PCs obtained from psych::principal in r. The table 5 below gives the inter-correlation between 4 PCs and their related indicators. Also, we could see the visualization of PCA in figure 3.
Gravity in brewing:
Size in brewing:
Strengths of brewing:
Fermentation in brewing:
Moreover, it is worth to mention that, ‘Efficiency’ and ‘MashThickness’ are missing in our components. The reasons for that are, first, mash efficiency is a target measurement (that brewers trying to approach) in brewing. Second, mash thickness is the degree of thickness of wort that brewers approach by empirical rule. Hence, both of these variables are expected to not have correlation with other variables, which we can verify by checking their distribution (Figure 4 and Figure 5)
In this study, principal component analysis was conducted for thirteen brewing variables. The data was from ‘Brewer’s Friends’ website allowing brewers share their recipes. The original data set has been cleaned for PCA purpose into thirteen numeric variables. The sample adequacy is tested by two methods, Bartlett-Sphericity and Kaiser-Meyer-Olkin (KMO). Both results indicate that the sample is to stable enough to conduct principal analysis.
By using the KMO rules, seeking knee, and count the cumulative proportion, four components were extracted for the further investigation. After rotating the matrix, the indicators were loaded into four components are interpreted and named as: (1) gravity in brewing, (2) size in brewing, (3) strengths in brewing, and (4) fermentation in brewing. The interpretations help us to understand the nature in brewing such as the gravity level throughout the brewing process, beer strengths weighted in ABV, IBU and color, and what facilitates fermentation in brewing for future brewing work.