Brewer’s Analysis

Principal Component Analysis is an important dimensional reduction technique in machine learning field. Today, I will demonstrate how to perform PCA and how to interpret the result of matrix rotation by using Brewer’s Recipe data.

Get code


Introduction

The site Brewer’s Friend allowing users to share their recipes for home brew beer, this dataset contains a selection of the recipes uploaded thus far. We are interested in discovering the underlying pattern this brewing data. To understand the terminologies and process in brewing, Figure 1. demonstrate the general flow chart in brewing.

In my analysis, I will explain how I clean the data for PCA use, demonstrate the sample adequacy, determine the principal component extraction, analyze the components, and conclude the analysis.

 


Exploratory Analysis

Before doing the PCA on Brewer’s data, let’s get knowing the data by performing Exploratory Analysis.

Categorical Variable Distribution:

In the bar chart, we could find the distribution recipe’s style is highly imbalance. Imbalance data causes several problems in classification and clustering. To solve the imbalance issue, we might have to use sampling techniques or bin the data.

Continuous Variable Distribution:

The five-number summary of continuous variables tells us its minimum, 1st quarter, median, 3rd quarter, maximum. To get a better idea of continuous-variable comparison with each other, we min-max normalized (which does not change their distribution) the data in the box plot. In the plot, first, we find the many of variables are skewed. Therefore, before we run some specific model like regression, we might have to eliminate outliers. Next, there are some pairs have almost the same distribution, variable ‘Size.L.’ and ‘BoilSize’, for instance. Almost-like distribution causes multicollinearity that jeopardizes predicting models by producing anomalous beta coefficients. Also, the plot shows that some variables are well normally distributed such as ‘PitchRate’, and ‘Efficiency’.

Linear Relationship & Correlation:

The scatterplot matrices illustrate the linear relationship between each variable. First, we could assess the relationship between target variable (FG) and independent variables. In this stage, we are looking for highly-correlated linear relationship that guarantees adequate linearity in analyses based on regression. In these scatterplot matrices, we found ‘OG’ and ‘BoilGravity’ are good for predicting ‘FG’. Second, if the relationships of two independent variables is highly-correlated, it is likely to have multicollinearity. in the correlation matrix, several pairs have high correlation and they are ‘BoilSize’/‘Size.L.’, ‘OG’/’FG’, ‘BoilGravity’/’OG’ and ’FG’/‘BoilGravity’.

 


Data Cleaning

The data set contains various variables including numeric data measured in the brewing process, categorical data associated with certain brewing methods or providers’ information up to 73861 cases and 23 variables. In order to do principal component analysis (since PCA is not good to deal with categorical data), I first removed all categorical variables, then removed all missing values. The reason for I removed all missing values is the remain data set still have 20371, which I deem is enough to find its latent variables. The descriptions of the result set are listed in Table 1.

 


Sample Adequacy

Sample size is one of the concerns that whether the data is right for principal component analysis. As we know, factor analysis is unstable for large data like the cleaned data set (contains 20371 instances), we can first assume doing principal component analysis is better in finding the underlying patterns in this data set. To test our assumption, we can look into two things, Bartlett-Sphericity test and Kaiser-Meyer-Olkin (KMO) test.

Bartlett’s test tells if k samples are from populations with equal variances. Its null hypothesis is that all variance is the same, and its alternative hypothesis is that at least one variance is different. If the p-value from Bartlett’s test is less than significance level 0.05, we can reject null hypothesis and accept alternative which indicates there is correlation to exploit. On the other hand, KMO test is another way to measure the adequacy of sample count by computing MSA (measure of sample adequacy) score. A higher MSA score indicates higher stability in principal component analysis.

Bartlett’s test:

KMO test:

In these test results, the p-value in Bartlett’s test (Table 2) is almost 0 and the overall MSA in KMO test (Table 3) is 0.62 which is greater than 0.5. Both indicate it is reasonable that principal component analysis is appropriate for this data set.

 


Determine Principal Component Number

Since we are sure the data is appropriate for principal component analysis, we have to determine how many principal component (as less as possible) to represent the entire data set. Here I am going to use three different ways to judge the best number of principal component. The first is cumulative proportion. Since it is for latent variables, I aim 60% to 80% total variance to represent the data set. In the result (Table 4), we can notice that 4 PCs cover 61.49% total proportion, and 7 PCs cover 83.76% total variance. In the perspective of cumulative proportion, 4, 5, 6 or 7 is reasonable number to use.

The second is KMO method that obtaining all PC with eigenvalue greater than one. One way to check that is looking into the scree plot and choose the number of PC with variance greater than one. In this method (see Figure 2), 4 PCs is the number we should use.

The third one is finding the knee in scree plot (Figure 2). The idea is to find a point that obtaining more after this point, it won’t fit a significant variance proportion but relatively fit noisy. In this fashion, we find the knee happens just between 4 and 5, which depict us that we should use only 4 PCs to represent that data. Therefore, by using these three methods, we could determine the number of PCs is 4.

 


Analyze the Components

In this section I will present analysis of PCs obtained from psych::principal in r. The table 5 below gives the inter-correlation between 4 PCs and their related indicators. Also, we could see the visualization of PCA in figure 3.

Gravity in brewing:

Since this component contains original gravity, boiling gravity and final gravity, it can be interpreted as “Gravity in Brewing Aspect“. The PC1 formula is: PC_{ 1 }=0.989\times OG+0.980\times BoilGravity+0.972\times FG The relationships between OG, BoilGravity and FG are positive. The reason for the grouping is the gravity level goes down during brewing process from original through boiling to the final status.

Size in brewing:

This component consists of the amount of brewed beer and the amount of wort in boiling. Thus, we can consider it as “Size in Brewing Aspect“. The PC2 formula is: PC_{ 2 }=0.990\times Size.L.+0.989\times BoilSize The size of brewed beer and it boiling size are highly related. In brewing, since boiling wort causes water evaporation, the brewed bear has lesser volumes depending on the amount of wort.

Strengths of brewing:

This component is made up of alcohol by volume, international bittering unit and color unit. Hence, we can interpret this component as “Strengths of Brewing Aspect“. The PC3 formula is: PC_{ 3 }=0.843\times ABV+0.692\times IBU+0.533\times Color In this formula, ABV, IBU and Color all contribute to this component. ABV contributes the most, IBU does the second most, and Color does the least. This result could help us to reveal the underlying pattern of how strong a beer is.

Fermentation in brewing:

Since component comprises rate of the amount of yeast that is added to cooled wort, primary temperature in fermentation and time of boiling, we can interpret this component as “Fermentation in Brewing Aspect“. The PC4 formula is: PC_{ 4 }=0.778\times PitchRate-0.700\times PrimaryTemp+0.499\times BoilTime PitchRate and BoilTime have positive contribution to brewing fermentation, but PrimaryTemp has negative contribution to it. In the perspective of brewing, the amount of added yeast (PitchRate), lower temperature (below 20ºC), and the stop of enzyme active (BoilTime) producing simpler sugars, which brewer yeast cannot ferment them, facilitate fermentation.

Moreover, it is worth to mention that, ‘Efficiency’ and ‘MashThickness’ are missing in our components. The reasons for that are, first, mash efficiency is a target measurement (that brewers trying to approach) in brewing. Second, mash thickness is the degree of thickness of wort that brewers approach by empirical rule. Hence, both of these variables are expected to not have correlation with other variables, which we can verify by checking their distribution (Figure 4 and Figure 5)

 


Conclusion

In this study, principal component analysis was conducted for thirteen brewing variables. The data was from ‘Brewer’s Friends’ website allowing brewers share their recipes. The original data set has been cleaned for PCA purpose into thirteen numeric variables. The sample adequacy is tested by two methods, Bartlett-Sphericity and Kaiser-Meyer-Olkin (KMO). Both results indicate that the sample is to stable enough to conduct principal analysis.

By using the KMO rules, seeking knee, and count the cumulative proportion, four components were extracted for the further investigation. After rotating the matrix, the indicators were loaded into four components are interpreted and named as: (1) gravity in brewing, (2) size in brewing, (3) strengths in brewing, and (4) fermentation in brewing. The interpretations help us to understand the nature in brewing such as the gravity level throughout the brewing process, beer strengths weighted in ABV, IBU and color, and what facilitates fermentation in brewing for future brewing work.