Forecasting is one of the essentials in human society nowadays. People forecast crime rate, stock price, weather, etc. **Jena** is a German university city and the second largest city in Thuringia. In this article, I am gonna use Jena’s weather data to forecast its temperature in two phases. The first phase is exploratory data analysis, and the second phase will mainly focus on LSTM application and hyper parameter tuning.

**Exploratory Data Analysis**

**Dataset**

All attributes in this dataset are numeric and the scalers among these attributes are quite different. For instance, the values in p(mbar) column seems around 1000 and the max.wv (m/s) column seems to have values lower than 1.

#### Statistical Description

First, we can notice that there is no missing values in this dataset according to that all values in count row is 70073. Second, we can get the basic idea of how each attribute are distribute after acquiring the five number summary of all attributes. Take sh (g/kg) column for example, the min and max is about 0.5 and 18 respectively. Hence, we expect that the 50th percentile should be near to 10. However, the 50th percentile is 6, much lower than what we expected. This indicate the distribution of sh (g/kg) column is right-skewed. Third, besides the five number summary, standard deviation give us an insight into the degree that these attributes spread. Based on the range and standard deviation, even though column P (mbar) and T (degC) have close standard deviations (8.360583, 8.420257), the range sizes 100 and 60, approximately, determine the P (mbar) degree of spreading will be greater than T (degC).

#### Distribution Plots

Above plots depicts how each attribute is distributed. Column T, p, Tpot, rho, and Tdew are pretty well bell-shaped distributed. Column wv and max.wv are virtually a bar that indicates all values in these two attributes are highly condensed. Column VPdef has a exponential distribution. Column VPmax, VPact, sh and H2OCa are right-skewed, and column rh is left-skewed. Column wd has bimodal distribution.

#### Correlation

For dependent variable, correlation matrix plot helps to obtain how each independent variable contributes in predicting T. In result, attribute p, wv, max.wv and wd seem to have no contribution in predicting target variable. In contrary, attribute Tpot, Tdew, VPmax, VPact, VPdef, sh and H2OC have positive and strong relationship with T. Attribute rho has negative and strong relationship with the target variable and attribute rh has negative and moderate relationship with the target variable.

For independent variables, correlation matrix plot helps to spot multicollinearity issue. In the plot, independent variable tpot, VPmax, Tdew, VPact, sh, and H2OC are highly correlated (with almost 1 correlation). Since multicollinearity could cause abnormal beta coefficients and inaccurate prediction in multi-linear regression, it is good to be aware of that while neural network is a complex linear system.

#### Time Series Plot

In the Temperature Trend plot above, it is noticeable that the minimum temperature has been going up from 2009 to 2017 and 2014 is the changing point. The range of minimum temperature before 2014 is around 20 Celsius degree from -23 to -3, and the range of minimum temperature after 2014 is around 10 Celsius degree from -13 to 0. The trend of maximum and average temperature didn’t change to much.

**Predicting Model**

**Summary of the final model**

The final model has 4 layers. The first is a LSTM with size 64 units and 20,224 learnable parameters. The second is a LSTM with size 64 units and 33,023 learnable parameters. The third is a LSTM with 32 units and 12,416 learnable parameters. The last layer is the output layer connected with all unit output from previous LSTM layer and it has 33 learnable parameters. In total, it has 65,697 parameters.

This model doesn’t has any dropout arguments. It will be explained in Journey of hyper-parameter and architecture section.

**Journey of hyper-parameter and architecture**

**Model 1**

In model 1, I try the example model from Keras to test whether it compiles, and the architecture and parameter configuration are as above. After 10 epoch, training, the MAE didn’t change and stayed at 0.4332. I realized that is because softmax function calculates the probabilities distribution of an event over n event. In weather prediction, we should use activation functions somehow match linearity.

**Model 2**

In Model 2, I changed the activation function to linear which is exactly designed for linear problems, and the result suggested that linear function is a proper one since the MAE dropped while train. After 30 epoch, the model had no overfitting issue and the validation MAE end up at 0.0201 from 0.0621.

**Model 3**

In Model 3, I changed the optimizer from rmsprop to adam with the default learning rate 0.001. In contrast to model 2, activation function adam is more effective than activation function rmsprop. In the plot, we can learning that the MEA of model 2 is around 0.04 at about 5 epoch, however, the MEA of model 3 is two times lower (0.02) at the same epoch level. Moreover, model has no overfitting issue after 30 epoch. This indicate that we can force the model learn more by increasing its complexity.

**Model 4**

In Model 4, the model architecture had been changed. The sizes of 1^{st} LSTM layer and 2^{nd} LSTM layer were increased by 1 times to 64. It turns out the result of increasing complexity was not as good as I expected. In the first 30 epoch, the MAE drop only a little bit faster than the previous model. However, we can still spot, in the second 30 epoch, the declination of MAE. This implies that the model might just bounce around the minimum value because the learning rate are too large to reach the minimum.

**Model 5 (Best)**

In Model 5, the learning rate had been tweaked into the half of previous model, 0.0005. It is expected to have a slower learning curve and be able to get lower MAE. In result, the MAE had been improved from 0.009 to 0.0084. Nevertheless, the second plot depicts that the model has overfitting issue because of the divergence of train MAE and test MAE.

**Model 6**

In Model 6, since the model got overfitting issue, I added 0.5 dropout rate in 2^{nd} LSTM layer and 3^{rd} LSTM layer to avoid overfitting. In result, this model have almost the same MAE as the best model (no dropout). Other than that, when it came to the Kaggle competition, it showed no improvement.

**Result, Observations, Conclusions**

**Result**

**Observation**

During the entire tuning process, there are some points I would like to share. 1) The size of a LTSM determines the number of learnable parameter which transitively determines the complexity of a LSMT sequential model. 2) Normally, setting dropout could help a model prevent from overfitting. However, in my experiment, it doesn’t provide much help to avoid overfitting issue in this problem. 3) Tuning learning rate is a tradeoff of time and accuracy. From the model 4 to model 5, the learning rate was lowered by half. The model 5 took longer to get the same MAE and was able to reach lower MAE. 4) Linear activation function is suitable to predict continuous output, but softmax is not. 5) Adam optimizer seems to perform better than rmsprop optimizer in this case. It is better to try as much as optimizers to test which one is the best for current task.

**Conclusion**

The model got slightly improved after each version of models with lower learning rate, greater LSTM size and more promising activation function/optimizer. It had been trained more than 200 epoch. In the first 30 epochs, the validation MAE dropped from 0.05 to 0.01. In the last 200 epochs the validation MAE drop from 0.01 to 0.0084.

**Reaction and Reflection**

First, to construct the architecture and tune the hyper-parameter is not easy. To cope with that, I changed model architecture, tuned hyper parameters, analysis the results, and then determine whether it is in the right track toward the optimal solution. It is like dropping someone in the middle of ocean and asking him to find the most beautiful island. Next, it is a time consuming task, combining the last experience, I trained each model at least 100 epochs trying to see the whole picture. Last, textbook problems is not real-world problem. Although we have been taught the theories and concepts about how to prevent from overfitting, the model with dropout performs worse than the model without dropout in real-life case. This shows that different problems might not be solved by the same solution because the real world is too complicated. The only solution is to practice more to acquire more experience in predicting model to equip ourselves. After all, data science is half art half science and the art part is all about experience.