Preprocessing Data

Many times, we couldn’t process data because ranges of attributes are totally different, like height and weight. Sometimes, even we solve that, we still have to deal with the effect of minor observation errors. In this work, I’m gonna demonstrate two preprocessing techniques to cope.

Get code


Normalization

Suppose that a hospital tested the age and body fat data for 18 randomly selected adults with the following results:

Steps below are used for data preprocessing needs:

1 – Using visualization techniques to understand distributions

Explain:

In comparison, the distribution of age will be smoother than the distribution of %fat because its box body (Q1-Q3) is wider than %fat’s. Also, that the median and mean of age is greater than the median and mean of %fat depicts the distribution of age will be on the right of %fat’s distribution. To verify this, we could check the density plot.


2 –  Three most common ways to normalize data

Min-max normalization

By using Min-max normalization, the range could actually be fully customized (new maximum minus new minimum). To illustrate, in the process, we first shift every point to the minimum value by subtracting the minimum. Then we divide each point by maximum minus minimum, total length of this data, to normalize points within range [0, 1]. After that, we multiply each point by new range of data (new maximum minus new minimum), and lastly, shift data by adding a new minimum value to each of them. However, in this case, we stop in normalizing data within range [0, 1].

Z-score normalization

Theoretically, the range of z-score normalization could be [-∞, ∞] which is [the lowest z-score, the highest z-score] of the data. In this method, we first calculate the mean and standard deviation (SD) of the distribution. Second, subtract every data point by mean, then divide the result by its SD to derive how many SD each point deviate from its mean. Therefore, the lowest value and the highest value of derived data will be the lowest z-score and the highest z-score.

Normalization by decimal scaling.

The range of decimal-scaling normalization is always [-1+Ɛ, 1-Ɛ] since we divide values of each point by j power of 10 in a condition that derived value of maximum absolute value in the data is less than 1, where j is as small as possible.


3 – Understand relationship between variables

 

Explain:

Based on this plot, first, we could easily find that the relationship of these two variables are positive since the trend is that while age goes up, %fat goes up. Second, the strength of correlation seems pretty strong because we don’t see too much variance from the predict line. However, it is worth to mention that the variance is wider while the age is under 30, which means the model is not that predictable for people under 30.

In the correlation matrix, these two variables are positively correlated. In the age-to-fat relationship, the values 0.82 indicates the relationship between them is positive (in the range (0, 1]) and strong (above |0.7|).

 


Smoothing

Suppose a group of 12 sales price records has been sorted as follows:

[5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215]

We could easily tell that the distribution of this data is skewed. To reduce the effect of minor observation errors, we need to smooth it. Here I use two common ways to do so:

  1. equal-depth partitioning with 4 values per bin
  2. equal-width partitioning with 3 bins

Explain:

  1. Equal-depth binning streams sorted data into numbers of equal-amount bins, and the values of each set of bins are smoothed to its own bin mean, median or boundary, respectively. In this case (smoothing by mean), 3 derived bins have equal frequency (at 4) and the attribute has 3 discrete values (9.75, 38.75, 145.75). The histograms above indicate that using equal-depth smoothing could reduce influence from noisy data (outliers), but it also changes its distribution.
  2. Equal-width binning divides the range of data by bin’s count to have equal range of intervals (bins). Then, each data point, falling into its corresponding interval, is smoothed into bin’s mean, median or boundary. In this case, smoothing results in 3 discrete values (29.56, 92, 209.5), and the distribution remain almost the same.