A Peak Into Data

To illustrate how to understand your data, we use the famous Fisher’s iris data at UCI machine learning repository as an example. Fisher’s iris data consists of measurements of iris flower plants, specifically the sepal length, sepal width, petal length, and petal width of 150 specimens. There are 50 specimens from each of three species: Setosa, Versicolour and Virginica.

Get code

First of first, we always check the head of data to the first idea of data we are going to process.

After that, since we want to understand our data, in this article, we gonna demonstrate 3 basic ways to do so:

1 – Using plot to see relationships


How does work? Let’s say we want to learn knowledge from this datasest. By observing scatter plot, we could basically know what kind of algorithm could work on it. For example, if we want to know how well a classifier work on variable sepal_length and sepal_width, we could know that classification algorithm among these two variables might not be well successful. it’s because although it did separate black dots, it cannot give us accurate information to separate green or red dots from each other since they are somehow overlapped. In contrary, a classifier might work well on petal_length and petal_width since each class is distinctly separated from each other.

2 – Observing attributes’ distribution


From the sepal_length and sepal_width diagrams, we can see the distributions are symmetric and close to normal by their shapes. Furthermore, by seeing their shape, we can know that the deviation of sepal_length is greater than the deviation of sepal_width. On the other hand, from the petal-length and petal-width diagrams, they are two noticeable bimodal distribution.

3 – Outliers detection


Box plot is a great tool to detect outliers, however, sometimes we could be deceived by data. For example, when we check the overall sepal-length variable, there is no outlier. However, if we diagnose sepal-length by class, we found an outlier in virginica class. Thus, before analyzing the data, we might have to take a look into this data point to see if there is any abnormalities in dataset. Next, using the same way to examine outliers, we found no outliers in overall petal-length variable. However, we spot two outliers by class. One is in setosa class, the other is in versicolor.