In data science, database plays a big role since so many time it is where we extract the data. Besides the SQL queries to databases, knowing how to build a database is quite important. In this article, I will demonstrate how to build a database from zero to one including functional dependency, normalization, entity-relationship diagram, physical model, and lastly play around with the database a little bit with a few queries.
In this demonstration, I am gonna use three made-up invoices to create a transaction database.
One Big Spread Sheet
Once we have all data from the invoices, we could put every value in a big spread sheet. This could help us to comb through the relationship between entities.
From the spread sheet, we have to identify columns which are derived values. This is important because no database stores redundancy. Take the last columns above for example, the last column is “Extended Amount” which is the calculation of “Del Qty” * (“Retail Price“- “Discount“). As long as we have “Del Qty“, “Retail Price” and “Discount“, “Extended Amount” is derivable. Therefore, in the database, we don’t need the last column.
Functional Dependencies and Normalization
Since we have the big spread sheet, we could use F.D.s to split the sheet to make sure all split tables are in 3NF.
1NF – atomicity
To be in 1NF, we have to make every cell in the sheet atomic.
2NF – no partial dependency
Partial dependency means a relation has attributes determined by part of one candidate key. The easiest way to split the table eliminating partial dependency is to ask ourselves, for each attribute, if we know a certain primary key, can we determine this attribute. For example, if we know customer ID, we are able to know customer name, address, phone, etc.
While splitting, we have to leave connections between tables. How to leave connections depends on its cardinalities. In short, if two tables have 1-to-1 relationship, we could arbitrary leave one primary key to the other table. If two tables have 1-to-many relationship, the primary key of lower cardinality goes to the table with higher cardinaltiy. If two tables have many-to-many relationship, we take the primary keys from both tables to create a relational table as the connection.
By this fashion, the big table will be split like: (Orange for PK, grey for FK)
3NF – no transitive dependency
Transitive dependency refers to a situation like A determines C because A determines B, and B determines C. By the same splitting fashion of cardinality, the tables are split like:
Logical Model and Physical Model
After splitting the big table in to tables in 3NF and figuring out its relationships, we could use Oracle SQL Developer to draw a logical model.
With setting the domains of attributes, we could right click the logical model to engineer a physical model in Oracle SQL Developer Browser.
Once we have the physical model, the DDL could be generate to create the database by the DDL preview functionality in Oracle. The last step is to populate data from the split tables.
- How many parts did Jim Shoe purchase in November, 2006?
- What was the total amount of all purchases by Speed Racer?
Building a relational database is about following steps. First, gather the data from a real-life source. It could be invoices, surveys or information from consulting. Second, put all information in a big sheet and split the sheet by using normalization rules. Third, create the logical model and physical model based on the connection left in previous step. Last, execute the DDL derived from the physical model, populate the data, and test your database.
It is worth to mention that even though we might have some idea of how goal looks like, we should constantly consult clients to make sure we, as developer, are on the right direction to develop the right product.