Now that we covered the very basics of database concepts and the terminology (if you’ve read part 1 of this series), we need to speak about getting your data in order…
I’ve talked previously about data anomalies, which can occur while inserting, updating, or deleting data in tables. To prevent this, relations that could introduce anomalies need to be split into smaller, well-structured relations.
This process is called normalisation: If a relation is normalised, rows can be inserted, deleted, or modified without creating anomalies.
Normalisation principles
In general, a well-formed relation will not contain more than one theme (e.g. Employee
does not contain information about Department
).
- To be a well-formed relation, every determinant must be a candidate key.
- Any not well-formed relation should be broken into two or more well-formed one.
Normalisation steps
There are several so called Normal Forms (a measurement of how well-formed relations are) defined but 3NF is sufficient for the majority of business cases:
Table with multivalued attributes
↓ Remove multivalued attributes
First Normal Form (1NF)
↓ Remove partial dependencies
Second Normal Form (2NF)
↓ Remove transitive dependencies
Third Normal Form (3NF)
First Normal Form (1NF)
A table does not contain multivalued attributes (more than one value in one cell, e.g. 101; 207
). Each row of a given column holds only a single value. This is called an atomic value since it can’t be split (from the Greek word atomos for uncuttable).
When we look at this example for non-1NF table…
EmployeeName | Department |
---|---|
Eric, Olivia | Hospitation |
…we see that Eric, Olivia
is a multivalued attribute which needs to be split into separate rows in order to be 1NF-conform:
EmployeeName | Department |
---|---|
Eric | Hospitation |
Olivia | Hospitation |
Second Normal Form (2NF)
- 1NF, and
- every non-key attribute is fully functionally dependent (see Part 1 for a refresher) on the entire primary key.
Let’s look at our Employee example again:
EmployeeId | Name | City | PostalCode | Department | DepartmentHead |
---|---|---|---|---|---|
101 | Eric | Cologne | 50672 | Hospitation | Olivia |
This table is representing information about two different themes: the employee and the department. It must therefore be normalized into 2NF by splitting Department
and DepartmentHead
into a second table that holds only information about the Department, leaving its primary key DepartmentId
as foreign key in the Employee
table, as we did in the previous post:
Table Employee
:
EmployeeId | Name | City | PostalCode | DepartmentId |
---|---|---|---|---|
101 | Eric | Cologne | 50672 | 1 |
102 | Daniela | Berlin | 10178 | 2 |
103 | Thomas | Cologne | 50667 | 2 |
104 | Olivia | Munich | 80638 | 1 |
Table Department
:
DepartmentId | DepartmentName | DepartmentHead |
---|---|---|
1 | Hospitation | Olivia |
2 | Engineering | Thomas |
Third Normal Form (3NF)
- 1NF, and
- 2NF, and
- no transitive dependencies (i.e. no functional dependencies on non-primary key attributes).
When a none-prime attribute depends on other non-prime attributes instead of depending on the prime attributes / primary key, a _transitive dependency- exists. Or in simpler words: an attribute within an entity depends on a non-prime attribute, which in turn depends on the primary key.
In our Employee
example, City
is transitive dependent on PostalCode
, which in turn depends on the primary key EmployeeId
:
EmployeeId | Name | City | PostalCode | DepartmentId |
---|---|---|---|---|
101 | Eric | Cologne | 50672 | 1 |
Let’s move City
and its PostalCode
to a new table, where the latter becomes the primary key. PostalCode
remains in the Employee
table as foreign key:
Table Employee
:
EmployeeId | Name | PostalCode | DepartmentId |
---|---|---|---|
101 | Eric | 50672 | 1 |
102 | Daniela | 10178 | 2 |
103 | Thomas | 50667 | 2 |
104 | Olivia | 80638 | 1 |
Table Department
:
DepartmentId | DepartmentName | DepartmentHead |
---|---|---|
1 | Hospitation | Olivia |
2 | Engineering | Thomas |
Table City
:
PostalCode | City |
---|---|
10178 | Berlin |
50672 | Cologne |
80638 | Munich |
General strategy
By breaking partial and transitive dependencies up into smaller relations, we can move from a lower Normal Form to a higher one.
But sometimes you might want or even need to do the opposite:
Denormalisation
Normalising relations (the process of breaking them apart into many component relations) may significantly increase the complexity of your data structure and denormalisation can be a good idea if you’re working with huge sets of data.
The benefits of denormalisation are:
- Simplicity
- Improved performance when querying the database
Of course, there are also costs of denormalisation:
- Risk of modification anomalies
- Potentially redundant data
- More storage space is required
In general though, normalising your data is an essential tool to avoid anomalies, and the need to denormalise might never arise in your project.
Grab a pen and some paper…
The process of data normalisation was already a hard topic to grasp. So for the next part, let’s do some drawing in:
Up next: Database Basics, Part 3: Data Models and Database Design
Previous topic:
Database Basics, Part 1: What are Relational Databases?
General credits and further readings: