As a software developer trainee, I did have an okay grasp of databases. But then enter Ruby on Rails’s Active Record and suddenly there seems to be quite a lot of magic and the mystery of “relational databases.” It seemed to me that the concept of relational databases was something that was inherent to Active Record, and I wanted to understand how to use them. My instructor pointed out that not Active Records but a type of database provides that functionality, namely relational databases. And to understand them, he recommended to start at the beginning: databases.
Welcome to Databases Basics
In the following blog posts I’m going to give you a basic overview of the concepts of non-relational and primarily relational databases.
If you know a bit about programming, maybe are a trainee, have ever seen a spreadsheet (the ones you find in Excel & Co), and have perhaps dabbled with a database before, (and are interested in relational databases, of course) then the next blog posts are for you.
This series is not meant to be the one-stop-shop source of all knowledge about databases but when you’ve finished reading it, I hope to have been able to shed some light into the darkness you might hold as a beginner surrounding the topic of databases, and you will be better prepared to vanquish the rest of those shadows yourself.
What is the purpose of a database?
It stores data, provides an organizational structure, and mechanisms to retrieve (query) and manipulate the data.
Though there are several types of databases, in the context of this article I want to distinguish between two types: Relational and non-relational databases.
Non-relational databases
These are essentially lists as you might write them on a piece of paper.
- Each row contains all the information needed to stand on its own.
- This may lead to redundancy: information that is present in multiple rows (e.g. a customer’s name and address when they made multiple purchases in a online shop).
- Lists tend to contain information on more than one theme or business concept (e.g. an employee’s home address, their department’s manager’s phone number, their current project’s budget, etc.).
- Redundancy and multiple themes in a single list might eventually lead to modification problems:
- Insertion anomalies: information in one row might contradict information in another row (e.g. when a customer moves but only enters the current address in a new order and not by changing it in their account data).
- Update anomalies: information that is present in multiple rows gets updated in only one row (e.g. when the phone number shared by multiple employees changes).
- Deletion anomalies: a row’s data gets deleted but contains attributes that shouldn’t be deleted (e.g. an employee leaves the company, the corresponding row gets deleted, but since they were the only one working on that project, information about the project is lost).
Example for a non-relational table
Entity Employee
:
EmployeeId | Name | City | PostalCode | Department | DepartmentHead |
---|---|---|---|---|---|
101 | Eric | Cologne | 50672 | Hospitation | Olivia |
102 | Daniela | Berlin | 10178 | Engineering | Thomas |
103 | Thomas | Cologne | 50667 | Engineering | Thomas |
104 | Olivia | Munich | 80638 | Hospitation | Olivia |
Relational databases
These are built around the concept of informational themes or business concepts (e.g. Employee, Product, ConcertTicket).
- A conventional list is being broken-up into several parts. Each informational theme is stored in its own table. E.g. a “Project list” might be divided into a “Customer”, a “Project”, and a “ProjectManager” table.
- Each row of a table represents an instance of the entity. (Bear with me, we’ll come to explanations of terms very shortly.)
- Each column of a table in a relational database therefore represents an attribute of an entity.
Example for relational tables
Entity Employee
:
EmployeeId | Name | City | PostalCode | DepartmentId |
---|---|---|---|---|
101 | Eric | Cologne | 50672 | 1 |
102 | Daniela | Berlin | 10178 | 2 |
103 | Thomas | Cologne | 50667 | 2 |
104 | Olivia | Munich | 80638 | 1 |
Entity Department
:
DepartmentId | DepartmentName | DepartmentHead |
---|---|---|
1 | Hospitation | Olivia |
2 | Engineering | Thomas |
So instead of having a column (an attribute) called Department
filled with the actual name of the department for each instance of the entity Employee
, we reference the DepartmentId
.
- When we query (retrieve) the data for Daniela, both tables will be joined together using matched pairs of data values. This works like a look-up.
The result of such a query would look like this:
EmployeeId | Name | City | PostalCode | Department | DepartmentHead |
---|---|---|---|---|---|
102 | Daniela | Berlin | 10178 | Engineering | Thomas |
One of the big advantages of relational tables is that we could easily rename the department “Hospitation” to “Gastronomy” by change the single entity 1
of table Department
instead of multiple Employee
entities.
Why use something that’s more complex?
Although relational tables come with more complexity than plain lists, as seen above, they have some clear advantages over lists:
- Data redundancy is minimized,
- complex relationships among topics can be preserved, and
- they allow partial data entry (
null
values).
For these reasons, I’m going to focus on relational databases in this series.
Database terms explained
While we have established the basic database concepts, you might have come across some unfamiliar terms. Sit down and let me explain…
Database
A database is a self-describing collection of records (in the case of a relational database: of related records). This means that the database itself contains the definition of its structure (which tables there are, what data types the columns hold, etc.).
The data describing the structure is called metadata.
Entity
An entity represents a single thematical unit within a database (e.g. “Employee” or “Product”), which has multiple attributes (e.g. “Email address” or “Product description”) to describe it. Think of it as a building blueprint or a cake recipe.
e.g. “An employee has a name and an email address.”
Instance of an entity
An instance on an entity is a specific occurrence of an entity. You can think of it as one specific house built according to the blueprint or one baked cake following the recipe.
e.g. “John Smith with the email address ‘john@smith.com’” is an instance of the entity “Employee”.
Relation
A two-dimensional table (having rows and columns) is called a relation if it has the following characteristics:
- A row contains data about a single instance of an entity (e.g. a single employee)
- A column contains data about attributes of that entity (e.g only the email address of the employee, nothing else)
- A cell holds a single value (not multiple separated by space, comma, semicolon, etc.)
- Each column has a unique name
- All values in one column are of the same data type (e.g. integers or strings)
- Each row’s data is unique (no identical row entries)
Database systems components
User (can also be an application)
↕
Database application (the software helping the user interact with the database)
↕
Database Management System
↕
Database
Database Management System (DBMS)
This is the intermediary between the database (the table data itself, if you will) and the database application querying the data.
- It manages and controls the DB activities
- Creates, processes and administers the DB it controls
- Writes, reads and modifies the DB’s data
- Enforces rules and constraints
- Controls concurrency
- Provides security
- Performs data backup and recovery
Structured Query Language (SQL)
SQL is an international standard language for creating, processing, and querying databases. It is implemented by the majority of database and has been invented in the 1970’s by IBM.[^1]
SQL is not a programming language but rather a data sub-language that is used to
- define and manage database structures,
- create user accounts, manage their permissions, etc., and
- manipulate, update and retrieve data (query).
I’m not going to cover how to use SQL in this series. There are many great tutorials out there, like the one on TutorialRepublic.
Key
A key is one (or more) columns of a relation whose values are used to identify a row.
A unique key is a value that’s unique for each row (e.g. EmployeeId
).
A non-unique key is a value that may be shared among several rows and identifies a set of rows (e.g. DepartmentId
)
There are several types of keys:
Candidate key
The candidate key is a unique key that has the potential to become the primary key, since it could identify (determine) the value for the other fields in the row.
Primary key (PK)
The primary key is a candidate key which is chosen to be the main key for the relation.
If you know the PK (e.g. EmployeeId
), you are able to uniquely identify a single row within a table.
Composite key
The composite key is composed of two or more attributes.
For a key to become unique, it must often become a composite key (e.g. train “WI2031” and the date and time).
Surrogate key
A surrogate key is a unique, numeric value that is added to a relation to serve as the primary key. It often gets used in place of a composite primary key, and is usually hidden in queries since it has no meaning to users.
Foreign key (FK)
A foreign key is a primary key from a table (A) that is place into another table (B). It is called a foreign key in the table that receives it (B).
Referential integrity
Along with the concept of foreign keys comes the concept of referential integrity, which states that every value of a foreign key must match a value of an existing primary key. Otherwise there’s no reference.
null
values
A null
value means that for a specific attribute no data exists: null
is ambiguous in that a column’s data is unknown, hasn’t been determined yet, or is not appropriate for the specific row. Depending on its use, it may have entirely different implications.
Determinant and functional dependency
An attribute that can be used to determine values that are assigned to other attributes in the same row is called a determinant. All primary or candidate keys are determinants - but not all determinants are primary or candidate keys.
In our Employee example, EmployeeId
determines the value of the other fields: there could be multiple employees with the same Name
in the company, DepartmentId
is not unique to the employee, and the Email
might be empty.
EmployeeId
is a determinant, candidate key, and a primary key at the same time:
- It is a candidate key because when we search the database for
104
, only Olivia’s data is retrieved. Another candidate key would be the combination ofName
andDepartmentId
but only until the company hires an employee with the same name for that department. - It is the primary key because it is the easiest column to use as the primary reference to this table of all possible candidate keys. Since the every employee gets another ID,
EmployeeId
is guaranteed to be unique to this table.
A candidate / primary key functionally determines all other non-key attributes in a row (e.g. EmployeeId
can be used to retrieve Name
and Email
).
Intermission ahead!
Now that you’re equipped with the necessary vocabulary, it’s time to take a short breath, drink a cup of tea and then head on over to the topic of…
Database Basics, Part 2: Data Normalisation
General credits and further readings:
Adrienne Watt’s “Database Design - 2nd Edition” book
Dr. Daniel Soper’s “Database Lessons” videos
[^1] based on the paper A Relational Model of Data for Large Shared Data Banks by Edgar F. Codd from 1970