Web applications can be split into two main parts:
Since a web application back-end is concerned with storing and retrieving data, often for many thousands or even millions of users, it is natural that a database is one of the major back-end components.
In this article, I’ll explain what a relational database is, and how to properly design your database to store your app’s information. I’ll also explain what database normalization entails with practical examples.
A database stores data in an organised way so that it can be searched and retrieved later. The database is made up of one or more tables. A table is much like a spreadsheet, in that it’s made up of rows and columns. All rows have the same columns, and each column contains the data itself. If it helps, think of a table as being like a spreadsheet in Excel or Google Sheets.
Data can be inserted, retrieved, updated, and deleted from a table. The word, created, is generally used instead of inserted, so, collectively, these four functions are affectionately abbreviated as CRUD.
A relational database is a type of database that organises data into tables, and creates links between these tables, based on defined relationships. These relationships enable you to retrieve and combine data from one or more tables with a single query.
But to truly understand a relational database, you need to make one yourself! Let’s get started by getting some real data with which we can work.
Let’s start with some Twitter data. I searched Twitter for #databases and took the following sample of ten tweets:
Here’s what each column name means:
This is all real data; you can search Twitter and actually find these tweets.
This is good. The data is all in one place; so it’s easy to find, right? Not exactly. There are a couple problems with this table.
Duplicates are problematic because it makes the CRUD operations more challenging. For example, it would take longer time to retrieve data because time would be wasted going through duplicate rows. Also, updating data would be an issue; if a user changes their Twitter handle, we would need to find every duplicate and update it.
Repetitive data is a problem. We can fix this problem by splitting Table 1 into separate tables. Let’s proceed with first resolving the repetition across columns issue.
As noted above, the
following_username columns in Table 1 are repetitive. This repetition occurred because I was trying to express the follow relationship between users. Let’s improve on Table 1‘s design by splitting it up into two tables: one just for the following relationships and one for the rest of the information.
Because @Brett_Englebert follows @RealSkipBayless, the
following table will express that relationship by storing @Brett_Englebert as the
from_user and @RealSkipBayless as the
to_user. Let’s go ahead and split table 1 into these two tables:
Table 2: The
Table 3: The
This is looking better. Now in the
users table, there is only one column with Twitter handles. In the
following table, the is only one Twitter handle per field in the
Edgar F. Codd, the computer scientist who laid down the theoretical basis of relational databases, called this step of removing repetitive data across columns the first normal form (1NF) in database normalization.
Database normalization is a database design method that avoids data duplication and gets rid of undesired aspects like insertion, update, and deletion Anomalies. Using relationships, database normalization rules break up larger tables into smaller ones. Normalization serves the dual purposes of removing unnecessary (repetitive) data and ensuring logical data storage.
In the first step of removing repetitive data across columns, we employed the first normal form(1NF)
When all attributes in a relation is a single valued attribute, has a unique name for each attribute or column, then we can say that relation is in its first normal form.
Now that we have fixed repetitions across columns, we need to fix repetitions across rows. Since the users @AndyRyder5 and @Brett_Englebert each tweeted twice, their information is duplicated in the
users table. This indicates that we need to pull out the tweets and place them in their own table.
As before, “text” stores the tweet itself. Since the “created_at” column stores the timestamp of the tweet, it makes sense to pull it into this table as well. I also include a reference to the “username” column so we know who published the tweet. Here is the result of placing the tweets in their own table:
Table 4: The tweets table
Table 5: The users table
After the split, the
users table has unique rows for users and their Twitter handles.
Edgar F. Codd called this step of removing repetitive data across rows the second normal form (2NF).
The next stage of normalizing a database is called second normal form (2NF). 2NF advances the first normal form (1NF). The 2NF makes sure that every piece of information with a many-to-many link is sorted and stored in a separate table.
A table must be in the first normal form and also free of partial dependencies in order to be in the Second Normal form.
When an attribute in a table depends solely on a portion of a composite primary key rather than the entire primary key, this is known as partial dependence.
We can separate the table, remove the property that is generating partial dependency (
text), and relocate it to another table (
tweets) where it will fit in nicely to eliminate partial dependency.
Data can be inserted, retrieved, updated, and deleted from a table.
So far, table 1 has been split into three new tables:
users. But how is this useful? Repetitive data has been removed, but now the data is spread out across three independent tables. In order to retrieve the data, we need to draw meaningful links between the tables. This way we can express queries like, “what a user has tweeted and who a user is following”.
The way to draw links between tables is to first give each row in a table a unique identifier, termed a primary key, and then reference that primary key in the other table to which you want to link.
We’ve actually already done this in
tweets. In the
users table, the primary key is the
username column because no two users will have the same Twitter handle. In
tweets, we reference this key in the
username column so we know who tweeted what. Since it is a reference, the
username column in
tweets is called a foreign key. In this way, the
username key links the
tweets tables together.
But, is the
username column the best idea for a primary key for the users table?
On one hand, it’s a natural key — it makes sense to search using a Twitter handle instead of assigning each user some numerical ID and searching on that. On the other hand, what if a user wants to change his Twitter handle? That could cause errors if the primary key and all referencing foreign keys aren’t updated accurately, errors that could be avoided if a constant numerical ID was used. Ultimately the choice depends on your system. If you want to give your users the ability to change their username, it’s better to add a numerical auto-incrementing
id column to
users and use that as the primary key. Otherwise,
username should do just fine. I’ll continue to use
username as the primary key for
Let’s move on to
tweets. A primary key should uniquely identify each row, so what should the primary key be here? The
created_at field won’t work because if two users tweet at the exact same time, their tweets would have an identical timestamp. The
text has the same problem in that if two users both tweet “Hello world,” we couldn’t distinguish between the rows. The
username column is the foreign key that defines the link with the
users so let’s not mess with that. Since the other columns are not good candidates, it makes sense here to add a numerical auto-incrementing
id column and use that as the primary key.
Table 6: The tweets table with an
Finally, let’s add a primary key to the
following table. In this table, neither the
from_user column nor the
to_user column uniquely identifies each row on its own. However
to_user together do, since they represent a single relationship. A primary key can be defined on more than one column, so we’ll use both these columns as the primary key for the
As for the foreign key,
to_user are each foreign keys since they can be used to define a link with the users table. If we query for a Twitter handle on the
from_user column, we’ll get all the users he follows. Correspondingly, if we query for a Twitter handle on the
to_user column, we’ll get all the users following him.
We’ve accomplished a lot so far. We removed repetitions across columns and rows by separating data into three different tables, and then we chose meaningful primary keys to link the tables together. This entire process is called normalization and its output a data that is clean and organized according to the relational model. The consequence of this organization is that rows will appear in the database only once moving forward, which in turn make the CRUD operations easier.
The figure below diagrams the finalized database schema. The three tables are linked together and the the primary keys are highlighted.
SQLite is common in embedded systems. iOS and Android let developers use SQLite to manage their app’s private database. Google Chrome uses SQLite to store your browsing history, cookies, and your thumbnails on the “Most visited” page.
Once you’ve downloaded and set up an RDBMS on your system, the next step is to create a database and tables inside of it in order to insert and manage your relational data. The way you do this is with Structured Query Language (SQL), which is the standard language for working with RDBMSs.
Here’s a brief overview of common SQL statements that are relevant to the example Twitter data above.
SQL is pretty similar to regular English sentences. There are small variations in SQL between each RDBMS vendor, termed SQL dialects, but the differences are not dramatic enough that you can’t easily transfer your SQL knowledge from one to the other.
In this article, we learned how to design a relational database. We took a collection of data and organised it into related tables. We learnt about normalizing database with the first and second normal forms.
We also briefly looked at RDBMS solutions and SQL. So get started by downloading an RDBMS and normalizing some of your data into a relational database today.
Preview Image Source: FindIcons.com/Barry Mieny