Web applications can be split into two main parts:

Since a web application back-end is concerned with storing and retrieving data, often for many thousands or even millions of users, it is natural that a database is one of the major back-end components.

In this article, I’ll explain what a relational database is, and how to properly design your database to store your app’s information. I’ll also explain what database normalization entails with practical examples.

A database stores data in an organised way so that it can be searched and retrieved later. The database is made up of one or more tables. A table is much like a spreadsheet, in that it’s made up of rows and columns. All rows have the same columns, and each column contains the data itself. If it helps, think of a table as being like a spreadsheet in Excel or Google Sheets.

Data can be inserted, retrieved, updated, and deleted from a table. The word, created, is generally used instead of inserted, so, collectively, these four functions are affectionately abbreviated as CRUD.

A relational database is a type of database that organises data into tables, and creates links between these tables, based on defined relationships. These relationships enable you to retrieve and combine data from one or more tables with a single query.

But to truly understand a relational database, you need to make one yourself! Let’s get started by getting some real data with which we can work.

Let’s start with some Twitter data. I searched Twitter for #databases and took the following sample of ten tweets:

Table 1

Here’s what each column name means:

This is all real data; you can search Twitter and actually find these tweets.

This is good. The data is all in one place; so it’s easy to find, right? Not exactly. There are a couple problems with this table.

Duplicates are problematic because it makes the CRUD operations more challenging. For example, it would take longer time to retrieve data because time would be wasted going through duplicate rows. Also, updating data would be an issue; if a user changes their Twitter handle, we would need to find every duplicate and update it.

Repetitive data is a problem. We can fix this problem by splitting Table 1 into separate tables. Let’s proceed with first resolving the repetition across columns issue.

As noted above, the username and following_username columns in Table 1 are repetitive. This repetition occurred because I was trying to express the follow relationship between users. Let’s improve on Table 1‘s design by splitting it up into two tables: one just for the following relationships and one for the rest of the information.

Because @Brett_Englebert follows @RealSkipBayless, the following table will express that relationship by storing @Brett_Englebert as the from_user and @RealSkipBayless as the to_user. Let’s go ahead and split table 1 into these two tables:

Table 2: The following table

Table 3: The users table

This is looking better. Now in the users table, there is only one column with Twitter handles. In the following table, the is only one Twitter handle per field in the to_user column.

Edgar F. Codd, the computer scientist who laid down the theoretical basis of relational databases, called this step of removing repetitive data across columns the first normal form (1NF) in database normalization. 

Database normalization is a database design method that avoids data duplication and gets rid of undesired aspects like insertion, update, and deletion Anomalies. Using relationships, database normalization rules break up larger tables into smaller ones. Normalization serves the dual purposes of removing unnecessary (repetitive) data and ensuring logical data storage.

In the first step of removing repetitive data across columns, we employed the first normal form(1NF)

When all attributes in a relation is a single valued attribute, has a unique name for each attribute or column, then we can say that relation is in its first normal form. 

Now that we have fixed repetitions across columns, we need to fix repetitions across rows. Since the users @AndyRyder5 and @Brett_Englebert each tweeted twice, their information is duplicated in the users table. This indicates that we need to pull out the tweets and place them in their own table.

As before, “text” stores the tweet itself. Since the “created_at” column stores the timestamp of the tweet, it makes sense to pull it into this table as well. I also include a reference to the “username” column so we know who published the tweet. Here is the result of placing the tweets in their own table:

Table 4: The tweets table

Table 5: The users table

After the split, the users table has unique rows for users and their Twitter handles.

Edgar F. Codd called this step of removing repetitive data across rows the second normal form (2NF).

The next stage of normalizing a database is called second normal form (2NF). 2NF advances the first normal form (1NF). The 2NF makes sure that every piece of information with a many-to-many link is sorted and stored in a separate table.

A table must be in the first normal form and also free of partial dependencies in order to be in the Second Normal form.

When an attribute in a table depends solely on a portion of a composite primary key rather than the entire primary key, this is known as partial dependence.

We can separate the table, remove the property that is generating partial dependency (text), and relocate it to another table (tweets) where it will fit in nicely to eliminate partial dependency.

Data can be inserted, retrieved, updated, and deleted from a table.

So far, table 1 has been split into three new tables: following), tweets, and users. But how is this useful? Repetitive data has been removed, but now the data is spread out across three independent tables. In order to retrieve the data, we need to draw meaningful links between the tables. This way we can express queries like, “what a user has tweeted and who a user is following”.

The way to draw links between tables is to first give each row in a table a unique identifier, termed a primary key, and then reference that primary key in the other table to which you want to link.

We’ve actually already done this in users and tweets. In the users table, the primary key is the username column because no two users will have the same Twitter handle. In tweets, we reference this key in the username column so we know who tweeted what. Since it is a reference, the username column in tweets is called a foreign key. In this way, the username key links the users and tweets tables together.

But, is the username column the best idea for a primary key for the users table?

On one hand, it’s a natural key — it makes sense to search using a Twitter handle instead of assigning each user some numerical ID and searching on that. On the other hand, what if a user wants to change his Twitter handle? That could cause errors if the primary key and all referencing foreign keys aren’t updated accurately, errors that could be avoided if a constant numerical ID was used. Ultimately the choice depends on your system. If you want to give your users the ability to change their username, it’s better to add a numerical auto-incrementing id column to users and use that as the primary key. Otherwise, username should do just fine. I’ll continue to use username as the primary key for users

Let’s move on to tweets. A primary key should uniquely identify each row, so what should the primary key be here? The created_at field won’t work because if two users tweet at the exact same time, their tweets would have an identical timestamp. The text has the same problem in that if two users both tweet “Hello world,” we couldn’t distinguish between the rows. The username column is the foreign key that defines the link with the users so let’s not mess with that. Since the other columns are not good candidates, it makes sense here to add a numerical auto-incrementing id column and use that as the primary key.

Table 6: The tweets table with an id column

Finally, let’s add a primary key to the following table. In this table, neither the from_user column nor the to_user column uniquely identifies each row on its own. However from_user and to_user together do, since they represent a single relationship. A primary key can be defined on more than one column, so we’ll use both these columns as the primary key for the following table.

As for the foreign key, from_user and to_user are each foreign keys since they can be used to define a link with the users table. If we query for a Twitter handle on the from_user column, we’ll get all the users he follows. Correspondingly, if we query for a Twitter handle on the to_user column, we’ll get all the users following him.

We’ve accomplished a lot so far. We removed repetitions across columns and rows by separating data into three different tables, and then we chose meaningful primary keys to link the tables together. This entire process is called normalization and its output a data that is clean and organized according to the relational model. The consequence of this organization is that rows will appear in the database only once moving forward, which in turn make the CRUD operations easier.

The figure below diagrams the finalized database schema. The three tables are linked together and the the primary keys are highlighted.

SQLite is common in embedded systems. iOS and Android let developers use SQLite to manage their app’s private database. Google Chrome uses SQLite to store your browsing history, cookies, and your thumbnails on the “Most visited” page.

Once you’ve downloaded and set up an RDBMS on your system, the next step is to create a database and tables inside of it in order to insert and manage your relational data. The way you do this is with Structured Query Language (SQL), which is the standard language for working with RDBMSs.

Here’s a brief overview of common SQL statements that are relevant to the example Twitter data above. 

SQL is pretty similar to regular English sentences. There are small variations in SQL between each RDBMS vendor, termed SQL dialects, but the differences are not dramatic enough that you can’t easily transfer your SQL knowledge from one to the other.

In this article, we learned how to design a relational database. We took a collection of data and organised it into related tables. We learnt about normalizing database with the first and second normal forms.

We also briefly looked at RDBMS solutions and SQL. So get started by downloading an RDBMS and normalizing some of your data into a relational database today.

Preview Image Source: FindIcons.com/Barry Mieny

source