A relational database stores data in multiple tables. Each table has columns describing the type of information, and each row is an instance of that data. Codd’s 12 Rules apply to relational databases, describing how you build them and use them. In addition to those rules, there are additional normalizing ‘forms’ that apply to data storage and queries on that data, and make access to it more efficient and less-error prone. We’ll look at those shortly.
Why is Normalization Important?
Without normalizing, data is likely to be difficult to retrieve (what if there are two different values?), expensive to fetch in terms of computing resources (what if there are ten values stored in a single column?) and difficult to update without causing inconsistencies. The data may have dependencies, and updates may only affect part of it, not all. You don’t create relational databases without normalization.
Normalizing the data means making it conform to Codd’s ‘Forms,’ or at least some of them (typically the first three). One way it does this is by eliminating redundant information. Let’s look closer at normalizing.
An Example of Normalization
An FAA-like organization is collecting statistics on flights. We start with a spreadsheet of flights collected from airports. It’s produced every day, and for every flight there is one row per passenger.
Each row holds this information:
1. Flight ID, e.g., ABC123
2. Take-off Airport short code (e.g., LHR (London Heathrow)) and address.
3. Time and date of departure.
4. Destination airport. Short code and address.
5. Aircraft ID and model, details, airline name etc.
6. Passenger details.
If there are 20 flights from an airport in a day, then that airport address is stored 20 times the number of passengers in the table. Likewise, if there are five flights to a destination airport in a day, its address appears five times the number of passengers. This is massively inefficient.
We’ll put the spreadsheet into a database table and, by normalizing it, change the single spreadsheet into a number of related tables. Multiple items stored in a single column in the spreadsheet are separated into fields in a table. (For example, column 5 in the spreadsheet has aircraft ID, mode and airline name.)
So, what tables do we need?
The flight table is the main one. It has these fields in it.
1. Flight ID – Primary Key
3. Aircraft ID
4. Take-off Airport ID
5. Destination Airport ID
I’ve isolated airport information and details into its own table (airports). That way, the airport ID (its key) can be stored in either the flight take-off or destination columns. If we need to make a change to an airport—perhaps it’s been allocated a new zip code or name change—we only have to change it one place in the airports table, not in every row (in either column) in the spreadsheet.
Likewise, we split the aircraft column in the spreadsheet into an aircraft table. This holds all information about the aircraft and the airline.
We now have four tables:
1. Flight Table
2. Airport Table
3. Aircraft Table
4. Passenger Table
The passenger table holds all passengers flying that day, with one row for passenger details: name, address, and so on. It has a flight ID per row.
This satisfies the first three normal forms for all of the tables. There’s no repeating element or group of elements (first normal form). There’s no non-prime attribute dependent on a subset of the candidate key (second normal form); and the other fields in the table provide information about the key, the whole key and nothing but the key (third normal form).
There are other normalization forms such as Boyce–Codd normal form and Fourth normal form, but these are for more complicated database structures than we have here.
Problems with Normalization
There is a trade-off between normalization and efficiency. One example is with table joins. For instance, if you want to get passenger totals per airport, you need to join the flights, airports and passengers tables.
In a join between tables, the server returns a table comprised of every combination of two, three or more tables. The number of rows in the join result grow exponentially with each extra table. If a join is done in a table using an unindexed column, then every row in that table is read, and that’s very inefficient.
For example, say we want to calculate how many passengers came into a specific airport between the hours of 1:00 A.M. and 8:00 A.M. We have 100,000 rows in the flight table, and 200 of them are for a given airport; if the airport ID column is not indexed, then the query has to read all 100,000 rows and compare the date/time 100,000 times. With indexing on the ID column, only 200 comparisons are done; if the date/time column is also indexed, then no comparisons need be done, and the query returns just the matching rows.
Denormalizing is making changes to the table structures to increase the read performance of the database. Typically, it reduces the need for expensive joins and searches. If a query needs to do a lot of reading, totalling, and so on, it can stress the database server through higher memory use and greater processor use, reducing performance.
An Example of Denormalizing
We’ll return to the flights example. In order to display a list of airports with the associated number of flights and passenger totals, you’ll have to do a query to list the airports, then query each airport to return the number of flights, and then query the passenger tables. You can do this with a query and grouping, but it could be quicker if we stored the number of passengers per flight in the runway table; then a simpler query would list what we need without any joins.
I found a list of over 52,000 airports on ourairports.com and imported the names into an airports table, and gave it a primary key column with Identity (i.e., it’s auto-incrementing). Here’s the gist of the airports table. There’s also an index on ID:
CREATE TABLE [dbo].[airports]( [Name] [nvarchar](255) NULL, [id] [int] IDENTITY(1,1) NOT NULL, [numflights] [int] NULL, PRIMARY KEY CLUSTERED
Flights is similar, with just an ID column and an airportid column with an index on it.
Next, I wrote a C# program to insert between 10 and 100 flights per airport into a flight table; 20 minutes later, it had inserted 2.9 million records and 24 million in the passenger table. Each flight had a random number between 50 and 250 passengers. That took an hour or two to create!
On my PC, this query to list the airports with the most passengers took 132 seconds to run despite all involved columns being indexed.
select a.name, count(p.passengerid) as 'Count' from airports a join flights f on a.id = f.airportid join passengers p on f.id= p.flightid group by a.name,f.id,f.airportId order by count(f.id) desc
I then added an extra column (passengercount) in the flights table, and updated it with the grouped total. The original query took 132 seconds, but after I updated passengercount, the revised query took 35 seconds—over three times as fast. But to update this field in the flights table took 40 seconds; in workflow terms, we need to do the update much less often (say, once a day). The passenger query may be run several times.
Denormalizing can speed up queries with large amounts of data, but it comes with a couple of costs. One big downside: You increase the size of tables with extra data. The other cost is that updates or inserts take longer, because you have to write the denormalized data into different tables (for instance, summing up the passenger counts per flight). Denormalizing can also reduce the load on the database server by making queries less memory-intensive.
With mostly static tables, where changes rarely (or never) occur, the update/inserts costs of denormalizing are insignificant. Adding in an extra column to a table that may have 20 or more is likewise not that big a sacrifice. In data-warehousing, which is concerned with reporting on data, denormalizing is a very useful technique. Check it out!