Normalization is not that complicated. In fact, the beauty of these rules is that you may not know you’re already applying them in your database design.
To demonstrate, over the course of this multi-part series, we’ll be covering the three generally accepted levels of normalization (first, second, and third normal form), followed by Boyce-Codd through 5NF for more advanced and specialized cases.
PART ONE: 1NF and the Philosophy of Normalization
First Normal Form (1NF) (aka EACH TABLE IS ONE THING)
Now, what do we mean by “ONE THING”? Well… just that. One thing. One concept. One idea. One “thing” we’re storing data about.
That thing could be Customer. Inventory. Staff. Order. Project. Timesheet. Basically, a NOUN (person, place, or thing, for those of you who haven’t watched Schoolhouse Rock in a while). And this NOUN needs to be the only thing the table is about. In other words, the “Customer” table doesn’t need the salesperson’s address, and the “Inventory” table doesn’t need the name of the customer who ordered that particular item.
The Customer table needs to have information about the customers. The Inventory table needs to have information on the items you carry. Each table is ONE THING.
“That seems pretty obvious,” you might say, “but what’s this got to do with 1NF?”
Everything. That’s the goal of 1NF at its core.
So what should you do to make each table “ONE THING”?
- Give each record a unique identifier (primary key)
- Take out the Russian Nesting Dolls (accidental tables-within-tables)
- Take out the TINY Russian Nesting Dolls (accidental columns-within-columns)
The unique identifier concept is fairly simple: Essentially, give each record an individual, non-duplicated value (for simplicity’s sake, let’s make it a number) so you can easily reference it in your queries. In many cases, you can point out these unique identifiers by a column name containing “id”.
If you doubt the power of unique identifiers, which of the two lines below would be easier to type multiple times in a query?
where store_id = 3
where store_name = ‘Ralph, Jerry, and Tim’s Super Massive Super Awesome Olympic Swimming Pool-Sized Bargain Bin‘
Yeah, that’s what I thought.
And even if the number 3 sounds like too much work to add to your table, you can convert your primary key to what’s called an identity column, which will AUTOMATICALLY add the unique identifier each time you add a new record. Technology is amazing, isn’t it?
Now, the two Russian Nesting Doll concepts may seem a bit odd at first. And you might be thinking, If we already established the ‘EACH TABLE IS ONE THING’ rule, why are tables-within-tables and columns-within-columns a thing? If that thought crossed your mind, then you are well on your way to making 1NF a subconscious practice.
Let’s see an example of these nesting dolls:
Order_id Item_id Item_name Customer_id Salesperson_name_and_rank
If this table design baffles you, it should. If it makes you cringe, it should. Because this table is NOT, in fact, “ONE THING”. It is multiple things (Order, Item, AND Salesperson), and that contradicts the principle of 1NF.
So let’s break this down.
Item_id and Item_name are a table within a table. Each Item_name is going to match its respective item_id no matter what order it’s a part of, so why not make these columns their own table?
Thus, Item is now its own separate table or “THING”.
Meanwhile, our original table now looks like this:
Order_id Customer_id Salesperson_name_and_rank
(And by the way, Customer_id would be what’s considered a foreign key, a quick and easy reference to the primary key/unique identifier of a related table—in this case, Customer. Remember, unique identifiers are what’s saving you from having to write Billy Bob Joe Johnson thousands and thousands of times in your records and queries)
Now, let’s look at the last column: Salesperson_name_and_rank. Again, you might be asking, “Why would someone intentionally add this to their table?” Because this column contains TWO pieces of data, both of which are probably related to each other in a way similar to Item_id and Item_name.
So, since this is two columns’ worth of data, let’s split them up:
Order_id Customer_id Salesperson_name Salesperson_rank
“Hold on! This just created the same nesting doll problem from earlier! Shouldn’t we move the salesperson columns to their own table as well?”
Good catch! And let’s add a unique identifier of Salesperson_id, just to come full circle on our 1NF adventure:
Salesperson_id Salesperson_name Salesperson_rank
Congratulations! You just used 1NF to turn one horrible table into three decent tables!
Salesperson_id Salesperson_name Salesperson_rank
Now, before we move on, I wanted to bring up another purpose of normalization that you may or may not have discovered on your own earlier: One of the big objectives of normalization is to make updating records virtually painless. If you need to update a record (for example, if “Ralph, Jerry, and Tim’s Super Massive Super Awesome Olympic Swimming Pool-Sized Bargain Bin” became “Ralph, Jerry, and Tam‘s Super Massive Super Awesome Olympic Swimming Pool-Sized Bargain Bin“), the average person would much rather update that record in only one place, rather than in fifty, five hundred, or even fifty thousand places. Not just because of the amount of time. it would take to find all fifty thousand instances of that store name, but with each new occurrence of that name, there’s also the increased probability of typos, which should be the bane of every developer’s existence. With typos, out of all those fifty thousand records, you could make twenty different typos, thus creating twenty-one variants of the same record. Depending on how many times you make the same typo and how many records each variant has if you queried your table using the correct spelling of the name (or a single misspelled variant), you could end up with hundreds of records missing from your results and/or calculations.
“store_id = 3” seems like a much nicer option now.
PART 2: 2NF, 3NF, and a Metaphor about Pirates
Second Normal Form (2NF) (aka: EACH COLUMN IS ABOUT THE ONE THING)
Again, seems fairly obvious. So obvious, in fact, that in our earlier 1NF example, the resulting tables were elevated to second normal form, on top of first normal form.
“Wait, what?! How did that happen so quickly?!”
Let’s demonstrate with another example:
Customer_id Customer_name Customer_address SalesRep_name
“Wait… what’s SalesRep_name doing in the Customer table? It’s not a foreign key, so shouldn’t that be in the SalesRep table?”
Bingo. That’s 2NF in a nutshell.
Remember in the previous example how we moved the Item and Salesperson columns to their own separate tables? By just doing that, our tables automatically became 2NF-compliant! And you didn’t even have to do anything extra!
Third Normal Form (3NF) (aka: ALL PIRATES MUST ANSWER TO THE CAPTAIN)
“Now, what on EARTH do you mean by that?”
In more technical terms, “All columns must depend on the primary key.”
So in each table, we have our primary key, the captain. The captain calls the shots, and he shouldn’t have anyone working behind his back plotting a mutiny. That’s where 3NF comes in.
Much like pirates secretly plotting a mutiny, 3NF violations can hide very well from the untrained eye.
Store_id Store_name Address City State Zip County Manager_id
On the surface, this table seems pretty solid. After all, each one of these columns describes a trait of this individual store. The name, address, even a foreign key pointing to the Manager or Employee table. These sailors seem to be getting along pretty well, right?
Not quite. Take a look at the fields between Address and Manager_id: City, State, Zip, and County. Are these fields really determined by the individual store ID? Or is there someone else trying to be the captain?
“City, state, and zip code are pretty closely linked, right? You should be able to tell the city and state, as well as the county, just from the zip code.”
Exactly. The zip code 92262 is associated with Palm Springs, CA, in San Bernardino County. The zip code 44114 is associated with Cleveland, OH, in Cuyahoga County. And the zip code 80905 is associated with Colorado Springs, CO, in El Paso County.
Since these fields are more determined by zip code than they are by the individual store ID, it seems we found who was planning to stage that mutiny.
So, let’s put these guys in their own table, where Zip can be the captain of his own crew:
ZipCode City State County
And since the stores still need the second half of their addresses, we can include a foreign key referencing the ZipCode table:
Store_id Store_name Address ZipCode Manager_id
Because the zip code relies on the individual store ID, everything seems to be fine on this ship… for now.
Now, in the vast majority of cases, your table design would be squeaky clean at this point, since 3NF is the generally accepted standard for normalization. So if you would consider yourself done here, pat yourself on the back! You’ve earned it!
However, there are some special cases that require even more normalization, far beyond what’s generally considered obvious. Those are Boyce-Codd Normal Form (BCNF, also considered “3NF+”), Fourth Normal Form (4NF), and Fifth Normal Form (5NF). We’ll be discussing things like multiple primary keys and primary/foreign key combos, so if you’re not quite ready to jump that far, you can stop here if you want.
If you’re still here, however, then without further ado…
PART THREE: BCNF, 4NF, 5NF, ad infinitum
Boyce-Codd Normal Form (BCNF) (aka NO ONE TELLS THE CAPTAIN WHAT TO DO)
Now, this one’s a little tricky to convert to common sense terms, simply because we’re now getting into the special cases, where nine times out of ten, you will probably not have to deal with tables that require BCNF. However, for those of you who do have to deal with these special cases, here’s a high-level explanation of what BCNF actually is:
Any part of the primary key or candidate key should not depend on a non-key attribute.
Now, what do we mean by that? Well, back in 3NF, we established that all columns must depend on the primary key. That is, we can’t have any columns that depend on an attribute that isn’t the primary key—such as how City, State, and County depended solely on ZipCode in the previous example.
Nine times out of ten, that’s good enough.
But some dudes named Boyce and Codd (among others, probably) created an upgrade to the “traditional” 3NF.
Better. Faster. Stronger. More data integrity.
And virtually indistinguishable from 3NF, except for one significant detail:
Not only should all attributes depend on the primary key and no other column(s), but any part of the primary key/candidate key should not depend on an attribute that is not, in itself, a key.
“But what does that MEAN?”
Hold on, I’m getting there.
First, we need to briefly cover some more advanced design concepts, just so you understand the terminology being used here.
Up until now, we’ve been discussing the primary key as if it’s only one column. And, in many cases, it is. However, it is possible for two or even more columns to serve as the single unique identifier for the table.
Here’s a quick example:
Store_num Item_num Num_in_stock
This table shows how many of each item are in stock at each store. “Store_num” alone can’t be the primary key, since each store has multiple items. And “Item_num” alone can’t be the primary key, since each item can be available in different stores. However, both “Store_num” and “Item_num” combined determine the number of items in stock at each store.
A primary key that is made up of two attributes (such as Store_num and Item_num combined) is often referred to as a composite key or even a super key.
Another assumption we made in the earlier examples is that only one attribute or set of attributes could be the primary key—which, to review, is the unique identifier for all records in the table. Depending on the individual table contents or design, however, that may not always be the case, and multiple attributes could uniquely identify each record if set as the primary key (a popular but not very practical example is including employee SSNs along with employee IDs; both uniquely identify individual employees, but one could get you into some serious trouble if a certain substance hit the fan). Another example, in some cases, would be Employee_name, if you have a small company where no two employees have the same name.
Attributes that could serve as unique identifiers for each record but are not considered the primary key are called candidate keys.
Now, back to our regularly scheduled program: BCNF.
Let’s say you have the following Orders table:
Order_id Cust_id Num_of_items Order_date
On the surface, it looks fine. It adheres to 3NF, since there doesn’t seem to be anyone plotting a mutiny here. But let’s take a closer look at the column dependencies:
Cust_id depends on Order_id. Looks good.
Num_of_items depends on Order_id. Looks good.
Order_date depends on Order_id
Wait. That doesn’t make sense. I don’t think that’s how time works.
In the example for 3NF, the ZipCode column still depended on the unique Store_id. However, in this case, Order_date doesn’t seem to depend on Order_id… but Order_id seems to depend on Order_date.
Which means we now have the primary key, which every attribute in this table should depend on… depending on an attribute that’s not even a candidate key, since it’s possible to have more than one order in a day. This is part of the gray area that separates 3NF from BCNF.
So how would you solve this issue (if you choose to solve it at all)?
Order_date would need to go into its own separate table with Order_id, since the objective is to get rid of any dependencies the primary key would have with attributes that weren’t candidate keys.
Order_id Cust_id Num_of_items
Does it seem a little counterproductive? At times, it definitely complicates things more than it helps, which is why there are some drawbacks to using BCNF in the real world. However, if you’re ever in a situation where you need to know what BCNF entails, well, there you go.
Fourth Normal Form (4NF) (aka MAKE COLUMNS RELATE TO EACH OTHER)
Compared to BCNF, this one is much easier to explain. And this rule gives us a brief return to the “common sense” track.
Let’s learn about a concept called Multi-Value Dependency.
You need a table with at least three columns. At least one of these columns needs to be the primary/composite key, and at least two other columns depend on this primary/composite key.
But they have absolutely nothing to do with each other.
Emp_id Emp_dept Favorite_fast_food
Now, unless we’re trying to discover a correlation between an employee’s department and what their favorite fast food restaurant is, these attributes are completely independent of each other, yet they both depend on the primary key (“Emp_id”). An employee can work in multiple departments, and an employee can also have multiple favorite fast-food restaurants.
The object of 4NF is to eliminate these multi-value dependencies, and that can easily be done by putting each unrelated attribute into its own table with “Emp_id”:
Fifth Normal Form (5NF) (aka ALL TABLES MUST BE BROKEN DOWN TO THE MOLECULAR LEVEL)
Put very simply, 5NF exists when all of the above rules are followed, and the table cannot be split any further without losing data and without reusing the same primary key in the subsequent tables.
“But haven’t we been doing that this entire time?”
With a few exceptions, yes, that’s what ended up happening with our previous examples.
“So why does it need to be a special rule? I thought this was the section for ‘special cases’?”
There comes a point in every field of study where theory exceeds practicality. Where the discussion turns to abstract concepts instead of things you actually care about. Where “infinity” becomes an acceptable answer. This is the theoretical point where absolutely nothing else can be done to improve the data integrity of this table.
“What’s the point of that?”
I have no idea.
So, there you have it, the six generally accepted forms of database normalization. Will you use all of these in your database design career? Unlikely. But whether or not you intend to think about 5NF every day at the office for the rest of your life, I hope this guide has helped you better understand the concept of database normalization.
If you have any questions or need a helping hand in your database environment, the VirtualDBA team is always ready and willing to help. Please don’t hesitate to reach out to any one of our knowledgeable consultants, since we’re always available to help, day or night.