Design pattern: subkeys (the zip code)

(Wayne Dick and Tom Jewett)

One of the major goals of relational database design is to prevent unnecessary duplication of data. In fact, this is one of the main reasons for using a relational database instead of a “flat file” that stores all information in one table. Sometimes we will design a class that seems to be correct, only to find out in the relation scheme or in the table itself that we have a problem.

Example: Almost every personal productivity program today includes some sort of contact manager. A “contact” is a person who could be a business associate or simply a friend or family member. Many of these programs have a very simplistic one-table model for the contact information, which probably looks something like this (ignoring phone numbers for the moment):

Contact class diagram

Other views of this diagram: Large image - Data dictionary (text)

• It may not be obvious that this model has a problem, until you look at the Contacts table with some typical data filled in:

Contacts
firstNamelastNamestreetzipCodecitystate
GeorgeBarnes1254 Bellflower90840Long BeachCA
SusanNoble1515 Palo Verde90840Long BeachCA
ErwinStar17022 Brookhurst92708Fountain ValleyCA
AliceBuck3884 Atherton90836Long BeachCA
FrankBorders10200 Slater92708Fountian ValleyCA
HannaDiedrich1699 Studebaker90840Long BeachCA

• Notice the repeated information in the city and state attributes. This is not only redundant data; it might also be inconsistent data. (Can you spot the “typo” above?)

Functional dependencies, subkeys, and lossless join decomposition

To understand why we have a problem, we first have to understand the concept of a functional dependency (FD), which is simply a more formal term for the super key property. If X and Y are sets of attributes, then the notation X→Y is read “X functionally determines Y” or “Y is functionally dependent on X.” This means that if I’m given a table filled with data plus the value of the attributes in X, then I can uniquely determine the value of the attributes in Y.

• A super key always functionally determines all of the other attributes in a relation (as well as itself). This is a “good” FD. A “bad” FD happens when we have an attribute or set of attributes that are a super key for some of the other attributes in the relation, but not a super key for the entire relation. We call this set of attributes a subkey of the relation.

• In our example above, the zipCode is a subkey of the Contacts table. It is not a super key for the entire table, but it functionally determines the city and state. (If you know the zip code, you can always find the city and state, although you might need all nine digits instead of the five we show here.) The opposite is not true, because many cities have more than one zip code, like Long Beach in this example. We can show this in the relation scheme:

Contact relation scheme

Other views of this diagram: Large image - Data dictionary (text)

• There is a very simple 3-step way to fix the problem with the relation scheme.

1. Remove all of the attributes that are dependent on the subkey. Put them into a new scheme. In this example, the dependent attributes are the city and state.

2. Duplicate the subkey attribute set in the new scheme, where it becomes the primary key of the new scheme. In this example, the sole subkey attribute is the zipCode.

3. Leave a copy of the subkey attribute set in the original scheme, where it is now a foreign key. It is no longer a subkey, because you’ve gotten rid of the attributes that were functionally dependent on it, and you’ve made it the primary key of its own table. The revised model will have a many-to-one relationship between the original scheme and the new one:

Revised contact relation scheme

Other views of this diagram: Large image - Data dictionary (text)

• The new Contacts table will look like the old one, minus the city and state fields. The new ZipLocations table, shown below, contains only one row per zip code. Joining this table to the Contacts (on matching zipCode pk-fk pairs) will produce the same information that was in the original table. What we have done is formally called lossless join decomposition of the original table.

Zip locations
zipCodecitystate
90840Long BeachCA
90836Long BeachCA
92708Fountain ValleyCA

Subkeys and normalization

Normalization means following a procedure or set of rules to insure that a database is well designed. Most normalization rules are meant to eliminate redundant data (that is, unnecessary duplicate data) in the database. Subkeys always result in redundant data, so we need to eliminate them using the procedure outlined above.

• If there are no subkeys in any of the tables in your database, you have a well-designed model according to what is usually called third normal form, or 3NF. Actually, 3NF permits subkeys in some very exceptional circumstances that we won’t discuss here; the strict no-subkey form is formally known as Boyce-Codd normal form, or BCNF.

• Some textbooks use the terms partial FDs and transitive FDs. Both of these are subkeys—the first where the subkey is part of a primary key, the second where is isn’t. Both can be eliminated by the procedure that we’ve shown here.

Correcting the UML class diagram

When we find a subkey in a relation scheme or table, we also know that the original UML class was badly designed. The problem, always, is that we have actually placed two conceptually different classes in a single class definition.

• In this example, a zipCode is not just an attribute of the Contact class. It is part of a ZipLocation class, which we can describe as “a geographical location whose boundaries have been uniquely identified by the U.S Postal Service for mail delivery.”

• The zipCode is an external key, created by the USPS for the convenience of its sorting machinery (not the postal customers). The ZipLocation class has the additional attributes of the city and state where it is located; in fact, it also has the attributes needed to precisely describe its boundaries, although we certainly do not need to represent these in our database. The geographical boundaries would form the “real” descriptive CK if they were included. As always, we need to describe the association between ZipLocations and Contacts:

“Each Contact lives in one and only one ZipLocation”

“Each ZipLocation is home to zero or more Contacts”

• As with all one-to-many associations, the association itself identifies which Contact lives in which ZipLocation. If we had started with this class diagram, we would have produced exactly the same relation scheme that we developed with the normalization process above!

Revised contact class diagram

Other views of this diagram: Large image - Data dictionary (text)