Design pattern: subkeys (the zip code)
(Wayne Dick and Tom Jewett)
One of the major goals of relational database design is to prevent unnecessary duplication of data. In fact, this is one of the main reasons for using a relational database instead of a “flat file” that stores all information in one table. Sometimes we will design a class that seems to be correct, only to find out in the relation scheme or in the table itself that we have a problem.
Example: Almost every personal productivity program today includes some sort of contact manager. A “contact” is a person who could be a business associate or simply a friend or family member. Many of these programs have a very simplistic one-table model for the contact information, which probably looks something like this (ignoring phone numbers for the moment):
• It may not be obvious that this model has a problem, until you look at the Contacts table with some typical data filled in:
|George||Barnes||1254 Bellflower||90840||Long Beach||CA|
|Susan||Noble||1515 Palo Verde||90840||Long Beach||CA|
|Erwin||Star||17022 Brookhurst||92708||Fountain Valley||CA|
|Alice||Buck||3884 Atherton||90836||Long Beach||CA|
|Frank||Borders||10200 Slater||92708||Fountian Valley||CA|
|Hanna||Diedrich||1699 Studebaker||90840||Long Beach||CA|
• Notice the repeated information in the city and state attributes. This is not only redundant data; it might also be inconsistent data. (Can you spot the “typo” above?)
Functional dependencies, subkeys, and lossless join decomposition
To understand why we have a problem, we first have to understand the concept of a functional dependency (FD), which is simply a more formal term for the super key property. If X and Y are sets of attributes, then the notation X→Y is read “X functionally determines Y” or “Y is functionally dependent on X.” This means that if I’m given a table filled with data plus the value of the attributes in X, then I can uniquely determine the value of the attributes in Y.
• A super key always functionally determines all of the other attributes in a relation (as well as itself). This is a “good” FD. A “bad” FD happens when we have an attribute or set of attributes that are a super key for some of the other attributes in the relation, but not a super key for the entire relation. We call this set of attributes a subkey of the relation.
• In our example above, the zipCode is a subkey of the Contacts table. It is not a super key for the entire table, but it functionally determines the city and state. (If you know the zip code, you can always find the city and state, although you might need all nine digits instead of the five we show here.) The opposite is not true, because many cities have more than one zip code, like Long Beach in this example. We can show this in the relation scheme:
• There is a very simple 3-step way to fix the problem with the relation scheme.
• The new Contacts table will look like the old one, minus the city and state fields. The new ZipLocations table, shown below, contains only one row per zip code. Joining this table to the Contacts (on matching zipCode pk-fk pairs) will produce the same information that was in the original table. What we have done is formally called lossless join decomposition of the original table.
Subkeys and normalization
Normalization means following a procedure or set of rules to insure that a database is well designed. Most normalization rules are meant to eliminate redundant data (that is, unnecessary duplicate data) in the database. Subkeys always result in redundant data, so we need to eliminate them using the procedure outlined above.
• If there are no subkeys in any of the tables in your database, you have a well-designed model according to what is usually called third normal form, or 3NF. Actually, 3NF permits subkeys in some very exceptional circumstances that we won’t discuss here; the strict no-subkey form is formally known as Boyce-Codd normal form, or BCNF.
• Some textbooks use the terms partial FDs and transitive FDs. Both of these are subkeys—the first where the subkey is part of a primary key, the second where is isn’t. Both can be eliminated by the procedure that we’ve shown here.
Correcting the UML class diagram
When we find a subkey in a relation scheme or table, we also know that the original UML class was badly designed. The problem, always, is that we have actually placed two conceptually different classes in a single class definition.
• In this example, a zipCode is not just an attribute of the Contact class. It is part of a ZipLocation class, which we can describe as “a geographical location whose boundaries have been uniquely identified by the U.S Postal Service for mail delivery.”
• The zipCode is an external key, created by the USPS for the convenience of its sorting machinery (not the postal customers). The ZipLocation class has the additional attributes of the city and state where it is located; in fact, it also has the attributes needed to precisely describe its boundaries, although we certainly do not need to represent these in our database. The geographical boundaries would form the “real” descriptive CK if they were included. As always, we need to describe the association between ZipLocations and Contacts:
• As with all one-to-many associations, the association itself identifies which Contact lives in which ZipLocation. If we had started with this class diagram, we would have produced exactly the same relation scheme that we developed with the normalization process above!
Copyright © 2002–2006, by Tom Jewett. Links to this site are welcome and encouraged. Individual copies may be printed for non-commercial classroom or personal use; however, this material may not be reposted to other web sites or newsgroups, or included in any printed or electronic publication, whether modified or not, without specific permission from the author.