Structured Programming

Jeff decides it's time to talk about intelligently managing data, noting that structured languages are particularly weak at data management. He then shares his notions of good database design.

February 01, 1993
URL:http://drdobbs.com/structured-programming/184408947

Figure 1

Figure 2

Figure 3

FEB93: STRUCTURED PROGRAMMING

STRUCTURED PROGRAMMING

Shoplifting in Reverse

Jeff Duntemann KG7JF

I don't know whether to confess this one or not: I've been reverse shoplifting again, and I just can't stop myself. I go into CompUSA or Bizmart or someplace like that with three or four copies of the magazine I publish under my arm. I pretend to browse the computer magazines in the magazine section, and then when nobody's looking I slip the three or four copies of PC Techniques I brought in onto a prominent place on the magazine rack and then nonchalantly make my escape.

It's not that I'm trying to sell magazines this way, or even get attention; a dozen copies left anonymously on newsstands around Phoenix obviously isn't going to do anything for my profile or my market share. I think what I'm really doing is tormenting the people who maintain the computerized inventory systems into adopting new modes of thought. See, when a customer takes one of my gray-market (mauve-market? raw umber-market?) magazines to the front counter, the clerk waves it over the laser scanner, and the cash-register terminal protests that the number is not on file. No problem--the clerk pounds the UPC number off the bar code into the terminal, collects the customer's $4.95, (which is marked on the cover) and moves on to the next guy in line.

No, the good stuff happens at the end of the month, when the inventory boys are trying to settle all their accounts. After updating their master inventory database with the code numbers of everything that hadn't quite gotten logged last time, they match the UPC against the ISSN index to find the name PC Techniques, and start scratching their heads.

"Hey, Charlie, I can't figure this one. We didn't order any copies of this magazine PC Techniques. The distributor didn't send us any. The distributor doesn't even carry it."

"So?" Charlie asks. "What's the problem?"

"We sold four copies."

Charlie chews on his lip for a second. $19.80 in revenue had come in, and has to be credited to something the firm had purchased wholesale to sell. But there was no record in the inventory file, no record in the invoice file, no record anywhere.

"So fix it," Charlie says, and walks away.

They say that whatever isn't impossible is inevitable. Bosh. Whatever isn't against the physical laws is inevitable, and I'm real picky about what I consider a physical law. It's a good idea to keep that in mind as you design any system, but in particular a database-oriented application that not only has to store data fed in from the outside world, but also make sense of it.

That's the essential difference between storing data in a file and storing it in a database. The data in the database has to make sense; that is, there are certain requirements about how items in one record relate to another record, and so on. The more complex the data is, and the more separate places (files, machines, networks, and so on) the data lives, the more likely you're going to encounter the "impossible." And when the impossible happens, you know that your boss is simply going to look at you and say, "Fix it. "Sympathy isn't part of the deal.

This real-world wisdom gets thoroughly lost in many discussions of database design, especially in the academic world, where far too many instructors have never implemented anything real in their lives. In the midst of arcane discussions of referential integrity, many-to-many joins, and theta selects, people forget that if somebody can reverse-shoplift, they will, and the system you're building had better be able to handle it.

It's time to talk about intelligently managing data. The structured languages (Pascal, Modula, Ada, and, well, OK, C) are particularly weak at data management, which has always puzzled me, since the vast majority of all programming in the commercial world is, at bottom, data-management programming. If you know nothing about data management, this would be a good time to start. If you do, hey, follow along and stop me if I say anything marginal.

One-to-Many

One file does not a database make. If everything you're storing fits in one file, you've got what they call a "flat file," and while some folks say "flat-file database," that's kind of like saying "honest congressman." The two halves of the term are mutually exclusive. To be a database, there must be at least two separate groups of data records that relate to one another in some well-defined way. Each group shares a common record structure.

Let's call each group of records a table, and to tie things more clearly to the real world, let's assume for this discussion that a table is also a disk file. A table is laid out very much like a Pascal file of records, and the term "record" is conveniently used to describe one row of the table. Some people call rows "tuples," which is an ugly and unnecessary piece of jargon. In most people's estimation, a row is a record, and that's jargon (and confusion) enough. The fields of a table line up vertically, in a way reminiscent of a spreadsheet, and are not surprisingly called the table's "columns." Figure 1 sums it up.

This is still just a file. To make it a database, there must be a second table with some defined relationship to the first table. For example, if you've got a table of contact names, you're going to want a table of contact addresses as well. That's what I've shown in Figure 2.

To many experienced programmers, something like this is painfully obvious. But I will admit, it took me a long time to realize that contact addresses should not be wedged into the same file as contact names. Why not? Simply because a contact may have more than one address. Okay--we'll have two separate address areas for each contact in the contact file. Well then, what about my friend George Ewing WA8WTE, who has four addresses?

How can a guy have four addresses?

I dunno. How could a guy reverse shoplift?

Just fix it. In this case, we put all the addresses out as a separate table. To be useful, there has to be a well-defined relationship between the two tables, or we'd never be able to associate an address with a contact. So what can we say about the Addresses table? Just this: That for every record in the Addresses table, there is an associated name in the Contacts table. To make the relationship unambiguous, we make sure that there is a code field in each one of the Addresses records that matches a code field in one of the Contacts records.

Why not just use some sort of last-name/first-name appellation for the code? Well, there are two Tom Campbells in my contacts database, and more Mike Smiths than I care to think about. Give each person a unique ID code. You don't have to tell them that they're a number and not a name. It'll be your secret.

Figure 2 makes a number of notable points. One is that there can be any number of Mike Smiths in the Contacts table, because each one has a unique code number. Another is that a given Mike Smith's four addresses can be anywhere in the Addresses table. They don't have to be adjacent to one another, nor in any particular place. All that matters is that each one of Mike's addresses is correctly tagged with Mike's ID code number. Finally, the ID code numbers are arbitrary. There is no connection at all between a contact record's position in the Contacts file and its ID code number. That is, our Mike-Smith-with-four-addresses is not necessarily the 174th record in the table.

The ID codes point up perhaps the most significant relationship between the two tables. There may be any number of addresses belonging to contact ID 174 in the Addresses table. However, there may only be one contact ID 174 in the Contacts table. This relationship is called a one-to-many relationship. That is, for each one record in the Contacts table there may be many address records in the Addresses table. In drawing relationships between tables on paper, people use the crowfoot-like symbol I've included between the tables. The crowfoot indicates the "many" end of the one-to-many relationship.

Multiple Relationships

It's worth stepping back for a moment and thinking about this concept in design terms. The most blatant advantage to setting up a Contacts database this way is that it eliminates the need for wasted empty address fields. Many flat files have two or sometimes even three separate address areas in each record. For people who have only one address, this means a significant number of bytes in each record are just landfill.

Following from that is virtually unfettered flexibility. A contact can have one, two, four, or seventeen addresses. Or none--suppose you once had a contact but he's now dead. It might pay to remember him in some circumstances--say, as the original holder of a patent you're researching--but you sure don't want to send him any mail. Basically, a database like this can handle any number of addresses. There is no "impossible." And that's always important in the design phase of any system.

Things get interesting when you continue with the design and consider phone numbers. It shouldn't surprise you that some people have more than one phone number. What surprises me sometimes is how many phone numbers a single person can have. A guy I know has a home phone, a home modem line, a work phone, two work modem lines, a cellular phone, and a pager (and he's thinking of setting up his own BBS...). We can add a third table to the database to contain phone numbers. That's a no-brainer. As with addresses, each contact in the Contacts table can have any number of phone numbers in the Phones table.

Alas, that's not enough. You gotta know where the phones are, because ol' Mike Smith gets around. You may know he's at work, but which phone number in Phones is at his work address? There's a missing relationship here. As shown in Figure 3, it's another one-to-many relationship. There can be multiple phones at any given address, but each phone can have only one address. It's also true that each phone number in Phones can belong to only one contact in Contacts. So Phones has a relationship with Addresses, and an entirely separate (if similar) relationship with Contacts.

To ID or Not to ID?

You've probably guessed by now that anytime there's a one-to-many relationship, the "one" end of the relationship has to be unique in its table. Addresses are usually unique within a limited geographical area, but are not necessarily unique--and the broader you range, the more likely you are to hit a duplicate. (How many #17 Maple Streets do you suspect there are in the country?) More to the point, an address is a biggish item compared to an ID code. To be unique in most (but not all!) cases, you have to consider the location, address, city, state, and zip code fields as a single aggregate field. If you're going to be doing a lot of searches or sorts, the computational burden of examining, comparing, and swapping around whole addresses can be brutal.

The rule of thumb is pretty simple: Anytime that you must access a record in a table uniquely, assign the record a unique key that you control. Do not assume that any given data that comes from outside your system will be unique.

A lot of naive designers have been stung on this one: Surely the federally assigned Social Security number is unique for each American citizen! True, I've never heard of duplicates occurring. But there's a real-world snag: No one is legally required to give you their Social Security number unless you're some sort of government agency--with the legal corollary that you have no recourse against a person who hands you a made-up number. And a made-up number is not necessarily unique. If someone can zoom you, they will.

Good design maxim: If you have to ID uniquely, assign the ID within the system.

There is a term for a field in a record that uniquely identifies that record within its table: primary key. In Figure 3, the primary-key fields are shaded. The Contacts table and the Addresses table each have a primary-key field. The Phones table does not.

Well, why should it?

Keep straight in your mind what primary keys do: They allow us to stake down the "one" portion of a one-to-many relationship. The Contacts table is on the "one" end of two such relationships so far, and if this is any significant business system, you can bet there will be a lot more. The Addresses table is on the "one" end of a one-to-many relationship with Phones: We have to be able to relate each phone number uniquely to one address. Is there any one-to-many relationship keyed to phone numbers? That is, can you think of any class of whatevers that we must relate to one phone number? A phone number can have a set of attributes attached to it (type of service, listed/unlisted, perhaps a baud rate), but each phone number will have only one such set. There's no "many" related to phone numbers, and hence no need to give the Phones table a primary key.

The Notion of "Countables"

Once you've gotten a fair grasp of the idea of a one-to-many relationship, it's time to consider the notion of what I call countables. Any reasonable database system contains numerous countables. In any system, the countables are those entities that are independently enumeratable. There is no absolutely reliable one-to-one correspondence between distinct countables. In other words, nothing guarantees that for each name in your Contacts table there will be only one or two addresses or phone numbers. So people, addresses, and phone numbers are each countables.

Design maxim: Each countable entity must reside in its own table.

Identifying the countables in your system is something that should be done fairly early in the design stage. I've always done it using a variation on the "stepwise-refinement" method used to design procedures and short, simple programs. It works this way: Look at your system spec from a height and identify the large countables within it. Then look carefully at each of those large countables and try to identify any smaller countables inside it.

Let's take as an example a simple system for handling mail-order books selling. It's easy to identify several large countables in such a system: the customers, the inventory, the orders, and the moneystuff. Let's look for the smaller countables in each.

We've already broken down the notion of a customer about as far as it needs to go. A customer table, an address table, and a phone-number table are about all we need there.

Inventory is stuff to be sold, so you obviously need a table with a catalog number (that you generate) as its primary key. But the books come from somewhere, so you'll have as another countable the vendors from whom you buy the books. And inside that countable, just as with customers, you'll have distinct countables for vendor addresses and phone numbers.

Is that far enough for inventory? It may be for now. However, before too long you might want to get a little fancier in tracking your sales and what drives them. This might imply that the prices you charge for your books becomes another countable, since a single book could have a cover price, a preferred-customer discount price, a summer-readers' sale price, and a blowout inventory-clearance price. The Prices table would contain information such as when the price became valid, when it expires, and the like.

Orders are countables that should have a table of their own, with an order number that you generate as a primary key. Each order has at least one but probably more items, which is another countable.

"Moneystuff" is the collective term I use for invoices and payments. When a customer owes you money, you send a series of increasingly obnoxious invoices, each of which should have its own record in some permanently retained table. In most cases, the customer eventually sends you some sort of payment, or multiple payments, that are applied against the customer's account balance. Invoices and payments are thus separate countables, each with a table of their own.

Deciding What Goes in a Table

Identifying your countables should be done well before you sit down and try to enumerate the fields that need to be stored in a given record. You need to peg the big database picture before you can make any informed decisions on what goes where. Once you feel confident that you have all your countables counted out and identified, you need to think about precisely what information goes in each table, and how that information should be divided into fields.

This has always been a kind of a painful process for me, because I'm a pack rat. I have milk jugs full of ceramic tube sockets out in the garage, some of which have been with me since I was a teenager. (I haven't been a teenager since 1971.) I feel the same way about data. Why fail to store something when you can store it...just in case you need it someday?

Maybe it's better to have it and not need it than need it and not have it. You'll have to decide for your own circumstances. It is true that it's always better to be aware of your choices than to realize down the road that you just never considered the possibility of having to remember that some customer is allergic to mink musk. That means you should consider every possible item of relevant data and make a keep/throw decision on each one.

So brainstorm. Sit down with a pad (paper or virtual) and blurt out everything that might ever go into each of your tables. For a moment at least, be truly paranoid about forgetting anything, in the interest of getting it all down.

After you're through brainstorming, walk around the block, sober up a little bit, have a couple of Triscuits, review your constraints (Is this system going to have to run on bottom-feeder machines with minimal memory and hard-disk capacity? Is the intelligence or dedication of the users not exactly off the top of the charts? Do I have to deliver this thing by next weekend?) and start scratching off the unnecessary.

It may take a few passes through your brainstorming list. Once the level of agony in deciding on each item becomes unbearable, you've probably got a pretty reasonable list of keepable fields.

Breaking Up Huge Tables

In an ideal world, there'd be no one-to-one relationships between tables. If there's only one record in A for each record in B, A and B are really the same table. Then again, when you consider your constraints, you may find that some of your big tables exceed a maximum record size for the data manager, or are just too ungainly to keep in one piece. When that happens, you have to consider the frequency with which some data is going to be needed. Are you going to have to look up the customer's hat size all that often? Stack-rank the fields in an oversized table by how often you're going to have to read or write them. Percolate the most frequently accessed fields to the top of the stack, and then pick a reasonable place to break the table in two, leaving the often-used stuff in one table and the rarely used stuff in the other table. Don't forget that you'll need a primary key of some sort to pin down the connection between records in the two tables.

If your files are going to be kept on a network server, it helps to remember that networking is still slow compared to the speed of a fast hard disk, and if you're hauling a couple of thousand bytes of near-trivia in from the server every time you want to look up a customer's credit balance, you're wasting precious time and cable bandwidth.

The Essence of Database Design

In summary, high-level database design goes pretty much like this:

Identify your countables. This includes the smaller countables within the larger countables. In general, each countable becomes a database table.
Define the relationships between the countables. Most database relationships are one-to-many. Decide what logic links two tables, and what sort of primary key should carry that logic. Any key that must be unique should be generated within the system. Beware of making assumptions about data that you do not control!
Brainstorm all possible fields for each identified table.
Reconsider your constraints and eliminate all unnecessary fields from the brainstorm list for each table.
Decide if any single table is too large to leave in one piece. For any that are, stack-rank all fields by frequency of access and migrate all frequently accessed fields to one table, with the rarely accessed fields going to another table. Link the two parts of the broken table by a unique primary key.

And never ever forget, at any level of the design, that anything that can happen, no matter how weird, is likely to happen someday. Design accordingly. It's a crazier world than you think.