Modeling, Data Semantics, and Natural Language
Previous columns in this space suggested the use of formal methods and modeling tools, particularly when your projects involve many objects and complex database schemas.
The objective of modeling is to complete a rigorous design with quality checks before you build a physical system. When you use a modeling tool, you create an abstract representation of the real world that defines relationships, structure, rules, and other information. Last month I discussed three types of modeling that will be important technologies in the new millennium. This trinity of modeling techniques includes object modeling, content modeling, and data modeling, using the Unified Modeling Language (UML), the eXtensible Markup Language (XML), and Object-Role Modeling (ORM), respectively. To build some Web applications, you might use two or three modeling techniques. For example, there are e-commerce working groups using UML and XML to create new electronic data interchange (EDI) standards. Developers creating multitier applications might use UML for middle-tier components and ORM to produce database schemas. UML and ORM share concepts such as an object-centric methodology and conceptual modeling, but ORM is a tool for designing databases, not application classes.
Because developing systems is a challenge, the software industry has produced a variety of tools to automate analysis and design. None has gained wider support than the UML adopted by the Object Management Group. The UML is an object-oriented methodology that combines separate work by three well-known experts: Grady Booch, Ivar Jacobson, and Jim Rumbaugh. It standardizes modeling language and notation, not a particular method. UML supports several different views of a system -- class diagrams, behavior diagrams, use-case diagrams, and implementation diagrams. Use-case diagrams let UML users define how actors participate in an interaction with the system. UML users can capture the system's dynamic behavior by using activity diagrams, collaboration diagrams, sequence diagrams, and state diagrams. To document the lower-level details of a system, UML users can develop component diagrams and deployment diagrams. The UML is also extensible to support new modeling concepts by the use of stereotypes and patterns. The UML is a powerful solution for application object modeling, but developers doing realtime and database applications find it doesn't address all of their needs. For those purposes, developers often turn to data modeling and ORM.
To model with ORM you can use Visio Enterprise, a product that also supports UML. If you're building a multitier application, you can use ORM for the database tier and UML for middle-tier objects or components. Of specific interest is Visio Enterprise's ORM tool known as VisioModeler. The term data model can be confusing, so I'll discuss that before launching into an explanation of data modeling.
Classifying DBMSs by Data Models
Developers today are building e-portals and Web sites that use new and legacy databases. To acquire data from different sources, you need to know that not all database engines use a common logical data model, such as E.F. Codd's relational model. In this context, data model refers to the core organization of data managed by a DBMS.
In approximate chronological order, industry and academia developed technology for managing hierarchical, CODASYL, relational, object, and object-relational databases. The hierarchical model is similar to the tree structure of DOS and UNIX directories. The Committee on Data Systems Languages (CODASYL) network model is similar to a doubly linked list, whereas the model for an object DBMS (ODBMS) is analogous to the persistent version of application objects. SQL databases are the closest thing we have to an implementation of Codd's relational model, but many SQL products are moving to the object-relational model as described in my March 1998 column. We also use the term data model to classify the operations used to access data in a database. The object database model is based on operations on objects and navigating by using object references or object IDs. The CODASYL model is based on operations on sets and procedural, record-by-record navigation. The relational model is based on nonprocedural operations on relations.
Classifying databases on the basis of language and data models is a practice that dates back about 30 years ago to E.F. Codd's writings. Codd wrote that database managers "may be classified by the data model with which the user interacts and the language provided to the user for expressing this interaction." In an Intelligent Enterprise magazine retrospective on Codd's writings, Chris Date wrote "The data model can be trees, nets, or relations; the language level can be low (Codd also calls this level procedural'), intermediate (algebra based), or high (calculus based)." Another difference in data models is the extent to which they provide data independence by isolating programs and users from implementation details and physical disk organization. Table 1 illustrates varying degrees of transparency offered by different data models.
Database developers also use the term data model to refer to a model of the objects or entities contained in an application domain (universe of discourse). They produce that model using methodologies such as Entity-Relationship Diagramming and ORM.
Why Should I Care About Modeling?
The public became aware of computer models after the publication of Eugene Burdick's The 460, but it was software quality-assurance research that contributed to the growth of software tools and modeling. Barry Boehm, Ray Wolverton, and others produced definitive studies that showed correcting errors at analysis and design time was easier and less costly than repairing problems later in the development cycle. Let's look at two major reasons to use modeling.
Reduce software costs and create more reliable software. The emergence of SQL also contributed to the need for modeling and better design technologies. Poorly designed SQL databases can lead to data-integrity problems and long-running queries. Web-site designers think in terms of seconds when describing slow-loading pages, but SQL problems can turn query execution time from seconds to minutes, and even hours. SQL users can also experience query anomalies when databases are not normalized. When you build large-scale applications, you must deal with issues such as index design, normalization, and the sheer complexity of some schemas. If you do a poor job on design, you'll find that a database-enabled Web site or application is not responsive.
Produce efficient SQL queries that maintain data integrity and return correct answers. There are other reasons, but creating reliable software and robust databases should be enough motivation for using modeling tools. For prime-time database design, you can use Visio Enterprise -- the Swiss army knife of software engineering tools. Visio Enterprise 5.0 supports a variety of design and modeling technologies, including UML 1.2, ORM, Booch, Rumbaugh, Nassi/Schneiderman, Shlaer-Mellor, Jacobson, Microsoft Repository 2.0, Platinum Erwin, and UML Navigator. It also includes diagramming tools for the Microsoft Data Warehousing Framework, Novell Directory Services, and Windows NT (2000) Domain and Active Directory services.
Data Modeling and ORM
Data modeling is a process of defining the objects or entities in an application domain, and the relationship between them. ORM is a data-modeling methodology that evolved from pioneering work in semantic modeling, such as J.R. Abrial's "Data Semantics" (Data Management Systems, North-Holland Publishers, 1974). Semantic modeling describes the semantics of the real world, representing real-world entities as concrete or abstract objects. ORM is a derivative of Professor G.M. Nijssen's Information Analysis Methodology (NIAM), known today as the Natural-language Information Analysis Method (NIAM). Nijssen was the coauthor, with Dr. Terry Halpin, of the 1989 edition of Conceptual Schema and Relational Database Design: A Fact-Oriented Approach (Prentice Hall). Dr. Halpin extended NIAM and formalized it as ORM, which he explains in the second edition of Conceptual Schema... (WytLytPub, 1999). He also contributed to the design of VisioModeler, which lets you use diagrams and sentences to systematically describe relationships and constraints. Using VisioModeler, you create an ORM model by expressing a series of facts, which are descriptions of objects playing roles. Example 1 is a simple example of a fact and several instances of the fact.
Business Rules
ORM provides a means of articulating rules that comprise the core logic of business processes, and including those rules in a rigorous system design. Business rules are the formal and informal knowledge that guides the structure, operation, and control of the enterprise. GUIDE (the IBM user society) sponsored an interesting business-rules project that published its report on the Web (see " Online"). The GUIDE report addresses structural rules, event responses, action assertions, and other aspects of defining business rules.
Y2K highlighted the problem of business processes for which rules are documented only in algorithms and code -- meaning they are accessible only to programmers. To avoid repeating that mistake today, your organization should foster an information exchange between developers and the key people who understand its business processes. ORM is a tool for communication between developers and users. It lets you define business rules and relationships by expressing them as easily understood facts. Because ORM models express facts in English statements ( Example 1), they are readable by domain experts who can provide feedback about the accuracy and efficacy of the new system.
Conceptual, Logical, and Physical Models
When modeling the domain of an application, or its universe of discourse, you can operate VisioModeler in a graphical mode augmented by text displays. VisioModeler provides different views of information: a conceptual view, logical view, and physical view. The modeling process follows a progression of developing the conceptual schema (or model), then the logical schema (model), and finally the physical database schema. It also supports reverse engineering such as creating a logical model from a physical Oracle database.
The conceptual model includes a high-level description of facts, such as Example 1. Halpin describes the conceptual schema as "a general design plan of the universe of discourse." VisioModeler validates the conceptual model before deriving a logical model. The logical schema or logical model includes details such as structures for data and operations such as using an index. Although logical data models can be of several types (hierarchical, network, relational, object-relational), VisioModeler does not generate models for, or reverse engineer, prerelational (hierarchical and network) databases. VisioModeler supports logical modeling using relational or IDEF1X techniques and generates fully normalized logical models. VisioModeler validates the logical model before transforming it into a database schema.
Natural Language or Graphical Modeling?
VisioModeler puts a graphical interface over natural-language modeling techniques. Instead of generating a text file containing Formal Object-Role Modeling Language (FORML) statements, you typically use diagrams to create a conceptual model. Using point-and-click techniques, you add objects, predicates, and constraints to a model. You can edit an individual fact by right-clicking on its predicate and using the Fact Editor.
Figure 1 is the graphical representation of several facts about Article objects, Title objects, Author objects, and so on. VisioModeler provides an option to verbalize facts, such as the fact shown in Figure 2. The fact in Figure 2 is the verbal equivalent of the fact shown at the top of the ORM diagram in Figure 1.
FORML sentences use object types to describe real-world objects, or sets of data that fit into a logical category such as Surname. When you define an object, you specify whether it is an entity or a value. When defining objects in sentences, you include the name of an object type, a reference mode, and a value. Reference mode is how we commonly refer to an object. For example, we use a name for a street and a code for a postal zone. In Example 2, President is the entity-type, surname is the reference mode, and Lincoln is the value.
The facts in an ORM model include predicates and a definition of simple or composite objects playing roles. VisioModeler uses role connectors to link objects to the roles they play. The arity of a fact refers to the number of roles an object plays. Objects can play one role (unary predicate), two roles (binary predicate), three roles (ternary predicate), and so on. The predicates in Figure 1 are all unary.
Constraints and Cardinality
After you add objects and predicates to a model, you can apply constraints on facts and enter example data, which enables VisioModeler to analyze the model and suggest constraints. The VisioModeler constraints palette is shown on the lower right side of Figure 1. You use it to add constraints such as mandatory roles, uniqueness constraints, primary reference constraints, index constraints, and so on. To express cardinality, you place arrows over role connectors that define one-to-one, one-to-many, and many-to-many relationships. In Figure 1, the arrow over the connector between Article and Title indicates a single-role uniqueness constraint. In other words, Article has a one-to-one relationship with Title. If you use the Fact Editor to view the constraints for the fact at the top of Figure 1, you'll see the information shown in Figure 3.
Using the Logical Model
After validating your conceptual mode, you can generate a logical model such as that in Figure 4, an example that uses relational notation. Working with the logical model diagram, you can examine details such as table contents, types, primary keys, and table relationships (foreign keys). After you've validated the logical model, you select the appropriate database driver for your target DBMS and generate SQL Data Definition Language (DDL) statements. VisioModeler has drivers for desktop databases (Microsoft Access and Corel Paradox) and client-server databases (Oracle, Informix, IBM DB2, Sybase, and Microsoft SQL Server). It can also generate SQL for other databases for which you have an ODBC driver.
As SQL databases evolve to include object-relational capabilities, modeling tools must evolve to support SQL3 features. VisioModeler lets you use advanced features of products such as Informix Dynamic Server and IBM DB2 Universal Database. You can divide some entities into subtypes and use object-relational types such as collections, opaque types, and named row types.
ORM or UML? Graphical Modeling?
Because VisioModeler supports both UML and ORM, we sometimes face the obvious question "Should I create an ORM or UML model?" ORM represents mature semantic modeling technology. It's a powerful tool for data modeling to design SQL databases (relational and object-relational). UML's roots are in object modeling, so you might prefer it for projects that use an object database. Terry Halpin succinctly described the applicability of ORM and UML for data modeling: "Although the UML facilitates software modeling, its object-oriented approach is arguably less than ideal for developing and validating conceptual data models with domain experts."
Not everyone is convinced a graphical tool is the best solution for modeling. Joan C. Nordbotten of the University of Bergen (Norway) and Martha Crosby of the University of Hawaii have done research on the use of graphic modeling tools. By studying eye movements and verbal protocols, they found that graphical styles influenced both model interpretation and reading strategy, and that students had increased error rates as models became more graphical.
The conclusion I draw from their research is that when you engage designers and subject-matter experts (domain experts) in a design review, you should provide a verbal expression of facts, as well as their graphical representation. Luckily, VisioModeler and FORML provide that capability.
(Get the source code for this article here.)
Ken consults and teaches Expert Series seminars. He is the author of Database Magic with Ken North (Prentice Hall) and Windows Multi-DBMS Programming (Wiley).