A Working Example
An enticing aspect of the technique is that the method is not theoretical. There is no need to wait for tools or guess at the details of how it worksall of the components and artifacts for a working data warehouse application are available for review and a test drive, and all the tools are either open source or available free for the purpose of making a prototype.
Oracle Warehouse Builder (www.oracle.com/technology/ products/warehouse) makes it easy to model a data warehouse using the CWM. OWB hides all the complexity and ensures a correct model, which you can then export into XMI format. With OWB, you can also generate the DDL required to build the data warehouse. OWB is well documented and is excellent for generating your first data warehouse database model.
Once you're familiar with a correctly built CWM data warehouse in XMI format, as produced by OWB, it is relatively straightforward to create data warehouses with MofEditor (www.fing.edu.uy/inco/ ens/aplicaciones/MofPlaza/web/mofplaza/mofeditor.htm), an open-source tool that lets you graphically create models based on MOF components. MofEditor will not help you generate the DDL for a data warehouse, but the technique described in this article can be used to aid in the creation of DDL, and that code would be reusable for every dimension and fact table that is subsequently needed.
I will illustrate the technique by showing how to transform a log file into a bulk load file to populate a fact table. To provide interesting transformations, some of the fields from the log file will be transformed into their respective dimension keys before being placed in the bulk load file.
Figure 1 is a portion of the sample model. It shows two fields, state and units, being transformed from the source log file into a batch load file for filling a fact table. The fields of the log file are listed vertically on the left. The transformations are horizontal from left to right. The fields of the resulting log file are listed vertically on the right. Every Transformation has a TransformationUse, which indicates the kind of work that is done in the transformation, although it doesn't provide details of how to do the work. In the previous example, the transformation sampleStateToBatch has a relationship to a TransformationUse object called "lookup." (The lookup node is another part of the diagram that is connected by the use of a relationship line.) The transformation sampleUnitsToBatch is connected to a TransformationUse called passThru, which is how the model indicates the different kinds of work performed in the transformations. In the state transformation, the state is looked up in a dimension table and a key for the state is returned and placed into the batch load file. In the passThru transformation, the units are reformatted into a normalized string and placed into the batch load file and no lookup is performed. There is a DataObjectSet node between the Transformations and the source fields because a transformation can have more than one source field. Similarly, any transformation can have more than one target field so there is a target DataObjectSet associated with each Transformation. All of this is well documented in the CWM specification.
MofEditor is capable of exporting any MOF model into XMI format. Once the MOF model is in XMI format, whether it is a model of a database or a set of transformations, Netbeans Metadata Repository (MDR) can import the model and provide both a graphical interface and a programmatic interface to the model.
The MDR programmatic interface makes it easy to answer a number of questions including what set of transformations operate on a particular file; what set of fields exist in a particular file; what TransformationUse (kind of transformation) is used for a particular transformation; and what database columns are affected by a particular log file field.
The MDR browser provides a mechanism for moving around in a model, regardless of the number of tools used in creating it. It can be used to look at the whole model or any part or it, hiding complexity as needed. It is a great tool for finding semantic errors in the model.
The programming interface to the model is in Java using the Java Metadata Interface (JMI) spec (java.sun.com/products/jmi/). However, the resulting code does not have to be in Java: The technique is equally effective in producing C#, DDL, XML, and HTML.
Example 1 shows how easy it is to get a list of dimensions out of a model. The API is similar for almost any information that you would like out of the model.
// connect to the repository MDRepository rep = MDRManager.getDefault().getDefaultRepository(); if (rep == null) { throw new Exception("MDRManager returned a null repository"); } ... public Collection getDimensions(DwDesignPackage extent) { RefPackage olap = extent.refPackage("Olap"); DimensionClass dc = (DimensionClass)olap.refClass("Dimension"); return dc.refAllOfClass(); } ... DwDesignPackage extent = mdr.getExtent(); for (Iterator iter1 = (mdr.getDimensions(extent)).iterator(); iter1.hasNext();) { Dimension dim = (Dimension) iter1.next(); ... String dn = dim.getName();
The second step in the technique is to write some code in whatever language is convenient to implement the desired transformations. Do not implement the code for all of the transformations; implement the code only for the unique kinds of transformations.
Write the code with the model in mind so that it follows what has been modeled. You can then test the code and break it apart into template files so that the code can be regenerated using the model and the JMI API. If you can generate the working code you just built, you can generate the working code for the rest of the model. It's easy to see how this technique can speed-up development on applications with repeating components. Indeed, in some instances, I have used the technique to generate the code for over 75 percent of the application.
Example 2 is a portion of template code. Available electronically (at www.ddj.com/code/) is a complete working ETL application with all of the code, templates, models, and DDL, which can be examined to obtain a deeper understanding of the process.
private HashMap build<%TfmName%>Lookup() throws Exception { CustomTransform customTransform = new CustomTransform(); <%DimName%>ManagerOlapFactory <%dimName%>MOFactory = new <%DimName%>ManagerOlapFactory(); <%dimName%>MOFactory.setProperties(properties); <%DimName%>ManagerOlap <%dimName%>ManagerOlap = <%dimName%>MOFactory.create<%DimName%>ManagerOlap(); <%DimName%>Olap <%dimName%>Olap[] = <%dimName%>ManagerOlap.loadAll(); HashMap lookup = new HashMap(); Class[] preformatArray = {Object.class}; boolean preformatSet = false; Method preformat = null; try { preformat = customTransform.getClass().getDeclaredMethod( "pre<%TfmName%>", preformatArray); preformatSet = true; } catch (NoSuchMethodException nsme) { preformatSet = false; }
I've used <% _ %> to bracket the variable parts of the templates. This character sequence was selected because I didn't expect to encounter it in my regular code. Any other unique character sequence would work equally well.
Once you master the API for accessing the model, generating the original code from the model with the templates is relatively easy work.
Some of the other tools I used in conjunction with this technique are: JEdit, which is great for looking at XMI code; Cognos Framework Manager, which reads CWM models directly; Mondrian, which is a good open-source OLAP reporting tool that can be configured from XML files built from the technique in this article; and SQL2JAVA, which eases building the database persistence layer for Java.