Dr. Dobb's | Software and the Core Description Process

Software and the Core Description Process

PSICAT is an open-source, cross-platform Java tool for creating, viewing, and editing geological core description diagrams.

September 05, 2007
URL:http://drdobbs.com/jvm/software-and-the-core-description-proces/201804201

Josh is IT manager for the Antarctic Geological Drilling (ANDRILL) project. He can be contacted at [email protected].

Core description is the process of documenting the cylindrical rock samples, called "cores," that result from scientific drilling. As the cores come out of the ground, they are examined and described by a team of scientists (sedimentologists) who specialize in identifying and interpreting the "stories" in the rock. The end product of this process is a series of diagrams that graphically represent the core description. These diagrams are crucial as they are the primary record of what was recovered during drilling and provide the scaffolding upon which all further scientific analysis is built.

Traditionally, the core description diagrams are created first by sketching them in a field notebook, then having some poor soul—usually a graduate student—draft them up in a graphics application such as CorelDRAW or Adobe Illustrator. This creates nice, publication-quality diagrams, but sufferes from two major faults:

There is a large amount of duplicated effort because each diagram is drawn twice—once by hand, then again in digital form.
The final product is a collection of static images that cannot be easily manipulated, analyzed, or searched. Answering questions such as "what percentage of the sediment was sand?" requires visually reviewing each diagram and manually recording the features of interest.

This was the case with ANDRILL (www.andrill .org), a drilling project in the Antarctic on which I am the IT manager.

After mulling over the problem of improving the current core description process, I came to the conclusion that the data encoded in the diagram was the key. The sedimentologists were drawing diagrams that visually represented some data (depth, grain size, rock type, and the like) without ever capturing that data itself. It's akin to drawing a bar chart without keeping the data being charted. Is the data lost? Not really; it's just not in a format that can be used to calculate a mean or standard deviation. Being able to access this core description data directly, instead of through its graphical representation, would let the scientists search for features of interest and perform further analyses more easily.

Data Capture

I considered three approaches to capturing the data represented in the core description diagrams:

Eliminate the diagramming altogether and have the scientists enter the data directly into spreadsheets.
Post-process the diagram images and extract the data using image analysis.
Provide an interactive drawing environment to capture the data as the diagram is drawn.

The scientists rejected the first approach because it was too different from the drawing process they were familiar with, and it eliminated the diagrams—the de facto standard in conveying the core description. The second approach was promising because it did not require changing the existing process, and it held the potential to work with existing diagrams and those from future expeditions. Ultimately, I rejected this approach because of its inherent inaccuracy. There was too much variability, in both structure and style, between the few example diagrams I was provided to expect any reasonable success rate with image-analysis techniques.

I settled on providing an interactive drawing environment and capturing the core description data as the diagram is drawn. This required the development of a custom drawing environment, no small task, but preserved the diagrams and drawing process the scientists were familiar with. It also had the potential to simplify the description process because the environment could be customized specifically to drawing core-description diagrams. The software could provide specialized tools for drawing geological features (like lithostratigraphic intervals) instead of forcing the scientist to represent these features with simple graphical objects (like rectangles and circles). The software and scientist could work at the same semantic level.

Before I could begin work on the software, which would ultimately be called "Paleontological Stratigraphic Interval Construction and Analysis Tool" (PSICAT), I spent several months working directly with scientists to learn the specialized vocabulary, concepts, and requirements of core description. I reverse-engineered dozens of actual core description diagrams to determine what data was encoded in them and how the data was visually represented. I also observed the scientists actually describing core. Once I had a handle on the types of data and tasks involved, I began work on modeling these in software.

Data Model

Core description diagrams like Figure 1 display a wealth of different data. Many of the data types share common properties. For example, lithostratigraphic intervals and stratigraphic units each have an associated top and a base depth. Each data type also has unique properties—stratigraphic units have identifiers associated with them, whereas intervals do not. To accommodate these various data types and those that have not yet been identified, PSICAT (portal.chronos.org/psicat-site) required a flexible model and representation scheme for the data.

[Click image to view at full size]

Figure 1: An example core description diagram.

Each high-level data type in PSICAT, such as lithostratigraphic intervals, stratigraphic units, and sedimentary structures, is represented by Model objects, which are associative arrays of key-value pairs. The associative array data structure was chosen for its simplicity and expressiveness—most, if not all, high-level data types can be represented as a collection of named properties. Each Model has two implicitly defined properties: an id, which uniquely identifies the Model; and a type, which identifies the high-level data type of the Model. Beyond these two implicit properties, Models can have an arbitrary number of other properties. Models are also hierarchical in nature and may contain nested Models, indicating a parent-child relationship.

This data model is flexible enough to be serialized to many different representations. PSICAT currently records the data model as an XML file using something like Example 1.

<models>
    <model id="0" type="Project" parent="">
       <property name="name">Sample Project</property>
    </model>
    <model id="1" type="Interval" parent="0">
       <property name="depth.top">0.00</property>
       <property name="depth.base">0.58</property>
          ...
    </model>
    <model id="2" type="Interval" parent="0">
       <property name="depth.top">0.58</property>
       <property name="depth.base">1.12</property>
          ...
    </model>
    ...
</models>

Example 1: PSICAT records the data model as an XML file.

However, there are two oddities in Example 1. First, if the data model is hierarchical and XML is hierarchical, why is the XML representation of the data model flat? During the development of PSICAT, I ran into a situation where deeply nested model trees were overflowing the stack because of too much recursion when parsing the data. In response, I switched to this flat structure, then "treeified" the data model after parsing.

The second is less an oddity and more a limitation of the data model and XML serialization. In Example 1, no type information is persisted about the individual properties. This means the value "number" is just as valid as "1.12" for the depth.base property. This can cause problems when a property value represents a data type other than a character string—a number, for instance. To address this, PSICAT applies a mapping action where it maps the generic Model object to a specialized subclass of Model based on the value of the type property. This specialized Model is still backed by the associative array but can define helper methods; for instance, double getBaseDepth() and void setBaseDepth(double depth), for accessing the property values using native data types.

Software Architecture

Core description is a complex process to model in software. It involves data capture, visualization, analysis, collaboration, revision, and publication, with different features required for each step. To complicate things further, I wanted PSICAT to be useful to the broader geoscientific community, rather than just a one-off, custom solution developed for ANDRILL. Since this is an ambitious and complicated undertaking, I needed a flexible, extensible software architecture that would let me reuse functionality common to all drilling projects, while still supporting the development of custom features for specific groups or tasks.

As a Java developer, I looked no further than Eclipse (www.eclipse.org) for my architecture solution. Known as a Java IDE, Eclipse is a full-blown platform of tools and frameworks for developing and managing software. Built on OSGi technology, Eclipse takes the concept of a modular architecture to the extreme.

Traditional applications, such as web browsers, provide well-defined interfaces and extension mechanisms that you can plug into and add new functionality. The host application is fully functional; the plug-ins simply augment it with new features. Eclipse, on the other hand, is built entirely of plug-ins. There is no host application per se, just a small engine that loads and runs plug-ins. All of the end-user functionality (editing and compiling Java source code) is implemented as a collection of collaborating plug-ins.

This pure plug-in approach offers many advantages, including a high-degree of flexibility and reuse. Eclipse can be easily customized to a specific task, such as developing Java code, developing web applications, or managing remote databases, simply by virtue of which plug-ins are included/removed. If a plug-in already exists to perform a particular function, it can be reused as is. The most potent advantage of this approach to application development is that it lets applications be extended in ways not initially envisioned.

Implementation

Eclipse is not only a model for modular, extensible applications, it also provides the Eclipse Rich Client Platform (RCP) on which to build them. RCP combines Equinox, an engine for loading and running plug-ins, and a collection of plug-ins that provide a framework for developing cross-platform GUI applications. RCP integrates with other Eclipse technologies so that applications can leverage frameworks like the Update Manager for managing software updates and the Graphical Editing Framework (GEF) for building graphical editors. Building on RCP let me focus on solving problems specific to PSICAT, rather than reinventing common application components like editors and wizards.

Like Eclipse, PSICAT is a collection of plug-ins that collaborate to provide useful functionality. These plug-ins can be organized into five distinct layers:

The first layer consists of Eclipse technologies: RCP, Update Manager, and GEF.
The second layer defines the data model (the Model objects) and provides services for managing diagrams, resources, and configuration.
The third layer adds the views, editors, and wizards that make up PSICAT. This and the previous two layers provide the scaffolding for data capture and visualization.
The fourth layer, where most of the core description-specific functionality is introduced, consists of two basic types of plug-ins: service plug-ins and column plug-ins. Service plug-ins add new features, such as image export and searching. Column plug-ins add the code for capturing/displaying new types of data on the diagram.
The final layer adds project-, group-, or task-specific customizations to PSICAT.

Noteworthy Features

Using PSICAT for core description offers many advantages over the previous approach of sketching the diagrams by hand and drafting them in a generic graphics application like Adobe Illustrator:

PSICAT lets users export all or a subset of the captured core description data as an Excel spreadsheet. In this format, the data is available for all manner of analysis and plotting, letting users answer questions like "What percentage of the hole is sand?" more easily than if they had just the images.
PSICAT includes searching capabilities so users can quickly find areas of interest in the core without resorting to looking at each diagram. PSICAT currently provides an experimental searching interface that lets users specify areas of interest with natural-language queries such as "sections of diamictite containing symbol pyrite in 0-500m" and "sections of symbol shell or symbol fragmented shell." Work is currently underway on a Google-like search of the written description data.
Often it is useful to plot external datasets, such as physical core properties captured by a multisensor core logging system, alongside the core description. PSICAT lets users import external datasets and integrate them directly into the core description diagrams.
The core-description process often involves more than just the initial characterization of the core. It also includes derived descriptions, such as summary diagrams, which show the general trends. Creating these summary diagrams requires relogging the whole core at a less-detailed scale to show only the important features. This represents a significant amount of work. To address this, PSICAT includes a feature, which processes the core description data with a sophisticated set of rules to automatically summarize it in a matter of seconds. The summarized data can be edited, analyzed, searched, and plotted just like any other core description data in PSICAT.

Conclusion

PSICAT is a free, open-source, cross-platform tool for creating, viewing, and editing core description diagrams. The merit of PSICAT's approach to the core description process was proven during the October 2006-January 2007 Antarctic field season where it was used to log the nearly 1300 meters of core drilled as part of the ANDRILL McMurdo Ice Shelf expedition. PSICAT performed well under continuous daily use in the field and offered many time-saving and science-enabling features.

The development of PSICAT is ongoing. It is being used again in Antarctica on ANDRILL's Southern McMurdo Sound Expedition during the September 2007-December 2007 field season. We are focusing on improving the user experience with PSICAT based on observations made during the first field testing. I also plan to offer more import/export options, more data column types, and improved search functionality.