Dr. Dobb's | Evolving a Domain-Specific Language

Evolving a Domain-Specific Language

After designing and using a DSL for a while, you will inevitably want to change it

May 31, 2007
URL:http://drdobbs.com/evolving-a-domain-specific-language/199800215

Steve Cook is a software architect for Microsoft, Gareth Jones is a professor of management at Texas A&M University, Stuart Kent is a program manager at Microsoft, and Alan Cameron Wills is technical director of TriReme International. This article was excerpted from their book Domain-Specific Development with Visual Studio DSL Tools. Copyright (c) 2007 Addison Wesley Professional. All rights reserved.

After designing and using a Doman-Specific Language (DSL) for a while, you will inevitably want to change it, and you will not want to lose the language instances that you have already created.

Changes in a DSL take one of the following forms:

Additions. New properties in a class or relationship; new classes; new relationships. The deserializer will take care of these, filling in default values when reading objects from older files.
Deletions. Properties, classes, or relationships removed. In this case, the deserializer will ignore the deleted elements.
Refactoring. A structure is replaced by something that represents the same information in a different way. For example, we might replace the Issue State Transition by a class with a pair of relationships on either side. Here, the deserializer will see some additions and some deletions but will not be able to do the translation automatically. Changes of property and relationship types come under the refactoring heading.

Of these, the language author needs to worry about refactorings. The following tactics mitigate the pain of change to some extent:

Separate refactorings. Identify and localize each refactoring, and plan separately how you will deal with each one. For example, one version change of the Issue State DSL might introduce a domain class to replace the Transition relationship, and might also introduce multiple issue types per project. As far as possible, we should keep those changes independent of each other.
Don't delete anything. Instead of deleting the old structure, just add the new one; for example, have both the Transition relationship and the domain class replacing it coexist in the same model. Deprecate the old structure by removing its tools from the toolbox, and any merge directives or connection builders that might deal with its creation, so that new instances can't be created. Then use one of the tactics that follow this one.
Ensure the tools work with both structures. During a migration period, ensure that the generating templates and other tools can work with both structures as much as possible. If the change is prompted by a changed requirement in what the tools generate, at least allow language users to edit and work with the old instances until they need to generate material from them.
Provide translation tools. One way is to provide a command that scans a store and updates the model in place. It is possible to make this happen on opening a file of an older version. Another convenient method of translating instance models is to write a generating template that creates the serialized form of the new version of the DSL from models of the older version. XSLT can also be used to translate one model file into another, given that they are both XML files.
Validate required fields. If you translate on opening an old file, make sure that your validation methods check any new fields so that the user is prompted to provide the required extra information.
Delete obsolete domain classes when you're sure all the instances are updated.
Publish the update plan to your language users so that they know when older structures will be obsolete.

Migration between language versions is an important and common problem. Having a forgiving file reader (deserializer) certainly relieves some of the pain, but you could do more. It is possible to conceive of tools that automate some of the above tactics. For example, one could maintain the history of changes to a domain model between versions and then use this information to generate a tool that does the conversion automatically.

What Makes a Good DSL?

A good DSL -- one that helps the people who are using it -- makes it easy to clearly represent information for a particular purpose. The following are some characteristics of a good DSL:

Well-defined scope. There is an easy-to-state single purpose for the DSL -- for example, "to describe the permitted sequence of states of issues in a project."
Minimality. There are no more concepts and relationships than are required for the ideas that need to be described with the DSL. If you can see a way to reduce the number without losing information, do so.
Familiarity. If there is a notation already in use in the domain -- railroad tracks? electrical circuits? -- base your notation on that. If the type of information is normally associated with a particular style of diagram -- statecharts -- use the conventions familiar there. Adapt the notation where you need to.
Bring out the important bits. The most important things should be the most obvious and easiest to change: names of states, transitions between them. Less important detail -- for example, the target namespace of the generated code -- can be left to the properties window. If you need to print out the full detail for some reviews, write a report template.
Moderate amount of redundancy. Ensure that when the user has made an important distinction, it was intentional. For example, we flag an error if the user has not marked a state with no incoming transitions as "start." On the other hand, users don't like to have to say everything twice, so don't overdo it.
Use the syntactic space. Most of the diagrams a user can draw should be meaningful. A language that is heavy with complex constraints probably doesn't fit the conceptual space well and is difficult to remember how to use. The big advantage of a DSL is that the notation guides you through useful things to say.
Graphs are not syntax trees. If you have an existing text notation, it is unlikely that its syntax tree will be the best graphical notation. See the example on regular expressions in the next section.
Allow for customization. Your DSL will not cover all the cases that users will require. Provide for further customization through handwritten code, preferably without forcing the users to alter generating templates.
Use the users' language. The language should be expressed in its users' terms, talking about their requirements -- rather than its implementations. Talk about issues rather than tables. Where it's difficult to avoid bringing in some implementation concerns, try to package them up in a single domain class, or by exploiting categories in the properties window. For example, in the DSL designer you'll find that properties that are mostly concerned with the form of the generated code are always gathered together under a "Code" category.

To illustrate the last few points further, we will consider the development of a different example.

Appropriate Notation: An Example with Regular Expressions

As software designers, we are always strongly aware of the tree-structured nature of just about everything we deal with. In designing a DSL notation, there is always an inclination to represent this tree directly in the syntax. However, this is not always the most usable option.

Regular expressions are an example of a textual DSL. Regular expressions have a very compact textual notation, and while very powerful for those who have learned the notation, occasional users can find them opaque. The goal in this example is to create a DSL in which regular expressions can be constructed with a graphical notation. The expected benefits include:

The toolbox guides construction of the expression.
A pictorial model is easier to follow.
Validation checks can be applied.
Graphical presentation should be easier to learn and understand.

Reminder about Regular Expressions

Regular expressions can seem very arcane to the uninitiated, but the basic idea is simple. Suppose you are processing some text -- let's say, an HTML file; and you want to find the next html element (between < and >); and you want to extract the tag and the parameter pairs of the element and get them into separate variables. So you call:

foreach (Match m in
   Regex.Match(yourHtmlString, theRegularExpression))
{ ... and each m is a match to part of the string ... }

The regular expression contains a sequence of characters that you expect to find in the string, and certain characters (parentheses, * + ? [ ] and one or two others) play special roles. * signifies a repetition of zero or many of what went immediately before, so that < * matches a < followed by any number of spaces, including none. + is similar, but insists on at least one occurrence. Square brackets match any single character in the range defined within, so that [A-Z]+ matches any sequence of at least one capital letter. Parentheses demarcate a match that you wish to capture into a variable, so that ([A-Za-z]+) should return to you a word made of one or more alphabetics. (?:...)* repeatedly performs the matches within the parentheses without capturing the whole thing to a variable. "|" specifies alternative matches. (?...) matches a pattern that a later ${name} must match in the same way -- so that the use of quote in the following example ensures that an opening quotation is matched with the same kind of closing mark. This regular expression:

< *([A-Za-z]+) +(?:([A-Za-z]+) *= *(?<quote>"|')([^"']*)${quote} *)*/?>

matches, for example:

< table bgcolor= "#ffddff" border="1' >

as illustrated in Figure 1.

< *([A-Za-z]+) +(?:([A-Za-z]+) *= *(?<quote>"|')([^"']*)${quote} *)*/?>

Figure 1: Interpretation of a regular expression.

The objective of this DSL is to make a more accessible notation for regular expressions.

Candidate Notations

There are several notations that might be made to work, each of which takes a rather different approach. One of our major principles is to work from instances, so in the following we use an example that teases out the notational distinctions:

<a (?:b[cd]*e|[a-z]+)z.

Candidate Notation 1: Regexp Tree

This notation (Figure 2) directly represents the abstract syntax tree of the regular expression. A sequence is represented by a vertical box, iterations are shown as a "*" in a circle, and alternatives are shown as a "|" in a circle.

Figure 2: Regexp Tree notation

The difficulty is that it isn't very easy to follow how it matches up to a given input string. For one thing, you have to go back and forth between nodes and their descendants to follow the matching process.

Candidate Notation 2: Railroad Tracks

"Railroad tracks" (Figure 3) have been used for presenting parsing paths for many years (notably in the syntax definition of Pascal). Match by pushing a counter around as you follow the string; replicate the counter on going through a branch (small filled rectangle); delete a counter when it fails to match; the whole thing matches if you get a counter to the end.

Figure 3: Railroad track notation.

The big drawback to this notation is that there are a lot of possible graphs that don't correspond to regular expressions. While it is possible to write validations, contraventions are not always easy for the naive user to spot, and it would be very irritating to constantly run into obscure error flags. In general, one of the expectations of a DSL is that it helps you to make statements that make sense within the domain.

The particular difficulty is that it allows you to make non-nested loops, and it can be quite difficult, depending on the layout of the graph, to see whether the constraint has been contravened or not, as illustrated in Figure 3 (which is invalid).

Figure 4: Invalid railroad track

Candidate Notation 3: Nested Boxes

This is a compromise solution in which arrows represent sequence, and nesting represents containment (Figure 5). The rule here is that paths can only converge or diverge at the branch points on either side of a box that contains the alternative paths. Each node otherwise just has at most one arrow entering and one leaving. There are also loop boxes with ports marked * or +.

Figure 5: Nested box notation.

This works just as well with the counter semantics while at the same time disallowing spaghetti branches. At the exit of a + or * loop box, you can move the counter around the box boundary back to the entry port. If the entry port is * or ?, you can avoid the content altogether by moving around the boundary straight to the exit. (The entry ports are decorated with arrows lined up with the box boundary to suggest this behavior.)

Candidate Notation 4: Nested Paths

This is another counter-following notation, but there are two kinds of link. Each node only has one entering link (Figure 6). To match a string to the expression, you follow the Next links, matching each character to the character in a box; if you match the last node to the string, the match has succeeded. If you come to a diamond or circle, you must first follow its Parts links, and match that (by getting to the last link) -- if there are several Parts links, split the counters and a match to any one will do. This notation is a bit more difficult to follow than the nested boxes, but (unlike candidate 2) all of the graphs you can draw make sensible regular expressions, and (unlike candidate 1) sequences of matches are represented by following Next links rather than working through a fan of sibling links so that you can understand the graph by pushing counters around it.

Figure 6: Nested path notation.

Graphs Are Not Syntax Trees

Nested Boxes (or its pragmatic variant, Nested Paths) seem to be the best notation, but notice how far those notations are from the initial direct representation of the regular expression syntax tree. The first candidate would be the easiest to generate regular expressions from, but the better notations help a user to understand the concepts.

The same consideration will frequently apply when taking any existing text syntax -- typically an XML file -- and creating a diagrammatic representation of it. The syntax tree of the existing language is often not the best option.