Evolving a Domain-Specific Language

After designing and using a DSL for a while, you will inevitably want to change it


May 31, 2007
URL:http://drdobbs.com/evolving-a-domain-specific-language/199800215

Steve Cook is a software architect for Microsoft, Gareth Jones is a professor of management at Texas A&M University, Stuart Kent is a program manager at Microsoft, and Alan Cameron Wills is technical director of TriReme International. This article was excerpted from their book Domain-Specific Development with Visual Studio DSL Tools. Copyright (c) 2007 Addison Wesley Professional. All rights reserved.


After designing and using a Doman-Specific Language (DSL) for a while, you will inevitably want to change it, and you will not want to lose the language instances that you have already created.

Changes in a DSL take one of the following forms:

Of these, the language author needs to worry about refactorings. The following tactics mitigate the pain of change to some extent:

Migration between language versions is an important and common problem. Having a forgiving file reader (deserializer) certainly relieves some of the pain, but you could do more. It is possible to conceive of tools that automate some of the above tactics. For example, one could maintain the history of changes to a domain model between versions and then use this information to generate a tool that does the conversion automatically.

What Makes a Good DSL?

A good DSL -- one that helps the people who are using it -- makes it easy to clearly represent information for a particular purpose. The following are some characteristics of a good DSL:

To illustrate the last few points further, we will consider the development of a different example.

Appropriate Notation: An Example with Regular Expressions

As software designers, we are always strongly aware of the tree-structured nature of just about everything we deal with. In designing a DSL notation, there is always an inclination to represent this tree directly in the syntax. However, this is not always the most usable option.

Regular expressions are an example of a textual DSL. Regular expressions have a very compact textual notation, and while very powerful for those who have learned the notation, occasional users can find them opaque. The goal in this example is to create a DSL in which regular expressions can be constructed with a graphical notation. The expected benefits include:

Reminder about Regular Expressions

Regular expressions can seem very arcane to the uninitiated, but the basic idea is simple. Suppose you are processing some text -- let's say, an HTML file; and you want to find the next html element (between < and >); and you want to extract the tag and the parameter pairs of the element and get them into separate variables. So you call:

foreach (Match m in
   Regex.Match(yourHtmlString, theRegularExpression))
{ ... and each m is a match to part of the string ... }

The regular expression contains a sequence of characters that you expect to find in the string, and certain characters (parentheses, * + ? [ ] and one or two others) play special roles. * signifies a repetition of zero or many of what went immediately before, so that < * matches a < followed by any number of spaces, including none. + is similar, but insists on at least one occurrence. Square brackets match any single character in the range defined within, so that [A-Z]+ matches any sequence of at least one capital letter. Parentheses demarcate a match that you wish to capture into a variable, so that ([A-Za-z]+) should return to you a word made of one or more alphabetics. (?:...)* repeatedly performs the matches within the parentheses without capturing the whole thing to a variable. "|" specifies alternative matches. (?...) matches a pattern that a later ${name} must match in the same way -- so that the use of quote in the following example ensures that an opening quotation is matched with the same kind of closing mark. This regular expression:

< *([A-Za-z]+) +(?:([A-Za-z]+) *= *(?<quote>"|')([^"']*)${quote} *)*/?>

matches, for example:

< table bgcolor= "#ffddff" border="1' >

as illustrated in Figure 1.

< *([A-Za-z]+) +(?:([A-Za-z]+) *= *(?<quote>"|')([^"']*)${quote} *)*/?>

Figure 1: Interpretation of a regular expression.

The objective of this DSL is to make a more accessible notation for regular expressions.

Candidate Notations

There are several notations that might be made to work, each of which takes a rather different approach. One of our major principles is to work from instances, so in the following we use an example that teases out the notational distinctions:

<a (?:b[cd]*e|[a-z]+)z.

Candidate Notation 1: Regexp Tree

This notation (Figure 2) directly represents the abstract syntax tree of the regular expression. A sequence is represented by a vertical box, iterations are shown as a "*" in a circle, and alternatives are shown as a "|" in a circle.

Figure 2: Regexp Tree notation

The difficulty is that it isn't very easy to follow how it matches up to a given input string. For one thing, you have to go back and forth between nodes and their descendants to follow the matching process.

Candidate Notation 2: Railroad Tracks

"Railroad tracks" (Figure 3) have been used for presenting parsing paths for many years (notably in the syntax definition of Pascal). Match by pushing a counter around as you follow the string; replicate the counter on going through a branch (small filled rectangle); delete a counter when it fails to match; the whole thing matches if you get a counter to the end.

Figure 3: Railroad track notation.

The big drawback to this notation is that there are a lot of possible graphs that don't correspond to regular expressions. While it is possible to write validations, contraventions are not always easy for the naive user to spot, and it would be very irritating to constantly run into obscure error flags. In general, one of the expectations of a DSL is that it helps you to make statements that make sense within the domain.

The particular difficulty is that it allows you to make non-nested loops, and it can be quite difficult, depending on the layout of the graph, to see whether the constraint has been contravened or not, as illustrated in Figure 3 (which is invalid).

Figure 4: Invalid railroad track

Candidate Notation 3: Nested Boxes

This is a compromise solution in which arrows represent sequence, and nesting represents containment (Figure 5). The rule here is that paths can only converge or diverge at the branch points on either side of a box that contains the alternative paths. Each node otherwise just has at most one arrow entering and one leaving. There are also loop boxes with ports marked * or +.

Figure 5: Nested box notation.

This works just as well with the counter semantics while at the same time disallowing spaghetti branches. At the exit of a + or * loop box, you can move the counter around the box boundary back to the entry port. If the entry port is * or ?, you can avoid the content altogether by moving around the boundary straight to the exit. (The entry ports are decorated with arrows lined up with the box boundary to suggest this behavior.)

Candidate Notation 4: Nested Paths

This is another counter-following notation, but there are two kinds of link. Each node only has one entering link (Figure 6). To match a string to the expression, you follow the Next links, matching each character to the character in a box; if you match the last node to the string, the match has succeeded. If you come to a diamond or circle, you must first follow its Parts links, and match that (by getting to the last link) -- if there are several Parts links, split the counters and a match to any one will do. This notation is a bit more difficult to follow than the nested boxes, but (unlike candidate 2) all of the graphs you can draw make sensible regular expressions, and (unlike candidate 1) sequences of matches are represented by following Next links rather than working through a fan of sibling links so that you can understand the graph by pushing counters around it.

Figure 6: Nested path notation.

Graphs Are Not Syntax Trees

Nested Boxes (or its pragmatic variant, Nested Paths) seem to be the best notation, but notice how far those notations are from the initial direct representation of the regular expression syntax tree. The first candidate would be the easiest to generate regular expressions from, but the better notations help a user to understand the concepts.

The same consideration will frequently apply when taking any existing text syntax -- typically an XML file -- and creating a diagrammatic representation of it. The syntax tree of the existing language is often not the best option.

Terms of Service | Privacy Statement | Copyright © 2024 UBM Tech, All rights reserved.