Brian is a software engineer at the Whitehead Institute for Biomedical Research. He can be contacted at [email protected].
Graph structures are used in applications ranging from representations of molecules in chemistry to analysis of computer network topology. Graph canonicalization is the process of forming a unique text representation for a graph, thereby enabling quick text-based searching without resorting to time-consuming isomorphism algorithms. In this article, I introduce a generic method for graph canonicalization that you can adapt to different graph representations.
The SMILES chemistry language was created to solve problems in chemoinformatics, including representation of molecules for database storage (see "SMILES: A Chemical Language and Information-System," by D. Weininger, Journal of Chemical Information and Computer Sciences, 1988). One of the main problems with database searches for duplicate compounds is that the same molecular graph can be input in many different ways ("six chemists, seven representations"). One of the properties of SMILES is the so-called "canonical smiles output," which is a unique representation for a molecular graph. This technique was used to produce unique database keys for quick discovery of duplicated compounds. This canonical graph representation is also used to produce consistent depictions of molecules. Daylight Chemical Information Systems (http://www.daylight.com/smiles/) created and maintains the SMILES language.
The SMILES algorithm is adapted to solve the canonicalization problem for graphs where the vertices and edges are labeled with any label that can be sorted, such as arbitrary text strings. Graphs that have attributes for vertices and edges are referred to as "labeled" or "colored" graphs. Figure 1 shows two such labeled graphs, both equivalent but represented differently. Here, I present a Python implementation that forms a unique ordering of nodes and edges for arbitrarily labeled graphs.
Algorithm
The graph representation used is straightforward. A graph contains vertices and edges, and a vertex can be traversed through an edge to its adjacent neighbors; see Listing One. Both the vertex and the edge are objects that have attributes and methods. For computation purposes, all attributes are stored in arrays such that the first vertex's equivalence class is stored in the equivalence class array position 0, and so on. Once properties have been computed, they are assigned back as attributes of the vertex and edge objects.
If all the vertices in a given graph have unique properties, the solution to the graph canonicalization is trivialyou simply sort the vertices, then use the sorted vertices to uniquely sort the edges of the graph. A problem arises, however, when two different vertices have the same label or properties. This happens in chemistry where two different atoms may have the same atomic properties, or could be as simple as a molecular graph of DNA that is composed of repeating copies of the same subunits. In these cases, the only difference between two atoms in a molecular graph may be the surrounding topology. In other words, the only difference between two vertices may be the vertices to which they are connected, and a different technique is required to create a unique representation for such graphs. Listing Two is a relatively complete graph implementation using extensions of basic Python types.
The canonicalization algorithm computes three new attributesan equivalence class, symmetry class, and symmetry orderfor each vertex in the graph. The equivalence class is a unique integer value for each vertex or edge label; however, if two vertices or edges have the same label, then they belong to the same equivalence class. Given the equivalence classes, a symmetry class is calculated using the topology or structure of the graph. If two vertices have the same symmetry class, then they are topologically indistinguishable from each other. In other words, two members of the same symmetry class have the same surrounding symmetry in the graph; see Figure 1(b). A symmetry order is then computed that generates a unique order for every vertex in the graph. This unique order breaks ties between two vertices with the same symmetry class and ensures that no matter how the vertices in a graph are input, the canonical ordering for vertices with the same surrounding topology will be the same. In other words, a symmetry order ensures a unique representation for the graph. Again in Figure 1(b), the algorithm needs to determine whether the edges A-b-C should be output before the edge A-d-C, even though their respective end vertices have the same equivalence class and the same symmetry class. Finally, outputting nodes and edges in order of their symmetry orders generates a unique graph representation. Example 1 shows the basic steps of the algorithm.
Computing the Equivalence Class
With this background, the algorithm is deceptively simple. The first step is assigning an equivalence class to each vertex. This is done by sorting the set of unique vertex labels, then assigning the equivalence class for a vertex as the sorted position of the vertices label plus one. The same process is used to generate an equivalence class for each edge; see Listing Three. (I use the term "node" for coding, and "vertex" for discussion of the algorithm.)
Computing the Symmetry Class
Computing symmetry classes is an iterative procedure. At every step, a new symmetry class is generated for each vertex and the process continues until the symmetry classes stabilize for all vertices. At each iteration, the goal is to take identical symmetry classes and determine if the surrounding topology of the corresponding vertices are different. If the algorithm determines a difference in topology, these ties are broken and a new symmetry class is added to the graph. Each iteration propagates these changes in symmetry classes to their surrounding vertices. An oversimplification is that the algorithm asks the question, "Are the neighboring vertices at a distance of one different?" and, "Are the vertices neighboring vertices at a distance of two different?"continuing until these differences cannot be found. The source code for the canonicalization class is available electronically; see "Resource Center," page 5.
To determine if neighboring topologies are different between two vertices, the current symmetry orders are sorted and replaced with their corresponding primes. Because prime numbers cannot be factored, the products of the corresponding primes of the vertices' neighbors are guaranteed to be unique for different sets of neighbors. The same is done for generating the corresponding primes from an edge's equivalence class, although this is done only once since the edge values are static.
In brief, the algorithm is: Initialize the symmetry class by setting it to the vertices equivalence class. The iterative procedure is then: First, the current symmetry classes are ranked and each node is marked with the symmetry class's corresponding prime number. For example, the lowest current symmetry class is assigned 2, the third lowest is assigned 5, and so forth. I don't use 0 or 1 as primes. The symmetry class ranking is detailed in the rank method of GenerateSymmetryClasses. Second, a neighbor value for each vertex is computed by multiplying the products of the corresponding prime of its neighbors and the corresponding prime of the edge traversed to reach the respective neighbor. The neighbor value is used to break ties in the symmetry classes. If two nodes have the same symmetry class but different neighbor values, then they are not topologically the same. The vertex with the largest neighbor value has its symmetry class incremented by one and all vertices with larger symmetry classes are also incremented by one. This is done for all symmetry class ties. This procedure is repeated until the new symmetry classes are equivalent to the previous iteration's symmetry classes, i.e., no new symmetry classes are added to the graph. The final set of symmetry classes is called an invariant in that no matter how the graph is input, each vertex with the same surrounding topology will have the same invariant symmetry class. This is detailed in the findInvariant method GenerateSymmetryClasses.
Figure 2 is a contrived example of determining the symmetry class using this algorithm using a simple graph that stabilizes in one iteration. I start off with the labeled graph in Figure 2(a) and, by sorting the node labels and edges labels, create the initial symmetry classes in Figure 2(b). Now there are two nodes with a symmetry class of 1, so I use their neighbors to see if these nodes are topologically distinct. The symmetry classes are converted to their corresponding primes, Figure 2(c). Then by analyzing their neighboring symmetry classes, I can determine if these two nodes are topologically equivalent. Since in Figure 2(d) they aren't equivalent, I generate a new symmetry class and increment all higher classes by one resulting in Figure 2(e). This process iterates back to generating the new corresponding primes until the symmetry classes converge. In this case, there are no symmetry class ties, so ordering in Figure 2(e) is the stable ordering, and no more iterations are required.
Computing the Symmetry Order
As an initial pass, the symmetry class of a vertex yields symmetry orders. Since the definition of symmetry order is a unique number for each vertex, symmetry class ties must be broken. Because the symmetry class is invariant with respect to topology, ties can be broken arbitrarily. Invariance with respect to topology means that if two nodes have the same symmetry class, then it makes no difference which node comes first in the output. The symmetry orders are broken by finding the lowest repeated symmetry class and arbitrarily adding 1 to one of the ties and all symmetry classes of higher value. This is repeated until no ties remain. The result is called "invariant partitioning," in which the order of the symmetry orders defines a canonical ordering of the graph's nodes. The details are shown in the findInvaritantPartitioning method of GenerateSymmetryClasses (see the canonicalization class, available electronically).
If edges are not labeled, then the initial pass in creating the symmetry orders is all that is necessary. The vertices in the graph can be output in the order of their symmetry orders, and the output order for edges can be determined by the symmetry orders of their endpoints. For example, an edge with symmetry class endpoints of (0, 3) is output before one with endpoints (1, 3).
Determining the order when edges are labeled is slightly trickier. In Figure 3, the two vertices labeled "A" have exactly the same symmetry class and the vertices labeled "C" also have exactly the same symmetry class. However, because the edges are labeled, we need to choose which edge comes first. In Figure 3, I need to distinguish edge "b" from edge "d." A simple traversal algorithm solves this problem by traversing the graph starting from the lowest symmetry order, proceeding to the next lowest symmetry order through the edge with the lowest equivalence class and continuing until all vertices and edges have been visited. The order in which an edge and vertex then visited determines the final symmetry order. Listing Four shows the recursive procedure.
Using this traversal algorithm yields an extended benefit; the traversal itself can be used to form the output representation. Daylight uses this technique for generating the canonical SMILES output. For example, the representation of benzene in the SMILES language is "c1ccccc1". The lowercase "c" represents a carbon in the aromatic state and the number 1 represents a closure, which indicates that the two atoms are connected. To see an implementation of the SMILES language, or if you are interested in chemoinformatics, go http://frowns.sourceforge.net/ or http://www.daylight.com/.
Conclusion
Given a large enough graph, the product of a vertex's neighboring corresponding primes can overflow a 32-bit integer. One of Python's features is that it automatically converts overflowing integers to arbitrarily long integers. Implementations in statically typed languages will have to take this potential into account. Isospectral graphs will not form a canonical ordering. I cannot say how frequent these graphs are but I have not encountered them during the analysis of several millions graphs averaging about 500 vertices each.
The examples included in the code (available electronically) have a couple of utilities to read/write simple types of graphs, including a bare-bones GraphML (http://graphml.graphdrawing.org/) parser and formatter that you can adapt as needed. The current implementation of the algorithm has only been tested with undirected graphs but should be easily modified to support directed graphs.
DDJ
Listing One
for node in graph: # for every node in g traverse to their # adjacentNeighbors and edges for adjacentNode, traversalEdge in node.neighbors(): print adjacentNode, traversalEdge
Listing Two
"""One of the best ways to store graphs in python is with a dictionary object. g = {} Nodes in a graph are the keys in a dictionary and the edges are the values. To traverse through the nodes in a graph for n in g: # do something with the node to traverse through key and edge values for n,e in g.items() to traverse through edge values for n in g.values() """ from types import DictType class Graph(DictType): def nodes(self): return self.keys() def edges(self): _edges = {} for neighbor in self.values(): for edge in neighbor.values(): _edges[edge] = 1 return _edges.keys() def iternodes(self): """iterate through the nodes""" return self.iterkeys() def neighbors(self, node): """(node) -> return the (node, edge) pairs for a given node""" return self[node].items() def iterneighbors(self): """iterate through the neighbors of a node""" return self.iteritems() def addNode(self, node): self[node] = self.get(node, {}) node.setParent(self) def addEdge(self, n1, n2, edge): # assumes n1 and n2 are already in a node try: assert not self[n1].has_key(n2) assert not self[n2].has_key(n1) self[n1][n2] = edge self[n2][n1] = edge except KeyError: raise GraphError("One of the edge nodes is not in the graph") from weakref import proxy class Base: def __init__(self, label): self.label = label self.equiv_class = -1 self.symorder = -1 self.symclass = -1 self.parent = None def setParent(self, parent): self.parent = proxy(parent) def __hash__(self): return id(self) def __cmp__(self, other): return cmp(self.label, other.label) def __str__(self): return self.label def __repr__(self): return "%s(%s)"%(self.__class__.__name__, `self.label`) class Node(Base): def neighbors(self): return self.parent.neighbors(self) class Edge(Base): pass
Listing Three
"""computes equivalence classes for atoms GenerateEquivClasses(graph) """ def GenerateEquivClasses(graph): """(graph) -> given a graph, generate the equivalence classes for the nodes and edges of the graph by assigning a unique number to uniquely labeled nodes and edges. The nodes and edges must be sortable""" # sort and rank the Nodes and Edges for labeledObject in [graph.nodes(), graph.edges()]: # sort the objects based on their label attribute labeledObject.sort(lambda x,y: cmp(x.label, y.label)) last = None rank = -1 for ob in labeledObject: if ob.label != last: # increment the rank rank = rank + 1 ob.equiv_class = rank last = ob.label
Listing Four
"""Recursively traverse a graph building up a canonical representation. Each vertex of a Molecule or Graph must have a attribute 'symorder' which is a unique number. This number guarantees only one traversal for the graph. Additionally each edge must have an attribute equiv_class which is a unique value for each different type of bond. This guarantees proper canonicalization of edges as well as vertices. usage generateSymmetryOrders(graph) """ def _traverse(node, prevNode, visitedNodes, visitedEdges, nodes, edges): visitedNodes[node] = 1 nodes.append(node) edgeIndex = 0 edgesToTraverse = [] for onode, edge in node.neighbors(): if prevNode is not None and onode is prevNode: # we are traversing back the way we came! so don't... pass elif visitedNodes.has_key(onode): # a closure! # traverse.addClosure(node, onode, edge) edges.append(edge) visitedEdges[edge] = 1 else: edgesToTraverse.append((onode.symorder, edge.equiv_class, edgeIndex, onode, edge)) edgeIndex += 1 if not edgesToTraverse: # dead end, return return edgesToTraverse.sort() for symorder, edgeEclass, index, onode, oedge in edgesToTraverse: if visitedNodes.has_key(onode): # somehow, we've seen this node so skip it continue edges.append(oedge) visitedEdges[oedge] = 1 _traverse(onode, node, visitedNodes, visitedEdges, nodes, edges) def _get_lowest_symorder(nodes): best = nodes[0] for node in nodes[1:]: if node.symorder < best.symorder: best = node return best def generateSymmetryOrders(graph): """(graph) -> traverse the symmetry classes in order of the smallest edge equiv_classes to generate symmetry orders""" node = _get_lowest_symorder(graph.nodes()) visitedNodes = {} visitedEdges = {} nodesUsed = [] edgesUsed = [] _traverse(node, None, visitedNodes, visitedEdges, nodesUsed, edgesUsed) i = 0 # set the symmetry orders of the nodes and edges for n in nodesUsed: n.symorder = i i += 1 i = 0 for edge in edgesUsed: edge.symorder = i i += 1