Finding the Shortest Path
A path through a graph is a traversal of consecutive nodes along a sequence of edges. Clearly, the node at the end of one edge in the sequence must also be the node at the beginning of the next edge in the sequence. The length of the path is the number of edges that are traversed along the path. A path does not have a cycle in it.
In this particular problem, the nodes are actors and the edges are the "in a movie with," "directed by," "costarred," and other such relationships.
Here, I'm looking for a path from "Kevin Bacon" to some other actor that has a length less than six. Actually, what I would really like is the shortest path (most direct relationship) within the set of all paths between those two actors.
The advantage of SQL as a database language is that it is a declarative, set-oriented language. I tell the SQL engine what I want to get back and it figures out how to do it. An optimizer looks at statistics, indexes, table sizes, and dozens of other things to get an execution plan. The optimizer is usually smarter than human beings.
But suddenly, this is not the approach we want with a graph. When I specify a rule for a path, I have to know the length (n) of the path I am looking for before I can write the query. Then when I run this query, I get all the paths in the set. Looking for this set can be a combinatorial explosion.
What I really wanted was the shortest path between my two actors. This means I have to start with paths of length=1, length=2, and so forth, until I either find a path or have searched all the edges in the graph. Example 2 is a template for a simple, two-edge path starting at Kevin Bacon.
SELECT M1.end_actor_id, M1.relationship_name M1.end_actor_id, M2.relationship_name, M2.end_actor_name FROM MovieCasts AS M1, MovieCasts AS M2 WHERE M1.begin_actor_name = 'Kevin Bacon' AND M1.end_actor_id = M2.begin_actor_id AND M2.end_actor_id = 'Joe Celko';
To extend this pattern, we alter the table by adding a column for the next relationship and one for the next actor. The important part is to also add a predicate that prevents cycles. That is, the next step in the path cannot already be in the unaltered table.
Notice in Example 3 that you have to redo the query for every path length for every pair of actors. If there is no relation between the two actors, you waste a lot of time and storage space, assuming that the problem doesn't overflow the limits of your database engine. Luckily, there's a tool that handles this kind of problem.
AND M1.end_actor_id = M2.begin_actor_id; AND M2.end_actor_id = M3.begin_actor_id ... AND Mn.end_actor_id = 'Joe Celko' AND Mn.begin_actor_id NOT IN ('Kevin Bacon', M2.end_actor_id, M3.end_actor_id, ..);