Finding the nearest position within a set of lines is not a difficult task--it's done every time you click the mouse in a drawing program or GIS (geographical information system). But when large amounts of data that have to be batch processed are involved, the situation changes. I recently was faced with the problem of about 1,000,000 lines that were exported from a GIS. The lines were more or less touching each other, thus representing a pipe network, but there were no logical links between them. Furthermore, some lines ended not near the end of another line, but instead their end touched another line (call it "parent") far away from its endpoints. It was necessary to split the parent line and create a new link at this position.
I had already a module implementing a "find nearest point in all lines" in a straightforward linear manner, but it turned out that the task of finding the nearest neighbor for all lines took several days on a high-end desktop machine. In contrast to point-to-point search where algorithms with logarithmic search times exist (see A Template for the Nearest Neighbor Problem by Larry Andrews), I don't know of such implementations for point-to-line search. This may have its cause in the fact that--in contrast to point-to-point search--on each single line there are infinite possible matching points. So what to do?
In this article, I present an optimized algorithm that reduces the search time to a fraction, when searching the nearest position within a set of lines. The algorithm is based on dividing the whole possible spatial area in quadrants and then building indices for each quadrant. All searches are based on using a threshold for the maximum search distance to the line found (the "maximum search width", which is 10 meters in my examples). If there is no line within this distance, nothing will be found. The solution consists of the classes described in Table 1. For comparison, the source code of the different versions of linear search discussed here is also provided. The sample code was tested under Windows, Linux and Solaris using VC++ 6.0 and gcc 3.3.3. The measurements were made using Windows XP
Sample Data
The sample main() function creates 1,000,000 lines of 100 meter length for sample data, as in Figure 1. Note, that some line ends are near to the end of another line, whereas other line ends are near the center of another line. Compared to the sample data, in typical GIS-applications lines vary much in their length. Also in typical data, lines (i.e. pipes, cables, streets etc.) mostly are represented by polygons with many points. But this is only a specialization of the problem where all polygons will have to be divided into their straight lines. By my experience though, straight lines going thorough the whole area are very rare.