Loading ...
Global Do...
News & Politics
7
0
Try Now
Log In
Pricing
Dijkstra’s Algorithm On–Line: An Empirical Case Study from Public Railroad Transport1 Frank Schulz and Dorothea Wagner and Karsten Weihe University of Konstanz, Department of Computer & Information Science Traffic information systems are among the most prominent real–world applications of Dijkstra’s algorithm for shortest paths. We consider the scenario of a central information server in the realm of public railroad transport on wide–area networks. Such a system has to process a large number of on–line queries for optimal travel connections in real time. In practice, this problem is usually solved by heuristic variations of Dijkstra’s algorithm, which do not guarantee an optimal result. We report results from a pilot study, in which we focused on the travel time as the only optimization criterion. In this study, various speed–up techniques for Dijkstra’s algorithm were analysed empirically. This analysis was based on the timetable data of all German trains and on a “snapshot” of half a million customer queries.2 1. INTRODUCTION 1.1 Problem. From a theoretical viewpoint, the problem of finding a shortest path from one node to another one in a graph with edge lengths is satisfactorily solved. In fact, the Fibonacci–heap implementation of Dijkstra’s algorithm requires O(m+n log n) time, where n is the number of nodes and m the number of edges [6]. However, various practical application scenarios impose restrictions that make this sort of algorithm impractical. For instance, many scenarios impose a strict limitation on space consumption.3 1 c©ACM, 2000. This is the author’s version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in Journal of Experimental Algorithmics, Volume 5(12), 2000, http://www.jea.acm.org 2With special courtesy of the TLC Transport-, Informatik- und Logistik–Consulting GmbH/EVA– Fahrplanzentrum, a subsidiary of the Deutsche Bahn AG. 3To give a concrete example: if a traffic information system is to be distributed on CD–Rom or to be run on an embedded system, a naive implementation of Dijkstra’s algorithm would typically Address: University of Konstanz, Box D188, 78457 Konstanz, Germany {schulz,wagner}@fmi.uni–konstanz.de, weihek@acm.org http://www.fmi.uni-konstanz.de/˜{schulz,wagner,weihe} 2 · F. Schulz and D. Wagner and K. Weihe In this paper, we consider a different scenario: space consumption is not an issue, but the system has to answer a potentially infinite number of customer queries for optimal travel connections on–line. The real–time restrictions are soft, which basi- cally means that the average response time is more important than the maximum response time. The concrete scenario we have in mind is a central server for public railroad transport, which has to process a large number of queries (e.g. a server that is directly accessible by customers through terminals in the train stations or through a web interface). Assumption 1.1. Our computational study is based on two further simplifying assumptions: (1) The time tables are identical every day. (2) The length of a travel connection is exclusively defined as the travel time (to be specified in greater detail in Definition 1.2 below). To achieve Assumption 1.1(1), we simply unite the time tables of all days. Due to this assumption, a customer query need not specify a date. Definition 1.2. A customer query (or simply query for short) is a triple con- sisting of two stations, A and B, and an earliest departure time t. Such a query asks for a connection from A to B that does not start before t and arrives B as early as possible. Algorithmic problems of this kind are usually approached heuristically in practice, which means that an optimal result is not guaranteed. Definition 1.3. An algorithm for a shortest–path problem is called distance– preserving if it produces an optimal connection for every input instance. Distance–preserving algorithms are not in wide use in traffic information sys- tems, because the average response time of such an algorithm was perceived to be unacceptable. In a new long–term project, we plan to investigate the question to what extent distance–preserving variants of Dijkstra’s algorithm have become com- petitive on contemporary computer technology. Here we give an experience report from a pilot study, in which we focused on the most fundamental kind of query, namely the one specified in Definition 1.2 (subject to Assumption 1.1(1)). This scenario is an example of a general problem in the design of practical algo- rithms, which we discussed in [22]: computational studies based on artificial (e.g. random) data do not make much sense, because the characteristics of the real–world data are crucial for the success or failure of an algorithmic approach in a concrete use scenario. Hence, experiments on real–world data are the method of choice. 1.2 Related work. Various textbooks address speed–up techniques for Dijkstra’s algorithm but have no concrete applications in mind, notably [2] and [14]. In [21], Chapter 4, a brief, introductory survey of selected techniques is given (with a strong bias towards the use scenario discussed here). Most work from the scientific side addresses the single–source variant, where a spanning tree of shortest paths from a designated exceed the available space. Dijkstra’s Algorithm On–Line · 3 root to all other nodes is to be found. Moreover, the main aspect addressed in work like [5] is the choice of the data structure for the priority queue. In Section 3 below (paragraph on the “search horizon”), we will see that the scenario considered in this paper requires algorithmic approaches that are fundamentally different, and Section 4 will show that the choice of the priority queue is indeed a marginal aspect in view of the goal of our project. On the other hand, most application–oriented work in this field is commercial, not scientific, and there is only a small number of publications. In fact, we are not aware of any publication especially about algorithms for wide–area railroad traffic information systems. Some scientific work has been done on local public transport. For example, [17] gives some insights into the state of the art. However, local public transport is quite different from wide–area public transport, because the timetables are very regular, and the most powerful speed–up techniques are based on the strict periodicity of the trains, buses, ferries, etc. In contrast, our experience is that the timetables of the national European train companies are not regular enough to gain a significant profit from these techniques.4 On the other hand, private transport has been extensively investigated in view of wide–area networks. Roughly speaking, this means “routing planners” for cars on city and country maps [9; 1; 4; 20; 12; 18]. This problem is different to ours in that it is two–dimensional, whereas train timetables induce time as a third dimension: due to the lack of periodicity, the earliest departure time is significant in our scenario. In contrast, temporal aspects do not play any role in the work quoted above.5 So it is not surprising that the research on routing planners has focused on purely geometric techniques. Some of the application–oriented work quoted above is based on hierarchical approaches. This is also one of our strategies (see Section 3.2, strategy “selections of stations”). For example, in [19] this approach is viewed from a database perspective without any specific scenario in mind (though with a bias towards road maps). For completeness, we mention the work on variants of Dijkstra’s algorithm that are intended to efficiently cope with large data in secondary memory. Chapter 9 of [3] gives an introduction to theoretical and practical aspects. As mentioned above (“problem” section in the Introduction), the problems caused by the slow access to secondary memory are beyond the scope of our paper. Since our main contribution is a computational study, the work on the method- ology of this sort of study (e.g. [7; 8; 11; 10; 13; 15; 16]) is also related. We will come back to methodological issues in Section 2. 1.3 Contribution of the paper. We implemented and evaluated various distance–preserving speed–up techniques for Dijkstra’s algorithm. The study is based on all train data (winter period 1996/97) of the Deutsche Bahn AG, the national railroad and train company of Germany. 4This experience is supported by personal communications with people from the industry. 5In principle, temporal aspects would also be relevant for the private transport, for example, the distinction between “rush hours” and other times of the day. However, this sort of temporal aspect is quite different from ours and, to our knowledge, not scientifically invesitigated either. 4 · F. Schulz and D. Wagner and K. Weihe The processed queries are a “snapshot” of the central Hafas6 server of the Deutsche Bahn AG, in which all queries of customers of all ticket offices in Germany were recorded over several hours. In particular, an input instance consists of a fixed and a variable part: —Fixed : all input instances include the same set of time tables. —Variable: the query, that is, departure station, arrival station, and earliest de- parture time, varies from input instance to input instance. The result of this snapshot comprises more than half a million queries, which might suffice for a representative analysis of this particular service (assuming that the typical query profile of customers does not vary dramatically from day to day). Due to the above–mentioned insight that the periodicity of the timetables is not a promising base for algorithmic approaches, the question is particularly interesting whether geometric techniques like those in routing planners are successful, although the scenario has geometric and temporal characteristics. We will see that this question can indeed be answered in the affirmative. 1.4 Overview. In Section 2 we will discuss the general rationale of our experimental set–up. Then, in Section 3, we will briefly summarize the algorithmic techniques that we evaluated. Finally, in Section 4, we will present and discuss the results of the computational study. 2. EMPIRICAL METHODOLOGY As mentioned in the Introduction, the question to be answered by our computa- tional study is very specific: Are distance–preserving variants of Dijkstra’s algorithm (in the sense of Definition 1.3) competitive on contemporary computer technology in the use scenario addressed in this paper? The notion “competitive” is here used in a very informal sense, which is probably the sense of interest from an applied viewpoint. Roughly speaking, we regard an algorithm as competitive if it would probably not become a bottleneck operation in a typical central server. Such a fuzzy notion does not allow far–reaching conclusions. We believe that our conclusions (which we will give in detail in Section 4 and in summary in Section 5) are modest enough to be safely based on our usage of competitiveness. The question addressed in our paper is quite different from the questions typically addressed in methodological work such as [7; 8; 11; 10; 13; 15; 16]. In fact, to our knowledge, the general focus in the literature is on systematic performance comparisons of different algorithms without or with only loose relation to concrete applications. The overwhelming majority of computational studies in theoretical computer science primarily addresses this kind of question. The method of choice here is the evaluation of the individual algorithms on artificial (mainly random) 6Hafas is a trademark of the Hacon Ingenieurgesellschaft mbH, Hannover, Germany. Dijkstra’s Algorithm On–Line · 5 input data, which is a good base for the systematic variation of parameters, and a statistical analysis of the computational results. In contrast, the question we address in this paper is tightly bound to a concrete application scenario. The profile of the input data is very specific, but a sufficient specification of this profile in terms of parameters such as the number of nodes/edges or the average/maximum node degree does not seem to be in our reach.7 Thus, there is no base for the creation of artificial, yet realistic data. On the other hand, a snapshot of half a million real customer queries, which is the base of our study, is certainly sufficient for a profound answer to the above question. Hence, we will restrict our attention to real–world data. Note that the question addressed in this paper is binary (yes/no). As mentioned in the Introduction, the real–time constraints are rather soft. This means that the overall average response time, taken over the total set of all queries, is the most important piece of information to decide upon yes or no. However, beyond this very specific question we also regard an estimation of the scope of the result as important. For instance, such a large body of response times, if well analysed, should allow a reliable prediction how the average response time changes if the average geographic distance of departure and arrival station changes. To give a concrete example: the percentage of long–distance queries is significantly higher for the web service than for the ticket office service, from which our snapshot was taken. To estimate the effect of variations like this, we will also consider the interdepen- dence between response times and two parameters, the Euclidean distance between departure and arrival station and the minimal total travel time computed in re- sponse to the respective query. In Section 4 we will see that this dependency is surprisingly strong, which means that we may assume a high level of confidence for predictions. As mentioned in the Introduction, we implemented and evaluated various speed– up techniques for Dijkstra’s algorithm. One notable exception is the bidirected– search technique. This technique does not really apply in our scenario, because the target node of the search depends on the arrival time and is thus not known in advance (see Section 3, paragraph on “bidirected search,” for details). However, one can imagine a variation of the bidirected–search technique, which does not rely on the exact arrival time but on an upper bound. To answer the question whether this class of modified bidirected–search techniques is potentially competitive at all, we implemented and evaluated an idealized variant, which indeed does rely on the exact arrival time and thus yields a lower bound on the potential performance of this class of speed–up techniques. 7Nonetheless, a comparison of the algorithms from Section 3 based on random data and a conven- tional statistical analysis as advocated by some of the above–mentioned papers may be of interest in its own right. However, such an analysis does not contribute to our specific question and is thus beyond the scope of this paper. 6 · F. Schulz and D. Wagner and K. Weihe 3. ALGORITHMS 3.1 Fundamental Algorithmic Approach 3.1.1 Train graph.. From Assumption 1.1 recall our restriction to time tables that do not vary from day to day. Due to this restriction, we only need a train network of 24 hours. The arrival or departure of a train at a station will be called an event. The train graph contains one node for every event. Two events v and w are connected by a directed edge (v, w) if v represents the departure of a train at some station and w represents the very next arrival of this train at some other station. On the other hand, two successive events at the same station are connected by an edge (in positive time direction), which means that every station is represented in the graph by a cycle through all of its events (the cycle is closed by a turn–around edge at midnight). In each case, the length of an edge is the time difference (in minutes) of the events represented by its endnodes. Obviously, a query then amounts to finding a shortest path from the earliest event at the departure station not before the earliest departure time to an arbitrary arrival event at the arrival station. The data contains 933, 066 events on 6, 961 stations. Consequently, there are 933, 066 · 3/2 = 1, 399, 599 edges in the graph. 3.1.2 Priority queue.. Dijkstra’s algorithm relies on a priority queue, which man- ages the events on the current “frontier line” of the search. As mentioned above, the best general worst–case bound, O(m+ n log n), is obtained from Fibonacci heaps, where n is again the number of events and m the number of edges. We do not use a Fibonacci heap but a normal heap (also often called a 2–heap), which yields an O((n +m) log n) bound [6]. Since m ∈ O(n) in train graphs, both bounds reduce to O(n log n). As an alternative to heaps, we also implemented the Dial variant as described in [2]. Basically, this means that the priority queue is realized by an array of buckets with a cyclically moving array index. The events of the frontier line are distributed among the buckets, and it is guaranteed that the very next non–empty bucket after the current array index always contains the candidates to be processed next. The asymptotic worst–case bound of the Dial implementation of Dijkstra’s algorithm is O(m+ nL), where L is the largest length of an edge. 3.1.3 Search horizon.. Of course, we deviate from the “textbook version” of Di- jkstra’s algorithm in that we do not compute the distance of every event from the source event but terminate the algorithm immediately once the first (and thus optimal) event at the arrival station is processed. The most fundamental distance– preserving speed–up technique for our scenario is then a reduction of the search to a (hopefully) small part of the graph, which contains all relevant events. Figs. 1– 4 demonstrate that such a reduction is crucial. In fact, they reveal that for the majority of all queries only small fractions of the total area and time horizon are relevant. To our knowledge, some commercial implementations remove events and edges from the graph (more or less heuristically, i.e. losing optimality of distances) before the search itself takes place. In contrast, we aim at an evaluation of distance– Dijkstra’s Algorithm On–Line · 7 0 5000 10000 15000 20000 25000 30000 35000 0 100 200 300 400 500 600 700 800 Fig. 1. The frequency distribution histogram of the queries from the snapshot according to the Euclidean distance between the departure station and the arrival station (granularity: 15 kilometers). 0 5000 10000 15000 20000 25000 30000 0 100 200 300 400 500 600 700 800 Fig. 2. Like Fig. 1 except that the abscissa now denotes the minimal travel time in minutes (granularity: 20 minutes). preserving strategies, so our approach is quite different. First of all, we apply the amortization technique discussed in [21] to obtain a sublinear expected run time per query.8 The only obstacle to sublinearity is the initialization of all events with infinite distance labels in the beginning of the textbook algorithm; in fact, Figs. 1–4 8The term “sublinear expected run time” is here used in a rather informal, experience–oriented way and not to be understood in a formal probabilistic sense. 8 · F. Schulz and D. Wagner and K. Weihe 0 5000 10000 15000 20000 25000 0 50 100 150 200 Fig. 3. Like Fig. 1 except that the abscissa now denotes the number of edges in the shortest path (granularity: 10 edges). 0 50 100 150 200 250 300 350 400 450 500 0 200 400 600 800 1000 1200 Fig. 4. For each minimal travel time (granularity: 20 minutes) the total CPU times of all queries yielding this value of the travel time are summed up to reveal which range of travel times con- tributes most of the total CPU time required for all queries. strongly suggest that on average the main loop of the algorithm only processes a very small fraction of the graph until the arrival station is seen and the algorithm terminates. As described in [21], every event is given an additional time stamp, which stores the number of the query in which it was reached in the main loop. Whenever an event is reached, its time stamp is updated accordingly. This update properly increases the time stamp if and only if this is the first time that the node is seen Dijkstra’s Algorithm On–Line · 9 in the current search. This is exactly the situation in which the current distance label should be infinity. Hence, if the time stamp is increased, the distance label is regarded as infinite, otherwise the value of the distance label is taken as is. Consequently, there is no need for an expensive initialization phase, and no event outside the “search horizon” of the main loop is hit at all. It might be worth mentioning an equivalent strategy to avoid the initialization step: An additional data structure, e.g. a linked list, is used to store the nodes with non-infinite distance labels. Then, all non-infinite distance labels can be reset to infinity after each query. 3.2 Algorithmic Variations The good performance of the following additional techniques relies on the general design decisions introduced in Section 3.1. All of these techniques are independent of each other in the sense that an arbitrary selection of them may be applied simultaneously. 3.2.1 Angle restriction.. This technique additionally relies on the coordinates associated with the individual stations. In a preprocessing step, we apply Dijkstra’s algorithm to each event to compute shortest paths from this event to all other stations.9 The results are not stored (this would require too much space) but only used to compute two values α(v, w) and β(v, w) for each edge (v, w) that corresponds to a train hop. These 2 ·n/2 values (n again the number of events) are stored and then used in the on–line system. More specifically, these values are to be interpreted as angles in the plane. See Fig. 5. Let (v, w) be an edge from some event v to some other event w, and let sv and sw be the stations to which events v and w belong, respectively. Then the values α(v, w) and β(v, w) stored for edge (v, w) span a circle sector with center sv in the plane, where, say, α(v, w) is the right leg. The values α(v, w) and β(v, w) are determined such that this circle sector has the following meaning: if the shortest path from event v to some event u at some station s′ contains (v, w), then s′ has to be in this circle sector. Subject to this condition, the difference of the angles, (β(v, w)−α(v, w)) mod 2π, is minimum. Clearly, the brute–force application of Dijkstra’s algorithm in the preprocessing allows one to compute this specific, narrowest possible circle sector for each edge. Consequently, edge (v, w) may be ignored by the search if the arrival station is not in the circle sector of this edge. The restriction of the search to edges whose circle sectors contain the arrival station is the strategy that we will call “angle restriction” in the following. Proposition 3.1. Due to the specific construction of the values α(·, ·) and β(·, ·), the strategy “angle restriction” is distance–preserving in the sense of Definition 1.3. 3.2.2 Selection of stations.. The basic idea behind this technique is similarly implemented in various routing planners for the private transport [9; 1; 4; 12]. 9This preprocessing takes several hours, which is absolutely acceptable in the practical setting from which this research originated, namely generating time tables once for every winter/summer period. Hence, there is no need to optimize the preprocessing. 10 · F. Schulz and D. Wagner and K. Weihe ( , ) α ( , ) β s’ u v w at v w at s v 1 at s w 2 Fig. 5. The circle sector spanned by α(v, w) and β(v, w) contains all stations s′ such that there is an event u at s′ whose shortest path from v starts with (v, w). 0 20000 40000 60000 80000 100000 120000 140000 160000 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 Fig. 6. The relation of the number of edges hit without (abscissa) and with (ordinate) strategy angle selection (granularity: 500 edges). In general, a certain small set of nodes (i.e. events) is selected, which means events in our setting. However, we will only select sets of nodes such that either every node from a station s is selected, or none of the nodes from station s. More specifically, in the data available to us, every station is assigned an “importance number,” which is intended to rank its degree of “centrality” in the railroad network. For our computational study, we select the 225 stations in the highest categories (see Fig. 10). These stations induce 95, 423 events in total, which constitute our selection of events. Let G = (V,E) denote the (directed) train graph. In particular, V is the set of all events. Moreover, let Ṽ be the set of selected nodes. The strategy is based on what we will call replaceable paths in the following. Let v, w ∈ V be two events, which may or may not belong to Ṽ . A path from v to w in G will be called replaceable, if Dijkstra’s Algorithm On–Line · 11 no node of the path (except for the endnodes, v and w, themselves) belongs to Ṽ . Based on this notion, we construct an auxiliary directed graph G′ = (V,E′) on V . For v, w ∈ V , there is a directed edge (v, w) ∈ E ′ if, and only if, there is a replaceable path from v to w in G. The length `(v, w) of an edge (v, w) ∈ E ′ in G′ is then defined as the minimum length of a replaceable path from v to w in G. The auxiliary graph G′ and these edge weights are also constructed once and for all in a preprocessing step (which only takes a few minutes). Each query is then answered by the computation of a shortest path in the auxiliary graph G′, and this path corresponds to a shortest path in the train graph G. We implement this general approach as follows. First of all, note that it is not necessary to reconstruct the path in G that corresponds to the shortest path computed in G′. In fact, what we really want to have from the computation is a sequence of trains and the stations where to change train. This data can be attached to the edges of the auxiliary graph in an even more compact (less redundant) and thus more efficiently evaluable fashion than to the edges of the train graph. As mentioned above, we select a set of stations, and the events of these stations are the selected events. Clearly, there is a trade–off. Roughly speaking, the smaller the number of selected stations is, the larger the resulting connected components are and, even worse, the larger the number of selected stations neighboring to a component. Since the number of edges depends on the latter number in a quadratic fashion, an improvement of performance due to a rigorous reduction of stations is soon outweighed by the tremendous increase in the number of edges. It has turned out that in our setting, a minor refinement of this strategy is sufficient to overcome this trade–off. For this, let u, v, and w be three selected events such that edges (u, v), (v, w), and (u,w) exist in the auxiliary graph. If `(u, v)+`(v, w) ≤ `(u,w), then edge (u,w) is dropped in the auxiliary graph. Again, the optimal distance is preserved. The number of edges grows only moderately after this modification, so quite a small set of selected stations becomes feasible. The selected stations induce 95, 423 events in total, which means that the number of events is approximately reduced by a factor of 10, and the number of stations is reduced by a factor of 31. This discrepancy between these two factors is not surprising, because central stations are typically met by more trains than other stations. Proposition 3.2. The strategy “selection of stations” is distance–preserving in the sense of Definition 1.3. 3.2.3 Goal–directed search.. This strategy is based on one real number b(s) for each station s, which must be a lower bound on the minimal travel time from s to the arrival station. Let (v, w) ∈ E be an edge, and let sv and sw be the stations of v and w, respectively. The values b(sv) and b(sw) must additionally satisfy b(sv)− b(sw) ≤ `(v, w). The length `(v, w) of (v, w) is then replaced by the modified length `′(v, w) := `(v, w)− b(sv) + b(sw) ≥ 0. It can be shown that the shortest paths with respect to the length function `′(·, ·) are identical with the shortest paths with respect to the original length function `(·, ·). Hence, we may alternately compute a shortest path with respect to `′(·, ·). The hope is that the “frontier” of the search horizon approaches the arrival station faster when the modified edge lengths are used. See [14] for a more systematic 12 · F. Schulz and D. Wagner and K. Weihe discussion. Roughly speaking, the tighter the lower bounds s(·) are, the larger the speed–up effect will be. The base of our lower bounds is the maximum speed of all trains as a transformation factor. The lower bound for an arbitrary station with respect to a given target station is the Euclidean distance of these two stations divided by this factor. These values are computable “on the fly” during the search. Proposition 3.3. The strategy “goal–directed search” is distance–preserving in the sense of Definition 1.3. 3.2.4 Bidirected search.. This strategy “merges” two applications of Dijkstra’s algorithm into one loop. More specifically, one search starts from the source event and goes forward as usual, whereas the other search starts from the target event and traverses every edge in backward direction. The algorithm is finished once an event was labeled permanently by both searches. A shortest path from the source event to the target event is then easily constructed from the results of the two searches up to this moment. We refer to [2; 14] for details. This strategy is not directly applicable in our scenario, because the target event is not known until the search is finished. Nonetheless, one can imagine variations of this strategy, in which the backward search indeed starts from the arrival station, however, not from the optimal arrival time but from an upper bound on this value. To evaluate the general potential of this strategy in view of our application sce- nario, we implemented an idealized variant. More specifically, we indeed use the exact optimal arrival time as a start point for the backward search. Clearly, this gives a lower bound on the performance of all of these variations of bidirected search.10 Since we only consider an idealized strategy here, it does not make sense to evaluate the CPU times achieved by this strategy. Hence, we will restrict the evaluation of this strategy to operation counts (see Section 4). 3.2.5 Combination of strategies.. In principle, all of these strategies can be freely combined. In that, the strategy “selection of stations” can be combined with the strategy “angle restrictions” in two ways, namely the angle restrictions can be computed for the auxiliary graph, or they can be computed for the train graph and simply taken over for the auxiliary graph. 4. ANALYSIS OF THE ALGORITHMIC PERFORMANCE The experiments were performed on a SUN Sparc Enterprise 5000, and the code was written in C++ using the GNU compiler. Tables 1 and 2 present a summarizing comparison of all combinations of strate- gies. Since every node in the train graph has a degree of three, the total number of algorithmic steps is asymptotically dominated by the number of operations inside the priority queue. In other words, these operations are representative operation counts in the sense of [2], Sect. 18. More specifically, for a heap the number of exchange operations is representative, whereas for the Dial variant the number of cyclic increments of the moving array index plus the number of the bucket- operations “insert” and “remove” is representative. The average total number per 10Except for pathological cases that are imagineable but might hardly occur in reality. Dijkstra’s Algorithm On–Line · 13 query of these operations are listed in the last part of Tables 1 and 2. As men- tioned in Section 2, we regard these average numbers as the most important piece of information in view of our central yes/no question. For both implementations of the priority queue, the CPU times imply the same strong ranking of combinations of strategies. Figs. 9–12 might give a visual impres- sion why this ranking is so unambiguous. Moreover, the discrepancy between the heap and the Dial implementation in terms of CPU times also decreases roughly from row to row. Our experience with several versions of the code is that the exact CPU times are strongly sensitive to the details of the implementation, but the general tendency was maintained and seems to be reliable. In particular, the main question raised in this paper (whether distance–preserving techniques in the sense of Definition 1.3 are competitive) can be safely answered in the affirmative at least for the restriction of the problem to the total travel time as the only optimization criterion. However, a detailed look at the results is more insightful (and necessary to con- clude predictions for scenarios with different profiles as discussed in Section 2). Fig. 6 shows that there is a very strong correlation between the number of edges hit with/without strategy “angle restriction.” On the other hand, Fig. 7 shows a detailed analysis of one particular, exemplary combination of strategies. A com- parison of the diagrams in Fig. 7 reveals an interesting effect, which is also found in the analogous diagrams of the other combinations: the CPU times of both the heap and the Dial implementation “look” linear in the number of nodes hit by the search.11 This correlation is so strong and the variance is so small that corresponding diagrams in Fig. 7 look almost identical. Figure 8 reveals the cause. In other words, in both cases the operations on the priority queue come close to constant time on average, even when the average is taken over each query separately. Finally, the bidirected–search technique required, on average, 5606 nodes, 9497 edges, an operation count of 7531 for the Dial implementation, and an operation count of 45826 for the heap implementation. As mentioned in the paragraph on bidirected search in Section 3, it does not make sense to take CPU times here, because an idealized algorithm was implemented. Since one can safely assume that the above results for this idealized algorithm are a lower bound on the real performance of this technique in our scenario, they give some evidence that bidirected search would be a reasonable choice (assuming that tight upper bounds on the exact arrival times can be found efficiently), but might not be the favorite choice. In Section 2, we raised the question whether the results allow reliable predictions for similar scenarios such as a web server, where a larger fraction of all queries asks for long–distance connections. The small variance encourages us to conclude (with caution) that our results are scalable in an approximately linear fashion to 11From the work on this study we gained the experience that the difference between Θ(n) and Θ(n log n) may be very small and hidden by the “noise” of the data. Hence, the claim that one can observe a linear behavior does not really contradict the claim that the asymptotic complexity is Θ(n logn). In other words, the above claim that the run time be linear is to be taken with a pinch of salt. 14 · F. Schulz and D. Wagner and K. Weihe Selection of stations Angle restriction CPU heap CPU Dial no no 0.310 0.103 no yes 0.036 0.018 yes no 0.027 0.012 yes yes (train) 0.007 0.005 yes yes (auxiliary) 0.005 0.003 Selection of stations Angle restriction Nodes Edges no no 17576 31820 no yes 4744 10026 yes no 2114 3684 yes yes (train) 1140 2737 yes yes (auxiliary) 993 2378 Selection of stations Angle restriction Ops. heap Ops. Dial no no 246255 23866 no yes 24526 3334 yes no 26304 3660 yes yes (train) 4973 1197 yes yes (auxiliary) 3191 932 Table 1. A summary of all computational results for the individual combinations of the techniques “selection of stations” and “angle restrictions.” The entries “train” and “auxiliary” in column #2 refer to the graph in which the angles were computed (see the last paragraph in Section 3). The columns #3–#4 give the average over all queries of the snapshot. More specifically, the first table gives the average raw CPU times, the second table the average number of nodes and edges hit by the search, and the third table the average operation counts. scenarios like that. As announced in the Introduction, the choice of the data structure for the priority queue seems to be a detail of minor relevance. The difference between the heap and the Dial implementation looks significant at first glance, however, this seems to be one of the aspects which are not stable but highly dependent on implementation details. 5. CONCLUSION AND OUTLOOK The outcome of this study suggests that geometric speed–up techniques are a good basis for the computation of provably optimal connections in railroad traffic infor- mation systems. The question raised in this paper is answered for the total travel time: the best combinations of strategies are by far faster than is currently needed in practice. The success of geometry–based strategies is a bit surprising, because the underlying data is not purely geometric in nature. Another surprising outcome of the study is that both the normal heap and the Dial data structure only require an (amortized) constant time per operation, whereas the worst–case bound is logarithmic for heaps and even linear for Dials. Note that no amortization over a set of queries must be applied to obtain a con- stant time per operation; the variance is small enough that the average run time per operation within a single query can essentially be regarded as bounded by a Dijkstra’s Algorithm On–Line · 15 Selection of stations Angle restriction CPU heap CPU Dial no no 0.204 0.072 no yes 0.022 0.013 yes no 0.019 0.010 yes yes (train) 0.006 0.004 yes yes (auxiliary) 0.005 0.003 Selection of stations Angle restriction Nodes Edges no no 11072 20046 no yes 3415 7120 yes no 1437 2465 yes yes (train) 856 2002 yes yes (auxiliary) 754 1765 Selection of stations Angle restriction Ops. heap Ops. Dial no no 151274 15274 no yes 16990 2430 yes no 18123 2595 yes yes (train) 3502 961 yes yes (auxiliary) 2282 751 Table 2. Like Table 1 except that now the strategy “goal–directed search” is additionally applied in all cases. constant. Due to the fact that the variance is negligible, a conventional statistical analysis would not make any sense. The choice of algorithms (and of minor details of their implementation) is justified a posteriori in view of our main question. Implementing and comparing further speed–up strategies is certainly an interesting topic in its own right. However, the goal of our research effort is different: to go far beyond the restricted scenario of Assumption 1.1 and to see which algorithms will stand this ultimate test. The minimal travel time is certainly an empirical research topic in its own right, not only because it is the most fundamental objective in practice. However, a practical algorithm must consider further criteria and restrictions. For example, the ticket costs and the number of train changes are also important objectives. Moreover, certain trains do not operate every day (contrasting Assumption 1.1(1)), and certain kinds of tickets are not valid for all trains, so it should be possible to exclude train connections in a query. A satisfactory compromise must be found between the speed of the algorithm and the quality of the result. Thus, the problem is not purely technical anymore but also involves “business rules,” which are usually very informal. In the future, an extensive requirements analysis will be necessary, which means that the work will be no longer purely “algorithmical” in nature. Such an analysis must be very detailed because otherwise there is no hope to match the real problem. Unfortunately, there is high evidence that the general problems addressed in [22] will become virulent here: a sufficiently simple formal model that captures all relevant details does not seem to be in our reach, and many details are “volatile” 16 · F. Schulz and D. Wagner and K. Weihe in the sense that they may change time and again in unforeseen ways. Future research will show whether these conflicting criteria can be simultaneously fulfilled satisfactorily. REFERENCES [1] R. Agrawal and H. Jagadish. Algorithms for searching massive graphs. IEEE Transact. Knowledge and Data Eng., 6:225–238, 1994. [2] R.K. Ahuja, T.L. Magnanti, and J.B. Orlin. Network Flows. Prentice–Hall, 1993. [3] S. Albers, A. Crauser, and K. Mehlhorn. Algorithmen für sehr große Datenmengen. MPI Saarbrücken, 1997. 12 [4] A. Car and A. Frank. Modelling a hierarchy of space applied to large road networks. In Proc. Int. Worksh. Adv. Research Geogr. Inform. Syst. (IGIS ’94), pages 15–24, 1994. [5] B.V. Cherkassky, A.V. Goldberg, and T. Radzik. Shortest paths algorithms: Theory and experimental evaluation. Mathematical Programming, 73:129–174, 1996. [6] T.H. Cormen, C.E. Leiserson, and R.L. Rivest. Introduction to Algorithms. MIT Press and McGraw–Hill, 1994. [7] J. Hooker. Needed: An empirical science of algorithms. Operations Research, 42:201–212, 1994. [8] J. Hooker. Testing heuristics: We have it all wrong. Journal of Heuristics, 1:33–42, 1995. [9] K. Ishikawa, M. Ogawa, S. Azume, and T. Ito. Map navigation software of the electro multivi- sion of the ’91 toyota soarer. In IEEE Int. Conf. Vehicle Navig. Inform. Syst. (VNIS ’91), pages 463–473, 1991. [10] H.F. Jackson, P.T. Boggs, S.G. Nash, and S. Powell. Guidelines for reporting results of com- putational experiments—report of the ad hoc committee. Mathematical Programming, 49:413–425, 1991. [11] D.S. Johnson. A theoretician’s guide to the experimental analysis of algorithms, 1996. http://www.research.att.com/~dsj/papers/exper.ps. [12] S. Jung and S. Pramanik. Hiti graph model of topographical road maps in navigation systems. In Proc. 12th IEEE Int. Conf. Data Eng., pages 76–84, 1996. [13] C.Y. Lee, J. Bard, M.L. Pinedo, and W. Wilhelm. Guidelines for reporting computational results in iie transactions. IEE Transactions, 25:121–123, 1993. [14] T. Lengauer. Combinatorial Algorithms for Integrated Circuit Layout. Wiley, 1990. [15] C. McGeoch. Analyzing algorithms by simulation: Variance reduction techniques and simu- lation speedups. ACM Computing Surveys, 24(2):195–212, 1992. [16] C. McGeoch. Toward an experimental method of algorithm simulation. INFORMS Journal on Computing, 8(1):1–15, 1996. [17] T. Preuss and J.-H. Syrbe. An integrated traffic information system. In Proc. 6th Int. Conf. Appl. Computer Networking in Architecture, Construction, Design, Civil Eng., and Ur- ban Planning (europIA ’97), 1997. [18] J. Shapiro, J. Waxman, and D. Nir. Level graphs and approximate shortest path algorithms. Network, 22:691–717, 1992. [19] S. Shekhar, A. Fetterer, and G. Goyal. Materialization trade–offs in hierarchical shortest path algorithms. In Proc. Int. Symp. Large Spatial Databases, Springer Lecture Notes in Computer Science 1262, pages 94–111, 1997. [20] S. Shekhar, A. Kohli, and M. Coyle. Path computation algorithms for advanced traveler information system (atis). In Proc. 9th IEEE Int. Conf. Data Eng., pages 31–39, 1993. [21] K. Weihe. Reuse of algorithms — still a challenge to object–oriented programming. In Proc. 12th ACM Symp. Object–Oriented Programming, Systems, Languages, and Applications (OOPSLA ’97), pages 34–48, 1997. 12Available from http://www.mpi-sb.mpg.de/units/ag1/extcomp.html. The lecture notes are in German, however, the chapter on shortest paths (Chapter 9) is in English. Dijkstra’s Algorithm On–Line · 17 0 20000 40000 60000 80000 100000 120000 140000 160000 180000 0 5000 10000 15000 20000 25000 30000 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 0 200 400 600 800 1000 1200 0 50000 100000 150000 200000 250000 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0 200 400 600 800 1000 1200 0 50000 100000 150000 200000 250000 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0 200 400 600 800 1000 1200 Fig. 7. An exemplary sequence of diagrams for one particular combination of strategies: “angle restriction” is applied, but the other strategies are not. First column: the frequency distribution histogram of all queries in the snapshot according to (a) the number of nodes met by the search (granularity: 500 nodes), (b) the CPU time for the heap implementation and (c) for the Dial implementation (granularity: 10 milliseconds). Second column: the average of (a) the number of nodes met and (b/c) the CPU times for the heap/Dial variant taken over all queries with roughly the same resulting total travel times (granularity: 20 minutes). The strong resemblance of the diagrams in each row nicely demonstrates the linear behavior of both priority–queue implemen- tations. [22] K. Weihe, U. Brandes, A. Liebers, M. Müller-Hannemann, D. Wagner, and T. Willhalm. Empirical design of geometric algorithms. In Proc. 15th ACM Symp. Comp. Geometry (SCG ’99), pages 86–94, 1999. 18 · F. Schulz and D. Wagner and K. Weihe 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 Fig. 8. The relation between the number of nodes hit by the search (abscissa) and the cpu time in seconds for the heap (left) and the Dial (right) implementation. The first row shows one point for every query, the second row condenses this information to histograms (granularity: 500 nodes). Dijkstra’s Algorithm On–Line · 19 Fig. 9. The left picture shows the edges hit by Dijkstra’s algorithm from Berlin Main East Station until the arrival station Frankfurt/Main Main Station is reached. In the right picture, the strategy “angle selection” was applied to the same query. 20 · F. Schulz and D. Wagner and K. Weihe Fig. 10. The left picture shows the 225 stations selected for the study on strategy “selection of stations,” and the right picture shows the edges hit by this strategy for the same query as in Figure 9. In the right picture, the original train graph is shown in the background, whereas the highlighted edges are the edges of the auxiliary graph hit by the search. Dijkstra’s Algorithm On–Line · 21 Fig. 11. Both pictures refer to the same query as in Fig. 9. The strategies “selection of stations” and “angle restriction” are applied simultaneously with angles computed from the train graph (left), and with angles computed from the auxiliary graph itself (right). Like in the previous picture, the original train graph is shown in the background, and the highlighted edges are the edges of the auxiliary graph hit by the search. 22 · F. Schulz and D. Wagner and K. Weihe Fig. 12. Both pictures refer to the same query as in Fig. 9. They show the results of the strategy “goal–directed search” and “bidirected search,” respectively.