Fix the O(n^3) performance issue in `is_connected` utility function #397

Wavetrace · 2024-11-12T13:49:58Z

The is_connected() function is implemented in terms of is_rechable(), effectively checking if each vertex is reachable from each other vertex in a nested loop. This quickly becomes terribly slow as the number of vertices and edges grow, becoming roughly unusable with a square connected graph of size 20x20.

This request fixes the issue by

using a single DFS walk for undirected graphs
employing the Tarjan's SCC algorithm for directed graphs

This reduces the algorithm complexity from cubic to linear and makes it possible to use it on large graphs. See also the commit message in 80ae041

This change is fully backward compatible with the previous implementation.

It also adds a new is_connected.cpp test. Feel free to run this test on the previous implementation. This test passes without the new algorithm, but picks the size of 20x20 square to be considerably slow. This test also accepts an integer as argv[1], the size of the square, so it is possible to observe the slow-down growing as the size increases (the default is 20).

Please see an excerpt of my measurement below (I can upload this program as well). The previous implementation quickly becomes impracticably slow.

I note, that is_connected is not documented and is not widely used. Still I consider it to be important to fully adhere to the interface specified in comment and preserve the compatibility, as it could have been used in non-public projects. In short one important decision here is to avoid any breaking changes.

I foresee a few concerns with this request, which I'll be happy to discuss and address if possible:

the graph_utility.hpp header file seems to be designed as low-dependency. Including < vector > and <strong_components.hpp> there seems to violate that choice. But I think the practical aspect of this is reasonably minimal; while fixing the issue fully preserving the backward compatibility is important in practice.
the colormap is not used for directed graphs. strong_components do not accept the colormap and pass it to DFS. I don't see this as a big concern. This can be addressed in 2 ways: updating SCC interface to pass the colormap to DFS, or (probably better) provide the version of is_connected with a single argument and no colormap.
the version for directed graphs seems to now require VertexIndexGraph concept. (I think) This can be fixed by using map instead of vector for the component map (comp_map) in case the graph is not VertexIndexGraph. I'm not sure, how important is this, but this seems to be a deviation from the full backward compatibility.

I consider a bit of follow-up work on this, which can also be done in a separate pull request:

provide single-argument is_connected(const Graph& g). This will require including the colormap header into the graph_utility
document is_connected (and other utility functions like is_reachable)
add examples of the above

Measurement:

=== Testing size 1
= Testing undirected, connected
time: undirected, connected - prev implementation - elapsed     3.299e-06s
time: undirected, connected - new  implementation - elapsed     1.179e-06s
= Testing undirected, not connected
time: undirected, not connected - prev implementation - elapsed         8.62e-07s
time: undirected, not connected - new  implementation - elapsed         5.39e-07s
= Testing directed, connected
time: directed, connected - prev implementation - elapsed       8.22e-07s
time: directed, connected - new  implementation - elapsed       2.636e-06s
= Testing directed, not connected
time: directed, not connected - prev implementation - elapsed   1.64e-06s
time: directed, not connected - new  implementation - elapsed   1.555e-06s

...
=== Testing size 10
= Testing undirected, connected
time: undirected, connected - prev implementation - elapsed     0.0409115s
time: undirected, connected - new  implementation - elapsed     5.703e-06s
= Testing undirected, not connected
time: undirected, not connected - prev implementation - elapsed         0.000392494s
time: undirected, not connected - new  implementation - elapsed         4.8e-06s
= Testing directed, connected
time: directed, connected - prev implementation - elapsed       0.0240339s
time: directed, connected - new  implementation - elapsed       7.925e-06s
= Testing directed, not connected
time: directed, not connected - prev implementation - elapsed   0.000261854s
time: directed, not connected - new  implementation - elapsed   6.209e-06s

...
=== Testing size 24
= Testing undirected, connected
time: undirected, connected - prev implementation - elapsed     16.458s
time: undirected, connected - new  implementation - elapsed     6.1249e-05s
= Testing undirected, not connected
time: undirected, not connected - prev implementation - elapsed         0.0312753s
time: undirected, not connected - new  implementation - elapsed         6.9342e-05s
= Testing directed, connected
time: directed, connected - prev implementation - elapsed       7.45684s
time: directed, connected - new  implementation - elapsed       4.9894e-05s
= Testing directed, not connected
time: directed, not connected - prev implementation - elapsed   0.01372s
time: directed, not connected - new  implementation - elapsed   0.000215756s

Adhere to the specification in comment: // Is the undirected graph connected? // Is the directed graph strongly connected? Instead of checking if each vertex is reachable from each vertex (which is O(n^3)), consider different approaches for directed and undirected graphs: For an undirected graph check if each vertex was reachable with a single DFS walk. This runs in O(N) (modulo back edges). For a directed graph run Tarjan SCC algorithm and check if all vertices end up in a single component. This runs in O(N+E). The speed-up is considerable, e.g. for a 25x25 square connected graph it is about: time: undirected, connected - prev implementation - elapsed 16.458s time: undirected, connected - new implementation - elapsed 6.1249e-05s time: directed, connected - prev implementation - elapsed 7.45684s time: directed, connected - new implementation - elapsed 4.9894e-05s For similar 80x80 graph it's about 0.000563113s, while it just takes forever with the previous O(n^3) approach.

Adhere to the specification in function comment: // Is the undirected graph connected? // Is the directed graph strongly connected? Instead of checking if each vertex is reachable from each vertex (which is O(n^3)), consider different approaches for directed and undirected graphs: For an undirected graph check if each vertex was reachable with a single DFS walk. This runs in O(N) of a single connected component (modulo its back edges). For a directed graph run Tarjan SCC algorithm and check if all vertices end up in a single component. This runs in O(N + E). The speed-up is considerable, e.g. for a 25x25 square connected graph it is about: time: undirected, connected - prev implementation - elapsed 16.458s time: undirected, connected - new implementation - elapsed 6.1249e-05s time: directed, connected - prev implementation - elapsed 7.45684s time: directed, connected - new implementation - elapsed 4.9894e-05s For similar 80x80 graph it's about 0.000563113s, while it just takes forever with the previous O(n^3) approach.

…backward compatibility with the previous implementation

Wavetrace · 2024-11-12T13:59:46Z

I've also opened #398 to track this.

This resolves a build error due to boost::detail::get preferred without the using declaration: ../../../boost/pending/detail/disjoint_sets.hpp:58:24: error: no matching function for call to 'get(long unsigned int*&, int&)' 58 | assert(i == get(p, i)); ../../../boost/graph/reverse_graph.hpp:561:7: note: candidate: 'template<class E> E boost::detail::get(boost::detail::underlying_edge_desc_map_type<E>, const boost::detail::reverse_graph_edge_descriptor<EdgeDesc>&)' 561 | E get(underlying_edge_desc_map_type< E > m, | ^~~ ../../../boost/graph/reverse_graph.hpp:561:7: note: template argument deduction/substitution failed: ../../../boost/pending/detail/disjoint_sets.hpp:58:24: note: mismatched types 'boost::detail::underlying_edge_desc_map_type<E>' and 'long unsigned int*' 58 | assert(i == get(p, i)); | ~~~^~~~~~

jeremy-murphy · 2024-11-14T03:14:42Z

I only have one question: how does someone with almost no activity history in GitHub know so much about Boost.Graph? :)

Wavetrace · 2024-11-14T10:41:14Z

I only have one question: how does someone with almost no activity history in GitHub know so much about Boost.Graph? :)

A great question! Well, I read the book as soon at it was published ;) And Alexander Stepanov's recommendation meant a lot to me, I know him personally and appreciate his inventions a lot. I also ended up rolling up my own graph library before, because Boost was not welcome in my organization.

As of GitHub, I hope to publish a few things soon, including the small project, where I encountered this error.

One reason I actually hit it is that I go straight to the code first and resort to the documentation when the code doesn't work as I'd expect.

Thanks for looking into this!

Wavetrace · 2024-11-14T11:20:58Z

For the record, this function was added in 86ef707 on Feb 6, 2001 and not modified since then except for style.

jeremy-murphy · 2024-11-15T01:51:40Z

I only have one question: how does someone with almost no activity history in GitHub know so much about Boost.Graph? :)

A great question! Well, I read the book as soon at it was published ;) And Alexander Stepanov's recommendation meant a lot to me, I know him personally and appreciate his inventions a lot. I also ended up rolling up my own graph library before, because Boost was not welcome in my organization.

Interesting. Did you also work with him? I'm also a great admirer of his work and managed to visit him once for a chat; unfortunately, Palo Alto is a long way from Australia.

Anyway, my question was fairly flippant, so let me address an actual concern: maybe this function shouldn't exist at all?
Although the meaning of "connected" is unambiguous for undirected graphs, it's not so for directed graphs, and although we would preserve the original meaning of strongly connected, it still leaves every new user to discover what is meant by "connected" in the directed case. Also, what happens when someone wants "connected" to mean weakly, or unilaterally, connected in the directed case? It sits uncomfortably with me. It perhaps has to be a parameter to is_connected so that the user can specify it at each call. What do you think of this situation? Let's resolve this design question before we move on to matters of code.

As you pointed out, the function is undocumented and not used anywhere in the code base that I can see, so we are free to reinvent it.

Wavetrace added 5 commits November 11, 2024 13:34

Added test for is_connected function

f79ef79

Clear the colormap in is_connected (undirected case) to preserve the …

c412416

…backward compatibility with the previous implementation

Formatted the code, observed the line lenght limit

4639591

Wavetrace mentioned this pull request Nov 12, 2024

is_connected() is too slow #398

Open

jeremy-murphy self-assigned this Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the O(n^3) performance issue in `is_connected` utility function #397

Fix the O(n^3) performance issue in `is_connected` utility function #397

Wavetrace commented Nov 12, 2024 •

edited

Loading

Wavetrace commented Nov 12, 2024

jeremy-murphy commented Nov 14, 2024

Wavetrace commented Nov 14, 2024

Wavetrace commented Nov 14, 2024

jeremy-murphy commented Nov 15, 2024

Fix the O(n^3) performance issue in is_connected utility function #397

Are you sure you want to change the base?

Fix the O(n^3) performance issue in is_connected utility function #397

Conversation

Wavetrace commented Nov 12, 2024 • edited Loading

Wavetrace commented Nov 12, 2024

jeremy-murphy commented Nov 14, 2024

Wavetrace commented Nov 14, 2024

Wavetrace commented Nov 14, 2024

jeremy-murphy commented Nov 15, 2024

Fix the O(n^3) performance issue in `is_connected` utility function #397

Fix the O(n^3) performance issue in `is_connected` utility function #397

Wavetrace commented Nov 12, 2024 •

edited

Loading