Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the O(n^3) performance issue in is_connected utility function #397

Open
wants to merge 6 commits into
base: develop
Choose a base branch
from

Conversation

Wavetrace
Copy link

@Wavetrace Wavetrace commented Nov 12, 2024

The is_connected() function is implemented in terms of is_rechable(), effectively checking if each vertex is reachable from each other vertex in a nested loop. This quickly becomes terribly slow as the number of vertices and edges grow, becoming roughly unusable with a square connected graph of size 20x20.

This request fixes the issue by

  • using a single DFS walk for undirected graphs
  • employing the Tarjan's SCC algorithm for directed graphs

This reduces the algorithm complexity from cubic to linear and makes it possible to use it on large graphs. See also the commit message in 80ae041

This change is fully backward compatible with the previous implementation.

It also adds a new is_connected.cpp test. Feel free to run this test on the previous implementation. This test passes without the new algorithm, but picks the size of 20x20 square to be considerably slow. This test also accepts an integer as argv[1], the size of the square, so it is possible to observe the slow-down growing as the size increases (the default is 20).

Please see an excerpt of my measurement below (I can upload this program as well). The previous implementation quickly becomes impracticably slow.

I note, that is_connected is not documented and is not widely used. Still I consider it to be important to fully adhere to the interface specified in comment and preserve the compatibility, as it could have been used in non-public projects. In short one important decision here is to avoid any breaking changes.

I foresee a few concerns with this request, which I'll be happy to discuss and address if possible:

  • the graph_utility.hpp header file seems to be designed as low-dependency. Including < vector > and <strong_components.hpp> there seems to violate that choice. But I think the practical aspect of this is reasonably minimal; while fixing the issue fully preserving the backward compatibility is important in practice.
  • the colormap is not used for directed graphs. strong_components do not accept the colormap and pass it to DFS. I don't see this as a big concern. This can be addressed in 2 ways: updating SCC interface to pass the colormap to DFS, or (probably better) provide the version of is_connected with a single argument and no colormap.
  • the version for directed graphs seems to now require VertexIndexGraph concept. (I think) This can be fixed by using map instead of vector for the component map (comp_map) in case the graph is not VertexIndexGraph. I'm not sure, how important is this, but this seems to be a deviation from the full backward compatibility.

I consider a bit of follow-up work on this, which can also be done in a separate pull request:

  • provide single-argument is_connected(const Graph& g). This will require including the colormap header into the graph_utility
  • document is_connected (and other utility functions like is_reachable)
  • add examples of the above

Measurement:

=== Testing size 1
= Testing undirected, connected
time: undirected, connected - prev implementation - elapsed     3.299e-06s
time: undirected, connected - new  implementation - elapsed     1.179e-06s
= Testing undirected, not connected
time: undirected, not connected - prev implementation - elapsed         8.62e-07s
time: undirected, not connected - new  implementation - elapsed         5.39e-07s
= Testing directed, connected
time: directed, connected - prev implementation - elapsed       8.22e-07s
time: directed, connected - new  implementation - elapsed       2.636e-06s
= Testing directed, not connected
time: directed, not connected - prev implementation - elapsed   1.64e-06s
time: directed, not connected - new  implementation - elapsed   1.555e-06s

...
=== Testing size 10
= Testing undirected, connected
time: undirected, connected - prev implementation - elapsed     0.0409115s
time: undirected, connected - new  implementation - elapsed     5.703e-06s
= Testing undirected, not connected
time: undirected, not connected - prev implementation - elapsed         0.000392494s
time: undirected, not connected - new  implementation - elapsed         4.8e-06s
= Testing directed, connected
time: directed, connected - prev implementation - elapsed       0.0240339s
time: directed, connected - new  implementation - elapsed       7.925e-06s
= Testing directed, not connected
time: directed, not connected - prev implementation - elapsed   0.000261854s
time: directed, not connected - new  implementation - elapsed   6.209e-06s

...
=== Testing size 24
= Testing undirected, connected
time: undirected, connected - prev implementation - elapsed     16.458s
time: undirected, connected - new  implementation - elapsed     6.1249e-05s
= Testing undirected, not connected
time: undirected, not connected - prev implementation - elapsed         0.0312753s
time: undirected, not connected - new  implementation - elapsed         6.9342e-05s
= Testing directed, connected
time: directed, connected - prev implementation - elapsed       7.45684s
time: directed, connected - new  implementation - elapsed       4.9894e-05s
= Testing directed, not connected
time: directed, not connected - prev implementation - elapsed   0.01372s
time: directed, not connected - new  implementation - elapsed   0.000215756s

Adhere to the specification in comment:
// Is the undirected graph connected?
// Is the directed graph strongly connected?

Instead of checking if each vertex is reachable from each vertex (which is O(n^3)),
consider different approaches for directed and undirected graphs:

For an undirected graph check if each vertex was reachable with a single DFS walk.
This runs in O(N) (modulo back edges).

For a directed graph run Tarjan SCC algorithm and check if all vertices end up in
a single component. This runs in O(N+E).

The speed-up is considerable, e.g. for a 25x25 square connected graph it is about:

time: undirected, connected - prev implementation - elapsed 	16.458s
time: undirected, connected - new  implementation - elapsed 	6.1249e-05s
time: directed, connected - prev implementation - elapsed 	7.45684s
time: directed, connected - new  implementation - elapsed 	4.9894e-05s

For similar 80x80 graph it's about 0.000563113s, while it just takes forever
with the previous O(n^3) approach.
Adhere to the specification in function comment:
// Is the undirected graph connected?
// Is the directed graph strongly connected?

Instead of checking if each vertex is reachable from each vertex (which is O(n^3)),
consider different approaches for directed and undirected graphs:

For an undirected graph check if each vertex was reachable with a single DFS walk.
This runs in O(N) of a single connected component (modulo its back edges).

For a directed graph run Tarjan SCC algorithm and check if all vertices end up in
a single component. This runs in O(N + E).

The speed-up is considerable, e.g. for a 25x25 square connected graph it is about:

time: undirected, connected - prev implementation - elapsed 	16.458s
time: undirected, connected - new  implementation - elapsed 	6.1249e-05s
time: directed, connected - prev implementation - elapsed 	7.45684s
time: directed, connected - new  implementation - elapsed 	4.9894e-05s

For similar 80x80 graph it's about 0.000563113s, while it just takes forever
with the previous O(n^3) approach.
…backward compatibility with the previous implementation
@Wavetrace
Copy link
Author

I've also opened #398 to track this.

This resolves a build error due to boost::detail::get preferred without the using declaration:

../../../boost/pending/detail/disjoint_sets.hpp:58:24: error: no matching function for call to 'get(long unsigned int*&, int&)'
   58 |         assert(i == get(p, i));
../../../boost/graph/reverse_graph.hpp:561:7: note: candidate: 'template<class E> E boost::detail::get(boost::detail::underlying_edge_desc_map_type<E>, const boost::detail::reverse_graph_edge_descriptor<EdgeDesc>&)'
  561 |     E get(underlying_edge_desc_map_type< E > m,
      |       ^~~
../../../boost/graph/reverse_graph.hpp:561:7: note:   template argument deduction/substitution failed:
../../../boost/pending/detail/disjoint_sets.hpp:58:24: note:   mismatched types 'boost::detail::underlying_edge_desc_map_type<E>' and 'long unsigned int*'
   58 |         assert(i == get(p, i));
      |                     ~~~^~~~~~
@jeremy-murphy jeremy-murphy self-assigned this Nov 14, 2024
@jeremy-murphy
Copy link
Contributor

I only have one question: how does someone with almost no activity history in GitHub know so much about Boost.Graph? :)

@Wavetrace
Copy link
Author

I only have one question: how does someone with almost no activity history in GitHub know so much about Boost.Graph? :)

A great question! Well, I read the book as soon at it was published ;) And Alexander Stepanov's recommendation meant a lot to me, I know him personally and appreciate his inventions a lot. I also ended up rolling up my own graph library before, because Boost was not welcome in my organization.

As of GitHub, I hope to publish a few things soon, including the small project, where I encountered this error.

One reason I actually hit it is that I go straight to the code first and resort to the documentation when the code doesn't work as I'd expect.

Thanks for looking into this!

@Wavetrace
Copy link
Author

For the record, this function was added in 86ef707 on Feb 6, 2001 and not modified since then except for style.

@jeremy-murphy
Copy link
Contributor

I only have one question: how does someone with almost no activity history in GitHub know so much about Boost.Graph? :)

A great question! Well, I read the book as soon at it was published ;) And Alexander Stepanov's recommendation meant a lot to me, I know him personally and appreciate his inventions a lot. I also ended up rolling up my own graph library before, because Boost was not welcome in my organization.

Interesting. Did you also work with him? I'm also a great admirer of his work and managed to visit him once for a chat; unfortunately, Palo Alto is a long way from Australia.

Anyway, my question was fairly flippant, so let me address an actual concern: maybe this function shouldn't exist at all?
Although the meaning of "connected" is unambiguous for undirected graphs, it's not so for directed graphs, and although we would preserve the original meaning of strongly connected, it still leaves every new user to discover what is meant by "connected" in the directed case. Also, what happens when someone wants "connected" to mean weakly, or unilaterally, connected in the directed case? It sits uncomfortably with me. It perhaps has to be a parameter to is_connected so that the user can specify it at each call. What do you think of this situation? Let's resolve this design question before we move on to matters of code.

As you pointed out, the function is undocumented and not used anywhere in the code base that I can see, so we are free to reinvent it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants