- Extract interesting subgraphs from one large graph, or set of many graphs
- Could be looking for particular patterns
- Could be looking for any frequent patterns (given a min support)
- Analyzing communication networks (internet, cell phone networks)
- Analyzing social networking communities
- Analyzing the structure of the web
- Identifying common interaction networks between biomolecules
- Detecting similarities in the chemical structures of compounds
- Vertices and edges
- Adjacent vertices are connected by an edge; neighbors
- A labeled graph has labels associated with the vertices and possibly the edges (similar to labels in clustering or classification)
- A subgraph is a subset of vertices and edges from a larger graph
- A connected subgraph is a subgraph in which there is path between any two vertices (i.e. no disconnected components)
- Graph mining is typically interested in identifying connected subgraphs
- There are many different ways of drawing or constructing graphs with the same underlying structure
- On a mathematical level, we can say that two graphs with the same structure are isomorphic
- Formally:
- Consider two graphs, G = (V, E) and G' = (V', E')
- G and G' are isomorphic if there exists a function, phi, that can map vertices from G' to G
- The mapping must preserve all edges; (U', V') is an edge in G' iff (phi(U'), phi(V')) is an edge in G
- The mapping must preserve all vertex labels; for any vertex U' in G', the label of U' must match the label of phi(U')
- In other words, mapping function phi must preserve all edge adjacencies and all vertex labels
- phi should be injective, meaning that it maps each vertex from G' to a unique vertex in G
- phi should be surjective, meaning that all vertices in G have a mapping from G'
- If phi is only injective, then we say that G' is subgraph isomorphic to G
- G' can be mapped "into" G; G' is isomorpic to a subgraph of G
- Given a set (database) of graphs, D, and some other graph, G, the support of G (sup(G)) is simply answering how many graphs in D contain G
- Despite the simple question, finding frequent graphs in a database has an enormous search space; an efficient algorithm is needed
- Decompose to two problems:
- Problem 1: Systematically generate candidate graphs to search for
- A good approach is the "edge growth mechanism", where we start with an empty graph and systematically add more edges
- The key is to not produce or evaluate duplicate candidates (prune the search tree); this requires graph isomorphism checking
- Problem 2: Count the support of candidate graphs in the database, which involves subgraph isomorphism checking
- Problem 1: Systematically generate candidate graphs to search for
- The preferred edge growth mechanism is rightmost path extension
- Order the vertices according to a DFS traversal
- Edges found during the DFS traversal are considered forward edges; all others are backward edges
- The rightmost path is the path from the start vertex to the very last found vertex
- To generate new paths from a graph G there are two options:
- We can add backward edges from the rightmost vertex to some other vertex on the rightmost path
- We can add forward edges from any vertex on the rightmost path
- General procedure:
- Define a total ordering for all possible edge extensions (i.e. adding new edges)
- First try backward extensions (i.e. #1 above), preferring to connect to vertices closer to the root
- Then try forward extensions (i.e. #2 above), preferring to start the edge with vertices farther from the root
- Note: we are never allowed to have more than a single edge between any two vertices
- Graph isomorphisms can be pruned by checking canonical codes of graphs
- If we generate duplicate (i.e. isomorphic) candidate graphs, we only need to keep one of them; think of it as assigning the graphs a rank, and then keeping only the highest ranked candidate
- General procedure:
- Order the vertices according to a DFS traversal
- Order the edges of the graph by following edges between consecutive vertices
- Visit each vertex in DFS ordering
- Build a DFS code tuple for each edge incident with that vertex: <vertex_1, vertex_2, label(vertex_1), label(vertex_2), label(edge(vertex_1, vertex_2))>
- First order all the backward edges incident with that vertex (they have a higher ordering)
- Then order all the forward edges incident with that vertex (they have a lower ordering)
- Move on to the next vertex
- This ordered set of DFS code tuples is known as the DFS code of the graph
- We can compare two graphs based on their DFS codes (equation in book)
- The graph with the smallest DFS code is considered canonical and is the preferred representation of the graph
- For the purposes of the gSpan algorithm, we don't use the DFS codes to check if two graphs are equal; rather, for a given graph G, we check if it is canonical, and if not, then prune that branch of the search tree
- Recursive algorithm that operates on a set of DFS tuples (the DFS code of a graph), a graph database, and a minimum support
- Initial call is made with an empty DFS code
- At each function call:
- All possible extensions of the the DFS code (i.e. ways to extend the graph) are generated, and the support of the extension calculated
- Uses rightmost path extension as described above
- Generates extensions based on the graph database
- During the first call (with the empty DFS code), the set of possible extensions is essentially every unique edge from every graph in the database
- Each possible extension of the DFS code is examined
- If the possible extension is frequent, we generate a new DFS code consisting of the union of the original code and the possible extension
- Prunes off any branches of the search tree for subgraphs that aren't frequent
- If the new DFS code (with the extension) is canonical, then recursively call gSpan, using the new DFS code
- Prunes off any branches of the search tree for subgraphs that will be examined/evaluated via another search path (with the canonical version of the graph)
- All possible extensions of the the DFS code (i.e. ways to extend the graph) are generated, and the support of the extension calculated