Frequent Subgraphs

Ben Polk

CSCI 550 - Fall 2020

Graph Mining

Extract interesting subgraphs from one large graph, or set of many graphs
Could be looking for particular patterns
Could be looking for any frequent patterns (given a min support)

Applications

Analyzing communication networks (internet, cell phone networks)
Analyzing social networking communities
Analyzing the structure of the web
Identifying common interaction networks between biomolecules
Detecting similarities in the chemical structures of compounds

Graph Basics

Vertices and edges
Adjacent vertices are connected by an edge; neighbors
A labeled graph has labels associated with the vertices and possibly the edges (similar to labels in clustering or classification)

Subgraphs

A subgraph is a subset of vertices and edges from a larger graph
A connected subgraph is a subgraph in which there is path between any two vertices (i.e. no disconnected components)
Graph mining is typically interested in identifying connected subgraphs

Isomorphism

There are many different ways of drawing or constructing graphs with the same underlying structure
On a mathematical level, we can say that two graphs with the same structure are isomorphic
Formally:
- Consider two graphs, G = (V, E) and G' = (V', E')
- G and G' are isomorphic if there exists a function, phi, that can map vertices from G' to G
  - The mapping must preserve all edges; (U', V') is an edge in G' iff (phi(U'), phi(V')) is an edge in G
  - The mapping must preserve all vertex labels; for any vertex U' in G', the label of U' must match the label of phi(U')
  - In other words, mapping function phi must preserve all edge adjacencies and all vertex labels
- phi should be injective, meaning that it maps each vertex from G' to a unique vertex in G
- phi should be surjective, meaning that all vertices in G have a mapping from G'
- If phi is only injective, then we say that G' is subgraph isomorphic to G
  - G' can be mapped "into" G; G' is isomorpic to a subgraph of G

Graph Mining Methodology

Subgraph Support

Given a set (database) of graphs, D, and some other graph, G, the support of G (sup(G)) is simply answering how many graphs in D contain G
Despite the simple question, finding frequent graphs in a database has an enormous search space; an efficient algorithm is needed
Decompose to two problems:
- Problem 1: Systematically generate candidate graphs to search for
  - A good approach is the "edge growth mechanism", where we start with an empty graph and systematically add more edges
  - The key is to not produce or evaluate duplicate candidates (prune the search tree); this requires graph isomorphism checking
- Problem 2: Count the support of candidate graphs in the database, which involves subgraph isomorphism checking

Candidate Generation (Rightmost Path Extension)

The preferred edge growth mechanism is rightmost path extension
- Order the vertices according to a DFS traversal
- Edges found during the DFS traversal are considered forward edges; all others are backward edges
- The rightmost path is the path from the start vertex to the very last found vertex
To generate new paths from a graph G there are two options:
1. We can add backward edges from the rightmost vertex to some other vertex on the rightmost path
2. We can add forward edges from any vertex on the rightmost path
General procedure:
- Define a total ordering for all possible edge extensions (i.e. adding new edges)
- First try backward extensions (i.e. #1 above), preferring to connect to vertices closer to the root
- Then try forward extensions (i.e. #2 above), preferring to start the edge with vertices farther from the root
- Note: we are never allowed to have more than a single edge between any two vertices

Isomorphism Checking (Canonical Codes)

Graph isomorphisms can be pruned by checking canonical codes of graphs
If we generate duplicate (i.e. isomorphic) candidate graphs, we only need to keep one of them; think of it as assigning the graphs a rank, and then keeping only the highest ranked candidate
General procedure:
- Order the vertices according to a DFS traversal
- Order the edges of the graph by following edges between consecutive vertices
  - Visit each vertex in DFS ordering
  - Build a DFS code tuple for each edge incident with that vertex: <vertex_1, vertex_2, label(vertex_1), label(vertex_2), label(edge(vertex_1, vertex_2))>
  - First order all the backward edges incident with that vertex (they have a higher ordering)
  - Then order all the forward edges incident with that vertex (they have a lower ordering)
  - Move on to the next vertex
- This ordered set of DFS code tuples is known as the DFS code of the graph
We can compare two graphs based on their DFS codes (equation in book)
The graph with the smallest DFS code is considered canonical and is the preferred representation of the graph
For the purposes of the gSpan algorithm, we don't use the DFS codes to check if two graphs are equal; rather, for a given graph G, we check if it is canonical, and if not, then prune that branch of the search tree

gSpan Algorithm

Recursive algorithm that operates on a set of DFS tuples (the DFS code of a graph), a graph database, and a minimum support
- Initial call is made with an empty DFS code
At each function call:
- All possible extensions of the the DFS code (i.e. ways to extend the graph) are generated, and the support of the extension calculated
  - Uses rightmost path extension as described above
  - Generates extensions based on the graph database
  - During the first call (with the empty DFS code), the set of possible extensions is essentially every unique edge from every graph in the database
- Each possible extension of the DFS code is examined
- If the possible extension is frequent, we generate a new DFS code consisting of the union of the original code and the possible extension
  - Prunes off any branches of the search tree for subgraphs that aren't frequent
- If the new DFS code (with the extension) is canonical, then recursively call gSpan, using the new DFS code
  - Prunes off any branches of the search tree for subgraphs that will be examined/evaluated via another search path (with the canonical version of the graph)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2020-11-18a.md

2020-11-18a.md

Frequent Subgraphs

Ben Polk

CSCI 550 - Fall 2020

Graph Mining

Applications

Graph Basics

Subgraphs

Isomorphism

Graph Mining Methodology

Subgraph Support

Candidate Generation (Rightmost Path Extension)

Isomorphism Checking (Canonical Codes)

gSpan Algorithm

Files

2020-11-18a.md

Latest commit

History

2020-11-18a.md

File metadata and controls

Frequent Subgraphs

Ben Polk

CSCI 550 - Fall 2020

Graph Mining

Applications

Graph Basics

Subgraphs

Isomorphism

Graph Mining Methodology

Subgraph Support

Candidate Generation (Rightmost Path Extension)

Isomorphism Checking (Canonical Codes)

gSpan Algorithm