Skip to content

Latest commit

 

History

History
104 lines (89 loc) · 6.48 KB

2020-11-18a.md

File metadata and controls

104 lines (89 loc) · 6.48 KB

Frequent Subgraphs

Ben Polk

CSCI 550 - Fall 2020



Graph Mining

  • Extract interesting subgraphs from one large graph, or set of many graphs
  • Could be looking for particular patterns
  • Could be looking for any frequent patterns (given a min support)

Applications

  • Analyzing communication networks (internet, cell phone networks)
  • Analyzing social networking communities
  • Analyzing the structure of the web
  • Identifying common interaction networks between biomolecules
  • Detecting similarities in the chemical structures of compounds

Graph Basics

  • Vertices and edges
  • Adjacent vertices are connected by an edge; neighbors
  • A labeled graph has labels associated with the vertices and possibly the edges (similar to labels in clustering or classification)

Subgraphs

  • A subgraph is a subset of vertices and edges from a larger graph
  • A connected subgraph is a subgraph in which there is path between any two vertices (i.e. no disconnected components)
  • Graph mining is typically interested in identifying connected subgraphs

Isomorphism

  • There are many different ways of drawing or constructing graphs with the same underlying structure
  • On a mathematical level, we can say that two graphs with the same structure are isomorphic
  • Formally:
    • Consider two graphs, G = (V, E) and G' = (V', E')
    • G and G' are isomorphic if there exists a function, phi, that can map vertices from G' to G
      • The mapping must preserve all edges; (U', V') is an edge in G' iff (phi(U'), phi(V')) is an edge in G
      • The mapping must preserve all vertex labels; for any vertex U' in G', the label of U' must match the label of phi(U')
      • In other words, mapping function phi must preserve all edge adjacencies and all vertex labels
    • phi should be injective, meaning that it maps each vertex from G' to a unique vertex in G
    • phi should be surjective, meaning that all vertices in G have a mapping from G'
    • If phi is only injective, then we say that G' is subgraph isomorphic to G
      • G' can be mapped "into" G; G' is isomorpic to a subgraph of G

Graph Mining Methodology

Subgraph Support

  • Given a set (database) of graphs, D, and some other graph, G, the support of G (sup(G)) is simply answering how many graphs in D contain G
  • Despite the simple question, finding frequent graphs in a database has an enormous search space; an efficient algorithm is needed
  • Decompose to two problems:
    • Problem 1: Systematically generate candidate graphs to search for
      • A good approach is the "edge growth mechanism", where we start with an empty graph and systematically add more edges
      • The key is to not produce or evaluate duplicate candidates (prune the search tree); this requires graph isomorphism checking
    • Problem 2: Count the support of candidate graphs in the database, which involves subgraph isomorphism checking

Candidate Generation (Rightmost Path Extension)

  • The preferred edge growth mechanism is rightmost path extension
    • Order the vertices according to a DFS traversal
    • Edges found during the DFS traversal are considered forward edges; all others are backward edges
    • The rightmost path is the path from the start vertex to the very last found vertex
  • To generate new paths from a graph G there are two options:
    1. We can add backward edges from the rightmost vertex to some other vertex on the rightmost path
    2. We can add forward edges from any vertex on the rightmost path
  • General procedure:
    • Define a total ordering for all possible edge extensions (i.e. adding new edges)
    • First try backward extensions (i.e. #1 above), preferring to connect to vertices closer to the root
    • Then try forward extensions (i.e. #2 above), preferring to start the edge with vertices farther from the root
    • Note: we are never allowed to have more than a single edge between any two vertices

Isomorphism Checking (Canonical Codes)

  • Graph isomorphisms can be pruned by checking canonical codes of graphs
  • If we generate duplicate (i.e. isomorphic) candidate graphs, we only need to keep one of them; think of it as assigning the graphs a rank, and then keeping only the highest ranked candidate
  • General procedure:
    • Order the vertices according to a DFS traversal
    • Order the edges of the graph by following edges between consecutive vertices
      • Visit each vertex in DFS ordering
      • Build a DFS code tuple for each edge incident with that vertex: <vertex_1, vertex_2, label(vertex_1), label(vertex_2), label(edge(vertex_1, vertex_2))>
      • First order all the backward edges incident with that vertex (they have a higher ordering)
      • Then order all the forward edges incident with that vertex (they have a lower ordering)
      • Move on to the next vertex
    • This ordered set of DFS code tuples is known as the DFS code of the graph
  • We can compare two graphs based on their DFS codes (equation in book)
  • The graph with the smallest DFS code is considered canonical and is the preferred representation of the graph
  • For the purposes of the gSpan algorithm, we don't use the DFS codes to check if two graphs are equal; rather, for a given graph G, we check if it is canonical, and if not, then prune that branch of the search tree

gSpan Algorithm

  • Recursive algorithm that operates on a set of DFS tuples (the DFS code of a graph), a graph database, and a minimum support
    • Initial call is made with an empty DFS code
  • At each function call:
    • All possible extensions of the the DFS code (i.e. ways to extend the graph) are generated, and the support of the extension calculated
      • Uses rightmost path extension as described above
      • Generates extensions based on the graph database
      • During the first call (with the empty DFS code), the set of possible extensions is essentially every unique edge from every graph in the database
    • Each possible extension of the DFS code is examined
    • If the possible extension is frequent, we generate a new DFS code consisting of the union of the original code and the possible extension
      • Prunes off any branches of the search tree for subgraphs that aren't frequent
    • If the new DFS code (with the extension) is canonical, then recursively call gSpan, using the new DFS code
      • Prunes off any branches of the search tree for subgraphs that will be examined/evaluated via another search path (with the canonical version of the graph)