Skip to content

Commit

Permalink
writing souffle autoindex post
Browse files Browse the repository at this point in the history
  • Loading branch information
sowmith1999 committed Jul 6, 2024
1 parent ce6c7a9 commit ad088b3
Show file tree
Hide file tree
Showing 9 changed files with 179 additions and 51 deletions.
57 changes: 41 additions & 16 deletions content/blog/souffle-auto-index/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,29 +7,54 @@ draft = false
math = true
+++
<!-- # Auto Indexing in Datalog -->
## Introduction
[Datalog](https://en.wikipedia.org/wiki/Datalog) rules are a bunch of relational queries, each involving of joins over multiple relations. Here is a simple example of a Datalog rule:

So Datalog rules have a bunch of queries and involve a lot of joins over multiple relations. All this involves looking up values in the relations, for these queries to run in practical amount of time, you need indexes to make the look-ups faster and avoid linear scans of the tables.
```prolog
path(x, y) :- path(x, z), edge(z, y).
```
The above query does Transitive closure over a graph, where `path` and `edge` are relations. The query is saying, if there is a path from `x` to `z` and there is an edge from `z` to `y`, then there is a path from `x` to `y`. This query involves a join over `path` and `edge` relations. [Larger programs](https://github.com/harp-lab/brouhaha/blob/master/analyze.slog) can have 100's of such queries, each involving joins over multiple relations.

For these queries to be performant, you need indexes to make the value look-ups faster and avoid linear scans of the tables.

So one of the best ways to run a datalog program well is to have a strong indexing plan, so that all the searches are covered by indexes. But Indexes cost a lot of memory and compute to build and maintain. Now, we need to find some middle ground, where we are leaving very little to no performance on the table, and be able to do it, with minimal number of indexes possible.
Well, How do you go about creating the indexes?, Is there a general way that is optimal?. What considerations do you need to take into account when creating indexes?.

This is the paper, that sets up the [Minimum Index Selection Problem](https://souffle-lang.github.io/vldb19.html). The goal of the paper is, given a set of searches that are performed over a relation, figure out the minimum number of indices possible, such that every search is covered by an Index.

### How does the problem look like?
### How does a sample problem look like?
- **Input**
- A relation R, has 3 columns that are being used in different searches, `x, y, z`.
- The input for the problem, the searches being used on a relation, `{x}, {x,y}, {x,z}, {x, y, z}`
- A relation R, has n columns, of those 3 are being used in different searches, `x, y, z`.
- The input for the problem is the set of searches being used on a relation, `{x}, {x,y}, {x,z}, {x, y, z}`
- To enable, performant look-up of these searches, you need to create indexes on the relation for these columns. Naively, at most 4 indexes are needed as there are 4 searches.
- **Problem**
- Here, How do you get rid of some searches, Idea is, if you have a search \( S_{1} \)
```python
define foo():
return 1
```
Well, Another snippet
```cpp
bool getBit(int num, int i) {
return ((num & (1<<i)) != 0);
}
```
- Can you get away with fewer indexes without any linear scans of the table?
- **Observations/Note's**
- "For example, the index \( \ell = x \prec y \prec z \) covers three primitive searches: \( S_1 = \sigma_{x=v_1} \), \( S_2 = \sigma_{x=v'_1, y=v'_2} \), and \( S_3 = \sigma_{x=v''_1, y=v''_2, z=v''_3} \)".
- You can share indexes among searches, if the searches share a common prefix. For example, if you have searches `{x}, {x,y}`. You can create an index for \(x \prec y\) and use it for both searches.
- Taking this a bit further, you can see, `{x}, {x, y}, {x, y, z}`, can share one index, \(x \prec y \prec z \) and `x,z` needs a separate index. Or another possibility is `x`, `x,z` and `x,z,y` share an index, and `x,y` a separate index.
- Here you can intuitively see that you can get away with fewer indexes than the number of searches, by just finding the longest common prefixes among the searches.
- **Solution**
- This finding the longest prefixes among the searches to cover all searches as to figure a minimum number of indexes is a simple definition of the Minimum Index Selection Problem(MISP).
- As the paper reveals, this problem can be modelled as Minimum Chain Cover Problem(MCCP), which can be solved in polynomial time.
- Hence, Our MISP too, can be solved in polynomial time.
- As does the paper, we will look at the problem and the solution in more detail in the following sections.
## Details
### Definitions
- **Primitive Search** : A primitive search is like a SQL select statement that return tuples which satisfy a condition. For example, a equality check on a column, \( \sigma_{x=v} \) is a primitive search.
- **Index** : An Index here refers to a clustered B-Tree index that covers a searach predicate. Index, \( \ell = x \prec y \prec z \) uses \( x \) followed by \( y \) followed by \( z \) as its key, and covers searches that share a common prefix with the index.
- **Search Chain** : "A sequence of \( k \) searches \(S_1, S_2, \ldots, S_k \) form a search chain if each search \( S_i \) is a proper subset of its immediate successor \( S_{i+1} \). As a result, all search in the same search chain can be covered by a single index."

### Content
Hopefully, I did a good job of explaining what, the problem we are trying to solve is and some intuition about how the paper looks to solve it.

#### Why not just create an index for each search?
- Well, you can, but that is super expensive both memory and compute, and you can do much better.

#### Why not just look through all the searches and figure the minimal set?
- Well, again you can, but this borders on not possible due to sheer number of possible combinations. The number of possible combinations is something like \( 2^{m!} \), where \( m \) is the number of columns used for searches over a relation.
- When you have \( m \) attributes involved, you have \( m! \) possible permutations of the attributes, and then, you have pick or not pick each of the \( m! \) permutations, hence the \( 2^{m!} \) possible minimal sets.


- Well, this is an image.
![joins](join.svg)

Expand Down
66 changes: 66 additions & 0 deletions layouts/about/single.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
{{- define "main" }}

<article class="post-single">
<header class="post-header">
{{ partial "breadcrumbs.html" . }}
<h1 class="post-title entry-hint-parent">
{{ .Title }}
{{- if .Draft }}
<span class="entry-hint" title="Draft">
<svg xmlns="http://www.w3.org/2000/svg" height="35" viewBox="0 -960 960 960" fill="currentColor">
<path
d="M160-410v-60h300v60H160Zm0-165v-60h470v60H160Zm0-165v-60h470v60H160Zm360 580v-123l221-220q9-9 20-13t22-4q12 0 23 4.5t20 13.5l37 37q9 9 13 20t4 22q0 11-4.5 22.5T862.09-380L643-160H520Zm300-263-37-37 37 37ZM580-220h38l121-122-18-19-19-18-122 121v38Zm141-141-19-18 37 37-18-19Z" />
</svg>
</span>
{{- end }}
</h1>
{{- if .Description }}
<div class="post-description">
{{ .Description }}
</div>
{{- end }}
{{- if not (.Param "hideMeta") }}
<div class="post-meta">
{{- partial "post_meta.html" . -}}
{{- partial "translation_list.html" . -}}
{{- partial "edit_post.html" . -}}
{{- partial "post_canonical.html" . -}}
</div>
{{- end }}
</header>
{{- $isHidden := (.Param "cover.hiddenInSingle") | default (.Param "cover.hidden") | default false }}
{{- partial "cover.html" (dict "cxt" . "IsSingle" true "isHidden" $isHidden) }}
{{- if (.Param "ShowToc") }}
{{- partial "toc.html" . }}
{{- end }}

{{- if .Content }}
<div class="post-content">
{{- if not (.Param "disableAnchoredHeadings") }}
{{- partial "anchored_headings.html" .Content -}}
{{- else }}{{ .Content }}{{ end }}
</div>
{{- end }}

<footer class="post-footer">
{{- $tags := .Language.Params.Taxonomies.tag | default "tags" }}
<ul class="post-tags">
{{- range ($.GetTerms $tags) }}
<li><a href="{{ .Permalink }}">{{ .LinkTitle }}</a></li>
{{- end }}
</ul>
{{- if (.Param "ShowPostNavLinks") }}
{{- partial "post_nav_links.html" . }}
{{- end }}
{{- if (and site.Params.ShowShareButtons (ne .Params.disableShare true)) }}
{{- partial "share_icons.html" . -}}
{{- end }}
</footer>

{{- if (.Param "comments") }}
{{- partial "comments.html" . }}
{{- end }}
</article>

{{- end }}{{/* end main */}}

Loading

0 comments on commit ad088b3

Please sign in to comment.