Safe GiST Index access #190

rapodaca · 2021-09-07T16:56:21Z

Thank you for this excellent project! I've started building extensions with pgx and writing about it here.

I noticed a pull request (#107) that mentions GIN and GiST.

Is the GiST index type currently available through pgx? If not, what would need to be put in place? If it is possible to use GIST, where can I find sample code and/or documentation?

Hoverbear · 2021-09-08T21:44:39Z

Hi @rapodaca ! There currently aren't nice APIs for that bit of functionality yet. :(

If you do a cargo doc --open on your local extension and navigate to pgx-pg-sys in the crates it builds, you could probably use the search function to find the relevant function symbols etc, but it's not going to be nice to use.

Did you have something in particular you were hoping to do with it? Maybe some code you wish worked to help drive designs around possibly adding this? (If you'd like to add it, that'd also be great!)

rapodaca · 2021-09-10T21:57:06Z

Thanks for the documentation tip. I can see things like GISTENTRY and GISTEntryVector. I'll have a closer look.

I'd like to build an extension that can search chemical structures. These are essentially graphs that can be queried kind of like strings: by exact match (exact structure) or contains match (substructure/subgraph).

The search is usually done in two passes:

A screening step that can return false positives. This is implemented with a Bloom filter than can be represented as a bit string.
A node-by-node search. This is the expensive part in which Ullmann or VF2 algorithms are used.

I'm very experienced with the steps above (having written a lot of low-level code to do it), but very new to Postgres extensions, so I'm not sure how structure search would translate into Postgres indexes and types.

There are examples of the kind of extension I'd like to build. A popular one is here:

https://www.rdkit.org/docs/Cartridge.html

It seems that GiST is the kind of index I want for this, based in part on how the above extension works. However, if btree could be made to work, I'd be interested in that as well.

I'm not really clear on how a GiST index works. I was hoping it would be possible to play around with them in the context of some simple types before solving the main problem. So I any pointer to a simple "Hello World" thing I try out (if that's possible) would help a lot.

Hoverbear · 2021-09-13T16:11:52Z

That sounds really fun! Are you going to be using petgraph and such?

I noticed @Horusiath created https://github.com/Horusiath/gevel, maybe that's a good place to take a look for this?

Horusiath · 2021-09-13T19:47:47Z

@Hoverbear thanks for mentioning. I'm working on adding the last bits to gevel from time to time, but 2/3 features for GiST indexes are there. Feel free to use them for reference. This however help with inspecting the indexes not building them.

@rapodaca if you want to build your own indexed data type, be prepared for a bit of work. One thing is knowing how Postgres GiST works. Another is defining a set of several core functions that are necessary for Postgres to plugin your own indexing logic. I've started some work on that for my data type (vector versions), but it's also not finished.

In short you should define several functions:

define a set of operator functions that can utilize index like equals, greater or lower than but also GiST specific eg. above, below, overlaps or contains.
union which composes higher-level key out of lower level ones. Eg. range union of [1..3) and [2..5) will be [1..5).
consistent to check if given index key satisfies the query for an operator from SQL statement.
penalty used to determine where it's least expensive to insert a new tuple, given several possible pages that could fit this role.
picksplit used when given index page is getting too big - in that case we need to split it in two and determine which tuples are about to go into which of two newly created pages.
compress/decompress (optional) functions which allow us to use different data type to be used to represent index keys on branch pages and different one on the leaf pages. Eg. Postgres GiST for tsvector uses bloom-like structure for internal representation to save space.

Hoverbear added the enhancement New feature or request label Sep 8, 2021

Hoverbear changed the title ~~GiST Indexes~~ Safe GiST Index access Sep 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Safe GiST Index access #190

Safe GiST Index access #190

rapodaca commented Sep 7, 2021

Hoverbear commented Sep 8, 2021

rapodaca commented Sep 10, 2021

Hoverbear commented Sep 13, 2021

Horusiath commented Sep 13, 2021 •

edited

Loading

Safe GiST Index access #190

Safe GiST Index access #190

Comments

rapodaca commented Sep 7, 2021

Hoverbear commented Sep 8, 2021

rapodaca commented Sep 10, 2021

Hoverbear commented Sep 13, 2021

Horusiath commented Sep 13, 2021 • edited Loading

Horusiath commented Sep 13, 2021 •

edited

Loading