Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Safe GiST Index access #190

Open
rapodaca opened this issue Sep 7, 2021 · 4 comments
Open

Safe GiST Index access #190

rapodaca opened this issue Sep 7, 2021 · 4 comments
Labels
enhancement New feature or request

Comments

@rapodaca
Copy link

rapodaca commented Sep 7, 2021

Thank you for this excellent project! I've started building extensions with pgx and writing about it here.

I noticed a pull request (#107) that mentions GIN and GiST.

Is the GiST index type currently available through pgx? If not, what would need to be put in place? If it is possible to use GIST, where can I find sample code and/or documentation?

@Hoverbear
Copy link
Contributor

Hi @rapodaca ! There currently aren't nice APIs for that bit of functionality yet. :(

If you do a cargo doc --open on your local extension and navigate to pgx-pg-sys in the crates it builds, you could probably use the search function to find the relevant function symbols etc, but it's not going to be nice to use.

Did you have something in particular you were hoping to do with it? Maybe some code you wish worked to help drive designs around possibly adding this? (If you'd like to add it, that'd also be great!)

@Hoverbear Hoverbear added the enhancement New feature or request label Sep 8, 2021
@Hoverbear Hoverbear changed the title GiST Indexes Safe GiST Index access Sep 8, 2021
@rapodaca
Copy link
Author

Thanks for the documentation tip. I can see things like GISTENTRY and GISTEntryVector. I'll have a closer look.

I'd like to build an extension that can search chemical structures. These are essentially graphs that can be queried kind of like strings: by exact match (exact structure) or contains match (substructure/subgraph).

The search is usually done in two passes:

  1. A screening step that can return false positives. This is implemented with a Bloom filter than can be represented as a bit string.
  2. A node-by-node search. This is the expensive part in which Ullmann or VF2 algorithms are used.

I'm very experienced with the steps above (having written a lot of low-level code to do it), but very new to Postgres extensions, so I'm not sure how structure search would translate into Postgres indexes and types.

There are examples of the kind of extension I'd like to build. A popular one is here:

https://www.rdkit.org/docs/Cartridge.html

It seems that GiST is the kind of index I want for this, based in part on how the above extension works. However, if btree could be made to work, I'd be interested in that as well.

I'm not really clear on how a GiST index works. I was hoping it would be possible to play around with them in the context of some simple types before solving the main problem. So I any pointer to a simple "Hello World" thing I try out (if that's possible) would help a lot.

@Hoverbear
Copy link
Contributor

That sounds really fun! Are you going to be using petgraph and such?

I noticed @Horusiath created https://github.com/Horusiath/gevel, maybe that's a good place to take a look for this?

@Horusiath
Copy link
Contributor

Horusiath commented Sep 13, 2021

@Hoverbear thanks for mentioning. I'm working on adding the last bits to gevel from time to time, but 2/3 features for GiST indexes are there. Feel free to use them for reference. This however help with inspecting the indexes not building them.

@rapodaca if you want to build your own indexed data type, be prepared for a bit of work. One thing is knowing how Postgres GiST works. Another is defining a set of several core functions that are necessary for Postgres to plugin your own indexing logic. I've started some work on that for my data type (vector versions), but it's also not finished.

In short you should define several functions:

  • define a set of operator functions that can utilize index like equals, greater or lower than but also GiST specific eg. above, below, overlaps or contains.
  • union which composes higher-level key out of lower level ones. Eg. range union of [1..3) and [2..5) will be [1..5).
  • consistent to check if given index key satisfies the query for an operator from SQL statement.
  • penalty used to determine where it's least expensive to insert a new tuple, given several possible pages that could fit this role.
  • picksplit used when given index page is getting too big - in that case we need to split it in two and determine which tuples are about to go into which of two newly created pages.
  • compress/decompress (optional) functions which allow us to use different data type to be used to represent index keys on branch pages and different one on the leaf pages. Eg. Postgres GiST for tsvector uses bloom-like structure for internal representation to save space.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants