pgday-austin-20161112.slide

Experimenting with Phrase Text Search and New RUM Index

Galvanize
November 12, 2016
Austin, Texas, USA

John Scott
CTO of RJ2 Technologies

jmscott@rj2tech.com
https://github.com/jmscott/talk

* "Facts", Immutablility and Blobs

- a fact is always in the past tense
- a sha digest is the key to the pdf
- the pdf blob does not need to be stored in postgres

* What is Full Text Search?

- Documents are plain old text, in database encoding
- Supports Many Text Encoding
- UTF-8 Encoding of Text Simplifies
- Only Two Datatypes Implement Text Search

* Who Wrote Full Text Search?
- Postgres Pro, a  Russian Company
- Oleg Bartunov
- Teodor Sigaev
- Artur Zakirov

* Think of Full Text Search as a Toolkit to Build Text Search Engines

- relational model excellent for correlating text with analytic data
- chemical equations
- gene sequences
- text fields in spreadsheet
- elastic search still better for text only search

* Source of the PDF Documents for the Demo

- arvix.org (Cornell, won the MacArthur Genius Award)
- Theory of Computing Blog Aggregator (http://feedworld.net/toc/)
- Electronic Colloquim on Computational Complexity eccc.hpi-web.de
- Many academic web sites surfed for past 10 years.

* Statistics of PDF Documents for Today

.html pgday-austin-20161112/stat.html

* Example of "Google" Style Search
 NEURAL CRYPTOGRAPHY

* Full Text Search Data Types
tsvector

 sorted list of lexemes (words) in the text, with optional positions for proximity ranking

tsquery

 boolean combinations of lexemes and the "followed by" <-> operator for phrases

* TSVector Data Type

.code pgday-austin-20161112/tsv1.code

Positions of Lexemes Needed for Proximity Ranking

.code pgday-austin-20161112/tsv2.code

Maps (Stem) Words to Lexemes and Removes Stop Words

.code pgday-austin-20161112/tsv3.code

* TSQuery Data Type

Boolean AND, OR, NOT with Grouping of Lexemes

.code pgday-austin-20161112/tsq2.code

Rearranges nested operators into a logically equivalent formulation

.code pgday-austin-20161112/tsq3.code
<-> is "Followed By" Operator.
<6> Means Exactly 6 Words Apart

* TSQuery Data Type

Lexemes in a tsquery can be labeled with * to specify prefix matching

.code pgday-austin-20161112/tsq4.code

Match any word in a tsvector that begins with "super"

* How to Query Text with the Boolean @@ Operator

TRUE

.code pgday-austin-20161112/q1.code

FALSE

.code pgday-austin-20161112/q2.code

* Extracting Matching Snippets for a "Headline"

.code pgday-austin-20161112/head1.code

Only Highlights Keywords (for Now)

Not Smart About Logical Structure of the Query (yet)

Very, Very Expensive, So Be Sure Push to Target List!

* Text Search Indexes - Speed Up the Query

GIN - Typical Inverted Index - Production Many Years

.code pgday-austin-20161112/idx1.code

RUM - Still in Beta

.code pgday-austin-20161112/idx1.code

* 
.image pgday-austin-20161112/rum.png _ 830

* RUM Index, Still in Beta

- Lexeme Positions Stored in the Index
- Rank Matches with Query Distance Operator that Understands Logical Queries
- Allows Timestamps Materialized in the Index
- Much Smarter Optimzer to Exploit Phrase Structure
- Query Fragments can be Indexed!

* RUM Index for Document Classification!

.code pgday-austin-20161112/rum1.code

BOOM

.code pgday-austin-20161112/rum2.code

* Dictionary and Text Configuration

- Map "Words" onto Lexemes - Stemming
- Can Search Multiple Dictionary in Particular Order
- "simple" is exact word, usually final disction in the search
- Loaded as Extension into Shared Memory (9.6)
- Common Mispellings can be in a Dictionary