-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathpgday-austin-20161112.slide
150 lines (92 loc) · 3.49 KB
/
pgday-austin-20161112.slide
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
Experimenting with Phrase Text Search and New RUM Index
Galvanize
November 12, 2016
Austin, Texas, USA
John Scott
CTO of RJ2 Technologies
https://github.com/jmscott/talk
* "Facts", Immutablility and Blobs
- a fact is always in the past tense
- a sha digest is the key to the pdf
- the pdf blob does not need to be stored in postgres
* What is Full Text Search?
- Documents are plain old text, in database encoding
- Supports Many Text Encoding
- UTF-8 Encoding of Text Simplifies
- Only Two Datatypes Implement Text Search
* Who Wrote Full Text Search?
- Postgres Pro, a Russian Company
- Oleg Bartunov
- Teodor Sigaev
- Artur Zakirov
* Think of Full Text Search as a Toolkit to Build Text Search Engines
- relational model excellent for correlating text with analytic data
- chemical equations
- gene sequences
- text fields in spreadsheet
- elastic search still better for text only search
* Source of the PDF Documents for the Demo
- arvix.org (Cornell, won the MacArthur Genius Award)
- Theory of Computing Blog Aggregator (http://feedworld.net/toc/)
- Electronic Colloquim on Computational Complexity eccc.hpi-web.de
- Many academic web sites surfed for past 10 years.
* Statistics of PDF Documents for Today
.html pgday-austin-20161112/stat.html
* Example of "Google" Style Search
NEURAL CRYPTOGRAPHY
* Full Text Search Data Types
tsvector
sorted list of lexemes (words) in the text, with optional positions for proximity ranking
tsquery
boolean combinations of lexemes and the "followed by" <-> operator for phrases
* TSVector Data Type
.code pgday-austin-20161112/tsv1.code
Positions of Lexemes Needed for Proximity Ranking
.code pgday-austin-20161112/tsv2.code
Maps (Stem) Words to Lexemes and Removes Stop Words
.code pgday-austin-20161112/tsv3.code
* TSQuery Data Type
Boolean AND, OR, NOT with Grouping of Lexemes
.code pgday-austin-20161112/tsq2.code
Rearranges nested operators into a logically equivalent formulation
.code pgday-austin-20161112/tsq3.code
<-> is "Followed By" Operator.
<6> Means Exactly 6 Words Apart
* TSQuery Data Type
Lexemes in a tsquery can be labeled with * to specify prefix matching
.code pgday-austin-20161112/tsq4.code
Match any word in a tsvector that begins with "super"
* How to Query Text with the Boolean @@ Operator
TRUE
.code pgday-austin-20161112/q1.code
FALSE
.code pgday-austin-20161112/q2.code
* Extracting Matching Snippets for a "Headline"
.code pgday-austin-20161112/head1.code
Only Highlights Keywords (for Now)
Not Smart About Logical Structure of the Query (yet)
Very, Very Expensive, So Be Sure Push to Target List!
* Text Search Indexes - Speed Up the Query
GIN - Typical Inverted Index - Production Many Years
.code pgday-austin-20161112/idx1.code
RUM - Still in Beta
.code pgday-austin-20161112/idx1.code
*
.image pgday-austin-20161112/rum.png _ 830
* RUM Index, Still in Beta
- Lexeme Positions Stored in the Index
- Rank Matches with Query Distance Operator that Understands Logical Queries
- Allows Timestamps Materialized in the Index
- Much Smarter Optimzer to Exploit Phrase Structure
- Query Fragments can be Indexed!
* RUM Index for Document Classification!
.code pgday-austin-20161112/rum1.code
BOOM
.code pgday-austin-20161112/rum2.code
* Dictionary and Text Configuration
- Map "Words" onto Lexemes - Stemming
- Can Search Multiple Dictionary in Particular Order
- "simple" is exact word, usually final disction in the search
- Loaded as Extension into Shared Memory (9.6)
- Common Mispellings can be in a Dictionary