-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: add Bloom filters for ATXs and malicious identities #6332
base: develop
Are you sure you want to change the base?
Conversation
aa86c6c
to
234c6c5
Compare
234c6c5
to
0b11181
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #6332 +/- ##
========================================
Coverage 81.8% 81.8%
========================================
Files 312 314 +2
Lines 34606 34890 +284
========================================
+ Hits 28318 28563 +245
- Misses 4452 4484 +32
- Partials 1836 1843 +7 ☔ View full report in Codecov by Sentry. |
i also was looking into leveraging bloom filters after realizing that most of the identities are not malicious therefore this index scan always has to complete fully but return nothing #6326. apparently sqlite itself has support for bloom filters, but i am not sure when query optimizer picks them one thing to note is that this query was added wrongly, the way it is written will make situation with atx handling worse if it wasn't yet released. there should be more optimal approach, that doesn't pay the cost all the time but only when it is actually needed. for example when equivocation happens update all atxs in equivocation set.
|
one other thing is that i will likely get rid of atx warmup and long tortoise loading, adding more time to startup seems like poor choice. maybe consider loading this bloom filter in the background and use it only once loaded (if sqlite bloom filter is hard to use) |
That's a good idea, thanks, will do so |
Updating all the identities in the equivocation set when one of them becomes malicious also makes sense to me, I had similar thoughts too, but this should probably be done in a separate PR |
Switched to background loading of the Bloom filters |
btw are ATX optimization speed up things noticeably for you? i mean, not just measuing how long atxs.Has was before/now, but in terms of syncv2 rate or resource usage. the reason why am asking is that it doesn't appear to be largest hotspot in atxs processing, this is distribution of vfs read syscalls by sqlite on v1.6.8, they don't hit disc because of large memory on my computer but this is the largest slow down after verification it is far from optimal, i would rework that to use smarter access, just changing how/when sqlite is used:
that list just addresses "mistakes" and doesn't introduce any additional complexity, and once addressed maybe adding more complexity will not be necessary. but if adding bloom filter help significantly in short term i won't object adding them |
// NewDBBloomFilter creates a new Bloom filter that for a database table. | ||
// tableName is the name of the table, idColumn is the name of the column that contains | ||
// the IDs, filter is an optional SQL expression that selects the rows to include in the | ||
// filter, and falsePositiveRate is the desired false positive rate. | ||
func NewDBBloomFilter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to document how to use it, the tradeoffs, etc. For example, why not set falsePositiveRate=0.0
, the extraCoef
usage, and so on.
filter is an optional SQL expression that selects the rows to include in the filter
It doesn't take an argument filter
. Should this part be removed?
if bf.minSize > 0 && size < bf.minSize { | ||
size = bf.minSize | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT:
if bf.minSize > 0 && size < bf.minSize { | |
size = bf.minSize | |
} | |
if bf.minSize > 0 { | |
size = max(size, bf.minSize) | |
} |
var bs []byte | ||
nRows, err := db.Exec(bf.loadSQL(), nil, func(stmt *Statement) bool { | ||
l := stmt.ColumnLen(0) | ||
if cap(bs) < l { | ||
bs = make([]byte, l) | ||
} else { | ||
bs = bs[:l] | ||
} | ||
stmt.ColumnBytes(0, bs) | ||
f.Add(bs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using bytes.Buffer
to avoid manual size work:
var bs []byte | |
nRows, err := db.Exec(bf.loadSQL(), nil, func(stmt *Statement) bool { | |
l := stmt.ColumnLen(0) | |
if cap(bs) < l { | |
bs = make([]byte, l) | |
} else { | |
bs = bs[:l] | |
} | |
stmt.ColumnBytes(0, bs) | |
f.Add(bs) | |
var buf bytes.Buffer | |
nRows, err := db.Exec(bf.loadSQL(), nil, func(stmt *Statement) bool { | |
buf.Reset() | |
buf.ReadFrom(stmt.ColumnReader(0)) | |
f.Add(buf.Bytes()) |
bf.mtx.Lock() | ||
bf.f = f | ||
bf.mtx.Unlock() | ||
bf.logger.Info("done loading Bloom filter", zap.String("name", bf.name), zap.Int("rows", nRows)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could it be a problem if new ATXs are inserted after the db.Exec
finished but before bf.f = f
? I.e. could these ATXs added "in between" be "lost" to the filter?
Perhaps it's worth unit-testing such case, wdyt?
@@ -1979,6 +1980,8 @@ func (app *App) setupDBs(ctx context.Context, lg log.Log) error { | |||
} | |||
{ | |||
warmupLog := app.log.Zap().Named("warmup") | |||
atxs.StartBloomFilter(app.db, warmupLog) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The warmup of the bloom filters will run in parallel with the atxsdata warmup which is also doing a lot of SQL reads. Wouldn't it hurt performance too much? How about starting the bloom filters warmup after atxsdata warmup finished?
// Contains verifies that the ID exists within the specified set. | ||
func Contains(db Executor, name string, id []byte) (bool, error) { | ||
if set, ok := db.(IDSetCollection); ok { | ||
return set.Contains(name, id) | ||
} | ||
return false, ErrNoSet | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When would it be used with something that does not implement IDSetCollection
? Why have this function instead of calling set.Contains
?
@@ -0,0 +1,212 @@ | |||
// Package expr proviedes a simple SQL expression parser and builder. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Package expr proviedes a simple SQL expression parser and builder. | |
// Package expr provides a simple SQL expression parser and builder. |
|
||
"github.com/spacemeshos/go-spacemesh/common/types" | ||
"github.com/spacemeshos/go-spacemesh/sql" | ||
) | ||
|
||
const ( | ||
// Bloom filter size is < 234 KiB while below 100k identities. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean identities in general or malicious ones?
// Bloom filter size is < 234 KiB while below 100k identities. | |
// Bloom filter size is < 234 KiB while below 100k malicious identities. |
ids, err := EquivocationSet(db, nodeID) | ||
if err != nil { | ||
return fmt.Errorf("get equivocation set for %v: %w", nodeID, err) | ||
} | ||
for _, id := range ids { | ||
sql.AddToSet(db, "malicious", id[:]) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on your discussion with @dshulyak, shouldn't this be:
ids, err := EquivocationSet(db, nodeID) | |
if err != nil { | |
return fmt.Errorf("get equivocation set for %v: %w", nodeID, err) | |
} | |
for _, id := range ids { | |
sql.AddToSet(db, "malicious", id[:]) | |
} | |
sql.AddToSet(db, "malicious", id[:]) |
provided SetMalicious
is called for every ID
in the equivocation set? Note: calling SetMalicious
is not yet implemented for ATX V2 AFAIR:
go-spacemesh/activation/malfeasance2.go
Lines 13 to 16 in 83b0499
func (p *MalfeasancePublisher) Publish(ctx context.Context, id types.NodeID, proof wire.Proof) error { | |
// TODO(mafa): implement me | |
return nil | |
} |
You have a very fast disk, don't you? The results could be very different on a slow one, where the disk would quickly become the limiting factor. Wdyt? |
one other thing if you are focusing on this part that runs the atxs.Has check, it makes more sense to refactor it such that caller always checks locally if atx exists and don't redo the work in the fetcher. more concretely, atx handler always needs to load previous atx for validation, if it tries to load it and sql.ErrNotFound is raised, you call that handler with previous atx id, otherwise you don't. in my opinion it is better to fix "logic" rather than adding more workarounds |
Converted to draft for now b/c more testing may be needed to justify the use of the bloom filters, with more pressing issues being there right now. |
Motivation
atxs.Has
andidentities.IsMalicious
require database access and a lot of such queries are made when fetching ATXs, resulting infalse
in majority of cases. Bloom filter can be used to avoid database access in most cases.Description
This adds Bloom filters that are initialized on startup and are updated as new ATXs and malicious identities are being added. ~114 MiB Bloom filter has 1% false positive rate for 100M ATXs, and 234 KiB Bloom filter has 0.01% false positive rate for 10K malicious identities. False positives don't mean that the check will yield incorrect result, they just incur a database query which is always done when the Bloom filter gives positive result.
About 2 min on an old Xeon (E5-2696) machine is needed to load the filters during startup. This appears to be a worthy tradeoff; the filter false positive rate and expected size values aren't expected to change often, thus I didn't add config values for them just yet.
UPD: will update the PR so that bloom filters are loaded in background and only used when they're ready, falling back to the old "always query DB" behavior while the filters are not ready yet.
The rqlite/sql SQLite parser/stringifier dependency introduced in the code would of course not be justified if it was only intended for the Bloom filters, as the necessary SQL could be hardcoded instead, but there are several places where SQL is processed or dynamically generated in go-spacemesh and the intent is to use the new
sql/expr
package in several other places to, extending it as needed:sql/builder
packageIn case of things like Bloom filters writing out all the queries explicitly may sound as a good "less magic" approach, and that may be subject for discussion, but repeated SQL queries for "mostly same" thing do cause issues, e.g. here we have a bug in equivocation set handling for malicious identities which resulted from not all related SQL queries being updated correctly: #6331 (to be fixed soon in a separate PR, w/o dynamic SQL)
The intent for
sql/expr
package is to hide most of therqlite/sql
functionality we don't need, and provide a simple and minimalistic interface for dynamic SQL needs of the codebase instead. The idea is not to userqlite/sql
directly in other go-spacemesh packages.sql/expr
has been extracted from thesyncv2
code and thus has slightly more functionality than Bloom filters use.Test Plan
Verify on a mainnet node
TODO