Skip to content

Commit

Permalink
Update and stabilize defrag tests (#1762)
Browse files Browse the repository at this point in the history
A number of tests related to defrag have had stability problems.

One reason for stability issues is that the tests are run sequentially
on the same server, and it takes jemalloc some time to release freed
pages. This is especially noticed immediately after a flushall. Over a
period of 10 seconds, it is observable that the "fragmentation bytes"
can decrease by several MB. Another reason is that there's no
standardization between tests. For each test, people have been
independently hacking/tweaking the success criteria, without addressing
underlying issues.

This update revamps all of the defrag tests:
* A fresh server is started for each test. Running each test in
isolation improves stability.
* A uniform function `log_frag` is now used for debug logging
* A uniform function `perform_defrag_test` ensures that each test is
written and executed in a uniform fashion. Limits are imposed to ensure
that the defrag results are consistent/reproducible. The intent is to
eliminate failures do to various tweaks to values in individual tests.
* Latency is tested much more strictly for most tests, reflecting the
recent improvements to defrag latency.
* The test `defrag edge case` has been removed. This test attempted to
create N pages with EXACTLY equal fragmentation in an attempt to confuse
the defrag logic. It's unlikely that this test was performing correctly,
and had questionable value.
* Tests for hash/list/set/zset/stream have been separated and
standardized. It was unlikely that the old test was performing properly
as none of the actual data structures were fragmented!

It's noted that pubsub doesn't appear to be defragging correctly. The
old test was based on deletion of strings (only) which doesn't actually
reflect what happens when a pubsub channel is removed. The test has been
reduced to only check that pubsub is not damaged during defrag - but
doesn't test for defrag efficacy. This isn't likely a significant issue
as it would be unlikely to create many thousands of pubsub channels and
then have associated fragmentation issues.
#1774

Resolves: #1746

---------

Signed-off-by: Jim Brunner <[email protected]>
Co-authored-by: Viktor Söderqvist <[email protected]>
  • Loading branch information
JimB123 and zuiderkwast authored Feb 28, 2025
1 parent 6156590 commit 3f6581b
Show file tree
Hide file tree
Showing 2 changed files with 514 additions and 642 deletions.
8 changes: 6 additions & 2 deletions tests/support/test.tcl
Original file line number Diff line number Diff line change
Expand Up @@ -124,15 +124,19 @@ proc assert_refcount_morethan {key ref} {

# Wait for the specified condition to be true, with the specified number of
# max retries and delay between retries. Otherwise the 'elsescript' is
# executed.
proc wait_for_condition {maxtries delay e _else_ elsescript} {
# executed. If 'debugscript' is provided, it is executed after failure of
# the confition (before the retry delay).
proc wait_for_condition {maxtries delay e _else_ elsescript {_debug_ ""} {debugscript ""}} {
while {[incr maxtries -1] >= 0} {
set errcode [catch {uplevel 1 [list expr $e]} result]
if {$errcode == 0} {
if {$result} break
} else {
return -code $errcode $result
}
if {$_debug_ == "debug"} {
uplevel 1 $debugscript
}
after $delay
}
if {$maxtries == -1} {
Expand Down
Loading

0 comments on commit 3f6581b

Please sign in to comment.