Update and stabilize defrag tests (#1762)

A number of tests related to defrag have had stability problems. One reason for stability issues is that the tests are run sequentially on the same server, and it takes jemalloc some time to release freed pages. This is especially noticed immediately after a flushall. Over a period of 10 seconds, it is observable that the "fragmentation bytes" can decrease by several MB. Another reason is that there's no standardization between tests. For each test, people have been independently hacking/tweaking the success criteria, without addressing underlying issues. This update revamps all of the defrag tests: * A fresh server is started for each test. Running each test in isolation improves stability. * A uniform function `log_frag` is now used for debug logging * A uniform function `perform_defrag_test` ensures that each test is written and executed in a uniform fashion. Limits are imposed to ensure that the defrag results are consistent/reproducible. The intent is to eliminate failures do to various tweaks to values in individual tests. * Latency is tested much more strictly for most tests, reflecting the recent improvements to defrag latency. * The test `defrag edge case` has been removed. This test attempted to create N pages with EXACTLY equal fragmentation in an attempt to confuse the defrag logic. It's unlikely that this test was performing correctly, and had questionable value. * Tests for hash/list/set/zset/stream have been separated and standardized. It was unlikely that the old test was performing properly as none of the actual data structures were fragmented! It's noted that pubsub doesn't appear to be defragging correctly. The old test was based on deletion of strings (only) which doesn't actually reflect what happens when a pubsub channel is removed. The test has been reduced to only check that pubsub is not damaged during defrag - but doesn't test for defrag efficacy. This isn't likely a significant issue as it would be unlikely to create many thousands of pubsub channels and then have associated fragmentation issues. #1774 Resolves: #1746 --------- Signed-off-by: Jim Brunner <[email protected]> Co-authored-by: Viktor Söderqvist <[email protected]>
valkey-io · Feb 28, 2025 · 3f6581b · 3f6581b
1 parent 6156590
commit 3f6581b
Show file tree

Hide file tree

Showing 2 changed files with 514 additions and 642 deletions.
diff --git a/tests/support/test.tcl b/tests/support/test.tcl
@@ -124,15 +124,19 @@ proc assert_refcount_morethan {key ref} {
 
 # Wait for the specified condition to be true, with the specified number of
 # max retries and delay between retries. Otherwise the 'elsescript' is
-# executed.
-proc wait_for_condition {maxtries delay e _else_ elsescript} {
+# executed. If 'debugscript' is provided, it is executed after failure of
+# the confition (before the retry delay).
+proc wait_for_condition {maxtries delay e _else_ elsescript {_debug_ ""} {debugscript ""}} {
     while {[incr maxtries -1] >= 0} {
         set errcode [catch {uplevel 1 [list expr $e]} result]
         if {$errcode == 0} {
             if {$result} break
         } else {
             return -code $errcode $result
         }
+        if {$_debug_ == "debug"} {
+            uplevel 1 $debugscript
+        }
         after $delay
     }
     if {$maxtries == -1} {