Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix LRU crash when getting too many random lua scripts #1310

Open
wants to merge 1 commit into
base: unstable
Choose a base branch
from

Conversation

sungming2
Copy link

@sungming2 sungming2 commented Nov 15, 2024

Problem

Valkey stores scripts in a dictionary (lua_scripts) keyed by their SHA1 hashes, but it needs a way to know which scripts are least recently used. It uses an LRU list (lua_scripts_lru_list) to keep track of scripts in usage order. When the list reaches a maximum length, Valkey evicts the oldest scripts to free memory in both the list and dictionary. The problem here is that the sds from the LRU list can be pointing to already freed/moved memory by active defrag that the sds in the dictionary used to point to. It results in assertion error at this line

Solution

If we duplicate the sds when adding it to the LRU list, we can create an independent copy of the script identifier (sha). This duplication ensures that the sha string in the LRU list remains stable and unaffected by any defragmentation that could alter or free the original sds. In addition, dictUnlink doesn't require exact pointer match(ref) so this change makes sense to unlink the right dictEntry with the copy of the sds.

Reproduce

To reproduce it with tcl test:

  1. Disable je_get_defrag_hint in defrag.c to trigger defrag often
  2. Execute test script
start_server {tags {"auth external:skip"}} {

    test {Regression for script LRU crash} {
        r config set activedefrag yes
        r config set active-defrag-ignore-bytes 1
        r config set active-defrag-threshold-lower 0
        r config set active-defrag-threshold-upper 1
        r config set active-defrag-cycle-min 99
        r config set active-defrag-cycle-max 99

        for {set i 0} {$i < 100000} {incr i} {
            r eval "return $i" 0
        }
        after 5000;
    }
}

Crash info

Crash report:

=== REDIS BUG REPORT START: Cut & paste starting from here ===
14044:M 12 Nov 2024 14:51:27.054 # === ASSERTION FAILED ===
14044:M 12 Nov 2024 14:51:27.054 # ==> eval.c:556 'de' is not true

------ STACK TRACE ------

Backtrace:
/usr/bin/redis-server 127.0.0.1:6379 [cluster](luaDeleteFunction+0x148)[0x723708]
/usr/bin/redis-server 127.0.0.1:6379 [cluster](luaCreateFunction+0x26c)[0x724450]
/usr/bin/redis-server 127.0.0.1:6379 [cluster](evalCommand+0x2bc)[0x7254dc]
/usr/bin/redis-server 127.0.0.1:6379 [cluster](call+0x574)[0x5b8d14]
/usr/bin/redis-server 127.0.0.1:6379 [cluster](processCommand+0xc84)[0x5b9b10]
/usr/bin/redis-server 127.0.0.1:6379 [cluster](processCommandAndResetClient+0x11c)[0x6db63c]
/usr/bin/redis-server 127.0.0.1:6379 [cluster](processInputBuffer+0x1b0)[0x6dffd4]
/usr/bin/redis-server 127.0.0.1:6379 [cluster][0x6bd968]
/usr/bin/redis-server 127.0.0.1:6379 [cluster][0x659634]
/usr/bin/redis-server 127.0.0.1:6379 [cluster](amzTLSEventHandler+0x194)[0x6588d8]
/usr/bin/redis-server 127.0.0.1:6379 [cluster][0x750c88]
/usr/bin/redis-server 127.0.0.1:6379 [cluster](aeProcessEvents+0x228)[0x757fa8]
/usr/bin/redis-server 127.0.0.1:6379 [cluster](redisMain+0x478)[0x7786b8]
/lib64/libc.so.6(__libc_start_main+0xe4)[0xffffa7763da4]
/usr/bin/redis-server 127.0.0.1:6379 [cluster][0x5ad3b0]

Defrag info:

mem_fragmentation_ratio:1.18
mem_fragmentation_bytes:47229992
active_defrag_hits:20561
active_defrag_misses:5878518
active_defrag_key_hits:77
active_defrag_key_misses:212
total_active_defrag_time:29009

Test:

Run the test script to push 100,000 scripts to ensure the LRU list keeps 500 maximum length without any crash.

27489:M 14 Nov 2024 20:56:41.583 * LRU List length: 500
27489:M 14 Nov 2024 20:56:41.583 * LRU List length: 500
27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500
27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500
27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500
27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500
27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500
27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500
27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500
27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500
27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500
27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500
27489:M 14 Nov 2024 20:56:41.584 * LRU List length: 500
[ok]: Regression for script LRU crash (6811 ms)
[1/1 done]: unit/test (7 seconds)

Copy link
Member

@enjoy-binbin enjoy-binbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks. This make sense to me, btw, can we do something in defrag.c to resume this pointer?

please fix the DCO btw

Copy link

codecov bot commented Nov 15, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.69%. Comparing base (86f33ea) to head (81e4cd1).
Report is 1 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #1310      +/-   ##
============================================
+ Coverage     70.55%   70.69%   +0.14%     
============================================
  Files           115      115              
  Lines         63158    63158              
============================================
+ Hits          44561    44652      +91     
+ Misses        18597    18506      -91     
Files with missing lines Coverage Δ
src/eval.c 57.02% <100.00%> (ø)

... and 13 files with indirect coverage changes

@sungming2 sungming2 force-pushed the fix-lua-lru-crash branch 2 times, most recently from 02c4673 to 6222d74 Compare November 15, 2024 09:26
@sungming2
Copy link
Author

Good catch, thanks. This make sense to me, btw, can we do something in defrag.c to resume this pointer?

please fix the DCO btw

Thanks for the comment. Are you saying about tracking references during defragmentation? if not, can you elaborate on your questions please?

@@ -203,6 +203,7 @@ void scriptingInit(int setup) {
* sha with the dictionary, so free fn is not set. */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment need to update

@enjoy-binbin
Copy link
Member

Are you saying about tracking references during defragmentation?

yean, something like that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants