Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catched SIGSEGV от scm_send_msg (/root/src/event-driven-servers/mavis/spawnd_accepted.c:422), pid: 378104: Connection refused #142

Open
zel-ah opened this issue Feb 5, 2025 · 16 comments

Comments

@zel-ah
Copy link

zel-ah commented Feb 5, 2025

Hello,
sometimes I get errors like :
Feb 5 06:15:53 ams-aaa1 tac_plus-ng[378100]: scm_send_msg: sendmsg: Connection refused
Feb 5 06:15:53 ams-aaa1 tac_plus-ng[378100]: scm_send_msg (/root/src/event-driven-servers/mavis/spawnd_accepted.c:422), pid: 378104: Connection refused
Feb 5 06:15:53 ams-aaa1 tac_plus-ng[378100]: Catched SIGSEGV
Feb 5 06:15:53 ams-aaa1 tac_plus-ng[378100]: Apologies ... this shouldn't have happened. Please verify that
Feb 5 06:15:53 ams-aaa1 tac_plus-ng[378100]: you are running the most current version from
Feb 5 06:15:53 ams-aaa1 tac_plus-ng[378100]: https://github.com/MarcJHuber/event-driven-servers/
Feb 5 06:15:53 ams-aaa1 tac_plus-ng[378100]: The version you're currently running is 144c6de
Feb 5 06:15:53 ams-aaa1 tac_plus-ng[378100]: If this issue persists even with the most recent version:
Feb 5 06:15:53 ams-aaa1 tac_plus-ng[378100]: Reconfigure with --debug, recompile and reinstall. Then send
Feb 5 06:15:53 ams-aaa1 tac_plus-ng[378100]: a bug report via GitHub issues at
Feb 5 06:15:53 ams-aaa1 tac_plus-ng[378100]: https://github.com/MarcJHuber/event-driven-servers/issues
Feb 5 06:15:53 ams-aaa1 tac_plus-ng[378100]: and include the backtraces.
Feb 5 06:15:53 ams-aaa1 tac_plus-ng[378100]: Please do NOT mail bug reports to the private mail address of
Feb 5 06:15:53 ams-aaa1 tac_plus-ng[378100]: the author, unless you have a prior permit for that.
Feb 5 06:15:53 ams-aaa1 tac_plus-ng[378100]: Thank you.
Feb 5 06:15:53 ams-aaa1 tac_plus-ng[378100]: EXECINFO: backtrace start
Feb 5 06:15:53 ams-aaa1 tac_plus-ng[378100]: EXECINFO: 10 [0x62198f040335]
Feb 5 06:15:53 ams-aaa1 tac_plus-ng[378100]: EXECINFO: 9 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b) [0x72234de2a28b]
Feb 5 06:15:53 ams-aaa1 tac_plus-ng[378100]: EXECINFO: 8 /lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca) [0x72234de2a1ca]
Feb 5 06:15:53 ams-aaa1 tac_plus-ng[378100]: EXECINFO: 7 [0x62198f057e64]
Feb 5 06:15:53 ams-aaa1 tac_plus-ng[378100]: EXECINFO: 6 /usr/local/lib/libmavis.so(scm_main+0x103) [0x72234e883bd3]
Feb 5 06:15:53 ams-aaa1 tac_plus-ng[378100]: EXECINFO: 5 /usr/local/lib/libmavis.so(spawnd_main+0x976) [0x72234e883806]
Feb 5 06:15:53 ams-aaa1 tac_plus-ng[378100]: EXECINFO: 4 /usr/local/lib/libmavis.so(io_main+0x2c) [0x72234e875fdc]
Feb 5 06:15:53 ams-aaa1 tac_plus-ng[378100]: EXECINFO: 3 /usr/local/lib/libmavis.so(io_poll+0xf1) [0x72234e8752e1]
Feb 5 06:15:53 ams-aaa1 tac_plus-ng[378100]: EXECINFO: 2 /usr/local/lib/libmavis.so(spawnd_accepted+0x967) [0x72234e880c97]
Feb 5 06:15:53 ams-aaa1 tac_plus-ng[378100]: EXECINFO: 1 /lib/x86_64-linux-gnu/libc.so.6(+0x45320) [0x72234de45320]
Feb 5 06:15:53 ams-aaa1 tac_plus-ng[378100]: EXECINFO: 0 /usr/local/lib/libmavis.so(catchsegv+0x105) [0x72234e884825]
Feb 5 06:15:53 ams-aaa1 tac_plus-ng[378100]: EXECINFO: backtrace end
Feb 5 06:15:53 ams-aaa1 tac_plus-ng[378100]: GDB: running: "(printf 'bt#012q#012';sleep 3)|gdb -n -q -p $CRASHPID 2>/dev/null"
Feb 5 06:15:53 ams-aaa1 tac_plus-ng[378100]: GDB: backtrace start
Feb 5 06:15:56 ams-aaa1 tac_plus-ng[378100]: GDB: backtrace end
Feb 5 06:15:56 ams-aaa1 systemd[1]: tac_plus-ng.service: Main process exited, code=exited, status=69/UNAVAILABLE
Feb 5 06:15:56 ams-aaa1 systemd[1]: tac_plus-ng.service: Failed with result 'exit-code'.
Feb 5 06:15:56 ams-aaa1 systemd[1]: tac_plus-ng.service: Unit process 345581 (tac_plus-ng) remains running after unit stopped.
Feb 5 06:15:56 ams-aaa1 systemd[1]: tac_plus-ng.service: Unit process 378102 (tac_plus-ng) remains running after unit stopped.
Feb 5 06:15:56 ams-aaa1 systemd[1]: tac_plus-ng.service: Consumed 6.869s CPU time, 236.2M memory peak, 0B memory swap peak.
Feb 5 06:15:56 ams-aaa1 systemd[1]: tac_plus-ng.service: Scheduled restart job, restart counter is at 7.

@MarcJHuber
Copy link
Owner

Hi,

thanks for reporting. This apparently happens somewhere in the error path after spawnd_accepted.c:422. I really can't read much more from the syslog output, compiling with --debug or --ggdb and capturing the SEGV with a debugger would really be helpful.

For starters, could you please share some details about you setup? E.g. is tracking configured, or might that be the result of a high-usage situation?

Thanks,

Marc

@MarcJHuber
Copy link
Owner

Hi,

I've just commited 443023f, no idea whether that fixes the issue, this is just based on guessing.

Thanks,

Marc

@zel-ah
Copy link
Author

zel-ah commented Feb 10, 2025

Hi, thanks for the update, I'll keep an eye on it, but a possible reason could be a heavy load, I have more than 1500 devices on my network, and possibly due to memory leaks, since after two days of work the consumption increased from 4 MB to 300+ MB (Memory: 332.5M (peak: 340.2M))

@MarcJHuber
Copy link
Owner

Hi,

that memory leak is clearly unexpected. Could you please share some details about your configuration (mavis modules used, e.g.)?

Thanks,

Marc

@zel-ah
Copy link
Author

zel-ah commented Feb 11, 2025

Hi,
I use only tac_plus-ng without external modules with config:
`#!/usr/local/sbin/tac_plus-ng
id = spawnd {
listen { port = 49 }
spawn = {
instances min = 1
instances max = 30
}
background = yes
}

id = tac_plus-ng {
log authzlog { destination = /var/log/tac_plus/%Y/authorization.%Y-%m.log }
log authclog { destination = /var/log/tac_plus/%Y/authentication.%Y-%m.log }
log acctlog { destination = /var/log/tac_plus/%Y/accounting.%Y-%m.log }
accounting log = acctlog
authentication log = authclog
authorization log = authzlog
host IPv4only {
address = 0.0.0.0/0
key = "XXXXX"
}

profile noc_admin {
    script {
        if (service == shell) {
            if (cmd == "") set priv-lvl = 15
            permit
        }
    }
}
profile ro_user  {
    script {
        if (service == shell) {
            if (cmd == "") {
                set priv-lvl = 15
                permit
            }
            if (cmd =~ /^show/) {
                set priv-lvl = 15
                permit
            }
            if (cmd =~ /^export/) {
                set priv-lvl = 15
                permit
            }
            if (cmd =~ /^exit/) {
                set priv-lvl = 15
                permit
            }
            if (cmd =~ /^terminal/) {
                set priv-lvl = 15
                permit
            }
        }
    }
}`

@MarcJHuber
Copy link
Owner

Hi,

thanks, that should make it easier to find what's causing this. Might take some time, nevertheless,

Thanks,

Marc

@MarcJHuber
Copy link
Owner

Hi,

do logs actually get written to /var/log/tac_plus/...? That location being non-writable would be a fairly good excuse for memory filling up.

Just guessing.

Thanks,

Marc

@zel-ah
Copy link
Author

zel-ah commented Feb 12, 2025

Hello,
This location is writable and logs are written. The strange thing is that the logs often contain messages like: "Possibly no single-connection support. Context feature may or may not work." but on the equipment side, single-connection is enabled in the settings.

@MarcJHuber
Copy link
Owner

Hi,

thanks, so I'll look further into finding that memory leak :-(

The "single-connection" warning should go away after defining "single-connection = yes" at host or realm level.

Thanks,

Marc

@zel-ah
Copy link
Author

zel-ah commented Feb 21, 2025

Hi,
"single-connection = yes" - it didn't help

But I noticed that after restart the memory is not freed:

`root@aaa1:~/src/event-driven-servers# service tac_plus-ng status
● tac_plus-ng.service - TACACS+ NG Service
Loaded: loaded (/etc/systemd/system/tac_plus-ng.service; enabled; preset: enabled)
Active: active (running) since Fri 2025-02-21 06:45:14 UTC; 29s ago
Main PID: 3176288 (tac_plus-ng)
Tasks: 3 (limit: 9443)
Memory: 421.5M (peak: 422.2M)
CPU: 170ms
CGroup: /system.slice/tac_plus-ng.service
├─1474064 "tac_plus-ng: 4 connections left, dying when idle"
├─3176288 "tac_plus-ng: 3 connections, ac" /etc/tacacs-ng/tac_plus-ng.conf -d 16
└─3176289 "tac_plus-ng: 3 connections"

root@aaa1:~/src/event-driven-servers# service tac_plus-ng status
● tac_plus-ng.service - TACACS+ NG Service
Loaded: loaded (/etc/systemd/system/tac_plus-ng.service; enabled; preset: enabled)
Active: active (running) since Fri 2025-02-21 06:45:14 UTC; 1min 31s ago
Main PID: 3176288 (tac_plus-ng)
Tasks: 2 (limit: 9443)
Memory: 419.0M (peak: 422.2M)
CPU: 470ms
CGroup: /system.slice/tac_plus-ng.service
├─3176288 "tac_plus-ng: 8 connections, ac" /etc/tacacs-ng/tac_plus-ng.conf -d 16
└─3176289 "tac_plus-ng: 8 connections"
`

@MarcJHuber
Copy link
Owner

Hi,

please forcefully "kill -9 3176288 3176289" (or whatever the process ids are now), 1474064 and 3176288/3176289 don't look related to me. After that, please restart tac_plus-ng (and make certain to be at current GIT level). Does the issue still occur after that? If yes, how long does it take until non-sane memory usage? Do you see any relation to the number of requests served?

Thanks,

Marc

@zel-ah
Copy link
Author

zel-ah commented Feb 21, 2025

Hi,
I'm using the latest version.
If I restart and there are active sessions, the memory is not freed, but if I stop, then kill -9 the process with the remaining sessions "tac_plus-ng: connections left, dying when idle" and start, it will start with freed memory.

@zel-ah
Copy link
Author

zel-ah commented Feb 21, 2025

Hi,
I also noticed strange behavior of the switches:
C9200L-24P-4G 17.12.04 CAT9K_LITE_IOSXE
they always maintain 2-3 active connections to the server, in fact, when restarting the process, it waits for the connections from them to complete,but if I configure single-connection on them, they stop working with tacacs.

@MarcJHuber
Copy link
Owner

Hi,

I simply can't test that, due to lack of suitable hardware. May or may not be an issue with IOS-XE, or with tac_plus-ng. In general, if restarting tac_plus-ng causes an issue with devices it looks more like a problem with the device implementation.

Cheers,

Marc

@zel-ah
Copy link
Author

zel-ah commented Feb 21, 2025

Hi,
Regarding the IOS-XE device, perhaps you didn’t quite understand me, when setting up single-connection on the switch, the switch stops working with the tacacs server, while the server itself works fine.
Thanks for the updates, I've stopped seeing SIGSEGV lately.
Now the main problem is a memory leak, and it also appears on other tacacs servers instaces that don't have IOS-XE devices in service, but the memory leak doesn't happen so quickly there.

@MarcJHuber
Copy link
Owner

Hi,

I see no issues with single connection on a C8000v platform running 17.12.04. I'd guess that that's the same TACACS+ client implementation. You could perhaps isolate the issue by enabling tacaca debugging on your switch.

I'm still looking into the memory leak issue. For now, mitigating it with "retire limit = <max. number of connections to be served>" and "connection timeout = " should work.

Cheers,

Marc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants