Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dummy-ups: relax error handling to prevent premature driver termination #2132

Merged
merged 3 commits into from
Oct 26, 2023
Merged

dummy-ups: relax error handling to prevent premature driver termination #2132

merged 3 commits into from
Oct 26, 2023

Conversation

desertwitch
Copy link
Contributor

This presents a possible follow up for the useful revisions of #2118.

It is currently not possible to utilize the dummy-ups driver's repeater function within a local multi-UPS ups.conf file, particularly having upsdrvctl collectively start all drivers via e.g. /usr/sbin/upsdrvctl -u root start or within a systemd environment.

The reason for this is that the dummy-ups driver at present terminates prematurely upon encountering any kind of error at repeater mode startup, in particular a connectivity one due to the driver it should repeat not being 100% up yet or upsd not being up yet - as encountered during collective upsdrvctl startup. This proceeds to raise an erroneous return code from upsdrvctl (Driver failed to start (exit status=1)) possibly breaking the user's entire NUT startup depending on their service scripts.

[ups]
driver = usbhid-ups
port = auto

[qnapups]
driver = dummy-ups
port = [email protected]

Such a configuration would - at present - require to first start ups and afterwards qnapups via individual command.
It is a configuration commonly used by users with QNAP, Synology etc. devices which require certain UPS names to function.

The current error handling in dummy-ups complicates such a setup and could also pose a theoretical problem in a networked NUT environment when e.g. a NUT installation is restarted after a reboot (due to a power loss situation). When the server comes back up and starts NUT, but individual network devices (switches, ...) have not recovered yet, a dummy-ups driver will terminate encountering the temporary (but later recovered from) connectivity problem.

I therefore propose the following changes, as I believe that we could allow dummy-ups to attempt to recover from any such not-necessarily fatal errors (with the fatal ones already specifically defined and causing driver termination - such as too old upsd, bad configuration etc.). While the user would still be notified of the errors occurring, the driver would not terminate in such a situation.

In general I assume most users utilizing dummy-ups to have to a degree RTFM and also bring enough technical proficiency to read and understand the error messages in the log if they see their dummy-ups setup not working as expected.

Here's such an example configuration and respective log from my testing system with a SNMP driver and a dummy-ups repeater configured locally and invoked via /usr/sbin/upsdrvctl -u root start:

[ups]
driver = snmp-ups
port = 192.168.0.100
pollfreq = 15
community = public
snmp_version = v2c

[qnapups]
driver = dummy-ups
port = [email protected]
Oct 25 06:24:36 Tower snmp-ups[17731]: Startup successful
Oct 25 06:24:36 Tower root: Error: Connection failure: Connection refused
Oct 25 06:24:36 Tower dummy-ups[17732]: Error: Connection failure: Connection refused
Oct 25 06:24:36 Tower root: Error: Driver not connected
Oct 25 06:24:36 Tower dummy-ups[17732]: Error: Driver not connected

== THE DUMMY-UPS DRIVER WOULD TERMINATE HERE IN CURRENT MASTER ==
== BELOW LOG MESSAGES DUE TO CHANGES INTRODUCED IN PULL REQUEST ==

Oct 25 06:24:37 Tower dummy-ups[17739]: Startup successful
Oct 25 06:24:37 Tower root: Network UPS Tools - UPS driver controller 2.8.0.1
Oct 25 06:24:38 Tower upsd[17745]: listening on 0.0.0.0 port 3493
Oct 25 06:24:38 Tower upsd[17745]: Connected to UPS [qnapups]: dummy-ups-qnapups
Oct 25 06:24:38 Tower dummy-ups[17739]: sock_connect: enabling asynchronous mode (auto)
Oct 25 06:24:38 Tower snmp-ups[17731]: sock_connect: enabling asynchronous mode (auto)
Oct 25 06:24:38 Tower upsd[17745]: Connected to UPS [ups]: snmp-ups-ups
Oct 25 06:24:38 Tower upsd[17745]: Found 2 UPS defined in ups.conf
Oct 25 06:24:38 Tower upsd[17746]: Startup successful
Oct 25 06:24:38 Tower upsd[17746]: Data for UPS [qnapups] is stale - check driver
Oct 25 06:24:38 Tower upsmon[17749]: Startup successful
Oct 25 06:24:38 Tower upsd[17746]: User [email protected] logged into UPS [ups]
Oct 25 06:24:42 Tower upsd[17746]: UPS [qnapups] data is no longer stale

== THE DRIVER RECOVERS TO NORMAL OPERATION FROM THE INITIAL CONNECTIVITY PROBLEM ==

@jimklimov
Copy link
Member

Seems useful, thanks. I just wonder if for the sake of a principle of least surprise, this behavior should be an option (e.g. some ups.conf section flag like linger) so it would just warn or by default abort as it did before. I suppose there can be people relying on existing behavior of whatever (cue xkcd on keyboards) :)

The early driver-startup aborts became less of a problem on systems with systemd or SMF, where the dead driver would be retried (and each has its dedicated service instance, so less of a use-case for one upsdrvctl start to rule them all).

Another common hiccup with dummy-ups is lack of upsd when it starts. Especially in service frameworks with diligently defined dependencies, having all driver units running may be marked as prerequisite for upsd to begin starting; however the relay dummy-ups may require the upsd and the "actual" driver like snmp-ups in your example, to activate. So far this has often caused clumsy workarounds (e.g. systemd drop-in configuration snippets to cut the dependency loop for certain [email protected] instances) and maybe reduced the attractiveness of such relays. An option to just let it start and wait for both the data server and an "actual" device driver behind it to appear would be surely useful.

On a note that seems related - before NUT v2.8.0 release, upsd itself also had a sub-par approach where it supported reloading (to learn or forget changes in ups.conf), but refused to start if no devices were configured at the moment. Some embedded systems wanted NUT data server always running, even if to authoritatively respond that it does not know of any hardware to monitor right now, so the #766 happened :)

@desertwitch
Copy link
Contributor Author

A linger flag sounds like a sensible solution, but we're talking driver-specific, right? I think that'd make the most sense. So basically when linger is enabled the driver emits the errors at startup but doesn't make them fatal, unless they are really fatal errors. Default behavior we keep at driver termination on any error at repeater startup as to not break existing workarounds. If I got that all right, I'll get right on it.

@desertwitch desertwitch marked this pull request as draft October 25, 2023 15:08
@jimklimov
Copy link
Member

Yep, like that and just for this driver (well, maybe later would also jump into clone*drivers that are similar).

Also as this is a notable behavior change, it warrants a driver version bump, as well as notes in NEWS.adoc and in the man page (for the new option).

@jimklimov jimklimov added enhancement service/daemon start/stop General subject for starting and stopping NUT daemons (drivers, server, monitor); also BG/FG/Debug labels Oct 25, 2023
@jimklimov jimklimov added this to the 2.8.1 milestone Oct 25, 2023
@desertwitch desertwitch marked this pull request as ready for review October 26, 2023 09:33
@desertwitch
Copy link
Contributor Author

I hope this looks good to you and implements the changes as discussed. 👍

@jimklimov jimklimov merged commit 0a61dea into networkupstools:master Oct 26, 2023
@jimklimov
Copy link
Member

Thanks! Merged :)

@desertwitch desertwitch deleted the dummyups-cerror branch October 26, 2023 14:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement service/daemon start/stop General subject for starting and stopping NUT daemons (drivers, server, monitor); also BG/FG/Debug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants