-
-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix PcpMmvWriter race condition #133
base: main
Are you sure you want to change the base?
Conversation
Fixes a race condition where metric updates can be lost if the happen while `PcpMmvWriter` is being reset. Previously, updates were dropped if the writer was not in a started state. If this happened while the the writer was executing `start()`, then it's possible the updated value would be lost if the old metric value had already been written. The issue is addressed here by blocking the metric update when the writer is in a `STARTING` state, until the writer has been fully started. If the writer is in a `STARTED` state, the update is applied immediately. If the writer is in a `STOPPED` state, the update is dropped immediately. Resolves performancecopilot#132
If the old code threw an exception starting, it would remaing in a stopped state. This replicates that behavior. If the start process fails, it reverts back to `STOPPED` and does not remain in a `STARTING` state.
if (state == State.STARTED) { | ||
doUpdateMetric(name, value); | ||
} else if (state == State.STARTING) { | ||
if (stateMonitor.enterWhenUninterruptibly(isStarted, Duration.ofSeconds(10))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just channeling some things we observed during my Aconex days, where any of the synchronization overhead for metric updates caused grief. Particularly during startup with lots of metrics (this is why the QuiescentRegistryListener is there, just to help soften the blow of so many new metrics turning up during startup.)
Can we make the block wait time configurable rather than hard-coded? (leave the user to control how much they value this data loss over synchronization overhead?)
I presume with this tricky concurrency, no easy path to automate some unit tests here? (Man I hate doing that too, just asking...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tallpsmith Would you like me to add the configuration to MonitoringViewProperties
? Do you think 10 seconds is a reasonable default? Note that the duration is the maximum time it will wait. I would be surprised if it ever actually waited that long.
Reliable, cross-system concurrency testing is a can of worms. I can see if I can get anything to work tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tallpsmith Just added a test that I think should be reliable. I confirmed it fails on the old code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tallpsmith Just pushed another commit that makes the wait time configurable. Let me know if it's what you had in mind, and, if so, if you'd also like parfait.sh
updated.
Adds the config property `parfait.writer.wait` that can be used to adjust the amount of time an update metric operation will wait for the writer to start.
Fixes a race condition where metric updates can be lost if the happen while
PcpMmvWriter
is being reset. Previously, updates were dropped if the writer was not in a started state. If this happened while the the writer was executingstart()
, then it's possible the updated value would be lost if the old metric value had already been written.The issue is addressed here by blocking the metric update when the writer is in a
STARTING
state, until the writer has been fully started. If the writer is in aSTARTED
state, the update is applied immediately. If the writer is in aSTOPPED
state, the update is dropped immediately.Resolves #132