Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix PcpMmvWriter race condition #133

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

pwinckles
Copy link
Contributor

Fixes a race condition where metric updates can be lost if the happen while PcpMmvWriter is being reset. Previously, updates were dropped if the writer was not in a started state. If this happened while the the writer was executing start(), then it's possible the updated value would be lost if the old metric value had already been written.

The issue is addressed here by blocking the metric update when the writer is in a STARTING state, until the writer has been fully started. If the writer is in a STARTED state, the update is applied immediately. If the writer is in a STOPPED state, the update is dropped immediately.

Resolves #132

Fixes a race condition where metric updates can be lost if the happen
while `PcpMmvWriter` is being reset. Previously, updates were dropped
if the writer was not in a started state. If this happened while the
the writer was executing `start()`, then it's possible the updated
value would be lost if the old metric value had already been written.

The issue is addressed here by blocking the metric update when the
writer is in a `STARTING` state, until the writer has been fully
started. If the writer is in a `STARTED` state, the update is applied
immediately. If the writer is in a `STOPPED` state, the update is
dropped immediately.

Resolves performancecopilot#132
If the old code threw an exception starting, it would remaing in a
stopped state. This replicates that behavior. If the start process
fails, it reverts back to `STOPPED` and does not remain in a
`STARTING` state.
if (state == State.STARTED) {
doUpdateMetric(name, value);
} else if (state == State.STARTING) {
if (stateMonitor.enterWhenUninterruptibly(isStarted, Duration.ofSeconds(10))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just channeling some things we observed during my Aconex days, where any of the synchronization overhead for metric updates caused grief. Particularly during startup with lots of metrics (this is why the QuiescentRegistryListener is there, just to help soften the blow of so many new metrics turning up during startup.)

Can we make the block wait time configurable rather than hard-coded? (leave the user to control how much they value this data loss over synchronization overhead?)

I presume with this tricky concurrency, no easy path to automate some unit tests here? (Man I hate doing that too, just asking...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tallpsmith Would you like me to add the configuration to MonitoringViewProperties? Do you think 10 seconds is a reasonable default? Note that the duration is the maximum time it will wait. I would be surprised if it ever actually waited that long.

Reliable, cross-system concurrency testing is a can of worms. I can see if I can get anything to work tomorrow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tallpsmith Just added a test that I think should be reliable. I confirmed it fails on the old code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tallpsmith Just pushed another commit that makes the wait time configurable. Let me know if it's what you had in mind, and, if so, if you'd also like parfait.sh updated.

Adds the config property `parfait.writer.wait` that can be used to
adjust the amount of time an update metric operation will wait for the
writer to start.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Race condition causes metric updates to be lost
2 participants