Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OXY-1514: only track sampled spans and through a setting track all spans. #107

Merged
merged 3 commits into from
Feb 17, 2025

Conversation

fisherdarling
Copy link
Collaborator

We're seeing high lock contention when removing spans from the LiveReferenceSet. Currently,
we sample all spans that are created, even if those spans are unsampled. The purpose of this
is to capture graceful restart timeout issues which sometimes occur in our services. By fetching
the list of all spans in memory when a graceful restart hangs, we can figure out what is preventing
the restart from occuring within its timeout.

In this commit we only track sampled spans. Most services which use foundations sample at a 1% rate,
meaning we effectively remove 99% of locks. This allows light debugability in all running services, at
any time you can get a list of traced spans. Although those spans should end up in jaeger/oltp at some
point1. To make sure we can still determine the root cause of graceful restart issues, we add a setting
which forces all spans to be sampled (what happened previously).

Footnotes

  1. to be fair, sometimes some traces will remain in memory forever and never be dropped, so even
    at 1% this can be useful.

Verified

This commit was signed with the committer’s verified signature.
eamanu Emmanuel Arias
We're seeing high lock contention when removing spans from the `LiveReferenceSet`. Currently,
we sample all spans that are created, even if those spans are unsampled. The purpose of this
is to capture graceful restart timeout issues which sometimes occur in our services. By fetching
the list of all spans in memory when a graceful restart hangs, we can figure out what is preventing
the restart from occuring within its timeout.

In this commit we only track _sampled_ spans. Most services which use foundations sample at a 1% rate,
meaning we effectively remove 99% of locks. This allows light debugability in all running services, at
any time you can get a list of traced spans. Although those spans should end up in jaeger/oltp at some
point[^1]. To make sure we can still determine the root cause of graceful restart issues, we add a setting
which forces all spans to be sampled (what happened previously).

[^1]: to be fair, sometimes some traces will remain in memory forever and never be dropped, so even
at 1% this can be useful.
@fisherdarling fisherdarling force-pushed the fisher/OXY-1514 branch 6 times, most recently from b39cbbd to e293c45 Compare February 17, 2025 12:54

Verified

This commit was signed with the committer’s verified signature.
eamanu Emmanuel Arias
Some spans might not have a context, so we should avoid a panic and simply skip them.
@fisherdarling fisherdarling merged commit 8872a66 into main Feb 17, 2025
17 checks passed
@bobrik
Copy link
Collaborator

bobrik commented Feb 18, 2025

Can we get a release with this as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants