Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continuous Profiling should be disabled by default until tiflash can be safely profiled #160

Open
mornyx opened this issue Aug 25, 2022 · 3 comments
Assignees

Comments

@mornyx
Copy link
Collaborator

mornyx commented Aug 25, 2022

See: pingcap/tiflash#5687

Continuous Profiling has been enabled by default since v6.1.0, so TiDB/TiKV/TiFlash will continue to trigger CPU profiling by default since v6.1.0. The CPU profiler of TiKV/TiFlash is pprof-rs. pprof-rs registers a signal handler during CPU profiling and periodically triggers the SIGPROF signal. After the SIGPROF signal is triggered, the signal handler is dispatched to the business thread for execution to sample the call stack of the current thread. Signal handler also makes system calls through glibc during execution, so it has a chance to affect errno. After the signal handler is executed, the business thread may get the wrong errno. However, errno has been protected in the current version of pprof-rs, so there is still no key evidence that the errno was modified by pprof-rs. The current judgment is based on two facts:

  1. The problem only occurs while profiling is on
  2. pprof-rs has a very similar issue

So, Continuous Profiling should be disabled by default until tiflash can be safely profiled.

@mornyx
Copy link
Collaborator Author

mornyx commented Aug 25, 2022

/assign

@breezewish
Copy link
Member

Is it possible to avoid "Signal handler also makes system calls"?

@mornyx
Copy link
Collaborator Author

mornyx commented Aug 29, 2022

Is it possible to avoid "Signal handler also makes system calls"?

There are currently two choices for stack backtracking: libunwind-based and frame-pointer-based.

In the past we were based on libunwind by default. In this case, it is unavoidable to affect errno. Here is an example:

fn main() {
    unsafe {
        println!("before backtrace: {}", *libc::__errno_location());
    }
    let _ = backtrace::Backtrace::new();
    unsafe {
        println!("after backtrace: {}", *libc::__errno_location());
    }
}

The output is:

before backtrace: 0
after backtrace: 2

Now we are based on frame-pointer by default, which ideally does not need to depend on system calls. Unfortunately, we cannot guarantee that all dynamic link libraries have frame pointers turned on, so a protection mechanism needs to be introduced. This protection mechanism needs to depend on system calls.

However, since the root cause of the problem on profiling on TiFlash is still unclear, we cannot determine whether the signal handler affects errno for the same reason.

And, I may try to introduce eBPF in the near future to improve the performance and stability of profiling, refer to tikv/tikv#13336.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants