Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf data idle time #393

Open
prim opened this issue May 16, 2022 · 9 comments
Open

perf data idle time #393

prim opened this issue May 16, 2022 · 9 comments

Comments

@prim
Copy link

prim commented May 16, 2022

speedscope provides a great time order view.

But for perf data, when the program is idle, speedscope will think that the time is consumed by the next sampled call stack. This is wrong, the program should be idle at this time.


$ cat test.c

#include <stdio.h>
#include <unistd.h>
#include <time.h>
#include <pthread.h>
#include <sys/time.h>
#include <sys/types.h>

void fselect() {
    fd_set rfds;
    struct timeval tv;
    int retval;

    FD_ZERO(&rfds);
    FD_SET(0, &rfds);
    tv.tv_sec = 2;
    tv.tv_usec = 0;
    retval = select(1, &rfds, NULL, NULL, &tv);

    if (retval == -1)
        perror("select()");
    else if (retval)
        printf("Data is available now.\n");
        /* FD_ISSET(0, &rfds) will be true. */
    else
        printf("No data within two seconds.\n");
}

void timed(int microseconds) {
    int sec, usec, diff;
    struct timeval tv;
    gettimeofday(&tv, NULL);
    // printf("begin seconds : %ld micro seconds : %ld\n", tv.tv_sec, tv.tv_usec);
    sec = tv.tv_sec;
    usec = tv.tv_usec;

    while (1) {
        gettimeofday(&tv, NULL);
        diff = (tv.tv_sec - sec) * 1000000 + tv.tv_usec - usec;
        if (diff > microseconds) {
            // printf("end seconds : %ld micro seconds : %ld\n", tv.tv_sec, tv.tv_usec);
            break;
        }
    }
}

void a() { timed(500000); }
void b() { timed(500000); }
void c() { timed(500000); }
void d() { timed(500000); }
void e() { timed(500000); }

void *fthread(void *args) {
    func();
    return NULL;
}

void func() {
    while (1) {
        a();
        b();
        fselect();
        c();
        d();
        e();
        printf("sleep begin\n");
        sleep(1.5);
        printf("sleep end\n");
    }
}

int main() {
    pthread_t thread;
    pthread_create(&thread, NULL, fthread, NULL);

    func();
    return 0;
}

$ gcc test.c -O0 -g -lpthread && ./a.out

$ perf record -F 99 -p 320355 -g -e cpu-clock 

The result now looks like this:

image

A more correct result should look like this:

image

@jlfwong
Copy link
Owner

jlfwong commented May 16, 2022

Hi @prim!

Thanks for providing clear repro instructions along with what you expect the output to be. I'll take a look next time I sit down to work on speedscope

@DanielThomas
Copy link

DanielThomas commented Jul 28, 2022

I see the same thing, here's a thread with just 189 samples over two minutes:

grpc-default-wo  4615/4802  [001] 52473.278291:   20408163 cpu-clock:
grpc-default-wo  4615/4802  [002] 52475.706807:   20408163 cpu-clock:
grpc-default-wo  4615/4802  [001] 52476.645550:   20408163 cpu-clock:
grpc-default-wo  4615/4802  [001] 52477.972046:   20408163 cpu-clock:
grpc-default-wo  4615/4802  [000] 52478.278147:   20408163 cpu-clock:
grpc-default-wo  4615/4802  [000] 52479.278122:   20408163 cpu-clock:
...
grpc-default-wo  4615/4802  [002] 52588.703835:   20408163 cpu-clock:
grpc-default-wo  4615/4802  [000] 52589.193597:   20408163 cpu-clock:
grpc-default-wo  4615/4802  [002] 52589.846662:   20408163 cpu-clock:
grpc-default-wo  4615/4802  [002] 52590.111962:   20408163 cpu-clock:
grpc-default-wo  4615/4802  [000] 52590.622132:   20408163 cpu-clock:
grpc-default-wo  4615/4802  [002] 52590.683375:   20408163 cpu-clock:
grpc-default-wo  4615/4802  [002] 52591.581311:   20408163 cpu-clock:
grpc-default-wo  4615/4802  [000] 52592.213928:   20408163 cpu-clock:
grpc-default-wo  4615/4802  [000] 52592.866970:   20408163 cpu-clock:

Screen Shot 2022-07-28 at 9 55 35 am

But it looks like it ran the entire time:

Screen Shot 2022-07-28 at 9 56 54 am

Perf also records either on a frequency or at an interval of event count, so you don't necessarily have a fixed rate of samples:

But for clock/cpu-clock you can probably count the frequency of events for the lowest resolution time you're presenting and use that to estimate wall time (IIRC this was sampled at 99Hz). It'd still only be an estimation of the maximum time possible for that method to have spent, but at least it'll be proportional to the count of samples.

@jlfwong
Copy link
Owner

jlfwong commented Jul 31, 2022

@prim @DanielThomas I started trying to investigate this by updating the test environment to be able to generate the perf output for the source file you provided @prim. I was able to get it to build and run and output a .perf file, but then perf chokes on generating the output.

You can see the progress and the output I'm getting here: #401 (comment)

So, at the moment, I'm stuck. Ideally, if you can identify what I'm doing wrong and figure out how to fix the dockerized output generation, that would be great. Failing that, if you can provide the full output of one of your profiles with the incorrect display in speedscope, that would be a different way of me making progres.

Thanks!

@DanielThomas
Copy link

Here's a system wide profile that demonstrates the problem:

profile.out.gz

The DiscoveryClient thread 2614953 is a good choice, here's when the threads are actually on-CPU:

Screen Shot 2022-08-01 at 10 22 03 am

G1 Conc#0 also shows a related issue: for threads that don't have idle time and are only on-CPU time once during the profile the timeline doesn't reflect the entire profile time so you can't compare timelines when switching threads/processes.

@jlfwong
Copy link
Owner

jlfwong commented Aug 5, 2022

Hmm, okay interesting. I think I understand the high level problem, but want to write it out to make sure I understand.

When a thread is idle because the OS scheduler isn't scheduling it either because it's sleeping or blocked on a certain kind of syscall like select, then perf doesn't report any samples about what that thread is doing during that time.

This is problematic because speedscope uses these samples to infer what's happening in continuous time. This isn't specific to perf import, but is a general problem with converting sampling profiles to flamegraphs -- you can't know what happened between samples, but you still need to attribute that time to some combination of the stack before the gap and the stack after the gap. Speedscope opts to attribute half the time to the sample before the gap and half the time to the sample after the gap:

appendSampleWithTimestamp(stack: FrameInfo[], timestamp: number) {
if (this.pendingSample) {
if (timestamp < this.pendingSample.centralTimestamp) {
throw new Error('Timestamps received out of order')
}
const endTimestamp = (timestamp + this.pendingSample.centralTimestamp) / 2
this.appendSampleWithWeight(
this.pendingSample.stack,
endTimestamp - this.pendingSample.startTimestamp,
)
this.pendingSample = {stack, startTimestamp: endTimestamp, centralTimestamp: timestamp}
} else {
this.pendingSample = {stack, startTimestamp: timestamp, centralTimestamp: timestamp}
}
}
build(): Profile {
if (this.pendingSample) {
if (this.samples.length > 0) {
this.appendSampleWithWeight(
this.pendingSample.stack,
this.pendingSample.centralTimestamp - this.pendingSample.startTimestamp,
)
} else {
// There is only a single sample. In this case, units will be meaningless,
// so we'll append with a weight of 1 and also clear any value formatter
this.appendSampleWithWeight(this.pendingSample.stack, 1)
this.setValueFormatter(new RawValueFormatter())
}
}
this.totalWeight = Math.max(
this.totalWeight,
this.weights.reduce((a, b) => a + b, 0),
)
this.sortGroupedCallTree()
return this
}

I'm not sure what the appropriate resolution here is. There are, I believe, two separate issues.

  1. How do we detect that this issue is occurring at all? One option is to estimate the sampling frequency by looking at adjacent sample times, and then if we see sample times that are 10x longer than that (or some other factor we tune), then we know that something is weird.

  2. If we detect it, how do we re-attribute time? Is it not the case that there are blocking syscalls that still cause a thread to be off-CPU that you would want to attribute to the stack that made the syscall?

@prim
Copy link
Author

prim commented Aug 9, 2022

I haven't checked the documentation or code, the following are my guesses.

The output of perf script is one by one sampling data. The first line of each sampling data consists of thread name, thread ID, timestamp, sampling interval, and sampling events, followed by N lines of function call stack.

grpc-default-wo 4615/4802 [001] 52473.278291: 20408163 cpu-clock:

Therefore, we can know that the sampling interval of @DanielThomas 's output is 0.20408163 seconds, which is 5 samples per second.


I use this little script below to process the output of the perf script and let speedscope render what I think is correct.
Please focus on the with_idle function.

speedscope also has problems when thread names contain spaces. replace_space_in_filename handles this. filename is actually the thread name, sorry.

# encoding: utf8

def split_line(line):
    line = line.strip()
    line = line.replace("\t", "")
    for i in xrange(8):
        line = line.replace("  ", " ")
    ps = line.rsplit(" ", 4)
    return ps

def format_time(begin, now):
    n = int(now - begin)
    minute = n / 60
    second = n % 60
    return "%02d:%02d" % (minute, second)

class Thread(object):

    def __init__(self, id, interval):
        self.id = id
        self.interval = interval
        self.except_amount = int(1 / interval)
        self.stats = {}
        self.min_ts = None
        self.max_ts = None
        self.ts_list = []
        self.cpu_usage = {}

    def handle(self, ps):
        ts = float(ps[2][:-1])
        second = int(ts)
        self.stats[second] = self.stats.get(second, 0) + 1
        if self.min_ts is None:
            self.min_ts = ts
        self.max_ts = ts

    def output(self):
        print "thread", self.id
        print "last seconds", self.max_ts - self.min_ts
        M = self.stats
        e = self.except_amount
        for sec, n in sorted(M.items()):
            r = n * 1.0 / e
            # print sec, n, e, r
            # if r > 0.3 and 0:
            #     print self.id, format_time(self.min_ts, sec), n, e, r
            self.cpu_usage[sec] = r

class PerfScript(object):

    def __init__(self, file_path):
        self.file_path = file_path
        self.thread_names = {}

        self.replace_space_in_filename()
        self.init()

    def replace_space_in_filename(self):
        m = {}
        with open(self.file_path, "r") as rfile:
            for line in rfile:
                if "cpu-clock" in line:
                    ps = split_line(line)
                    filename = ps[0]
                    if " " in filename:
                        m[filename] = filename.replace(" ", "-")

        with open(self.file_path, "rb") as rfile:
            content = rfile.read()

        for k, v in m.items():
            content = content.replace(k, v)

        with open(self.file_path, "wb") as wfile:
            wfile.write(content)

    def init(self):
        with open(self.file_path, "r") as rfile:
            for line in rfile:
                if "cpu-clock" in line:
                    ps = split_line(line)
                    thread_id = ps[1]
                    dt = int(ps[3]) / 1000000000.
                    self.event_ps = ps
                    self.interval = dt
                    self.thread_names[thread_id] = ps[0]

    def with_idle(self, out_file_path):                                                                                                                                                                                                                                                    [13/1933]
        lines = []
        dt = self.interval
        thread_last_ts = {}

        with open(self.file_path, "r") as rfile:
            for line in rfile:
                if "cpu-clock" in line:
                    ps = split_line(line)
                    ts = float(ps[2][:-1])
                    thread_id = ps[1]
                    last_ts = thread_last_ts.get(thread_id)
                    if last_ts is not None:
                        while True:
                            diff = ts - last_ts
                            # print "T", diff, dt, last_ts, ts
                            if diff > 0.0 and diff > 2 * dt:
                                last_ts += dt
                                # print "A", last_ts
                                self.event_ps[0] = self.thread_names[thread_id]
                                self.event_ps[1] = thread_id
                                self.event_ps[2] = str(last_ts) + ":"
                                lines.append(" ".join(self.event_ps) + "\n")
                                # lines.append("\t 0x0 [unknown] ([unknown])\n")
                                lines.append("\n")
                            else:
                                break
                    last_ts = ts
                    thread_last_ts[thread_id] = last_ts

                lines.append(line)

        with open(out_file_path, "w+") as wfile:
            wfile.write("".join(lines))

    def stat(self):
        threads = {}
        with open(self.file_path, "r") as rfile:
            for line in rfile:
                if "cpu-clock" in line:
                    ps = split_line(line)
                    thread_id = ps[1]
                    thread = threads.get(thread_id)
                    if thread is None:
                        thread = threads[thread_id] = Thread(thread_id, self.interval)
                    thread.handle(ps)

        for k, v in threads.items():
            v.output()

        ts = set()
        for k, v in threads.items():
            ts = ts.union(v.cpu_usage.keys())

        for k in sorted(ts):
            s = 0
            for _, v in threads.items():
                s += v.cpu_usage.get(k, 0)
            print "total cpu", k, s
            continue
            if s < 0.3:
                continue
            print "total cpu", k, s
            for _, v in threads.items():
                print v.id, v.cpu_usage.get(k, 0)
            print

p = PerfScript("perf.script")
p.with_idle("perf.script.idle")
p.stat()

I'm not sure if other sampling profile tools will not do sampling when the program is idle like perf.
I think some tools will and some won't.
Maybe we just need to deal with perf specially?


In fact, I'm trying to use speedscope to analyze CPU glitches in my program and locate the cause of high CPU consumption in a moment.

Like the stat function in the above python script, I think it is possible to infer the CPU utilization of the program at that time according to the number of perf samples, and then directly tell the user the time of high CPU utilization.

@DanielThomas
Copy link

DanielThomas commented Aug 9, 2022

  1. For perf, at least certainly for cpu and cpu-clock events, no detection is required, we know it's a statistical sampling profiler
  2. In terms of attributing time. My profile was taken at 49Hz - for CPU busy periods if you bucket the timestamps by millisecond you'll get around that number. Here's a period/thread where you can see the 20.40ms spacing of the samples:
G1 Conc#0 2609559/2609647 [064] 14958254.393007:   20408163 cpu-clock:
G1 Conc#0 2609559/2609647 [064] 14958254.413415:   20408163 cpu-clock:
G1 Conc#0 2609559/2609647 [064] 14958254.433823:   20408163 cpu-clock:
G1 Conc#0 2609559/2609647 [064] 14958254.454231:   20408163 cpu-clock:
G1 Conc#0 2609559/2609647 [064] 14958254.474639:   20408163 cpu-clock:
G1 Conc#0 2609559/2609647 [064] 14958254.495048:   20408163 cpu-clock:
G1 Conc#0 2609559/2609647 [064] 14958254.515456:   20408163 cpu-clock:
G1 Conc#0 2609559/2609647 [064] 14958254.535885:   20408163 cpu-clock:
G1 Conc#0 2609559/2609647 [064] 14958254.556292:   20408163 cpu-clock:
G1 Conc#0 2609559/2609647 [064] 14958254.576701:   20408163 cpu-clock:
G1 Conc#0 2609559/2609647 [064] 14958254.597145:   20408163 cpu-clock:
G1 Conc#0 2609559/2609647 [078] 14958254.618787:   20408163 cpu-clock:
G1 Conc#0 2609559/2609647 [078] 14958254.639195:   20408163 cpu-clock:
G1 Conc#0 2609559/2609647 [078] 14958254.659603:   20408163 cpu-clock:
G1 Conc#0 2609559/2609647 [078] 14958254.680011:   20408163 cpu-clock:
G1 Conc#0 2609559/2609647 [078] 14958254.700419:   20408163 cpu-clock:
G1 Conc#0 2609559/2609647 [078] 14958254.720827:   20408163 cpu-clock:
G1 Conc#0 2609559/2609647 [078] 14958254.741235:   20408163 cpu-clock:
G1 Conc#0 2609559/2609647 [078] 14958254.761644:   20408163 cpu-clock:
G1 Conc#0 2609559/2609647 [078] 14958254.782052:   20408163 cpu-clock:
G1 Conc#0 2609559/2609647 [064] 14958254.393007:   20408163 cpu-clock:
G1 Conc#0 2609559/2609647 [064] 14958254.413415:   20408163 cpu-clock:
G1 Conc#0 2609559/2609647 [064] 14958254.433823:   20408163 cpu-clock:
G1 Conc#0 2609559/2609647 [064] 14958254.454231:   20408163 cpu-clock:
G1 Conc#0 2609559/2609647 [064] 14958254.474639:   20408163 cpu-clock:
G1 Conc#0 2609559/2609647 [064] 14958254.495048:   20408163 cpu-clock:
G1 Conc#0 2609559/2609647 [064] 14958254.515456:   20408163 cpu-clock:
G1 Conc#0 2609559/2609647 [064] 14958254.535885:   20408163 cpu-clock:
G1 Conc#0 2609559/2609647 [064] 14958254.556292:   20408163 cpu-clock:
G1 Conc#0 2609559/2609647 [064] 14958254.576701:   20408163 cpu-clock:
G1 Conc#0 2609559/2609647 [064] 14958254.597145:   20408163 cpu-clock:
G1 Conc#0 2609559/2609647 [078] 14958254.618787:   20408163 cpu-clock:
G1 Conc#0 2609559/2609647 [078] 14958254.639195:   20408163 cpu-clock:
G1 Conc#0 2609559/2609647 [078] 14958254.659603:   20408163 cpu-clock:
G1 Conc#0 2609559/2609647 [078] 14958254.680011:   20408163 cpu-clock:
G1 Conc#0 2609559/2609647 [078] 14958254.700419:   20408163 cpu-clock:
G1 Conc#0 2609559/2609647 [078] 14958254.720827:   20408163 cpu-clock:
G1 Conc#0 2609559/2609647 [078] 14958254.741235:   20408163 cpu-clock:
G1 Conc#0 2609559/2609647 [078] 14958254.761644:   20408163 cpu-clock:
G1 Conc#0 2609559/2609647 [078] 14958254.782052:   20408163 cpu-clock:

Rather than estimate the sampling frequency, I'd just look at the smallest period seen between samples on any individual thread and use as the minimum time resolution for all threads. Off-CPU time is an entirely different concern, just having on-CPU time work hugely helpful on it's own.

@prim
Copy link
Author

prim commented Aug 15, 2022

Oh, I was wrong. 20408163 refers to 0.020408163 seconds.

@milianw
Copy link

milianw commented Aug 5, 2023

Perf has the ability to report context switches too, via perf record --switch-events. If you take that into account, you can visualize off-CPU time properly.

If you additionally do perf record --e sched:sched_switch ... you can also get a callgraph to attribute the off-CPU time to (which is very useful, it will point to e.g. sleep, lock contention, synchronous I/O, pagefaults, ...).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants