-
Notifications
You must be signed in to change notification settings - Fork 301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[wip] Reintroduce vhost_user statistics #1022
base: master
Are you sure you want to change the base?
Conversation
This reverts commit 77c0d57. # Conflicts: # src/lib/virtio/net_device.lua
@lukego What would be the acceptable cost (in %) of adding vhost_user statistics?
This seems to indicate that the current |
@eugeneia Sorry for the slow feedback. I think the overhead should be less than 1%. The issue for me is that I don't understand why a simple change like bumping a few counters would cost more than this. I suspect there is an interesting answer but I don't know what it is. This is really a case where we need better tooling for "zooming in" on performance issues. In a perfect world the information that I would like to have available from CI would include:
In principle we could produce the trace dumps with |
I have done some ad-hoc testing using Here is what IR is added for each
Any hints on how I can find out which region of mcode belongs to a IR region? |
Is there a loop in that trace? Is this IR snippet from before the loop (first iteration) or after the loop (subsequent iterations reusing work from the first)? (Can you gist the full trace?) If you have run traceprof then could you show the before+after on that too? Just to see if e.g. the hotspot has moved from the loop of a trace to a side-trace which would be slower. The PMU data does not quite seem to add up. If this code is only using 1% more cycles in total then it should not account for more than a 1% slowdown. However maybe I am not interpreting the info right or the measurements are not precise enough. |
Matching up IR with mcode is laborious and heuristic in my experience. One tip is to dump with However the IR probably contains all of the answers. Just that it is a bit less familiar and I haven't memorized all the opcodes and so I tend to use the mcode as a crutch. Could instead refer to the IR docs. |
Just observations from glancing at the IR:
|
This gist has both dumps (master vs vhost_stats). Wasn't aware of traceprof, will give it a try. While the dumps where generated with the profiler enabled, the PMU measurements were collected without the profiler on. |
Oh hey look at the last three lines of the IR:
This is actually exactly the code that we want. Load a u64 value from memory ( |
Traceprof should be very helpful here. It will tell us how much time was spent in each trace and whether that is in the sequential part of the trace or the looping part (which is usually better optimized). |
The reason traceprof is important is that the same function can be compiled many times as different traces each with different IR and mcode. It can even be compiled twice in the same trace, separately for the first iteration and the subsequent ones. Traceprof tells us how much CPU-time each of these implementations is actually consuming. This tells us which traces we should be looking at, and also how well we are keeping the CPU occupied with loops (which tend to be compiled the best.) |
Traceprof confirmed that we are looking at the right trace:
But the IR I referenced is indeed outside of the loop. (Lines below |
Groovy. So on both branches there is a trace starting at btw: Interesting to see the other hot trace too. On master branch this is trace 35 and it is a side-trace branching from root trace 18 ( |
Ok, so the difference in the LOOP IR is really just:
Which seems very efficient for two counters (number of packets and size, I removed the mcast/bcast for more isolation). So does this point to a cache issue of sorts? Looking at the experiment I kicked off yesterday, using meta-methods for counters, the results seem to confirm that the problem is not with the code generated. |
This IR code does look fairly innocent. Looks like it bumps two counters, the first by exactly 1 (sounds like the packets counter) and the second by a variable amount (sounds like the bytes counter). Could still be that this innocent-looking addition does impact the generated code. Maybe the JIT is running out of registers and needs to spill temporary values onto the stack or recalculate them. I think it is worth comparing the machine code too. btw: I usually run the JIT dump with However, could also be time to take a step back and look more carefully at the metrics. How much % slowdown are you seeing on average on the l2fwd benchmarks? Can you account for this with the PMU counters? (How are you taking the PMU measurements?) |
Good tip using
I got the PMU measurements using this ad-hoc patch and running
|
So a few quick-fire thoughts for now: If adding the counters produces 30 additional instructions per packet then this seems plausible to account for the performance hit. Second passJust an out of the box idea... I am starting to think that the best way to make LuaJIT performance simple and understandable is to factor code as multiple small loops ("o-o-o") rather than one big one ("O"). This way we see each loop separately in traceprof, we can read their code in isolation, JIT issues like side-traces are localized, and maybe we get better code (less register pressure, fewer "barriers" forcing accumulators to be committed to memory, etc). On the one hand it would be quite a task to rewrite the whole virtio module this way e.g. one loop to collect the descriptors, one loop to copy the payloads, one loop to calculate the offloaded checksums, etc. This could turn out to be a wasted effort if the coding style did not actually help. On the other hand it may be straightforward to factor this new counter functionality as an isolated loop. Could consider something like this? function pull_and_count ()
-- get the link that we will output packets to & remember the index before we added any
local l = self.output.output
local index = l.write
-- run the normal pull logic to add packets to the link
pull()
-- bump counters as a post-processing step across the new packets added to the link
while index ~= l.write do
count(l.packets[index])
index = (index + 1) % link.max
end
end (Operating on packets after they have been transmitted on the link is probably not sound practice in a multiprocess context, since we don't own them anymore after we transmit them, but that is a detail we could deal with.) Side-trace risksRelated issue - there are two potentially unbiased branches in this Lua code: local counters = self.owner.shm
counter.add(counters.txbytes, tx_p.length)
counter.add(counters.txpackets)
if ethernet:is_mcast(tx_p.data) then
counter.add(counters.txmcast)
end
if ethernet:is_bcast(tx_p.data) then
counter.add(counters.txbcast)
end This will force the JIT to generate separate code ("side traces") for multicast and broadcast packets and this will make performance sensitive to the mix of unicast/multicast/broadcast traffic. This is likely being missed by our standard performance tests since those use uniform packet contents. I would recommend rewriting these to be branch-free (or alternatively making the test suite more sophisticated so that it can observe and report the effect of traffic mix on performance). |
Relatedly: I am not really confident to dive into the details here because the code+tprof+trace all need to come from the same instance in order to fit together. If I would take these puzzle pieces from multiple different executions then I will never manage to piece them together. We really need to come up with a standard way to capture all the data needed for performance analysis in one place... I also think the PMU information would be much easier to interpret if we captured it separately for each app. So much to do... one step at at time :). |
At least the IR/mcode/traceprof dumps are always from a single run. By the way, I was seeing a huge slowdown when using PMU, is that expected? Could also be related to I agree that we need a test case for a mix of multicast/broadcast/unicast packets, the reason why I threw out the “branch-free patch” is that I couldn’t measure its positive impact. Will ponder more on the suggestions in this thread and report back. :-) |
@eugeneia There should not be any overhead from PMU nor from CPU pinning. Unless perhaps you have pinned several processes to the same core?
Generally I am on board with this philosophy i.e. be conservative about optimizations and don't add complicated code unless you are sure it is worth the trouble. However, this case seems different for two reasons. First is that this is the single longest and hottest trace in the whole application and so any surprises are likely to have severe impact. Second is that adding workload-sensitive branches is definitely reducing our performance test coverage i.e. this will cause new side-traces and machine code to be generated and we have no visibility of the impact because our tests are not exercising those paths. (Over time I expect that we will develop programming styles that make all of these issues more manageable. I am still hopeful that writing multiple small loops will be more robust than single large ones as we have here in virtio. This is already the effect of breaking functionality into multiple apps, but it may make sense within apps too. Have to get some more experience to really make sense of it all.) In the future we could develop more sophisticated testing tools to deal with such issues. Simplest might be to test many randomized traffic workloads. Fancier (daydreaming...) might be a load tester that uses reinforcement learning to find an application's pain-points by searching for workloads that cause excessive load (e.g. cause packet drops or making a bits-per-cycle metric drop). |
Yep that was it... :-) I agree with the rationale for avoiding the branches until we are at least able to test the branches... good thinking! |
Frustratingly, my first couple attempts at this seem to make matters worse: https://hydra.snabb.co/build/818389/download/2/report.html#split-by-benchmark |
Reenable alarm notification
Given the speedup of the vhost_user code due to #1001 want to reintroduce vhost_user statistics counters from #931. I gave this a
[wip]
tag since the performance regression caused by this change is still not acceptable.See the Rmarkdown report for the regression: https://hydra.snabb.co/build/551126/download/2/report.html#overall-summary
Cc @lukego