Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wip] Reintroduce vhost_user statistics #1022

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

eugeneia
Copy link
Member

Given the speedup of the vhost_user code due to #1001 want to reintroduce vhost_user statistics counters from #931. I gave this a [wip] tag since the performance regression caused by this change is still not acceptable.

See the Rmarkdown report for the regression: https://hydra.snabb.co/build/551126/download/2/report.html#overall-summary

Cc @lukego

This reverts commit 77c0d57.

# Conflicts:
#	src/lib/virtio/net_device.lua
@eugeneia
Copy link
Member Author

@lukego What would be the acceptable cost (in %) of adding vhost_user statistics?

model.tables(aov(score ~ snabb, read.csv("https://hydra.snabb.co/job/max/counters-experiments/benchmark-csv/latest/download-by-type/file/CSV")), type="means")
Tables of means
Grand mean

7.864391 

 snabb 
snabb
          master vhost_user-stats 
           8.038            7.691

This seems to indicate that the current vhost_user-stats branch costs us ~5% total performance, which seems like too much.

@lukego
Copy link
Member

lukego commented Sep 26, 2016

@eugeneia Sorry for the slow feedback. I think the overhead should be less than 1%.

The issue for me is that I don't understand why a simple change like bumping a few counters would cost more than this. I suspect there is an interesting answer but I don't know what it is. This is really a case where we need better tooling for "zooming in" on performance issues.

In a perfect world the information that I would like to have available from CI would include:

  • How do the relevant traces look on both branches? (Compare IR and machine code for the vhost_user loops to see what is different.)
  • How are the metrics instructions-per-packet and instructions-per-cycle impacted? If we are executing more instructions then there is likely a JIT issue, if our instructions are executing more slowly there is likely a CPU issue (like Add new programs 'top' and 'gc' #558 (comment)).

In principle we could produce the trace dumps with -jdump, identify the relevant ones with -tprof, and see the detailed metrics with enhanced timeline logs (#1011). In practice we have not yet worked out how to tie these tools together with the CI jobs and it is inconvenient to have to run them manually.

@eugeneia
Copy link
Member Author

eugeneia commented Sep 26, 2016

I have done some ad-hoc testing using lib.pmu and so far it looks like vhost_user-stats is using ~3% more instructions and ~1% more cycles in total (~2% more instructions per cycle). This correlates well with the performance impact we are seeing, I guess? (edit: scratch that nonsense)

Here is what IR is added for each counter.add:

> 0329 >  p32 HREFK  0132  "shm" @3
> 0330 >  tab HLOAD  0329
> 0331 >  p32 UREFC  net_device.lua:<line>  #0  
> 0332 >  tab ULOAD  0331
> 0333    int FLOAD  0332  tab.hmask
> 0334 >  int EQ     0333  +15 
> 0335    p32 FLOAD  0332  tab.node
> 0336 >  p32 HREFK  0335  "add" @13
> 0337 >  fun HLOAD  0336
> 0338    int FLOAD  0330  tab.hmask
> 0339 >  int EQ     0338  +7  
> 0340    p32 FLOAD  0330  tab.node
> 0341 >  p32 HREFK  0340  "txbytes" @6
> 0342 >  cdt HLOAD  0341
> 0343 >  fun EQ     0337  counter.lua:<line>
> 0344 >  nil PROF   
> 0345    u16 FLOAD  0342  cdata.ctypeid
> 0346 >  int EQ     0345  +1281
> 0347    p64 ADD    0342  +8  
> 0348    u64 XLOAD  0347  
> 0350    u64 ADD    0348  0311
> 0352    u64 XSTORE 0347  0350
> 0353 >  nil PROF   
> 0354 >  p32 HREFK  0340  "txpackets" @4
> 0355 >  cdt HLOAD  0354
> 0356 >  nil PROF   
> 0357    u16 FLOAD  0355  cdata.ctypeid
> 0358 >  int EQ     0357  +1281
> 0359    p64 ADD    0355  +8  
> 0360    u64 XLOAD  0359  
> 0362    u64 ADD    0360  +1  
> 0364    u64 XSTORE 0359  0362

Any hints on how I can find out which region of mcode belongs to a IR region?

@lukego
Copy link
Member

lukego commented Sep 26, 2016

Is there a loop in that trace? Is this IR snippet from before the loop (first iteration) or after the loop (subsequent iterations reusing work from the first)? (Can you gist the full trace?)

If you have run traceprof then could you show the before+after on that too? Just to see if e.g. the hotspot has moved from the loop of a trace to a side-trace which would be slower.

The PMU data does not quite seem to add up. If this code is only using 1% more cycles in total then it should not account for more than a 1% slowdown. However maybe I am not interpreting the info right or the measurements are not precise enough.

@lukego
Copy link
Member

lukego commented Sep 26, 2016

Matching up IR with mcode is laborious and heuristic in my experience. One tip is to dump with +r to see register assignments that will match in both IR and mcode. I would love to automate this matching.

However the IR probably contains all of the answers. Just that it is a bit less familiar and I haven't memorized all the opcodes and so I tend to use the mcode as a crutch. Could instead refer to the IR docs.

@lukego
Copy link
Member

lukego commented Sep 26, 2016

Just observations from glancing at the IR:

  • There are a bunch of hashtable lookups (HREFK) and type checks and so on. This would be fine if it is happening once, on the first packet, before the trace spins in the loop. However if we are doing this on every packet then it will add up.
  • There are PROF IR nodes that I suspect are due to the LuaJIT profiler (-jp). I would suggest disabling the profiler, at least when taking PMU measurements, because it can impact the compiled code. (One reason I use traceprof instead is that it does not interact with code generation.)
  • Could try representing counters in a more efficient way e.g. as FFI objects (method call to add) or as closures (call to add). This could reduce the number of instructions required to bump a counter and make it less crucial for the JIT to "hoist" them outside of the inner loop. (On timeline I have a "factory" function that returns a closure that you call to log an event.)

@eugeneia
Copy link
Member Author

This gist has both dumps (master vs vhost_stats). Wasn't aware of traceprof, will give it a try. While the dumps where generated with the profiler enabled, the PMU measurements were collected without the profiler on.

@lukego
Copy link
Member

lukego commented Sep 26, 2016

Oh hey look at the last three lines of the IR:

> 0360    u64 XLOAD  0359  
> 0362    u64 ADD    0360  +1  
> 0364    u64 XSTORE 0359  0362

This is actually exactly the code that we want. Load a u64 value from memory (XLOAD), add one (ADD), store the result (XSTORE). (Okay, even better would be to have only one ADD that keeps the accumulated value in a register until the loop terminates, but that might be asking too much.) The rest of the IR instructions look mostly like overhead that we want to eliminate (e.g. more efficient function call) or move outside of the hot code path (help the JIT to "hoist" the work out of the loop).

@lukego
Copy link
Member

lukego commented Sep 26, 2016

Traceprof should be very helpful here. It will tell us how much time was spent in each trace and whether that is in the sequential part of the trace or the looping part (which is usually better optimized).

@lukego
Copy link
Member

lukego commented Sep 26, 2016

The reason traceprof is important is that the same function can be compiled many times as different traces each with different IR and mcode. It can even be compiled twice in the same trace, separately for the first iteration and the subsequent ones. Traceprof tells us how much CPU-time each of these implementations is actually consuming. This tells us which traces we should be looking at, and also how well we are keeping the CPU occupied with loops (which tend to be compiled the best.)

@eugeneia
Copy link
Member Author

Traceprof confirmed that we are looking at the right trace:

# with stats
traceprof report (recorded 36737/36737 samples):
 26% TRACE  81:LOOP          ->loop    virtq_device.lua:57
 19% TRACE  40      (21/4)   ->21      link.lua:80
  7% TRACE  21               ->loop    flooding.lua:29
  4% TRACE  23:LOOP          ->loop    synth.lua:36
  3% TRACE  89:LOOP          ->loop    virtq_device.lua:57
  2% TRACE  21:LOOP          ->loop    flooding.lua:29
  1% TRACE  42      (29/0)   ->35      app.lua:329
  1% TRACE  88:LOOP          ->loop    basic_apps.lua:80
  1% TRACE  38      (23/15)  ->28      synth.lua:35
# master
traceprof report (recorded 36275/36275 samples):
 20% TRACE  35      (18/4)   ->18      link.lua:80
 18% TRACE  75:LOOP          ->loop    virtq_device.lua:57
  7% TRACE  18               ->loop    flooding.lua:29
  4% TRACE  17:LOOP          ->loop    synth.lua:36
  3% TRACE  85:LOOP          ->loop    virtq_device.lua:57
  2% TRACE  18:LOOP          ->loop    flooding.lua:29
  1% TRACE  83:LOOP          ->loop    basic_apps.lua:80
  1% TRACE  32      (17/15)  ->23      synth.lua:35

But the IR I referenced is indeed outside of the loop. (Lines below -- LOOP -- are inside the loop, right?)

@lukego
Copy link
Member

lukego commented Sep 26, 2016

Groovy. So on both branches there is a trace starting at virtq_device.lua:57 that is hot inside its loop. So it makes sense to compare the IR code for this trace on each branch, starting from the --LOOP-- line as you say. Hopefully this code will be more streamlined than what we saw above because it can reuse the work that was done on the first iteration (function lookups, type checks, etc). LuaJIT does this kind of optimization very well.

btw: Interesting to see the other hot trace too. On master branch this is trace 35 and it is a side-trace branching from root trace 18 (18/4). We can see below that trace 18 is flooding.lua:29. So it sounds like the flooding bridge module has an issue like lukego/blog#8 i.e. instead of staying inside a tightly compiled loop it is branching to a side-trace and repeating setup work.

@eugeneia
Copy link
Member Author

eugeneia commented Sep 27, 2016

Ok, so the difference in the LOOP IR is really just:

0631    u64 XLOAD  0383                                       
0632    u64 ADD    0631  +1
0633    u64 XSTORE 0383  0632
0634    u64 XLOAD  0393
0635    u64 ADD    0634  0616
0636    u64 XSTORE 0393  0635

Which seems very efficient for two counters (number of packets and size, I removed the mcast/bcast for more isolation). So does this point to a cache issue of sorts?

Looking at the experiment I kicked off yesterday, using meta-methods for counters, the results seem to confirm that the problem is not with the code generated.

@lukego
Copy link
Member

lukego commented Sep 27, 2016

This IR code does look fairly innocent. Looks like it bumps two counters, the first by exactly 1 (sounds like the packets counter) and the second by a variable amount (sounds like the bytes counter).

Could still be that this innocent-looking addition does impact the generated code. Maybe the JIT is running out of registers and needs to spill temporary values onto the stack or recalculate them. I think it is worth comparing the machine code too.

btw: I usually run the JIT dump with -jdump=+r to show where the values are allocated (which register or which stack position). This also helps to line up IR with mcode because the register names match.

However, could also be time to take a step back and look more carefully at the metrics. How much % slowdown are you seeing on average on the l2fwd benchmarks? Can you account for this with the PMU counters? (How are you taking the PMU measurements?)

@eugeneia
Copy link
Member Author

Good tip using +r, I think I found the machine code, which is much longer than the IR (lines not matching are also only in the dump with statistics). I can’t really tell what the remaining instructions do, but this might support the “running out of registers” theory?

IR                                     MCODE
                                       0bc9da82  mov r8d, [rsp+0x10]
0631 r14      u64 XLOAD  0383          0bc9da87  mov r14, [rdx+0x8]
0632 r14      u64 ADD    0631  +1      0bc9da8b  add r14, +0x01    
0633          u64 XSTORE 0383  0632    0bc9da8f  mov [rdx+0x8], r14
0634 r14      u64 XLOAD  0393          0bc9da93  mov r14, [rcx+0x8]
0635 r14      u64 ADD    0634  0616    0bc9da97  add r14, r10      
0636          u64 XSTORE 0393  0635    0bc9da9a  mov [rcx+0x8], r14
                                       0bc9da9e  mov r14d, 0x2e          
                                       0bc9daa4  cmp r14d, r10d          
                                       0bc9daa7  cmovl r14d, r10d        
                                       0bc9daab  xor r10d, r10d          
                                       0bc9daae  add r14d, +0x04         
                                       0bc9dab2  jo 0x0bc90058  ->50      
                                       0bc9dab8  add r14d, +0x05         
                                       0bc9dabc  jo 0x0bc90058  ->50      
                                       0bc9dac2  xorps xmm6, xmm6        
                                       0bc9dac5  cvtsi2sd xmm6, r14d     
                                       0bc9daca  mulsd xmm6, [0x40047880]
                                       0bc9dad3  cvttsd2si r14, xmm6     
                                       0bc9dad8  test r14, r14           
                                       0bc9dadb  jns 0x0bc9daeb          
                                       0bc9dadd  addsd xmm6, [0x4178d7b8]
                                       0bc9dae6  cvttsd2si r14, xmm6     
                                       0bc9daeb  add r14, [r11+0x8]      
                                       0bc9daef  mov [r11+0x8], r14      
                                       0bc9daf3  mov [r13+0x2800], r10w  
                                       0bc9dafb  mov r14, [r9+0x8]       
                                       0bc9daff  mov [r9+r14*8+0x18], r13
                                       0bc9db04  add r14, +0x01          
                                       0bc9db08  mov [r9+0x8], r14       

I got the PMU measurements using this ad-hoc patch and running

DOCKERFLAGS=-t SNABB_PCI_INTEL0=soft SNABB_PCI_INTEL1=soft sudo -E lock scripts/dock.sh "NFV_PMU=yes program/snabbnfv/dpdk_bench.sh"

@lukego
Copy link
Member

lukego commented Sep 28, 2016

So a few quick-fire thoughts for now:

If adding the counters produces 30 additional instructions per packet then this seems plausible to account for the performance hit.

Second pass

Just an out of the box idea...

I am starting to think that the best way to make LuaJIT performance simple and understandable is to factor code as multiple small loops ("o-o-o") rather than one big one ("O"). This way we see each loop separately in traceprof, we can read their code in isolation, JIT issues like side-traces are localized, and maybe we get better code (less register pressure, fewer "barriers" forcing accumulators to be committed to memory, etc).

On the one hand it would be quite a task to rewrite the whole virtio module this way e.g. one loop to collect the descriptors, one loop to copy the payloads, one loop to calculate the offloaded checksums, etc. This could turn out to be a wasted effort if the coding style did not actually help. On the other hand it may be straightforward to factor this new counter functionality as an isolated loop.

Could consider something like this?

function pull_and_count ()
   -- get the link that we will output packets to & remember the index before we added any
   local l = self.output.output
   local index = l.write
   -- run the normal pull logic to add packets to the link
   pull()
   -- bump counters as a post-processing step across the new packets added to the link
   while index ~= l.write do
      count(l.packets[index])
      index = (index + 1) % link.max
   end
end

(Operating on packets after they have been transmitted on the link is probably not sound practice in a multiprocess context, since we don't own them anymore after we transmit them, but that is a detail we could deal with.)

Side-trace risks

Related issue - there are two potentially unbiased branches in this Lua code:

local counters = self.owner.shm
counter.add(counters.txbytes, tx_p.length)
counter.add(counters.txpackets)
if ethernet:is_mcast(tx_p.data) then
   counter.add(counters.txmcast)
end
if ethernet:is_bcast(tx_p.data) then
   counter.add(counters.txbcast)
end

This will force the JIT to generate separate code ("side traces") for multicast and broadcast packets and this will make performance sensitive to the mix of unicast/multicast/broadcast traffic. This is likely being missed by our standard performance tests since those use uniform packet contents. I would recommend rewriting these to be branch-free (or alternatively making the test suite more sophisticated so that it can observe and report the effect of traffic mix on performance).

@lukego
Copy link
Member

lukego commented Sep 28, 2016

Relatedly: I am not really confident to dive into the details here because the code+tprof+trace all need to come from the same instance in order to fit together. If I would take these puzzle pieces from multiple different executions then I will never manage to piece them together.

We really need to come up with a standard way to capture all the data needed for performance analysis in one place... I also think the PMU information would be much easier to interpret if we captured it separately for each app. So much to do... one step at at time :).

@eugeneia
Copy link
Member Author

At least the IR/mcode/traceprof dumps are always from a single run. By the way, I was seeing a huge slowdown when using PMU, is that expected? Could also be related to numactl pinning to a specific core.

I agree that we need a test case for a mix of multicast/broadcast/unicast packets, the reason why I threw out the “branch-free patch” is that I couldn’t measure its positive impact.

Will ponder more on the suggestions in this thread and report back. :-)

@lukego
Copy link
Member

lukego commented Sep 30, 2016

@eugeneia There should not be any overhead from PMU nor from CPU pinning. Unless perhaps you have pinned several processes to the same core?

the reason why I threw out the “branch-free patch” is that I couldn’t measure its positive impact.

Generally I am on board with this philosophy i.e. be conservative about optimizations and don't add complicated code unless you are sure it is worth the trouble. However, this case seems different for two reasons. First is that this is the single longest and hottest trace in the whole application and so any surprises are likely to have severe impact. Second is that adding workload-sensitive branches is definitely reducing our performance test coverage i.e. this will cause new side-traces and machine code to be generated and we have no visibility of the impact because our tests are not exercising those paths.

(Over time I expect that we will develop programming styles that make all of these issues more manageable. I am still hopeful that writing multiple small loops will be more robust than single large ones as we have here in virtio. This is already the effect of breaking functionality into multiple apps, but it may make sense within apps too. Have to get some more experience to really make sense of it all.)

In the future we could develop more sophisticated testing tools to deal with such issues. Simplest might be to test many randomized traffic workloads. Fancier (daydreaming...) might be a load tester that uses reinforcement learning to find an application's pain-points by searching for workloads that cause excessive load (e.g. cause packet drops or making a bits-per-cycle metric drop).

@eugeneia
Copy link
Member Author

Unless perhaps you have pinned several processes to the same core?

Yep that was it... :-)

I agree with the rationale for avoiding the branches until we are at least able to test the branches... good thinking!

@eugeneia
Copy link
Member Author

eugeneia commented Nov 4, 2016

On the one hand it would be quite a task to rewrite the whole virtio module this way e.g. one loop to collect the descriptors, one loop to copy the payloads, one loop to calculate the offloaded checksums, etc. This could turn out to be a wasted effort if the coding style did not actually help. On the other hand it may be straightforward to factor this new counter functionality as an isolated loop.

Frustratingly, my first couple attempts at this seem to make matters worse: https://hydra.snabb.co/build/818389/download/2/report.html#split-by-benchmark

dpino added a commit to dpino/snabb that referenced this pull request Mar 29, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants