Tests for observation loss and resource leakage #183

waltermi · 2017-10-23T15:42:59Z

In the ECN plugin, the "ecn.negotiation" information of some measurements is missing when a lot of workers are running. Running fewer workers reduces the number of measurements with missing conditions. The problem also occured with some self-made chains.

irl · 2017-10-25T20:26:35Z

This makes me think that flows are being dropped within PATHspider, or packets are being dropped and never captured. Unless you can find a place in PATHspider that this is occuring where it shouldn't, this is not a bug. Just use a sensible number of workers.

britram · 2017-10-27T10:25:33Z

This framing may not have been the best way to ask this question, but it does seem that results are load-dependent and load-dependency in results is input size dependent, which suggests resource leakage. We don't have a good baseline or calibration for giving people who are trying to use PATHspider guidance on what "a sensible number of workers" is, and we should have that if we're going to close issues based on it.

So, I think what we need here is some idea of how many records go missing under which network, CPU, and memory conditions with what number for workers, which probably means profiling observer loss for a set of possible conditions on relatively constrained DO nodes.

britram · 2017-10-27T10:29:49Z

(I may have cycles for this in December, but not before.)

mirjak · 2017-10-27T10:33:11Z

Agree, we should probably add some more logging/tracking capabilities here as well...

irl · 2017-10-27T14:21:09Z

Logging/tracking is #155 probably. @mirjak if you have specific requests for logging, please add them there. Otherwise we can just go nuts adding logging (if we have a decent logging setup, logging should be near-zero cost).

irl · 2017-10-27T14:22:03Z

Let's say the criteria for this being closed is running benchmarks and documenting sensible worker counts for different setups? (Also making sure our defaults are not way off for most users)

britram · 2017-10-27T14:22:59Z

SGTM. I'd suggest running these on a smallish DO box (~2G ram) since this is one of the places we'd like it to run. We should also explicitly check short runs (100-1000 targets) against long runs (1000000 targets) since the latter will exhibit any resource leaks (@waltermi has reported this behavior, but it's unclear whether it's just in the traceroute branch at this point)

britram · 2017-10-27T14:30:49Z

n.b. if this is targeting 2.0 and assigned to me, it will cause 2.0 to drop late.

mirjak · 2017-10-30T14:51:52Z

I didn't check what's logged to far but know that there was a test but no observation would be really helpful here.

irl · 2017-10-30T15:14:44Z

@mirjak: Does #184 look good for that?

mirjak · 2017-10-30T16:13:12Z

Yes, however, not sure if not_observed should be a condition or just something to log... thanks!

britram · 2017-10-30T19:18:54Z

I'm pretty sure there's a not_observed condition, at least for ECN

…

On 30 Oct 2017, at 17:13, mirjak ***@***.***> wrote: Yes, however, not sure if not_observed should be a condition or just something to log... thanks! — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.

britram · 2017-10-30T19:20:14Z

ah, i see I'm wrong. in any case `not_observed` should *absolutely* be a condition -- metadata calibrating the measurement platform itself needs to be inline or it'll be ignored by later analysis.

…

On 30 Oct 2017, at 20:18, Brian Trammell ***@***.***> wrote: I'm pretty sure there's a not_observed condition, at least for ECN > On 30 Oct 2017, at 17:13, mirjak ***@***.***> wrote: > > Yes, however, not sure if not_observed should be a condition or just something to log... thanks! > > — > You are receiving this because you were assigned. > Reply to this email directly, view it on GitHub, or mute the thread. >

irl · 2017-10-31T18:16:20Z

It's now included in the ECN plugin, which should make some automated benchmarking possible as we just need to analyse the output for unobserved flows.

irl · 2018-01-10T22:22:51Z

pspdr measure -w X -i eth0 ecn --connect tcp < /tmp/targetsp > /tmp/results

Rough numbers:

For 2320 jobs in a DO 2GB instance, 20 workers gives 20 losses. >30 workers causes all but around 300 to be lost.

It looks like DO has changed their networking as you now get a NAT IPv4 address which may be impacting the performance. This could also be Meltdown/Spectre microcode updates killing the CPU as I've seen from other cloud providers.

With the same task on my desktop (2x Intel [email protected], 24GB RAM, DSL connection), 2321 jobs with 100 workers only gives me 23 losses.

irl · 2018-08-26T17:59:27Z

This is a duplicate of #198

irl closed this as completed Oct 25, 2017

britram changed the title ~~Missing conditions with too many workers running~~ Tests for observation loss and resource leakage Oct 27, 2017

britram added this to the Release 2.0 _Argyroneta aquatica_ milestone Oct 27, 2017

britram reopened this Oct 27, 2017

irl assigned britram Oct 27, 2017

irl added the needs:docs label Oct 27, 2017

irl modified the milestones: Release 2.0 _Argyroneta aquatica_, Release 2.1 _Latrodectus hasselti_ Oct 27, 2017

irl mentioned this issue Jan 11, 2018

Add a benchmark command to determine an appropriate number of workers #198

Open

irl unassigned britram Dec 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tests for observation loss and resource leakage #183

Tests for observation loss and resource leakage #183

waltermi commented Oct 23, 2017

irl commented Oct 25, 2017

britram commented Oct 27, 2017

britram commented Oct 27, 2017

mirjak commented Oct 27, 2017

irl commented Oct 27, 2017

irl commented Oct 27, 2017

britram commented Oct 27, 2017 •

edited

Loading

britram commented Oct 27, 2017

mirjak commented Oct 30, 2017

irl commented Oct 30, 2017

mirjak commented Oct 30, 2017

britram commented Oct 30, 2017 via email

britram commented Oct 30, 2017 via email

irl commented Oct 31, 2017

irl commented Jan 10, 2018

irl commented Aug 26, 2018

Tests for observation loss and resource leakage #183

Tests for observation loss and resource leakage #183

Comments

waltermi commented Oct 23, 2017

irl commented Oct 25, 2017

britram commented Oct 27, 2017

britram commented Oct 27, 2017

mirjak commented Oct 27, 2017

irl commented Oct 27, 2017

irl commented Oct 27, 2017

britram commented Oct 27, 2017 • edited Loading

britram commented Oct 27, 2017

mirjak commented Oct 30, 2017

irl commented Oct 30, 2017

mirjak commented Oct 30, 2017

britram commented Oct 30, 2017 via email

britram commented Oct 30, 2017 via email

irl commented Oct 31, 2017

irl commented Jan 10, 2018

irl commented Aug 26, 2018

britram commented Oct 27, 2017 •

edited

Loading