Redesign katip-elasticsearch internals #72

MichaelXavier · 2017-10-11T19:52:55Z

Discussed a bit in other channels. Tagging in @bitemyapp

katip-elasticsearch is extremely naive and has some bad UX to boot. Right now it has a configurable pool of worker threads. They pull from a TBMQueue and then do individual indexDocument calls. This is not ideal because:

The user already configures a queue size for the scribe overall. It is confusing when they do that and then they're limited by this seemingly superfluous queue with an independent size.
indexDocument is slow and hard on the elasticsearch server. It should be bulk indexing.

I think the right way to go is:

Try to drop this concept of a worker pool. You won't need to beat up your ES server from multiple threads if you're actually bulk indexing.
See if we can get away with no worker threads in general. The scribe itself is an independent thread.
I feel like we should accumulate logs and flush every X seconds or Y messages (configurable with defaults) or when the scribe is finalizing. This probably means we could use a faster/more appropriate data structure than TBMQueue. The main challenge I think is the X seconds part. I think we'd need to spin off an Async for flushing but I'm not quite seeing how best to design it to flush when full or after a certain amount of inactivity.

The text was updated successfully, but these errors were encountered:

bitemyapp · 2017-10-11T21:05:09Z

I'm taking a crack at this but don't let it stop you investigating separately.

bitemyapp · 2017-10-18T20:47:00Z

Here's an initial benchmark for the current state of it:

benchmarking ES logging/log 10 messages
time                 26.03 ms   (22.81 ms .. 29.18 ms)
                     0.964 R²   (0.935 R² .. 0.991 R²)
mean                 31.28 ms   (29.77 ms .. 32.62 ms)
std dev              3.121 ms   (2.288 ms .. 4.148 ms)
variance introduced by outliers: 40% (moderately inflated)
benchmarking ES logging/log 100 messages
time                 243.4 ms   (223.9 ms .. 262.9 ms)
                     0.998 R²   (0.989 R² .. 1.000 R²)
mean                 250.5 ms   (246.6 ms .. 257.3 ms)
std dev              6.501 ms   (125.5 μs .. 7.910 ms)
variance introduced by outliers: 16% (moderately inflated)
benchmarking ES logging/log 10000 Messages
time                 2.741 s    (2.221 s .. 3.874 s)
                     0.980 R²   (0.965 R² .. 1.000 R²)
mean                 2.464 s    (2.318 s .. 2.594 s)
std dev              212.0 ms   (0.0 s .. 224.6 ms)
variance introduced by outliers: 22% (moderately inflated)

However this includes the creation of the http manager, scribe, and log environment. Will amend in a moment. I'm also going to knock down the final volume from 10,000 to 1,000.

Update: I'm getting the wrong number of documents for the changed version that doesn't spin up the scribe for each benchmark. Here's my branch: https://github.com/bitemyapp/katip/tree/bitemyapp/bulk-elasticsearch-scribe

@MichaelXavier any ideas what I might be doing wrong?

bitemyapp · 2017-10-18T21:14:40Z

Benchmark katip-elasticsearch-bench: RUNNING...
benchmarking ES logging/log 10 messages
time                 1.291 ms   (1.266 ms .. 1.339 ms)
                     0.988 R²   (0.981 R² .. 0.994 R²)
mean                 1.247 ms   (1.211 ms .. 1.292 ms)
std dev              129.5 μs   (95.14 μs .. 170.0 μs)
variance introduced by outliers: 73% (severely inflated)
benchmarking ES logging/log 100 messages
time                 14.13 ms   (13.34 ms .. 15.05 ms)
                     0.972 R²   (0.951 R² .. 0.987 R²)
mean                 11.48 ms   (10.24 ms .. 12.33 ms)
std dev              2.520 ms   (1.640 ms .. 3.559 ms)
variance introduced by outliers: 85% (severely inflated)
benchmarking ES logging/log 1000 Messages
time                 123.4 ms   (120.1 ms .. 128.7 ms)
                     0.998 R²   (0.995 R² .. 1.000 R²)
mean                 122.9 ms   (121.5 ms .. 124.6 ms)
std dev              2.157 ms   (1.538 ms .. 2.879 ms)
variance introduced by outliers: 11% (moderately inflated)

I'm getting output like that, but my document count will only increase by a couple documents even after a forced refresh index.

MichaelXavier · 2017-10-18T21:49:48Z

Hmm, normally if docs aren't showing up I'd assume it was due to not closing scribes but if I'm following your bench correctly it looks like you're doing that. Although I wonder if too eagerly: replicateM_ repeats $ bracket mkLogEnv closeScribes this would open and close/flush the scribe 10, 100, 1000 times, which I'm not sure you want. Don't you want bracket mkLogEnv closeScribes $ \le -> replicateM_ repeats $ ...? opening and closing every iteration seems like it would thrash your performance badly not but it also seems like it would not let performance shine if you moved to bulk inserts.

Back to the question of doc counts not increasing as expected, this is still strange. I'd expect every time you close the scribe for it to delay for some time to shuttle the single log message through to the end. The ES scribe should shut down and then the thread managing that scribe should shut down. My next step would probably be to toss some traces into the es scribe and see what's actually happening, probably restricting it to the smallest test size that exhibits the behavior.

bitemyapp · 2017-10-18T22:18:50Z

Don't you want bracket mkLogEnv closeScribes $ \le -> replicateM_ repeats $ ...?

I did that before and it also exhibited weird behavior. I'll try to narrow it down.

bitemyapp · 2017-10-19T19:43:24Z

logMessages :: IO LogEnv -> Int -> IO ()
logMessages mkLogEnv repeats = do
  bracket mkLogEnv closeScribes
    $ \ le -> forM_ [1..repeats] $ \i -> runKatipT le $ do
      logMsg "ns" InfoS ("This goes to elasticsearch: " <> (logStr $ T.pack $ show i))

With that code, I seem to only be getting the first benchmark's 10 documents in a single batch or something.

bitemyapp · 2017-10-19T19:52:01Z

@MichaelXavier Fixed it. I was running close scribes in the bracket but the scribe was only being created once. Derp.

bitemyapp · 2017-10-19T19:56:00Z

After a single run I got:

benchmarking ES logging/log 10 messages
time                 162.1 μs   (157.6 μs .. 166.5 μs)
                     0.991 R²   (0.987 R² .. 0.994 R²)
mean                 152.7 μs   (148.7 μs .. 156.4 μs)
std dev              12.53 μs   (10.51 μs .. 15.39 μs)
variance introduced by outliers: 73% (severely inflated)
benchmarking ES logging/log 100 messages
time                 228.8 μs   (226.8 μs .. 231.2 μs)
                     0.996 R²   (0.994 R² .. 0.998 R²)
mean                 221.6 μs   (216.1 μs .. 226.3 μs)
std dev              16.62 μs   (12.86 μs .. 22.17 μs)
variance introduced by outliers: 68% (severely inflated)
benchmarking ES logging/log 1000 Messages
time                 988.9 μs   (970.0 μs .. 1.013 ms)
                     0.985 R²   (0.958 R² .. 0.996 R²)
mean                 1.041 ms   (1.017 ms .. 1.092 ms)
std dev              118.4 μs   (63.82 μs .. 202.4 μs)
variance introduced by outliers: 78% (severely inflated)

With a total of 6,035 docs inserted. Which shouldn't be possible even with Criterion re-runs because 5 documents is not a whole insertion run. But it's closer.

bitemyapp · 2017-10-19T20:02:52Z

I'm assuming the load shedding is kicking in at this point.

Update: increasing the queue size means that closeScribes is sitting for a long time after the benchmarks finish. Few hundred to a thousand docs per second. Also means the benchmarks themselves don't mean a lot if I'm closing the scribes afterward.

bitemyapp · 2017-10-19T20:15:28Z

benchmark katip-elasticsearch-bench: RUNNING...
benchmarking ES logging/log 10 messages
time                 24.24 ms   (22.09 ms .. 25.80 ms)
                     0.985 R²   (0.977 R² .. 0.996 R²)
mean                 22.60 ms   (22.08 ms .. 23.24 ms)
std dev              1.334 ms   (1.004 ms .. 1.830 ms)
variance introduced by outliers: 24% (moderately inflated)
benchmarking ES logging/log 100 messages
time                 247.6 ms   (216.6 ms .. 284.2 ms)
                     0.986 R²   (0.977 R² .. 1.000 R²)
mean                 226.3 ms   (217.3 ms .. 240.8 ms)
std dev              15.10 ms   (5.074 ms .. 21.72 ms)
variance introduced by outliers: 15% (moderately inflated)
benchmarking ES logging/log 1000 Messages
time                 2.595 s    (2.379 s .. 2.819 s)
                     0.999 R²   (0.996 R² .. 1.000 R²)
mean                 2.488 s    (2.446 s .. 2.520 s)
std dev              47.20 ms   (0.0 s .. 53.82 ms)
variance introduced by outliers: 19% (moderately inflated)
Doc inserted count was: 21220

I think the benchmark is good to go now. I'll get started on the bulk API version now.

MichaelXavier · 2017-10-19T20:24:00Z

Nice! I've got to imagine that the 1000 messages benchmark is going to get demolished.

MichaelXavier self-assigned this Oct 11, 2017

MichaelXavier added the enhancement label Oct 11, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redesign katip-elasticsearch internals #72

Redesign katip-elasticsearch internals #72

MichaelXavier commented Oct 11, 2017

bitemyapp commented Oct 11, 2017

bitemyapp commented Oct 18, 2017 •

edited

Loading

bitemyapp commented Oct 18, 2017

MichaelXavier commented Oct 18, 2017

bitemyapp commented Oct 18, 2017

bitemyapp commented Oct 19, 2017

bitemyapp commented Oct 19, 2017

bitemyapp commented Oct 19, 2017

bitemyapp commented Oct 19, 2017 •

edited

Loading

bitemyapp commented Oct 19, 2017

MichaelXavier commented Oct 19, 2017

Redesign katip-elasticsearch internals #72

Redesign katip-elasticsearch internals #72

Comments

MichaelXavier commented Oct 11, 2017

bitemyapp commented Oct 11, 2017

bitemyapp commented Oct 18, 2017 • edited Loading

bitemyapp commented Oct 18, 2017

MichaelXavier commented Oct 18, 2017

bitemyapp commented Oct 18, 2017

bitemyapp commented Oct 19, 2017

bitemyapp commented Oct 19, 2017

bitemyapp commented Oct 19, 2017

bitemyapp commented Oct 19, 2017 • edited Loading

bitemyapp commented Oct 19, 2017

MichaelXavier commented Oct 19, 2017

bitemyapp commented Oct 18, 2017 •

edited

Loading

bitemyapp commented Oct 19, 2017 •

edited

Loading