Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redesign katip-elasticsearch internals #72

Open
MichaelXavier opened this issue Oct 11, 2017 · 11 comments
Open

Redesign katip-elasticsearch internals #72

MichaelXavier opened this issue Oct 11, 2017 · 11 comments
Assignees

Comments

@MichaelXavier
Copy link
Collaborator

Discussed a bit in other channels. Tagging in @bitemyapp

katip-elasticsearch is extremely naive and has some bad UX to boot. Right now it has a configurable pool of worker threads. They pull from a TBMQueue and then do individual indexDocument calls. This is not ideal because:

  1. The user already configures a queue size for the scribe overall. It is confusing when they do that and then they're limited by this seemingly superfluous queue with an independent size.
  2. indexDocument is slow and hard on the elasticsearch server. It should be bulk indexing.

I think the right way to go is:

  1. Try to drop this concept of a worker pool. You won't need to beat up your ES server from multiple threads if you're actually bulk indexing.
  2. See if we can get away with no worker threads in general. The scribe itself is an independent thread.
  3. I feel like we should accumulate logs and flush every X seconds or Y messages (configurable with defaults) or when the scribe is finalizing. This probably means we could use a faster/more appropriate data structure than TBMQueue. The main challenge I think is the X seconds part. I think we'd need to spin off an Async for flushing but I'm not quite seeing how best to design it to flush when full or after a certain amount of inactivity.
@bitemyapp
Copy link
Contributor

I'm taking a crack at this but don't let it stop you investigating separately.

@bitemyapp
Copy link
Contributor

bitemyapp commented Oct 18, 2017

Here's an initial benchmark for the current state of it:

benchmarking ES logging/log 10 messages
time                 26.03 ms   (22.81 ms .. 29.18 ms)
                     0.964 R²   (0.935 R² .. 0.991 R²)
mean                 31.28 ms   (29.77 ms .. 32.62 ms)
std dev              3.121 ms   (2.288 ms .. 4.148 ms)
variance introduced by outliers: 40% (moderately inflated)
benchmarking ES logging/log 100 messages
time                 243.4 ms   (223.9 ms .. 262.9 ms)
                     0.998 R²   (0.989 R² .. 1.000 R²)
mean                 250.5 ms   (246.6 ms .. 257.3 ms)
std dev              6.501 ms   (125.5 μs .. 7.910 ms)
variance introduced by outliers: 16% (moderately inflated)
benchmarking ES logging/log 10000 Messages
time                 2.741 s    (2.221 s .. 3.874 s)
                     0.980 R²   (0.965 R² .. 1.000 R²)
mean                 2.464 s    (2.318 s .. 2.594 s)
std dev              212.0 ms   (0.0 s .. 224.6 ms)
variance introduced by outliers: 22% (moderately inflated)

However this includes the creation of the http manager, scribe, and log environment. Will amend in a moment. I'm also going to knock down the final volume from 10,000 to 1,000.

Update: I'm getting the wrong number of documents for the changed version that doesn't spin up the scribe for each benchmark. Here's my branch: https://github.com/bitemyapp/katip/tree/bitemyapp/bulk-elasticsearch-scribe

@MichaelXavier any ideas what I might be doing wrong?

@bitemyapp
Copy link
Contributor

Benchmark katip-elasticsearch-bench: RUNNING...
benchmarking ES logging/log 10 messages
time                 1.291 ms   (1.266 ms .. 1.339 ms)
                     0.988 R²   (0.981 R² .. 0.994 R²)
mean                 1.247 ms   (1.211 ms .. 1.292 ms)
std dev              129.5 μs   (95.14 μs .. 170.0 μs)
variance introduced by outliers: 73% (severely inflated)
benchmarking ES logging/log 100 messages
time                 14.13 ms   (13.34 ms .. 15.05 ms)
                     0.972 R²   (0.951 R² .. 0.987 R²)
mean                 11.48 ms   (10.24 ms .. 12.33 ms)
std dev              2.520 ms   (1.640 ms .. 3.559 ms)
variance introduced by outliers: 85% (severely inflated)
benchmarking ES logging/log 1000 Messages
time                 123.4 ms   (120.1 ms .. 128.7 ms)
                     0.998 R²   (0.995 R² .. 1.000 R²)
mean                 122.9 ms   (121.5 ms .. 124.6 ms)
std dev              2.157 ms   (1.538 ms .. 2.879 ms)
variance introduced by outliers: 11% (moderately inflated)

I'm getting output like that, but my document count will only increase by a couple documents even after a forced refresh index.

@MichaelXavier
Copy link
Collaborator Author

Hmm, normally if docs aren't showing up I'd assume it was due to not closing scribes but if I'm following your bench correctly it looks like you're doing that. Although I wonder if too eagerly: replicateM_ repeats $ bracket mkLogEnv closeScribes this would open and close/flush the scribe 10, 100, 1000 times, which I'm not sure you want. Don't you want bracket mkLogEnv closeScribes $ \le -> replicateM_ repeats $ ...? opening and closing every iteration seems like it would thrash your performance badly not but it also seems like it would not let performance shine if you moved to bulk inserts.

Back to the question of doc counts not increasing as expected, this is still strange. I'd expect every time you close the scribe for it to delay for some time to shuttle the single log message through to the end. The ES scribe should shut down and then the thread managing that scribe should shut down. My next step would probably be to toss some traces into the es scribe and see what's actually happening, probably restricting it to the smallest test size that exhibits the behavior.

@bitemyapp
Copy link
Contributor

Don't you want bracket mkLogEnv closeScribes $ \le -> replicateM_ repeats $ ...?

I did that before and it also exhibited weird behavior. I'll try to narrow it down.

@bitemyapp
Copy link
Contributor

logMessages :: IO LogEnv -> Int -> IO ()
logMessages mkLogEnv repeats = do
  bracket mkLogEnv closeScribes
    $ \ le -> forM_ [1..repeats] $ \i -> runKatipT le $ do
      logMsg "ns" InfoS ("This goes to elasticsearch: " <> (logStr $ T.pack $ show i))

With that code, I seem to only be getting the first benchmark's 10 documents in a single batch or something.

screenshot from 2017-10-19 14-42-54

@bitemyapp
Copy link
Contributor

@MichaelXavier Fixed it. I was running close scribes in the bracket but the scribe was only being created once. Derp.

@bitemyapp
Copy link
Contributor

After a single run I got:

benchmarking ES logging/log 10 messages
time                 162.1 μs   (157.6 μs .. 166.5 μs)
                     0.991 R²   (0.987 R² .. 0.994 R²)
mean                 152.7 μs   (148.7 μs .. 156.4 μs)
std dev              12.53 μs   (10.51 μs .. 15.39 μs)
variance introduced by outliers: 73% (severely inflated)
benchmarking ES logging/log 100 messages
time                 228.8 μs   (226.8 μs .. 231.2 μs)
                     0.996 R²   (0.994 R² .. 0.998 R²)
mean                 221.6 μs   (216.1 μs .. 226.3 μs)
std dev              16.62 μs   (12.86 μs .. 22.17 μs)
variance introduced by outliers: 68% (severely inflated)
benchmarking ES logging/log 1000 Messages
time                 988.9 μs   (970.0 μs .. 1.013 ms)
                     0.985 R²   (0.958 R² .. 0.996 R²)
mean                 1.041 ms   (1.017 ms .. 1.092 ms)
std dev              118.4 μs   (63.82 μs .. 202.4 μs)
variance introduced by outliers: 78% (severely inflated)

With a total of 6,035 docs inserted. Which shouldn't be possible even with Criterion re-runs because 5 documents is not a whole insertion run. But it's closer.

@bitemyapp
Copy link
Contributor

bitemyapp commented Oct 19, 2017

screenshot from 2017-10-19 15-02-30

screenshot from 2017-10-19 15-02-25

I'm assuming the load shedding is kicking in at this point.

Update: increasing the queue size means that closeScribes is sitting for a long time after the benchmarks finish. Few hundred to a thousand docs per second. Also means the benchmarks themselves don't mean a lot if I'm closing the scribes afterward.

@bitemyapp
Copy link
Contributor

benchmark katip-elasticsearch-bench: RUNNING...
benchmarking ES logging/log 10 messages
time                 24.24 ms   (22.09 ms .. 25.80 ms)
                     0.985 R²   (0.977 R² .. 0.996 R²)
mean                 22.60 ms   (22.08 ms .. 23.24 ms)
std dev              1.334 ms   (1.004 ms .. 1.830 ms)
variance introduced by outliers: 24% (moderately inflated)
benchmarking ES logging/log 100 messages
time                 247.6 ms   (216.6 ms .. 284.2 ms)
                     0.986 R²   (0.977 R² .. 1.000 R²)
mean                 226.3 ms   (217.3 ms .. 240.8 ms)
std dev              15.10 ms   (5.074 ms .. 21.72 ms)
variance introduced by outliers: 15% (moderately inflated)
benchmarking ES logging/log 1000 Messages
time                 2.595 s    (2.379 s .. 2.819 s)
                     0.999 R²   (0.996 R² .. 1.000 R²)
mean                 2.488 s    (2.446 s .. 2.520 s)
std dev              47.20 ms   (0.0 s .. 53.82 ms)
variance introduced by outliers: 19% (moderately inflated)
Doc inserted count was: 21220

screenshot from 2017-10-19 15-14-41

I think the benchmark is good to go now. I'll get started on the bulk API version now.

@MichaelXavier
Copy link
Collaborator Author

Nice! I've got to imagine that the 1000 messages benchmark is going to get demolished.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants