Mark partition as busy when a new batch is sent to it #281

isra17 · 2017-05-29T18:40:59Z

I happened to stumble onto this bug where the DBWorker keep sending request to the spider until the queue is empty. Seems to go as follow:

DBWorker set Spider's partition as ready
DBWorker send a new batch, feed.counter = 256
Spider receive new batch, send new offset spider.offset = 256
DBWorker receive offset, since spider.offset <= feed.counter, keep the partition as ready
Spider is busy scraping.
DBWorker send a new batch to the spider's partition, feed.counter = 512
DBWorker still sending new batches, feed.counter = 1024
Finally Spider has some space for new request, download next requests and then send its new offset, spider.offset = 512
DBWorker now set the partition as busy, however the lag between the spider offset and the feed counter can be quite huge by that time.

I guess crawling slowly make this worse since a single batch can take a few minutes to process, leaving the DBWorker some time to overload the feed.

My fix here is to mark a partition that received messages as busy. This way, the worker will wait for an update on the spider offset update to mark the partition as ready if needed. This should work well with a bigger MAX_NEXT_REQUESTS value on the worker to ensure the queue is never empty.

PS: Is there any IRC channel where the maintainers and other Frontera users are hanging out? I tried #scrapy but this didn't seem the right place to have Frontera discussion.

This prevent a spider from being overwhelmed by requests because it takes too long to process a batch. The DB Worker will be able to send a new batch once a crawler requests more requests and send the offset message to update its state to the DB Worker.

codecov-io · 2017-05-29T18:51:19Z

Codecov Report

Merging #281 into master will decrease coverage by 0.12%.
The diff coverage is 48%.

@@            Coverage Diff             @@
##           master     #281      +/-   ##
==========================================
- Coverage   70.16%   70.04%   -0.13%     
==========================================
  Files          68       68              
  Lines        4720     4723       +3     
  Branches      632      635       +3     
==========================================
- Hits         3312     3308       -4     
- Misses       1272     1279       +7     
  Partials      136      136

Impacted Files	Coverage Δ
frontera/worker/db.py	`63.63% <100%> (+0.45%)`	⬆️
frontera/core/messagebus.py	`67.3% <100%> (+0.64%)`	⬆️
frontera/contrib/messagebus/zeromq/__init__.py	`80.11% <43.47%> (-4.23%)`	⬇️
frontera/contrib/backends/hbase.py	`70.55% <0%> (-0.76%)`	⬇️
frontera/__init__.py
...apy_recording/scrapy_recording/spiders/__init__.py	`100% <0%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1dec22f...a848c73. Read the comment docs.

sibiryakov · 2017-05-30T10:07:03Z

@isra17 thanks for the contribution!
we don't have irc or other chat channel, because there is not that big demand.

I'm not sure I understand what the problem is:

loose of some requests generated by DBW between 7 and 8 steps?
or
incorrect partition status set because of wrong sequence/timing of offset exchange?

A.

isra17 · 2017-05-30T12:59:43Z

The issue is that until the Spider finishes its current batch, the DBWorker will just keep sending new ones. In my case, the DBWorker have time to flush the entire backend queue into the message bus before the spider has the opportunity to mark itself as busy. This get annoying when the spider ends up with a few hours worth of work waiting in the message bus.

sibiryakov · 2017-06-07T08:50:52Z

The issue is that until the Spider finishes its current batch, the DBWorker will just keep sending new ones. In my case, the DBWorker have time to flush the entire backend queue into the message bus before the spider has the opportunity to mark itself as busy.

The idea behind the code you're trying to modify is that DBW sends always some amount of request in advance. When there's a pauses between batches required for passing the states (ready->busy->ready) and waiting for a batch to finish (when batch is 95% finishing the spider is mostly idle waiting for a longest request) the crawling speed decreases. With some amount of requests always available in the queue spider has a chance to get requests always when there's a space in it's internal queue.

This get annoying when the spider ends up with a few hours worth of work waiting in the message bus.

I don't understand this. Spider is waiting because a) messages with batches were lost in ZMQ or b) busy status was incorrectly set and wasn't changing for long time, even when spider was already ready?

This is pretty tough topic to discuss async/remotely, so please contact me by Skype, alexander.sibiryakov so we could save some time.

A.

isra17 · 2017-07-07T21:02:25Z

@sibiryakov I did refactor the PR to keep track of the offset as discussed on Skype. Let me know if anything is missing.

sibiryakov

looks good, but needs some work

sibiryakov · 2017-07-11T06:41:27Z

.travis.yml

@@ -3,6 +3,7 @@ python: 2.7
 branches:
  only:
    - master
+    - busy-partitions


this isn't needed probably

sibiryakov · 2017-07-11T07:59:08Z

frontera/contrib/messagebus/zeromq/__init__.py

-    def mark_busy(self, partition_id):
-        self.ready_partitions.discard(partition_id)
+    def set_spider_offset(self, partition_id, offset):
+        self.partitions_offset[partition_id] = offset


this variable isn't defined

Ha good catch, fixed

sibiryakov · 2017-07-11T08:15:46Z

tests/mocks/message_bus.py

-        self.ready_partitions.discard(partition_id)
-
+        partitions = []
+        for partition_id, last_offset in self.partitions_offset.items():


Here, if partition doesn't exist (yet) in this dict - it will not be returned as available, which is wrong.
What if producer offsets will be less (because of DB worker restart) than consumer offsets?

line 86 does create the keys for each partition. In the worst case, a new partition will first send an offset message and create a key in the partitions_offset. As for the negative offset I don't think anything will break? From my understanding, when a DBWorker restart, a spider will have a big invalid offset, but it should be marked as ready and on its next message this will fix it.

isra17 added 2 commits May 29, 2017 14:26

Add unittest for busy partition on new_batch

2f7e3cd

Remove debug log

d4aa8db

isra17 added 4 commits July 4, 2017 18:06

Track lag by regularly checking consumer and producer lag

0b646e4

Fix partition test

dacd2b8

revert travis

f7d2cd0

Review code

64b61de

sibiryakov reviewed Jul 11, 2017

View reviewed changes

isra17 added 3 commits July 11, 2017 10:06

Init partitions_offset dict

58ff5f4

Handle new partition

d2ef60e

Fix bug where no spider is running

a848c73

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mark partition as busy when a new batch is sent to it #281

Mark partition as busy when a new batch is sent to it #281

isra17 commented May 29, 2017

codecov-io commented May 29, 2017 •

edited

Loading

sibiryakov commented May 30, 2017

isra17 commented May 30, 2017

sibiryakov commented Jun 7, 2017 •

edited

Loading

isra17 commented Jul 7, 2017

sibiryakov left a comment

sibiryakov Jul 11, 2017

sibiryakov Jul 11, 2017

isra17 Jul 11, 2017 •

edited

Loading

sibiryakov Jul 11, 2017

isra17 Jul 11, 2017

sibiryakov Jul 24, 2017

Mark partition as busy when a new batch is sent to it #281

Are you sure you want to change the base?

Mark partition as busy when a new batch is sent to it #281

Conversation

isra17 commented May 29, 2017

codecov-io commented May 29, 2017 • edited Loading

Codecov Report

sibiryakov commented May 30, 2017

isra17 commented May 30, 2017

sibiryakov commented Jun 7, 2017 • edited Loading

isra17 commented Jul 7, 2017

sibiryakov left a comment

Choose a reason for hiding this comment

sibiryakov Jul 11, 2017

Choose a reason for hiding this comment

sibiryakov Jul 11, 2017

Choose a reason for hiding this comment

isra17 Jul 11, 2017 • edited Loading

Choose a reason for hiding this comment

sibiryakov Jul 11, 2017

Choose a reason for hiding this comment

isra17 Jul 11, 2017

Choose a reason for hiding this comment

sibiryakov Jul 24, 2017

Choose a reason for hiding this comment

codecov-io commented May 29, 2017 •

edited

Loading

sibiryakov commented Jun 7, 2017 •

edited

Loading

isra17 Jul 11, 2017 •

edited

Loading