-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mark partition as busy when a new batch is sent to it #281
base: master
Are you sure you want to change the base?
Conversation
This prevent a spider from being overwhelmed by requests because it takes too long to process a batch. The DB Worker will be able to send a new batch once a crawler requests more requests and send the offset message to update its state to the DB Worker.
Codecov Report
@@ Coverage Diff @@
## master #281 +/- ##
==========================================
- Coverage 70.16% 70.04% -0.13%
==========================================
Files 68 68
Lines 4720 4723 +3
Branches 632 635 +3
==========================================
- Hits 3312 3308 -4
- Misses 1272 1279 +7
Partials 136 136
Continue to review full report at Codecov.
|
@isra17 thanks for the contribution! I'm not sure I understand what the problem is:
A. |
The issue is that until the Spider finishes its current batch, the DBWorker will just keep sending new ones. In my case, the DBWorker have time to flush the entire backend queue into the message bus before the spider has the opportunity to mark itself as busy. This get annoying when the spider ends up with a few hours worth of work waiting in the message bus. |
The idea behind the code you're trying to modify is that DBW sends always some amount of request in advance. When there's a pauses between batches required for passing the states (ready->busy->ready) and waiting for a batch to finish (when batch is 95% finishing the spider is mostly idle waiting for a longest request) the crawling speed decreases. With some amount of requests always available in the queue spider has a chance to get requests always when there's a space in it's internal queue.
I don't understand this. Spider is waiting because a) messages with batches were lost in ZMQ or b) busy status was incorrectly set and wasn't changing for long time, even when spider was already ready? This is pretty tough topic to discuss async/remotely, so please contact me by Skype, alexander.sibiryakov so we could save some time. A. |
@sibiryakov I did refactor the PR to keep track of the offset as discussed on Skype. Let me know if anything is missing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, but needs some work
.travis.yml
Outdated
@@ -3,6 +3,7 @@ python: 2.7 | |||
branches: | |||
only: | |||
- master | |||
- busy-partitions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this isn't needed probably
def mark_busy(self, partition_id): | ||
self.ready_partitions.discard(partition_id) | ||
def set_spider_offset(self, partition_id, offset): | ||
self.partitions_offset[partition_id] = offset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this variable isn't defined
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha good catch, fixed
self.ready_partitions.discard(partition_id) | ||
|
||
partitions = [] | ||
for partition_id, last_offset in self.partitions_offset.items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, if partition doesn't exist (yet) in this dict - it will not be returned as available, which is wrong.
What if producer offsets will be less (because of DB worker restart) than consumer offsets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line 86 does create the keys for each partition. In the worst case, a new partition will first send an offset
message and create a key in the partitions_offset
. As for the negative offset I don't think anything will break? From my understanding, when a DBWorker restart, a spider will have a big invalid offset, but it should be marked as ready and on its next message this will fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok!
I happened to stumble onto this bug where the DBWorker keep sending request to the spider until the queue is empty. Seems to go as follow:
feed.counter = 256
spider.offset = 256
spider.offset <= feed.counter
, keep the partition as readyfeed.counter = 512
feed.counter = 1024
spider.offset = 512
I guess crawling slowly make this worse since a single batch can take a few minutes to process, leaving the DBWorker some time to overload the feed.
My fix here is to mark a partition that received messages as busy. This way, the worker will wait for an update on the spider offset update to mark the partition as ready if needed. This should work well with a bigger
MAX_NEXT_REQUESTS
value on the worker to ensure the queue is never empty.PS: Is there any IRC channel where the maintainers and other Frontera users are hanging out? I tried #scrapy but this didn't seem the right place to have Frontera discussion.