-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWS SQS FIFO messages being processed out of order or in duplicate #78
Comments
Another test, now with the
|
Everything seems to be expected given the current implementation:
|
Btw, awesome job on the bug report. Everything was clear! |
@josevalim thanks for the quick feedback. Glad you appreciated the somewhat extensive report 😄
Indeed I believe it would be. In my specific scenario it does makes sense to guarantee processing order and, if a message fails for some reason, retry it after the visibility timeout blocking the processing of other messages with the same group id. My latest checks tells me that I can achieve that by retrieving only a single message each time but that is definitely slower than working with a batch in-memory and increases a lot the number of API requests.
As for this point, indeed it seems that works fine with the partition_by set to be the same as the message_group_id that messages have been assigned in SQS. |
Perfect. I am afraid however that we don’t have anyone with this particular need from our side, so I don’t see us implementing this any time soon. Also, keep in mind this choice can have a cascading effect on the system. SQS seems to be happy to still send messages for a message group even if previous ones were not yet acked. This means that one failure will make it so future messages continue to arrive, and all of them will now have to be failed and wait for the visibility timeout, potentially for several minutes. If SQS provides no control over this (such as a number of messages in flow for a given partition), perhaps going down this route is not recommended after all. |
My understanding of SQS FIFO is a bit different since I thought that multiple consumers can receive messages but not with the same group id during the visibility timeout period. If more than one message is received for the same group in a consumer, that process should ensure processing order in-memory too. Once all the messages are acknowledge, another consumer can grab messages for that same group if they are enqueued in the meantime. Thus I do think SQS does have control over the message flow for a given partition (message group id) or at least those were my conclusions from the docs and PoCs I did when I choose to go with SQS FIFO for my use case, but I admit that I might have misunderstood or made some mistake while testing.
Totally get that. I will see what I can do for my specific use case. Who knows? If it makes sense I might find some time to propose you some changes that might help others with similar use cases 😉 |
Just sharing an AWS article that helped me understanding this topic in the past, just in case: |
I've been using
broadway_sqs
to consume AWS SQS FIFO queues and I noticed some unexpected behaviours when processing the messages since sometimes those were processed out of order or more than one time.Initially I didn't had the Broadway partition_by configured and once I did that, things seemed to improve but I can still see some double processing and out of order processing occurring. For example, looking at the below logs – organized by process identifier to help readability – we can see that:
Before setting up the partition_by the behaviour was even more awkward with different consumers handling messages from the same message_group_id:
My understanding is that AWS SQS FIFO queues, using the message_group_id, should guarantee message order within the same message group identifier and that once a message has been received, during its visibility timeout, no other consumer can receive the same message.
I'll leave here the code for my SQS producer:
I'm I misinterpreting the behaviour that should be expected? Anyone has experienced the same behaviour?
The text was updated successfully, but these errors were encountered: