Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] vMQTTAgentTask deadlock on re-subscribing after MQTT disconnection #81

Closed
grdSTM opened this issue Dec 15, 2022 · 4 comments
Closed
Labels
bug Something isn't working

Comments

@grdSTM
Copy link
Contributor

grdSTM commented Dec 15, 2022

A user investigating the vMQTTAgentTask behavior upon MQTT broker disconnection faced a deadlock in xLockSubCtx().

The test-disconnections have been triggered by playing with the MQTT_KEEP_ALIVE_TIMEOUT i.e. disabling for a certain time handleKeepAlive() in the client code.
 
What happens is that:
- At power-on, the MQTT client connects to the broker and subscribes to update-accepted/rejected/delta topics.
- After the 1st broker disconnect-reconnect event, the topics are re-subscribed.
- After the 2nd broker disconnect-reconnect event, the topics are already present because the session is persistent and that doesn’t require a new re-subscription.
 
At this point, an attempt to re-take a mutex already taken (xLockSubCtx() call) and never released brings the vMQTTAgentTask in a deadlock.

A quick fix in the code helps to overcome this situation. 

See the patch they made over the v202205.00 version: https://github.com/FreeRTOS/iot-reference-stm32u5/compare/main...grdSTM:lab-iot-reference-stm32u5:resubDlWa?expand=1

As far as I understand, the house-keeping of subscription locks is intended to be handled by MQTTAgent_CancelAll() callbacks.

Edit: The issue has been detected in a port relying on an alternative implementation of the MQTT transport interface. Compared to with the Wi-Fi driver of the iot-reference-stm32u5 project, it is possible that pending TCP segments are handled differently when the TCP connection gets closed.

Is the above patch the proper fix, or are there preferred alternatives?

Thanks!

@grdSTM grdSTM added the bug Something isn't working label Dec 15, 2022
@archigup
Copy link
Member

archigup commented Dec 15, 2022

Hi, we're looking into this. Thanks for bringing this up!

@AniruddhaKanhere
Copy link
Member

Hello @grdSTM,

Thanks a lot for taking the time to report the bug to us. We appreciate it :)

Trying to follow along - let me know if I miss something or did not understand correctly.

In case of clean session (After the 1st broker disconnect-reconnect event), the mutex is relinquished here. After which the agent does its job in the command loop here.

After this, the broker is disconnected from the client and we reacquire the mutex here.
Following this, we restart from top of the loop where MQTT_Connect is called. This time however, the clean session flag is set to false and we call prvHandleResubscribe here. Which does NOT try to take the mutex and just relinquishes it in case of a failure.

I am not sure that I am seeing a deadlock here. Can you help e find my mistake? Did I miss something or overlook some case?

Thank you for reporting the bug to us.

Thanks,
Aniruddha

@gmarcolinosr
Copy link

gmarcolinosr commented Feb 21, 2023

Hello @AniruddhaKanhere,

sorry to jump abruptly into the discussion, I had stumbled in the deadlock issue some time ago and notified the issue to @grdSTM.
Referring to your comment, what i have seen is that after the 2nd disconnect-reconnect, the session flag is set to true because the session is persistent. That means that prvHandleResubscribe() is not called.

The temporary solution we adopted is to explicitly handle the case ( xSessionPresent == true) by unlocking the mutex, see the patch made over the v202205.00 version:

https://github.com/FreeRTOS/iot-reference-stm32u5/compare/main...grdSTM:lab-iot-reference-stm32u5:resubDlWa?expand=1

@AniruddhaKanhere
Copy link
Member

The above PR was merged. Thank you for the contribution @grdSTM and @gmarcolinosr!

I shall be closing Issue now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants