Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bluetooth: hci_driver: Fix deadlock in MPSL workq #18953

Merged
merged 2 commits into from
Jan 13, 2025

Conversation

PavelVPV
Copy link
Contributor

@PavelVPV PavelVPV commented Nov 19, 2024

This PR fixes deadlock in MPSL workq caused by bt_buf_get_rx with
K_FOREVER.

Previously, calling bt_buf_get_rx with K_FOREVER could block
indefinitely if the requested pool had no available buffers. This
blocking affected the MPSL work queue, preventing it from processing
other work items, including critical timeslot events like
MPSL_TIMESLOT_SIGNAL_CANCELLED and MPSL_TIMESLOT_SIGNAL_BLOCKED.

The issue becomes more severe if a background flash operation coincides
with this scenario. Flash operations often execute on the system work
queue (sysworkq), which can delay host work items responsible for
freeing buffers in the RX pool. In such cases:

  • Flash operations stall, waiting for a timeslot event from the MPSL
    work queue.
  • The MPSL work queue remains blocked by bt_buf_get_rx.
  • This results in a deadlock, causing the flash operation to timeout and
    triggering a warning.

This PR modifies the HCI driver to call bt_buf_get_rx with
K_NO_WAIT. If no buffer is immediately available, it relies on a
callback to notify when a buffer is freed. This change ensures the MPSL
work queue remains unblocked, allowing other work items to execute while
waiting for a buffer.

@PavelVPV PavelVPV added DNM CI-all-test Run All integration tests labels Nov 19, 2024
@github-actions github-actions bot added manifest changelog-entry-required Update changelog before merge. Remove label if entry is not needed or already added. labels Nov 19, 2024
@NordicBuilder
Copy link
Contributor

NordicBuilder commented Nov 19, 2024

The following west manifest projects have changed revision in this Pull Request:

Name Old Revision New Revision Diff

All manifest checks OK

Note: This message is automatically posted and updated by the Manifest GitHub Action.

@NordicBuilder
Copy link
Contributor

NordicBuilder commented Nov 19, 2024

CI Information

To view the history of this post, clich the 'edited' button above
Build number: 6

Inputs:

Sources:

sdk-nrf: PR head: e434180985eba1d76383ab3ddfaacce90ebfc0f8

more details

sdk-nrf:

PR head: e434180985eba1d76383ab3ddfaacce90ebfc0f8
merge base: 658662845a8e4d4cb677800c8d36acfac9176c90
target head (main): 0e43ec203ca7d66c6e4e0a07da76a1ea9f0e99b3
Diff

Github labels

Enabled Name Description
ci-disabled Disable the ci execution
ci-all-test Run all of ci, no test spec filtering will be done
ci-force-downstream Force execution of downstream even if twister fails
ci-run-twister Force run twister
ci-run-zephyr-twister Force run zephyr twister
List of changed files detected by CI (1)
subsys
│  ├── bluetooth
│  │  ├── controller
│  │  │  │ hci_driver.c

Outputs:

Toolchain

Version: 11349092be
Build docker image: docker-dtr.nordicsemi.no/sw-production/ncs-build:11349092be_912848a074

Test Spec & Results: ✅ Success; ❌ Failure; 🟠 Queued; 🟡 Progress; ◻️ Skipped; ⚠️ Quarantine

  • ◻️ Toolchain - Skipped: existing toolchain is used
  • ✅ Build twister - Skipped: Skipping Build & Test as it succeeded in a previous run: 5
  • ✅ Integration tests
    • ✅ test-sdk-audio
    • ✅ desktop52_verification - Skipped: Job was skipped as it succeeded in a previous run
    • ✅ test-fw-nrfconnect-boot
    • ✅ test-fw-nrfconnect-apps - Skipped: Job was skipped as it succeeded in a previous run
    • ✅ test_ble_nrf_config - Skipped: Job was skipped as it succeeded in a previous run
    • ✅ test-fw-nrfconnect-ble_mesh - Skipped: Job was skipped as it succeeded in a previous run
    • ✅ test-fw-nrfconnect-ble_samples
    • ✅ test-fw-nrfconnect-chip - Skipped: Job was skipped as it succeeded in a previous run
    • ✅ test-fw-nrfconnect-nfc - Skipped: Job was skipped as it succeeded in a previous run
    • ✅ test-fw-nrfconnect-nrf-iot_libmodem-nrf - Skipped: Job was skipped as it succeeded in a previous run
    • ✅ test-fw-nrfconnect-nrf-iot_serial_lte_modem - Skipped: Job was skipped as it succeeded in a previous run
    • ✅ test-fw-nrfconnect-nrf-iot_zephyr_lwm2m - Skipped: Job was skipped as it succeeded in a previous run
    • ✅ test-fw-nrfconnect-nrf-iot_samples - Skipped: Job was skipped as it succeeded in a previous run
    • ✅ doc-internal - Skipped: Job was skipped as it succeeded in a previous run
    • ✅ test-fw-nrfconnect-nrf-iot_thingy91 - Skipped: Job was skipped as it succeeded in a previous run
    • ✅ test-fw-nrfconnect-nrf_crypto - Skipped: Job was skipped as it succeeded in a previous run
    • ✅ test-fw-nrfconnect-rpc - Skipped: Job was skipped as it succeeded in a previous run
    • ✅ test-fw-nrfconnect-rs - Skipped: Job was skipped as it succeeded in a previous run
    • ✅ test-fw-nrfconnect-fem - Skipped: Job was skipped as it succeeded in a previous run
    • ✅ test-fw-nrfconnect-tfm - Skipped: Job was skipped as it succeeded in a previous run
    • ✅ test-fw-nrfconnect-thread - Skipped: Job was skipped as it succeeded in a previous run
    • ✅ test-fw-nrfconnect-zigbee - Skipped: Job was skipped as it succeeded in a previous run
    • ✅ test-sdk-find-my - Skipped: Job was skipped as it succeeded in a previous run
    • ✅ test-fw-nrfconnect-nrf-iot_mosh - Skipped: Job was skipped as it succeeded in a previous run
    • ✅ test-fw-nrfconnect-nrf-iot_positioning - Skipped: Job was skipped as it succeeded in a previous run
    • ✅ test-sdk-sidewalk - Skipped: Job was skipped as it succeeded in a previous run
    • ✅ test-sdk-wifi - Skipped: Job was skipped as it succeeded in a previous run
    • ✅ test-low-level - Skipped: Job was skipped as it succeeded in a previous run
    • ✅ test-fw-nrfconnect-nrf-iot_nrf_provisioning - Skipped: Job was skipped as it succeeded in a previous run
    • ✅ test-sdk-pmic-samples - Skipped: Job was skipped as it succeeded in a previous run
    • ✅ test-sdk-mcuboot - Skipped: Job was skipped as it succeeded in a previous run
    • ✅ test-sdk-dfu - Skipped: Job was skipped as it succeeded in a previous run
    • ✅ test-fw-nrfconnect-ps - Skipped: Job was skipped as it succeeded in a previous run
    • ✅ test-secdom-samples-public - Skipped: Job was skipped as it succeeded in a previous run

Note: This message is automatically posted and updated by the CI

@PavelVPV PavelVPV force-pushed the hci_driver_async_buf_get branch 2 times, most recently from a53bc48 to 6fea7f3 Compare November 20, 2024 06:42
@NordicBuilder
Copy link
Contributor

You can find the documentation preview for this PR at this link. It will be updated about 10 minutes after the documentation build succeeds.

Note: This comment is automatically posted by the Documentation Publish GitHub Action.

@PavelVPV PavelVPV changed the title DNM: bluetooth: hci_driver: Fix deadlock in MPSL workq bluetooth: hci_driver: Fix deadlock in MPSL workq Nov 20, 2024
@PavelVPV PavelVPV added the DNM label Nov 20, 2024
@PavelVPV
Copy link
Contributor Author

DNM until upstream PR is merged and downstreamed: zephyrproject-rtos/zephyr#81646

@@ -15,7 +15,6 @@ config DESKTOP_BT
select BT_SETTINGS
select BT_SIGNING
select BT_SMP
imply BT_CONN_TX_NOTIFY_WQ if SOC_FLASH_NRF_RADIO_SYNC_MPSL
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was removed for testing as the fix makes BT_CONN_TX_NOTIFY_WQ feature unnecessary. It will be reverted once upstream PR is merged.

west.yml Outdated
@@ -68,8 +70,9 @@ manifest:
# https://developer.nordicsemi.com/nRF_Connect_SDK/doc/latest/zephyr/introduction/index.html
# https://developer.nordicsemi.com/nRF_Connect_SDK/doc/latest/zephyr/guides/modules.html
- name: zephyr
remote: pavelvpv
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added for testing with the upstream fix. Will be reverted.

@PavelVPV PavelVPV marked this pull request as ready for review November 20, 2024 09:48
@PavelVPV PavelVPV requested review from a team as code owners November 20, 2024 09:48
@PavelVPV
Copy link
Contributor Author

Moved from the draft state as no more work is needed for the change and would like to get feedback on the change.

Copy link
Contributor

@rugeGerritsen rugeGerritsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this nice change, and also for taking the time to describe how and what problem it solves.

I believe we should add a changelog entry and update known issues. I suggest to do that in a separate PR.

subsys/bluetooth/controller/hci_driver.c Outdated Show resolved Hide resolved
subsys/bluetooth/controller/hci_driver.c Show resolved Hide resolved
subsys/bluetooth/controller/hci_driver.c Outdated Show resolved Hide resolved
subsys/bluetooth/controller/hci_driver.c Show resolved Hide resolved
Copy link

This pull request has been marked as stale because it has been open (more than) 30 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this pull request will automatically be closed in 7 days. Note, that you can always re-open a closed pull request at any time.

@github-actions github-actions bot added the Stale label Dec 28, 2024
@github-actions github-actions bot closed this Jan 4, 2025
@NordicBuilder
Copy link
Contributor

NordicBuilder commented Jan 7, 2025

Memory footprint analysis revealed the following potential issues

sample.matter.template.debug[nrf7002dk/nrf5340/cpuapp]: High ROM usage: 911938[B] - link (cc: @kkasperczyk-no @ArekBalysNordic @markaj-nordic)
sample.matter.template.release[nrf7002dk/nrf5340/cpuapp]: High ROM usage: 820770[B] - link (cc: @kkasperczyk-no @ArekBalysNordic @markaj-nordic)

Note: This message is automatically posted and updated by the CI (latest/sdk-nrf/PR-18953/5)

@PavelVPV PavelVPV force-pushed the hci_driver_async_buf_get branch from df0447e to 56d0b96 Compare January 7, 2025 15:03
@github-actions github-actions bot removed the Stale label Jan 8, 2025
@PavelVPV PavelVPV force-pushed the hci_driver_async_buf_get branch from 56d0b96 to 552ef10 Compare January 13, 2025 07:23
This commit fixes deadlock in MPSL workq caused by bt_buf_get_rx with
K_FOREVER.

Previously, calling bt_buf_get_rx with K_FOREVER could block
indefinitely if the requested pool had no available buffers. This
blocking affected the MPSL work queue, preventing it from processing
other work items, including critical timeslot events like
MPSL_TIMESLOT_SIGNAL_CANCELLED and MPSL_TIMESLOT_SIGNAL_BLOCKED.

The issue becomes more severe if a background flash operation coincides
with this scenario. Flash operations often execute on the system work
queue (sysworkq), which can delay host work items responsible for
freeing buffers in the RX pool. In such cases:

- Flash operations stall, waiting for a timeslot event from the MPSL
  work queue.
- The MPSL work queue remains blocked by bt_buf_get_rx.
- This results in a deadlock, causing the flash operation to timeout and
  triggering a warning.

This commit modifies the HCI driver to call bt_buf_get_rx with
K_NO_WAIT. If no buffer is immediately available, it relies on a
callback to notify when a buffer is freed. This change ensures the MPSL
work queue remains unblocked, allowing other work items to execute while
waiting for a buffer.

Signed-off-by: Pavel Vasilyev <[email protected]>
After replacing K_FOREVER with K_NO_WAIT when requesting a buffer from
the host pool, it became possible to call `hci_driver_receive_process`
directly from the SDC callback instead of going through
`receive_signal_raise` as `hci_driver_receive_process` is now
non-blocking.

Signed-off-by: Pavel Vasilyev <[email protected]>
@PavelVPV PavelVPV force-pushed the hci_driver_async_buf_get branch from 552ef10 to e434180 Compare January 13, 2025 08:35
@rugeGerritsen rugeGerritsen merged commit aec6383 into nrfconnect:main Jan 13, 2025
12 checks passed
@alwa-nordic
Copy link
Contributor

Approved. I'm not sure whether or not the second commit e434180 is a good idea or not, but I'm fine with it. I assume it's done for performance.

It does raise the complexity slightly, as there are now multiple stack usage scenarios when invoking hci_driver_receive_process. This could make it harder to determine the needed stack sizes.

I assume the SDC callback is also on the MPSL work thread, so there is still only one thread that can call into hci_driver_receive_process.

PavelVPV added a commit to PavelVPV/sdk-nrf that referenced this pull request Jan 15, 2025
Add a note about fixed issue where a flash operation executed on the
system workqueue might result in `-ETIMEDOUT` if there is an active
Bluetooth LE connection.

Ref: nrfconnect#18953

Signed-off-by: Pavel Vasilyev <[email protected]>
PavelVPV added a commit to PavelVPV/sdk-nrf that referenced this pull request Jan 15, 2025
Add a note about fixed issue where a flash operation executed on the
system workqueue might result in `-ETIMEDOUT` if there is an active
Bluetooth LE connection.

Ref: nrfconnect#18953

Signed-off-by: Pavel Vasilyev <[email protected]>
nordicjm pushed a commit that referenced this pull request Jan 15, 2025
Add a note about fixed issue where a flash operation executed on the
system workqueue might result in `-ETIMEDOUT` if there is an active
Bluetooth LE connection.

Ref: #18953

Signed-off-by: Pavel Vasilyev <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog-entry-required Update changelog before merge. Remove label if entry is not needed or already added. CI-all-test Run All integration tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants