Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to hyper 1.0 #3371

Merged
merged 8 commits into from
Feb 4, 2025
Merged

Update to hyper 1.0 #3371

merged 8 commits into from
Feb 4, 2025

Conversation

jarhodes314
Copy link
Contributor

Proposed changes

This updates nearly all our usage of hyper to 1.0, including through axum. It also updates rustls to 0.23.

I've not tested it properly, although I'm not expecting massive breaking changes. There is currently 1 failing unit test that I will look into further. I believe the functionality is working as intended, but the check the test performs to clarify the error is of the expected type isn't working. Additionally there is some new flakiness in the tedge_mqtt_bridge tests that I would like to understand better (but I think if the issue is a real bug rather than a test issue, it boils down to "when we put the bridge under a frankly ridiculous amount of load, things go a bit wonky").

Currently we do still have a dependency on hyper 0.14 through rumqttd, although this only used for testing. I have updated rumqttc to the latest version using a git dependency (pinned to a commit hash to ensure we don't receive unexpected breaking changes).

Types of changes

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Improvement (general improvements like code refactoring that doesn't explicitly fix a bug or add any new functionality)
  • Documentation Update (if none of the other choices apply)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Paste Link to the issue

Checklist

  • I have read the CONTRIBUTING doc
  • I have signed the CLA (in all commits with git commit -s)
  • I ran cargo fmt as mentioned in CODING_GUIDELINES
  • I used cargo clippy as mentioned in CODING_GUIDELINES
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)

Further comments

@reubenmiller
Copy link
Contributor

@jarhodes314 FYI: I think we're going to have to stick to the ring crypto backend as aws-lc-rs does not support all of our targets.

@jarhodes314
Copy link
Contributor Author

@jarhodes314 FYI: I think we're going to have to stick to the ring crypto backend as aws-lc-rs does not support all of our targets.

Annoyingly, part of the dependency on aws-lc-rs is through rumqttc. I've got it working locally, I'm just trying to patch it on a fork so I can create a PR against rumqttc to fix the issue properly. Once I've got a fork, I'll point the git dependency at that so this is at least unblocked for testing.

@jarhodes314
Copy link
Contributor Author

Now created bytebeamio/rumqtt#941, I'll update this PR with the forked dependency to unblock the build.

Copy link
Contributor

github-actions bot commented Jan 30, 2025

Robot Results

✅ Passed ❌ Failed ⏭️ Skipped Total Pass % ⏱️ Duration
560 0 3 560 100 1h33m54.914218s

Copy link
Contributor

@didier-wenzek didier-wenzek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So many breaking API changes here and there :-( Thank you for tackling this.

About upgrading rumqttc, this PR is the way to go compared to #3359. Indeed, even if currently depending on fork of the main branch (bytebeamio/rumqtt#941), the changes are more future proof.


impl zeroize::Zeroize for PrivateKey {
fn zeroize(&mut self) {
self.0 .0.zeroize()
// FIXME I can't be implemented with rustls 0.23 due to immutability
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One can submit a PR on rustls_pki_types

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is related to security and a removal of functionality, we should create a ticket to track it (e.g. and create a PR as @didier-wenzek suggested) before merging so we don't forget about this.

Copy link
Contributor

@Bravo555 Bravo555 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good, but 3 things are left:

  • hyper 1.5 -> 1.6
  • can CryptoProvider::install_default() be removed?
  • most important: handling Zeroize, which we can no longer guarantee because CertificateDer and PrivateKeyDer are immutable. Making a PR to upstream is an option, but it could take some time. In the meantime, is zeroizing necessary so that we can't replace it with a noop? We're not really testing for it anywhere, so I'm not sure what exactly it provides us.

Comment on lines 8 to 9
use rustls::crypto;
let _ = crypto::CryptoProvider::install_default(crypto::ring::default_provider());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: why do we need this snippet in the tests?

When I remove it, the tests still work. It's a bit of a pain to always have to call this in tests (also because cryptography should be somewhat abstracted and caller shouldn't have to care about it, so we don't always necessarily know when we'll have to call into rustls).

So because it still works without the snippet, I reckon it either doesn't call into ServerConfig/ClientConfig builders or some other default is used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, having a closer look at this test, it seems like we're not using HTTPS at all (mockito doesn't support HTTPS as far as I can tell, and we certainly don't do anything to it that looks like it would enable HTTPS). @didier-wenzek can you elaborate on whether this lack of HTTPS in the test is an error or whether I'm just misunderstanding something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirm that there is no more test over HTTPS, as you said because mockito doesn't support HTTPS. There was a test against https://httpbin.org, but this has been removed because subject to flackiness. See #1985

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't get any test failures when commenting out crates/extensions/c8y_auth_proxy/src/server.rs:1250, so I doubt it's because HTTPS isn't used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect because we now only have ring enabled and not aws-lc-rs, rustls does this for us automatically, which would explain why I previously ran into errors and now don't.

hyper-rustls = { version = "0.24", default-features = false, features = [
"tokio-runtime",
hyper = { version = "1.5", default-features = false }
hyper-rustls = { version = "0.27.5", default-features = false, features = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: why 1.5? cargo check works for me with hyper 1.6

Suggested change
hyper-rustls = { version = "0.27.5", default-features = false, features = [
hyper = { version = "1.6", default-features = false }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite know, but we're already using 1.6 anyway according to the lockfile

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, my bad, indeed there's 1.6 in the lockfile, just my Dependi extension took the older 0.14 version used for compat with rumqttd and displayed me a "not up to date mark", so we're in the clear.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: http-body and hyper-util can be upgraded to latest patch version using cargo upgrade

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now done

Comment on lines 4760 to +4770
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "c546c80d6be4bc6a00c0f01730c08df82eaa7a7a61f11d656526506112cc1709"
dependencies = [
"thiserror-impl",
"thiserror-impl 1.0.61",
]

[[package]]
name = "thiserror"
version = "2.0.11"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "d452f284b73e6d76dd36758a0c8684b1d5be31f92b89d07fd5822175732206fc"
dependencies = [
"thiserror-impl 2.0.11",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: Seems that hyper & friends are using a new major release of thiserror. Our crates and some other dependencies still use 1.0, so 2 versions are pulled, but because thiserror basically does macros, it shouldn't have major codegen implications.

Copy link
Contributor

@didier-wenzek didier-wenzek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay to merge this PR despite there are still a failing tests (acceptor_rejects_untrusted_client_certificates and bridge_disconnect_while_sending), because:

  • This PR is going in the correct direction, i.e. leveraging the HEAD of rumqttc to upgrade to rustls 0.23 despite conflicting constraints. Reverting to the next released version of rumqttc will be feasible with no major difficulties.
  • This PR spreads a lot of small but non-trivial changes across all the creates depending on TLS. It would be good to avoid conflicts.

Follow-up tasks:

@reubenmiller are you okay merging this PR soon?

match self {
Self::Small(bytes) => (bytes.clone().into(), Some(bytes)),
Self::Small(bytes) => (
SyncBody::new(Full::new(bytes.clone()).map_err(|i| match i {})),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get how matching the Infallible error can help to make an impl http_body::Body() from a Result.

What am I missing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because Infallible is an uninhabited type (i.e. the enum has no variants, so a value of the type can never exist), match i {} is valid and evaluates to !, so this can be cast to whatever implmentation is necessary (in this case axum::Error).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

match i {} evaluates to !

This is what I was missing.

@@ -4,7 +4,7 @@ test-threads = "num-cpus"

[[profile.default.overrides]]
# Remove once https://github.com/thin-edge/thin-edge.io/issues/3021 is resolved
filter = 'test(bridge_reconnects_successfully_after_local_connection_interrupted) or test(bridge_reconnects_successfully_after_cloud_connection_interrupted)'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When has been these two flaky tests resolved? If so it would be good to close #3021.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure, but I suspect it was with the reliability improvements you made in #3122. I've not seen the flakiness locally myself in cargo-nextest and I have seen quite a lot of flaky tests in recent weeks, I believe due to the strain it puts my work laptop under.

@jarhodes314
Copy link
Contributor Author

@didier-wenzek I don't quite understand this, you've linked to a PR against rumqttc but suggested this doesn't exist?

Fix the two tests that have been ignored here.

Yep, the bridge_disconnect one is the main new source of flakiness in the bridge tests. I will reduce the number of messages it sends because I think the issue is to do with rumqttd getting overloaded. I don't quite understand how (I presume a change in rumqttc may have changed how it sends messages:

Here's a passing example:

running 1 test
[2025-01-31T14:57:50Z INFO  tedge_mqtt_bridge::health] MQTT bridge connected to cloud broker
[2025-01-31T14:57:50Z INFO  tedge_mqtt_bridge] Bridge cloud connection subscribing to [Filter = s/ds, Qos = AtLeastOnce]
[2025-01-31T14:57:50Z INFO  tedge_mqtt_bridge::health] MQTT bridge connected to local broker
[2025-01-31T14:57:50Z INFO  tedge_mqtt_bridge] Bridge local connection subscribing to [Filter = c8y/s/us, Qos = AtLeastOnce]
[2025-01-31T14:57:50Z INFO  tedge_mqtt_bridge] Bridge local connection received ack for unknown pkid=2
[2025-01-31T14:57:51Z ERROR tedge_mqtt_bridge::health] MQTT bridge failed to connect to cloud broker: Mqtt state: Connection closed by peer abruptly
[2025-01-31T14:57:51Z INFO  tedge_mqtt_bridge::health] MQTT bridge connected to cloud broker
[2025-01-31T14:57:51Z INFO  tedge_mqtt_bridge] Bridge cloud connection subscribing to [Filter = s/ds, Qos = AtLeastOnce]
[2025-01-31T14:57:51Z INFO  tedge_mqtt_bridge] Bridge local connection received ack for unknown pkid=3
[2025-01-31T14:57:51Z INFO  tedge_mqtt_bridge] Bridge local connection received ack for unknown pkid=4
test bridge_disconnect_while_sending ... ok

And a failing example:

running 1 test
test bridge_disconnect_while_sending ... FAILED

failures:

---- bridge_disconnect_while_sending stdout ----
[2025-01-31T14:56:40Z INFO  tedge_mqtt_bridge::health] MQTT bridge connected to cloud broker
[2025-01-31T14:56:40Z INFO  tedge_mqtt_bridge] Bridge cloud connection subscribing to [Filter = s/ds, Qos = AtLeastOnce]
[2025-01-31T14:56:40Z INFO  tedge_mqtt_bridge::health] MQTT bridge connected to local broker
[2025-01-31T14:56:40Z INFO  tedge_mqtt_bridge] Bridge local connection subscribing to [Filter = c8y/s/us, Qos = AtLeastOnce]
[2025-01-31T14:56:40Z INFO  tedge_mqtt_bridge] Bridge local connection received ack for unknown pkid=2
[2025-01-31T14:56:40Z ERROR tedge_mqtt_bridge::health] MQTT bridge failed to connect to cloud broker: Mqtt state: Connection closed by peer abruptly
[2025-01-31T14:56:40Z INFO  tedge_mqtt_bridge::health] MQTT bridge connected to cloud broker
[2025-01-31T14:56:40Z INFO  tedge_mqtt_bridge] Bridge cloud connection subscribing to [Filter = s/ds, Qos = AtLeastOnce]
[2025-01-31T14:56:40Z ERROR tedge_mqtt_bridge::health] MQTT bridge failed to connect to local broker: Mqtt state: Connection closed by peer abruptly
[2025-01-31T14:56:40Z INFO  tedge_mqtt_bridge::health] MQTT bridge connected to local broker
[2025-01-31T14:56:40Z INFO  tedge_mqtt_bridge] Bridge local connection subscribing to [Filter = c8y/s/us, Qos = AtLeastOnce]
thread 'bridge_disconnect_while_sending' panicked at crates/extensions/tedge_mqtt_bridge/tests/bridge.rs:232:45:
called `Result::unwrap()` on an `Err` value: timed-out waiting for received message

Caused by:
    deadline has elapsed
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

The thing that sticks out to me is MQTT bridge failed to connect to local broker: Mqtt state: Connection closed by peer abruptly. The "cloud" version of this is intended, and is part of the test. The fact it happens for the "local" broker too indicates to me that rumqttd is the cause of this issue as the local broker should remain connected, and I'm fairly confident in that assertion.

This does lead to the question of why do we have a new issue with rumqttd, a dependency that hasn't updated. My suspicions are that the updated rumqttc has changed how it publishes the messages, and rumqttd isn't coping with this. It definitely seems to be related to the amount of load the system is under. If the test manages to publish the message and then the broker restarts/does something else that drops queued messages, this would explain the non-forwarding of messages. I'm trying to understand this further, but it's proving a bit difficult as enabling more logs changes the performance characteristics sufficiently to not cause the flakiness to appear.

The fact that it's (mostly, but not entirely) that test may also prove useful in understanding what's going on here. I believe I've seen extremely occasional issues (1 in some thousands of runs) with bridge_many_messages too, although that sends half the number of messages and the bridge connection remains in-tact.

@didier-wenzek
Copy link
Contributor

@didier-wenzek I don't quite understand this, you've linked to a PR against rumqttc but suggested this doesn't exist?

Oops, I failed to noticed that is was not a link to your fork.

@@ -31,7 +31,6 @@ use tracing::log;
static ALLOCATOR: Cap<alloc::System> = Cap::new(alloc::System, usize::MAX);

fn main() -> anyhow::Result<()> {
clap_complete::CompleteEnv::with_factory(TEdgeCli::command).complete();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, must've accidentally deleted this when removing the CryptoProvider stuff.

I've created #3373 to address the fact this went undetected.

@reubenmiller
Copy link
Contributor

I'm okay to merge this PR despite there are still a failing tests (acceptor_rejects_untrusted_client_certificates and bridge_disconnect_while_sending), because:

  • This PR is going in the correct direction, i.e. leveraging the HEAD of rumqttc to upgrade to rustls 0.23 despite conflicting constraints. Reverting to the next released version of rumqttc will be feasible with no major difficulties.
  • This PR spreads a lot of small but non-trivial changes across all the creates depending on TLS. It would be good to avoid conflicts.

Follow-up tasks:

@reubenmiller are you okay merging this PR soon?

In principle, yes I'm ok with this PR merging soon), we just need to make sure we create all of the required follow up tickets prior to merging (so we don't forget about it).

@reubenmiller reubenmiller added refactoring Developer value dependencies Pull requests that update a dependency file and removed refactoring Developer value labels Feb 3, 2025
@jarhodes314
Copy link
Contributor Author

Now created the follow up tickets and linked to the relevant one from the FIXME I created in the code re zeroize. I'll set to work on creating that PR today.

@jarhodes314 jarhodes314 requested review from didier-wenzek and removed request for rina23q February 3, 2025 11:46
Copy link
Contributor

@didier-wenzek didier-wenzek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. Thank you for tackling this.

@jarhodes314 jarhodes314 added this pull request to the merge queue Feb 4, 2025
Merged via the queue into thin-edge:main with commit 073d9fb Feb 4, 2025
33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants