Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add flags to control telemetry behaviors #25849

Merged
merged 4 commits into from
Jan 18, 2025

Conversation

waynr
Copy link
Contributor

@waynr waynr commented Jan 16, 2025

This PR is meant to address #25813 by providing two new options for the influxdb3 serve subcommand:

  • --disable-telemetry-upload
    • prevents the server from pushing any of the telemetry it collects to any telemetry endpoint
  • --telemetry-endpoint
    • allows the telemetry upload endpoint to be specified.
    • mostly included for the sake of CLI integration tests to validate the default TelemetryStore initialization behavior, ie so we can assert on the "Initializing TelemetryStore with upload enabled for {telemetry_endpoint}." debug log message without accidentally pushing to the real telemetry server in CI

I also considered including some kind of Rust feature flag that would be off by default but which we could enable for release builds; this feature flag would give us local dev builds of the server with telemetry uploads disabled for local testing to prevent unknowingly pushing bogus telemetry data, but I wanted to get the team's input before putting that effort in.

I think this hypothetical off-by-default feature flag is useful in addition to the CLI flag because the latter requires us to explicitly do something to turn telemetry off in dev environments whereas the former doesn't.

@waynr waynr requested a review from a team January 16, 2025 21:47
@waynr waynr force-pushed the feat/add-flag-to-disable-telemetry branch from a99774b to 3753cff Compare January 16, 2025 21:52
Copy link
Member

@pauldix pauldix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, but I'll wait for Praveen to look. I'm ok with having it disabled on features, as long as we're sure any actual public builds and Docker images has it enabled.

@@ -97,3 +97,4 @@ test_helpers.workspace = true
tonic.workspace = true
tower.workspace = true
test-log.workspace = true
predicates = "3.1.3"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this? We generally want all our dependencies in the workspace.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a library used by the assert_cmd crate to assert on CLI stdout with code that looks like this (snipped from this PR, there are other examples where the inverse logic is applied with the .not() method):

    // validate we get a debug message indicating upload disabled
    AssertCmd::cargo_bin("influxdb3")
        .unwrap()
        .args(serve_args)
        .arg("-vv")
        .arg("--disable-telemetry-upload")
        .timeout(std::time::Duration::from_millis(200))
        .assert()
        .failure()
        .stdout(predicates::str::contains(expected_disabled));

I added it using cargo add --dev -p ... and didn't noticed that it wasn't added to the workspace -- I'll do that now!

@waynr
Copy link
Contributor Author

waynr commented Jan 16, 2025

I'm ok with having it disabled on features, as long as we're sure any actual public builds and Docker images has it enabled.

Okay, I'll take a look at adding that after I get back from a late lunch!

@waynr waynr force-pushed the feat/add-flag-to-disable-telemetry branch 2 times, most recently from 9dbf454 to f6728fc Compare January 17, 2025 01:10
@waynr
Copy link
Contributor Author

waynr commented Jan 17, 2025

Okay I've added the telemetry opt-in feature flag and verified that it works as expected with a local build:

% cargo build --workspace --features "influxdb3/release_default"
   Compiling influxdb3 v0.1.0 (/home/wayne/projects/github.com/influxdata/influxdb/influxdb3)
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 5.60s

% ./target/debug/influxdb3 serve -vv --host-id="meowmeowmeow" --http-bind='0.0.0.0:8086' --object-store=file --data-dir $PWD/tmp-data-dir_no-license_1
2025-01-17T01:07:41.498496Z  INFO influxdb3::commands::serve: InfluxDB 3 Core server starting writer_id=meowmeowmeow git_hash=v2.5.0-14272-gb8a94488b5693503731ec01c3e5b5a47255df516-dirty version=0.1.0 uuid=01f21a8d-6bef-4cf4-9b84-33c443c8d7eb num_cpus=12
2025-01-17T01:07:41.498551Z DEBUG influxdb3::commands::serve: build configuration build_malloc_conf=
2025-01-17T01:07:41.498760Z  INFO influxdb3_clap_blocks::object_store: Object Store db_dir="/home/wayne/projects/github.com/influxdata/influxdb/tmp-data-dir_no-license_1" object_store_type="Directory"
2025-01-17T01:07:41.499037Z  INFO influxdb3::commands::serve: Creating shared query executor num_threads=12
2025-01-17T01:07:41.502096Z DEBUG datafusion_execution::memory_pool::pool: Created new GreedyMemoryPool(pool_size=8589934592)
2025-01-17T01:07:41.523637Z  INFO influxdb3::commands::serve: catalog initialized instance_id="eda38d95-df71-4127-8383-dc757ff037d7"
2025-01-17T01:07:41.525658Z DEBUG influxdb3_wal::object_store: >>> replaying
2025-01-17T01:07:41.526019Z  INFO influxdb3::commands::serve: setting up background mem check for query buffer
2025-01-17T01:07:41.526079Z DEBUG influxdb3::commands::serve: setting up background buffer checker mem_threshold_bytes=23298950349
2025-01-17T01:07:41.526174Z  INFO influxdb3::commands::serve: setting up telemetry store
2025-01-17T01:07:41.526266Z DEBUG influxdb3::commands::serve: Initializing TelemetryStore with upload enabled for https://telemetry.v3.influxdata.com.

And without the feature flag:

% cargo build --workspace
   Compiling influxdb3_telemetry v0.1.0 (/home/wayne/projects/github.com/influxdata/influxdb/influxdb3_telemetry)
   Compiling influxdb3_wal v0.1.0 (/home/wayne/projects/github.com/influxdata/influxdb/influxdb3_wal)
   Compiling influxdb3_catalog v0.1.0 (/home/wayne/projects/github.com/influxdata/influxdb/influxdb3_catalog)
   Compiling influxdb3_py_api v0.1.0 (/home/wayne/projects/github.com/influxdata/influxdb/influxdb3_py_api)
   Compiling influxdb3_cache v0.1.0 (/home/wayne/projects/github.com/influxdata/influxdb/influxdb3_cache)
   Compiling influxdb3_write v0.1.0 (/home/wayne/projects/github.com/influxdata/influxdb/influxdb3_write)
   Compiling influxdb3_processing_engine v0.1.0 (/home/wayne/projects/github.com/influxdata/influxdb/influxdb3_processing_engine)
   Compiling influxdb3_server v0.1.0 (/home/wayne/projects/github.com/influxdata/influxdb/influxdb3_server)
   Compiling influxdb3 v0.1.0 (/home/wayne/projects/github.com/influxdata/influxdb/influxdb3)
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 17.86s

% ./target/debug/influxdb3 serve -vv --host-id="meowmeowmeow" --http-bind='0.0.0.0:8086' --object-store=file --data-dir $PWD/tmp-data-dir_no-license_1
2025-01-17T01:16:15.027988Z  INFO influxdb3::commands::serve: InfluxDB 3 Core server starting writer_id=meowmeowmeow git_hash=v2.5.0-14272-gb8a94488b5693503731ec01c3e5b5a47255df516-dirty version=0.1.0 uuid=5cbabe73-d85a-4388-837a-762f482707a8 num_cpus=12
2025-01-17T01:16:15.028067Z DEBUG influxdb3::commands::serve: build configuration build_malloc_conf=
2025-01-17T01:16:15.028386Z  INFO influxdb3_clap_blocks::object_store: Object Store db_dir="/home/wayne/projects/github.com/influxdata/influxdb/tmp-data-dir_no-license_1" object_store_type="Directory"
2025-01-17T01:16:15.028856Z  INFO influxdb3::commands::serve: Creating shared query executor num_threads=12
2025-01-17T01:16:15.032499Z DEBUG datafusion_execution::memory_pool::pool: Created new GreedyMemoryPool(pool_size=8589934592)
2025-01-17T01:16:15.056524Z  INFO influxdb3::commands::serve: catalog initialized instance_id="eda38d95-df71-4127-8383-dc757ff037d7"
2025-01-17T01:16:15.058756Z DEBUG influxdb3_wal::object_store: >>> replaying
2025-01-17T01:16:15.059306Z  INFO influxdb3::commands::serve: setting up background mem check for query buffer
2025-01-17T01:16:15.059405Z DEBUG influxdb3::commands::serve: setting up background buffer checker mem_threshold_bytes=23298950349
2025-01-17T01:16:15.059554Z  INFO influxdb3::commands::serve: setting up telemetry store
2025-01-17T01:16:15.059701Z DEBUG influxdb3::commands::serve: Initializing TelemetryStore with upload disabled.

@waynr
Copy link
Contributor Author

waynr commented Jan 17, 2025

Hmmm, maybe it's too confusing to use both CLI options and build flags for this since we would have to enable the feature flag for the CLI options to have any effect 🤔 I think I'll go ahead and remove the feature flag.

@waynr waynr force-pushed the feat/add-flag-to-disable-telemetry branch from f6728fc to 8e8d446 Compare January 17, 2025 01:20
Copy link
Contributor

@praveen-influx praveen-influx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall 👍 - just left couple of clarifications queries

.arg("--disable-telemetry-upload")
.timeout(std::time::Duration::from_millis(500))
.assert()
.failure()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it failure() because process starts and hangs around till 500 millis and then shuts down? Just wondered if it could've used the TestServer to spin up the server and got hold of the stdout for your assertions. If you'd already looked into it and it didn't work for you that's fine, but it has the mechanism built in to kill the process after the test without timeout.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it failure() because process starts and hangs around till 500 millis and then shuts down?

I added the failure() call just to be clear in this test case that we expect a non-zero exit code when the process is killed using the timeout method. It's not strictly necessary for this test case (we only really care about checking the stdout), just me trying to be explicit.

Just wondered if it could've used the TestServer to spin up the server and got hold of the stdout for your assertions.

I did look through the TestSever code a little, and it looked to me like it was intended more for API-level integration testing than what I'm doing here, which is asserting on the CLI's stdout. Looking at it again, it's still not obvious how to get the stdout from it without modifying it non-trivially to preserve stdout.

it has the mechanism built in to kill the process after the test without timeout.

Yeah, I see the it has the mechanism built in to kill the process after the test without timeout using the kill method, but that would require additional boilerplate to have the test case sleep 500 ms before killing it (assuming we also added more code to preserve and retrieve the TestServer stdout).

To be clear, the timeout method isn't used to kill the process after the test -- the test happens after we've killed the process. We kill it after a constant 500 ms because that's what I thought would reliably be enough time to reach the stdout debug messages that are being asserted on (a too small timeout could cause this test to be flaky since we would be less likely to reach that debug message). I know it's not ideal asserting on stdout to test whether the telemetry store is running in but I think the only other option would be to actually set up a listening http server and point it at that, and test that something reached it from the TelemetryStore (or not as the test case might call for), but that seemed like it might no be worth the additional effort...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did think the test happens after killing the process as well with the timeout - just wasn't sure if there was easy wins in extending TestServer.

.timeout(std::time::Duration::from_millis(500))
.assert()
.failure()
.stdout(predicates::str::contains(expected_disabled).not());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using predicates is fine but I've been following the existing pattern of using assert_contains! and assert_not_contains! macros from test_helpers just to keep it consistent. If those macros are lacking in functionality for your use case, definitely keep predicates in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I now see the run* functions used elsewhere in this file and how they are using those assertions you mentioned. But those test cases seem to be running short-lived subcommands rather than the long-lived serve subcommand. I like the assert_cmd abstractions for their succinctness in a case like this -- particularly the timeout method to avoid extra boilerplate around waiting a fixed time before killing the process. Using predicates just seemed a natural extension of using the assert_cmd crate, which I noticed was already being used here for its cargo_bin command constructor. I'll look into using the other assertions though!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea - it'd have had to wait till server started fully. I think we already loop through in TestServer till it hits the last line to capture the address. So, you could just extend TestServer to hold all the lines it's seen during the startup and access it in the tests. When TestServer::spawn().await completes, that means server is fully up. Then you can just loop through and check the line you wanted to assert is there.

See if that works, if not go ahead and merge this PR - the point about TestServer is not really critical, I just wanted to check if you'd seen it and still thought it's not fit for your test.

Copy link
Contributor Author

@waynr waynr Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the value in adding more code to TestServer when all we need to accomplish the task at hand is std::process::Command? I see the specific value of TestServer where it's used -- it abstracts over a running server while executing API-driven test cases and displaying logs in the test case if TEST_LOG is set. That makes sense. But it's super heavy-handed when all that's necessary for these new test cases is std::process::Command. Is it just the assert_cmd helpers that you don't like? It would still be simpler to remove those (then add more code to TestServer that is), but the test case code becomes more verbose in doing so.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, it's the timeouts that stood out. The initial thought was TestServer already abstracts the server starting part already and it also hooks into server logs and then the assertions stood out as well. But as I said, both aren't critical they're part of test infra, we can ship it.

@@ -60,7 +60,7 @@ pub const DEFAULT_DATA_DIRECTORY_NAME: &str = ".influxdb3";
/// The default bind address for the HTTP API.
pub const DEFAULT_HTTP_BIND_ADDR: &str = "0.0.0.0:8181";

pub const DEFAULT_TELMETRY_ENDPOINT: &str = "https://telemetry.v3.influxdata.com";
pub const DEFAULT_TELEMETRY_ENDPOINT: &str = "https://telemetry.v3.influxdata.com";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@waynr waynr force-pushed the feat/add-flag-to-disable-telemetry branch from 8e8d446 to 3328494 Compare January 17, 2025 21:18
@praveen-influx praveen-influx self-requested a review January 17, 2025 23:12
@waynr waynr merged commit d800d8e into main Jan 18, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants