feat: add flags to control telemetry behaviors #25849

waynr · 2025-01-16T21:46:04Z

This PR is meant to address #25813 by providing two new options for the influxdb3 serve subcommand:

--disable-telemetry-upload
- prevents the server from pushing any of the telemetry it collects to any telemetry endpoint
--telemetry-endpoint
- allows the telemetry upload endpoint to be specified.
- mostly included for the sake of CLI integration tests to validate the default TelemetryStore initialization behavior, ie so we can assert on the "Initializing TelemetryStore with upload enabled for {telemetry_endpoint}." debug log message without accidentally pushing to the real telemetry server in CI

I also considered including some kind of Rust feature flag that would be off by default but which we could enable for release builds; this feature flag would give us local dev builds of the server with telemetry uploads disabled for local testing to prevent unknowingly pushing bogus telemetry data, but I wanted to get the team's input before putting that effort in.

I think this hypothetical off-by-default feature flag is useful in addition to the CLI flag because the latter requires us to explicitly do something to turn telemetry off in dev environments whereas the former doesn't.

pauldix

Looks good to me, but I'll wait for Praveen to look. I'm ok with having it disabled on features, as long as we're sure any actual public builds and Docker images has it enabled.

pauldix · 2025-01-16T21:53:48Z

influxdb3/Cargo.toml

@@ -97,3 +97,4 @@ test_helpers.workspace = true
 tonic.workspace = true
 tower.workspace = true
 test-log.workspace = true
+predicates = "3.1.3"


What's this? We generally want all our dependencies in the workspace.

It's a library used by the assert_cmd crate to assert on CLI stdout with code that looks like this (snipped from this PR, there are other examples where the inverse logic is applied with the .not() method):

// validate we get a debug message indicating upload disabled AssertCmd::cargo_bin("influxdb3") .unwrap() .args(serve_args) .arg("-vv") .arg("--disable-telemetry-upload") .timeout(std::time::Duration::from_millis(200)) .assert() .failure() .stdout(predicates::str::contains(expected_disabled));

I added it using cargo add --dev -p ... and didn't noticed that it wasn't added to the workspace -- I'll do that now!

waynr · 2025-01-16T22:00:05Z

I'm ok with having it disabled on features, as long as we're sure any actual public builds and Docker images has it enabled.

Okay, I'll take a look at adding that after I get back from a late lunch!

waynr · 2025-01-17T01:18:15Z

Okay I've added the telemetry opt-in feature flag and verified that it works as expected with a local build:

% cargo build --workspace --features "influxdb3/release_default"
   Compiling influxdb3 v0.1.0 (/home/wayne/projects/github.com/influxdata/influxdb/influxdb3)
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 5.60s

% ./target/debug/influxdb3 serve -vv --host-id="meowmeowmeow" --http-bind='0.0.0.0:8086' --object-store=file --data-dir $PWD/tmp-data-dir_no-license_1
2025-01-17T01:07:41.498496Z  INFO influxdb3::commands::serve: InfluxDB 3 Core server starting writer_id=meowmeowmeow git_hash=v2.5.0-14272-gb8a94488b5693503731ec01c3e5b5a47255df516-dirty version=0.1.0 uuid=01f21a8d-6bef-4cf4-9b84-33c443c8d7eb num_cpus=12
2025-01-17T01:07:41.498551Z DEBUG influxdb3::commands::serve: build configuration build_malloc_conf=
2025-01-17T01:07:41.498760Z  INFO influxdb3_clap_blocks::object_store: Object Store db_dir="/home/wayne/projects/github.com/influxdata/influxdb/tmp-data-dir_no-license_1" object_store_type="Directory"
2025-01-17T01:07:41.499037Z  INFO influxdb3::commands::serve: Creating shared query executor num_threads=12
2025-01-17T01:07:41.502096Z DEBUG datafusion_execution::memory_pool::pool: Created new GreedyMemoryPool(pool_size=8589934592)
2025-01-17T01:07:41.523637Z  INFO influxdb3::commands::serve: catalog initialized instance_id="eda38d95-df71-4127-8383-dc757ff037d7"
2025-01-17T01:07:41.525658Z DEBUG influxdb3_wal::object_store: >>> replaying
2025-01-17T01:07:41.526019Z  INFO influxdb3::commands::serve: setting up background mem check for query buffer
2025-01-17T01:07:41.526079Z DEBUG influxdb3::commands::serve: setting up background buffer checker mem_threshold_bytes=23298950349
2025-01-17T01:07:41.526174Z  INFO influxdb3::commands::serve: setting up telemetry store
2025-01-17T01:07:41.526266Z DEBUG influxdb3::commands::serve: Initializing TelemetryStore with upload enabled for https://telemetry.v3.influxdata.com.

And without the feature flag:

% cargo build --workspace
   Compiling influxdb3_telemetry v0.1.0 (/home/wayne/projects/github.com/influxdata/influxdb/influxdb3_telemetry)
   Compiling influxdb3_wal v0.1.0 (/home/wayne/projects/github.com/influxdata/influxdb/influxdb3_wal)
   Compiling influxdb3_catalog v0.1.0 (/home/wayne/projects/github.com/influxdata/influxdb/influxdb3_catalog)
   Compiling influxdb3_py_api v0.1.0 (/home/wayne/projects/github.com/influxdata/influxdb/influxdb3_py_api)
   Compiling influxdb3_cache v0.1.0 (/home/wayne/projects/github.com/influxdata/influxdb/influxdb3_cache)
   Compiling influxdb3_write v0.1.0 (/home/wayne/projects/github.com/influxdata/influxdb/influxdb3_write)
   Compiling influxdb3_processing_engine v0.1.0 (/home/wayne/projects/github.com/influxdata/influxdb/influxdb3_processing_engine)
   Compiling influxdb3_server v0.1.0 (/home/wayne/projects/github.com/influxdata/influxdb/influxdb3_server)
   Compiling influxdb3 v0.1.0 (/home/wayne/projects/github.com/influxdata/influxdb/influxdb3)
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 17.86s

% ./target/debug/influxdb3 serve -vv --host-id="meowmeowmeow" --http-bind='0.0.0.0:8086' --object-store=file --data-dir $PWD/tmp-data-dir_no-license_1
2025-01-17T01:16:15.027988Z  INFO influxdb3::commands::serve: InfluxDB 3 Core server starting writer_id=meowmeowmeow git_hash=v2.5.0-14272-gb8a94488b5693503731ec01c3e5b5a47255df516-dirty version=0.1.0 uuid=5cbabe73-d85a-4388-837a-762f482707a8 num_cpus=12
2025-01-17T01:16:15.028067Z DEBUG influxdb3::commands::serve: build configuration build_malloc_conf=
2025-01-17T01:16:15.028386Z  INFO influxdb3_clap_blocks::object_store: Object Store db_dir="/home/wayne/projects/github.com/influxdata/influxdb/tmp-data-dir_no-license_1" object_store_type="Directory"
2025-01-17T01:16:15.028856Z  INFO influxdb3::commands::serve: Creating shared query executor num_threads=12
2025-01-17T01:16:15.032499Z DEBUG datafusion_execution::memory_pool::pool: Created new GreedyMemoryPool(pool_size=8589934592)
2025-01-17T01:16:15.056524Z  INFO influxdb3::commands::serve: catalog initialized instance_id="eda38d95-df71-4127-8383-dc757ff037d7"
2025-01-17T01:16:15.058756Z DEBUG influxdb3_wal::object_store: >>> replaying
2025-01-17T01:16:15.059306Z  INFO influxdb3::commands::serve: setting up background mem check for query buffer
2025-01-17T01:16:15.059405Z DEBUG influxdb3::commands::serve: setting up background buffer checker mem_threshold_bytes=23298950349
2025-01-17T01:16:15.059554Z  INFO influxdb3::commands::serve: setting up telemetry store
2025-01-17T01:16:15.059701Z DEBUG influxdb3::commands::serve: Initializing TelemetryStore with upload disabled.

waynr · 2025-01-17T01:20:21Z

Hmmm, maybe it's too confusing to use both CLI options and build flags for this since we would have to enable the feature flag for the CLI options to have any effect 🤔 I think I'll go ahead and remove the feature flag.

praveen-influx

Looks good overall 👍 - just left couple of clarifications queries

praveen-influx · 2025-01-17T11:14:50Z

influxdb3/tests/server/cli.rs

+        .arg("--disable-telemetry-upload")
+        .timeout(std::time::Duration::from_millis(500))
+        .assert()
+        .failure()


Is it failure() because process starts and hangs around till 500 millis and then shuts down? Just wondered if it could've used the TestServer to spin up the server and got hold of the stdout for your assertions. If you'd already looked into it and it didn't work for you that's fine, but it has the mechanism built in to kill the process after the test without timeout.

Is it failure() because process starts and hangs around till 500 millis and then shuts down?

I added the failure() call just to be clear in this test case that we expect a non-zero exit code when the process is killed using the timeout method. It's not strictly necessary for this test case (we only really care about checking the stdout), just me trying to be explicit.

Just wondered if it could've used the TestServer to spin up the server and got hold of the stdout for your assertions.

I did look through the TestSever code a little, and it looked to me like it was intended more for API-level integration testing than what I'm doing here, which is asserting on the CLI's stdout. Looking at it again, it's still not obvious how to get the stdout from it without modifying it non-trivially to preserve stdout.

it has the mechanism built in to kill the process after the test without timeout.

Yeah, I see the it has the mechanism built in to kill the process after the test without timeout using the kill method, but that would require additional boilerplate to have the test case sleep 500 ms before killing it (assuming we also added more code to preserve and retrieve the TestServer stdout).

To be clear, the timeout method isn't used to kill the process after the test -- the test happens after we've killed the process. We kill it after a constant 500 ms because that's what I thought would reliably be enough time to reach the stdout debug messages that are being asserted on (a too small timeout could cause this test to be flaky since we would be less likely to reach that debug message). I know it's not ideal asserting on stdout to test whether the telemetry store is running in but I think the only other option would be to actually set up a listening http server and point it at that, and test that something reached it from the TelemetryStore (or not as the test case might call for), but that seemed like it might no be worth the additional effort...

I did think the test happens after killing the process as well with the timeout - just wasn't sure if there was easy wins in extending TestServer.

praveen-influx · 2025-01-17T11:17:06Z

influxdb3/tests/server/cli.rs

+        .timeout(std::time::Duration::from_millis(500))
+        .assert()
+        .failure()
+        .stdout(predicates::str::contains(expected_disabled).not());


I think using predicates is fine but I've been following the existing pattern of using assert_contains! and assert_not_contains! macros from test_helpers just to keep it consistent. If those macros are lacking in functionality for your use case, definitely keep predicates in.

Ah, I now see the run* functions used elsewhere in this file and how they are using those assertions you mentioned. But those test cases seem to be running short-lived subcommands rather than the long-lived serve subcommand. I like the assert_cmd abstractions for their succinctness in a case like this -- particularly the timeout method to avoid extra boilerplate around waiting a fixed time before killing the process. Using predicates just seemed a natural extension of using the assert_cmd crate, which I noticed was already being used here for its cargo_bin command constructor. I'll look into using the other assertions though!

Yea - it'd have had to wait till server started fully. I think we already loop through in TestServer till it hits the last line to capture the address. So, you could just extend TestServer to hold all the lines it's seen during the startup and access it in the tests. When TestServer::spawn().await completes, that means server is fully up. Then you can just loop through and check the line you wanted to assert is there.

See if that works, if not go ahead and merge this PR - the point about TestServer is not really critical, I just wanted to check if you'd seen it and still thought it's not fit for your test.

What is the value in adding more code to TestServer when all we need to accomplish the task at hand is std::process::Command? I see the specific value of TestServer where it's used -- it abstracts over a running server while executing API-driven test cases and displaying logs in the test case if TEST_LOG is set. That makes sense. But it's super heavy-handed when all that's necessary for these new test cases is std::process::Command. Is it just the assert_cmd helpers that you don't like? It would still be simpler to remove those (then add more code to TestServer that is), but the test case code becomes more verbose in doing so.

Not really, it's the timeouts that stood out. The initial thought was TestServer already abstracts the server starting part already and it also hooks into server logs and then the assertions stood out as well. But as I said, both aren't critical they're part of test infra, we can ship it.

praveen-influx · 2025-01-17T11:17:25Z

influxdb3/src/commands/serve.rs

@@ -60,7 +60,7 @@ pub const DEFAULT_DATA_DIRECTORY_NAME: &str = ".influxdb3";
 /// The default bind address for the HTTP API.
 pub const DEFAULT_HTTP_BIND_ADDR: &str = "0.0.0.0:8181";

-pub const DEFAULT_TELMETRY_ENDPOINT: &str = "https://telemetry.v3.influxdata.com";
+pub const DEFAULT_TELEMETRY_ENDPOINT: &str = "https://telemetry.v3.influxdata.com";


waynr requested a review from a team January 16, 2025 21:47

waynr force-pushed the feat/add-flag-to-disable-telemetry branch from a99774b to 3753cff Compare January 16, 2025 21:52

pauldix reviewed Jan 16, 2025

View reviewed changes

waynr force-pushed the feat/add-flag-to-disable-telemetry branch 2 times, most recently from 9dbf454 to f6728fc Compare January 17, 2025 01:10

waynr force-pushed the feat/add-flag-to-disable-telemetry branch from f6728fc to 8e8d446 Compare January 17, 2025 01:20

praveen-influx reviewed Jan 17, 2025

View reviewed changes

waynr added 3 commits January 17, 2025 11:50

feat: add hidden CLI option to disable sending telemetry

c551e1a

feat: --telemetry-host option to facilitate CLI testing

725e0e4

test: validate telemetry flags via stdout in CLI tests

d640a93

waynr mentioned this pull request Jan 17, 2025

chore: add cargo check to circleci config #25859

Closed

waynr force-pushed the feat/add-flag-to-disable-telemetry branch from 8e8d446 to 3328494 Compare January 17, 2025 21:18

chore: remove explicit predicates dependency

3328494

praveen-influx self-requested a review January 17, 2025 23:12

praveen-influx approved these changes Jan 17, 2025

View reviewed changes

waynr merged commit d800d8e into main Jan 18, 2025
13 checks passed

waynr mentioned this pull request Jan 18, 2025

Disable telemetry during CI test runs #25813

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add flags to control telemetry behaviors #25849

feat: add flags to control telemetry behaviors #25849

waynr commented Jan 16, 2025

pauldix left a comment

pauldix Jan 16, 2025

waynr Jan 16, 2025

waynr commented Jan 16, 2025

waynr commented Jan 17, 2025

waynr commented Jan 17, 2025

praveen-influx left a comment

praveen-influx Jan 17, 2025

waynr Jan 17, 2025

praveen-influx Jan 17, 2025

praveen-influx Jan 17, 2025

waynr Jan 17, 2025

praveen-influx Jan 17, 2025

waynr Jan 17, 2025 •

edited

Loading

praveen-influx Jan 17, 2025

praveen-influx Jan 17, 2025

feat: add flags to control telemetry behaviors #25849

feat: add flags to control telemetry behaviors #25849

Conversation

waynr commented Jan 16, 2025

pauldix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

waynr commented Jan 16, 2025

waynr commented Jan 17, 2025

waynr commented Jan 17, 2025

praveen-influx left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

waynr Jan 17, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

waynr Jan 17, 2025 •

edited

Loading