Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachprod: prometheus query & list --spark #138252

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

srosenberg
Copy link
Member

@srosenberg srosenberg commented Jan 4, 2025

Previously, prometheus Query and QueryRange
was available only to roachtest; i.e., outside
of roachprod. This PR refactors the code s.t.
both can easily query prometheus.

Inspired by this new roachprod trick, we add
a new list option, namely roachprod list --spark.
When enabled, a sparkline is rendered directly in the
terminal. (As of this time, only iterm2 is supported.)
The sparkline graph denotes the cluster aggregate
throughput of committed KV transactions, i.e.,
txn_commits.

Epic: none
Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@srosenberg srosenberg force-pushed the sr/roachprod_prometheus_query branch 3 times, most recently from a8a7623 to 4026f8f Compare January 4, 2025 17:58
Copy link

blathers-crl bot commented Jan 4, 2025

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@srosenberg srosenberg force-pushed the sr/roachprod_prometheus_query branch 3 times, most recently from c3a3d97 to 62645c7 Compare January 5, 2025 00:49
Previously, prometheus `Query` and `QueryRange`
was available only to roachtest; i.e., outside
of roachprod. This PR refactors the code s.t.
both can easily query prometheus.

Inspired by this new roachprod trick, we add
a new list option, namely `roachprod list --spark`.
When enabled, a sparkline is rendered directly in the
terminal. (As of this time, only `iterm2` is supported.)
The sparkline graph denotes the cluster aggregate
throughput of committed KV transactions, i.e.,
`txn_commits`.

Epic: none
Release note: None
@srosenberg srosenberg force-pushed the sr/roachprod_prometheus_query branch from 62645c7 to 646a40a Compare January 5, 2025 02:09
@srosenberg srosenberg changed the title WIP roachprod: prometheus query & list --spark Jan 5, 2025
@srosenberg srosenberg marked this pull request as ready for review January 5, 2025 02:10
@srosenberg srosenberg requested a review from a team as a code owner January 5, 2025 02:10
@srosenberg srosenberg requested review from herkolategan, DarrylWong, golgeek and kvoli and removed request for a team January 5, 2025 02:10
@srosenberg
Copy link
Member Author

Below is an example of roachprod list either via --spark or export ROACHPROD_SPARK_LINE=true.

roachprod_list_sparkline

Copy link
Contributor

@golgeek golgeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks amazing 😻

points := prometheus.Values(series)
img, err := sparkline.DrawSparkline(points, 250, 35)
if img != nil && err == nil {
if s, err := sparkline.ITermEncodePNGToString(img, ""); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Wouldn't it be better to have a sparkline.DrawAsItermString(points, 250, 35) that combines both calls above?

for _, labelValToSeries := range m {
for clusterName, series := range labelValToSeries {
points := prometheus.Values(series)
img, err := sparkline.DrawSparkline(points, 250, 35)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: sparkline.Draw() should suffice

Copy link
Contributor

@DarrylWong DarrylWong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool!

@@ -200,6 +201,10 @@ func initListCmdFlags(listCmd *cobra.Command) {
"cost", "c", os.Getenv("ROACHPROD_COST_ESTIMATES") == "true",
"Show cost estimates",
)
listCmd.Flags().BoolVarP(&listSparkline,
"spark", "s", os.Getenv("ROACHPROD_SPARK_LINE") == "true",
"Show sparkline",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This flag description could be fleshed out a bit. Something mentioning: Show sparkline graph of txn_commits over last 5 minutes. Currently only supported on iterm?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing, I'll add some text to describe the flag.

)

const (
clusterQueryTemplate = "sum(rate(txn_commits{cluster=~\"$CLUSTERS\"}[5m])) by (cluster)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why txn_commits vs say cpu or qps? Mostly just curious on the envisioned use.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The choice is definitely not entirely random :) Allow me to explain. CPU utilization isn't a good proxy for what the DB might be doing under the hood; e.g., an increase/decrease in transaction throughput may not move the needle wrt cpu, if it's mostly IO-bound. Also, a cpu has many cores, and we don't (yet) have recording rules in prometheus; i.e., the above query would be more expensive. By qps, I suppose you mean the throughput of executed sql statements? We could do that. I chose the KV layer because technically there could be no sql, and yet there are KV transactions; e.g., range splits, other system tasks, like cluster initialization, would all be reflected in that metric.

In summary, txt_commits seems like a good indicator of the amount of KV work a cluster might be doing or not. (Incidentally, it's also usually linearly correlated with the gRPC throughput.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation! Makes sense to me!

Copy link
Collaborator

@herkolategan herkolategan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Eye Candy!

@kvoli
Copy link
Collaborator

kvoli commented Jan 9, 2025

Don't wait up for me to review, working through a bit of a backlog atm and this already appears more than well reviewed by test eng :)

@kvoli kvoli removed their request for review January 9, 2025 14:56
Copy link
Contributor

@nameisbhaskar nameisbhaskar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some minor comments.

@@ -14,6 +14,7 @@ import (
"github.com/cockroachdb/cockroach/pkg/cmd/roachtest/roachtestutil/task"
"github.com/cockroachdb/cockroach/pkg/roachprod/logger"
"github.com/cockroachdb/cockroach/pkg/roachprod/prometheus"
prominstaller "github.com/cockroachdb/cockroach/pkg/roachprod/prometheus/prominstaller"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: prominstaller is redundant.

@@ -0,0 +1,53 @@
// Copyright 2024 The Cockroach Authors.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Copyright 2024 The Cockroach Authors.
// Copyright 2025 The Cockroach Authors.

"nit"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lol, stuck in the past :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants