-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachprod: prometheus query & list --spark #138252
base: master
Are you sure you want to change the base?
roachprod: prometheus query & list --spark #138252
Conversation
a8a7623
to
4026f8f
Compare
Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
c3a3d97
to
62645c7
Compare
Previously, prometheus `Query` and `QueryRange` was available only to roachtest; i.e., outside of roachprod. This PR refactors the code s.t. both can easily query prometheus. Inspired by this new roachprod trick, we add a new list option, namely `roachprod list --spark`. When enabled, a sparkline is rendered directly in the terminal. (As of this time, only `iterm2` is supported.) The sparkline graph denotes the cluster aggregate throughput of committed KV transactions, i.e., `txn_commits`. Epic: none Release note: None
62645c7
to
646a40a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks amazing 😻
points := prometheus.Values(series) | ||
img, err := sparkline.DrawSparkline(points, 250, 35) | ||
if img != nil && err == nil { | ||
if s, err := sparkline.ITermEncodePNGToString(img, ""); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Wouldn't it be better to have a sparkline.DrawAsItermString(points, 250, 35)
that combines both calls above?
for _, labelValToSeries := range m { | ||
for clusterName, series := range labelValToSeries { | ||
points := prometheus.Values(series) | ||
img, err := sparkline.DrawSparkline(points, 250, 35) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: sparkline.Draw()
should suffice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool!
@@ -200,6 +201,10 @@ func initListCmdFlags(listCmd *cobra.Command) { | |||
"cost", "c", os.Getenv("ROACHPROD_COST_ESTIMATES") == "true", | |||
"Show cost estimates", | |||
) | |||
listCmd.Flags().BoolVarP(&listSparkline, | |||
"spark", "s", os.Getenv("ROACHPROD_SPARK_LINE") == "true", | |||
"Show sparkline", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This flag description could be fleshed out a bit. Something mentioning: Show sparkline graph of txn_commits over last 5 minutes. Currently only supported on iterm
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing, I'll add some text to describe the flag.
) | ||
|
||
const ( | ||
clusterQueryTemplate = "sum(rate(txn_commits{cluster=~\"$CLUSTERS\"}[5m])) by (cluster)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why txn_commits vs say cpu or qps? Mostly just curious on the envisioned use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The choice is definitely not entirely random :) Allow me to explain. CPU utilization isn't a good proxy for what the DB might be doing under the hood; e.g., an increase/decrease in transaction throughput may not move the needle wrt cpu, if it's mostly IO-bound. Also, a cpu has many cores, and we don't (yet) have recording rules in prometheus; i.e., the above query would be more expensive. By qps, I suppose you mean the throughput of executed sql statements? We could do that. I chose the KV layer because technically there could be no sql, and yet there are KV transactions; e.g., range splits, other system tasks, like cluster initialization, would all be reflected in that metric.
In summary, txt_commits
seems like a good indicator of the amount of KV work a cluster might be doing or not. (Incidentally, it's also usually linearly correlated with the gRPC throughput.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation! Makes sense to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Eye Candy!
Don't wait up for me to review, working through a bit of a backlog atm and this already appears more than well reviewed by test eng :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some minor comments.
@@ -14,6 +14,7 @@ import ( | |||
"github.com/cockroachdb/cockroach/pkg/cmd/roachtest/roachtestutil/task" | |||
"github.com/cockroachdb/cockroach/pkg/roachprod/logger" | |||
"github.com/cockroachdb/cockroach/pkg/roachprod/prometheus" | |||
prominstaller "github.com/cockroachdb/cockroach/pkg/roachprod/prometheus/prominstaller" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: prominstaller
is redundant.
@@ -0,0 +1,53 @@ | |||
// Copyright 2024 The Cockroach Authors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Copyright 2024 The Cockroach Authors. | |
// Copyright 2025 The Cockroach Authors. |
"nit"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lol, stuck in the past :)
Previously, prometheus
Query
andQueryRange
was available only to roachtest; i.e., outside
of roachprod. This PR refactors the code s.t.
both can easily query prometheus.
Inspired by this new roachprod trick, we add
a new list option, namely
roachprod list --spark
.When enabled, a sparkline is rendered directly in the
terminal. (As of this time, only
iterm2
is supported.)The sparkline graph denotes the cluster aggregate
throughput of committed KV transactions, i.e.,
txn_commits
.Epic: none
Release note: None