Add storage scale testing kep #812

msau42 · 2019-02-05T03:14:16Z

Develop a plan on how to test scalability of K8s storage

@kubernetes/sig-storage-feature-requests
@kubernetes/sig-scalability-feature-requests

msau42 · 2019-02-05T03:15:02Z

It's still WIP, but I wanted to start reviewing at least the pod startup tests which I want to target first.

wojtek-t · 2019-02-05T14:36:33Z

keps/sig-storage/20190204-scale-testing.md

+superseded-by:
+---
+
+# Volume Scale Testing Plan


I would change to Scale and Performance.
Some of them don't really need to be run at scale.

wojtek-t · 2019-02-05T14:39:22Z

keps/sig-storage/20190204-scale-testing.md

+  * CSI mock driver
+  * TODO: Hostpath?
+  * TODO: Local?
+* Provide a framework that vendors can use to run scale tests against their CSI drivers.


Can you explain that in a bit more details? I would like to see the expected result of it (i.e. what a endor would have to do to test that).
So let's say that I would like to test exactly the same scenarios that we will create tests for, but instead of say using mock driver, use e.g. GCE PD. I think that the ideal situation is to just:

configure a cluster correctly (i guess, run it on GCE should be enough here)

use appropriate PV type in tests
And that's all.
Or do you have something else on your mind?

Before running the tests, a vendor would need to:

define a storageclass

launch their csi driver

The main requirement I wanted to convey here is that the tests should be written so that we don't need to hard code the volume type. Maybe we can just require that the default storageclass is set appropriately before the test. I will try to reword this a bit.

wojtek-t · 2019-02-05T14:43:01Z

keps/sig-storage/20190204-scale-testing.md

+
+### Phase 2
+
+* Pod teardown tests with results published and thresholds established.


I would personally remove "thresholds established" from here (and below).
I think that we are going in a direction where we would like to fail the tests if the fail the SLO. And i think we are (at least not short term) not planning to add any SLO around that.
So I think that those metrics are useful and we may want to publish them on some graphs, but I would (at least for now) not talk about thresholds.

Ok I'll move defining a slo to a future phase

wojtek-t · 2019-02-05T14:43:22Z

Thanks a lot for starting this doc @msau42 !

msau42 · 2019-02-06T04:27:08Z

keps/sig-storage/20190204-scale-testing.md

+
+The tests should be developed and run in sig-scalability’s test framework and infrastructure.
+
+TODO details and links


@wojtek-t do you have any references to the scale test framework that I can link to?

I would bet that I added a comment, but apparently not.

This is the link to design; https://github.com/kubernetes/perf-tests/blob/master/clusterloader2/docs/design.md

This is Readme: https://github.com/kubernetes/perf-tests/blob/master/clusterloader2/README.md

I guess it's enough for this KEP. If you would need anything else, let me know.

pohly · 2019-02-06T09:59:18Z

keps/sig-storage/20190204-scale-testing.md

+* Create many pods with 1 volume each.
+* Create 1 pod with many volumes.
+  * For PVC types, test with and without PV.NodeAffinity as one more dimension.
+  * Measure pod startup time, with a breakdown of time spent in volume scheduling, attach and mount operations.


Is there already some way how those times can be measured? All I can think of is looking at events posted for the pod, but I have no idea how detailed and/or accurate that is.

There is the tracing KEP but it no longer looks like that'll make it into 1.14 at all.

Yes, we have scalability tests today that have the infrastructure in place to measure pod startup latency. As part of this work, we may have to enhance it if we also want to pull out finer grained metrics for volume operations.

I guess I should have this comment here: #812 (comment)

BTW - do you already generate events on e.g. attach and mount operations? Or would they have to be added? If the latter, we should probably think more if this is the right approach (I'm not convinced - the events trick was done 3+ years ago with lack of time to do anything better).

We have events on attach, and recently removed events for mount because they were too noisy for the remount volume types (secrets, etc).

I was not leaning towards using events for measuring attach/mount time because we have found that events can be unreliable and dropped/gc.

I was not leaning towards using events for measuring attach/mount time because we have found that events can be unreliable and dropped/gc.

Yes - I'm also not a fan of using event due to reasons you mentioned. They may be used to extend the information with some optional data which e.g. helps with debugging, but if we lose it, then it's not end of the world.
So yeah - if we can find out a more robust mechanism, I'm all for it.

pohly · 2019-02-06T10:05:27Z

keps/sig-storage/20190204-scale-testing.md

+
+For each volume type:
+
+* Create many pods with 1 volume each.


Is that expected to start one pod after the other, waiting for it to run before creating the next, or is that a "ask for n pods at once, wait for all of them to run" test?

How many is "many"? In general, a lot of this testing is likely to hit various scalability issues. Do we want to time-box each configuration to a certain configurable duration? That's easier when starting pods one-by-one because then we can basically stop at any time. When creating n pods, n might be so large that we exceed the time limit. Then what? Looking at currently running tests at the end of the time limit is racy. Consider it a test failure?

I believe the current scale tests create the pods in parallel.

How many needs to be figured out, as we currently don't know what our scale limits are. I imagine that the number of pods is configurable, and we run the test with various numbers, collect the results, and see when the numbers start to degrade significantly.

So we should have a timebox for the tests, but we'll need to do some experimentation to figure out what the limit is to fit within it. I will add a section that describes this

pohly · 2019-02-06T10:10:54Z

keps/sig-storage/20190204-scale-testing.md

+* Delete many pods with 1 volume each.
+* Delete 1 pod with many volumes.
+  * Measure pod deletion time, with a breakdown of time spent in unmount and detach operations.
+    * Note: Detach can only be measured for CSI volumes by the removal of the VolumeAttachment object.


Isn't detach always decoupled from pod deletion? I.e. pod stops, some time later the volume gets detached from the node? I remember that this was driven by the control loop in kubelet, but it's quite likely that I don't remember correctly or misunderstood.

If it is indeed decoupled from pod deletion, then an approach for measure time based on pod events won't work.

If CSI volumes need special handling, then this affects test portability.

Yes the problem is that Pod deletion is not tied to detach. But I think detach is still important to measure because for RWO volumes, it impacts the time it takes to launch a new Pod that uses the same volume.

We may have to make an exception on portability in order to collect detach metrics. It could just be that for non-CSI volumes, we won't be able to collect that data.

msau42 · 2019-02-09T00:47:51Z

Updated to address comments

wojtek-t · 2019-02-11T14:08:49Z

keps/sig-storage/20190204-scale-testing.md

+  from kube-controller-manager
+  * Caveat: These metrics measure the time of a single operation attempt and not
+    the e2e time across retries. These metrics may not be suitable to measure
+    latency from a pod's perspective.


I think that we need to clearly distinguish two things:
(a) the metrics that are implementing SLI
(b) metrics that are useful for understanding better what is happening and/or debugging failures

Most of the questions here are for (b) category, which is not something that is the most important thing for scalability SIG - this is important for individual SIGs to debug their feature. And how exactly this will be done in their components is out of our responsibility (it is much more in scope of sig instrumentation).
The main two categories of options are:

expose a set of aggregated metrics from your component (and then we can provide mechanisms for gathering them at the framework level - the problem isn't fully solved in ClusterLoader, but it's something we are looking at as this is really needed)

tract this information in a more custom way in your component by:

some kind of tracing (until we have it, maybe traces in logs: https://github.com/kubernetes/utils/tree/master/trace are good enough for this purpose)

extending objects to keep more information (though we should increase amount of operations on apiserver due to that reason)
But this is more individual - there isn't one single good option that will work in all cases.

@pohly @msau42

Agree. For pod startup SLI, we already have the necessary metrics in place.

We already have metrics for volume operations, but like I mentioned above, they're not really good for measuring e2e time to successfully complete the operation. So the options I see are:

Add new metrics that can track total e2e time for the operation. This was attempted already in VolumeManager mount/unmount volume metrics kubernetes#57766, but it got complicated when we tried to handle operation cancellations/complete failure.

Add tracing. Needs investigation if this could be leveraged for more than just scale tests without too much overhead to the system. I still think a "total latency" metric would be very useful beyond scale tests.

Add timestamps to API objects. This would need to be carefully considered so that we don't trigger additional API updates. Ideally we already have some sort of status that we are already updating when the operation is complete. For something like attach, that can also be detached and reattached, we would need to make sure the timestamp makes sense in each of those states.

Also, I'll update the proposal to distinguish these two types of metrics

I still think a "total latency" metric would be very useful beyond scale tests.

+1

Ideally we already have some sort of status that we are already updating when the operation is complete. For something like attach, that can also be detached and reattached, we would need to make sure the timestamp makes sense in each of those states.

Yes - that makes perfect sense to me.

wojtek-t · 2019-02-11T14:10:06Z

keps/sig-storage/20190204-scale-testing.md

+  timestamps in PersistentVolumeClaim, VolumeAttachment, and Pod statuses,
+  respectively.
+  * TODO: how to measure time reliably for deletion events, where the API object
+    deletion is the trigger?


Good question, though I don't think it's a blocker.
We don't have any SLI for pod deletion and I've never heard high-level complaints that this is too slow.

When using RWO volumes, pod deletion (and unmount/detach) time becomes more important because the next pod using the same volume cannot be started until the volume has been detached from the previous node.

OK - that's a great point which I totally forgot about.
I will add a TODO to my proposal for the SLI to consider adding a deletion SLI because of that reason.

I still think it's not a hard blocker - I would treat it more as a second step.

wojtek-t · 2019-02-11T14:10:49Z

keps/sig-storage/20190204-scale-testing.md

+1. Start with the documented [max limits](https://kubernetes.io/docs/setup/cluster-large/)
+  and [stateless pod
+  SLO](https://github.com/kubernetes/community/blob/master/sig-scalability/slos/pod_startup_latency.md)
+  as a target SLO.


kubernetes/community#3242 is out for review

pohly · 2019-02-11T14:10:29Z

keps/sig-storage/20190204-scale-testing.md

+* Measure time starting from pod scheduling to volume operation completion times
+  for provision, attach and mount. This requires changes in Kubernetes to add
+  timestamps in PersistentVolumeClaim, VolumeAttachment, and Pod statuses,
+  respectively.


With these timestamps, do you mean events or actual API changes of these types?

pohly · 2019-02-11T14:14:46Z

keps/sig-storage/20190204-scale-testing.md

+  timestamps in PersistentVolumeClaim, VolumeAttachment, and Pod statuses,
+  respectively.
+  * TODO: how to measure time reliably for deletion events, where the API object
+    deletion is the trigger?


I'm not sure whether relying on the API to expose internals is the right approach. It probably isn't going to get us all the way.

As it has been pointed out in the review of the tracing proposal, mapping to spans (whether in OpenTracing or OpenCensus) doesn't look like the right model either.

Perhaps we really just need an event collection mechanism that is independent of the Kubernetes API, and then do some data analysis on that?

I agree that extending the API doesn't sound like a good option.
Though, it's possible to e.g. consider setting an annotation to expose that information - there is already a precedent for it: kubernetes/kubernetes#69946 though we should be careful with things like this.

I'm not convinced that adding timestamps of important events to api objects is exposing internal details. I was picturing this after looking at how the pod startup sli is calculated, and seeing that it looks like Pod container's start timestamp.

Let me rephrase my thoughts (also, I don't have full insight in what exactly you're suggesting, so this may not make sense - let me knwo if that's the case).
I guess that I was trying to say is that, I would be reluctant to extending the overall Pod api, just to enable tracking of some internal metrics. If what you have on your mind is appropriately setting timestamps e.g. in Pod conditions or sth like that, I think that can make sense (though I would need to understand a bit better what exactly you're proposing).

pohly · 2019-02-11T14:21:51Z

keps/sig-storage/20190204-scale-testing.md

@@ -107,7 +146,7 @@ These tests should measure how long it takes to start up a pod with volumes, ass

 For each volume type:

-* Create many pods with 1 volume each.
+* Create many pods with 1 volume each in parallel.


Some more points to consider:

Do we create one volume per pod? Is that realistic (i.e. can the storage system scale to that number)? If we don't and volumes cannot be shared arbitrarily, then pods become unschedulable.

The proposed test scales all pods evenly across all nodes, right? As a side effect, we get volumes also spread evenly. I am wondering whether it makes sense to have a test that stresses scalability of the local operations by running a high number of pods+volumes on a single node.

Do we create one volume per pod? Is that realistic (i.e. can the storage system scale to that number)? If we don't and volumes cannot be shared arbitrarily, then pods become unschedulable.

If you look into our current tests for stateless, what we do is:

saturate the cluster as fast as we can (when we don't really measure anything)

then generate a constant stream of pods (actually we don't create them in parallel) for which we measure stuff
I think that would address your concern?

The proposed test scales all pods evenly across all nodes, right? As a side effect, we get volumes also spread evenly. I am wondering whether it makes sense to have a test that stresses scalability of the local operations by running a high number of pods+volumes on a single node.

+1
Though it's a slightly different kind of test, that is enough to have a 1-node cluster to run.

What does "saturate the cluster" mean? Can you point me to the code which does it or a description?

The actual measurements are then done sequentially (create one pod, wait for it to be running, stop it, create next pod, etc.)?

That would work much better for pods using the same PV. But timing will vary greatly depending on the scheduling pattern: one pod running after another on the same node will start up much faster than a pod for which the volume first has to be moved from the old node to the new node.

I don't want to handle sharing PVs in the first phase at least, because how it behaves is highly variable on the storage backend type (RWO vs RWX) and how it gets scheduled.

What does saturating the cluster achieve? Is it for pre-pulling images onto the nodes?

I will add more detail that for these initial tests, there is no sharing of volumes, and we can stress various dimensions:

Total number of volumes per pod (1 pod with many volumes on one node)

Total number of volumes per node (many pods with 1 volume each on one node)

Total number of volumes per cluster (can we take total # volumes per node and multiply by # nodes?)

msau42 · 2019-02-14T23:35:10Z

Added some more updates. PTAL @wojtek-t @pohly @saad-ali

msau42 · 2019-02-14T23:36:03Z

I also moved the other tests to WIP for the future section. And I just want to focus on pod startup tests for this initial phase

saad-ali

/lgtm

/hold
To allow other reviewers time to review.

msau42 · 2019-02-21T01:20:10Z

ping @wojtek-t @pohly for review

wojtek-t · 2019-02-26T15:14:00Z

@msau42 - sorry, I've been OOO for couple days (and also trying to get out of backlog now). Looking now.

wojtek-t

Just some minor comment (feel free to ignore them). Overall looks great

LGTM (please ping me if you want me to apply the label - not sure if you're waitng for @pohly too).

wojtek-t · 2019-02-26T15:15:24Z

keps/sig-storage/20190204-scale-testing.md

+
+The framework already supports measurements for Pod startup latency and can be
+used to determine a [Pod startup SLO with
+volumes](https://github.com/kubernetes/community/pull/3242).


If any changes will be needed (and I think there might be) - this should be on us. Happy to be your main point-of-contact with that.
@krzysied - FYI

You may want to make it explicit too.

wojtek-t · 2019-02-26T15:17:29Z

keps/sig-storage/20190204-scale-testing.md

+  [volume scheduling
+  metrics](https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/volume/persistentvolume/scheduler_bind_cache_metrics.go)
+* Timing of volume events. However, events should not be considered a reliable
+  measurement because they can be dropped or garbage collected. Also there are


For stateless pod, we are using events as debugging info - if they are we report them, if not, well - we just have e2e value (which is e2e SLI, which is enough to check if we meet SLO - just harder/impossible to debug in that case).

msau42 · 2019-03-01T23:16:54Z

Updated, squashed and changed to implementable. @wojtek-t @pohly PTAL.

/hold cancel

wojtek-t · 2019-03-11T21:10:02Z

/lgtm

k8s-ci-robot · 2019-03-11T21:10:12Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: msau42, saad-ali, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/sig-storage/OWNERS~~ [saad-ali]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot requested review from childsb and saad-ali February 5, 2019 03:14

k8s-ci-robot added sig/pm size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 5, 2019

k8s-ci-robot assigned pohly, saad-ali and wojtek-t Feb 5, 2019

msau42 changed the title ~~Add storage scale testing kep~~ WIP: Add storage scale testing kep Feb 5, 2019

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 5, 2019

wojtek-t reviewed Feb 5, 2019

View reviewed changes

msau42 force-pushed the scale-kep branch from e747f23 to 7061612 Compare February 5, 2019 19:29

msau42 commented Feb 6, 2019

View reviewed changes

pohly reviewed Feb 6, 2019

View reviewed changes

wojtek-t reviewed Feb 11, 2019

View reviewed changes

pohly reviewed Feb 11, 2019

View reviewed changes

msau42 changed the title ~~WIP: Add storage scale testing kep~~ Add storage scale testing kep Feb 14, 2019

saad-ali reviewed Feb 15, 2019

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 15, 2019

wojtek-t reviewed Feb 26, 2019

View reviewed changes

Add storage scale testing kep

d47bf2a

msau42 force-pushed the scale-kep branch from c616020 to d47bf2a Compare March 1, 2019 23:16

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 1, 2019

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 1, 2019

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 11, 2019

k8s-ci-robot merged commit 9d0f436 into kubernetes:master Mar 11, 2019

justaugustus removed the sig/pm label Apr 19, 2020


		### Phase 2

		* Pod teardown tests with results published and thresholds established.


		The tests should be developed and run in sig-scalability’s test framework and infrastructure.

		TODO details and links

Add storage scale testing kep #812

Add storage scale testing kep #812

Conversation

msau42 commented Feb 5, 2019

msau42 commented Feb 5, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wojtek-t commented Feb 5, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msau42 commented Feb 9, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msau42 commented Feb 14, 2019

msau42 commented Feb 14, 2019

saad-ali left a comment

Choose a reason for hiding this comment

msau42 commented Feb 21, 2019

wojtek-t commented Feb 26, 2019

wojtek-t left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msau42 commented Mar 1, 2019

wojtek-t commented Mar 11, 2019

k8s-ci-robot commented Mar 11, 2019