-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add storage scale testing kep #812
Conversation
superseded-by: | ||
--- | ||
|
||
# Volume Scale Testing Plan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would change to Scale and Performance.
Some of them don't really need to be run at scale.
* CSI mock driver | ||
* TODO: Hostpath? | ||
* TODO: Local? | ||
* Provide a framework that vendors can use to run scale tests against their CSI drivers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain that in a bit more details? I would like to see the expected result of it (i.e. what a endor would have to do to test that).
So let's say that I would like to test exactly the same scenarios that we will create tests for, but instead of say using mock driver, use e.g. GCE PD. I think that the ideal situation is to just:
- configure a cluster correctly (i guess, run it on GCE should be enough here)
- use appropriate PV type in tests
And that's all.
Or do you have something else on your mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before running the tests, a vendor would need to:
- define a storageclass
- launch their csi driver
The main requirement I wanted to convey here is that the tests should be written so that we don't need to hard code the volume type. Maybe we can just require that the default storageclass is set appropriately before the test. I will try to reword this a bit.
|
||
### Phase 2 | ||
|
||
* Pod teardown tests with results published and thresholds established. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would personally remove "thresholds established" from here (and below).
I think that we are going in a direction where we would like to fail the tests if the fail the SLO. And i think we are (at least not short term) not planning to add any SLO around that.
So I think that those metrics are useful and we may want to publish them on some graphs, but I would (at least for now) not talk about thresholds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I'll move defining a slo to a future phase
Thanks a lot for starting this doc @msau42 ! |
|
||
The tests should be developed and run in sig-scalability’s test framework and infrastructure. | ||
|
||
TODO details and links |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wojtek-t do you have any references to the scale test framework that I can link to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would bet that I added a comment, but apparently not.
This is the link to design; https://github.com/kubernetes/perf-tests/blob/master/clusterloader2/docs/design.md
This is Readme: https://github.com/kubernetes/perf-tests/blob/master/clusterloader2/README.md
I guess it's enough for this KEP. If you would need anything else, let me know.
* Create many pods with 1 volume each. | ||
* Create 1 pod with many volumes. | ||
* For PVC types, test with and without PV.NodeAffinity as one more dimension. | ||
* Measure pod startup time, with a breakdown of time spent in volume scheduling, attach and mount operations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there already some way how those times can be measured? All I can think of is looking at events posted for the pod, but I have no idea how detailed and/or accurate that is.
There is the tracing KEP but it no longer looks like that'll make it into 1.14 at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we have scalability tests today that have the infrastructure in place to measure pod startup latency. As part of this work, we may have to enhance it if we also want to pull out finer grained metrics for volume operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I should have this comment here: #812 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW - do you already generate events on e.g. attach and mount operations? Or would they have to be added? If the latter, we should probably think more if this is the right approach (I'm not convinced - the events trick was done 3+ years ago with lack of time to do anything better).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have events on attach, and recently removed events for mount because they were too noisy for the remount volume types (secrets, etc).
I was not leaning towards using events for measuring attach/mount time because we have found that events can be unreliable and dropped/gc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was not leaning towards using events for measuring attach/mount time because we have found that events can be unreliable and dropped/gc.
Yes - I'm also not a fan of using event due to reasons you mentioned. They may be used to extend the information with some optional data which e.g. helps with debugging, but if we lose it, then it's not end of the world.
So yeah - if we can find out a more robust mechanism, I'm all for it.
|
||
For each volume type: | ||
|
||
* Create many pods with 1 volume each. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that expected to start one pod after the other, waiting for it to run before creating the next, or is that a "ask for n pods at once, wait for all of them to run" test?
How many is "many"? In general, a lot of this testing is likely to hit various scalability issues. Do we want to time-box each configuration to a certain configurable duration? That's easier when starting pods one-by-one because then we can basically stop at any time. When creating n pods, n might be so large that we exceed the time limit. Then what? Looking at currently running tests at the end of the time limit is racy. Consider it a test failure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the current scale tests create the pods in parallel.
How many needs to be figured out, as we currently don't know what our scale limits are. I imagine that the number of pods is configurable, and we run the test with various numbers, collect the results, and see when the numbers start to degrade significantly.
So we should have a timebox for the tests, but we'll need to do some experimentation to figure out what the limit is to fit within it. I will add a section that describes this
* Delete many pods with 1 volume each. | ||
* Delete 1 pod with many volumes. | ||
* Measure pod deletion time, with a breakdown of time spent in unmount and detach operations. | ||
* Note: Detach can only be measured for CSI volumes by the removal of the VolumeAttachment object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't detach always decoupled from pod deletion? I.e. pod stops, some time later the volume gets detached from the node? I remember that this was driven by the control loop in kubelet, but it's quite likely that I don't remember correctly or misunderstood.
If it is indeed decoupled from pod deletion, then an approach for measure time based on pod events won't work.
If CSI volumes need special handling, then this affects test portability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the problem is that Pod deletion is not tied to detach. But I think detach is still important to measure because for RWO volumes, it impacts the time it takes to launch a new Pod that uses the same volume.
We may have to make an exception on portability in order to collect detach metrics. It could just be that for non-CSI volumes, we won't be able to collect that data.
Updated to address comments |
from kube-controller-manager | ||
* Caveat: These metrics measure the time of a single operation attempt and not | ||
the e2e time across retries. These metrics may not be suitable to measure | ||
latency from a pod's perspective. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we need to clearly distinguish two things:
(a) the metrics that are implementing SLI
(b) metrics that are useful for understanding better what is happening and/or debugging failures
Most of the questions here are for (b) category, which is not something that is the most important thing for scalability SIG - this is important for individual SIGs to debug their feature. And how exactly this will be done in their components is out of our responsibility (it is much more in scope of sig instrumentation).
The main two categories of options are:
- expose a set of aggregated metrics from your component (and then we can provide mechanisms for gathering them at the framework level - the problem isn't fully solved in ClusterLoader, but it's something we are looking at as this is really needed)
- tract this information in a more custom way in your component by:
- some kind of tracing (until we have it, maybe traces in logs: https://github.com/kubernetes/utils/tree/master/trace are good enough for this purpose)
- extending objects to keep more information (though we should increase amount of operations on apiserver due to that reason)
But this is more individual - there isn't one single good option that will work in all cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. For pod startup SLI, we already have the necessary metrics in place.
We already have metrics for volume operations, but like I mentioned above, they're not really good for measuring e2e time to successfully complete the operation. So the options I see are:
- Add new metrics that can track total e2e time for the operation. This was attempted already in VolumeManager mount/unmount volume metrics kubernetes#57766, but it got complicated when we tried to handle operation cancellations/complete failure.
- Add tracing. Needs investigation if this could be leveraged for more than just scale tests without too much overhead to the system. I still think a "total latency" metric would be very useful beyond scale tests.
- Add timestamps to API objects. This would need to be carefully considered so that we don't trigger additional API updates. Ideally we already have some sort of status that we are already updating when the operation is complete. For something like attach, that can also be detached and reattached, we would need to make sure the timestamp makes sense in each of those states.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I'll update the proposal to distinguish these two types of metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I still think a "total latency" metric would be very useful beyond scale tests.
+1
- Ideally we already have some sort of status that we are already updating when the operation is complete. For something like attach, that can also be detached and reattached, we would need to make sure the timestamp makes sense in each of those states.
Yes - that makes perfect sense to me.
timestamps in PersistentVolumeClaim, VolumeAttachment, and Pod statuses, | ||
respectively. | ||
* TODO: how to measure time reliably for deletion events, where the API object | ||
deletion is the trigger? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, though I don't think it's a blocker.
We don't have any SLI for pod deletion and I've never heard high-level complaints that this is too slow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using RWO volumes, pod deletion (and unmount/detach) time becomes more important because the next pod using the same volume cannot be started until the volume has been detached from the previous node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - that's a great point which I totally forgot about.
I will add a TODO to my proposal for the SLI to consider adding a deletion SLI because of that reason.
I still think it's not a hard blocker - I would treat it more as a second step.
1. Start with the documented [max limits](https://kubernetes.io/docs/setup/cluster-large/) | ||
and [stateless pod | ||
SLO](https://github.com/kubernetes/community/blob/master/sig-scalability/slos/pod_startup_latency.md) | ||
as a target SLO. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kubernetes/community#3242 is out for review
* Measure time starting from pod scheduling to volume operation completion times | ||
for provision, attach and mount. This requires changes in Kubernetes to add | ||
timestamps in PersistentVolumeClaim, VolumeAttachment, and Pod statuses, | ||
respectively. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With these timestamps, do you mean events or actual API changes of these types?
timestamps in PersistentVolumeClaim, VolumeAttachment, and Pod statuses, | ||
respectively. | ||
* TODO: how to measure time reliably for deletion events, where the API object | ||
deletion is the trigger? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure whether relying on the API to expose internals is the right approach. It probably isn't going to get us all the way.
As it has been pointed out in the review of the tracing proposal, mapping to spans (whether in OpenTracing or OpenCensus) doesn't look like the right model either.
Perhaps we really just need an event collection mechanism that is independent of the Kubernetes API, and then do some data analysis on that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that extending the API doesn't sound like a good option.
Though, it's possible to e.g. consider setting an annotation to expose that information - there is already a precedent for it: kubernetes/kubernetes#69946 though we should be careful with things like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced that adding timestamps of important events to api objects is exposing internal details. I was picturing this after looking at how the pod startup sli is calculated, and seeing that it looks like Pod container's start timestamp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me rephrase my thoughts (also, I don't have full insight in what exactly you're suggesting, so this may not make sense - let me knwo if that's the case).
I guess that I was trying to say is that, I would be reluctant to extending the overall Pod api, just to enable tracking of some internal metrics. If what you have on your mind is appropriately setting timestamps e.g. in Pod conditions or sth like that, I think that can make sense (though I would need to understand a bit better what exactly you're proposing).
@@ -107,7 +146,7 @@ These tests should measure how long it takes to start up a pod with volumes, ass | |||
|
|||
For each volume type: | |||
|
|||
* Create many pods with 1 volume each. | |||
* Create many pods with 1 volume each in parallel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more points to consider:
- Do we create one volume per pod? Is that realistic (i.e. can the storage system scale to that number)? If we don't and volumes cannot be shared arbitrarily, then pods become unschedulable.
- The proposed test scales all pods evenly across all nodes, right? As a side effect, we get volumes also spread evenly. I am wondering whether it makes sense to have a test that stresses scalability of the local operations by running a high number of pods+volumes on a single node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Do we create one volume per pod? Is that realistic (i.e. can the storage system scale to that number)? If we don't and volumes cannot be shared arbitrarily, then pods become unschedulable.
If you look into our current tests for stateless, what we do is:
- saturate the cluster as fast as we can (when we don't really measure anything)
- then generate a constant stream of pods (actually we don't create them in parallel) for which we measure stuff
I think that would address your concern?
- The proposed test scales all pods evenly across all nodes, right? As a side effect, we get volumes also spread evenly. I am wondering whether it makes sense to have a test that stresses scalability of the local operations by running a high number of pods+volumes on a single node.
+1
Though it's a slightly different kind of test, that is enough to have a 1-node cluster to run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "saturate the cluster" mean? Can you point me to the code which does it or a description?
The actual measurements are then done sequentially (create one pod, wait for it to be running, stop it, create next pod, etc.)?
That would work much better for pods using the same PV. But timing will vary greatly depending on the scheduling pattern: one pod running after another on the same node will start up much faster than a pod for which the volume first has to be moved from the old node to the new node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to handle sharing PVs in the first phase at least, because how it behaves is highly variable on the storage backend type (RWO vs RWX) and how it gets scheduled.
What does saturating the cluster achieve? Is it for pre-pulling images onto the nodes?
I will add more detail that for these initial tests, there is no sharing of volumes, and we can stress various dimensions:
- Total number of volumes per pod (1 pod with many volumes on one node)
- Total number of volumes per node (many pods with 1 volume each on one node)
- Total number of volumes per cluster (can we take total # volumes per node and multiply by # nodes?)
I also moved the other tests to WIP for the future section. And I just want to focus on pod startup tests for this initial phase |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/hold
To allow other reviewers time to review.
@msau42 - sorry, I've been OOO for couple days (and also trying to get out of backlog now). Looking now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some minor comment (feel free to ignore them). Overall looks great
LGTM (please ping me if you want me to apply the label - not sure if you're waitng for @pohly too).
|
||
The framework already supports measurements for Pod startup latency and can be | ||
used to determine a [Pod startup SLO with | ||
volumes](https://github.com/kubernetes/community/pull/3242). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If any changes will be needed (and I think there might be) - this should be on us. Happy to be your main point-of-contact with that.
@krzysied - FYI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may want to make it explicit too.
[volume scheduling | ||
metrics](https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/volume/persistentvolume/scheduler_bind_cache_metrics.go) | ||
* Timing of volume events. However, events should not be considered a reliable | ||
measurement because they can be dropped or garbage collected. Also there are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For stateless pod, we are using events as debugging info - if they are we report them, if not, well - we just have e2e value (which is e2e SLI, which is enough to check if we meet SLO - just harder/impossible to debug in that case).
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: msau42, saad-ali, wojtek-t The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Develop a plan on how to test scalability of K8s storage
@kubernetes/sig-storage-feature-requests
@kubernetes/sig-scalability-feature-requests