Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add storage scale testing kep #812

Merged
merged 1 commit into from
Mar 11, 2019
Merged

Conversation

msau42
Copy link
Member

@msau42 msau42 commented Feb 5, 2019

Develop a plan on how to test scalability of K8s storage

@kubernetes/sig-storage-feature-requests
@kubernetes/sig-scalability-feature-requests

@k8s-ci-robot k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. kind/feature Categorizes issue or PR as related to a new feature. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. labels Feb 5, 2019
@k8s-ci-robot k8s-ci-robot added sig/pm size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 5, 2019
@msau42
Copy link
Member Author

msau42 commented Feb 5, 2019

/assign @saad-ali @wojtek-t @pohly

It's still WIP, but I wanted to start reviewing at least the pod startup tests which I want to target first.

@msau42 msau42 changed the title Add storage scale testing kep WIP: Add storage scale testing kep Feb 5, 2019
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 5, 2019
superseded-by:
---

# Volume Scale Testing Plan
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would change to Scale and Performance.
Some of them don't really need to be run at scale.

* CSI mock driver
* TODO: Hostpath?
* TODO: Local?
* Provide a framework that vendors can use to run scale tests against their CSI drivers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain that in a bit more details? I would like to see the expected result of it (i.e. what a endor would have to do to test that).
So let's say that I would like to test exactly the same scenarios that we will create tests for, but instead of say using mock driver, use e.g. GCE PD. I think that the ideal situation is to just:

  • configure a cluster correctly (i guess, run it on GCE should be enough here)
  • use appropriate PV type in tests
    And that's all.
    Or do you have something else on your mind?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before running the tests, a vendor would need to:

  • define a storageclass
  • launch their csi driver

The main requirement I wanted to convey here is that the tests should be written so that we don't need to hard code the volume type. Maybe we can just require that the default storageclass is set appropriately before the test. I will try to reword this a bit.


### Phase 2

* Pod teardown tests with results published and thresholds established.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would personally remove "thresholds established" from here (and below).
I think that we are going in a direction where we would like to fail the tests if the fail the SLO. And i think we are (at least not short term) not planning to add any SLO around that.
So I think that those metrics are useful and we may want to publish them on some graphs, but I would (at least for now) not talk about thresholds.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I'll move defining a slo to a future phase

@wojtek-t
Copy link
Member

wojtek-t commented Feb 5, 2019

Thanks a lot for starting this doc @msau42 !


The tests should be developed and run in sig-scalability’s test framework and infrastructure.

TODO details and links
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wojtek-t do you have any references to the scale test framework that I can link to?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would bet that I added a comment, but apparently not.

This is the link to design; https://github.com/kubernetes/perf-tests/blob/master/clusterloader2/docs/design.md

This is Readme: https://github.com/kubernetes/perf-tests/blob/master/clusterloader2/README.md

I guess it's enough for this KEP. If you would need anything else, let me know.

* Create many pods with 1 volume each.
* Create 1 pod with many volumes.
* For PVC types, test with and without PV.NodeAffinity as one more dimension.
* Measure pod startup time, with a breakdown of time spent in volume scheduling, attach and mount operations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there already some way how those times can be measured? All I can think of is looking at events posted for the pod, but I have no idea how detailed and/or accurate that is.

There is the tracing KEP but it no longer looks like that'll make it into 1.14 at all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we have scalability tests today that have the infrastructure in place to measure pod startup latency. As part of this work, we may have to enhance it if we also want to pull out finer grained metrics for volume operations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I should have this comment here: #812 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW - do you already generate events on e.g. attach and mount operations? Or would they have to be added? If the latter, we should probably think more if this is the right approach (I'm not convinced - the events trick was done 3+ years ago with lack of time to do anything better).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have events on attach, and recently removed events for mount because they were too noisy for the remount volume types (secrets, etc).

I was not leaning towards using events for measuring attach/mount time because we have found that events can be unreliable and dropped/gc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not leaning towards using events for measuring attach/mount time because we have found that events can be unreliable and dropped/gc.

Yes - I'm also not a fan of using event due to reasons you mentioned. They may be used to extend the information with some optional data which e.g. helps with debugging, but if we lose it, then it's not end of the world.
So yeah - if we can find out a more robust mechanism, I'm all for it.


For each volume type:

* Create many pods with 1 volume each.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that expected to start one pod after the other, waiting for it to run before creating the next, or is that a "ask for n pods at once, wait for all of them to run" test?

How many is "many"? In general, a lot of this testing is likely to hit various scalability issues. Do we want to time-box each configuration to a certain configurable duration? That's easier when starting pods one-by-one because then we can basically stop at any time. When creating n pods, n might be so large that we exceed the time limit. Then what? Looking at currently running tests at the end of the time limit is racy. Consider it a test failure?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the current scale tests create the pods in parallel.

How many needs to be figured out, as we currently don't know what our scale limits are. I imagine that the number of pods is configurable, and we run the test with various numbers, collect the results, and see when the numbers start to degrade significantly.

So we should have a timebox for the tests, but we'll need to do some experimentation to figure out what the limit is to fit within it. I will add a section that describes this

* Delete many pods with 1 volume each.
* Delete 1 pod with many volumes.
* Measure pod deletion time, with a breakdown of time spent in unmount and detach operations.
* Note: Detach can only be measured for CSI volumes by the removal of the VolumeAttachment object.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't detach always decoupled from pod deletion? I.e. pod stops, some time later the volume gets detached from the node? I remember that this was driven by the control loop in kubelet, but it's quite likely that I don't remember correctly or misunderstood.

If it is indeed decoupled from pod deletion, then an approach for measure time based on pod events won't work.

If CSI volumes need special handling, then this affects test portability.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the problem is that Pod deletion is not tied to detach. But I think detach is still important to measure because for RWO volumes, it impacts the time it takes to launch a new Pod that uses the same volume.

We may have to make an exception on portability in order to collect detach metrics. It could just be that for non-CSI volumes, we won't be able to collect that data.

@msau42
Copy link
Member Author

msau42 commented Feb 9, 2019

Updated to address comments

from kube-controller-manager
* Caveat: These metrics measure the time of a single operation attempt and not
the e2e time across retries. These metrics may not be suitable to measure
latency from a pod's perspective.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we need to clearly distinguish two things:
(a) the metrics that are implementing SLI
(b) metrics that are useful for understanding better what is happening and/or debugging failures

Most of the questions here are for (b) category, which is not something that is the most important thing for scalability SIG - this is important for individual SIGs to debug their feature. And how exactly this will be done in their components is out of our responsibility (it is much more in scope of sig instrumentation).
The main two categories of options are:

  • expose a set of aggregated metrics from your component (and then we can provide mechanisms for gathering them at the framework level - the problem isn't fully solved in ClusterLoader, but it's something we are looking at as this is really needed)
  • tract this information in a more custom way in your component by:
    • some kind of tracing (until we have it, maybe traces in logs: https://github.com/kubernetes/utils/tree/master/trace are good enough for this purpose)
    • extending objects to keep more information (though we should increase amount of operations on apiserver due to that reason)
      But this is more individual - there isn't one single good option that will work in all cases.

@pohly @msau42

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. For pod startup SLI, we already have the necessary metrics in place.

We already have metrics for volume operations, but like I mentioned above, they're not really good for measuring e2e time to successfully complete the operation. So the options I see are:

  • Add new metrics that can track total e2e time for the operation. This was attempted already in VolumeManager mount/unmount volume metrics kubernetes#57766, but it got complicated when we tried to handle operation cancellations/complete failure.
  • Add tracing. Needs investigation if this could be leveraged for more than just scale tests without too much overhead to the system. I still think a "total latency" metric would be very useful beyond scale tests.
  • Add timestamps to API objects. This would need to be carefully considered so that we don't trigger additional API updates. Ideally we already have some sort of status that we are already updating when the operation is complete. For something like attach, that can also be detached and reattached, we would need to make sure the timestamp makes sense in each of those states.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I'll update the proposal to distinguish these two types of metrics

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I still think a "total latency" metric would be very useful beyond scale tests.

+1

  • Ideally we already have some sort of status that we are already updating when the operation is complete. For something like attach, that can also be detached and reattached, we would need to make sure the timestamp makes sense in each of those states.

Yes - that makes perfect sense to me.

timestamps in PersistentVolumeClaim, VolumeAttachment, and Pod statuses,
respectively.
* TODO: how to measure time reliably for deletion events, where the API object
deletion is the trigger?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, though I don't think it's a blocker.
We don't have any SLI for pod deletion and I've never heard high-level complaints that this is too slow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using RWO volumes, pod deletion (and unmount/detach) time becomes more important because the next pod using the same volume cannot be started until the volume has been detached from the previous node.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - that's a great point which I totally forgot about.
I will add a TODO to my proposal for the SLI to consider adding a deletion SLI because of that reason.

I still think it's not a hard blocker - I would treat it more as a second step.

1. Start with the documented [max limits](https://kubernetes.io/docs/setup/cluster-large/)
and [stateless pod
SLO](https://github.com/kubernetes/community/blob/master/sig-scalability/slos/pod_startup_latency.md)
as a target SLO.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kubernetes/community#3242 is out for review

* Measure time starting from pod scheduling to volume operation completion times
for provision, attach and mount. This requires changes in Kubernetes to add
timestamps in PersistentVolumeClaim, VolumeAttachment, and Pod statuses,
respectively.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With these timestamps, do you mean events or actual API changes of these types?

timestamps in PersistentVolumeClaim, VolumeAttachment, and Pod statuses,
respectively.
* TODO: how to measure time reliably for deletion events, where the API object
deletion is the trigger?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether relying on the API to expose internals is the right approach. It probably isn't going to get us all the way.

As it has been pointed out in the review of the tracing proposal, mapping to spans (whether in OpenTracing or OpenCensus) doesn't look like the right model either.

Perhaps we really just need an event collection mechanism that is independent of the Kubernetes API, and then do some data analysis on that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that extending the API doesn't sound like a good option.
Though, it's possible to e.g. consider setting an annotation to expose that information - there is already a precedent for it: kubernetes/kubernetes#69946 though we should be careful with things like this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced that adding timestamps of important events to api objects is exposing internal details. I was picturing this after looking at how the pod startup sli is calculated, and seeing that it looks like Pod container's start timestamp.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me rephrase my thoughts (also, I don't have full insight in what exactly you're suggesting, so this may not make sense - let me knwo if that's the case).
I guess that I was trying to say is that, I would be reluctant to extending the overall Pod api, just to enable tracking of some internal metrics. If what you have on your mind is appropriately setting timestamps e.g. in Pod conditions or sth like that, I think that can make sense (though I would need to understand a bit better what exactly you're proposing).

@@ -107,7 +146,7 @@ These tests should measure how long it takes to start up a pod with volumes, ass

For each volume type:

* Create many pods with 1 volume each.
* Create many pods with 1 volume each in parallel.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more points to consider:

  • Do we create one volume per pod? Is that realistic (i.e. can the storage system scale to that number)? If we don't and volumes cannot be shared arbitrarily, then pods become unschedulable.
  • The proposed test scales all pods evenly across all nodes, right? As a side effect, we get volumes also spread evenly. I am wondering whether it makes sense to have a test that stresses scalability of the local operations by running a high number of pods+volumes on a single node.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Do we create one volume per pod? Is that realistic (i.e. can the storage system scale to that number)? If we don't and volumes cannot be shared arbitrarily, then pods become unschedulable.

If you look into our current tests for stateless, what we do is:

  • saturate the cluster as fast as we can (when we don't really measure anything)
  • then generate a constant stream of pods (actually we don't create them in parallel) for which we measure stuff
    I think that would address your concern?
  • The proposed test scales all pods evenly across all nodes, right? As a side effect, we get volumes also spread evenly. I am wondering whether it makes sense to have a test that stresses scalability of the local operations by running a high number of pods+volumes on a single node.

+1
Though it's a slightly different kind of test, that is enough to have a 1-node cluster to run.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "saturate the cluster" mean? Can you point me to the code which does it or a description?

The actual measurements are then done sequentially (create one pod, wait for it to be running, stop it, create next pod, etc.)?

That would work much better for pods using the same PV. But timing will vary greatly depending on the scheduling pattern: one pod running after another on the same node will start up much faster than a pod for which the volume first has to be moved from the old node to the new node.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to handle sharing PVs in the first phase at least, because how it behaves is highly variable on the storage backend type (RWO vs RWX) and how it gets scheduled.

What does saturating the cluster achieve? Is it for pre-pulling images onto the nodes?

I will add more detail that for these initial tests, there is no sharing of volumes, and we can stress various dimensions:

  • Total number of volumes per pod (1 pod with many volumes on one node)
  • Total number of volumes per node (many pods with 1 volume each on one node)
  • Total number of volumes per cluster (can we take total # volumes per node and multiply by # nodes?)

@msau42
Copy link
Member Author

msau42 commented Feb 14, 2019

Added some more updates. PTAL @wojtek-t @pohly @saad-ali

@msau42
Copy link
Member Author

msau42 commented Feb 14, 2019

I also moved the other tests to WIP for the future section. And I just want to focus on pod startup tests for this initial phase

@msau42 msau42 changed the title WIP: Add storage scale testing kep Add storage scale testing kep Feb 14, 2019
@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Feb 14, 2019
Copy link
Member

@saad-ali saad-ali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

/hold
To allow other reviewers time to review.

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 15, 2019
@msau42
Copy link
Member Author

msau42 commented Feb 21, 2019

ping @wojtek-t @pohly for review

@wojtek-t
Copy link
Member

@msau42 - sorry, I've been OOO for couple days (and also trying to get out of backlog now). Looking now.

Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor comment (feel free to ignore them). Overall looks great

LGTM (please ping me if you want me to apply the label - not sure if you're waitng for @pohly too).


The framework already supports measurements for Pod startup latency and can be
used to determine a [Pod startup SLO with
volumes](https://github.com/kubernetes/community/pull/3242).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If any changes will be needed (and I think there might be) - this should be on us. Happy to be your main point-of-contact with that.
@krzysied - FYI

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may want to make it explicit too.

[volume scheduling
metrics](https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/volume/persistentvolume/scheduler_bind_cache_metrics.go)
* Timing of volume events. However, events should not be considered a reliable
measurement because they can be dropped or garbage collected. Also there are
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For stateless pod, we are using events as debugging info - if they are we report them, if not, well - we just have e2e value (which is e2e SLI, which is enough to check if we meet SLO - just harder/impossible to debug in that case).

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 1, 2019
@msau42
Copy link
Member Author

msau42 commented Mar 1, 2019

Updated, squashed and changed to implementable. @wojtek-t @pohly PTAL.

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 1, 2019
@wojtek-t
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 11, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: msau42, saad-ali, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 9d0f436 into kubernetes:master Mar 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants