Skip to content

Commit

Permalink
Add storage scale testing kep
Browse files Browse the repository at this point in the history
  • Loading branch information
msau42 committed Feb 5, 2019
1 parent 632685b commit e747f23
Showing 1 changed file with 154 additions and 0 deletions.
154 changes: 154 additions & 0 deletions keps/sig-storage/20190204-scale-testing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
---
title: Volume Scale Testing Plan
authors:
- "@msau42"
owning-sig: sig-storage
participating-sigs:
- sig-scalability
reviewers:
- "@pohly"
- "@saad-ali"
- "@wojtek-t"
approvers:
- "@saad-ali"
- "@wojtek-t"
editor: TBD
creation-date: 2019-02-04
last-updated: 2019-02-04
status: provisional
see-also:
replaces:
superseded-by:
---

# Volume Scale Testing Plan

## Table of Contents

* [Summary](#summary)
* [Motivation](#motivation)
* [Goals](#goals)
* [Non-Goals](#non-goals)
* [Proposal](#proposal)
* [WIP Test Environment](#wip-test-environment)
* [Test Cases](#test-cases)
* [Pod Startup](#pod-startup)
* [Pod Teardown](#pod-teardown)
* [WIP: PV Binding Tests](#wip-pv-binding-tests)
* [WIP: PV Provisioning Tests](#wip-pv-provisioning-tests)
* [WIP: PV Deletion Tests](#wip-pv-deletion-tests)
* [Graduation Criteria](#graduation-criteria)
* [Phase 1](#phase-1)
* [Phase 2](#phase-2)
* [Implementation History](#implementation-history)


## Summary

This KEP outlines a plan for testing scalability of K8s storage components.

## Motivation

Adding storage scale tests will help:
* Understand the current scale limits of the Kuberentes storage system.
* Set expectations (SLOs) for consumers of the Volume API.
* Determine bottlenecks and influence which need addressing.

### Goals

* Measure the overhead of K8s components owned by sig-storage:
* K8s volume controllers
* Kubelet volume manager
* CSI sidecars
* Stress various dimensions of volume operations to determine:
* Max volumes per pod
* Max volumes per node
* Max volumes per cluster
* Test with the following volume types:
* EmptyDir
* Secret
* Configmap
* Downward API
* CSI mock driver
* TODO: Hostpath?
* TODO: Local?
* Provide a framework that vendors can use to run scale tests against their CSI drivers.

### Non-Goals

* Test and measure storage provider’s drivers.

## Proposal

### WIP Test Environment

The tests should be developed and run in sig-scalability’s test framework and infrastructure.

TODO details and links

### Test Cases

#### Pod Startup

These tests should measure how long it takes to start up a pod with volumes, assuming that the volume has already been provisioned.

For each volume type:

* Create many pods with 1 volume each.
* Create 1 pod with many volumes.
* For PVC types, test with and without PV.NodeAffinity as one more dimension.
* Measure pod startup time, with a breakdown of time spent in volume scheduling, attach and mount operations.

#### Pod Teardown

These tests should measure how long it takes to delete a pod with volumes.

For each volume type:

* Delete many pods with 1 volume each.
* Delete 1 pod with many volumes.
* Measure pod deletion time, with a breakdown of time spent in unmount and detach operations.
* Note: Detach can only be measured for CSI volumes by the removal of the VolumeAttachment object.

#### WIP: PV Binding Tests

These tests should measure the time it takes to bind a PVC to a preprovisioned available PV.

#### WIP: PV Provisioning Tests

These tests should measure the time it takes to bind a PVC to a dynamically provisioned PV.

For each volume type that supports provisioning:

* Create many PVCs with immediate binding.
* Create many PVCs with delayed binding.
* Measure volume provisioning time.

#### WIP: PV Deletion Tests

These tests should measure the time it takes to delete a PVC and PV.

For each volume type that supports deletion:

* Delete many PVCs with the Delete reclaim policy.
* Measure volume deletion time.


## Graduation Criteria

### Phase 1

* Pod startup tests running in scale clusters for all the targeted volume types.
* Pod startup latency and max limits results published.
* Scale tests fail if latency is above threshold.

### Phase 2

* Pod teardown tests with results published and thresholds established.
* PV provisioning and deletion tests with results published and thresholds established.
* PV binding tests with results published and thresholds established.


## Implementation History


0 comments on commit e747f23

Please sign in to comment.