A BOSH Director is software deployed to a virtual machine (VM) which orchestrates the deployment of other VMs. We explore running the BOSH Director on Amazon Web Services (AWS) on VMs much smaller (fewer CPU cores, less RAM) than the default (m3.xlarge), and we find that a BOSH Director can be effective even when deployed to an instance as small as a t2.medium for small-to-medium sized deployments. This can reduce the annual cost of the Director by as much as 75%.
[We also explore installing the BOSH Director on an even-smaller instance, the t2.nano. The deployment failed ]
We compare three Amazon Web Services (AWS) Elastic Compute Cloud (EC2) instance types: the default BOSH instance type, the m3.xlarge, [default_instance] and two t2 instance types, the t2.micro and the t2.nano (the two least expensive EC2 instance types).
Instance Type | vCPU | Mem (GiB) | Cost/yr [annual_cost] | Annual Savings |
---|---|---|---|---|
m3.xlarge | 4 | 15 | $1,485.60 | 0.0% |
t2.medium | 2 | 4 | $359.60 | 75.8% |
t2.small | 1 | 2 | $208.60 | 86.0% |
We found that t2.small instance could reliably deploy a 64-instance deployment
with max_threads: 8
(we were able to deploy 5 times. We had a failure the
sixth time, but it seemed to be infrastructure-related, not BOSH related (i.e.
failure message was Timed out pinging to c9ad83ee-f001-4ebc-bee3-1b56fae5b07c after 600 seconds
and AWS Console had a warning icon next to an instance)).
We found that a t2.small could not reliably deploy with max_threads: 8
; Of
six deployments, 3 failed (i.e. 50% success rate).
A t2.medium instance could reliable deploy a 64-instance deployment with
max_threads: 16
(we successfully deployed 6 times, though the results of
one of the deployments is not in the graph — the preceding delete deployment
did not finish, so its results were skewed).
A t2.medium instance with max_threads: 20
and max_in_flight: 64
would get 16
failed instances (up, but you need to ssh in & monit restart director
). Am
dropping max_in_flight: 32
.
16 of these messages. Always director job.
'bosh-smurf-ubuntu/8 (542d7e18-4658-4b56-9017-afbfd170cb01)' is not running after update. Review logs for failed jobs: director
Tried with max_in_flight: 32
. Failed again, 18 times. Will try max_in_flight: 16
failed again, but 5 directors, 2 health monitors.
I dropped update.update_watch_time
from 600s to 60s, and now I different get failures, Failed updating job bosh-smurf-centos > bosh-smurf-centos/1 (91099780-e522-43e7-ba64-22dd0d784480): Unknown CPI error 'Unknown' with message 'execution expired' (00:10:49)
. When I ssh into the VM and type monit status
,
I see the following:
/var/vcap/bosh/etc/monitrc:8: Warning: include files not found '/var/vcap/monit/job/*.monitrc'
The Monit daemon 5.2.5 uptime: 38m
System 'system_localhost' running
/var/vcap/{data,store}
are not mounted, obviously. Specifically, /dev/xvdb2 is
mounted to /tmp
instead of to /var/vcap/data
, and /dev/xvdf1
is not mounted
at all (should be mounted to /var/vcap/store
).
-bash-4.2# fdisk -l | grep "^Disk /dev"
Disk /dev/xvda: 3221 MB, 3221225472 bytes, 6291456 sectors
Disk /dev/xvdb: 2147 MB, 2147483648 bytes, 4194304 sectors
Disk /dev/xvdf: 3221 MB, 3221225472 bytes, 6291456 sectors
with max_threads
set to 16, I still see the
Failed updating job bosh-smurf-centos > bosh-smurf-centos/29 (eb19cc54-feef-4194-b6e9-493d2fe5a231): Unknown CPI error 'Unknown' with message 'execution expired' (00:08:31)
and also
System call error while talking to director: Network is down - connect(2) for "52.70.98.70" port 25555 (52.70.98.70:25555)
We deployed 64 VMs simultaneously using the BOSH "Dummy" Release (a minimal release with no jobs).
Our methodology has the following flaws:
- We stress but one function of the BOSH Director: deploying. We don't test exporting releases, cloud check, uploading releases or stemcells, recreating instances, etc....
- We don't test BOSH functions in parallel (e.g. uploading a release while deploying).
- Our benchmark is synthetic (i.e. the dummy release); it may have been of greater interest to deploy Cloud Foundry Elastic Runtime
- There may have been BOSH mishaps
Failure:
Started creating missing vms > dummy-centos/21 (ef0a7b45-f9b4-43ed-aa2e-9cd898405358)
Started creating missing vms > dummy-centos/29 (f4be42ec-9800-40bb-bd18-441fb69c88ff)
Failed creating missing vms > dummy-centos/26 (33b9b499-231d-4ca9-a8c9-e7e416021b9b): Invalid CPI response - SchemaValidationError: Expected instance of Hash, given instance of NilClass (00:03:21)
Failed creating missing vms > dummy-centos/25 (38ac84e7-6a00-44e7-a48a-2dc49685e0da): Invalid CPI response - SchemaValidationError: Expected instance of Hash, given instance of NilClass (00:03:17)
Error 100: Attempt to unlock a mutex which is not locked[WARNING] cannot access director, trying 4 more times...
while :; do sleep 30; uptime; done
13:39:05 up 9:24, 1 user, load average: 44.20, 19.70, 7.58
13:40:06 up 9:25, 1 user, load average: 45.31, 24.61, 10.08
13:40:48 up 9:26, 1 user, load average: 43.99, 26.78, 11.42
13:41:52 up 9:27, 1 user, load average: 43.36, 29.75, 13.47
13:42:34 up 9:28, 1 user, load average: 41.74, 30.96, 14.56
13:43:06 up 9:28, 1 user, load average: 38.24, 31.28, 15.27
13:44:06 up 9:29, 1 user, load average: 40.10, 32.94, 16.84
13:44:58 up 9:30, 1 user, load average: 37.52, 33.26, 17.79
13:45:30 up 9:31, 1 user, load average: 34.11, 32.85, 18.14
13:46:04 up 9:31, 1 user, load average: 35.84, 33.41, 18.87
13:46:53 up 9:32, 1 user, load average: 33.36, 32.96, 19.47
13:48:00 up 9:33, 1 user, load average: 34.42, 33.05, 20.60
13:49:21 up 9:35, 1 user, load average: 32.83, 32.87, 21.51
13:50:37 up 9:36, 1 user, load average: 32.18, 32.66, 22.26
13:51:37 up 9:37, 1 user, load average: 30.26, 31.83, 22.61
13:52:07 up 9:37, 1 user, load average: 26.69, 30.81, 22.57
13:52:44 up 9:38, 1 user, load average: 28.36, 30.71, 22.88
13:53:50 up 9:39, 1 user, load average: 31.43, 31.14, 23.51
13:54:23 up 9:40, 1 user, load average: 27.01, 30.00, 23.41
13:54:54 up 9:40, 1 user, load average: 23.29, 28.82, 23.23
13:55:47 up 9:41, 1 user, load average: 21.95, 27.38, 23.05
13:56:32 up 9:42, 1 user, load average: 22.25, 26.58, 22.98
13:57:20 up 9:43, 1 user, load average: 20.56, 25.30, 22.72
13:58:13 up 9:44, 1 user, load average: 17.18, 23.82, 22.37
13:59:11 up 9:45, 1 user, load average: 22.18, 23.78, 22.45
13:59:57 up 9:45, 1 user, load average: 14.05, 21.65, 21.79
14:00:27 up 9:46, 1 user, load average: 8.51, 19.58, 21.10
14:00:57 up 9:46, 1 user, load average: 5.24, 17.73, 20.43
14:01:27 up 9:47, 1 user, load average: 3.23, 16.05, 19.79
14:01:57 up 9:47, 1 user, load average: 2.04, 14.53, 19.16
14:02:27 up 9:48, 1 user, load average: 1.23, 13.14, 18.55
14:02:57 up 9:48, 1 user, load average: 0.75, 11.89, 17.96
14:03:27 up 9:49, 1 user, load average: 0.45, 10.75, 17.39
14:03:57 up 9:49, 1 user, load average: 0.33, 9.74, 16.85
14:04:27 up 9:50, 1 user, load average: 0.20, 8.81, 16.31
14:04:57 up 9:50, 1 user, load average: 0.12, 7.97, 15.79
14:05:27 up 9:51, 1 user, load average: 0.07, 7.21, 15.29
14:05:57 up 9:51, 1 user, load average: 0.04, 6.52, 14.80
14:06:27 up 9:52, 1 user, load average: 0.03, 5.89, 14.33
The director collapsed.
bosh tasks recent --no-filter
Acting as user 'admin' on 'BOSH'
+-----+---------+-------------------------+-------------------------+-----------+---------------+-------------------------------+--------------------------------------------------------------------------------+
| # | State | Started | Last Activity | User | Deployment | Description | Result |
+-----+---------+-------------------------+-------------------------+-----------+---------------+-------------------------------+--------------------------------------------------------------------------------+
| 109 | timeout | 2016-06-21 13:36:11 UTC | 2016-06-21 13:36:11 UTC | admin | dummy | create deployment | |
It was still using 270MB Swap
top - 14:14:10 up 10:00, 1 user, load average: 0.01, 1.28, 8.74
Tasks: 127 total, 2 running, 125 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.7 us, 0.0 sy, 0.0 ni, 99.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 499248 total, 402820 used, 96428 free, 4512 buffers
KiB Swap: 506040 total, 270900 used, 235140 free. 32320 cached Mem
t2.small deployed fine but the CLI lost contact with the server:
Started creating missing vms > dummy-centos/10 (38af1948-0891-4d7b-9d24-fc5cf13adaec)System call error while talking to director: Network is unreachable - connect(2) for "52.70.98.70" port 25555 (52.70.98.70:25555)
2nd t2.small deployment also failed:
Done creating missing vms > dummy-ubuntu/17 (cf8c1bbe-5a5d-4907-9b50-eeb715b3e861) (00:02:10)
Failed creating missing vms > dummy-ubuntu/22 (3173ad44-fe74-47b6-bcfb-4d7986284f03): Timed out pinging to 7809e02f-19dd-47bb-b07b-b32f0aa31edf after 600 seconds (00:12:02)
Failed creating missing vms (00:21:42)
3rd t2.small deployment also failed (also it created but 48 of the 64 VMs):
Done creating missing vms > dummy-centos/19 (978c7b63-e114-4c77-af82-bc71e19a06c8) (00:03:00)
Started creating missing vms > dummy-centos/25 (cbce0158-954f-423e-ba99-a2b03e6d70ef)System call error while talking to director: Network is unreachable - connect(2) for "52.70.98.70" port 25555 (52.70.98.70:25555)
In neither case was it the epic failure of the t2.micro:
grep "Unresponsive client detected" /var/vcap/sys/log/director/* # nothing
In fact, all 64 VMs were successfully deployed both times.
Failed updating job bosh-smurf-centos > bosh-smurf-centos/11 (dfb06de2-86bf-4ed4-8d53-b077fdf2493d): eventmachine not initialized: evma_signal_loopbreak (00:09:05)System call error while talking to director: Network is down - connect(2) for "52.70.98.70" port 25555 (52.70.98.70:25555)
Notes:
- t2.small managed to swap, but only a little
- a Director at rest (3 deployments, 66 VMs) shows 536MB
used by processes as
shown by [htop]. During a deploy with
max_threads: 10
, that number climbed as high as 1828MB, which indicates that a deploy can consume ~1292MB of RAM, which works out to ~128MB / thread.
a t2.
The author's BOSH director is deployed to a t2.nano instance and manages two single-VM deployments. The BOSH Director's CPU utilization is typically under 1% and the instance has maxed-out its CPU credits at 75.
Danny Berger, a BOSH developer, has also suggested stopping (i.e. shutting down) the BOSH Director as a mechanism to save money. One would still need pay $0.005 per hour to reserve the BOSH Director's elastic IP address and $0.10 per GB-month of provisioned storage Amazon EBS General Purpose SSD (gp2) volume. This works out to a total yearly cost of $97.80 — $54.00 for the storage and $43.80 for the Elastic IP address. Curiously, it's more expensive to reserve an Elastic IP address for a year ($43.80) than it is to spin up a t2.nano instance and assign the Elastic IP address to that instance ($38).
One large corporate user deploys BOSH Directors on t2.medium instances. Their experience indicates that the director property director.max_threads, which is set in the BOSH manifest, is too high for a t2.medium, and the cloudfoundry/bosh#1263
[default_instance] The instance type of the BOSH Director deployed to Amazon Web Services (AWS) is an m3.xlarge in the sample bosh-init manifest.
[annual_cost] AWS charges are broken into two components: EC2 (compute) costs and EBS (disk) costs:
Instance Type | EC2 cost/yr | EBS cost/yr | Total |
---|---|---|---|
m3.xlarge | $1428 | $57.60 | $1,485.60 |
t2.medium | $302 | $57.60 | $359.60 |
t2.small | $151 | $57.60 | $208.60 |
t2.micro | $75 | $57.60 | $132.60 |
t2.nano | $38 | $57.60 | $95.60 |
The EC2 cost/yr assumes 1-year term all-upfront reserved instances. Prices are in US dollars and are current as of 2016-06-14.
The EBS cost/yr assumes 3 volumes of a combined size of 48GB for an annual cost of $57.60:
BOSH Volume | Mount point | Type | Size (GB) | cost/yr |
---|---|---|---|---|
root | / |
gp2 | 3GB | $3.60 |
ephemeral | /var/vcap/data |
gp2 | 25GB | $30.00 |
persistent | /var/vcap/store |
gp2 | 20GB | $24.00 |
gp2 is an "Amazon EBS General Purpose SSD"; it costs $0.10 per GB-month of provisioned storage. Prices are in US dollars and are current as of 2016-06-14.
- The m3.xlarge instances use 2 x 40GB solid-state drives (SSDs) whereas the t2 instances use Amazon's Elastic Block Store (EBS), which should result in higher disk latency for the t2 instances (i.e. the t2 instances' disks should be slower).
- The m3 instance is a "Fixed Performance Instance type"; its vCPU is consistently available; the t2 instances' vCPUs are Burstable Performance Instances which "accrue CPU Credits when they are idle, and use CPU credits when they are active" (i.e. the t2 instance type may become sluggish or unresponsive when credits are low)