-
Notifications
You must be signed in to change notification settings - Fork 5
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #22 from leongc/main
- Loading branch information
Showing
3 changed files
with
30 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
Date and Time: 17 December 2024 | ||
|
||
Topics discussed: | ||
|
||
OpenAI Kubernetes failure Dec 11 | ||
https://status.openai.com/incidents/ctrsv3lwd797 | ||
[Quick takes on the recent OpenAI public incident write-up – Surfing Complexity](https://surfingcomplexity.blog/2024/12/14/quick-takes-on-the-recent-openai-public-incident-write-up/) | ||
- It was fun to use a well-written incident report from a different company to ask questions of our local cluster team | ||
- Control Plane can create brittleness. E.g. people claim about AZ robustness, but often AWS control planes run in us-east-1, so even if people "don’t deploy to us-east-1" they can still be vulnerable to failures in that AWS region because of AWS control plane | ||
- Designers often overlook the control plane when designing robustness for their systems | ||
- Question to stimulate wider design thinking: "if you lost access to the system, how would you restart with just a key to the datacenter?" | ||
|
||
LLM-based automation to manage infrastructure and its impact on safety | ||
- Known uses of LLMs have been a buddy or assistant | ||
- Seen proposals to have LLM power decision making. Haven’t seen stories of anyone doing this | ||
- We know LLMs have non-deterministic behaviors. | ||
- When this happens, what will be the consequences for safety and reliability? For understanding the incidents? | ||
|
||
Did seasonal holiday spending affect anyone? | ||
- could automate scaling. but not databases. cannot reshard under load; that requires additional capacity. | ||
- Use historical, not current data to predict seasonal load + lead time. | ||
incidents might be more damaging to reputation and revenue than the cost of over-provisioning | ||
- live adjustment to prices is similar to control plane capabilities on top of the primary functionality | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,8 @@ | ||
2023-08-15: 25 attendees, 41m average duration | ||
2023-09-19: 36 attendees, 68m average duration | ||
2023-10-18: 25 attendees, 46m average duration | ||
2024-01-16: 20 attendees, 46m average duration | ||
2024-01-16: 20 attendees, 46m average duration | ||
2024-09-17: 18 attendees, 35m average duration | ||
2024-10-15: 4 attendees, 44m average duration | ||
2024-11-19: 8 attendees, 44m average duration | ||
2024-12-17: 5 attendees, 56m average duration |