Skip to content

Commit

Permalink
Merge pull request #22 from leongc/main
Browse files Browse the repository at this point in the history
  • Loading branch information
stevemcghee authored Jan 1, 2025
2 parents 5133595 + f18ce18 commit f169aa3
Show file tree
Hide file tree
Showing 3 changed files with 30 additions and 1 deletion.
24 changes: 24 additions & 0 deletions discuss/2024-december.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
Date and Time: 17 December 2024

Topics discussed:

OpenAI Kubernetes failure Dec 11
https://status.openai.com/incidents/ctrsv3lwd797
[Quick takes on the recent OpenAI public incident write-up – Surfing Complexity](https://surfingcomplexity.blog/2024/12/14/quick-takes-on-the-recent-openai-public-incident-write-up/)
- It was fun to use a well-written incident report from a different company to ask questions of our local cluster team
- Control Plane can create brittleness. E.g. people claim about AZ robustness, but often AWS control planes run in us-east-1, so even if people "don’t deploy to us-east-1" they can still be vulnerable to failures in that AWS region because of AWS control plane
- Designers often overlook the control plane when designing robustness for their systems
- Question to stimulate wider design thinking: "if you lost access to the system, how would you restart with just a key to the datacenter?"

LLM-based automation to manage infrastructure and its impact on safety
- Known uses of LLMs have been a buddy or assistant
- Seen proposals to have LLM power decision making. Haven’t seen stories of anyone doing this
- We know LLMs have non-deterministic behaviors.
- When this happens, what will be the consequences for safety and reliability? For understanding the incidents?

Did seasonal holiday spending affect anyone?
- could automate scaling. but not databases. cannot reshard under load; that requires additional capacity.
- Use historical, not current data to predict seasonal load + lead time.
incidents might be more damaging to reputation and revenue than the cost of over-provisioning
- live adjustment to prices is similar to control plane capabilities on top of the primary functionality

1 change: 1 addition & 0 deletions discuss/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ This conversation is what you choose it to be! Participants are welcome to to pr

## Summaries of recent discussions

* [2024-December](2024-december.txt)
* [2024-January](2024-january.txt)
* [2023-October](2023-october.txt)
* [2023-September](2023-september.txt)
Expand Down
6 changes: 5 additions & 1 deletion discuss/stats.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,8 @@
2023-08-15: 25 attendees, 41m average duration
2023-09-19: 36 attendees, 68m average duration
2023-10-18: 25 attendees, 46m average duration
2024-01-16: 20 attendees, 46m average duration
2024-01-16: 20 attendees, 46m average duration
2024-09-17: 18 attendees, 35m average duration
2024-10-15: 4 attendees, 44m average duration
2024-11-19: 8 attendees, 44m average duration
2024-12-17: 5 attendees, 56m average duration

0 comments on commit f169aa3

Please sign in to comment.