Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alex fixes #157

Merged
merged 4 commits into from
Feb 1, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/inclusion.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,4 +26,4 @@ jobs:
uses: actions/setup-node@v4

- name: Run inclusion
run: npx alex -q *.md || echo "Catch warnings and exit 0" # Once all warnings have been resolved, remove the second statement
run: npx alex -q *.md
18 changes: 9 additions & 9 deletions AWS-costs.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,24 +3,24 @@ AWS Costs

### Trusted Advisor

Use the [Trusted Advisor](https://console.aws.amazon.com/trustedadvisor/home?#/dashboard) to identify instances that you can potentially downgrade to a smaller instance size or terminate. Trusted Advisor is a native AWS resource available to you when your account has Enterprise support. It gives recommendations for cost savings opportunities and also provides availability, security, and fault tolerance recommendations. Even simple tunings in CPU usage and provisioned IOPS can add up to significant savings.
Use the [Trusted Advisor](https://console.aws.amazon.com/trustedadvisor/home?#/dashboard) to identify instances that you can potentially downgrade to a smaller instance size or terminate. Trusted Advisor is a native AWS resource available to you when your account has Enterprise support. It gives recommendations for cost savings opportunities and also provides availability, security, and fault tolerance recommendations. Even the simplest tunings, such as to CPU usage and provisioned IOPS can add up to significant savings.

On the TA dashboard, click on **Low Utilization Amazon EC2 Instances** and sort the low utilisation instances table by the highest **Estimated Monthly Savings**.

### Billing & Cost management
You can use the [Bills](https://console.aws.amazon.com/billing/home?region=eu-west-1#/bill) and [Cost explorer](https://console.aws.amazon.com/billing/home?region=eu-west-1#/bill) to understand the breakdown of your AWS usage and possible identify services you didn’t know you were using it.

### Unattached Volumes
Volumes available but not in used costs the same price. You can easily find them in the [EC2 console](https://eu-west-1.console.aws.amazon.com/ec2/v2/home?region=eu-west-1#Volumes:state=available;sort=size) under Volumes section by filtering by state (available).
Volumes available but not in used costs the same price. You can find them in the [EC2 console](https://eu-west-1.console.aws.amazon.com/ec2/v2/home?region=eu-west-1#Volumes:state=available;sort=size) under Volumes section by filtering by state (available).

### Unused AMIs
Unused AMIs cost money. You can easily clean them up using the [AMI cleanup tool](https://github.com/guardian/deploy-tools-platform/tree/master/cleanup)
Unused AMIs cost money. You can clean them up using the [AMI cleanup tool](https://github.com/guardian/deploy-tools-platform/tree/master/cleanup)

### Unattached EIPs
Unattached Elastic IP addresses costs money. You can easily find them using the trust advisor, or looking at your bills as they are free if they are attached (so in use).
Unattached Elastic IP addresses costs money. You can find them using the trust advisor, or looking at your bills as they are free if they are attached (so in use).

### DynamoDB
It’s very easy to overcommit the reserved capacity on this service. You should frequently review the reserved capacity of all your dynamodb tables.
You should frequently review the reserved capacity of all your dynamodb tables to make sure it's not over-committed.
The easiest way to do this is to select the Metric tab and check the Provisioned vs. Consumed write and read capacity graphs and use the Capacity tab to adjust the Provisioned capacity accordingly.
Make sure the table capacity can handle traffic spikes. Use the time range on the graphs to see the past weeks usage.

Expand All @@ -38,7 +38,7 @@ Lower storage price, higher access price. Interesting for backups for instance.

* [Reduce Redundancy Storage](https://aws.amazon.com/s3/reduced-redundancy/)

Lower storage price, reduced redundancy. Interesting for easy reproducible data or non critical data such as logs for instance.
Lower storage price, reduced redundancy. Interesting for reproducible data or non-critical data such as logs.

* Glacier

Expand All @@ -51,9 +51,9 @@ Another useful feature to manage your buckets is the possibility to set [lifecyc
S3’s multipart upload feature accelerates the uploading of large objects by allowing you to split them up into logical parts that can be uploaded in parallel. However if you initiate a multipart upload but never finish it, the in-progress upload occupies some storage space and will incur storage charges.
And the thing is these uploads are not visible when you list the contents of a bucket through the console or the standard api (you have to use a special command)

There is 2 easy ways to solve this now and prevent it to happen in the future:
There are two ways to solve this now and prevent it from happening in the future:

* a [simple script](https://gist.github.com/mchv/9dccbd9245287b26e34ab78bad43ea6c) that can list them with size and potentially delete existing (based on [AWS API](http://docs.aws.amazon.com/cli/latest/reference/s3api/list-parts.html?highlight=list%20parts))
* a [script](https://gist.github.com/mchv/9dccbd9245287b26e34ab78bad43ea6c) that can list them with size and potentially delete existing (based on [AWS API](http://docs.aws.amazon.com/cli/latest/reference/s3api/list-parts.html?highlight=list%20parts))
* [Add a lifecycle rule](https://aws.amazon.com/blogs/aws/s3-lifecycle-management-update-support-for-multipart-uploads-and-delete-markers/) to each bucket to delete automatically incomplete multipart uploads after a few days ([official AWS doc](http://docs.aws.amazon.com/AmazonS3/latest/dev/mpuoverview.html#mpu-abort-incomplete-mpu-lifecycle-config))

An example of how to cloud-form the lifecycle rule:
Expand Down Expand Up @@ -81,7 +81,7 @@ You can see savings of over `50%` on reserved instances vs. on-demand instances.
[More info on reserving instances](https://aws.amazon.com/ec2/purchasing-options/reserved-instances/getting-started/).

Reservations are set to a particular AWS region and to a particular instances type.
Therefore after making a reservation you are committing to run that particular region/instances combination until the reservation period finishes or you will swipe off all the financial benefits.
Therefore, after making a reservation you are committing to run that particular region/instances combination until the reservation period finishes, or you will swipe off all the financial benefits.

### Spot Instances

Expand Down
4 changes: 2 additions & 2 deletions AWS-lambda-metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,15 +3,15 @@ Metrics for Lambdas
* AWS Embedded Metrics are an ideal solution for generating metrics for Lambda functions that will track historical data.
* They are a method for capturing Cloudwatch metrics as part of a logging request.
* This is good because it avoids the financial and performance cost of making a putMetricData() request.
* It also makes it easy to find the point at which the metric is updated in both the logs and in the code itself.
* It also makes it easier to find the point at which the metric is updated in both the logs and in the code itself.
* This does not work at all for our EC2 apps as their logs do not pass through Cloudwatch.
* [This pull request](https://github.com/guardian/mobile-n10n/pull/696) gives a working example of how to embed metrics in your logging request
* [This document](https://docs.google.com/document/d/1cL_t5NhO8J9Bwiu4rghoGh8i_um_sXDyKuq4COhdLEc/edit?usp=sharing) gives a good summary of why AWS embedded metrics are so useful
* Full details can be found in the [AWS Documentation](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Embedded_Metric_Format_Specification.html), but here are the highlights:
* To use AWS Embedded metrics, logs must be in JSON format.
* A metric is embedded in a JSON logging request by adding a root node named “_aws” to the start of the log request.
* The metric details are defined within this "_aws" node.
* The following code snippet shows a simple logging request updating a single metric:
* The following code snippet shows a logging request updating a single metric:

```json
{"_aws": {
Expand Down
8 changes: 4 additions & 4 deletions AWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,14 +41,14 @@ VPC

* To follow best practice for VPCs, ensure you have a single CDK-generated VPC in your account that is used to house your applications. You can find the docs for it [here](https://github.com/guardian/cdk/blob/main/src/constructs/vpc/vpc.ts#L32-L59).
* While generally discouraged, in some exceptional cases, such as security-sensitive services, you may want to use the construct to generate further VPCs in order to isolate specific applications. It is worth discussing with DevX Security and InfoSec if you think you have a service that requires this.
* Avoid using the default VPC - The default VPC is designed to make it easy to get up and running but with many negative tradeoffs:
* Avoid using the default VPC - The default VPC is designed to get you up and running quickly, but with many negative tradeoffs:
- It lacks the proper security and auditing controls.
- Network Access Control Lists (NACLs) are unrestricted.
- The default VPC does not enable flow logs. Flow logs allow users to track network flows in the VPC for auditing and troubleshooting purposes
- No tagging
- The default VPC enables the assignment of public addresses in public subnets by default. This is a security issue as a small mistake in setup could
then allow the instance to be reachable by the Internet.
* The account should be allocated a block of our IP address space to support peering. Often you may not know you need peering up front, so better to plan for it just in case. See [here](https://docs.aws.amazon.com/vpc/latest/peering/vpc-peering-basics.html) for more info on AWS peering rules.
* The account should be allocated a block of our IP address space to support peering. Often you may not know you need peering up front, so better to plan for it regardless. See [here](https://docs.aws.amazon.com/vpc/latest/peering/vpc-peering-basics.html) for more info on AWS peering rules.
* If it is likely that AWS resources will need to communicate with our on-prem infrastructure, then contact the networking team to request a CIDR allocation for the VPC.
* Ensure you have added the correct [Gateway Endpoints](https://docs.aws.amazon.com/vpc/latest/privatelink/vpce-gateway.html) for the AWS services being accessed from your private subnets to avoid incurring unnecessary networking costs.
* Security of the VPC and security groups must be considered. See [here](https://github.com/guardian/security-recommendations/blob/main/recommendations/aws.md#vpc--security-groups) for details.
Expand Down Expand Up @@ -116,7 +116,7 @@ and the the function does one or more of the following:

This started happening after a change in how the event loop works between NodeJS 8 and 10. The method AWS uses to freeze the lambda runtime after it has not been invoked for a while may not work correctly in the cases above.

The workaround is simple (if a little silly). Wrap your root handler in a setTimeout:
The workaround is to wrap your root handler in a setTimeout:

```javascript
exports.handler = function (event, context, callback) {
Expand Down Expand Up @@ -145,7 +145,7 @@ Your lambda will get triggered multiple times you trigger it synchronously using
#### Details
[`--cli-read-timeout`](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-options.html#:~:text=cli%2Dread%2Dtimeout) is a general CLI param that applies to all subcommands and determines how long it will wait for data to be read from a socket. It seems to default to 60 seconds.

In the case of a synchronously executed long-running lambda, this timeout can be exceeded. The first lambda invocation "fails" (though not in a way that is visible in any lambda metrics or logs), and the CLI will abort the request and retry. The first lambda invocation hasn't really failed though - it will continue to run, possibly successfully - it's just that the CLI client that initiated it has stopped waiting for a response.
In the case of a synchronously executed long-running lambda, this timeout can be exceeded. The first lambda invocation "fails" (though not in a way that is visible in any lambda metrics or logs), and the CLI will abort the request and retry. The first lambda invocation hasn't really failed though - it will continue to run, possibly successfully - but the CLI client that initiated it has stopped waiting for a response.

Setting `--cli-read-timeout` to `0` removes the timeout and make the socket read wait indefinitely, meaning the CLI command will block until the lambda completes or times out.

Expand Down
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,8 @@ This repository document [principles](#principles), standards and [guidelines](#

- Publish services **availability SLA** to **communicate expectations** to users or dependent systems and identify area of improvements.

- Design for **simplicity** and **single responsibility**. Great designs model complex problems as simple discrete components. monitoring, self-healing and graceful degradation are simpler when responsibilities are not conflated.
<!-- alex ignore simple -->
- Design for **simplicity** and **single responsibility**. Great designs model complex problems as simple discrete components. Monitoring, self-healing and graceful degradation are simpler when responsibilities are not conflated.

- **Design for failures**. All things break, so the behaviour of a system when any of its components, collaborators or hosting infrastructure fail or respond slowly must be a key part of its design.

Expand Down
2 changes: 1 addition & 1 deletion RFCs.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ When you are ready, provide a clear description of:
2. Why the status quo does not address the problem
3. A proposed solution

This will help readers to more easily understand your rationale.
This will help readers to better understand your rationale.

Comments on Google docs or GitHub discussions are good ways of collecting people's thoughts alongside your original proposal.

Expand Down
2 changes: 1 addition & 1 deletion cdn.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ Fastly is highly programmable through its VCL configuration language, but VCL ca

A lot can be achieved with minimal Fastly configuration, and careful use of cache-control, surrogate-control and surrogate-key headers served by your application. This has the advantage that most of the caching logic is co-located with the rest of your application.

If this is insufficient, the next step is making use of [VCL Snippets](https://docs.fastly.com/en/guides/using-regular-vcl-snippets), which can be easily edited in the Fastly console and provide a useful way of providing a little extra functionality. You can try-out snippets of Fastly VCL functionality with https://fiddle.fastly.dev/ .
If this is insufficient, the next step is making use of [VCL Snippets](https://docs.fastly.com/en/guides/using-regular-vcl-snippets), which can be edited in the Fastly console and provide a useful way of providing a little extra functionality. You can try-out snippets of Fastly VCL functionality with https://fiddle.fastly.dev/ .

If you find that your VCL snippets are becoming large, you should consider switching to [custom VCL](https://docs.fastly.com/en/guides/uploading-custom-vcl), which should be versioned in Github, tested in CI and deployed using riff-raff, as in
https://github.com/guardian/fastly-edge-cache.
Expand Down
4 changes: 2 additions & 2 deletions client-side.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ See the separate [npm-packages.md](./npm-packages.md).
- Gzip all textual assets served, using GZip level 6 where possible
- Optimise images for size (e.g. jpegtran, pngquant, giflossy, svgo,
etc.)
- Favour SVGs where possible. What happens if images are disabled or
- Favour SVGs where possible. What happens if images aren't enabled or
unsupported?
- Avoid inlining encoded assets in CSS.

Expand Down Expand Up @@ -81,7 +81,7 @@ various areas below.

- Define what browsers and versions you support. What happens if using an unsupported browser?
- Define what viewports do you support. What happens if using an unsupported viewport?
- What happens if JS/CSS is disabled or overridden in the client?
- What happens if JS/CSS is switched off or overridden in the client?

### Reporting

Expand Down
4 changes: 2 additions & 2 deletions domain-names.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,11 @@

## Which DNS provider should I use?

NS1 is our preferred supplier for DNS hosting. We pay for their dedicated DNS service, which is independent from their shared platform. This means that even if their shared platform experiences a DDOS attack, our DNS will still be available. It is easy to cloudform DNS records in NS1 using the Guardian::DNS::RecordSet custom resource ([CDK](https://guardian.github.io/cdk/classes/constructs_dns.GuCname.html) / [Cloudformation](https://github.com/guardian/cfn-private-resource-types/tree/main/dns/guardian-dns-record-set-type/docs) docs)
NS1 is our preferred supplier for DNS hosting. We pay for their dedicated DNS service, which is independent from their shared platform. This means that even if their shared platform experiences a DDOS attack, our DNS will still be available. You can cloudform DNS records in NS1 using the Guardian::DNS::RecordSet custom resource ([CDK](https://guardian.github.io/cdk/classes/constructs_dns.GuCname.html) / [Cloudformation](https://github.com/guardian/cfn-private-resource-types/tree/main/dns/guardian-dns-record-set-type/docs) docs)

### Avoid Route53

In the past teams have delegated subdomains to Route53, but this approach is no longer recommended. It is now easy to manage DNS records in NS1 as infrastructure-in-code, so the main benefit of Route53 is eroded. Delegating to Route53 introduces an additional point of failure, since NS1 is authoritative for all of our key domain names. It also makes it harder for engineers and future tooling to reason about a domain.
In the past teams have delegated subdomains to Route53, but this approach is no longer recommended. It is now easier to manage DNS records in NS1 as infrastructure-in-code, so the main benefit of Route53 is eroded. Delegating to Route53 introduces an additional point of failure, since NS1 is authoritative for all of our key domain names. It also makes it harder for engineers and future tooling to reason about a domain.

### Exceptions where Route53 might be a good answer

Expand Down
1 change: 1 addition & 0 deletions elasticsearch.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ Regular snapshots of your cluster can provide restore points if data is lost. S

If you have the [AWS Plugin](https://github.com/elastic/elasticsearch-cloud-aws) installed you can perform snapshots to S3.

<!-- alex ignore master -->
Some examples of scripts used to setup and run S3 snapshots: https://github.com/guardian/grid/tree/master/elasticsearch/scripts

You can watch snapshots in progress: `curl $ES_URL:9200/_snapshot/_status`
Expand Down
2 changes: 1 addition & 1 deletion github.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ Bear in mind:
* The best visibility for most repositories is `Public`, rather than `Internal` or `Private`.
[Developing in the Open](https://www.theguardian.com/info/developer-blog/2014/nov/28/developing-in-the-open) makes better software!
* Make sure you grant an appropriate focussed [GitHub team](https://github.com/orgs/guardian/teams) full
[`Admin` access to the repo](https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/managing-repository-settings/managing-teams-and-people-with-access-to-your-repository#filtering-the-list-of-teams-and-people) - this should be the just the dev team that will be owning this project, it shouldn't be a huge team with hundreds of members!
[`Admin` access to the repo](https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/managing-repository-settings/managing-teams-and-people-with-access-to-your-repository#filtering-the-list-of-teams-and-people) - this should be the dev team that will be owning this project, not a huge team with hundreds of members!

We're no longer using https://repo-genesis.herokuapp.com/, as there are many different aspects to setting a GitHub repo up in the best possible
way, and repo-genesis only enforced a couple of them, and only at the point of creation. DevX have plans to enable a new repo-monitoring
Expand Down
Loading