Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add k8sattributesprocessor to otlp pipeline with workload type detection #1524

Closed
wants to merge 16 commits into from

Conversation

musa-asad
Copy link
Contributor

@musa-asad musa-asad commented Feb 3, 2025

Description of the issue

To support the Explore related feature in CloudWatch, the CloudWatch Agent sends an "Entity", which includes relevant metadata to correlate metrics or logs between resources (e.g., an EKS cluster) and services (e.g., a Java application). When the CloudWatch Agent runs in a Kubernetes cluster, we need to collect the namespace, workload name, and node name to populate the "Entity".

However, we currently only get Kubernetes metadata when Application Signals is enabled. For OTLP custom metrics, if Application Signals isn't configured, then we don't have a way to fetch Kubernetes metadata. To achieve this, we must implement the Kubernetes Attributes Processor within the CloudWatch Agent.

Additionally, the process of fetching metadata with the Kubernetes Attributes Processor depends on the agent's workload type:

  • Daemonset Mode:
    If the agent is running as a daemonset, we must configure a node filter. This prevents the agent from fetching metadata for pods on other nodes.

Hence, we must also implement workload type detection.

Description of changes

Revision 1

  • Implements k8sattributesprocessor:
    • Add translation logic for k8sattributesprocessor to extract metadata from the application pod's IP and set node filter if the agent is a DaemonSet.
    • Add k8sattributesprocessor to otlp pipeline.
    • Update sample yaml files to include k8sattributesprocessor.
  • Implement workload type detection:
    • Add getWorkloadType() in translator/util/eksdetector/eksdetector.go to query Kubernetes API with POD_NAME and K8S_NAMESPACE environmental variables and retrieve workload type from pod information.
    • Configure Workload value in IsEKSCache.
    • Reference IsEKSCache to use in DetectWorkloadType() to return workload type.
    • Add getter and setter for workload type in translator/context/context.go.
    • Set workload type in the config-translator binary.
  • Add constants for DaemonSet, Deployment, and StatefulSet.
  • Update and add unit tests for new functionality.

Revision 2

  • Add validation for SetWorkloadType().
  • Use constants for return values in getWorkloadType().
  • Change "Unknown" to "" in getWorkloadType() since it serves no functional purpose.

Revision 3

  • Rename DetectWorkloadType() to GetWorkloadType()
  • Rename k8sattributes config files to mention node filter or not.
  • Adjust errors messages.
  • Print out errors in getWorkloadType().

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Tests

  1. Created an EKS cluster and deployed the Amazon CloudWatch Observability EKS add-on.
  2. Set up sample application by following https://aws-otel.github.io/docs/getting-started/adot-eks-add-on/sample-app.
  • Removed resource attributes.
  • Changed OTEL_EXPORTER_OTLP_ENDPOINT to http://cloudwatch-agent.amazon-cloudwatch:4317.
  1. Built the agent image by running make docker-build-amd64 and changed the image in the AmazonCloudWatchAgent custom resource.
  • Added debug exporter to OTLP pipeline for testing.

Kubernetes Attributes Processor

Debug Exporter Output:
K8s Metadata:

k8s.pod.name: Str(sample-app-69cdbb5f95-sqks8)
k8s.namespace.name: Str(default)
k8s.replicaset.name: Str(sample-app-69cdbb5f95)
k8s.deployment.name: Str(sample-app)
k8s.node.name: Str(ip-XXX-XX-XX-XX.us-west-2.compute.internal)

Entity Fields:

com.amazonaws.cloudwatch.entity.internal.type: Str(Service)
com.amazonaws.cloudwatch.entity.internal.service.name: Str(unknown_service:java)
com.amazonaws.cloudwatch.entity.internal.deployment.environment: Str(k8s:entity-cluster-2/default)
com.amazonaws.cloudwatch.entity.internal.platform.type: Str(K8s)
com.amazonaws.cloudwatch.entity.internal.k8s.cluster.name: Str(entity-cluster-2)
com.amazonaws.cloudwatch.entity.internal.k8s.namespace.name: Str(default)
com.amazonaws.cloudwatch.entity.internal.k8s.workload.name: Str(sample-app)
com.amazonaws.cloudwatch.entity.internal.k8s.node.name: Str(ip-XXX-XX-XX-XX.us-west-2.compute.internal)
com.amazonaws.cloudwatch.entity.internal.instance.id: Str(i-0da7f196c5fa59a25)

EMF Output:
K8s Metadata:

"k8s.deployment.name": "sample-app",
"k8s.namespace.name": "default",
"k8s.node.name": "ip-XXX-XX-XX-XX.us-west-2.compute.internal",
"k8s.pod.ip": "XXX.XX.XX.XXX",
"k8s.pod.name": "sample-app-69cdbb5f95-sqks8",
"k8s.replicaset.name": "sample-app-69cdbb5f95",

Workload Type Detection

DaemonSet:
Screenshot 2025-02-03 at 1 17 06 AM

Deployment:
Screenshot 2025-02-03 at 1 17 48 AM

Requirements

Before commit the code, please do the following steps.

  1. Run make fmt and make fmt-sh
  2. Run make lint

@musa-asad musa-asad changed the base branch from main to custom-metrics-entity February 3, 2025 00:05
@musa-asad musa-asad self-assigned this Feb 3, 2025
@musa-asad musa-asad changed the title Add k8sattributesprocessor to k8s otlp pipeline with workload type detection Add k8sattributesprocessor to otlp pipeline with workload type detection Feb 3, 2025
@musa-asad musa-asad requested review from lisguo and okankoAMZ February 3, 2025 00:12
@musa-asad musa-asad requested review from duhminick and removed request for okankoAMZ February 3, 2025 02:01
@musa-asad
Copy link
Contributor Author

musa-asad commented Feb 3, 2025

It looks like the sample application I used return unknown_service:java as the service name. Similar to JMX, should we use the resource processor to remove this resource attribute to be able to fall back to the K8sWorkload name?

@musa-asad musa-asad marked this pull request as ready for review February 3, 2025 06:19
@musa-asad musa-asad requested a review from a team as a code owner February 3, 2025 06:19
@musa-asad musa-asad changed the base branch from custom-metrics-entity to main February 3, 2025 06:20
@musa-asad musa-asad changed the base branch from main to custom-metrics-entity February 3, 2025 06:40
@musa-asad musa-asad changed the base branch from custom-metrics-entity to main February 3, 2025 06:41
@musa-asad musa-asad changed the base branch from main to custom-metrics-entity February 3, 2025 06:41
@musa-asad musa-asad changed the base branch from feature-custom-metrics-entity to main February 3, 2025 06:44
@musa-asad musa-asad changed the base branch from main to feature-custom-metrics-entity February 3, 2025 06:44
@musa-asad musa-asad closed this Feb 3, 2025
@musa-asad musa-asad reopened this Feb 3, 2025
@musa-asad musa-asad removed the request for review from duhminick February 3, 2025 07:00
@musa-asad musa-asad requested review from JayPolanco and duhminick and removed request for JayPolanco February 3, 2025 07:00
@musa-asad musa-asad force-pushed the feature-custom-metrics-entity branch from 7f0a8f8 to 5e1b3aa Compare February 5, 2025 08:18
- cumulativetodelta/hostOtlpMetrics/cloudwatchlogs
- k8sattributes/hostOtlpMetrics/cloudwatchlogs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should cumulativetodelta come after awsentity?

Copy link
Contributor Author

@musa-asad musa-asad Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding, the cumulativetodelta processor only touches numeric datapoint attributes, whereas awsentity only touches resource attributes, and sometimes string datapoint attributes to get service name information, which should be fine.

Albeit, my changes shouldn't have changed the position of the entity processor. I can move it after if you think it's safer. It looks like we do the same thing for the "hostDeltaMetrics" pipeline, so we may have to change that too. I am fine with either option, but let me know.

@@ -90,6 +99,33 @@ func (d *EksDetector) getConfigMap(namespace string, name string) (map[string]st
return configMap.Data, nil
}

func (d *EksDetector) getWorkloadType() (string, error) {
podName := os.Getenv("POD_NAME")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how will these env vars be set?

Copy link
Contributor Author

@musa-asad musa-asad Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -80,6 +80,16 @@ func DetectKubernetesMode(configuredMode string) string {

}

func DetectWorkloadType() string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are returning a value, we should name the func "Get*"

also if we are checking for an error, maybe it's worth returning an error as well?

Suggested change
func DetectWorkloadType() string {
func GetWorkloadType() (string, error) {

Copy link
Contributor Author

@musa-asad musa-asad Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to return an error (or warning) here since if it defaults to "", then it just means they are not on Kubernetes, which is fine. Though, I renamed to GetWorkloadType().

@musa-asad musa-asad closed this Feb 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants