Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus documentation deprecated and broken #2380

Open
1 task done
fjammes opened this issue Jan 10, 2025 · 4 comments
Open
1 task done

Prometheus documentation deprecated and broken #2380

fjammes opened this issue Jan 10, 2025 · 4 comments
Labels
kind/bug Something isn't working

Comments

@fjammes
Copy link

fjammes commented Jan 10, 2025

What happened?

  • ✋ I have searched the open/closed issues and my issue is not listed.

Reproduction Code

Follow these instructions for setting up prometheus metrics: https://kubeflow.github.io/spark-operator/docs/user-guide.html#monitoring

Expected behavior

Prometheus metrics should be exported from sparks runners.

Document should be updated and fixed.

Actual behavior

Spark driver is not able to start, prometheus jar is missing:

kubectl logs -n spark fink-broker-stream2raw-driver | tail -n 10
+ '[' -z ']'
+ '[' -z x ']'
+ SPARK_CLASSPATH='/opt/spark/conf::/opt/spark/jars/*'
+ case "$1" in
+ shift 1
+ CMD=("$SPARK_HOME/bin/spark-submit" --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$@")
+ exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=10.244.0.54 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.deploy.PythonRunner local:///home/fink/fink-broker/bin/stream2raw.py -log_level DEBUG -online_data_prefix hdfs://simple-hdfs-namenode-default-0.simple-hdfs-namenode-default.hdfs:8020///user/185 -producer sims -tinterval 2 --noscience -servers kafka-cluster-kafka-bootstrap.kafka:9092 -schema /home/fink/fink-alert-schemas/ztf/datasim_basic_alerts_all_distribute_topics.avro -startingoffsets_stream earliest -topic ztf_public_20200101 -night 20200101
Error opening zip file or JAR manifest missing : /home/fink/jmx_prometheus_javaagent-0.11.0.jar
Error occurred during initialization of VM
agent library failed to init: instrument

Environment & Versions

  • Kubernetes Version: v1.31.0
  • Spark Operator Version: docker.io/kubeflow/spark-operator:2.1.0
  • Apache Spark Version: 3.4.1

Additional context

Here is the yaml for the spark application, it seems the spec.deps download the jar after the JVM startup.

apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"sparkoperator.k8s.io/v1beta2","kind":"SparkApplication","metadata":{"annotations":{},"labels":{"app.kubernetes.io/instance":"fink-broker"},"name":"fink-broker-stream2raw","namespace":"spark"},"spec":{"arguments":["-log_level","DEBUG","-online_data_prefix","hdfs://simple-hdfs-namenode-default-0.simple-hdfs-namenode-default.hdfs:8020///user/185","-producer","sims","-tinterval","2","--noscience","-servers","kafka-cluster-kafka-bootstrap.kafka:9092","-schema","/home/fink/fink-alert-schemas/ztf/datasim_basic_alerts_all_distribute_topics.avro","-startingoffsets_stream","earliest","-topic","ztf_public_20200101","-night","20200101"],"deps":{"jars":["https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.11.0/jmx_prometheus_javaagent-0.11.0.jar"]},"driver":{"coreRequest":"0","cores":1,"env":[{"name":"SPARK_USER","value":"185"}],"javaOptions":"-Divy.cache.dir=/tmp -Divy.home=/tmp -Dcom.amazonaws.sdk.disableCertChecking=true","labels":{"version":"3.4.1"},"memory":"1000m","serviceAccount":"spark"},"executor":{"coreRequest":"0","cores":1,"env":[{"name":"SPARK_USER","value":"185"}],"instances":1,"javaOptions":"-Dcom.amazonaws.sdk.disableCertChecking=true","labels":{"version":"3.4.1"},"memory":"512m"},"image":"gitlab-registry.in2p3.fr/astrolabsoftware/fink/fink-broker-noscience:v3.1.3-rc1-55-gf99bc5b","imagePullPolicy":"IfNotPresent","mainApplicationFile":"local:///home/fink/fink-broker/bin/stream2raw.py","mode":"cluster","monitoring":{"exposeDriverMetrics":true,"exposeExecutorMetrics":true,"prometheus":{"jmxExporterJar":"/home/fink/jmx_prometheus_javaagent-0.11.0.jar","port":8090}},"pythonVersion":"3","restartPolicy":{"onFailureRetries":3,"onFailureRetryInterval":10,"onSubmissionFailureRetries":5,"onSubmissionFailureRetryInterval":20,"type":"OnFailure"},"sparkConf":null,"sparkVersion":"3.4.1","type":"Python"}}
  creationTimestamp: "2025-01-10T09:31:38Z"
  generation: 2
  labels:
    app.kubernetes.io/instance: fink-broker
  name: fink-broker-stream2raw
  namespace: spark
  resourceVersion: "10356"
  uid: 17b9f1c7-b1e6-4f2c-bd6f-5f0993f28693
spec:
  arguments:
  - -log_level
  - DEBUG
  - -online_data_prefix
  - hdfs://simple-hdfs-namenode-default-0.simple-hdfs-namenode-default.hdfs:8020///user/185
  - -producer
  - sims
  - -tinterval
  - "2"
  - --noscience
  - -servers
  - kafka-cluster-kafka-bootstrap.kafka:9092
  - -schema
  - /home/fink/fink-alert-schemas/ztf/datasim_basic_alerts_all_distribute_topics.avro
  - -startingoffsets_stream
  - earliest
  - -topic
  - ztf_public_20200101
  - -night
  - "20200101"
  deps:
    jars:
    - https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.11.0/jmx_prometheus_javaagent-0.11.0.jar
  driver:
    coreRequest: "0"
    cores: 1
    env:
    - name: SPARK_USER
      value: "185"
    javaOptions: -Divy.cache.dir=/tmp -Divy.home=/tmp -Dcom.amazonaws.sdk.disableCertChecking=true
    labels:
      version: 3.4.1
    memory: 1000m
    serviceAccount: spark
  executor:
    coreRequest: "0"
    cores: 1
    env:
    - name: SPARK_USER
      value: "185"
    instances: 1
    javaOptions: -Dcom.amazonaws.sdk.disableCertChecking=true
    labels:
      version: 3.4.1
    memory: 512m
  image: gitlab-registry.in2p3.fr/astrolabsoftware/fink/fink-broker-noscience:v3.1.3-rc1-55-gf99bc5b
  imagePullPolicy: IfNotPresent
  mainApplicationFile: local:///home/fink/fink-broker/bin/stream2raw.py
  mode: cluster
  monitoring:
    exposeDriverMetrics: true
    exposeExecutorMetrics: true
    prometheus:
      jmxExporterJar: /home/fink/jmx_prometheus_javaagent-0.11.0.jar
      port: 8090
  pythonVersion: "3"
  restartPolicy:
    onFailureRetries: 3
    onFailureRetryInterval: 10
    onSubmissionFailureRetries: 5
    onSubmissionFailureRetryInterval: 20
    type: OnFailure
  sparkVersion: 3.4.1
  type: Python

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

@fjammes fjammes added the kind/bug Something isn't working label Jan 10, 2025
@fjammes fjammes changed the title Proemtheus documentation deprecated and broken Prometheus documentation deprecated and broken Jan 10, 2025
@fjammes
Copy link
Author

fjammes commented Jan 10, 2025

Here is some track to export prometheus metrics:

Add mx_prometheus_javaagent.jar in its latest version inside Dockerfile for spark runner

ENV JMX_EXPORTER_AGENT_VERSION 1.1.0
 ADD https://github.com/prometheus/jmx_exporter/releases/download/${JMX_EXPORTER_AGENT_VERSION}/jmx_prometheus_javaagent-${JMX_EXPORTER_AGENT_VERSION}.jar /opt/spark/jars
 RUN chmod 644 /opt/spark/jars/jmx_prometheus_javaagent-${JMX_EXPORTER_AGENT_VERSION}.jar

Configure spark application:

apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
...
spec:
  ...
  monitoring:
    exposeDriverMetrics: true
    exposeExecutorMetrics: true
    prometheus:
      jmxExporterJar: /opt/spark/jars/jmx_prometheus_javaagent-1.1.0.jar
      port: 8090

Test:

kubectl port-forward -n spark fink-broker-stream2raw-driver 8090&
curl localhost:8090/metrics | tail -n 5
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0Handling connection for 8090
100 22877  100 22877    0     0   365k      0 --:--:-- --:--:-- --:--:--  366k
# TYPE spark_driver_livelistenerbus_queue_streams_numdroppedevents_type_counters_count_total counter
spark_driver_livelistenerbus_queue_streams_numdroppedevents_type_counters_count_total{app_id="fink-broker-stream2raw",app_namespace="spark"} 0.0
# HELP spark_driver_livelistenerbus_queue_streams_size_type_gauges Attribute exposed for management metrics:name=spark.fink-broker-stream2raw.driver.LiveListenerBus.queue.streams.size,type=gauges,attribute=Value
# TYPE spark_driver_livelistenerbus_queue_streams_size_type_gauges gauge
spark_driver_livelistenerbus_queue_streams_size_type_gauges{app_id="fink-broker-stream2raw",app_namespace="spark"} 0.0

@fjammes
Copy link
Author

fjammes commented Jan 10, 2025

Is it possible to update the documentation with this procedure?

@fjammes
Copy link
Author

fjammes commented Jan 15, 2025

Any idea please?

@fjammes
Copy link
Author

fjammes commented Feb 7, 2025

Is this monitoring part still supported by someone? It seems there is no grafana dashboard for metrics produced with jmx_prometheus_javaagent-1.1.0.jar? Could someone help to improve prometheus integration and its related documentation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant