Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a Kafka producer for sending events to the Event Bus #20895

Open
wants to merge 39 commits into
base: master
Choose a base branch
from

Conversation

madebydna
Copy link

@madebydna madebydna commented Feb 20, 2025

Summary

  • This work is behind a feature toggle (flipper): YES
  • This PR introduces a multi-purpose Kafka producer enabled to send messages to the Event Bus. This is part of the "Submission Traceability" teams efforts to collect events ("traces") about various form submissions in the journey of the form through various VA systems. We are currently only targeting the dev environment with these changes.
  • A detailed writeup about the changes can be found here.
  • Note that this PR only adds a Kafka producer but does not actually send events. Code to send events will be added in future PRs.
  • The new gems being introduced are waterdrop for producing Kafka events, avro_turf for encoding messages in Avro format and integrating with a Schema Registry, avro gem for encoding messages in Avro format, aws-msk-iam-sasl-signer for authenticating to the Event Bus, and testcontainers-ruby for integration tests agains a dockerized Kafka cluster.
  • All new gems have been selected carefully and are actively maintained.

Update

  • testcontainers-ruby gem and the corresponding spec are being held back as they require the presence of Docker which is not part of the GHA runners.
  • The avro_turf introduces a new dependency on the excon gem for making HTTP requests. This caused some spec failures due to an issue with VCR's webmock adapter in the test environment. It required a monkey patch to undo the handling of VCR's request headers for webmock in the presence of the excon gem. Note that the issue would NOT affect production code as the excon gem is only used in the context of Avro Turf.
  • Second update on the Avro Turf gem: To address the issues described above, we have since replaced that gem with the lower-level avro gem that only encodes payloads without actually interacting with a schema registry.

Related issue(s)

Testing done

  • New code is covered by unit tests. Unit tests use a mock Kafka client named WaterDrop::Clients::Buffered in the test environment.
  • Integration tests use the ruby-testcontainer gem to run against a Kafka cluster in a short-lived Docker environment.

What areas of the site does it impact?

  • At the moment, no parts of the site should be impacted, since the Kafka producer is not actually used to send messages.

Acceptance criteria

  • I added unit tests and integration tests for the new code.
  • No error nor warning in the console.
  • Events are being sent to the appropriate logging solution
  • Documentation has been updated (link to documentation)
  • No sensitive information (i.e. PII/credentials/internal URLs/etc.) is captured in logging, hardcoded, or specs
  • Feature/bug has a monitor built into Datadog (if applicable)
  • If app impacted requires authentication, did you login to a local build and verify all authenticated routes work as expected
  • I added a screenshot of the developed feature

Requested Feedback

Once the code has been merged, the Kafka client will attempt to connect to the Event Bus based on the configured broker URLs and authentication features. This will only happen in the dev environment and when the feature toggle is turned on. However, connection issues are expected to occur until the connection details have been fully debugged. In that time, the Kafka client will produce some noisy logs in its attempt to connect.

@va-vfs-bot va-vfs-bot temporarily deployed to wip-kafka-producer/main/main February 20, 2025 22:21 Inactive
@madebydna madebydna self-assigned this Feb 20, 2025
# Any other errors. This should not happen and indicates trouble.
Rails.logger.error "An unexpected error occurred while producing a message to #{topic}. Please check the logs for more information. This dispatch will not reach Kafka"
raise e
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is in draft so please excuse me if you weren't expecting comments yet.

I'm curious about the pattern of catching the exception, sending it to the error log and then re-raising it. Why not just let the exception occur? To me, the exception class names are just as descriptive as the message that you prepend to the exception. I do see the utility of the else clause.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dellerbie thanks for taking a look! The PR is indeed still being worked on but this is a valid question. I was mainly rescuing and re-raising the errors to demonstrate the types of errors that can occur and to show that we can add specific log messages and custom error handling as appropriate. This is generally an experimental feature and one challenge is to ensure that we have proper instrumentation. I will likely remove this in the (near) future once we've had some trial runs

This comment was marked as resolved.

Copy link

Error: A file (or its parent directories) does not have a CODEOWNERS entry. Please update the .github/CODEOWNERS file and add the entry for the Offending file: app/sidekiq/kafka/example_job.rb

Copy link

Error: A file (or its parent directories) does not have a CODEOWNERS entry. Please update the .github/CODEOWNERS file and add the entry for the Offending file: app/sidekiq/kafka/example_job.rb

@va-vfs-bot va-vfs-bot temporarily deployed to wip-kafka-producer/main/main February 21, 2025 14:33 Inactive
Copy link

github-actions bot commented Feb 21, 2025

1 Warning
⚠️ This PR changes 298 LoC (not counting whitespace/newlines).

In order to ensure each PR receives the proper attention it deserves, we recommend not exceeding
200. Expect some delays getting reviews.

File Summary

Files

  • .github/CODEOWNERS (+4/-1)

  • Gemfile (+3/-0)

  • app/sidekiq/kafka/example_job.rb (+10/-0)

  • config/features.yml (+3/-0)

  • config/initializers/sidekiq.rb (+3/-0)

  • config/puma.rb (+3/-0)

  • config/settings.yml (+5/-0)

  • config/settings/development.yml (+5/-0)

  • config/settings/test.yml (+5/-0)

  • lib/kafka/avro_producer.rb (+52/-0)

  • lib/kafka/oauth_token_refresher.rb (+22/-0)

  • lib/kafka/producer_manager.rb (+49/-0)

  • lib/kafka/schemas/test-value-1.avsc (+12/-0)

  • spec/lib/kafka/avro_producer_spec.rb (+121/-0)

    Note: We exclude files matching the following when considering PR size:

    *.csv, *.json, *.tsv, *.txt, *.md, Gemfile.lock, app/swagger, modules/mobile/docs, spec/fixtures/, spec/support/vcr_cassettes/, modules/mobile/spec/support/vcr_cassettes/, db/seeds, modules/vaos/app/docs, modules/meb_api/app/docs, modules/appeals_api/app/swagger/, *.bru, *.pdf
    

Big PRs are difficult to review, often become stale, and cause delays.

Generated by 🚫 Danger

Copy link

Error: A file (or its parent directories) does not have a CODEOWNERS entry. Please update the .github/CODEOWNERS file and add the entry for the Offending file: config/initializers/waterdrop_client.rb

@va-vfs-bot va-vfs-bot temporarily deployed to wip-kafka-producer/main/main February 21, 2025 15:35 Inactive
Copy link

Error: A file (or its parent directories) does not have a CODEOWNERS entry. Please update the .github/CODEOWNERS file and add the entry for the Offending file: config/initializers/waterdrop_client.rb

@va-vfs-bot va-vfs-bot temporarily deployed to wip-kafka-producer/main/main February 21, 2025 16:56 Inactive
@va-vfs-bot va-vfs-bot temporarily deployed to wip-kafka-producer/main/main March 3, 2025 15:48 Inactive
Copy link
Contributor

@aherzberg aherzberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants