Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create build pipelines for Java dependencies #940

Open
4 tasks
Tracked by #583
dervoeti opened this issue Nov 19, 2024 · 6 comments
Open
4 tasks
Tracked by #583

Create build pipelines for Java dependencies #940

dervoeti opened this issue Nov 19, 2024 · 6 comments

Comments

@dervoeti
Copy link
Member

dervoeti commented Nov 19, 2024

Problem:
Let's say we apply a patch to e.g. Hadoop 3.4.1 to fix a vulnerability. We bump a dependency to the latest version, the vulnerability is gone. But all our products that have dependencies on Hadoop Java artifacts will still pull the original Hadoop 3.4.1 components from the default public Maven repository, which does not contain our patched version.

We could instead contribute the patch upstream, which is nice, since we also get additional validation of the patch by the maintainers and other people can easily profit from the patch as well. But: To actually use the patch in all our products, we'd have to wait for the next release of Hadoop.

Idea:
Build a patched version of Hadoop and publish it to our own Maven repo. Patch downstream products like Hive, Trino etc. to use that version of Hadoop. There might be multiple steps involved, for example: A vulnerability originating in Hadoop is present in a Trino image. It's in the Trino Phoenix plugin, so we'd have to build (and patch) that plugin ourselves. For that, we have to build and patch Phoenix ourselves first.

I think we should still try to contribute patches upstream in the long-term, because that way we can give something back, we get additional validation from the maintainers and we have to maintain less custom patches.

To do:

  • Figure out if we need a versioning scheme for our custom versions and if so, how it should look like (e.g. Hadoop 3.4.1-stackable1.0.0)
  • If we have custom versions, how can we make sure the original version (e.g. Hadoop 3.4.1) is also present in the SBOM (that's needed because otherwise vulnerability scanners won't detect vulnerabilities filed directly against Hadoop 3.4.1)?
  • Is building everything in one go a requirement? Example: I create a patch for Hadoop and if I now build Druid, do I have to build Hadoop first and then trigger the Druid build (and patch it to use the new Hadoop version)? Or can I just build Druid and it will automatically build the latest version of Hadoop and build Druid afterwards with that version (e.g. with COPY --from...)?
  • Do we want to have separate build targets for container images and JARs?
@lfrancke
Copy link
Member

This is related to (or a duplicate of) stackabletech/issues#674

@dervoeti
Copy link
Member Author

dervoeti commented Jan 7, 2025

I thought about this a bit and can think of two possible solutions, an indirect and a direct relation between the build processes.

As an example: I want to build Druid and make it use a Hadoop version with custom patches

1. Indirect relation / remote Maven repo:

The Hadoop build process pushes JARs to a remote Maven repo, Druid build pulls them from there.

Before pushing the JARs, we have to make sure that our fixed version of Hadoop has it's own version identifier. Otherwise, if I fix something in Hadoop, push it and someone else works on Druid in parallel, my change might interfere with their Druid build.

So we have to make the Druid build use a specific version of Hadoop (3.4.1-stackable1.0.0), or at least some pinned checksum.

2. Direct relation / local Maven repo:

Directly copy the JARs over from another image, something like this in the Druid image:

COPY --chown=${STACKABLE_USER_UID}:0 --from=hadoop-builder /stackable/hadoop/share/hadoop/* /stackable/shared/artifacts/org/apache/hadoop

And then make Maven use them, for example by adding this to the build profile in pom.xml:

<repository>
    <id>custom-stackable</id>
    <url>file:///stackable/shared/artifacts</url>
</repository>

Pros:

  • No need to host Maven repo
  • No need for custom version identifier for Hadoop
  • When I create my Hadoop patch and then build the Druid image, Docker will automatically rebuild Hadoop and then Druid afterwards

Cons:

  • How to do this for something else then Hadoop?
    Example:
    Druid has a direct dependency, for example jackson-dataformat-xml:2.12.7
    jackson-dataformat-xml pulls in a vulnerable dependency.
    Maybe there is a newer version of the vulnerable dependency itself, but jackson-dataformat-xml has not released a new version yet with the fix.
    So we'd have to create our own fork of jackson-dataformat-xml with the bumped dependency.
    How do we get this forked version into Druid? It feels awkward to create a Docker image for it when it's really just a JAR.
    Not sure if we'll really ever need this, but I wanted to mention it.

@nightkr
Copy link
Member

nightkr commented Jan 17, 2025

I think 2 would generally be preferable (I really don't want the build process to have to bounce through Maven or depend on the current state of it...), but it does have the downside of slowing down the build a decent amount.

I think I'd rather treat the maven repo as a secondary output of the build step, instead of flatly saying "just copy the jars from in here". That would also be more applicable to the "turns out we need to rebuild jackson too" use-case.

@razvan
Copy link
Member

razvan commented Jan 17, 2025

Regarding the custom versions of artifacts, this Trivy page suggests that we might be able to use the version qualifier to label jars while Trivy still looks up vunls for the base version.

@dervoeti
Copy link
Member Author

dervoeti commented Jan 20, 2025

Regarding the custom versions of artifacts, this Trivy page suggests that we might be able to use the version qualifier to label jars while Trivy still looks up vunls for the base version.

So Maven would pull version 1.2.3-stackable1.0.0 of a JAR and Trivy will still detect the version to be 1.2.3 for the exact same JAR? Sounds like a possible solution, we would need to check if Syft does it the same way (it probably does), then Syft will add version 1.2.3 to the SBOM automatically. It feels a bit "broken" to me when a JAR has two versions at the same time, but I can't see any concrete negative effects yet.

The alternative would be to add the original JAR as a dummy component inside the image, then vulnerability scanners and Syft would find it as well. But then vulnerable dependencies of that JAR might still be present in the container image (at least if it's a fat JAR), even if we use newer versions of them in our forked Java dependency. The code can't really be executed, but the vulnerabilities in those dependencies would still be reported by scanners. So we wouldn't reduce the vulnerability count of the image, we could just issue VEX statements that we're not affected since the vulnerable components are not used. Which is better than nothing but not really what we want.
But maybe there's a way to create a fake JAR that's really slim and doesn't have any dependencies, really just a dummy to indicate that version 1.2.3 of the component is present in the image.

@dervoeti
Copy link
Member Author

dervoeti commented Jan 28, 2025

Discussion update

We had a meeting about this last week and agreed on these things for now, they are still in "draft mode" however:

  • We need custom versions
  • Versions should have the SDP version as suffix (3.4.1-stackable25.3.0)
  • COPY --from is the preferred solution for bringing in dependencies
  • When you build, for example, Hadoop locally, it should not be mandatory to publish it to Maven
  • Publishing to Maven should be a separate build step
  • We will discuss how we make scanners detect the original version after we figured out all the other requirements of this issue, we are confident that we will find a way to achieve this

Current state of our builds

I also looked at how we currently build dependencies. In general, we have the choice of building the dependency directly inside a Dockerfile when building the product, or building the dependency in some separate step, publishing the built artifact somewhere and download it in the Dockerfile.

Currently, we use both ways:

Variant 1, docker-images contains the build instructions:
The build happens either in a separate Dockerfile for the dependency (examples: kcat or Hadoop) or directly in the Dockerfile of the product (example: hbase-opa-authorizer). config-utils or containerdebug would also be examples of this variant.

Variant 2, the build instructions for the dependency live outside of docker-images:
The dependency is built somewhere else (for example locally or in a Github Action) and published somewhere (usually Nexus), the Dockerfile in docker-images pulls in the built artifact from there (examples: JMX exporter, kafka-opa-authorizer, druid-opa-authorizer).

The sources for both variants are either mirrored in one of our Github repositories or in Nexus.

Based on our discussion and the current state of how our builds work, I tried to create a draft for a possible way forward.

Possible way forward

  • Only use Github to mirror the source code: Move all sources from Nexus into Github repositories and only use these repositories as sources in our build processes
  • Use these repos just as mirrors. Apply custom patches to them using separate .patch files that are applied after cloning.
  • Build products and dependencies with Dockerfiles inside docker-images.
  • Each dependency should have its own Dockerfile. It might make sense to create separate folders for products (Dockerfiles that produce an image that is published, like Hadoop or Druid) and dependencies (the rest, like kcat or druid-opa-authorizer).
  • For some of these Dockerfiles, we need a "push to Maven repo" option, e.g. Hadoop or HBase, but not necessarily for all of them. For those where we need it, we should be able to specify a build argument if we want to publish the built artifact to Maven when running docker build. We could automatically set this argument when doing a release.
  • As soon as we decide to publish a JAR that's used in SDP, the publishing of that JAR happens in docker-images and the version has a SDP version suffix. Even if it's our own project (like druid-opa-authorizer) and we don't apply patches on top of it and the project publishes its own JAR (which would technically be the same file), we always publish separate JARs from within docker-images as soon as they might be used by SDP clients.

Advantages:

  • You can build everything in one go. If you create a new patch for Hadoop and rebuild Druid, it will automatically rebuild Hadoop first and Druid will automatically be built with the latest patched version of Hadoop.
  • You can build and test everything locally without publishing artifacts.
  • You can still publish built artifacts if you explicitly want to do this.

This solution just covers the next steps, not everything that is in scope if this issue is solved by it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants