Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PoC] Add snapshot support to Pulp #6166

Closed
wants to merge 10 commits into from

Conversation

Moustafa-Moustafa
Copy link

This is just a PoC to get some early feedback on the proposed solution further investments. This is related to this proposal .
The change adds a new field "snapshot" to both the Distribution and Publication models. New publication can be made snapshot publications by setting the snapshot field. Snapshot distributions can be created by setting the snapshot field on the Distribution model and will expose all snapshot publications.
This PR also enables the change for pulp_file as an example.

@@ -98,12 +98,13 @@ class Publication(MasterModel):

complete = models.BooleanField(db_index=True, default=False)
pass_through = models.BooleanField(default=False)
snapshot = models.BooleanField(default=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fear this design collides with the idea that one repository can be distributed by multiple distributions.

What if we called this field "precious" or "protected" so it won't get autodeleted by retain-repo-version?

Can you take a look at this feature (which seems to be really pretty much similar):
https://github.com/pulp/pulp_deb/blob/main/pulp_deb/app/models/publication.py#L125

Also a repository version is already a kind of snapshot. As is any publication. I don't think the name "Snapshot Distribution" really catches the time travel aspect well.

Copy link
Contributor

@daviddavis daviddavis Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me see if I understand: if I have a repo with publications that I want distributed by two snapshot distributions (or whatever we call it) then I should have the ability to set different publications per distro? And this design doesn't allow for this?

So in that case I agree a join model is better, and we could look at creating something like DistributedPublication or extending the DistributedPublication design for this use case.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can still have a repository distributed by multiple distributions. Additionally, all the repository's snapshots can be distributed by multiple snapshot distributions, with each snapshot distribution distributing all the repository's snapshot publications. I don't see how this design conflicts with having a repository distributed by multiple distributions. Can you clarify that?

We can agree on a different name for the "snapshot" field if we want to generalize its usage to protect against deletion by retain_repo_versions. However, we would still need a separate field to identify whether a publication is a "snapshot" publication. The plan is to protect publications with snapshot=true, but that's not included in this PoC yet. I can add this next.

I chose the name "snapshot" to align with other repository services like snapshot.debian.org and snapshot.ubuntu.com.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole idea of decoupling repositories from remotes and distributions always seems to confuse everybody. But that's where we are. Sorry about that. (0.99999 of the times probably have one repository with one remote and one distribution.)
I think David captures it quite nicely. And maybe the DistributedPublication is just the right tool here. It would be up to the Distribution to decide what it means when serving content.

Copy link
Author

@Moustafa-Moustafa Moustafa-Moustafa Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, a distribution can be set to distribute a repo version, a publication, a repo, or a remote. So, even though the distribution and the repository are decoupled, we can still set a distribution to serve a specific repo, which would serve the latest repo version of that repo. The snapshot distribution is similar in the sense that it would be related to a specific repo, but instead of serving the latest repo version, it would serve all the snapshot publications.

We can have the following distributions which will serve the exact content (i.e. all the snapshots for that repo):

  • snapshot/rhel/9/prod
  • snapshot/yumrepos/rhel9.0-prod

Similar to having the following distributions which will serve the latest repo version of the repo:

  • rhel/9/prod
  • yumrepos/rhel9.0-prod

I think using a DistributedPublication-like concept is a nice generalization. However, we don't need that for the snapshot scenario, and it might add unnecessary complexity. I also don't see actual use cases for the scenario where we need to serve a specific set of snapshots for different distributions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this over some and I don't have a strong opinion. I agree though that there is a separation in Pulp between repository and distribution but publication seems to be somewhat in a gray area I think. For one thing, if you delete a repo or repo version, the publication also gets deleted which I believe means that publication belongs to the repo.

Also, I tend to agree that I'm not sure that users will want or need to mark snapshots per distribution instead of per repository. So I think that would likely add some complexity both in the code and for users. That said, I think one option might be to go with the current design here and later add a join model if we ever do decide we need it.

I'm not sure how much my opinion matters here though since I don't work as closely with the code. And I would certainly prefer to see us implement a join model over not seeing this feature merged into Pulp if this PR won't be accepted without a join model.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You make a good observation here: Publications belong. I would even go as far as saying, they belong to the repository version (RV). Also, not all content types use publications (it's a mess). Do you think we can decide the "serve this as part of the snapshot" in terms of RV instead of publications? Would we just serve any available RV anyway and this is all about preventing to delete "precious" RV?

On more rather subtle thing from the consumers perspective: What the consumer could see when in the past really is defined by what a distribution was hooked up to when. So if we aim for an accurate history of that, we kind of need to make this a multi-layer distribution.

One more idea: In order to prevent RV from deletion, we could add a NullDistribution doing just that serving nothing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants