Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a sitemap #489

Closed
danielcompton opened this issue Feb 11, 2016 · 12 comments
Closed

Create a sitemap #489

danielcompton opened this issue Feb 11, 2016 · 12 comments

Comments

@danielcompton
Copy link
Member

It would be good to create a sitemap for Clojars to enumerate the many many pages on the site.
More details: https://support.google.com/webmasters/answer/156184?hl=en

@renatoalencar
Copy link
Contributor

renatoalencar commented Nov 14, 2020

This could possibly be a good issue to build, but I'm in doubt on which solution would be better. I thought about three possible solutions.

Just querying the database and rendering a XML

The simplest solution I could think of, but I guess it could a viable vector for DOS attack, since it would query the entire database every time some hits the server.

Querying and rendering with cache

Caching the results on something like an atom or a file, but I don't know how memory it could need giving the size of the database, or how difficult it would be to handle and maintain internal mutable state.

Generating a sitemap.xml from time to time

Calling it from the command line, on a regular period of time. It could be just a simple cron job, but it wouldn't be that much real time as the other alternatives.

Tell me what you think about the solutions above and how could we engage on building this.

@danielcompton
Copy link
Member Author

The last option of generating a sitemap.xml file seems good to me and matches what we do currently for generating data feeds:

(defn generate-feeds [dest db s3-bucket]
(let [feed-file (str dest "/feed.clj.gz")]
(apply put-files
s3-bucket
(write-to-file (full-feed db) feed-file :gzip)
(write-sums feed-file)))
(let [poms (pom-list s3-bucket)
pom-file (str dest "/all-poms.txt")
gz-file (str pom-file ".gz")]
(apply put-files
s3-bucket
(write-to-file poms pom-file nil println)
(write-to-file poms gz-file :gzip println)
(concat
(write-sums pom-file)
(write-sums gz-file))))

There's no need for this data to be particularly fresh, every 24 hours seems plenty to me.

@renatoalencar
Copy link
Contributor

There are any pages that shouldn't go on the sitemap? Or there is any nontrivial detail about this, for example, on #482 you argued about using /:key for groups instead of users, since users also have their /user/:key as profile page. I mean, which would be the actual canonical URL for this?

@danielcompton
Copy link
Member Author

I think user/:key makes more sense here and should be easier to generate.

@renatoalencar
Copy link
Contributor

Great, I guess I'll just build what I have in mind first and get back to you later.

@opoku
Copy link
Collaborator

opoku commented Oct 3, 2024

@danielcompton @tobias I've put up a PR to address this issue. Let me know if you have any feedback on it whenever you are able to take a look.

@tobias
Copy link
Member

tobias commented Oct 3, 2024

Thanks @opoku! I'll take a look once things settle down a bit.

@tobias
Copy link
Member

tobias commented Oct 17, 2024

Thanks for the PR @opoku! I merged it, but ran in a test failure locally, and noticed a couple of issues:

Sitemaps that are referenced in the sitemap index file must be in the same directory as the sitemap index file, or lower in the site hierarchy. For example, if the sitemap index file is at https://example.com/public/sitemap_index.xml, it can only contain sitemaps that are in the same or deeper directory, like https://example.com/public/shared/....

So that's a bigger issue. We may need to generate/store these in a different way. I'll think about it, and would be happy to chat about it tomorrow if you have a few minutes.

@opoku
Copy link
Collaborator

opoku commented Oct 18, 2024

Thanks for the feedback. I'll put up some thing to address the test failure. Happy to chat some more tomorrow.

@tobias
Copy link
Member

tobias commented Nov 2, 2024

Thanks for the rework @opoku! I merged it and cleaned it up a bit. I then added a location block to our nginx config to serve these. I tested the sitemap with google, and it complained that our robots.txt file disallowed accessing it, but without any other info. I reworked robots.txt a bit, and we'll give it some time to see if it start working.

@tobias
Copy link
Member

tobias commented Nov 3, 2024

Proxying the sitemap instead of redirecting to it fixed the issue. This is now working. Thanks @opoku!

@tobias tobias closed this as completed Nov 3, 2024
@opoku
Copy link
Collaborator

opoku commented Nov 3, 2024

@tobias woohoo. Thanks for cleaning this up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants