Add custom delay inbetween requests (prevent ban) #989

tooomm · 2023-03-11T15:13:01Z

lychee v0.10.3

I'm seeing many [429] ... | Failed: Network error: Too Many Requests errors.

The issue is that the links we want to check are actually queries to an API which then return URLs again.
We are doing well over 1000 requests.

From their docs:

Rate Limits and Good Citizenship

We kindly ask that you insert 50 – 100 milliseconds of delay between the requests you send to the server

Submitting excessive requests to the server may result in a HTTP 429 Too Many Requests status code. Continuing to overload the API after this point may result in a temporary or permanent ban of your IP address.

Is there any default delay? Could not find information on that.

The lychee docs list various mitigations to circumvent rate limits: https://lychee.cli.rs/#/troubleshooting/rate-limits?id=rate-limits
But none really help here in order to prevent getting banned.

The text was updated successfully, but these errors were encountered:

mre · 2023-03-11T17:05:38Z

There is no delay between the requests at the moment.
In fact, we run many requests concurrently; even to the same host.

We could add a --request-delay argument and in combination with --max-concurrency 1 it would resolve your issue, but it would just be a band-aid to the underlying problem, which is the rate limiting itself. You see, delays need to be tweaked by humans to find a sweet spot between throughput and stability. This can't be solved with a global delay.

Instead, my proposal is to add better rate-limiting support per website. I wrote https://github.com/mre/rate-limits a while ago and would like to integrate it into lychee. We would keep track of the current rate-limits for each host in a HashMap (a concurrent HashMap actually) or maybe even create one reqwest client per host; I don't know which one is the better option right now.

E.g. the hash map could look like this:

use rate_limits::ResetTime;
use std::collections::HashMap;
use time::OffsetDateTime;

let mut rate_limits = HashMap::new();
rate_limits.insert(
    "github.com".to_string(),
    ResetTime::DateTime(OffsetDateTime::from_unix_timestamp(1350085394).unwrap()),
);

An entry would be inserted once we get rate-limited.
Before every request, we would check if the current time is after the ResetTime.
If not, we'd wait the difference and finally remove the rate limit entry from the map.
This would scale much better than a global delay and would cover more use-cases.

What do you think?

tooomm · 2023-03-12T20:38:42Z

You are absolutely right that the needed delay would need to be tweaked to fit all queried hosts... and there is likely no common ground. And of course concurrency needs to be set to 1 and everything would be slowed down.

Your proposal sounds like a pretty smart solution!
How common is the usage of this headers already? I see that the newest IETF document also covers APIs, but everything is still in draft?

mre · 2023-03-12T23:41:41Z

It is pretty common for APIs, but not for websites I would guess. Realistically we might still need both, the rate-limit headers and a way to configure the delay.

Let's start with rate-limit headers, though, because that's a common way for a website to tell us to slow down. Another common way is the infamous 429. We currently retry those requests with an exponential backoff, which is a start. (We could do better by estimating an optimal request interval based on the response time with various request delays, but let's not get ahead of ourselves.)

kdeldycke · 2023-06-12T12:08:28Z

Instead, my proposal is to add better rate-limiting support per website.

👆This!

Checking my awesome-falsehood project for dead links reveals some false positives for news.ycombinator.com and twitter.com domains. Of course these two protects themselves from abuse, and concurrent access by Lychee are seen as such.

--max-concurrency 1 solves the issue.

But we forfeit any performance. The ideal solution would be a way to have either --max-concurrency-per-domain 1 or --delay-per-domain 1.5s option.

Sacrifice performances, to prevent false positives. See: lycheeverse/lychee#989 (comment)

mre · 2023-06-12T13:09:18Z

Did you manage to check twitter links lately? It's failing on my end, even with our workaround to use nitter instead. Maybe the concurrency is what's killing it for me.

Haven't encountered any issues with HN yet, even though it's probably a matter of not triggering their rate-limiting. Out of curiosity, how many requests to news.ycombinator.com does it take until you encounter any issues?

kdeldycke · 2023-06-12T15:09:17Z

Did you manage to check twitter links lately?

It's more complicated than that: --max-concurrency 1 fixed it from my machine. But it doesn't from my GitHub action workflows. So there is hard rate-limiting by Twitter from requests originating from GitHub.

how many requests to news.ycombinator.com does it take until you encounter any issues?

Around 4 request:

Source: kdeldycke/awesome-falsehood#159

lfrancke · 2024-01-05T13:58:57Z

I'd like a feature like rate-limit in muffet which applies across all requests.

I'm happy to offer a bounty of sorts of 100€ (payable via PayPal or SEPA) for whoever implements this, if multiple people work on it I'm happy to split the money.

mre · 2024-01-05T15:20:41Z

That's great to hear! To whoever might be interested in tackling that, feel free to post a comment here.
Generally, we can follow the approach of muffet that was linked above.

Refs: lycheeverse/lychee#1147 lycheeverse/lychee#989 (comment)

nathany · 2024-03-14T06:25:46Z

Seeing this a lot from GitHub today -- I don't think it was happening the other day. Even with --max-concurrency 32.

e.g.

✗ [429] https://github.com/markdown-it/markdown-it-abbr | Failed: Network error: Too Many Requests

There are about 7000 links in the repository I'm scanning for broken links, and a lot of the links are hitting GitHub, so getting rate limited is no surprise. --max-concurrency 1 is working so far, but it's gonna take a long time.

mre · 2024-03-14T09:15:55Z

Random thoughts:

Did your token expire by any chance?
Have you changed the pipeline recently?
Is the token passed correctly?
Do you use the latest action?
How did you count the number of GitHub links? With --dumb?

I didn't encounter any errors today... yet. At least by the end of the day I should also be affected. Not aware of any changes on GitHub yet. Hope that helps. 😅

nathany · 2024-03-14T14:02:57Z

@mre Doh. I was running lychee from my local computer against the markdown files in an open source GitHub project. I didn't see the GitHub Token option in the README/help. I will try that.

That works much better. Sorry for not reading the readme 🤦🏼

I'm still seeing a few rate limits even with the GitHub token (far fewer, but it not zero). This is without a max-concurrency option:

[proposals/p3646.md]:
✗ [429] https://github.com/rust-lang/rust/issues/60210 | Failed: Network error: Too Many Requests
✗ [429] https://github.com/intellij-rust/intellij-rust/commit/f82f6cd68567e574bf1e30f5e0d263ee15d1d36e | Failed: Network error: Too Many Requests
✗ [429] https://github.com/rust-lang/rust/pull/71322 | Failed: Network error: Too Many Requests

When it hits the rate limit, does it give up, so I need to manually check those links?

I could just add --accept 429, but that's not entirely ideal.

mre · 2024-03-14T22:34:37Z

Yes, the output is the final status output of lychee. There will be no further checks after that. That means you will have to check manually, try again later, reduce the number of concurrent requests, or "shard" the requests (split them up into groups). You can do the latter by e.g. using two lychee runs.

# Throttle GitHub check
lychee --include github --max-concurrency 20 <other settings>

# Don't throttle the rest
lychee --exclude github <other settings>

In the future, I hope that we can fix that once and for all with per-host rate-limiting.

mre · 2025-01-06T14:42:44Z

#1605 should cover this.

lfrancke · 2025-01-08T11:56:30Z

Just FYI that I'm still happy to honor my "pledge" above should this be implemented. I should notice when this is closed, if I don't reach out feel free to ping me!

mre · 2025-01-09T00:09:50Z

That's awesome. I'll add the bug-bounty tag to the other issue as well so we don't forget. Thanks for the support!

mre added the enhancement New feature or request label Apr 18, 2023

kdeldycke added a commit to kdeldycke/workflows that referenced this issue Jun 12, 2023

Limit lychee max concurrency.

18b3295

Sacrifice performances, to prevent false positives. See: lycheeverse/lychee#989 (comment)

mre added the bug-bounty label Jan 5, 2024

kdeldycke added a commit to kdeldycke/workflows that referenced this issue Jan 7, 2024

Twitter/X is now skipped by default

b701bad

Refs: lycheeverse/lychee#1147 lycheeverse/lychee#989 (comment)

pellared mentioned this issue Mar 26, 2024

Add READMEs to every package open-telemetry/opentelemetry-go#5103

Merged

mre mentioned this issue Apr 15, 2024

too many requests error on github profile links even if use --github-token #1402

Closed

mre mentioned this issue Dec 18, 2024

The cache is ineffective with the default concurrency, for links in a website's theme #1593

Open

mre mentioned this issue Jan 6, 2025

Add Per-Host Rate Limiting and Caching #1605

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add custom delay inbetween requests (prevent ban) #989

Add custom delay inbetween requests (prevent ban) #989

tooomm commented Mar 11, 2023 •

edited

Loading

mre commented Mar 11, 2023

tooomm commented Mar 12, 2023 •

edited

Loading

mre commented Mar 12, 2023

kdeldycke commented Jun 12, 2023 •

edited

Loading

mre commented Jun 12, 2023

kdeldycke commented Jun 12, 2023

lfrancke commented Jan 5, 2024 •

edited

Loading

mre commented Jan 5, 2024

nathany commented Mar 14, 2024 •

edited

Loading

mre commented Mar 14, 2024

nathany commented Mar 14, 2024 •

edited

Loading

mre commented Mar 14, 2024

mre commented Jan 6, 2025

lfrancke commented Jan 8, 2025

mre commented Jan 9, 2025

Add custom delay inbetween requests (prevent ban) #989

Add custom delay inbetween requests (prevent ban) #989

Comments

tooomm commented Mar 11, 2023 • edited Loading

mre commented Mar 11, 2023

tooomm commented Mar 12, 2023 • edited Loading

mre commented Mar 12, 2023

kdeldycke commented Jun 12, 2023 • edited Loading

mre commented Jun 12, 2023

kdeldycke commented Jun 12, 2023

lfrancke commented Jan 5, 2024 • edited Loading

mre commented Jan 5, 2024

nathany commented Mar 14, 2024 • edited Loading

mre commented Mar 14, 2024

nathany commented Mar 14, 2024 • edited Loading

mre commented Mar 14, 2024

mre commented Jan 6, 2025

lfrancke commented Jan 8, 2025

mre commented Jan 9, 2025

tooomm commented Mar 11, 2023 •

edited

Loading

tooomm commented Mar 12, 2023 •

edited

Loading

kdeldycke commented Jun 12, 2023 •

edited

Loading

lfrancke commented Jan 5, 2024 •

edited

Loading

nathany commented Mar 14, 2024 •

edited

Loading

nathany commented Mar 14, 2024 •

edited

Loading