-
-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add custom delay inbetween requests (prevent ban) #989
Comments
There is no delay between the requests at the moment. We could add a Instead, my proposal is to add better rate-limiting support per website. I wrote https://github.com/mre/rate-limits a while ago and would like to integrate it into lychee. We would keep track of the current rate-limits for each host in a E.g. the hash map could look like this: use rate_limits::ResetTime;
use std::collections::HashMap;
use time::OffsetDateTime;
let mut rate_limits = HashMap::new();
rate_limits.insert(
"github.com".to_string(),
ResetTime::DateTime(OffsetDateTime::from_unix_timestamp(1350085394).unwrap()),
); An entry would be inserted once we get rate-limited. What do you think? |
You are absolutely right that the needed delay would need to be tweaked to fit all queried hosts... and there is likely no common ground. And of course concurrency needs to be set to 1 and everything would be slowed down. Your proposal sounds like a pretty smart solution! |
It is pretty common for APIs, but not for websites I would guess. Realistically we might still need both, the rate-limit headers and a way to configure the delay. Let's start with rate-limit headers, though, because that's a common way for a website to tell us to slow down. Another common way is the infamous |
👆This! Checking my awesome-falsehood project for dead links reveals some false positives for
But we forfeit any performance. The ideal solution would be a way to have either |
Sacrifice performances, to prevent false positives. See: lycheeverse/lychee#989 (comment)
Did you manage to check twitter links lately? It's failing on my end, even with our workaround to use nitter instead. Maybe the concurrency is what's killing it for me. Haven't encountered any issues with HN yet, even though it's probably a matter of not triggering their rate-limiting. Out of curiosity, how many requests to |
It's more complicated than that:
Around 4 request: ![]() Source: kdeldycke/awesome-falsehood#159 |
I'd like a feature like I'm happy to offer a bounty of sorts of 100€ (payable via PayPal or SEPA) for whoever implements this, if multiple people work on it I'm happy to split the money. |
That's great to hear! To whoever might be interested in tackling that, feel free to post a comment here. |
Seeing this a lot from GitHub today -- I don't think it was happening the other day. Even with e.g.
There are about 7000 links in the repository I'm scanning for broken links, and a lot of the links are hitting GitHub, so getting rate limited is no surprise. |
Random thoughts:
I didn't encounter any errors today... yet. At least by the end of the day I should also be affected. Not aware of any changes on GitHub yet. Hope that helps. 😅 |
@mre Doh. I was running lychee from my local computer against the markdown files in an open source GitHub project. I didn't see the GitHub Token option in the README/help. I will try that. That works much better. Sorry for not reading the readme 🤦🏼 I'm still seeing a few rate limits even with the GitHub token (far fewer, but it not zero). This is without a max-concurrency option:
When it hits the rate limit, does it give up, so I need to manually check those links? I could just add --accept 429, but that's not entirely ideal. |
Yes, the output is the final status output of lychee. There will be no further checks after that. That means you will have to check manually, try again later, reduce the number of concurrent requests, or "shard" the requests (split them up into groups). You can do the latter by e.g. using two lychee runs. # Throttle GitHub check
lychee --include github --max-concurrency 20 <other settings>
# Don't throttle the rest
lychee --exclude github <other settings> In the future, I hope that we can fix that once and for all with per-host rate-limiting. |
#1605 should cover this. |
Just FYI that I'm still happy to honor my "pledge" above should this be implemented. I should notice when this is closed, if I don't reach out feel free to ping me! |
That's awesome. I'll add the |
I'm seeing many
[429] ... | Failed: Network error: Too Many Requests
errors.The issue is that the links we want to check are actually queries to an API which then return URLs again.
We are doing well over 1000 requests.
From their docs:
Is there any default delay? Could not find information on that.
The lychee docs list various mitigations to circumvent rate limits: https://lychee.cli.rs/#/troubleshooting/rate-limits?id=rate-limits
But none really help here in order to prevent getting banned.
The text was updated successfully, but these errors were encountered: