-
-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't crawl similar URL #21
Comments
Thanks for the tool, but I‘d like to ignore when crawling not after. |
@j3ssie any way to avoid duplicate urls as it never ends on some domains and keep continue with duplicate urls: It also needed a switch to filter / skip urls throwing particular status code. |
Hi, so i am having the same problem. The issue is go-colly itself since they do not really take care of duplicates. When you scrape multiple domains it is actually pretty common to cause infinite loops. Colly has an options redis backend that can take care of this: https://github.com/gocolly/redisstorage I think the idea of this project was to be portable and such so it kind of makes sense to not force a database onto people. You could actually do this in memory as well. That whole redisqueue thingy can be added in crawler.go right below "c := colly.NewCollector(" (just search for it) I'll share my code when I a have fully implemented this. I actually have it running without colly in a much simpler scraper that just uses http and regex on the html Here is how I solved the issue in my project (I have to Queues a toScrape and hasScraped):
And my Redis Wrapper:
Redis SADD is actually not slowing me down as much as I thought, you can read about what I mean here: EDIT: Collyv2 doesn't support queue anymore....... lol |
I actually found a simpler fix: The dolly documentation sucks.... I was searching through the code on how they implement that in memory check.
which then calls
and returns a bool so in crawler.go you could so something like
|
@j3ssie any plans to add this to avoid duplicate URLs ? In my case |
|
Hi guys. I want to thank you for the great tool. And there are some suggestions.
As the pic above shows, there are many similar URL in one site, can I have some method to ignore them. as just fetch one of them?
The text was updated successfully, but these errors were encountered: