Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't crawl similar URL #21

Open
Martin2877 opened this issue Oct 15, 2020 · 7 comments
Open

Don't crawl similar URL #21

Martin2877 opened this issue Oct 15, 2020 · 7 comments

Comments

@Martin2877
Copy link

image

Hi guys. I want to thank you for the great tool. And there are some suggestions.
As the pic above shows, there are many similar URL in one site, can I have some method to ignore them. as just fetch one of them?

@tibug
Copy link

tibug commented Oct 19, 2020

@Martin2877
Copy link
Author

https://github.com/tomnomnom/unfurl

Thanks for the tool, but I‘d like to ignore when crawling not after.

@jaikishantulswani
Copy link

jaikishantulswani commented Oct 21, 2020

@j3ssie any way to avoid duplicate urls as it never ends on some domains and keep continue with duplicate urls:
[url] - [code-200] - https://example.com/
[url] - [code-200] - https://example.com/
[url] - [code-200] - https://example.com/
[url] - [code-200] - https://example.com/
[url] - [code-200] - https://example.com/

It also needed a switch to filter / skip urls throwing particular status code.

@StasonJatham
Copy link

StasonJatham commented Oct 28, 2021

Hi, so i am having the same problem. The issue is go-colly itself since they do not really take care of duplicates. When you scrape multiple domains it is actually pretty common to cause infinite loops.

Colly has an options redis backend that can take care of this: https://github.com/gocolly/redisstorage
I updated it to support go-redis v8 (gocolly/redisstorage#4 (comment))
...i sadly forgot to add ctx in one call but if you open this in VSCode it'll tell you.

I think the idea of this project was to be portable and such so it kind of makes sense to not force a database onto people. You could actually do this in memory as well.

That whole redisqueue thingy can be added in crawler.go right below "c := colly.NewCollector(" (just search for it)

I'll share my code when I a have fully implemented this. I actually have it running without colly in a much simpler scraper that just uses http and regex on the html

Here is how I solved the issue in my project (I have to Queues a toScrape and hasScraped):

			currentUrl := toScrapeQueue.Pop(nameOfQueue)
			if currentUrl == "" {
				c.Status = statusIdle
				continue
			}

			c.Status = statusPreparing

			if !notScrapepable(currentUrl) {
				log("Starting to crawl "+currentUrl, errorNotice)

				req := NewRequest(currentUrl)
				c.Status = statusResponse

				resp, err := req.Do()
				if err != nil {
					logError(err)
					continue
				}
				c.Status = statusParsing

				if resp.IsDiscarded {
					log("Request to "+currentUrl+" discarded", errorNotice)
					continue
				}
				log("Crawled "+currentUrl, errorNotice)

				// Redis
				allUrlsExtracted := extractURLs(string(resp.Body))
				for urlToTest := range allUrlsExtracted {
					if !hasScrapedQueue.IsMember(urlToTest, redisHasScrapedQueue) {
						toScrapeQueue.Push(urlToTest, redisToScrapeQueue)
					}
				}
				hasScrapedQueue.UniqueAdd(currentUrl, redisHasScrapedQueue)

And my Redis Wrapper:

func (q *Queue) UniqueAdd(value string, key string) {
	newLength, err := q.red.SAdd(ctx, key, value).Result()
	if err != nil {
		log("Could not push item nr. ("+fmt.Sprint(newLength)+") ->"+err.Error(), errorError)
	}
}

func (q *Queue) IsMember(value string, key string) bool {
	isMember, _ := q.red.SIsMember(ctx, key, value).Result()
	return isMember
}

func (q *Queue) AllMembers(key string) []string {
	allMembers, _ := q.red.SMembers(ctx, key).Result()
	return allMembers
}

func (q *Queue) Size(key string) int64 {
	queueLen, _ := q.red.LLen(ctx, key).Result()
	return queueLen
}

func (q *Queue) SetSize(key string) int64 {
	queueLen, _ := q.red.SCard(ctx, key).Result()
	return queueLen
}

func (q *Queue) Pop(key string) string {
	poppedElement, err := q.red.LPop(ctx, key).Result()
	if err != nil {
		log("Could not pop ("+poppedElement+") ->"+err.Error(), errorError)
	}
	return poppedElement
}

func (q *Queue) Push(value string, key string) {
	newLength, err := q.red.LPush(ctx, key, value).Result()
	if err != nil {
		log("Could not push item nr. ("+fmt.Sprint(newLength)+") ->"+err.Error(), errorError)
	}
}

Redis SADD is actually not slowing me down as much as I thought, you can read about what I mean here:
https://redis.io/commands/sadd

EDIT: Collyv2 doesn't support queue anymore....... lol

@StasonJatham
Copy link

StasonJatham commented Oct 28, 2021

I actually found a simpler fix:
you can use HasVisited

The dolly documentation sucks.... I was searching through the code on how they implement that in memory check.

// HasVisited checks if the provided URL has been visited
func (c *Collector) HasVisited(URL string) (bool, error) {
	return c.checkHasVisited(URL, nil)
}

// HasPosted checks if the provided URL and requestData has been visited
// This method is useful more likely to prevent re-visit same URL and POST body
func (c *Collector) HasPosted(URL string, requestData map[string]string) (bool, error) {
	return c.checkHasVisited(URL, requestData)
}

which then calls

func (c *Collector) checkHasVisited(URL string, requestData map[string]string) (bool, error) {
	h := fnv.New64a()
	h.Write([]byte(URL))

	if requestData != nil {
		h.Write(streamToByte(createFormReader(requestData)))
	}

	return c.store.IsVisited(h.Sum64())
}

and returns a bool

so in crawler.go you could so something like

hasVisited, _ := crawler.C.HasVisited(urlString)
if !hasVisited{
    _ = e.Request.Visit(urlString)
}

@ocervell
Copy link

ocervell commented Feb 3, 2023

@j3ssie any plans to add this to avoid duplicate URLs ? In my case gospider keeps crawling on the same URL and gets stuck / never ends, so I have to kill it manually.

@jaikishantulswani
Copy link

@j3ssie any plans to add this to avoid duplicate URLs ? In my case gospider keeps crawling on the same URL and gets stuck / never ends, so I have to kill it manually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants