Don't crawl similar URL #21

Martin2877 · 2020-10-15T06:43:02Z

Hi guys. I want to thank you for the great tool. And there are some suggestions.
As the pic above shows, there are many similar URL in one site, can I have some method to ignore them. as just fetch one of them?

tibug · 2020-10-19T08:29:05Z

https://github.com/tomnomnom/unfurl

Martin2877 · 2020-10-19T08:33:19Z

https://github.com/tomnomnom/unfurl

Thanks for the tool, but I‘d like to ignore when crawling not after.

jaikishantulswani · 2020-10-21T13:31:09Z

@j3ssie any way to avoid duplicate urls as it never ends on some domains and keep continue with duplicate urls:
[url] - [code-200] - https://example.com/
[url] - [code-200] - https://example.com/
[url] - [code-200] - https://example.com/
[url] - [code-200] - https://example.com/
[url] - [code-200] - https://example.com/

It also needed a switch to filter / skip urls throwing particular status code.

StasonJatham · 2021-10-28T14:18:35Z

Hi, so i am having the same problem. The issue is go-colly itself since they do not really take care of duplicates. When you scrape multiple domains it is actually pretty common to cause infinite loops.

Colly has an options redis backend that can take care of this: https://github.com/gocolly/redisstorage
I updated it to support go-redis v8 (gocolly/redisstorage#4 (comment))
...i sadly forgot to add ctx in one call but if you open this in VSCode it'll tell you.

I think the idea of this project was to be portable and such so it kind of makes sense to not force a database onto people. You could actually do this in memory as well.

That whole redisqueue thingy can be added in crawler.go right below "c := colly.NewCollector(" (just search for it)

I'll share my code when I a have fully implemented this. I actually have it running without colly in a much simpler scraper that just uses http and regex on the html

Here is how I solved the issue in my project (I have to Queues a toScrape and hasScraped):

			currentUrl := toScrapeQueue.Pop(nameOfQueue)
			if currentUrl == "" {
				c.Status = statusIdle
				continue
			}

			c.Status = statusPreparing

			if !notScrapepable(currentUrl) {
				log("Starting to crawl "+currentUrl, errorNotice)

				req := NewRequest(currentUrl)
				c.Status = statusResponse

				resp, err := req.Do()
				if err != nil {
					logError(err)
					continue
				}
				c.Status = statusParsing

				if resp.IsDiscarded {
					log("Request to "+currentUrl+" discarded", errorNotice)
					continue
				}
				log("Crawled "+currentUrl, errorNotice)

				// Redis
				allUrlsExtracted := extractURLs(string(resp.Body))
				for urlToTest := range allUrlsExtracted {
					if !hasScrapedQueue.IsMember(urlToTest, redisHasScrapedQueue) {
						toScrapeQueue.Push(urlToTest, redisToScrapeQueue)
					}
				}
				hasScrapedQueue.UniqueAdd(currentUrl, redisHasScrapedQueue)

And my Redis Wrapper:

func (q *Queue) UniqueAdd(value string, key string) {
	newLength, err := q.red.SAdd(ctx, key, value).Result()
	if err != nil {
		log("Could not push item nr. ("+fmt.Sprint(newLength)+") ->"+err.Error(), errorError)
	}
}

func (q *Queue) IsMember(value string, key string) bool {
	isMember, _ := q.red.SIsMember(ctx, key, value).Result()
	return isMember
}

func (q *Queue) AllMembers(key string) []string {
	allMembers, _ := q.red.SMembers(ctx, key).Result()
	return allMembers
}

func (q *Queue) Size(key string) int64 {
	queueLen, _ := q.red.LLen(ctx, key).Result()
	return queueLen
}

func (q *Queue) SetSize(key string) int64 {
	queueLen, _ := q.red.SCard(ctx, key).Result()
	return queueLen
}

func (q *Queue) Pop(key string) string {
	poppedElement, err := q.red.LPop(ctx, key).Result()
	if err != nil {
		log("Could not pop ("+poppedElement+") ->"+err.Error(), errorError)
	}
	return poppedElement
}

func (q *Queue) Push(value string, key string) {
	newLength, err := q.red.LPush(ctx, key, value).Result()
	if err != nil {
		log("Could not push item nr. ("+fmt.Sprint(newLength)+") ->"+err.Error(), errorError)
	}
}

Redis SADD is actually not slowing me down as much as I thought, you can read about what I mean here:
https://redis.io/commands/sadd

EDIT: Collyv2 doesn't support queue anymore....... lol

StasonJatham · 2021-10-28T14:43:43Z

I actually found a simpler fix:
you can use HasVisited

The dolly documentation sucks.... I was searching through the code on how they implement that in memory check.

// HasVisited checks if the provided URL has been visited
func (c *Collector) HasVisited(URL string) (bool, error) {
	return c.checkHasVisited(URL, nil)
}

// HasPosted checks if the provided URL and requestData has been visited
// This method is useful more likely to prevent re-visit same URL and POST body
func (c *Collector) HasPosted(URL string, requestData map[string]string) (bool, error) {
	return c.checkHasVisited(URL, requestData)
}

which then calls

func (c *Collector) checkHasVisited(URL string, requestData map[string]string) (bool, error) {
	h := fnv.New64a()
	h.Write([]byte(URL))

	if requestData != nil {
		h.Write(streamToByte(createFormReader(requestData)))
	}

	return c.store.IsVisited(h.Sum64())
}

and returns a bool

so in crawler.go you could so something like

hasVisited, _ := crawler.C.HasVisited(urlString)
if !hasVisited{
    _ = e.Request.Visit(urlString)
}

ocervell · 2023-02-03T20:00:15Z

@j3ssie any plans to add this to avoid duplicate URLs ? In my case gospider keeps crawling on the same URL and gets stuck / never ends, so I have to kill it manually.

jaikishantulswani · 2024-02-14T13:54:18Z

@j3ssie any plans to add this to avoid duplicate URLs ? In my case gospider keeps crawling on the same URL and gets stuck / never ends, so I have to kill it manually.

StasonJatham mentioned this issue Oct 28, 2021

Duplicate URL's #23

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't crawl similar URL #21

Don't crawl similar URL #21

Martin2877 commented Oct 15, 2020

tibug commented Oct 19, 2020

Martin2877 commented Oct 19, 2020

jaikishantulswani commented Oct 21, 2020 •

edited

Loading

StasonJatham commented Oct 28, 2021 •

edited

Loading

StasonJatham commented Oct 28, 2021 •

edited

Loading

ocervell commented Feb 3, 2023 •

edited

Loading

jaikishantulswani commented Feb 14, 2024

Don't crawl similar URL #21

Don't crawl similar URL #21

Comments

Martin2877 commented Oct 15, 2020

tibug commented Oct 19, 2020

Martin2877 commented Oct 19, 2020

jaikishantulswani commented Oct 21, 2020 • edited Loading

StasonJatham commented Oct 28, 2021 • edited Loading

StasonJatham commented Oct 28, 2021 • edited Loading

ocervell commented Feb 3, 2023 • edited Loading

jaikishantulswani commented Feb 14, 2024

jaikishantulswani commented Oct 21, 2020 •

edited

Loading

StasonJatham commented Oct 28, 2021 •

edited

Loading

StasonJatham commented Oct 28, 2021 •

edited

Loading

ocervell commented Feb 3, 2023 •

edited

Loading