How to configure with Scrapy CrawlSpider #344

villeristi · 2018-08-15T12:17:20Z

Instructions on Official documentation about Using the Frontera with Scrapy throws exception with CrawlSpider.

spider-code:

class TestSpider(CrawlSpider):
    name = 'testspider'
    start_urls = ['https://example.com']
    rules = (
        Rule(LinkExtractor(), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
      # some code here...
      pass

Exception thrown:

File "/usr/local/lib/python3.6/site-packages/frontera/contrib/scrapy/schedulers/frontier.py", line 112, in process_spider_output
    frontier_request = response.meta[b'frontier_request']
KeyError: b'frontier_request'

So, how would one use Frontera properly with existing Scrapy-project?

Cheers, this looks definitely awesome!

The text was updated successfully, but these errors were encountered:

villeristi · 2018-08-17T08:00:21Z

Closing, read through the docs (which could be organized better)

mautini · 2018-09-16T11:16:03Z

Hi,

I've got the same issue. I looked into the doc but impossible to find an answer. Please can you help me @villeristi ? How did you solve the problem and can you provide a link to the documentation explaining the issue ?

Thanks in advance

pdeboer · 2018-10-31T07:04:38Z

Hi, same issue here. @mautini, did you happen to figure it out already?

sibiryakov · 2018-11-01T11:55:56Z

The idea is Scrapy shouldn't be scheduling any links, only parsing and extracting. All the scheduling logic should be implemented in crawling strategy.

Example:
https://github.com/scrapinghub/frontera/blob/master/examples/cluster/bc/spiders/bc.py

mautini · 2018-11-04T14:15:13Z

Hi @pdeboer,

I finally found a solution. As @sibiryakov mentioned, you must not provide links to Scrapy directly. So start by removing start_urls from your spiders.

Next, you must generate a backend to allow frontera to send url to fetch to scrapy. For this purpose, in your Frontera settings, change the backend to frontera.contrib.backends.sqlalchemy.Distributed (apparently, the tutorial does not work with MemoryDistributedBackend) and set SQLALCHEMYBACKEND_ENGINE = 'sqlite:///<<fileName>>.db' to persist your backend status (and queues...) in a file. Instead it will be in memory and lost after the generation.

Now, generate the database using add_seeds script (step 6 here : https://frontera.readthedocs.io/en/latest/topics/quick-start-single.html?highlight=add%20seeds)

You can start the crawler, it should be working !

dkipping · 2020-07-24T15:02:44Z

Hi all!

I am getting the same error as @villeristi initially: KeyError b'frontier_request as a spider response processing error.

Quick setup explanation: mostly followed the distributed quickstart setup and config, scrapy with frontera and trying to use scrapy-selenium with it.

Compliant with @sibiryakov's example, the spider is also just yielding requests in the parse function, however we use the SeleniumRequest from scrapy-selenium.

Requests are yielded in the parse() function and in the start_requests() function.

Are we also meant to avoid yielding requests in the start_requests function? Or could it be the SeleniumRequest causing it? Or is there more in configuration/settings that is crucial for this?

More details in #401 (opened as I did not find this issue before, happy to move or close)

Thanks for all reactions and input! :)

villeristi closed this as completed Aug 17, 2018

sibiryakov reopened this Sep 17, 2018

villeristi closed this as completed Jun 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to configure with Scrapy CrawlSpider #344

How to configure with Scrapy CrawlSpider #344

villeristi commented Aug 15, 2018

villeristi commented Aug 17, 2018

mautini commented Sep 16, 2018

pdeboer commented Oct 31, 2018

sibiryakov commented Nov 1, 2018 •

edited

Loading

mautini commented Nov 4, 2018 •

edited

Loading

dkipping commented Jul 24, 2020

How to configure with Scrapy CrawlSpider #344

How to configure with Scrapy CrawlSpider #344

Comments

villeristi commented Aug 15, 2018

villeristi commented Aug 17, 2018

mautini commented Sep 16, 2018

pdeboer commented Oct 31, 2018

sibiryakov commented Nov 1, 2018 • edited Loading

mautini commented Nov 4, 2018 • edited Loading

dkipping commented Jul 24, 2020

sibiryakov commented Nov 1, 2018 •

edited

Loading

mautini commented Nov 4, 2018 •

edited

Loading