Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to configure with Scrapy CrawlSpider #344

Closed
villeristi opened this issue Aug 15, 2018 · 6 comments
Closed

How to configure with Scrapy CrawlSpider #344

villeristi opened this issue Aug 15, 2018 · 6 comments

Comments

@villeristi
Copy link

Instructions on Official documentation about Using the Frontera with Scrapy throws exception with CrawlSpider.

spider-code:

class TestSpider(CrawlSpider):
    name = 'testspider'
    start_urls = ['https://example.com']
    rules = (
        Rule(LinkExtractor(), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
      # some code here...
      pass

Exception thrown:

File "/usr/local/lib/python3.6/site-packages/frontera/contrib/scrapy/schedulers/frontier.py", line 112, in process_spider_output
    frontier_request = response.meta[b'frontier_request']
KeyError: b'frontier_request'

So, how would one use Frontera properly with existing Scrapy-project?

Cheers, this looks definitely awesome!

@villeristi
Copy link
Author

Closing, read through the docs (which could be organized better)

@mautini
Copy link

mautini commented Sep 16, 2018

Hi,

I've got the same issue. I looked into the doc but impossible to find an answer. Please can you help me @villeristi ? How did you solve the problem and can you provide a link to the documentation explaining the issue ?

Thanks in advance

@sibiryakov sibiryakov reopened this Sep 17, 2018
@pdeboer
Copy link

pdeboer commented Oct 31, 2018

Hi, same issue here. @mautini, did you happen to figure it out already?

@sibiryakov
Copy link
Member

sibiryakov commented Nov 1, 2018

The idea is Scrapy shouldn't be scheduling any links, only parsing and extracting. All the scheduling logic should be implemented in crawling strategy.

Example:
https://github.com/scrapinghub/frontera/blob/master/examples/cluster/bc/spiders/bc.py

@mautini
Copy link

mautini commented Nov 4, 2018

Hi @pdeboer,

I finally found a solution. As @sibiryakov mentioned, you must not provide links to Scrapy directly. So start by removing start_urls from your spiders.

Next, you must generate a backend to allow frontera to send url to fetch to scrapy. For this purpose, in your Frontera settings, change the backend to frontera.contrib.backends.sqlalchemy.Distributed (apparently, the tutorial does not work with MemoryDistributedBackend) and set SQLALCHEMYBACKEND_ENGINE = 'sqlite:///<<fileName>>.db' to persist your backend status (and queues...) in a file. Instead it will be in memory and lost after the generation.

Now, generate the database using add_seeds script (step 6 here : https://frontera.readthedocs.io/en/latest/topics/quick-start-single.html?highlight=add%20seeds)

You can start the crawler, it should be working !

@dkipping
Copy link

Hi all!

I am getting the same error as @villeristi initially: KeyError b'frontier_request as a spider response processing error.

Quick setup explanation: mostly followed the distributed quickstart setup and config, scrapy with frontera and trying to use scrapy-selenium with it.

Compliant with @sibiryakov's example, the spider is also just yielding requests in the parse function, however we use the SeleniumRequest from scrapy-selenium.

Requests are yielded in the parse() function and in the start_requests() function.

Are we also meant to avoid yielding requests in the start_requests function? Or could it be the SeleniumRequest causing it? Or is there more in configuration/settings that is crucial for this?

More details in #401 (opened as I did not find this issue before, happy to move or close)

Thanks for all reactions and input! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants