-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extending advertools to crawl dynamic websites #1
Comments
Todo:
|
Capture screenshots for a given URL list using the save_screenshot function. Current request: `from scrapy_playwright.page import PageMethod url_list = ['https://www.wikipedia.org', "https://quotes.toscrape.com"] output_dir = "/content/advertools/output" meta = { custom_settings = { adv.save_screenshot( `import advertools as adv url_list = ['https://www.wikipedia.org', "https://quotes.toscrape.com"] output_dir = "/content/advertools/output" adv.save_screenshot( |
Feature Proposal: Support for Scraping Dynamic Websites using Playwright
Problem
Presently, advertools does a great job of web scraping static websites. However, there is a dire need to add support for scraping dynamic websites that load or modify content using JavaScript.
Proposed Solution
To address this issue, I propose integrating Playwright into advertools. Playwright is a robust, feature-rich, and highly capable library for browser automation. It supports multiple browsers (Chromium, Firefox, and WebKit) and provides a high-level API to control headless (or full) browsers.
Playwright is also a preferred library by Scrapy itself. Read more here.
Using Playwright would enable advertools to load dynamic content by executing JavaScript, waiting for specific events, or even user-like interactions before scraping the page. This would vastly extend the reach of advertools and enable it to scrape more complex and modern websites.
Details of the Solution
For this, a new module plw_spider.py is added by cloning the existing spider.py. So, it has all of the existing functionality of crawling plus playwright-supported features.
By doing so, we will have entirely isolated features that will make it an optional choice for the users. Dependent libraries for plw_spider will be kept out of the main package and will be installed manually.
Browser Support: Leverage Playwright's ability to control multiple browsers, thereby offering users a choice of scraping engine.
Dynamic Content Handling: Implement functionality to interact with and scrape dynamically loaded content. This might include executing JavaScript, waiting for AJAX requests to complete, handling pop-ups, or clicking buttons.
Refer to the upstream docs for the Page class to see available methods.
See it in action in the Google collab here
Benefits
By adding support for dynamic website scraping via Playwright, advertools will become a more versatile tool, able to handle a wider range of websites and use cases. This would potentially attract more users to advertools and make it a stronger competitor in the web scraping tool market.
I look forward to further thoughts on this proposal and am ready to commence work on this feature as soon as we get the go-ahead.
Thank you.
The text was updated successfully, but these errors were encountered: