Skip to content

Version 0.9.3 Article Parsing improvements and huge jump in multi language support (support for over 40 languages added)

Compare
Choose a tag to compare
@AndyTheFactory AndyTheFactory released this 18 Mar 00:10
· 23 commits to master since this release

Massive improvements in multi-language capabilities. Added over 40 new languages and completely reworked the language module. Much easier to add new languages now. Additionally, added support for Google News as a source. You can now search and parse news based on keywords, topic, location or website.
Integrated cloudscraper as an optional dependency. If installed, it will us cloudscraper as a layer over requests. Cloudscraper tries to bypass cloudflair protection.
We now have use two evaluation datasets - the one from scrapinghub and one created by us drom the top 200 most popular websites. This will help keeping track of future improvements and to have a clear view of the impact of the changes.

We see a steady improvement from version 0.9.0 up to 0.9.3. The evaluation results are available in the documentation. The evaluation dataset is also available in the following repository: Article Extraction Dataset

  • You can now install languages that need special packages as optional dependencies
  • Google News full integrated in the scraping process.
  • You can now pickle sources and articles - easier to save and recover scraping
  • Bumped minimum python version support to Python 3.8