Version 0.9.3 Article Parsing improvements and huge jump in multi language support (support for over 40 languages added)
Massive improvements in multi-language capabilities. Added over 40 new languages and completely reworked the language module. Much easier to add new languages now. Additionally, added support for Google News as a source. You can now search and parse news based on keywords, topic, location or website.
Integrated cloudscraper as an optional dependency. If installed, it will us cloudscraper as a layer over requests. Cloudscraper tries to bypass cloudflair protection.
We now have use two evaluation datasets - the one from scrapinghub and one created by us drom the top 200 most popular websites. This will help keeping track of future improvements and to have a clear view of the impact of the changes.
We see a steady improvement from version 0.9.0 up to 0.9.3. The evaluation results are available in the documentation. The evaluation dataset is also available in the following repository: Article Extraction Dataset
- You can now install languages that need special packages as optional dependencies
- Google News full integrated in the scraping process.
- You can now pickle sources and articles - easier to save and recover scraping
- Bumped minimum python version support to Python 3.8