documentation update for 0.9.2

AndyTheFactory · Jan 14, 2024 · 327e10f · 327e10f
1 parent 0412f94
commit 327e10f
Show file tree

Hide file tree

Showing 4 changed files with 93 additions and 62 deletions.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -0,0 +1,22 @@
+
+## Adding languages
+Interested in adding a new language for us? Refer to: ~~[Docs - Adding new
+languages](https://newspaper4k.readthedocs.io/en/latest/user_guide/advanced.html#adding-new-languages)~~
+
+At the moment we are not integrating new languages, the language api will change.
+You can still submit a PR with the language you want to add and we will merge it once the language api is stable.
+
+## Submitting a PR
+Interested in submitting a PR? Refer to: [Docs - Submitting a PR](https://newspaper4k.readthedocs.io/en/latest/user_guide/advanced.html#submitting-a-pr)
+
+## Submitting an issue
+Before submitting an issue, please check if it has already been reported. Additionally, please check that:
+- The article website you have troubles with is not paywalled [Docs - Paywall](https://newspaper4k.readthedocs.io/en/latest/user_guide/known_issues.html#paywall)
+- The article website is not generating the webpage dynamically (e.g. using JavaScript) [Docs - Dynamic content](https://newspaper4k.readthedocs.io/en/latest/user_guide/known_issues.html#dynamic-content)
+- The article website is not using a language that is not supported by newspaper4k [Docs - Supported languages](https://newspaper4k.readthedocs.io/en/latest/user_guide/languages.html)
+
+Also, in any case, please provide the following information:
+- The URL of the article you are trying to parse
+- The code you are using to parse the article
+- The error you are getting (if any)
+- The parsing result you are getting (if any)
diff --git a/README.md b/README.md
@@ -14,148 +14,155 @@ I have duplicated all issues on the original project and will try to fix them. I
     - Fixes for Python < 3.8 are low priority and might not be merged
 
 # Quick start
+
 ``` bash
 pip install newspaper4k
 ```
 
+## Using the CLI
+
 You can start directly from the command line, using the included CLI:
 ``` bash
 python -m newspaper --url="https://edition.cnn.com/2023/11/17/success/job-seekers-use-ai/index.html" --language=en --output-format=json --output-file=article.json
 
 ```
 
-Or use the Python API:
+## Using the Python API
+
+Alternatively, you can use the Python API:
+
+### Processing one article / url at a time
 
 ``` python
 import newspaper
 
 article = newspaper.article('https://edition.cnn.com/2023/10/29/sport/nfl-week-8-how-to-watch-spt-intl/index.html')
 
 print(article.authors)
-# ['Hannah Brewitt', 'Minute Read', 'Published', 'Am Edt', 'Sun October']
+>> ['Hannah Brewitt', 'Minute Read', 'Published', 'Am Edt', 'Sun October']
 
 print(article.publish_date)
-# 2023-10-29 09:00:15.717000+00:00
+>> 2023-10-29 09:00:15.717000+00:00
 
 print(article.text)
 # New England Patriots head coach Bill Belichick, right, embraces Buffalo Bills head coach Sean McDermott ...
 
 print(article.top_image)
-#https://media.cnn.com/api/v1/images/stellar/prod/231015223702-06-nfl-season-gallery-1015.jpg?c=16x9&q=w_800,c_fill
+>> https://media.cnn.com/api/v1/images/stellar/prod/231015223702-06-nfl-season-gallery-1015.jpg?c=16x9&q=w_800,c_fill
 
 print(article.movies)
-# []
+>> []
 
 article.nlp()
 print(article.keywords)
-# ['broncos', 'game', 'et', 'wide', 'chiefs', 'mahomes', 'patrick', 'denver', 'nfl', 'stadium', 'week', 'quarterback', 'win', 'history', 'images']
+>> ['broncos', 'game', 'et', 'wide', 'chiefs', 'mahomes', 'patrick', 'denver', 'nfl', 'stadium', 'week', 'quarterback', 'win', 'history', 'images']
 
 print(article.summary)
-# Kevin Sabitus/Getty Images Denver Broncos running back Javonte Williams evades Green Bay Packers safety Darnell Savage, bottom.
-# Kathryn Riley/Getty Images Kansas City Chiefs quarterback Patrick Mahomes calls a play during the Chiefs' 19-8 Thursday Night Football win over the Denver Broncos on October 12.
-# Paul Sancya/AP New York Jets running back Breece Hall carries the ball during a game against the Denver Broncos.
-# The Broncos have not beaten the Chiefs since 2015, and have never beaten Chiefs quarterback Patrick Mahomes.
-# Australia: NFL+, ESPN, 7Plus Brazil: NFL+, ESPN Canada: NFL+, CTV, TSN, RDS Germany: NFL+, ProSieben MAXX, DAZN Mexico: NFL+, TUDN, ESPN, Fox Sports, Sky Sports UK: NFL+, Sky Sports, ITV, Channel 5 US: NFL+, CBS, NBC, FOX, ESPN, Amazon Prime
+>> Kevin Sabitus/Getty Images Denver Broncos running back Javonte Williams evades Green Bay Packers safety Darnell Savage, bottom.
+>> Kathryn Riley/Getty Images Kansas City Chiefs quarterback Patrick Mahomes calls a play during the Chiefs' 19-8 Thursday Night Football win over the Denver Broncos on October 12.
+>> Paul Sancya/AP New York Jets running back Breece Hall carries the ball during a game against the Denver Broncos.
+>> The Broncos have not beaten the Chiefs since 2015, and have never beaten Chiefs quarterback Patrick Mahomes.
+>> Australia: NFL+, ESPN, 7Plus Brazil: NFL+, ESPN Canada: NFL+, CTV, TSN, RDS Germany: NFL+, ProSieben MAXX, DAZN Mexico: NFL+, TUDN, ESPN, Fox Sports, Sky Sports UK: NFL+, Sky Sports, ITV, Channel 5 US: NFL+, CBS, NBC, FOX, ESPN, Amazon Prime
 
 ```
-## Using the builder API
 
-This way you can build a Source object from a newspaper websites. This object will allow you to get all the articles and categories on the website. When you build the source, articles are not yet downloaded. You need to call `download_articles()` to download the articles, but note that it can take a significant time.
+## Parsing and scraping whole News Sources (websites) using the Source Class
+
+This way you can build a Source object from a newspaper websites. This class will allow you to get all the articles and categories on the website. When you build the source, articles are not yet downloaded. The `build()` call will  parse front page, will detect category links (if possible), get any RSS feeds published by the news site, and will create a list of article links.
+You need to call `download_articles()` to download the articles, but note that it can take a significant time.
+
+`download_articles()` will download the articles in a multithreaded fashion using `ThreadPoolExecutor` from the `concurrent` package. The number of concurrent threads can be configured in `Configuration`.`number_threads` or passed as an argument to `build()`.
 
-``` python
 
 ``` python
 import newspaper
 
-cnn_paper = newspaper.build('http://cnn.com')
+cnn_paper = newspaper.build('http://cnn.com', number_threads=3)
 print(cnn_paper.category_urls())
-# ['https://cnn.com', 'https://money.cnn.com', 'https://arabic.cnn.com', 'https://cnnespanol.cnn.com', 'http://edition.cnn.com', 'https://edition.cnn.com', 'https://us.cnn.com', 'https://www.cnn.com']
+> ['https://cnn.com', 'https://money.cnn.com', 'https://arabic.cnn.com',
+> 'https://cnnespanol.cnn.com', 'http://edition.cnn.com',
+> 'https://edition.cnn.com', 'https://us.cnn.com', 'https://www.cnn.com']
 
 article_urls = [article.url for article in cnn_paper.articles]
 print(article_urls[:3])
-# ['https://arabic.cnn.com/middle-east/article/2023/10/30/number-of-hostages-held-in-gaza-now-up-to-239-idf-spokesperson', 'https://arabic.cnn.com/middle-east/video/2023/10/30/v146619-sotu-sullivan-hostage-negotiations', 'https://arabic.cnn.com/middle-east/article/2023/10/29/norwegian-pm-israel-gaza']
+> ['https://arabic.cnn.com/middle-east/article/2023/10/30/number-of-hostages-held-in-gaza-now-up-to-239-idf-spokesperson',
+> 'https://arabic.cnn.com/middle-east/video/2023/10/30/v146619-sotu-sullivan-hostage-negotiations',
+> 'https://arabic.cnn.com/middle-east/article/2023/10/29/norwegian-pm-israel-gaza']
 
 article = cnn_paper.articles[0]
 article.download()
 article.parse()
 
 print(article.title)
-# المتحدث باسم الجيش الإسرائيلي: عدد الرهائن المحتجزين في غزة يصل إلى
+> المتحدث باسم الجيش الإسرائيلي: عدد الرهائن المحتجزين في غزة يصل إلى
+
+```
+Or if you want to get bulk articles from the website (have in mind that this could take a long time and could get your IP blocked by the newssite):
+
+``` python
+import newspaper
+
+cnn_source = newspaper.build('http://cnn.com', number_threads=3)
+
+print(len(newspaper.article_urls))
+
+articles = source.download_articles()
 
-``` pycon
-from newspaper import fulltext
+print(len(articles))
 
-html = requests.get(...).text
-text = fulltext(html)
+print(articles[0].title)
 ```
-## Languages
 
-Newspaper can extract and detect languages *seamlessly*. If no language
-is specified, Newspaper will attempt to auto detect a language from the available meta data. The fallback language is English.
+## Multilanguage features
 
-``` python
+Newspaper can extract and detect languages *seamlessly* based on the article meta tags. Additionally, you can specify the language for the website / article.  If no language is specified, Newspaper will attempt to auto detect a language from the available meta data. The fallback language is English.
+
+Language detection is crucial for accurate article extraction. If the wrong language is detected or provided, chances are that no article text will be returned. Before parsing, check that your language is supported by our package.
 
-``` pycon
+``` python
 from newspaper import Article
 
 article = Article('https://www.bbc.com/zhongwen/simp/chinese-news-67084358')
 article.download()
 article.parse()
 
 print(article.title)
-# 晶片大战：台湾厂商助攻华为突破美国封锁？
+> 晶片大战：台湾厂商助攻华为突破美国封锁？
+
+if article.config.use_meta_language:
+  # If we use the autodetected language, this config attribute will be true
+  print(article.meta_lang)
+else:
+  print(article.config.language)
 
+> zh
 ```
 
 # Docs
 
 Check out [The Docs](https://newspaper4k.readthedocs.io) for full and
 detailed guides using newspaper.
 
-# Contributing
-
-## Adding languages
-Interested in adding a new language for us? Refer to: ~~[Docs - Adding new
-languages](https://newspaper4k.readthedocs.io/en/latest/user_guide/advanced.html#adding-new-languages)~~
-
-At the moment we are not integrating new languages, the language api will change.
-You can still submit a PR with the language you want to add and we will merge it once the language api is stable.
-
-## Submitting a PR
-Interested in submitting a PR? Refer to: [Docs - Submitting a PR](https://newspaper4k.readthedocs.io/en/latest/user_guide/advanced.html#submitting-a-pr)
-
-## Submitting an issue
-Before submitting an issue, please check if it has already been reported. Additionally, please check that:
-- The article website you have troubles with is not paywalled [Docs - Paywall](https://newspaper4k.readthedocs.io/en/latest/user_guide/known_issues.html#paywall)
-- The article website is not generating the webpage dynamically (e.g. using JavaScript) [Docs - Dynamic content](https://newspaper4k.readthedocs.io/en/latest/user_guide/known_issues.html#dynamic-content)
-- The article website is not using a language that is not supported by newspaper4k [Docs - Supported languages](https://newspaper4k.readthedocs.io/en/latest/user_guide/languages.html)
-
-Also, in any case, please provide the following information:
-- The URL of the article you are trying to parse
-- The code you are using to parse the article
-- The error you are getting (if any)
-- The parsing result you are getting (if any)
-
-
 # Features
 
 -   Multi-threaded article download framework
+-   Newspaper category detection
 -   News url identification
 -   Text extraction from html
 -   Top image extraction from html
 -   All image extraction from html
--   Keyword extraction from text
--   Summary extraction from text
+-   Keyword building from the extracted text
+-   Autoatic article text summarization
 -   Author extraction from text
--   Google trending terms extraction
+-   Easy to use Command Line Interface (`python -m newspaper....`)
 -   Works in 10+ languages (English, Chinese, German, Arabic, \...)
 
 # Requirements and dependencies
 
 Following system packages are required:
 
--   PIL: `libjpeg-dev` `zlib1g-dev` `libpng12-dev`
--   lxml: `libxml2-dev` `libxslt-dev`
+-   **Pillow**: `libjpeg-dev` `zlib1g-dev` `libpng12-dev`
+-   **Lxml**: `libxml2-dev` `libxslt-dev`
 -   Python Development version: `python-dev`
 
 
@@ -184,9 +191,6 @@ NOTE: If you find problem installing `libpng12-dev`, try installing
 
         $ pip3 install newspaper4k
 
--   Download NLP (nltk) related corpora:
-
-        $ curl https://raw.githubusercontent.com/AndyTheFactory/newspaper4k/master/download_corpora.py | python3
 
 **If you are on OSX**, install using the following, you may use both
 homebrew or macports:
@@ -197,8 +201,10 @@ homebrew or macports:
 
     $ pip3 install newspaper4k
 
-    $ curl https://raw.githubusercontent.com/AndyTheFactory/newspaper4k/master/download_corpora.py | python3
 
+# Contributing
+
+see [CONTRIBUTING.md](CONTRIBUTING.md)
 
 # LICENSE
 

diff --git a/docs/requirements.in b/docs/requirements.in
@@ -1,3 +1,4 @@
 Sphinx>=5,<6
 sphinx_rtd_theme
 python-docs-theme
+sphinx-argparse
diff --git a/newspaper/api.py b/newspaper/api.py
@@ -29,6 +29,8 @@ def build(url="", dry=False, config=None, **kwargs) -> Source:
 def build_article(url="", config=None, **kwargs) -> Article:
     """Returns a constructed article object without downloading
     or parsing
+    .. deprecated:: 0.9.2
+                use :any:`Article` or :any:`newspaper.article` instead
     """
     config = config or Configuration()
     config.update(**kwargs)