Skip to content

Commit

Permalink
documentation update for 0.9.2
Browse files Browse the repository at this point in the history
  • Loading branch information
AndyTheFactory committed Jan 14, 2024
1 parent 0412f94 commit 327e10f
Show file tree
Hide file tree
Showing 4 changed files with 93 additions and 62 deletions.
22 changes: 22 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@

## Adding languages
Interested in adding a new language for us? Refer to: ~~[Docs - Adding new
languages](https://newspaper4k.readthedocs.io/en/latest/user_guide/advanced.html#adding-new-languages)~~

At the moment we are not integrating new languages, the language api will change.
You can still submit a PR with the language you want to add and we will merge it once the language api is stable.

## Submitting a PR
Interested in submitting a PR? Refer to: [Docs - Submitting a PR](https://newspaper4k.readthedocs.io/en/latest/user_guide/advanced.html#submitting-a-pr)

## Submitting an issue
Before submitting an issue, please check if it has already been reported. Additionally, please check that:
- The article website you have troubles with is not paywalled [Docs - Paywall](https://newspaper4k.readthedocs.io/en/latest/user_guide/known_issues.html#paywall)
- The article website is not generating the webpage dynamically (e.g. using JavaScript) [Docs - Dynamic content](https://newspaper4k.readthedocs.io/en/latest/user_guide/known_issues.html#dynamic-content)
- The article website is not using a language that is not supported by newspaper4k [Docs - Supported languages](https://newspaper4k.readthedocs.io/en/latest/user_guide/languages.html)

Also, in any case, please provide the following information:
- The URL of the article you are trying to parse
- The code you are using to parse the article
- The error you are getting (if any)
- The parsing result you are getting (if any)
130 changes: 68 additions & 62 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,148 +14,155 @@ I have duplicated all issues on the original project and will try to fix them. I
- Fixes for Python < 3.8 are low priority and might not be merged

# Quick start

``` bash
pip install newspaper4k
```

## Using the CLI

You can start directly from the command line, using the included CLI:
``` bash
python -m newspaper --url="https://edition.cnn.com/2023/11/17/success/job-seekers-use-ai/index.html" --language=en --output-format=json --output-file=article.json

```

Or use the Python API:
## Using the Python API

Alternatively, you can use the Python API:

### Processing one article / url at a time

``` python
import newspaper

article = newspaper.article('https://edition.cnn.com/2023/10/29/sport/nfl-week-8-how-to-watch-spt-intl/index.html')

print(article.authors)
# ['Hannah Brewitt', 'Minute Read', 'Published', 'Am Edt', 'Sun October']
>> ['Hannah Brewitt', 'Minute Read', 'Published', 'Am Edt', 'Sun October']

print(article.publish_date)
# 2023-10-29 09:00:15.717000+00:00
>> 2023-10-29 09:00:15.717000+00:00

print(article.text)
# New England Patriots head coach Bill Belichick, right, embraces Buffalo Bills head coach Sean McDermott ...

print(article.top_image)
#https://media.cnn.com/api/v1/images/stellar/prod/231015223702-06-nfl-season-gallery-1015.jpg?c=16x9&q=w_800,c_fill
>> https://media.cnn.com/api/v1/images/stellar/prod/231015223702-06-nfl-season-gallery-1015.jpg?c=16x9&q=w_800,c_fill

print(article.movies)
# []
>> []

article.nlp()
print(article.keywords)
# ['broncos', 'game', 'et', 'wide', 'chiefs', 'mahomes', 'patrick', 'denver', 'nfl', 'stadium', 'week', 'quarterback', 'win', 'history', 'images']
>> ['broncos', 'game', 'et', 'wide', 'chiefs', 'mahomes', 'patrick', 'denver', 'nfl', 'stadium', 'week', 'quarterback', 'win', 'history', 'images']

print(article.summary)
# Kevin Sabitus/Getty Images Denver Broncos running back Javonte Williams evades Green Bay Packers safety Darnell Savage, bottom.
# Kathryn Riley/Getty Images Kansas City Chiefs quarterback Patrick Mahomes calls a play during the Chiefs' 19-8 Thursday Night Football win over the Denver Broncos on October 12.
# Paul Sancya/AP New York Jets running back Breece Hall carries the ball during a game against the Denver Broncos.
# The Broncos have not beaten the Chiefs since 2015, and have never beaten Chiefs quarterback Patrick Mahomes.
# Australia: NFL+, ESPN, 7Plus Brazil: NFL+, ESPN Canada: NFL+, CTV, TSN, RDS Germany: NFL+, ProSieben MAXX, DAZN Mexico: NFL+, TUDN, ESPN, Fox Sports, Sky Sports UK: NFL+, Sky Sports, ITV, Channel 5 US: NFL+, CBS, NBC, FOX, ESPN, Amazon Prime
>> Kevin Sabitus/Getty Images Denver Broncos running back Javonte Williams evades Green Bay Packers safety Darnell Savage, bottom.
>> Kathryn Riley/Getty Images Kansas City Chiefs quarterback Patrick Mahomes calls a play during the Chiefs' 19-8 Thursday Night Football win over the Denver Broncos on October 12.
>> Paul Sancya/AP New York Jets running back Breece Hall carries the ball during a game against the Denver Broncos.
>> The Broncos have not beaten the Chiefs since 2015, and have never beaten Chiefs quarterback Patrick Mahomes.
>> Australia: NFL+, ESPN, 7Plus Brazil: NFL+, ESPN Canada: NFL+, CTV, TSN, RDS Germany: NFL+, ProSieben MAXX, DAZN Mexico: NFL+, TUDN, ESPN, Fox Sports, Sky Sports UK: NFL+, Sky Sports, ITV, Channel 5 US: NFL+, CBS, NBC, FOX, ESPN, Amazon Prime

```
## Using the builder API

This way you can build a Source object from a newspaper websites. This object will allow you to get all the articles and categories on the website. When you build the source, articles are not yet downloaded. You need to call `download_articles()` to download the articles, but note that it can take a significant time.
## Parsing and scraping whole News Sources (websites) using the Source Class

This way you can build a Source object from a newspaper websites. This class will allow you to get all the articles and categories on the website. When you build the source, articles are not yet downloaded. The `build()` call will parse front page, will detect category links (if possible), get any RSS feeds published by the news site, and will create a list of article links.
You need to call `download_articles()` to download the articles, but note that it can take a significant time.

`download_articles()` will download the articles in a multithreaded fashion using `ThreadPoolExecutor` from the `concurrent` package. The number of concurrent threads can be configured in `Configuration`.`number_threads` or passed as an argument to `build()`.

``` python

``` python
import newspaper

cnn_paper = newspaper.build('http://cnn.com')
cnn_paper = newspaper.build('http://cnn.com', number_threads=3)
print(cnn_paper.category_urls())
# ['https://cnn.com', 'https://money.cnn.com', 'https://arabic.cnn.com', 'https://cnnespanol.cnn.com', 'http://edition.cnn.com', 'https://edition.cnn.com', 'https://us.cnn.com', 'https://www.cnn.com']
> ['https://cnn.com', 'https://money.cnn.com', 'https://arabic.cnn.com',
> 'https://cnnespanol.cnn.com', 'http://edition.cnn.com',
> 'https://edition.cnn.com', 'https://us.cnn.com', 'https://www.cnn.com']

article_urls = [article.url for article in cnn_paper.articles]
print(article_urls[:3])
# ['https://arabic.cnn.com/middle-east/article/2023/10/30/number-of-hostages-held-in-gaza-now-up-to-239-idf-spokesperson', 'https://arabic.cnn.com/middle-east/video/2023/10/30/v146619-sotu-sullivan-hostage-negotiations', 'https://arabic.cnn.com/middle-east/article/2023/10/29/norwegian-pm-israel-gaza']
> ['https://arabic.cnn.com/middle-east/article/2023/10/30/number-of-hostages-held-in-gaza-now-up-to-239-idf-spokesperson',
> 'https://arabic.cnn.com/middle-east/video/2023/10/30/v146619-sotu-sullivan-hostage-negotiations',
> 'https://arabic.cnn.com/middle-east/article/2023/10/29/norwegian-pm-israel-gaza']

article = cnn_paper.articles[0]
article.download()
article.parse()

print(article.title)
# المتحدث باسم الجيش الإسرائيلي: عدد الرهائن المحتجزين في غزة يصل إلى
> المتحدث باسم الجيش الإسرائيلي: عدد الرهائن المحتجزين في غزة يصل إلى

```
Or if you want to get bulk articles from the website (have in mind that this could take a long time and could get your IP blocked by the newssite):

``` python
import newspaper

cnn_source = newspaper.build('http://cnn.com', number_threads=3)

print(len(newspaper.article_urls))

articles = source.download_articles()

``` pycon
from newspaper import fulltext
print(len(articles))

html = requests.get(...).text
text = fulltext(html)
print(articles[0].title)
```
## Languages

Newspaper can extract and detect languages *seamlessly*. If no language
is specified, Newspaper will attempt to auto detect a language from the available meta data. The fallback language is English.
## Multilanguage features

``` python
Newspaper can extract and detect languages *seamlessly* based on the article meta tags. Additionally, you can specify the language for the website / article. If no language is specified, Newspaper will attempt to auto detect a language from the available meta data. The fallback language is English.

Language detection is crucial for accurate article extraction. If the wrong language is detected or provided, chances are that no article text will be returned. Before parsing, check that your language is supported by our package.

``` pycon
``` python
from newspaper import Article

article = Article('https://www.bbc.com/zhongwen/simp/chinese-news-67084358')
article.download()
article.parse()

print(article.title)
# 晶片大战:台湾厂商助攻华为突破美国封锁?
> 晶片大战:台湾厂商助攻华为突破美国封锁?

if article.config.use_meta_language:
# If we use the autodetected language, this config attribute will be true
print(article.meta_lang)
else:
print(article.config.language)

> zh
```

# Docs

Check out [The Docs](https://newspaper4k.readthedocs.io) for full and
detailed guides using newspaper.

# Contributing

## Adding languages
Interested in adding a new language for us? Refer to: ~~[Docs - Adding new
languages](https://newspaper4k.readthedocs.io/en/latest/user_guide/advanced.html#adding-new-languages)~~

At the moment we are not integrating new languages, the language api will change.
You can still submit a PR with the language you want to add and we will merge it once the language api is stable.

## Submitting a PR
Interested in submitting a PR? Refer to: [Docs - Submitting a PR](https://newspaper4k.readthedocs.io/en/latest/user_guide/advanced.html#submitting-a-pr)

## Submitting an issue
Before submitting an issue, please check if it has already been reported. Additionally, please check that:
- The article website you have troubles with is not paywalled [Docs - Paywall](https://newspaper4k.readthedocs.io/en/latest/user_guide/known_issues.html#paywall)
- The article website is not generating the webpage dynamically (e.g. using JavaScript) [Docs - Dynamic content](https://newspaper4k.readthedocs.io/en/latest/user_guide/known_issues.html#dynamic-content)
- The article website is not using a language that is not supported by newspaper4k [Docs - Supported languages](https://newspaper4k.readthedocs.io/en/latest/user_guide/languages.html)

Also, in any case, please provide the following information:
- The URL of the article you are trying to parse
- The code you are using to parse the article
- The error you are getting (if any)
- The parsing result you are getting (if any)


# Features

- Multi-threaded article download framework
- Newspaper category detection
- News url identification
- Text extraction from html
- Top image extraction from html
- All image extraction from html
- Keyword extraction from text
- Summary extraction from text
- Keyword building from the extracted text
- Autoatic article text summarization
- Author extraction from text
- Google trending terms extraction
- Easy to use Command Line Interface (`python -m newspaper....`)
- Works in 10+ languages (English, Chinese, German, Arabic, \...)

# Requirements and dependencies

Following system packages are required:

- PIL: `libjpeg-dev` `zlib1g-dev` `libpng12-dev`
- lxml: `libxml2-dev` `libxslt-dev`
- **Pillow**: `libjpeg-dev` `zlib1g-dev` `libpng12-dev`
- **Lxml**: `libxml2-dev` `libxslt-dev`
- Python Development version: `python-dev`


Expand Down Expand Up @@ -184,9 +191,6 @@ NOTE: If you find problem installing `libpng12-dev`, try installing

$ pip3 install newspaper4k

- Download NLP (nltk) related corpora:

$ curl https://raw.githubusercontent.com/AndyTheFactory/newspaper4k/master/download_corpora.py | python3

**If you are on OSX**, install using the following, you may use both
homebrew or macports:
Expand All @@ -197,8 +201,10 @@ homebrew or macports:

$ pip3 install newspaper4k

$ curl https://raw.githubusercontent.com/AndyTheFactory/newspaper4k/master/download_corpora.py | python3

# Contributing

see [CONTRIBUTING.md](CONTRIBUTING.md)

# LICENSE

Expand Down
1 change: 1 addition & 0 deletions docs/requirements.in
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
Sphinx>=5,<6
sphinx_rtd_theme
python-docs-theme
sphinx-argparse
2 changes: 2 additions & 0 deletions newspaper/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,8 @@ def build(url="", dry=False, config=None, **kwargs) -> Source:
def build_article(url="", config=None, **kwargs) -> Article:
"""Returns a constructed article object without downloading
or parsing
.. deprecated:: 0.9.2
use :any:`Article` or :any:`newspaper.article` instead
"""
config = config or Configuration()
config.update(**kwargs)
Expand Down

0 comments on commit 327e10f

Please sign in to comment.