always get 0 article #5

Caesium-133 · 2022-05-09T12:03:46Z

canrevan --category 100 --start_date 20220501 --end_date 20220507 --max_page 5
[] navigation pages: 35
[] collect article urls: 100%|█████████████████| 35/35 [00:00<00:00, 37.76it/s]
[] total collected articles: 700
[] crawl news article contents: 100%|████████| 700/700 [00:12<00:00, 56.38it/s]
[*] finish crawling 0 news articles to [articles.txt]

like this.

is it my network or the website?

Derek-tjhwang · 2022-11-01T02:13:43Z

동일한 이슈가 있는데 해결된 걸까요?

jonyejin · 2023-01-31T05:10:58Z

2023년 1월 31일 기준 같은 이슈가 있습니다

jonyejin · 2023-01-31T07:55:18Z

네이버 뉴스가 UI를 개편하면서 HTML형식이 달라져서 생긴 문제네요. carevan/parsing.py파일에서
strainer = SoupStrainer("div", attrs={"id": "dic_area"})로 변경하면 수집 가능합니다! PR 올릴게요 @affjljoo3581

arangeblue · 2023-08-28T02:14:22Z

아직 이슈가 고쳐지지 않았나요? 아직 finish crawling 0 news articles to [articles.txt] 라고 나오네요~

pastebean · 2024-03-21T11:22:54Z

2024.03.21

parsing.py의 def parse_article_content 부분을 아래와 같이 수정하면 잘 작동합니다!

def parse_article_content(document: str, include_reporter_name: bool) -> str:
strainer = SoupStrainer("article", attrs={"id": "dic_area"})
document = BeautifulSoup(document, "lxml", parse_only=strainer)
content = document.find("article")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

always get 0 article #5

always get 0 article #5

Caesium-133 commented May 9, 2022

Derek-tjhwang commented Nov 1, 2022

jonyejin commented Jan 31, 2023

jonyejin commented Jan 31, 2023

arangeblue commented Aug 28, 2023

pastebean commented Mar 21, 2024

always get 0 article #5

always get 0 article #5

Comments

Caesium-133 commented May 9, 2022

Derek-tjhwang commented Nov 1, 2022

jonyejin commented Jan 31, 2023

jonyejin commented Jan 31, 2023

arangeblue commented Aug 28, 2023

pastebean commented Mar 21, 2024