Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

always get 0 article #5

Open
Caesium-133 opened this issue May 9, 2022 · 5 comments
Open

always get 0 article #5

Caesium-133 opened this issue May 9, 2022 · 5 comments

Comments

@Caesium-133
Copy link

canrevan --category 100 --start_date 20220501 --end_date 20220507 --max_page 5
[] navigation pages: 35
[
] collect article urls: 100%|█████████████████| 35/35 [00:00<00:00, 37.76it/s]
[] total collected articles: 700
[
] crawl news article contents: 100%|████████| 700/700 [00:12<00:00, 56.38it/s]
[*] finish crawling 0 news articles to [articles.txt]

like this.

is it my network or the website?

@Derek-tjhwang
Copy link

동일한 이슈가 있는데 해결된 걸까요?

@jonyejin
Copy link
Contributor

2023년 1월 31일 기준 같은 이슈가 있습니다

@jonyejin
Copy link
Contributor

네이버 뉴스가 UI를 개편하면서 HTML형식이 달라져서 생긴 문제네요. carevan/parsing.py파일에서
strainer = SoupStrainer("div", attrs={"id": "dic_area"})로 변경하면 수집 가능합니다! PR 올릴게요 @affjljoo3581

@arangeblue
Copy link

아직 이슈가 고쳐지지 않았나요? 아직 finish crawling 0 news articles to [articles.txt] 라고 나오네요~

@pastebean
Copy link

2024.03.21

parsing.py의 def parse_article_content 부분을 아래와 같이 수정하면 잘 작동합니다!

def parse_article_content(document: str, include_reporter_name: bool) -> str:
strainer = SoupStrainer("article", attrs={"id": "dic_area"})
document = BeautifulSoup(document, "lxml", parse_only=strainer)
content = document.find("article")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants