Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rdrview does not extract titles #39

Open
eliobtl opened this issue Jun 25, 2024 · 2 comments
Open

rdrview does not extract titles #39

eliobtl opened this issue Jun 25, 2024 · 2 comments

Comments

@eliobtl
Copy link

eliobtl commented Jun 25, 2024

Hi ! Thanks for rdrview.

I found that, on some websites, it does not extract titles.
An example:
this article looks normal in firefox reader view :
screenshot-24-06-25-18-52-21

but with rdrview, there are no titles, only paragraphs:
screenshot-24-06-25-18-53-02

On other websites, it sometimes displays subtitles normally but not the main title.

I use rdrview build from latest commit with gcc on alpine linux x86_64.

If you have an idea on why this happens, I would be happy to know.

@eafer
Copy link
Owner

eafer commented Jul 6, 2024

What goes wrong here is that the page you link is using h1 tags for the section titles, and rdrview expects that to be used only for the main title, so they get removed. It seems that firefox used to have this issue too, but it got fixed a few years ago: mozilla/readability@11093f011f57fa528a0. So I need to port that patch for rdrview, but it's not trivial because it uses a unicode regex.

@eliobtl
Copy link
Author

eliobtl commented Jul 6, 2024

Ok, thanks for the explanation, I'll wait.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants