-
-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ignore undateable domains more intentionally #34
Comments
Hi @rahulbot, it would be OK but I'd prefer to get to chance to tackle the problem first. |
@coreydockser can you please provide an example of a wikipedia page that does return a publication date, and one that does not? |
Sorry for the delay, I ran into some odd issues of my own making. Anyways, here's a sample of four articles with different results. https://en.wikipedia.org/wiki/Among_Us – returns None (this is the behavior we want) https://en.wikipedia.org/wiki/January_1969 – returns 2018-06-19, this date appears as datePublished in the html https://en.wikipedia.org/wiki/F-scale_(personality_test) - returns 2005-07-05. the datePublished on this page is 2005-07-25, though, so I'm unsure where it came from. https://en.wikipedia.org/wiki/2021_United_States_Capitol_attack - 2021-01-06, this is the date of the event, but it's also the datePublished. |
@coreydockser Thanks, I'll look at it and see if I can find a solution. |
Hi @coreydockser, I checked the cases and I don't agree with you at all:
So I fail to grasp where the problem lies, could you please be more specific and/or provide further examples for other websites? |
The library version issue could explain some of those specific results. However the second piece is more of a question of your intentions. In our projects, "publication date" means the date a news article was listed as being published online. That is rooted in ideas from the historical news industry (despite edits and iterations of online stories becoming more commonplace). Wikipedia articles are meant to be living documents, so for us they don't have a "publication date" in that sense. This is important for our time-series based analysis of news attention. So I guess the one way to state the question is like this: for this library do you intend "publication date" to have a technology-informed definition such as the date of last edit? Or do you want a more "news-ish" definition like we use? It sounds like it is more the former, in which case there are no "undateable" domains. If that is what you intend, then we can close this issue as won't-fix and we can handle the idea of "undateable" domains based on our project definition in our own code before we pass content into htmldate. Thanks for any clarifications and your great work on this library! |
Thanks for the explanations, I get your point. Indeed, I guess it would be possible to focus on a "news-ish" understanding of publication date by setting an additional parameter prior to the extraction. What would be the formal requirements for it to happen? I'm leaving this thread open to see if we can address the issue. |
In our testing the current code produces unreliable results when tested on Wikipedia articles. Sometimes it returns a data, sometimes it doesn't. Wikipedia articles are constantly updated, so @coreydockser and I would like to propose to change it so it returns no date if the URL is a wikipedia.org one. In our broader experience with Media Cloud this produces more useful results (for our open web news analysis context).
In terms of implementation, we could just copy
filter_url_for_undateable
function fromdate_guesser
and use that as is to include the other checks it does for undateable domains. We'd call it early on inguess_date
.The text was updated successfully, but these errors were encountered: