Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[twitter] Shorten title attribute for #29912 #31151

Closed
wants to merge 10 commits into from

Conversation

palewire
Copy link

@palewire palewire commented Aug 9, 2022

Please follow the guide below

  • You will be asked some questions, please read them carefully and answer honestly
  • Put an x into all the boxes [ ] relevant to your pull request (like that [x])
  • Use Preview tab to see how your pull request will actually look like

Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

  • I am the original author of this code and I am willing to release it under Unlicense
  • I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

  • Bug fix
  • Improvement
  • New extractor
  • New feature

This request trims the standard Twitter ID field to workaround the bug reported in #29912

@palewire palewire changed the title Shorten Twitter title attribute for #29912 [twitter] Shorten title attribute for #29912 Aug 9, 2022
@palewire
Copy link
Author

palewire commented Aug 9, 2022

@dirkf, I believe I have achieved your request with a little less code. The uploader name is pretty strictly limited by Twitter, so I don't think we need to slice it. That means we can use a more straightforward slice like the one just pushed. Give it a look and let me know what you think.

Copy link
Contributor

@dirkf dirkf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM, with comments as below. I'll enable the CI tests so that you can check the results.

@palewire
Copy link
Author

@dirkf, those changes are in.

@palewire palewire requested a review from dirkf August 10, 2022 12:49
@palewire
Copy link
Author

The flake errors in the test run appear to be in other files.

@dirkf
Copy link
Contributor

dirkf commented Aug 10, 2022

Apparently the default flake8 settings that the project uses changed in some way so that a keyword must be separated from the next token by a space, which is basically good but irritating. Otherwise GH somehow removed some spaces from the code, which seems unlikely and would be much worse.

If you update from the current master the 2 places where this caused problems are fixed.

@palewire
Copy link
Author

I think I've done that.

@pukkandan
Copy link
Contributor

pukkandan commented Aug 11, 2022

imo, this should not be implemented.

  • Would all extractors be expected to do this?
  • title != filename. We shouldn't extract wrong metadata just because it is convenient for filenames. Output template should be used to control filename length instead. This is a poorly implemented bandaid.

As I see, changing the default outtmpl to something like %(title).200s-%(id)s.%(ext)s would be much better than this solution.

Or, if the ... is important, move this code to core

@dirkf
Copy link
Contributor

dirkf commented Aug 11, 2022

This extractor is unusual in that there is no title in the plain webpage, although a title and an og:title are set by JS. The full tweet text is extracted in the description. So:

  • Would all extractors be expected to do this?

No. But Twitter was the source of some of the original issues complaining about filename length errors with the default output template, so this is an opportunity to avoid those in future, without expecting users to check for a FAQ that tells them to use %(title).100s or whatever, and without changing the core code.

Instead of fixing the title, we could generate a warning if the title is "long", or possibly include the FAQ text in the error message if we could reliably detect an exception caused by an over-long filename (a generic "solution" in place of #29989).

Changing the default output template could be an alternative but I would be concerned that setting a sufficiently low value to avoid filename length issues would affect filenames on filesystems with no effective limit, attracting a different set of complaints. The extractor could manipulate the output template by inserting a length field into any %(title)s but that would weirdly affect all URLs in the download session.

  • title != filename. We shouldn't extract wrong metadata just because it is convenient for filenames. ...

I believe that the default output template was chosen on the assumption that title would be a short text field, for some definition of "short". Implicitly, yt-dl tries to cast all sites in the mould of YouTube, where the title is (a) required (b) not longer than 100 characters, although SEO fiends recommend no more than 70.

Apparently the page title for a tweet set by Twitter has the form (in terms of the extracted fields) {og_title}: "{clean_html(description)}", where og_title, the og:title meta field, is {uploader_id} on Twitter. We emulate that but remove on Twitter, presumably because, where else would it be, given the context? A profile page has {uploader_id} (@{uploader}) / Twitter, but we don't have an extractor for those 1.

Nitter tries to improve Twitter by making a tweet title thus {profile.fullname} (@{profile.username}): "{stripHtml(tweet.text)}", combining the two methods used by Twitter. Perhaps we should also do that, as {uploader_id} (@{uploader}): "{clean_html(description)}" (and stripping unwanted Twitter links from the description, as now).

In the browser title bar I see the Twitter title presented with ... replacing as much of the middle of the text as is needed to make the text fit the title bar. Something like that still seems appropriate.

Output template should be used to control filename length instead. This is a poorly implemented bandaid.

After all, yt-dl itself is just a bandaid in some sense. What would make for a better implementation? Surely even a harsh critic would call the current implementation at least "adequate"?

Footnotes

  1. We do have an extractor for twitter:card but it redirects to an invalid URL and all the test URLs but the single only_matching URL give a "nothing to see here" page; even that gets 403 for the API call to list video details, as does the only similar URL known to G.

@pukkandan
Copy link
Contributor

What would make for a better implementation? Surely even a harsh critic would call the current implementation at least "adequate"?

I gave one suggestion above. As you pointed out, that has it's own issues too. (I still think this is a better solution though). If a proper solution was easy, I doubt this issue would have existed for so long

From a user perspective, I do agree this is better than nothing. But I also think that it'll become a headache to maintain any kind of consistency b/w extractors over time.

And I disagree that the twitter is special in this regard. There are plenty of other extractors that face this issue. This search is a quick way to get some examples

That said, since twitter is so popular, I can understand why you'd consider this technical debt to be acceptable.

And for the record, while an FAQ entry might help some users, I don't think it will reduce the number of issues opened. The kind of people who actually read the FAQ are the same people who'd look for duplicate issues.

@palewire
Copy link
Author

palewire commented Aug 11, 2022

If you really want to get existential about it, I think the core problem is requiring a title for all sites. As detailed above, there's not a tidy title for a tweet analogous to a YouTube title. So there will never be tidy answer to this problem.

That said, making the title field optional seems like a can of worms that isn't worth opening either. Some kind of compromise is necessary and, as documented in the issue that prompted this pull request, the current solution is bad for users. So, IMHO, some alternative is needed. Something along the lines of what's proposed here does seem like the simplest solution. If universality is what you're after, I think enforcing a global trim on all filenames seems like the most natural extension.

@dirkf
Copy link
Contributor

dirkf commented Aug 11, 2022

The rationale for title is presumably what I said above, trying to make all sites fit the YT metadata schema. yt-dlp has allowed a None title while continuing to require the title field without the sky falling.

... If a proper solution was easy, I doubt this issue would have existed for so long

Amen to that. If we can get something like #29989 to work, that would fix the problem for all sites. Meanwhile:

... since twitter is so popular, I can understand why you'd consider this technical debt to be acceptable.

Exactly. The Twitter title field is up for grabs: its extraction shouldn't routinely crash the program. I guess that extractor is used more than all the other possibly problematic extractors together, which are also a small proportion of the ~800 (crikey) we have.

And

...while an FAQ entry might help some users, I don't think it will reduce the number of issues opened

Absolutely, why I mentioned putting some FAQ text in error message appears, if only we knew reliably what that would be.

@palewire
Copy link
Author

palewire commented Dec 1, 2024

I'm withdrawing this request.

@palewire palewire closed this Dec 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants