[twitter] Shorten title attribute for #29912 #31151

palewire · 2022-08-09T02:44:28Z

Please follow the guide below

You will be asked some questions, please read them carefully and answer honestly
Put an x into all the boxes [ ] relevant to your pull request (like that [x])
Use Preview tab to see how your pull request will actually look like

Before submitting a pull request make sure you have:

Searched the bugtracker for similar pull requests
Read adding new extractor tutorial
Read youtube-dl coding conventions and adjusted the code to meet them
Covered the code with tests (note that PRs without tests will be REJECTED)
Checked the code with flake8

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

I am the original author of this code and I am willing to release it under Unlicense
I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

Bug fix
Improvement
New extractor
New feature

This request trims the standard Twitter ID field to workaround the bug reported in #29912

youtube_dl/extractor/twitter.py

palewire · 2022-08-09T19:23:19Z

@dirkf, I believe I have achieved your request with a little less code. The uploader name is pretty strictly limited by Twitter, so I don't think we need to slice it. That means we can use a more straightforward slice like the one just pushed. Give it a look and let me know what you think.

dirkf

Thanks, LGTM, with comments as below. I'll enable the CI tests so that you can check the results.

youtube_dl/extractor/twitter.py

Co-authored-by: dirkf <[email protected]>

palewire · 2022-08-10T12:29:09Z

@dirkf, those changes are in.

palewire · 2022-08-10T14:33:28Z

The flake errors in the test run appear to be in other files.

dirkf · 2022-08-10T14:46:15Z

Apparently the default flake8 settings that the project uses changed in some way so that a keyword must be separated from the next token by a space, which is basically good but irritating. Otherwise GH somehow removed some spaces from the code, which seems unlikely and would be much worse.

If you update from the current master the 2 places where this caused problems are fixed.

palewire · 2022-08-10T20:57:11Z

I think I've done that.

pukkandan · 2022-08-11T02:45:05Z

imo, this should not be implemented.

Would all extractors be expected to do this?
title != filename. We shouldn't extract wrong metadata just because it is convenient for filenames. Output template should be used to control filename length instead. This is a poorly implemented bandaid.

As I see, changing the default outtmpl to something like %(title).200s-%(id)s.%(ext)s would be much better than this solution.

Or, if the ... is important, move this code to core

dirkf · 2022-08-11T06:46:28Z

This extractor is unusual in that there is no title in the plain webpage, although a title and an og:title are set by JS. The full tweet text is extracted in the description. So:

Would all extractors be expected to do this?

No. But Twitter was the source of some of the original issues complaining about filename length errors with the default output template, so this is an opportunity to avoid those in future, without expecting users to check for a FAQ that tells them to use %(title).100s or whatever, and without changing the core code.

Instead of fixing the title, we could generate a warning if the title is "long", or possibly include the FAQ text in the error message if we could reliably detect an exception caused by an over-long filename (a generic "solution" in place of #29989).

Changing the default output template could be an alternative but I would be concerned that setting a sufficiently low value to avoid filename length issues would affect filenames on filesystems with no effective limit, attracting a different set of complaints. The extractor could manipulate the output template by inserting a length field into any %(title)s but that would weirdly affect all URLs in the download session.

title != filename. We shouldn't extract wrong metadata just because it is convenient for filenames. ...

I believe that the default output template was chosen on the assumption that title would be a short text field, for some definition of "short". Implicitly, yt-dl tries to cast all sites in the mould of YouTube, where the title is (a) required (b) not longer than 100 characters, although SEO fiends recommend no more than 70.

Apparently the page title for a tweet set by Twitter has the form (in terms of the extracted fields) {og_title}: "{clean_html(description)}", where og_title, the og:title meta field, is {uploader_id} on Twitter. We emulate that but remove on Twitter, presumably because, where else would it be, given the context? A profile page has {uploader_id} (@{uploader}) / Twitter, but we don't have an extractor for those ¹.

Nitter tries to improve Twitter by making a tweet title thus {profile.fullname} (@{profile.username}): "{stripHtml(tweet.text)}", combining the two methods used by Twitter. Perhaps we should also do that, as {uploader_id} (@{uploader}): "{clean_html(description)}" (and stripping unwanted Twitter links from the description, as now).

In the browser title bar I see the Twitter title presented with ... replacing as much of the middle of the text as is needed to make the text fit the title bar. Something like that still seems appropriate.

Output template should be used to control filename length instead. This is a poorly implemented bandaid.

After all, yt-dl itself is just a bandaid in some sense. What would make for a better implementation? Surely even a harsh critic would call the current implementation at least "adequate"?

We do have an extractor for twitter:card but it redirects to an invalid URL and all the test URLs but the single only_matching URL give a "nothing to see here" page; even that gets 403 for the API call to list video details, as does the only similar URL known to G. ↩

pukkandan · 2022-08-11T11:42:59Z

What would make for a better implementation? Surely even a harsh critic would call the current implementation at least "adequate"?

I gave one suggestion above. As you pointed out, that has it's own issues too. (I still think this is a better solution though). If a proper solution was easy, I doubt this issue would have existed for so long

From a user perspective, I do agree this is better than nothing. But I also think that it'll become a headache to maintain any kind of consistency b/w extractors over time.

And I disagree that the twitter is special in this regard. There are plenty of other extractors that face this issue. This search is a quick way to get some examples

That said, since twitter is so popular, I can understand why you'd consider this technical debt to be acceptable.

And for the record, while an FAQ entry might help some users, I don't think it will reduce the number of issues opened. The kind of people who actually read the FAQ are the same people who'd look for duplicate issues.

palewire · 2022-08-11T12:53:19Z

If you really want to get existential about it, I think the core problem is requiring a title for all sites. As detailed above, there's not a tidy title for a tweet analogous to a YouTube title. So there will never be tidy answer to this problem.

That said, making the title field optional seems like a can of worms that isn't worth opening either. Some kind of compromise is necessary and, as documented in the issue that prompted this pull request, the current solution is bad for users. So, IMHO, some alternative is needed. Something along the lines of what's proposed here does seem like the simplest solution. If universality is what you're after, I think enforcing a global trim on all filenames seems like the most natural extension.

dirkf · 2022-08-11T13:38:18Z

The rationale for title is presumably what I said above, trying to make all sites fit the YT metadata schema. yt-dlp has allowed a None title while continuing to require the title field without the sky falling.

... If a proper solution was easy, I doubt this issue would have existed for so long

Amen to that. If we can get something like #29989 to work, that would fix the problem for all sites. Meanwhile:

... since twitter is so popular, I can understand why you'd consider this technical debt to be acceptable.

Exactly. The Twitter title field is up for grabs: its extraction shouldn't routinely crash the program. I guess that extractor is used more than all the other possibly problematic extractors together, which are also a small proportion of the ~800 (crikey) we have.

And

...while an FAQ entry might help some users, I don't think it will reduce the number of issues opened

Absolutely, why I mentioned putting some FAQ text in error message appears, if only we knew reliably what that would be.

palewire · 2024-12-01T11:29:05Z

I'm withdrawing this request.

palewire added 4 commits August 8, 2022 19:42

Shorten title attribute for ytdl-org#29912

27e5c22

Flake8

05aa33b

use_generic

abc3a6a

use_generic

d2381be

palewire changed the title ~~Shorten Twitter title attribute for #29912~~ [twitter] Shorten title attribute for #29912 Aug 9, 2022

dirkf reviewed Aug 9, 2022

View reviewed changes

youtube_dl/extractor/twitter.py Outdated Show resolved Hide resolved

Switch trim method

c1c8791

dirkf requested changes Aug 10, 2022

View reviewed changes

youtube_dl/extractor/twitter.py Show resolved Hide resolved

youtube_dl/extractor/twitter.py Outdated Show resolved Hide resolved

palewire and others added 2 commits August 10, 2022 05:25

Update youtube_dl/extractor/twitter.py

9b6e394

Co-authored-by: dirkf <[email protected]>

Update twitter.py

e84f067

Tweaks

784f597

palewire requested a review from dirkf August 10, 2022 12:49

palewire added 2 commits August 10, 2022 15:08

Merge branch 'ytdl-org:master' into twitter-titles

42e6110

Trailing whitespace

099dcca

dirkf force-pushed the master branch from 01bf89e to 4c6fba3 Compare August 26, 2022 07:51

dirkf mentioned this pull request Sep 12, 2022

«File name too long» should not error #29912

Open

3 tasks

palewire closed this Dec 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[twitter] Shorten title attribute for #29912 #31151

[twitter] Shorten title attribute for #29912 #31151

palewire commented Aug 9, 2022

palewire commented Aug 9, 2022

dirkf left a comment

palewire commented Aug 10, 2022

palewire commented Aug 10, 2022

dirkf commented Aug 10, 2022

palewire commented Aug 10, 2022

pukkandan commented Aug 11, 2022 •

edited

Loading

dirkf commented Aug 11, 2022

pukkandan commented Aug 11, 2022

palewire commented Aug 11, 2022 •

edited

Loading

dirkf commented Aug 11, 2022

palewire commented Dec 1, 2024

[twitter] Shorten title attribute for #29912 #31151

[twitter] Shorten title attribute for #29912 #31151

Conversation

palewire commented Aug 9, 2022

Please follow the guide below

Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

What is the purpose of your pull request?

palewire commented Aug 9, 2022

dirkf left a comment

Choose a reason for hiding this comment

palewire commented Aug 10, 2022

palewire commented Aug 10, 2022

dirkf commented Aug 10, 2022

palewire commented Aug 10, 2022

pukkandan commented Aug 11, 2022 • edited Loading

dirkf commented Aug 11, 2022

Footnotes

pukkandan commented Aug 11, 2022

palewire commented Aug 11, 2022 • edited Loading

dirkf commented Aug 11, 2022

palewire commented Dec 1, 2024

pukkandan commented Aug 11, 2022 •

edited

Loading

palewire commented Aug 11, 2022 •

edited

Loading